Clustering and knowledge integration

Nguyen-Viet-Dung Nghiem

Résumé

Clustering is one of the essential topics in data mining. Although it is designed to work in a fully unsupervisedway, its application in real-world data is often regulated by expert knowledge. Constrained clustering (ageneralization of semi-supervised clustering) aims to exploit this knowledge during the clustering task. In thisthesis, we develop two frameworks to integrate expert constraints in the clustering task. In the first work, wepropose a declarative post-processing method to adapt the output of a clustering algorithm to satisfy theconstraints. The originality is to consider an allocation matrix that gives the scores for attribution of points toeach cluster and to find the best partition satisfying all the constraints. In the second work, we propose aunified framework to integrate general constraints in a clustering model with deep learning. The genericity isobtained by formulating the constraints in propositional logic, defining two versions of semantic loss, andcomputing them through Weighted Model Counting. Experimental results on well-known datasets show thatour approach is competitive with other constraint-specific methods while being general. In addition, we havedefined and formulated new types of constraints in clustering: the cluster coverage constraint limiting thenumber of clusters to which a group of points can belong and the combined fairness constraint taking intoaccount both the group fairness and individual fairness.

Le clustering sous contraintes (une généralisation du clustering semi-supervisé) vise à exploiter lesconnaissances des experts lors de la tâche de clustering. La connaissance s'exprime souvent par unensemble de contraintes et peut prendre des formes diverses. Dans cette thèse, nous développons deuxmécanismes pour intégrer des contraintes dans la tâche de clustering. Dans la première partie, nousproposons une méthode déclarative post-traitement pour adapter la sortie d'un algorithme de clustering poursatisfaire les contraintes. L'originalité est de considérer une matrice d'allocation qui donne les scoresd'attribution des points à chaque cluster et de trouver la meilleure partition satisfaisant toutes les contraintes.Dans la deuxième partie, nous proposons un cadre unifié pour intégrer les contraintes générales dans unmodèle de clustering avec l'apprentissage profond. La généricité se représente en formalisant descontraintes en logique et en considérant leurs modèles. Les résultats expérimentaux sur des jeux dedonnées connus montrent que notre approche est compétitive avec d'autres méthodes spécifiques auxcontraintes tout en étant générale. De plus, nous avons défini et formulé de nouveaux types de contraintesen clustering : la contrainte de couverture de cluster limitant le nombre de clusters auxquels un groupe depoints peut appartenir, et la contrainte d'équité combinée prenant en compte à la fois l'équité de groupe etl'équité individuelle.

Clustering and knowledge integration

Clustering et intégration de connaissances

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager