| HAL : hal-00636111, version 1 |
| DOI : 10.1016/j.ipm.2006.03.017 |
| Fiche détaillée | Récupérer au format |
|
|
| Information Processing and Management 42, 6 (2006) 1532-1552 |
|
|
|
|
| Textmining without document context. |
|
|
Eric Sanjuan 1Fidelia Ibekwe-Sanjuan 2 |
|
|
| (15/12/2006) |
|
|
| We consider a challenging clustering task: the clustering of multi-word terms without document co-occurrence information in order to form coherent groups of topics. For this task, we developed a methodology taking as input multi-word terms and lexico-syntactic relations between them. Our clustering algorithm, named CPCL is implemented in the TermWatch system. We compared CPCL to other existing clustering algorithms, namely hierarchical and partitioning (k-means, k-medoids). This out-of-context clustering task led us to adapt multi-word term representation for statistical methods and also to refine an existing cluster evaluation metric, the editing distance in order to evaluate the methods. Evaluation was carried out on a list of multi-word terms from the genomic field which comes with a hand built taxonomy. Results showed that while k-means and k-medoids obtained good scores on the editing distance, they were very sensitive to term length. CPCL on the other hand obtained a better cluster homogeneity score and was less sensitive to term length. Also, CPCL showed good adaptability for handling very large and sparse matrices. |
|
|
|
|
|
|
|
|
|
|
| 1 : | Laboratoire Informatique d'Avignon (LIA) |
| Université d'Avignon – Centre d'Enseignement et de Recherche en Informatique - CERI | |
| 2 : | Equipe de recherche de Lyon en sciences de l'information et de la communication (ELICO) |
| Université Lumière - Lyon II : EA4147 – Université Claude Bernard - Lyon I – Université Jean Moulin - Lyon III – Institut d'Études Politiques [IEP] - Lyon – École Nationale Supérieure des Sciences de l'Information et des Bibliothèques - Lyon | |
|
|
|
|
|
|
|
|
| Domaine | : | Sciences de l'Homme et Société/Sciences de l'information et de la communication |
|
|
| Multi-word term clustering – Lexico-syntactic relations – Text mining – Informetrics – Cluster evaluation |
|
|
| Liste des fichiers attachés à ce document : | |||||
|
|
|
| hal-00636111, version 1 | |
| http://hal.archives-ouvertes.fr/hal-00636111 | |
| oai:hal.archives-ouvertes.fr:hal-00636111 | |
| Contributeur : Fidelia Ibekwe-Sanjuan | |
| Soumis le : Mercredi 2 Novembre 2011, 19:36:22 | |
| Dernière modification le : Jeudi 3 Novembre 2011, 17:02:07 | |