Textmining without document context.

Abstract : We consider a challenging clustering task: the clustering of multi-word terms without document co-occurrence information in order to form coherent groups of topics. For this task, we developed a methodology taking as input multi-word terms and lexico-syntactic relations between them. Our clustering algorithm, named CPCL is implemented in the TermWatch system. We compared CPCL to other existing clustering algorithms, namely hierarchical and partitioning (k-means, k-medoids). This out-of-context clustering task led us to adapt multi-word term representation for statistical methods and also to refine an existing cluster evaluation metric, the editing distance in order to evaluate the methods. Evaluation was carried out on a list of multi-word terms from the genomic field which comes with a hand built taxonomy. Results showed that while k-means and k-medoids obtained good scores on the editing distance, they were very sensitive to term length. CPCL on the other hand obtained a better cluster homogeneity score and was less sensitive to term length. Also, CPCL showed good adaptability for handling very large and sparse matrices.
Type de document :
Article dans une revue
Information Processing and Management, Elsevier, 2006, 42 (6), pp.1532-1552. <10.1016/j.ipm.2006.03.017>


https://hal.archives-ouvertes.fr/hal-00636111
Contributeur : Fidelia Ibekwe-Sanjuan <>
Soumis le : mercredi 2 novembre 2011 - 19:36:22
Dernière modification le : mercredi 23 mars 2016 - 09:48:34
Document(s) archivé(s) le : vendredi 3 février 2012 - 02:21:12

Fichier

ipm.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Eric Sanjuan, Fidelia Ibekwe-Sanjuan. Textmining without document context.. Information Processing and Management, Elsevier, 2006, 42 (6), pp.1532-1552. <10.1016/j.ipm.2006.03.017>. <hal-00636111>

Exporter

Partager

Métriques

Consultations de
la notice

326

Téléchargements du document

273