Textmining without document context. - Archive ouverte HAL Accéder directement au contenu
Article Dans Une Revue Information Processing and Management Année : 2006

Textmining without document context.

Résumé

We consider a challenging clustering task: the clustering of multi-word terms without document co-occurrence information in order to form coherent groups of topics. For this task, we developed a methodology taking as input multi-word terms and lexico-syntactic relations between them. Our clustering algorithm, named CPCL is implemented in the TermWatch system. We compared CPCL to other existing clustering algorithms, namely hierarchical and partitioning (k-means, k-medoids). This out-of-context clustering task led us to adapt multi-word term representation for statistical methods and also to refine an existing cluster evaluation metric, the editing distance in order to evaluate the methods. Evaluation was carried out on a list of multi-word terms from the genomic field which comes with a hand built taxonomy. Results showed that while k-means and k-medoids obtained good scores on the editing distance, they were very sensitive to term length. CPCL on the other hand obtained a better cluster homogeneity score and was less sensitive to term length. Also, CPCL showed good adaptability for handling very large and sparse matrices.
Fichier principal
Vignette du fichier
ipm.pdf (259.05 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00636111 , version 1 (02-11-2011)

Identifiants

Citer

Eric Sanjuan, Fidelia Ibekwe-Sanjuan. Textmining without document context.. Information Processing and Management, 2006, 42 (6), pp.1532-1552. ⟨10.1016/j.ipm.2006.03.017⟩. ⟨hal-00636111⟩
420 Consultations
604 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More