Textmining without document context.

Eric Sanjuan; Fidelia Ibekwe-Sanjuan

doi:10.1016/j.ipm.2006.03.017

Article Dans Une Revue Information Processing and Management Année : 2006

Textmining without document context.

(1) , (2)

1
2

Eric Sanjuan

Fonction : Auteur correspondant
PersonId : 912763
IdHAL : eric-sanjuan
ORCID : 0000-0002-4057-6691

Connectez-vous pour contacter l'auteur

Laboratoire Informatique d'Avignon

Fidelia Ibekwe-Sanjuan

Fonction : Auteur correspondant
PersonId : 180321
IdHAL : fidelia-ibekwe
ORCID : 0000-0001-8862-7729
IdRef : 11366396X

Connectez-vous pour contacter l'auteur

Equipe de recherche de Lyon en sciences de l'information et de la communication

Résumé

We consider a challenging clustering task: the clustering of multi-word terms without document co-occurrence information in order to form coherent groups of topics. For this task, we developed a methodology taking as input multi-word terms and lexico-syntactic relations between them. Our clustering algorithm, named CPCL is implemented in the TermWatch system. We compared CPCL to other existing clustering algorithms, namely hierarchical and partitioning (k-means, k-medoids). This out-of-context clustering task led us to adapt multi-word term representation for statistical methods and also to refine an existing cluster evaluation metric, the editing distance in order to evaluate the methods. Evaluation was carried out on a list of multi-word terms from the genomic field which comes with a hand built taxonomy. Results showed that while k-means and k-medoids obtained good scores on the editing distance, they were very sensitive to term length. CPCL on the other hand obtained a better cluster homogeneity score and was less sensitive to term length. Also, CPCL showed good adaptability for handling very large and sparse matrices.

Mots clés

Multi-word term clustering Lexico-syntactic relations Text mining Informetrics Cluster evaluation

Domaines

Sciences de l'information et de la communication

Fichier principal

ipm.pdf (259.05 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Fidelia Ibekwe : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00636111

Soumis le : mercredi 2 novembre 2011-19:36:22

Dernière modification le : mardi 3 octobre 2023-14:14:03

Archivage à long terme le : vendredi 3 février 2012-02:21:12

Dates et versions

hal-00636111 , version 1 (02-11-2011)

Identifiants

HAL Id : hal-00636111 , version 1
DOI : 10.1016/j.ipm.2006.03.017

Citer

Eric Sanjuan, Fidelia Ibekwe-Sanjuan. Textmining without document context.. Information Processing and Management, 2006, 42 (6), pp.1532-1552. ⟨10.1016/j.ipm.2006.03.017⟩. ⟨hal-00636111⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-LYON3 UNIV-AVIGNON UNIV-LYON1 UNIV-LYON2 ELICO LIA UDL

420 Consultations

604 Téléchargements

Textmining without document context.

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager