Textual data summarization using the Self-Organized Co-Clustering model

Margot Selosse; Julien Jacques; Christophe Biernacki

doi:10.1016/j.patcog.2020.107315

Article Dans Une Revue Pattern Recognition Année : 2020

Textual data summarization using the Self-Organized Co-Clustering model

(1) , (1) , (2)

1
2

Margot Selosse

Fonction : Auteur

Entrepôts, Représentation et Ingénierie des Connaissances

Julien Jacques

Fonction : Auteur
PersonId : 173226
IdHAL : julien-jacques
ORCID : 0000-0003-4808-2781
IdRef : 098191551

Entrepôts, Représentation et Ingénierie des Connaissances

Christophe Biernacki

Fonction : Auteur
PersonId : 923939

MOdel for Data Analysis and Learning

Résumé

Recently, different studies have demonstrated the use of co-clustering, a data mining technique which simultaneously produces row-clusters of observations and column-clusters of features. The present work introduces a novel co-clustering model to easily summarize textual data in a document-term format. In addition to highlighting homogeneous co-clusters as other existing algorithms do we also distinguish noisy co-clusters from significant co-clusters, which is particularly useful for sparse document-term matrices. Furthermore, our model proposes a structure among the significant co-clusters, thus providing improved interpretability to users. The approach proposed contends with state-of-the-art methods for document and term clustering and offers user-friendly results. The model relies on the Poisson distribution and on a constrained version of the Latent Block Model, which is a probabilistic approach for co-clustering. A Stochastic Expectation-Maximization algorithm is proposed to run the model’s inference as well as a model selection criterion to choose the number of coclusters. Both simulated and real data sets illustrate the eciency of this model by its ability to easily identify relevant co-clusters.

Mots clés

coclustering Latent Block Model document-term matrix

Domaines

Statistiques [math.ST]

Fichier principal

manuscript.pdf (1.23 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Margot Selosse : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02115294

Soumis le : lundi 24 février 2020-18:27:34

Dernière modification le : samedi 20 avril 2024-03:09:11

Dates et versions

hal-02115294 , version 1 (30-04-2019)

hal-02115294 , version 2 (09-12-2019)

hal-02115294 , version 3 (24-02-2020)

Identifiants

HAL Id : hal-02115294 , version 3
DOI : 10.1016/j.patcog.2020.107315

Citer

Margot Selosse, Julien Jacques, Christophe Biernacki. Textual data summarization using the Self-Organized Co-Clustering model. Pattern Recognition, 2020, 103, pp.107315. ⟨10.1016/j.patcog.2020.107315⟩. ⟨hal-02115294v3⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LYON1 UNIV-LYON2 ERIC INRIA2 UNIV-LILLE UDL LPP-MATH

291 Consultations

287 Téléchargements

Textual data summarization using the Self-Organized Co-Clustering model

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager