Correlation between textual similarity and quality of LDA topic model results

Abstract : The LDA topic model describes a corpus on the basis of its vocabulary. Our experiment aims at determining whether LDA outputs' quality can be estimated through text similarity metrics, and if so determining the most relevant one. To do so, we use a categorized corpus on which we apply these metrics on every pair of categories. We present correlation scores between several metrics and the quality of the topic model. The experiments also include a comparison between simple and complex term extraction within our framework. We observed very high correlations with the Hellinger distance with or without complex terms, while the Soergel distance is most efficient when including complex terms. These experiments are a case study on a categorised corpus of 20,000 article abstracts.
Document type :
Conference papers
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-02390357
Contributor : Florent Breuil <>
Submitted on : Tuesday, December 3, 2019 - 9:06:36 AM
Last modification on : Monday, January 13, 2020 - 5:46:07 PM

Identifiers

Citation

Amaury Delamaire, Mihaela Juganaru-Mathieu, Michel Beigbeder. Correlation between textual similarity and quality of LDA topic model results. 2019 13th International Conference on Research Challenges in Information Science (RCIS), May 2019, Brussels, Belgium. pp.1-6, ⟨10.1109/RCIS.2019.8877076⟩. ⟨hal-02390357⟩

Share

Metrics

Record views

34