A comparison between latent semantic analysis and correspondence analysis

Julie Séguéla; Gilbert Saporta

Communication Dans Un Congrès Année : 2011

A comparison between latent semantic analysis and correspondence analysis

(1) , (1)

Julie Séguéla

Fonction : Auteur
PersonId : 964638

Centre d'études et de recherche en informatique et communications

Gilbert Saporta

Fonction : Auteur
PersonId : 180161
IdHAL : gilbert-saporta
ORCID : 0000-0002-3406-5887
IdRef : 027122565

Centre d'études et de recherche en informatique et communications

Résumé

Latent Semantic Analysis (LSA) is a technique for analyzing textual data through a singular value decomposition of term-document matrices (Deerwester et al. (1990), Landauer et al. (2007)). The basic postulate is that there is an underlying latent semantic structure in word usage data that is partially hidden or obscured by the variability of word choice (synonymy problem). LSA is also called Latent Semantic Indexing (LSI) in information retrieval, where the main application consists in computing similarities between user's query and all documents in the space, or between documents. Since LSA is a SVD of a contingency table, it strongly resembles to Correspondence Analysis (CA), see Lebart et al. (1998). Before performing the SVD, practitioners of LSA recommend several weighting functions of the frequencies, but not the one leading to the chi-square metric. Typically, LSA allows to reduce the dimensionality from several thousands to several hundred of a huge but sparse data matrix. Given the dimension, graphical representations are useless. In the context of statistical implementations, the coordinates can be used for categorization tasks (in supervised or unsupervised frameworks). We first compare basic LSA with CA on a toy example. Then performances of CA and LSA with several weighting functions are compared on a large data set coming from job offers posted on the web. When posted on the internet, job offers have been labeled by recruiters according to the job category (e.g. Marketing, Information Systems, Finance, etc.). We are interested in the capacity of these document representation technics to lead us to the real job category with a clustering method. After preprocessing of job offers, we compute similarities between texts based on coordinates in reduced spaces and apply an hybrid method combining hierarchical clustering and k-means algorithm. Performance of text representation methods willbe assessed with three different measures (Cohen's Kappa, Rand index, F-measure) and discussed according to the number of dimensions kept.

Mots clés

textual data web data correspondence analysis Latent semantic analysis

données Web analyse des correspondances Analyse sémantique latente données textuelles

Domaines

Informatique [cs] Statistiques [math.ST]

SeguelaSaporta070211.pdf (746.78 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Laboratoire CEDRIC : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01125854

Soumis le : vendredi 11 décembre 2020-14:37:52

Dernière modification le : lundi 8 avril 2024-14:27:14

Dates et versions

hal-01125854 , version 1 (11-12-2020)

Identifiants

HAL Id : hal-01125854 , version 1

Citer

Julie Séguéla, Gilbert Saporta. A comparison between latent semantic analysis and correspondence analysis. CARME 2011 International conference on Correspondence Analysis and Related Methods, Feb 2011, Rennes, France. pp.20. ⟨hal-01125854⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNAM CEDRIC-CNAM HESAM

278 Consultations

50 Téléchargements

A comparison between latent semantic analysis and correspondence analysis

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager