A comparison between latent semantic analysis and correspondence analysis - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2011

A comparison between latent semantic analysis and correspondence analysis

Résumé

Latent Semantic Analysis (LSA) is a technique for analyzing textual data through a singular value decomposition of term-document matrices (Deerwester et al. (1990), Landauer et al. (2007)). The basic postulate is that there is an underlying latent semantic structure in word usage data that is partially hidden or obscured by the variability of word choice (synonymy problem). LSA is also called Latent Semantic Indexing (LSI) in information retrieval, where the main application consists in computing similarities between user's query and all documents in the space, or between documents. Since LSA is a SVD of a contingency table, it strongly resembles to Correspondence Analysis (CA), see Lebart et al. (1998). Before performing the SVD, practitioners of LSA recommend several weighting functions of the frequencies, but not the one leading to the chi-square metric. Typically, LSA allows to reduce the dimensionality from several thousands to several hundred of a huge but sparse data matrix. Given the dimension, graphical representations are useless. In the context of statistical implementations, the coordinates can be used for categorization tasks (in supervised or unsupervised frameworks). We first compare basic LSA with CA on a toy example. Then performances of CA and LSA with several weighting functions are compared on a large data set coming from job offers posted on the web. When posted on the internet, job offers have been labeled by recruiters according to the job category (e.g. Marketing, Information Systems, Finance, etc.). We are interested in the capacity of these document representation technics to lead us to the real job category with a clustering method. After preprocessing of job offers, we compute similarities between texts based on coordinates in reduced spaces and apply an hybrid method combining hierarchical clustering and k-means algorithm. Performance of text representation methods willbe assessed with three different measures (Cohen's Kappa, Rand index, F-measure) and discussed according to the number of dimensions kept.
SeguelaSaporta070211.pdf (746.78 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-01125854 , version 1 (11-12-2020)

Identifiants

  • HAL Id : hal-01125854 , version 1

Citer

Julie Séguéla, Gilbert Saporta. A comparison between latent semantic analysis and correspondence analysis. CARME 2011 International conference on Correspondence Analysis and Related Methods, Feb 2011, Rennes, France. pp.20. ⟨hal-01125854⟩
278 Consultations
50 Téléchargements

Partager

Gmail Facebook X LinkedIn More