Word Sense Clustering and Clusterability

Diana Mccarthy; Marianna Apidianaki; Katrin Erk

doi:10.1162/COLI

Article Dans Une Revue Computational Linguistics Année : 2016

Word Sense Clustering and Clusterability

, (1) ,

Diana Mccarthy

Fonction : Auteur

Marianna Apidianaki

Fonction : Auteur
PersonId : 20607
IdHAL : marianna-apidianaki

Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur

Katrin Erk

Fonction : Auteur

Résumé

Word sense disambiguation and the related field of automated word sense induction traditionally assume that the occurrences of a lemma can be partitioned into senses. But this seems to be a much easier task for some lemmas than others. Our work builds on recent work that proposes describing word meaning in a graded fashion rather than through a strict partition into senses; in this article we argue that not all lemmas may need the more complex graded analysis, depending on their partitionability. Although there is plenty of evidence from previous studies and from the linguistics literature that there is a spectrum of partitionability of word meanings, this is the first attempt to measure the phenomenon and to couple the machine learning literature on clusterability with word usage data used in computational linguistics. We propose to operationalize partitionability as clusterability, a measure of how easy the occurrences of a lemma are to cluster. We test two ways of measuring clusterability: (1) existing measures from the machine learning literature that aim to measure the goodness of optimal k-means clusterings, and (2) the idea that if a lemma is more clusterable, two clusterings based on two different views of the same data points will be more congruent. The two views that we use are two different sets of manually constructed lexical substitutes for the target lemma, on the one hand monolingual paraphrases, and on the other hand translations. We apply automatic clustering to the manual annotations. We use manual annotations because we want the representations of the instances that we cluster to be as informative and clean as possible. We show that when we control for polysemy, our measures of clusterability tend to correlate with partitionability, in particular some of the type-(1) clusterability measures, and that these measures outperform a baseline that relies on the amount of overlap in a soft clustering.

Mots clés

word sense clustering clusterability

Domaines

Informatique [cs]

Fichier principal

COLI_a_00247.pdf (431.78 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Limsi Publications : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01838502

Soumis le : vendredi 13 juillet 2018-14:42:24

Dernière modification le : samedi 7 octobre 2023-21:36:20

Archivage à long terme le : lundi 15 octobre 2018-10:03:52

Dates et versions

hal-01838502 , version 1 (13-07-2018)

Identifiants

HAL Id : hal-01838502 , version 1
DOI : 10.1162/COLI

Citer

Diana Mccarthy, Marianna Apidianaki, Katrin Erk. Word Sense Clustering and Clusterability. Computational Linguistics, 2016, 42, pp.245-275. ⟨10.1162/COLI⟩. ⟨hal-01838502⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS LIMSI UNIV-PARIS-SACLAY SORBONNE-UNIVERSITE LISN GS-ENGINEERING GS-COMPUTER-SCIENCE

32 Consultations

66 Téléchargements

Word Sense Clustering and Clusterability

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager