Proper Noun Semantic Clustering using Bag-Of-Vectors

Ali-Reza Ebadat 1 Vincent Claveau 1 Pascale Sébillot 1
1 TEXMEX - Multimedia content-based indexing
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
Abstract : In this paper, we propose a model for semantic clustering of entities extracted from a text, and we apply it to a Proper Noun classification task. This model is based on a new method to compute the similarity between the entities. In- deed, the classical way of calculating similarity is to build a feature vector or Bag-of-Features for each entity and then use classical similarity functions like cosine. In practice, the fea- tures are contextual ones, such as words around the different occurrences of each entity. Here, we propose to use an alternative representation for en- tities, called Bag-Of-Vectors, or Bag-of-Bags-of-Features. In this new model, each entity is not defined as a unique vector but as a set of vectors, in which each vector is built based on the contextual features of one occurrence of the entity. In or- der to use Bag-Of-Vectors for clustering, we introduce new versions of classical similarity functions such as Cosine, Jac- card and Scalar Products. Experimentally, we show that the Bag-Of-Vectors representa- tion always improve the clustering results compared to clas- sical Bag-Of-Features representations.
Complete list of metadatas

Cited literature [21 references]  Display  Hide  Download
Contributor : Vincent Claveau <>
Submitted on : Monday, December 3, 2012 - 2:48:54 PM
Last modification on : Friday, November 16, 2018 - 1:24:20 AM
Long-term archiving on : Monday, March 4, 2013 - 3:49:51 AM


Files produced by the author(s)


  • HAL Id : hal-00760105, version 1


Ali-Reza Ebadat, Vincent Claveau, Pascale Sébillot. Proper Noun Semantic Clustering using Bag-Of-Vectors. ANLP - Applied Natural Language Processing conference. Special track at the 25th International FLAIRS Conference., May 2012, Marco Island, FL, United States. ⟨hal-00760105⟩



Record views


Files downloads