Scalable Graph Building from Text Data - Archive ouverte HAL Accéder directement au contenu
Article Dans Une Revue Proceedings of Machine Learning Research Année : 2014

Scalable Graph Building from Text Data

Pietro Michiardi
  • Fonction : Auteur
  • PersonId : 1084771
Olivier Thonnard
  • Fonction : Auteur
  • PersonId : 965107
Wim Mees
  • Fonction : Auteur
  • PersonId : 1008623

Résumé

In this paper we propose NNCTPH, a new MapReduce algorithm that is able to build an approximate k-NN graph from large text datasets. The algorithm uses a modified version of Context Triggered Piecewise Hashing to bin the input data into buckets, and uses an exhaustive search inside the buckets to build the graph. It also uses multiple stages to join the different unconnected subgraphs. We experimentally test the algorithm on different datasets consisting of the subject of spam emails. Although the algorithm is still at an early development stage, it already proves to be four times faster than a MapReduce implementation of NN-Descent, for the same quality of produced graph.
Fichier principal
Vignette du fichier
bigmine2014_debatty.pdf (238.32 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01525743 , version 1 (22-05-2017)

Identifiants

  • HAL Id : hal-01525743 , version 1

Citer

Thibault Debatty, Pietro Michiardi, Olivier Thonnard, Wim Mees. Scalable Graph Building from Text Data. Proceedings of Machine Learning Research, 2014, Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, 36, pp.120-132. ⟨hal-01525743⟩

Collections

EURECOM
48 Consultations
159 Téléchargements

Partager

Gmail Facebook X LinkedIn More