Benchmarking Top-K Keyword and Top-K Document Processing with T²K² and T²K²D²

Abstract : Top-k keyword and top-k document extraction are very popular text analysis techniques. Top-k keywords and documents are often computed on-the-fly, but they exploit weighted vocabularies that are costly to build. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present T²K², a top-k keywords and documents benchmark, and its decision support-oriented evolution T²K²D². Both benchmarks feature a real tweet dataset and queries with various complexities and selectivities. They help evaluate weighting schemes and database implementations in terms of computing performance. To illustrate our bench-marks' relevance and genericity, we successfully ran performance tests on the TF-IDF and Okapi BM25 weighting schemes, on one hand, and on different relational (Oracle, PostgreSQL) and document-oriented (MongoDB) database implementations, on the other hand.
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-01717121
Contributeur : Jérôme Darmont <>
Soumis le : jeudi 19 avril 2018 - 14:13:27
Dernière modification le : mercredi 31 octobre 2018 - 12:24:20
Document(s) archivé(s) le : mardi 18 septembre 2018 - 14:15:52

Fichiers

benchmark.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Ciprian-Octavian Truica, Jérôme Darmont, Alexandru Boicea, Florin Radulescu. Benchmarking Top-K Keyword and Top-K Document Processing with T²K² and T²K²D². Future Generation Computer Systems, Elsevier, 2018, 85, pp.60-75. 〈https://www.sciencedirect.com/science/article/pii/S0167739X17323580〉. 〈10.1016/j.future.2018.02.037〉. 〈hal-01717121〉

Partager

Métriques

Consultations de la notice

111

Téléchargements de fichiers

66