Benchmarking Top-K Keyword and Top-K Document Processing with T²K² and T²K²D²

Ciprian-Octavian Truica; Jérôme Darmont; Alexandru Boicea; Florin Radulescu

doi:10.1016/j.future.2018.02.037

Article Dans Une Revue Future Generation Computer Systems Année : 2018

Benchmarking Top-K Keyword and Top-K Document Processing with T²K² and T²K²D²

(1) , (2) , (1) , (1)

1
2

Ciprian-Octavian Truica

Fonction : Auteur
PersonId : 5322
IdHAL : ciprian-octavian-truica
IdRef : 253129265

University Politehnica of Bucharest [Romania]

Jérôme Darmont

Fonction : Auteur
PersonId : 14011
IdHAL : jerome-darmont
ORCID : 0000-0003-1491-384X
IdRef : 081304668

Entrepôts, Représentation et Ingénierie des Connaissances

Alexandru Boicea

Fonction : Auteur
PersonId : 969674

University Politehnica of Bucharest [Romania]

Florin Radulescu

Fonction : Auteur
PersonId : 1028757

University Politehnica of Bucharest [Romania]

Résumé

Top-k keyword and top-k document extraction are very popular text analysis techniques. Top-k keywords and documents are often computed on-the-fly, but they exploit weighted vocabularies that are costly to build. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present T²K², a top-k keywords and documents benchmark, and its decision support-oriented evolution T²K²D². Both benchmarks feature a real tweet dataset and queries with various complexities and selectivities. They help evaluate weighting schemes and database implementations in terms of computing performance. To illustrate our bench-marks' relevance and genericity, we successfully ran performance tests on the TF-IDF and Okapi BM25 weighting schemes, on one hand, and on different relational (Oracle, PostgreSQL) and document-oriented (MongoDB) database implementations, on the other hand.

Mots clés

top-k keywords top-k documents text analytics benchmarking weighting schemes database systems

Domaines

Base de données [cs.DB] Traitement du texte et du document

Fichier principal

benchmark.pdf (944.06 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Jérôme Darmont : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01717121

Soumis le : jeudi 19 avril 2018-14:13:27

Dernière modification le : samedi 10 décembre 2022-20:17:00

Archivage à long terme le : mardi 18 septembre 2018-14:15:52

Dates et versions

hal-01717121 , version 1 (19-04-2018)

Licence

Paternité

Identifiants

HAL Id : hal-01717121 , version 1
ARXIV : 1804.07525
DOI : 10.1016/j.future.2018.02.037

Citer

Ciprian-Octavian Truica, Jérôme Darmont, Alexandru Boicea, Florin Radulescu. Benchmarking Top-K Keyword and Top-K Document Processing with T²K² and T²K²D². Future Generation Computer Systems, 2018, 85, pp.60-75. ⟨10.1016/j.future.2018.02.037⟩. ⟨hal-01717121⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-LYON1 UNIV-LYON2 ERIC LABEXIMU UDL

234 Consultations

331 Téléchargements

Benchmarking Top-K Keyword and Top-K Document Processing with T²K² and T²K²D²

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager