MLWIKIR: A Python toolkit for building large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and more - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2020

MLWIKIR: A Python toolkit for building large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and more

Résumé

Deep learning allowed for new state-of-the-art performance on ad-hoc information retrieval (IR). This approach usually requires large amounts of annotated data to be more effective than traditional baselines such as BM25. However, most standard ad-hoc IR datasets publicly available for academic research (e.g. Robust04, ClueWeb09) have at most 250 annotated queries and are usually in English only. Deep learning models for IR (e.g. DUET, Conv-KNRM) perform poorly on such datasets as they are trained and evaluated on large scale datasets collected from commercial search engines, not publicly available for academic research. This is a problem for reproducibility and the advancement of research. Moreover, most datasets are in English or Chinese only and deep learning models for ad-hoc IR are not evaluated on other languages. In this paper, we propose MLWIKIR: an open-source toolkit to automatically build large-scale information retrieval datasets based on Wikipedia in 10 different languages that can be adapted to any Wikipedia language given a tokenizer.
Fichier principal
Vignette du fichier
CIRCLE20_22.pdf (426.75 Ko) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte

Dates et versions

hal-03263816 , version 1 (17-06-2021)

Identifiants

  • HAL Id : hal-03263816 , version 1

Citer

Jibril Frej, Didier Schwab, Jean-Pierre Chevallet. MLWIKIR: A Python toolkit for building large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and more. Joint Conference of the Information Retrieval Communities in Europe, Jul 2020, Toulouse, France. ⟨hal-03263816⟩
64 Consultations
56 Téléchargements

Partager

Gmail Facebook X LinkedIn More