Istex : A Database of Twenty Million Scientific Papers with a Mining Tool Which Uses Named Entities - Archive ouverte HAL Accéder directement au contenu
Article Dans Une Revue Information Année : 2019

Istex : A Database of Twenty Million Scientific Papers with a Mining Tool Which Uses Named Entities

Résumé

Istex is a database of twenty million full text scientific papers bought by the French Government for the use of academic libraries. Papers are usually searched for by the title, authors, keywords or possibly the abstract. To authorize new types of queries of Istex, we implemented a system of named entity recognition on all papers and we offer users the possibility to run searches on these entities. After the presentation of the French Istex project, we detail in this paper the named entity recognition with CasEN, a cascade of graphs, implemented on the Unitex Software. CasEN exists in French, but not in English. The first challenge was to build a new cascade in a short time. The results of its evaluation showed a good Precision measure, even if the Recall was not very good. The Precision was very important for this project to ensure it did not return unwanted papers by a query. The second challenge was the implementation of Unitex to parse around twenty millions of documents. We used a dockerized application. Finally, we explain also how to query the resulting Named entities in the Istex website.
Fichier principal
Vignette du fichier
information-10-00178.pdf (514.19 Ko) Télécharger le fichier
Origine : Publication financée par une institution

Dates et versions

hal-02152978 , version 1 (05-12-2023)

Licence

Paternité

Identifiants

Citer

Denis Maurel, Enza Morale, Nicolas Thouvenin, Patrice Ringot, Angel Turri. Istex : A Database of Twenty Million Scientific Papers with a Mining Tool Which Uses Named Entities. Information, 2019, Natural Language Processing and Text Mining, 10 (5), pp.178. ⟨10.3390/info10050178⟩. ⟨hal-02152978⟩
111 Consultations
7 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More