Duplicate Detection with Efficient Language Models for Automatic Bibliographic Heterogeneous Data Integration

Nicolas Turenne

Pré-Publication, Document De Travail Année : 2015

Duplicate Detection with Efficient Language Models for Automatic Bibliographic Heterogeneous Data Integration

(1)

Nicolas Turenne

Fonction : Auteur
PersonId : 749038
IdHAL : nicolas-turenne
ORCID : 0000-0003-1229-5590
IdRef : 168615738

Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement

Résumé

We present a new method to detect duplicates used to merge different bibliographic record corpora with the help of lexical and social information. As we show, a trivial key is not available to delete useless documents. Merging heteregeneous document databases to get a maximum of information can be of interest. In our case we try to build a document corpus about the TOR molecule so as to extract relationships with other gene components from PubMed and WebOfScience document databases. Our approach makes key fingerprints based on n-grams. We made two documents gold standards using this corpus to make an evaluation. Comparison with other well-known methods in deduplication gives best scores of recall (95\%) and precision (100\%).

Domaines

Informatique [cs]

nicolas turenne : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03373972

Soumis le : lundi 11 octobre 2021-17:46:49

Dernière modification le : vendredi 19 novembre 2021-20:00:50

Dates et versions

hal-03373972 , version 1 (11-10-2021)

Identifiants

HAL Id : hal-03373972 , version 1
ARXIV : 1504.07597

Citer

Nicolas Turenne. Duplicate Detection with Efficient Language Models for Automatic Bibliographic Heterogeneous Data Integration. 2015. ⟨hal-03373972⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INRAE

15 Consultations

0 Téléchargements

Duplicate Detection with Efficient Language Models for Automatic Bibliographic Heterogeneous Data Integration

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager