A distantly supervised dataset for automated data extraction from diagnostic studies

Christopher Norman; Mariska Leeflang; Rene Spijker; Evangelos Kanoulas; Aurélie Névéol

Communication Dans Un Congrès Année : 2019

A distantly supervised dataset for automated data extraction from diagnostic studies

(1) , , , , (1)

Christopher Norman

Fonction : Auteur
PersonId : 1034700

Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur

Mariska Leeflang

Fonction : Auteur

Rene Spijker

Fonction : Auteur

Evangelos Kanoulas

Fonction : Auteur

Aurélie Névéol

Fonction : Auteur
PersonId : 20620
IdHAL : aurelie-neveol
ORCID : 0000-0002-1846-9144
IdRef : 094239428

Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur

Résumé

Systematic reviews are important in evidence based medicine, but are expensive to produce. Automating or semi-automating the data extraction of index test, target condition, and reference standard from articles has the potential to decrease the cost of conducting systematic reviews of diagnostic test accuracy, but relevant training data is not available. We create a distantly supervised dataset of approximately 90,000 sentences, and let two experts manually annotate a small subset of around 1,000 sentences for evaluation. We evaluate the performance of BioBERT and logistic regression for ranking the sentences, and compare the performance for distant and direct supervision. Our results suggest that distant supervision can work as well as, or better than direct supervision on this problem, and that distantly trained models can perform as well as, or better than human annotators.

Domaines

Informatique [cs] Informatique et langage [cs.CL]

Limsi Publications : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02282792

Soumis le : mardi 10 septembre 2019-11:33:19

Dernière modification le : samedi 7 octobre 2023-21:36:20

Dates et versions

hal-02282792 , version 1 (10-09-2019)

Identifiants

HAL Id : hal-02282792 , version 1

Citer

Christopher Norman, Mariska Leeflang, Rene Spijker, Evangelos Kanoulas, Aurélie Névéol. A distantly supervised dataset for automated data extraction from diagnostic studies. ACL Workshop on Biomedical Natural Language Processing, Aug 2019, Florence, Italy. ⟨hal-02282792⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS LIMSI UNIV-PARIS-SACLAY SORBONNE-UNIVERSITE LISN GS-ENGINEERING GS-COMPUTER-SCIENCE

44 Consultations

0 Téléchargements

A distantly supervised dataset for automated data extraction from diagnostic studies

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager