Content-based subject classification at article level in biomedical context

Eric Jeangirard

Pré-Publication, Document De Travail Année : 2021

Content-based subject classification at article level in biomedical context

(1)

Eric Jeangirard

Fonction : Auteur
PersonId : 175736
IdHAL : eric-jeangirard
ORCID : 0000-0002-3767-7125
IdRef : 242241344

Ministère de l'Education nationale, de l’Enseignement supérieur et de la Recherche

Résumé

Subject classification is an important task to analyze scholarly publications. In general, mainly two kinds of approaches are used: classification at a journal level and classification at the article level. We propose a mixed approach, leveraging on embeddings technique in NLP to train classifiers with article metadata (title, abstract, keywords in particular) labelled with the journal-level classification FoR (Fields of Research) and then apply these classifiers at the article level. We use this approach in the context of biomedical publications using metadata from Pubmed. Fasttext classifiers are trained with FoR codes and used to classify publications based on their available metadata. Results show that using a stratification sampling strategy for training help reduce the bias due to unbalanced field distribution. An implementation of the method is proposed on the repository https://github.com/dataesr/scientific_tagger

Mots clés

open science subject classification fasttext word embeddings fields of research

Domaines

Bibliothèque électronique [cs.DL]

Fichier principal

scientific_tagger.pdf (527.68 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

ERIC JEANGIRARD : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03212544

Soumis le : jeudi 29 avril 2021-16:47:04

Dernière modification le : vendredi 30 avril 2021-09:41:36

Archivage à long terme le : vendredi 30 juillet 2021-19:01:58

Dates et versions

hal-03212544 , version 1 (29-04-2021)

Licence

Paternité

Identifiants

HAL Id : hal-03212544 , version 1

Citer

Eric Jeangirard. Content-based subject classification at article level in biomedical context. 2021. ⟨hal-03212544⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

70 Consultations

42 Téléchargements

Content-based subject classification at article level in biomedical context

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Partager