Classifying encyclopedia articles: Comparing machine and deep learning methods and exploring their predictions

Alice Brenon; Ludovic Moncla; Katherine Mcdonough

doi:10.1016/j.datak.2022.102098

Article Dans Une Revue Data and Knowledge Engineering Année : 2022

Classifying encyclopedia articles: Comparing machine and deep learning methods and exploring their predictions

(1, 2) , (1) , (3)

1
2
3

Alice Brenon

Fonction : Auteur
PersonId : 753084
IdHAL : alice-brenon
ORCID : 0000-0003-2276-7061

Data Mining and Machine Learning

Interactions, Corpus, Apprentissages, Représentations

Ludovic Moncla

Fonction : Auteur
PersonId : 172
IdHAL : ludovic-moncla
ORCID : 0000-0002-1590-9546
IdRef : 190370491

Data Mining and Machine Learning

Katherine Mcdonough

Fonction : Auteur

The Alan Turing Institute

Résumé

This article presents a comparative study of supervised classification approaches applied to the automatic classification of encyclopedia articles written in French. Our dataset is composed of 17 volumes of text from the Encyclopédie by Diderot and d'Alembert (1751-72) including about 70,000 articles. We combine text vectorization (bag-of-words and word embeddings) with machine learning methods, deep learning, and transformer architectures. In addition evaluating these approaches, we review the classification predictions using a variety of quantitative and qualitative methods. The best model obtains 86% as an average f-score for 38 classes. Using network analysis we highlight the difficulty of classifying semantically close classes. We also introduce examples of opportunities for qualitative evaluation of "misclassifications" in order to understand the relationship between content and different ways of ordering knowledge. We openly release all code and results obtained during this research.

Mots clés

classification supervised machine learning deep learning encyclopedia computational humanities networks

Domaines

Recherche d'information [cs.IR] Apprentissage [cs.LG]

Fichier principal

Classifying_encyclopedia_DKE_preprint_submitted.pdf (12.01 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Ludovic Moncla : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03821073

Soumis le : mercredi 19 octobre 2022-13:44:11

Dernière modification le : jeudi 1 février 2024-14:27:07

Dates et versions

hal-03821073 , version 1 (19-10-2022)

Identifiants

HAL Id : hal-03821073 , version 1
DOI : 10.1016/j.datak.2022.102098

Citer

Alice Brenon, Ludovic Moncla, Katherine Mcdonough. Classifying encyclopedia articles: Comparing machine and deep learning methods and exploring their predictions. Data and Knowledge Engineering, 2022, 142, pp.102098. ⟨10.1016/j.datak.2022.102098⟩. ⟨hal-03821073⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON CNRS UNIV-LYON1 UNIV-LYON2 INSA-LYON EC-LYON LIRIS INSA-GROUPE UDL ANR

85 Consultations

72 Téléchargements

Classifying encyclopedia articles: Comparing machine and deep learning methods and exploring their predictions

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager