Skip to Main content Skip to Navigation
Conference papers

Évaluation intrinsèque et extrinsèque du nettoyage de pages Web

Gaël Lejeune 1 Romain Brixtel 2 Charlotte Lecluze 3
1 TALN
LINA - Laboratoire d'Informatique de Nantes Atlantique
3 Equipe Hultech - Laboratoire GREYC - UMR6072
GREYC - Groupe de Recherche en Informatique, Image et Instrumentation de Caen
Abstract : In this article, we tackle the problem of evaluation of web page cleaning tools. This task is seldom studied in the literature although it has consequences on the linguistic processing performed on web-based corpora. We propose two types of evaluation : (I) an intrinsic (content-based) evaluation with measures on words, tags and characters ; (II) an extrinsic (task-based) evaluation on the same corpus by studying the effects of the cleaning step on the performances of an NLP pipeline. We show that the results are not consistent in both evaluations. We show as well that there are important differences in the results between the studied languages. We conclude that the choice of a web page cleaning tool should be made in view of the aimed task rather than on the performances of the tools in an intrinsic evaluation.
Document type :
Conference papers
Complete list of metadata

Cited literature [9 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01170005
Contributor : Gaël Lejeune Connect in order to contact the contributor
Submitted on : Tuesday, June 30, 2015 - 4:34:55 PM
Last modification on : Tuesday, October 19, 2021 - 11:34:56 PM
Long-term archiving on: : Tuesday, April 25, 2017 - 8:37:45 PM

File

article_court.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01170005, version 1

Citation

Gaël Lejeune, Romain Brixtel, Charlotte Lecluze. Évaluation intrinsèque et extrinsèque du nettoyage de pages Web. Traitement Automatique des Langues Naturelles 2015, Jun 2015, Caen, France. ⟨hal-01170005⟩

Share

Metrics

Les métriques sont temporairement indisponibles