Skip to Main content Skip to Navigation
Journal articles

Out-of-the-Box and Into the Ditch? Multilingual Evaluation of Generic Text Extraction Tools

Abstract : This article examines extraction methods designed to retain the main text content of web pages and discusses how the extraction could be oriented and evaluated: can and should it be as generic as possible to ensure opportunistic corpus construction? The evaluation grounds on a comparative benchmark of open-source tools used on pages in five different languages (Chinese, English, Greek, Polish and Russian), it features several metrics to obtain more fine-grained differentiations. Our experiments highlight the diversity of web page layouts across languages or publishing countries. These discrepancies are reflected by diverging performances so that the right tool has to be chosen accordingly.
Document type :
Journal articles
Complete list of metadata

Cited literature [40 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02732851
Contributor : Gaël Lejeune <>
Submitted on : Tuesday, June 2, 2020 - 7:40:38 AM
Last modification on : Wednesday, December 9, 2020 - 3:10:08 PM
Long-term archiving on: : Friday, September 25, 2020 - 11:41:01 PM

File

2020.wac-1.2.pdf
Publisher files allowed on an open archive

Identifiers

  • HAL Id : hal-02732851, version 1

Citation

Adrien Barbaresi, Gaël Lejeune. Out-of-the-Box and Into the Ditch? Multilingual Evaluation of Generic Text Extraction Tools. Language Resources and Evaluation Conference (LREC 2020), 2020, pp.5-13. ⟨hal-02732851⟩

Share

Metrics

Record views

36

Files downloads

69