Skip to Main content Skip to Navigation
Journal articles

A New Proposal for Evaluating Web Page Cleaning Tools

Abstract : In this article, we tackle the problem of evaluation of Web Content Extraction tools. This task is seldom studied in the literature although it has important consequences on the linguistic processing of web-based corpora. Here, we compare two types of evaluation. Firstly, an intrinsic (content-based) evaluation which is carried out in a multilingual setting (five languages). Secondly, an extrinsic (task-based) evaluation on the same corpus by studying the effects of the cleaning step on the performances of an NLP pipeline. We show that in the intrinsic evaluation, the results are not consistent with extrinsic evaluation results. We also show that the results differ greatly in the studied languages. We conclude that the choice of a web page cleaning tool should be made with respect to the task that is tackled rather than the performances observed through the intrinsic evaluation scheme.
Complete list of metadatas

Cited literature [18 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02467732
Contributor : Gaël Lejeune <>
Submitted on : Wednesday, February 5, 2020 - 11:18:00 AM
Last modification on : Thursday, March 5, 2020 - 12:06:13 PM

File

3062-6256-1-PB.pdf
Publisher files allowed on an open archive

Identifiers

Citation

Gaël Lejeune, Lichao Zhu. A New Proposal for Evaluating Web Page Cleaning Tools. Computación y sistemas, Instituto Politécnico Nacional IPN Centro de Investigación en Computación, 2018, ⟨10.13053/CyS-22-4-3062⟩. ⟨hal-02467732⟩

Share

Metrics

Record views

27

Files downloads

55