. Barbaresi-a, Generic Web Content Extraction with Open-Source Software, Proceedings of KONVENS 2019, Kaleidoscope Abstracts, pp.267-268, 2019.

. Barbaresi-a.-&-lejeune-g, Out-of-the-Box and Into the Ditch ? Multilingual Evaluation of Generic Text Extraction Tools, Proceedings of the 12th Web as Corpus workshop, 2020.

M. Baroni, S. Bernardini, and . Ferraresi-a.-&-zanchetta-e, The WaCky Wide Web : a collection of very large linguistically processed web-crawled corpora, Language Resources and Evaluation, vol.43, issue.3, pp.209-226, 2009.

M. Baroni, F. Chantree, and . Kilgarriff-a.-&-sharoff-s, Cleaneval : a Competition for Cleaning Web Pages, Proceedings of LREC, pp.638-643, 2008.

. Geyken-a, . Barbaresi-a, J. Didakowski, B. Jurish, and . Wiegand-f.-&-lemnitzer-l, , 2017.

, Digitalen Wörterbuchs der deutschen Sprache" (DWDS), vol.45, pp.327-344

. Habernal-i and . Zayed-o.-&-gurevych-i, C4Corpus : Multilingual Web-size corpus with free license, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pp.914-922, 2016.

. Hamborg-f, N. Meuschke, . Breitinger-c.-&-gipp-b.-;-m, V. Gaede, . Trkulja-&-v et al., news-please : A generic news crawler and extractor, Proceedings of the 15th International Symposium of Information Science, pp.218-223, 2017.

. Kilgarriff-a, Googleology is bad science, Computational Linguistics, vol.33, issue.1, pp.147-151, 2007.

C. Kohlschütter and . Fankhauser-p.-&-nejdl-w, Boilerplate detection using shallow text features, Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM '10, pp.441-450, 2010.

G. Lejeune, . Brixtel-r, . Doucet-a, and . Lucas-n, Daniel : Language independent character-based news surveillance, International Conference on NLP, pp.64-75, 2012.
URL : https://hal.archives-ouvertes.fr/hal-01071903

. Lejeune-g.-&-zhu-l, A New Proposal for Evaluating Web Page Cleaning Tools, Computación y Sistemas, vol.22, issue.4, 2018.

. E. Peters-m and . Lecocq-d, Content extraction using diverse feature sets, Proceedings of the 22nd International Conference on World Wide Web, pp.89-90, 2013.

J. Pomikálek, Removing boilerplate and duplicate content from web corpora, 2011.

. Schäfer-r and . Barbaresi-a.-&-bildhauer-f, The Good, the Bad, and the Hazy : Design Decisions in Web Corpus Construction, Proceedings of the 8th Web as Corpus Workshop, pp.7-15, 2013.

. Weninger-t, . Palacios-r, V. Crescenzi, . &. Gottron-t, and . Merialdo-p, Web Content Extraction -a Meta-Analysis of its Past and Thoughts on its Future, ACM SIGKDD Explorations Newsletter, vol.17, issue.2, pp.17-23, 2016.