Generic Web Content Extraction with Open-Source Software, Proceedings of KONVENS 2019, Kaleidoscope Abstracts, pp.267-268, 2019. ,
Out-of-the-Box and Into the Ditch ? Multilingual Evaluation of Generic Text Extraction Tools, Proceedings of the 12th Web as Corpus workshop, 2020. ,
The WaCky Wide Web : a collection of very large linguistically processed web-crawled corpora, Language Resources and Evaluation, vol.43, issue.3, pp.209-226, 2009. ,
Cleaneval : a Competition for Cleaning Web Pages, Proceedings of LREC, pp.638-643, 2008. ,
, , 2017.
, Digitalen Wörterbuchs der deutschen Sprache" (DWDS), vol.45, pp.327-344
C4Corpus : Multilingual Web-size corpus with free license, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pp.914-922, 2016. ,
news-please : A generic news crawler and extractor, Proceedings of the 15th International Symposium of Information Science, pp.218-223, 2017. ,
Googleology is bad science, Computational Linguistics, vol.33, issue.1, pp.147-151, 2007. ,
Boilerplate detection using shallow text features, Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM '10, pp.441-450, 2010. ,
Daniel : Language independent character-based news surveillance, International Conference on NLP, pp.64-75, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-01071903
A New Proposal for Evaluating Web Page Cleaning Tools, Computación y Sistemas, vol.22, issue.4, 2018. ,
Content extraction using diverse feature sets, Proceedings of the 22nd International Conference on World Wide Web, pp.89-90, 2013. ,
Removing boilerplate and duplicate content from web corpora, 2011. ,
The Good, the Bad, and the Hazy : Design Decisions in Web Corpus Construction, Proceedings of the 8th Web as Corpus Workshop, pp.7-15, 2013. ,
Web Content Extraction -a Meta-Analysis of its Past and Thoughts on its Future, ACM SIGKDD Explorations Newsletter, vol.17, issue.2, pp.17-23, 2016. ,