Z. Bar-yossef and S. Rajagopalan, Template Detection via Data Mining and its Applications, Proceedings of the 11th International Conference on World Wide Web, pp.580-591, 2002.

A. Barbaresi, Ad hoc and general-purpose corpus construction from web sources, 2015.
URL : https://hal.archives-ouvertes.fr/tel-01167309

A. Barbaresi, Efficient construction of metadataenhanced web corpora, Proceedings of the 10th Web as Corpus Workshop, pp.7-16, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01371704

A. Barbaresi, The Vast and the Focused: On the need for thematic web and blog corpora, Proceedings of the CMLC-7 workshop, pp.29-32, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02447305

M. Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff, Cleaneval: a Competition for Cleaning Web Pages, Proceedings of LREC, pp.638-643, 2008.

M. Baroni, S. Bernardini, A. Ferraresi, and E. Zanchetta, The WaCky Wide Web: a collection of very large linguistically processed web-crawled corpora, Language Resources and Evaluation, vol.43, issue.3, pp.209-226, 2009.

D. Bauer, J. Degen, X. Deng, P. Herger, J. Gasthaus et al., FIASCO: Filtering the internet by automatic subtree classification, Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop (WAC-3), pp.111-121, 2007.

N. Brügger and D. Laursen, Introduction: Digital humanities, the web, and national web domains, The Historical Web and Digital Humanities, pp.1-9, 2019.

C. Buck and P. Koehn, Quick and reliable document alignment via TF/IDF-weighted cosine distance, Proceedings of the First Conference on Machine Translation, vol.2, pp.672-678, 2016.

D. Cai, S. Yu, J. Wen, and W. Ma, VIPS: a Vision-based Page Segmentation Algorithm, 2003.

H. J. Carey and M. Manic, HTML web content extraction using paragraph tags, 25th International Symposium on Industrial Electronics (ISIE), pp.1099-1105, 2016.

A. Geyken, A. Barbaresi, J. Didakowski, B. Jurish, F. Wiegand et al., Die Korpusplattform des, Digitalen Wörterbuchs der deutschen Sprache" (DWDS). Zeitschrift für germanistische Linguistik, vol.45, issue.2, pp.327-344, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01575661

M. Ghasemisharif, P. Snyder, A. Aucinas, and B. Livshits, SpeedReader: Reader Mode Made Fast and Private, Proceedings of the World Wide Web Conference, pp.526-537, 2019.

T. Gottron, Evaluating content extraction on HTML documents, Proceedings of the 2nd International Conference on Internet Technologies and Applications, pp.123-132, 2007.

S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, , 2003.

, DOM-based content extraction of HTML documents, Proceedings of the 12th international conference on World Wide Web, pp.207-214

I. Habernal, O. Zayed, and I. Gurevych, , 2016.

, C4Corpus: Multilingual Web-size corpus with free license, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pp.914-922

F. Hamborg, N. Meuschke, C. Breitinger, and B. Gipp, news-please: A generic news crawler and extractor, Proceedings of the 15th International Symposium of Information Science, pp.218-223, 2017.

H. Kao, S. Lin, J. Ho, and M. Chen, Mining web informative structures and contents based on entropy analysis, IEEE Transactions on Knowledge and Data Engineering, vol.16, issue.1, pp.41-55, 2004.

A. Kilgarriff, Googleology is bad science, Computational Linguistics, vol.33, pp.147-151, 2007.

C. Kohlschütter and W. Nejdl, A Densitometric Approach to Web Page Segmentation, Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp.1173-1182, 2008.

C. Kohlschütter, P. Fankhauser, and W. Nejdl, , 2010.

, Boilerplate detection using shallow text features, Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM '10, pp.441-450

G. Lejeune and L. Zhu, A New Proposal for Evaluating Web Page Cleaning Tools, vol.22, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02467732

G. Lejeune, R. Brixtel, A. Doucet, L. , and N. , Daniel: Language independent character-based news surveillance, International Conference on NLP, pp.64-75, 2012.
URL : https://hal.archives-ouvertes.fr/hal-01071903

M. E. Peters and D. Lecocq, Content extraction using diverse feature sets, Proceedings of the 22nd International Conference on World Wide Web, pp.89-90, 2013.

J. Platt, K. Toutanova, and W. Yih, Translingual document representations from discriminative projections, Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp.251-261, 2010.

J. Pomikálek, Removing boilerplate and duplicate content from web corpora, 2011.

P. A. Qureshi and N. Memon, Hybrid model of content extraction, Journal of Computer and System Sciences, vol.78, issue.4, pp.1248-1257, 2012.

A. R. Rae, J. Kim, D. Le, and G. R. Thoma, Main Content Detection in HTML Journal Articles, Proceedings of the ACM Symposium on Document Engineering 2018, pp.1-4, 2018.

J. W. Ratcliff and D. E. Metzener, Pattern Matching: The Gestalt Approach, Dr. Dobb's Journal, vol.13, issue.7, p.46, 1988.

R. Schäfer, A. Barbaresi, and F. Bildhauer, The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction, Proceedings of the 8th Web as Corpus Workshop, pp.7-15, 2013.

R. Schäfer, A. Barbaresi, and F. Bildhauer, Focused Web Corpus Crawling, Proceedings of the 9th Web as Corpus workshop (WAC-9), pp.9-15, 2014.

R. Schäfer, CommonCOW: Massively Huge Web Corpora from CommonCrawl Dataand a Method to Distribute them Freely under Restrictive EU Copyright Laws, Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC'16), pp.4500-4504, 2016.

M. Spousta, M. Marek, and P. Pecina, Victor: the Web-Page Cleaning Tool, 4th Web as Corpus Workshop (WAC-4), pp.12-17, 2008.

F. Sun, D. Song, and L. Liao, DOM-based content extraction via text density, Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp.245-254, 2011.

T. Vogels, O. Ganea, and C. Eickhoff, Web2text: Deep structured boilerplate removal, European Conference on Information Retrieval, pp.167-179, 2018.

T. Weninger, W. H. Hsu, and J. Han, CETR: content extraction via tag ratios, Proceedings of the 19th international conference on World Wide Web, pp.971-980, 2010.

T. Weninger, R. Palacios, V. Crescenzi, T. Gottron, and P. Merialdo, Web Content Extraction -a Meta-Analysis of its Past and Thoughts on its Future, ACM SIGKDD Explorations Newsletter, vol.17, issue.2, pp.17-23, 2016.