S. Baluja, Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework, Proceedings of the 15th international conference on World Wide Web, WWW '06, pp.33-42, 2006.

A. Barbaresi, Ad hoc and general-purpose corpus construction from web sources, 2015.
URL : https://hal.archives-ouvertes.fr/tel-01167309

C. Biemann, F. Bildhauer, S. Evert, D. Goldhahn, U. Quasthoff et al., Scalable construction of high-quality web corpora, JLCL, vol.28, issue.2, pp.23-59, 2013.

R. Brixtel, G. Lejeune, A. Doucet, and N. Lucas, Any Language Early Detection of Epidemic Diseases from Web News Streams, International Conference on Healthcare Informatics (ICHI), pp.159-168, 2013.
URL : https://hal.archives-ouvertes.fr/hal-01073195

D. Chakrabarti, R. Kumar, and K. Punera, A graph-theoretic approach to webpage segmentation, Proceedings of the 17th international conference on World Wide Web, WWW '08, pp.377-386, 2008.

S. N. Das, P. K. Vijayaraghavan, and M. Mathew, Article: Eliminating noisy information in web pages using featured dom tree, International Journal of Applied Information Systems, vol.2, issue.2, pp.27-34, 2012.

A. Doucet, G. Kazai, and J. Meunier, ICDAR 2011 Book Structure Extraction Competition. Proceedings of the Eleventh International Conference on Document Analysis and Recognition (ICDAR'2011), pp.1501-1505, 2011.
URL : https://hal.archives-ouvertes.fr/hal-01069019

S. Evert, A lightweight and efficient tool for cleaning web pages, Proceedings of LREC, pp.3489-3493, 2008.

. Computación-y-sistemas, , vol.22, pp.1249-1258, 2018.

A. Ferraresi, E. Zanchetta, M. Baroni, and S. Bernardini, Introducing and evaluating ukwac, a very large web-derived corpus of english, Proceedings of the 4th Web as Corpus Workshop, LREC, pp.47-54, 2008.

L. Fu, Y. Meng, Y. Xia, and H. Yu, Web content extraction based on webpage layout analysis, Second International Conference on Information Technology and Computer Science, pp.40-43, 2010.

C. Kohlschütter, P. Fankhauser, and W. Nejdl, Boilerplate detection using shallow text features, Proceedings of the third ACM international conference on Web search and data mining, WSDM '10, pp.441-450, 2010.

J. Pasternack and D. Roth, Extracting article text from the web with maximum subsequence segmentation, pp.971-980, 2009.

J. Pomikálek, Removing boilerplate and duplicate content from web corpora. Disertacn? práce, Masarykova univerzita, 2011.

J. W. Ratcliff and D. E. Metzener, Pattern matching: The gestalt approach, Dr. Dobbs Journal, vol.13, issue.7, pp.68-72, 1988.

M. Spousta, M. Marek, and P. Pecina, Victor: the Web-Page Cleaning Tool, Proceedings of the 4th Web as Corpus Workshop, LREC, pp.12-17, 2008.

K. Vieira, A. S. Da-silva, N. Pinto, E. S. De-moura, J. A. Cavalcanti et al., A fast and robust method for web page template detection and removal, ACM international conference on Information and knowledge management, CIKM '06, pp.258-267, 2006.

. Computación-y-sistemas, , vol.22, pp.1249-1258, 2018.