Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework, Proceedings of the 15th international conference on World Wide Web, WWW '06, pp.33-42, 2006. ,
Ad hoc and general-purpose corpus construction from web sources, 2015. ,
URL : https://hal.archives-ouvertes.fr/tel-01167309
Scalable construction of high-quality web corpora, JLCL, vol.28, issue.2, pp.23-59, 2013. ,
Any Language Early Detection of Epidemic Diseases from Web News Streams, International Conference on Healthcare Informatics (ICHI), pp.159-168, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-01073195
A graph-theoretic approach to webpage segmentation, Proceedings of the 17th international conference on World Wide Web, WWW '08, pp.377-386, 2008. ,
Article: Eliminating noisy information in web pages using featured dom tree, International Journal of Applied Information Systems, vol.2, issue.2, pp.27-34, 2012. ,
, ICDAR 2011 Book Structure Extraction Competition. Proceedings of the Eleventh International Conference on Document Analysis and Recognition (ICDAR'2011), pp.1501-1505, 2011.
URL : https://hal.archives-ouvertes.fr/hal-01069019
A lightweight and efficient tool for cleaning web pages, Proceedings of LREC, pp.3489-3493, 2008. ,
, , vol.22, pp.1249-1258, 2018.
Introducing and evaluating ukwac, a very large web-derived corpus of english, Proceedings of the 4th Web as Corpus Workshop, LREC, pp.47-54, 2008. ,
Web content extraction based on webpage layout analysis, Second International Conference on Information Technology and Computer Science, pp.40-43, 2010. ,
Boilerplate detection using shallow text features, Proceedings of the third ACM international conference on Web search and data mining, WSDM '10, pp.441-450, 2010. ,
Extracting article text from the web with maximum subsequence segmentation, pp.971-980, 2009. ,
Removing boilerplate and duplicate content from web corpora. Disertacn? práce, Masarykova univerzita, 2011. ,
Pattern matching: The gestalt approach, Dr. Dobbs Journal, vol.13, issue.7, pp.68-72, 1988. ,
Victor: the Web-Page Cleaning Tool, Proceedings of the 4th Web as Corpus Workshop, LREC, pp.12-17, 2008. ,
A fast and robust method for web page template detection and removal, ACM international conference on Information and knowledge management, CIKM '06, pp.258-267, 2006. ,
, , vol.22, pp.1249-1258, 2018.