A. Ait-elhadj and M. Boughanem, Using structural similarity for clustering XML documents, Knowledge and Information Systems, vol.32, issue.1, pp.109-139, 2012.

J. Alarte, J. Silva, and S. Tamarit, What web template extractor should i use? a benchmarking and comparison for five template extractors, ACM Trans. Web, vol.13, issue.2, p.19, 2019.

J. Beel, S. Langer, M. Genzmehr, and C. Müller, Docear's pdf inspector: Title extraction from pdf files, Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '13, pp.443-444, 2013.

R. Brixtel, Maximal repeats enhance substring-based authorship attribution, Proceedings of the International Conference Recent Advances in Natural Language Processing, pp.63-71, 2015.

R. De-busser and M. Moens, A general learning method for automatic titleextraction from html pages, Machine Learning and Data Mining in Pattern Recognition, pp.704-718, 2006.

B. Daille, E. Jacquey, G. Lejeune, L. F. Melo, and Y. Toussaint, Ambiguity Diagnosis for Terms in Digital Humanities, Language Resources and Evaluation Conference, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01423650

. Hervé-déjean, , 2007.

M. Denil, A. Demiraj, and N. De-freitas, Extraction of salient sentences from labelled documents, 2014.

A. Doucet, G. Kazai, S. Colutto, and G. Mühlberger, Overview of the IC-DAR 2013 Competition on Book Structure Extraction, Proc. of the 12th International Conference on Document Analysis and Recognition (IC-DAR'2013), pp.1438-1443, 2013.

A. Doucet and M. Lehtonen, Unsupervised classification of text-centric xml document collections. In Comparative Evaluation of XML Information Retrieval Systems, 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX, Lecture Notes in Computer Science, vol.4518, pp.497-509, 2007.
URL : https://hal.archives-ouvertes.fr/hal-00324994

E. Giguet, A. Baudrillart, and N. Lucas, Resurgence for the book structure extraction competition, INEX 2009 Workshop PreProceedings, pp.136-142, 2009.
URL : https://hal.archives-ouvertes.fr/hal-01069909

E. Giguet and N. Lucas, The book structure extraction competition with the resurgence software at caen university, Focused Retrieval and Evaluation, pp.170-178, 2010.
URL : https://hal.archives-ouvertes.fr/hal-01071717

E. Giguet and N. Lucas, The book structure extraction competition with the resurgence software for part and chapter detection at caen university, Comparative Evaluation of Focused Retrieval -9th International Workshop INEX, vol.6932, pp.128-139, 2010.
URL : https://hal.archives-ouvertes.fr/hal-01069909

A. Silja-huttunen, . Vihavainen, R. Peter-von-etter, and . Yangarber, Relevance prediction in information extraction using discourse and lexical features, Proceedings of the 18th Nordic Conference of Computational Linguistics, pp.114-121, 2011.

S. Klampfl, M. Granitzer, K. Jack, and R. Kern, Unsupervised document structure analysis of digital scientific articles, Int. J. Digit. Libr, vol.14, issue.3-4, pp.83-99, 2014.

G. Lejeune, R. Brixtel, A. Doucet, and N. Lucas, Multilingual event extraction for epidemic detection, Artificial Intelligence in Medicine, vol.65, issue.2, pp.131-143, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01712956

G. Lejeune and L. Zhu, A new proposal for evaluating web page cleaning tools, Computación y Sistemas, vol.22, 2018.

T. Nguyen, A. Doucet, and M. Coustaty, Enhancing table of contents extraction by system aggregation, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol.01, pp.242-247, 2017.

, The fintoc-2019 shared task: Financial document structure extraction, The Second Workshop on Financial Narrative Processing of NoDalida, 2019.

D. Tkaczyk, A. Collins, and J. Beel, Who did what?: Identifying author contributions in biomedical publications using naïve bayes, Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, JCDL '18, pp.387-388, 2018.

Y. Xue, Y. Hu, G. Xin, R. Song, S. Shi et al., Web page title extraction and its application, Information Processing & Management, vol.43, issue.5, pp.1332-1347, 2007.