The Book Structure Extraction Competition with the Resurgence software for part and chapter detection at Caen University - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2011

The Book Structure Extraction Competition with the Resurgence software for part and chapter detection at Caen University

Résumé

The GREYC Island team participated in the Structure Extraction Competition part of the INEX Book track for the second time, with the Resurgence software. We used a minimal strategy primarily based on top-down document representation with two levels, part and chapter. The main idea is to use a model describing relationships for elements in the document structure. Frontiers between high-level units are detected, parts and then chapters. Page is also used. The periphery center relationship is calculated on the entire document and reflected on each page. The strong points of the approach are that it deals with the entire document; it handles books without ToCs, and titles that are not represented in the ToC (e. g. preface); it is not dependent on lexicon, hence tolerant to OCR errors and language independent; it is simple and fast.
Fichier non déposé

Dates et versions

hal-01069909 , version 1 (30-09-2014)

Identifiants

  • HAL Id : hal-01069909 , version 1

Citer

Emmanuel Giguet, Nadine Lucas. The Book Structure Extraction Competition with the Resurgence software for part and chapter detection at Caen University. INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval, Dec 2011, Saarbrücken, Germany. p. 128-139. ⟨hal-01069909⟩
28 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More