Daniel at the FinSBD-2 Task: Extracting List and Sentence Boundaries from PDF Documents, a model-driven approach to PDF document analysis

In this paper, we present the method we have designed and implemented for identifying lists and sentences in PDF documents while participating to FinSBD-2 Financial Document Analysis SharedTask. We propose a model-driven approach for the French and English datasets. It relies on a top-down process from the PDF itself in order to keep control of the workflow. Our objective is to use PDF structure extraction to improve text segment boundaries detection in an end-to-end fashion.

Domaines

Informatique [cs] Traitement du texte et du document

Giguet Emmanuel : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02927304

Soumis le : mardi 1 septembre 2020-15:30:33

Dernière modification le : mercredi 20 mars 2024-16:20:04

Dates et versions

hal-02927304 , version 1 (01-09-2020)

Identifiants

HAL Id : hal-02927304 , version 1

Citer

Emmanuel Giguet, Gaël Lejeune. Daniel at the FinSBD-2 Task: Extracting List and Sentence Boundaries from PDF Documents, a model-driven approach to PDF document analysis: Proceedings of the Second Workshop on Financial Technology and Natural Language Processing. https://aclanthology.org/2020.finnlp-1.11/, pp.67-74, 2021, Proceedings of the Second Workshop on Financial Technology and Natural Language Processing. ⟨hal-02927304⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS GREYC GREYC-HULTECH COMUE-NORMANDIE ENSICAEN UNICAEN SORBONNE-UNIVERSITE STIH SU-LETTRES

64 Consultations

0 Téléchargements