Daniel at the FinSBD-2 Task: Extracting List and Sentence Boundaries from PDF Documents, a model-driven approach to PDF document analysis - Archive ouverte HAL Accéder directement au contenu
Proceedings/Recueil Des Communications Année : 2021

Daniel at the FinSBD-2 Task: Extracting List and Sentence Boundaries from PDF Documents, a model-driven approach to PDF document analysis

Résumé

In this paper, we present the method we have designed and implemented for identifying lists and sentences in PDF documents while participating to FinSBD-2 Financial Document Analysis SharedTask. We propose a model-driven approach for the French and English datasets. It relies on a top-down process from the PDF itself in order to keep control of the workflow. Our objective is to use PDF structure extraction to improve text segment boundaries detection in an end-to-end fashion.
Fichier non déposé

Dates et versions

hal-02927304 , version 1 (01-09-2020)

Identifiants

  • HAL Id : hal-02927304 , version 1

Citer

Emmanuel Giguet, Gaël Lejeune. Daniel at the FinSBD-2 Task: Extracting List and Sentence Boundaries from PDF Documents, a model-driven approach to PDF document analysis: Proceedings of the Second Workshop on Financial Technology and Natural Language Processing. https://aclanthology.org/2020.finnlp-1.11/, pp.67-74, 2021, Proceedings of the Second Workshop on Financial Technology and Natural Language Processing. ⟨hal-02927304⟩
64 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More