Skip to Main content Skip to Navigation
Conference papers

End-to-End Extraction of Structured Information from Business Documents with Pointer-Generator Networks

Clément Sage 1, 2 Alex Aussem 2 Véronique Eglin 1 Haytham Elghazel 2 Jérémy Espinas
1 imagine - Extraction de Caractéristiques et Identification
LIRIS - Laboratoire d'InfoRmatique en Image et Systèmes d'information
2 DM2L - Data Mining and Machine Learning
LIRIS - Laboratoire d'InfoRmatique en Image et Systèmes d'information
Abstract : The predominant approaches for extracting key information from documents resort to classifiers predicting the information type of each word. However, the word level ground truth used for learning is expensive to obtain since it is not naturally produced by the extraction task. In this paper, we discuss a new method for training extraction models directly from the textual value of information. The extracted information of a document is represented as a sequence of tokens in the XML language. We learn to output this representation with a pointer-generator network that alternately copies the document words carrying information and generates the XML tags delimiting the types of information. The ability of our end-to-end method to retrieve structured information is assessed on a large set of business documents. We show that it performs competitively with a standard word classifier without requiring costly word level supervision.
Complete list of metadata

Cited literature [37 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02958913
Contributor : Clément Sage <>
Submitted on : Tuesday, October 6, 2020 - 12:55:35 PM
Last modification on : Tuesday, June 1, 2021 - 2:08:09 PM
Long-term archiving on: : Thursday, January 7, 2021 - 7:36:41 PM

Identifiers

  • HAL Id : hal-02958913, version 1

Citation

Clément Sage, Alex Aussem, Véronique Eglin, Haytham Elghazel, Jérémy Espinas. End-to-End Extraction of Structured Information from Business Documents with Pointer-Generator Networks. EMNLP 2020 Workshop on Structured Prediction for NLP, Nov 2020, Punta Cana (online), Dominican Republic. ⟨hal-02958913⟩

Share

Metrics

Record views

202

Files downloads

560