Automatic Corpus Extension for Data-Driven Natural Language Generation

Abstract : As data-driven approaches started to make their way into the Natural Language Generation (NLG) domain, the need for automation of corpus building and extension became apparent. Corpus creation and extension in data-driven NLG domain traditionally involved manual paraphrasing performed by either a group of experts or with resort to crowd-sourcing. Building the training corpora manually is a costly enterprise which requires a lot of time and human resources. We propose to automate the process of corpus extension by integrating automatically obtained synonyms and paraphrases. Our methodology allowed us to significantly increase the size of the training corpus and its level of variability (the number of distinct tokens and specific syntactic structures). Our extension solutions are fully automatic and require only some initial validation. The human evaluation results confirm that in many cases native users favor the outputs of the model built on the extended corpus.
Complete list of metadatas

Cited literature [17 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02021894
Contributor : Stéphane Huet <>
Submitted on : Saturday, February 16, 2019 - 9:19:36 PM
Last modification on : Wednesday, May 15, 2019 - 10:12:03 AM
Long-term archiving on : Friday, May 17, 2019 - 4:24:08 PM

File

LREC16.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02021894, version 1

Collections

Citation

Elena Manishina, Bassam Jabaian, Stéphane Huet, Fabrice Lefèvre. Automatic Corpus Extension for Data-Driven Natural Language Generation. 10th International Conference on Language Resources and Evaluation (LREC), 2016, Portorož, Slovenia. pp.3624-3631. ⟨hal-02021894⟩

Share

Metrics

Record views

19

Files downloads

27