Automatic Corpus Extension for Data-Driven Natural Language Generation

Elena Manishina; Bassam Jabaian; Stéphane Huet; Fabrice Lefèvre

Communication Dans Un Congrès Année : 2016

Automatic Corpus Extension for Data-Driven Natural Language Generation

(1) , (1) , (1) , (1)

Elena Manishina

Fonction : Auteur
PersonId : 779656
IdRef : 196554608

Laboratoire Informatique d'Avignon

Bassam Jabaian

Fonction : Auteur
PersonId : 172824
IdHAL : bassam-jabaian
IdRef : 171425081

Laboratoire Informatique d'Avignon

Stéphane Huet

Fonction : Auteur
PersonId : 10005
IdHAL : shuet
ORCID : 0000-0003-1838-3807
IdRef : 110355245

Laboratoire Informatique d'Avignon

Fabrice Lefèvre

Fonction : Auteur
PersonId : 175133
IdHAL : fabricelefevre
IdRef : 089427092

Laboratoire Informatique d'Avignon

Résumé

As data-driven approaches started to make their way into the Natural Language Generation (NLG) domain, the need for automation of corpus building and extension became apparent. Corpus creation and extension in data-driven NLG domain traditionally involved manual paraphrasing performed by either a group of experts or with resort to crowd-sourcing. Building the training corpora manually is a costly enterprise which requires a lot of time and human resources. We propose to automate the process of corpus extension by integrating automatically obtained synonyms and paraphrases. Our methodology allowed us to significantly increase the size of the training corpus and its level of variability (the number of distinct tokens and specific syntactic structures). Our extension solutions are fully automatic and require only some initial validation. The human evaluation results confirm that in many cases native users favor the outputs of the model built on the extended corpus.

Mots clés

corpus building natural language generation automatic paraphrasing

Domaines

Traitement du texte et du document Informatique et langage [cs.CL]

Fichier principal

LREC16.pdf (218.09 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Stéphane Huet : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02021894

Soumis le : samedi 16 février 2019-21:19:36

Dernière modification le : vendredi 23 octobre 2020-16:48:30

Archivage à long terme le : vendredi 17 mai 2019-16:24:08

Dates et versions

hal-02021894 , version 1 (16-02-2019)

Identifiants

HAL Id : hal-02021894 , version 1

Citer

Elena Manishina, Bassam Jabaian, Stéphane Huet, Fabrice Lefèvre. Automatic Corpus Extension for Data-Driven Natural Language Generation. 10th International Conference on Language Resources and Evaluation (LREC), 2016, Portorož, Slovenia. pp.3624-3631. ⟨hal-02021894⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-AVIGNON LIA

127 Consultations

100 Téléchargements

Automatic Corpus Extension for Data-Driven Natural Language Generation

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager