Normalizing speech transcriptions for Natural Language Processing

Anne Dister; Mathieu Constant; Gérald Prunelle

Communication Dans Un Congrès Année : 2010

Normalizing speech transcriptions for Natural Language Processing

, (1) ,

Anne Dister

Fonction : Auteur
PersonId : 861886

Mathieu Constant

Fonction : Auteur
PersonId : 19722
IdHAL : constant-mathieu
IdRef : 158098188

Laboratoire d'Informatique Gaspard-Monge

Gérald Prunelle

Fonction : Auteur

Résumé

Researchers in the field of spoken text processing face specific problems, all related to the nature of the data. In particular, spoken texts are full of disfluencies that constitute practical issues for automatic analysis. On the basis of a corpus of almost 500.000 words from the textual data bank of spontaneous spoken French of Valibel (http://www.uclouvain.be/valibel.html), we have especially studied four types of disfluencies: repetition, word fragments, immediate self-correction and the word euh, called "filled pause". In this paper, we show how these four types of disfluencies were automatically preprocessed in texts. The principle we used was to annotate the part of the disfluency called reparandum (according to the terminology in Shriberg 1994), in order to keep only the repair part.

Domaines

Traitement du texte et du document

Matthieu Constant : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00866252

Soumis le : jeudi 26 septembre 2013-14:06:47

Dernière modification le : lundi 13 mai 2024-12:33:15

Dates et versions

hal-00866252 , version 1 (26-09-2013)

Identifiants

HAL Id : hal-00866252 , version 1

Citer

Anne Dister, Mathieu Constant, Gérald Prunelle. Normalizing speech transcriptions for Natural Language Processing. 3rd International Conference on Spoken Communication (GSCP'09), Università degli Studi di Napoli L'Orientale, Feb 2009, Naples, Italy. pp.507-520. ⟨hal-00866252⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENPC CNRS UNIV-MLV LIGM_LINGU PARISTECH LIGM LIGM_MOA UNIV-EIFFEL LIGM_ADA

96 Consultations

0 Téléchargements

Normalizing speech transcriptions for Natural Language Processing

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager