Skip to Main content Skip to Navigation
Conference papers

Normalizing speech transcriptions for Natural Language Processing

Abstract : Researchers in the field of spoken text processing face specific problems, all related to the nature of the data. In particular, spoken texts are full of disfluencies that constitute practical issues for automatic analysis. On the basis of a corpus of almost 500.000 words from the textual data bank of spontaneous spoken French of Valibel (, we have especially studied four types of disfluencies: repetition, word fragments, immediate self-correction and the word euh, called "filled pause". In this paper, we show how these four types of disfluencies were automatically preprocessed in texts. The principle we used was to annotate the part of the disfluency called reparandum (according to the terminology in Shriberg 1994), in order to keep only the repair part.
Document type :
Conference papers
Complete list of metadata
Contributor : Matthieu Constant Connect in order to contact the contributor
Submitted on : Thursday, September 26, 2013 - 2:06:47 PM
Last modification on : Thursday, September 29, 2022 - 2:21:15 PM


  • HAL Id : hal-00866252, version 1


Anne Dister, Mathieu Constant, Gérald Prunelle. Normalizing speech transcriptions for Natural Language Processing. 3rd International Conference on Spoken Communication (GSCP'09), Università degli Studi di Napoli L'Orientale, Feb 2009, Naples, Italy. pp.507-520. ⟨hal-00866252⟩



Record views