Automatically Finding Semantically Consistent N-grams to Add New Words in LVCSR Systems

Gwénolé Lecorvé 1, * Guillaume Gravier 1 Pascale Sébillot 1
* Corresponding author
1 TEXMEX - Multimedia content-based indexing
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
Abstract : This paper presents a new method to automatically add n-grams containing out-of-vocabulary (OOV) words to a baseline language model (LM), where these n-grams are sought to be grammatically correct and to make sense according to the meaning of OOV words. First, this method consists in determining the word sequences, i.e., n-grams, in which the usage of a given OOV word is the most semantically consistent. Then, conditional probabilities of these n-grams have to be computed. To do this, semantic relations between words are used to assimilate each OOV word to several equivalent in-vocabulary words. Based on these last words, n-grams from the baseline LM are re-used to find the word sequences to be added and to compute their probabilities. After augmenting the vocabulary and launching a recognition process, experiments show that our method results in WER improvements which are comparable to those obtained using a state-of-the-art open vocabulary LM.
Document type :
Conference papers
Complete list of metadatas

Cited literature [9 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-00645223
Contributor : Pascale Sébillot <>
Submitted on : Sunday, November 27, 2011 - 2:37:39 PM
Last modification on : Friday, November 16, 2018 - 1:24:06 AM
Long-term archiving on : Tuesday, February 28, 2012 - 2:21:45 AM

File

lecorve_icassp2011.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00645223, version 1

Citation

Gwénolé Lecorvé, Guillaume Gravier, Pascale Sébillot. Automatically Finding Semantically Consistent N-grams to Add New Words in LVCSR Systems. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, May 2011, Prague, Czech Republic. 4 p., 2 columns. ⟨hal-00645223⟩

Share

Metrics

Record views

1311

Files downloads

213