Char+CV-CTC: Combining Graphemes and Consonant/Vowel Units for CTC-Based ASR Using Multitask Learning

Abdelwahab Heba; Thomas Pellegrini; Jean-Pierre Lorré; Régine André-Obrecht

Communication Dans Un Congrès Année : 2019

Char+CV-CTC: Combining Graphemes and Consonant/Vowel Units for CTC-Based ASR Using Multitask Learning

(1, 2) , (1) , (2) , (1, 3)

1
2
3

Abdelwahab Heba

Fonction : Auteur
PersonId : 1129713
IdRef : 221309241

Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio

Linagora [Puteaux]

Thomas Pellegrini

Fonction : Auteur
PersonId : 741962
IdHAL : thomas-pellegrini
ORCID : 0000-0001-8984-1399
IdRef : 127577955

Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio

Jean-Pierre Lorré

Fonction : Auteur
PersonId : 998553

Linagora [Puteaux]

Régine André-Obrecht

Fonction : Auteur
PersonId : 740810
IdHAL : obrecht
IdRef : 060375965

Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio

Université Toulouse III - Paul Sabatier

Résumé

Previous work has shown that end-to-end neural-based speech recognition systems can be improved by adding auxiliary tasks at intermediate layers. In this paper, we report multitask learning (MTL) experiments in the context of connectionist temporal classification (CTC) based speech recognition at character level. We compare several MTL architectures that jointly learn to predict characters (sometimes called graphemes) and consonant/vowel (CV) binary labels. The best approach, which we call Char+CV-CTC, adds up the character and CV logits to obtain the final character predictions. The idea is to put more weight on the vowel (consonant) characters when the vowel (consonant) symbol ‘V’ (‘C’) is predicted in the auxiliary-task branch of the network. Experiments were carried out on the Wall Street Journal (WSJ) corpus. Char+CV-CTC achieved the best ASR results with a 2.2% Character Error Rate and a 6.1% Word Error Rate (WER) on the Eval92 evaluation subset. This model outperformed its monotask model counterpart by 0.7% absolute in WER and also achieved almost the same performance of 6.0% as a strong baseline phone-based Time Delay Neural Network (“TDNN-Phone+TR2”) model.

Mots clés

Multi-task learning Connectionist tem-poral classiﬁcation Automatic speech recognition

Domaines

Informatique et langage [cs.CL]

Fichier principal

heba_25028.pdf (169.28 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Open Archive Toulouse Archive Ouverte (OATAO) : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02419431

Soumis le : jeudi 19 décembre 2019-14:41:40

Dernière modification le : mercredi 31 janvier 2024-16:36:53

Archivage à long terme le : vendredi 20 mars 2020-18:31:43

Dates et versions

hal-02419431 , version 1 (19-12-2019)

Identifiants

HAL Id : hal-02419431 , version 1
OATAO : 25028

Citer

Abdelwahab Heba, Thomas Pellegrini, Jean-Pierre Lorré, Régine André-Obrecht. Char+CV-CTC: Combining Graphemes and Consonant/Vowel Units for CTC-Based ASR Using Multitask Learning. 20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019), Sep 2019, Graz, Austria. pp.1611-1615. ⟨hal-02419431⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLSE2 CNRS SMS UT1-CAPITOLE IRIT IRIT-SAMOVA IRIT-SI TOULOUSE-INP UNIV-UT3 UT3-TOULOUSEINP

160 Consultations

391 Téléchargements

Char+CV-CTC: Combining Graphemes and Consonant/Vowel Units for CTC-Based ASR Using Multitask Learning

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager