End-to-end acoustic modelling for phone recognition of young readers

Lucile Gelin; Morgane Daniel; Julien Pinquier; Thomas Pellegrini

doi:10.1016/j.specom.2021.08.003

Article Dans Une Revue Speech Communication Année : 2021

End-to-end acoustic modelling for phone recognition of young readers

(1, 2) , (2) , (1) , (1)

1
2

Lucile Gelin

Fonction : Auteur
PersonId : 742641
IdHAL : lucile-gelin
ORCID : 0000-0002-5623-9438

Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio

Lalilo, Paris

Morgane Daniel

Fonction : Auteur
PersonId : 1072616

Lalilo, Paris

Julien Pinquier

Fonction : Auteur
PersonId : 21789
IdHAL : julien-pinquier
ORCID : 0000-0003-1556-1284
IdRef : 086752839

Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio

Thomas Pellegrini

Fonction : Auteur
PersonId : 741962
IdHAL : thomas-pellegrini
ORCID : 0000-0001-8984-1399
IdRef : 127577955

Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio

Résumé

Automatic recognition systems for child speech are lagging behind those dedicated to adult speech in the race of performance. This phenomenon is due to the high acoustic and linguistic variability present in child speech caused by their body development, as well as the lack of available child speech data. Young readers’ speech additionally displays peculiarities, such as slow reading rate and presence of reading mistakes, that hardens the task. This work attempts to tackle the main challenges in phone acoustic modelling for young child speech with limited data and improve understanding of strengths and weaknesses of a wide selection of model architectures in this domain. We find that transfer learning techniques are highly efficient on end-to-end architectures for adult-to-child adaptation with a small amount of child speech data. Through transfer learning, a Transformer model complemented with a Connectionist Temporal Classification (CTC) objective function, reaches a phone error rate of 28.1%, outperforming a state-of-the-art DNN–HMM model by 6.6% relative, as well as other end-to-end architectures by more than 8.5% relative. An analysis of the models’ performance on two specific reading tasks (isolated words and sentences) is provided, showing the influence of the utterance length on attention-based and CTC-based models. The Transformer+CTC model displays an ability to better detect reading mistakes made by children, which can be attributed to the CTC objective function effectively constraining the attention mechanisms to be monotonic.

Mots clés

child speech phone recognition transformer connectionist temporal classification transfer learning low-resource

Domaines

Traitement du signal et de l'image [eess.SP] Multimédia [cs.MM] Intelligence artificielle [cs.AI]

Fichier principal

S0167639321000959.pdf (920.02 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Accord Elsevier CCSD : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03373156

Soumis le : lundi 16 octobre 2023-09:35:47

Dernière modification le : lundi 20 novembre 2023-11:44:22

Archivage à long terme le : mercredi 17 janvier 2024-19:20:47

Dates et versions

hal-03373156 , version 1 (16-10-2023)

Licence

Paternité

Identifiants

HAL Id : hal-03373156 , version 1
DOI : 10.1016/j.specom.2021.08.003
PII : S0167-6393(21)00095-9

Citer

Lucile Gelin, Morgane Daniel, Julien Pinquier, Thomas Pellegrini. End-to-end acoustic modelling for phone recognition of young readers. Speech Communication, 2021, 134, pp.71-84. ⟨10.1016/j.specom.2021.08.003⟩. ⟨hal-03373156⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLSE2 CNRS UT1-CAPITOLE IRIT IRIT-SAMOVA ANR IRIT-SI IRIT-UT3 TOULOUSE-INP UNIV-UT3 UT3-TOULOUSEINP

96 Consultations

20 Téléchargements

End-to-end acoustic modelling for phone recognition of young readers

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager