A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning

Sameer Khurana; Antoine Laurent; Wei-Ning Hsu; Jan Chorowski; Adrian Łańcucki; Ricard Marxer; James Glass

Communication Dans Un Congrès Année : 2020

A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning

, (1) , , (2) , (2) , (3) ,

1
2
3

Sameer Khurana

Fonction : Auteur
PersonId : 1075440

Antoine Laurent

Fonction : Auteur
PersonId : 13586
IdHAL : antoine-laurent
ORCID : 0000-0002-2653-1008
IdRef : 147099072

Laboratoire d'Informatique de l'Université du Mans

Wei-Ning Hsu

Fonction : Auteur

Jan Chorowski

Fonction : Auteur

University of Wrocław [Poland]

Adrian Łańcucki

Fonction : Auteur

University of Wrocław [Poland]

Ricard Marxer

Fonction : Auteur
PersonId : 19391
IdHAL : ricard-marxer
ORCID : 0000-0001-5099-5059
IdRef : 240437713

DYNamiques de l’Information

James Glass

Fonction : Auteur

Résumé

Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech. LVMs admit an intuitive probabilistic interpretation where the latent structure shapes the information extracted from the signal. Even though LVMs have recently seen a renewed interest due to the introduction of Vari-ational Autoencoders (VAEs), their use for speech representation learning remains largely unexplored. In this work, we propose Convolutional Deep Markov Model (ConvDMM), a Gaus-sian state-space model with non-linear emission and transition functions modelled by deep neural networks. This unsupervised model is trained using black box variational inference. A deep convolutional neural network is used as an inference network for structured variational approximation. When trained on a large scale speech dataset (LibriSpeech), ConvDMM produces features that significantly outperform multiple self-supervised feature extracting methods on linear phone classification and recognition on the Wall Street Journal dataset. Furthermore, we found that ConvDMM complements self-supervised methods like Wav2Vec and PASE, improving on the results achieved with any of the methods alone. Lastly, we find that ConvDMM features enable learning better phone recognizers than any other features in an extreme low-resource regime with few labelled training examples.

Mots clés

Neural Variational Latent Variable Model Structured Variational Inference Unsupervised Speech Representation Learning

Domaines

Intelligence artificielle [cs.AI] Informatique et langage [cs.CL] Apprentissage [cs.LG] Réseau de neurones [cs.NE]

Fichier principal

convDMM_arxiv.pdf (1023.33 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Antoine LAURENT : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02912029

Soumis le : mercredi 5 août 2020-09:49:18

Dernière modification le : vendredi 22 mars 2024-18:24:04

Archivage à long terme le : lundi 30 novembre 2020-14:36:57

Dates et versions

hal-02912029 , version 1 (05-08-2020)

Identifiants

HAL Id : hal-02912029 , version 1

Citer

Sameer Khurana, Antoine Laurent, Wei-Ning Hsu, Jan Chorowski, Adrian Łańcucki, et al.. A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning. Interspeech 2020, Oct 2020, Shanghai, China. ⟨hal-02912029⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLN CNRS UNIV-AMU UNIV-LEMANS LIUM LIS-LAB INCIAM

215 Consultations

67 Téléchargements

A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager