Phonemic transcription of low-resource languages: To what extent can preprocessing be automated?

Guillaume Wisniewski; Alexis Michaud; Séverine Guillaume

Communication Dans Un Congrès Année : 2020

Phonemic transcription of low-resource languages: To what extent can preprocessing be automated?

(1) , (2) , (2)

1
2

Guillaume Wisniewski

Fonction : Auteur
PersonId : 748468
IdHAL : guillaume-wisniewski
ORCID : 0000-0002-4445-080X
IdRef : 128062290

Laboratoire de Linguistique Formelle

Alexis Michaud

Fonction : Auteur
PersonId : 419
IdHAL : alexis-michaud
ORCID : 0000-0003-1165-2680
IdRef : 095131507

Langues et civilisations à tradition orale

Séverine Guillaume

Fonction : Auteur
PersonId : 12704
IdHAL : severine-guillaume
ORCID : 0000-0003-1772-2600

Langues et civilisations à tradition orale

Résumé

Automatic Speech Recognition for low-resource languages has been an active field of research for more than a decade. It holds promise for facilitating the urgent task of documenting the world's dwindling linguistic diversity. Various methodological hurdles are encountered in the course of this exciting development, however. A well-identified difficulty is that data preprocessing is not at all trivial. The tests reported here (on Yongning Na and other languages from the Pangloss Collection, an open archive of endangered languages) explore some possibilities for automating the process of data preprocessing: assessing to what extent it is possible to bypass the involvement of language experts for menial tasks of data preparation for Natural Language Processing (NLP) purposes. What is at stake is the accessibility of language archive data for a range of NLP tasks and beyond.

Mots clés

Speech Resource/Database Endangered Languages Speech Recognition/Understanding

Domaines

Linguistique Intelligence artificielle [cs.AI] Méthodes et statistiques Traitement du signal et de l'image [eess.SP]

Fichier principal

PersephonePangloss_SLTU_published.pdf (1.22 Mo)

Origine : Accord explicite pour ce dépôt

Alexis Michaud : Connectez-vous pour contacter le contributeur

https://shs.hal.science/hal-02513914

Soumis le : mercredi 27 mai 2020-11:55:59

Dernière modification le : mardi 2 avril 2024-15:48:04

Dates et versions

hal-02513914 , version 1 (26-03-2020)

hal-02513914 , version 2 (18-04-2020)

hal-02513914 , version 3 (27-05-2020)

Licence

Paternité - Pas d'utilisation commerciale

Identifiants

HAL Id : hal-02513914 , version 3

Citer

Guillaume Wisniewski, Alexis Michaud, Séverine Guillaume. Phonemic transcription of low-resource languages: To what extent can preprocessing be automated?. 1st Joint SLTU (Spoken Language Technologies for Under-resourced languages) and CCURL (Collaboration and Computing for Under-Resourced Languages) Workshop, 2020, Marseille, France. pp.306-315. ⟨hal-02513914v3⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS UNIV-PARIS3 LLF INALCO LACITO CAMPUS-AAR AAI USPC UP-SOCIETES-HUMANITES ASIES_ET_PACIFIQUE ANR

648 Consultations

316 Téléchargements

Phonemic transcription of low-resource languages: To what extent can preprocessing be automated?

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Partager