CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2017

CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube

Résumé

This paper addresses the issue of comparability of comments extracted from Youtube. The comments concern spoken Algerian which could be either local Arabic, Modern Standard Arabic or French. This diversity of expression arises a huge number of problems concerning the data processing. In this article, several methods of alignment will be proposed and tested. The method which permits to best align is Word2Vec-based approach that will be used iteratively. This recurrent call of Word2Vec allows to improve significantly the results of comparability. In fact, a dictionary-based approach leads to a Recall of 4, while our approach allows to get a Recall of 33 at rank 1. Thanks to this approach, we built from Youtube CALYOU, a Comparable Corpus of the spoken Algerian.
Fichier principal
Vignette du fichier
KarimaKAmelInterspeech2017.pdf (223.05 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01531591 , version 1 (01-06-2017)

Identifiants

  • HAL Id : hal-01531591 , version 1

Citer

Karima Abidi, Mohamed Amine Menacer, Kamel Smaili. CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube. 18th Annual Conference of the International Communication Association (Interspeech), Aug 2017, Stockholm, Sweden. ⟨hal-01531591⟩
679 Consultations
430 Téléchargements

Partager

Gmail Facebook X LinkedIn More