Multilingual Aligned Corpora From Movie Subtitles - Archive ouverte HAL Accéder directement au contenu
Rapport (Rapport De Recherche) Année : 2005

Multilingual Aligned Corpora From Movie Subtitles

Résumé

This paper describes a methodology for building aligned multilingual corpora form movie subtitles found on the Web. The subtitles have specific formats and encodings. In a first step, we convert them to our multilingual subtitle format based on XML. In a second step, we align the subtitle sentences with the time used to display them on the screen. We implemented the tool Jimaku in order to semi- automatically perform these steps. The last step consists in aligning the sentences at the sub-sentence level and to index the corpus for contextual lookup. For this step, we use the W I M S platform, result of previous research on text collections management.
Fichier principal
Vignette du fichier
Subtitles_MM-EG.pdf (249.56 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00968632 , version 1 (02-04-2014)

Identifiants

  • HAL Id : hal-00968632 , version 1

Citer

Mathieu Mangeot, Emmanuel Giguet. Multilingual Aligned Corpora From Movie Subtitles. [Research Report] LISTIC. 2005. ⟨hal-00968632⟩
331 Consultations
537 Téléchargements

Partager

Gmail Facebook X LinkedIn More