Skip to Main content Skip to Navigation
Journal articles

Linguistic documents synchronizing sound and text

Abstract : The goal of the LACITO linguistic archive project is to conserve and to make available for research recorded and transcribed oral traditions and other linguistic materials in (mainly) unwritten languages, giving simultaneous access to sound recordings and text annotation. The project uses simple, TEI-inspired XML markup for the kinds of annotation traditionally used in field linguistics. Transcriptions are segmented at the levels of, roughly, the sentence and the word, and annotation associated with different levels: metadata at the text level, free translation at the sentence level, interlinear glosses at the word level, etc. Time alignment is at the sentence (and optionally the word) level. To minimize in-house development and maintenance, the project uses standard software to the extent possible. Marked-up data is processed using widely-available XML/XSL/XSLT/XQL software tools, and displayed using standard browsers. The project has developed (1) an authoring tool, SoundIndex, to facilitate time-alignment, (2) a Java applet which enables standard browsers to access time-aligned speech, (3) XSL stylesheets which determine \"views\" on the data, and (4) a simple CGI interface permitting the user to choose documents and views and to enter queries. The paper describes these elements in detail. Current objectives are further development of the annotation with a view to linguistic research beyond simple browsing, and of a querying system (using a standard XML query processor) to exploit the annotated material.
Document type :
Journal articles
Complete list of metadata
Contributor : Michel Jacobson <>
Submitted on : Wednesday, June 22, 2005 - 3:32:53 PM
Last modification on : Tuesday, June 23, 2020 - 8:48:00 AM
Long-term archiving on: : Thursday, April 1, 2010 - 9:44:58 PM


  • HAL Id : hal-00005544, version 1


Michel Jacobson, Boyd Michailovsky, John Lowe. Linguistic documents synchronizing sound and text. Speech Communication, Elsevier : North-Holland, 2001, 33, p. 79-96. ⟨hal-00005544⟩



Record views


Files downloads