Compiling terminological data using comparable corpora: from term extraction to dictionaries - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2012

Compiling terminological data using comparable corpora: from term extraction to dictionaries

Résumé

For scientific domains, terminological resources like dictionaries are often not available or not up-to-date. Additionally, term variation (Daille, 2005) is often not documented. As a result, translators working in technical domains usually spend much time building terminological resources. The project TTC aims at compiling domain-specific lexical resources which are to be integrated into CAT tools and SMT systems. Since parallel data is often not available, comparable corpora are used: they are available for a large range of domains in many languages. The TTC tool suite consists of the following steps: 1. corpus collection using a focused crawler (Groc, 2011) 2. pattern-based term extraction of terminologically relevant noun phrases from tagged and lemmatized text (Schmid, 1995), 3. identification of term variants: (DE) "Korrosionsschutz = Schutz gegen Korrosion" (corrosion protection = protection against corrosion) 4. term alignment: for a given term of the source language, equivalents in the target language are searched and aligned. Term lists of both the source and target language, as well as a general language dictionary are taken as an input to this step. In our poster presentation, we focus on term alignment, presenting two approaches: (1) lexical strategies and (2) the use of context vectors. 1. Terms do not necessarily have an equivalent of the same syntactic structure in other languages, particularly German compounds. By applying term variation patterns, compounds can be reformulated, resulting in term variants of different syntactic structures (Morin and Daille, 2010). This allows to individually look up the components of a compound in the dictionary and identify matching target language terms: "Stromspeicherung = Speicherung von Strom = storage of power / storage of electricity". 2. Terms and their translations tend to appear in comparable lexical contexts. For each source language term, context vectors are computed and translated into the target language. The translated vectors are then compared with target language context vectors (using cosine measure): terms with similar context vectors are likely to be translations. Since both approaches depend on the coverage of the dictionary, we consider the lexical strategies as an input for the context vector method.
Fichier non déposé

Dates et versions

hal-00819590 , version 1 (01-05-2013)

Identifiants

  • HAL Id : hal-00819590 , version 1

Citer

M. Weller, Anita Gojun, Ulrich Heid, Béatrice Daille, Emmanuel Morin. Compiling terminological data using comparable corpora: from term extraction to dictionaries. 34th Annual Conference of the German Linguistic Society (DGfS), Mar 2012, Frankfurt, Germany. ⟨hal-00819590⟩
86 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More