Compilation of specialized comparable corpus in French and Japanese

Abstract : We present in this paper the development of a specialized comparable corpora compilation tool, for which quality would be close to a manually compiled corpus. The comparability is based on three levels: domain, topic and type of discourse. Domain and topic can be filtered with the keywords used through web search. But the detection of the type of discourse needs a wide linguistic analysis. The first step of our work is to automate the detection of the type of discourse that can be found in a scientific domain (science and popular science) in French and Japanese languages. First, a contrastive stylistic analysis of the two types of discourse is done on both languages. This analysis leads to the creation of a reusable, generic and robust typology. Machine learning algorithms are then applied to the typology, using shallow parsing. We obtain good results, with an average precision of 80% and an average recall of 70% that demonstrate the efficiency of this typology. This classification tool is then inserted in a corpus compilation tool which is a text collection treatment chain realized through IBM \texttt{UIMA} system. Starting from two specialized web documents collection in French and Japanese, this tool creates the corresponding corpus.
Type de document :
Communication dans un congrès
Pascale Fung, Reinhard Rapp, Pierre Zweigenbaum. ACL-IJCNLP workshop “Building and Using Comparable Corpora” (BUCC 2009), Aug 2009, Singapore. pp.55-63, 2009
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-00411258
Contributeur : Lorraine Goeuriot <>
Soumis le : mercredi 26 août 2009 - 17:18:19
Dernière modification le : mercredi 24 juin 2015 - 11:00:40

Identifiants

  • HAL Id : hal-00411258, version 1

Collections

Citation

Lorraine Goeuriot, Béatrice Daille, Emmanuel Morin. Compilation of specialized comparable corpus in French and Japanese. Pascale Fung, Reinhard Rapp, Pierre Zweigenbaum. ACL-IJCNLP workshop “Building and Using Comparable Corpora” (BUCC 2009), Aug 2009, Singapore. pp.55-63, 2009. <hal-00411258>

Partager

Métriques

Consultations de la notice

69