MarsaTag, a tagger for French written texts and speech transcriptions - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2014

MarsaTag, a tagger for French written texts and speech transcriptions

Stéphane Rauzy
Philippe Blache

Résumé

We present in this paper a new system, MarsaTag, aiming at segmenting, tagging and chunking French input. The originality of the tool, on top of its efficiency, is its ability to process written texts as well as speech transcriptions. The tagger executes the three following operations. First, a rule-based tokenizer splits the raw textual input in a sequence of tokens. In a second step, thanks to a broad-coverage morphosyntactic lexicon, each token form is associated to a tag distribution. The last step consists in disambiguating the tagging by selecting the POS tag sequence with the highest probability. The probability of a sequence of tags is computed thanks to a stochastic model using the Hidden Markov Model machinery. The states or patterns of our model are extracted from the GraceLPL resource (700,000 tokens with morphosyntactic annotation). The performance of the tagger reaches an F-measure score of 0.974 for written material. The tagger has been adapted for the treatment of spontaneous speech transcriptions. The system has been trained with a large spoken French corpus (CID, see Bertrand et al. 2008). Phenomena proper to speech (filled paused, disfluencies, truncation, etc.) were identified and included in a model specific to speech transcription inputs. The tagger performance of 0.948 (F-measure) has been evaluated on the manual corrected tags of the CID corpus. MarsaTag is distributed with a software interface allowing the choice of various input and output formats (see hdl:11041/sldr000841). Thanks to the genericity of the technique, extension to other languages for which annotated treebanks are available (e.g. Chinese Penn Treebank) is currently in progress.
Fichier non déposé

Dates et versions

hal-01500736 , version 1 (03-04-2017)

Identifiants

  • HAL Id : hal-01500736 , version 1

Citer

Stéphane Rauzy, Grégoire Montcheuil, Philippe Blache. MarsaTag, a tagger for French written texts and speech transcriptions. Second Asian Pacific Corpus linguistics Conference, Mar 2014, Hong Kong, China. pp.220-220. ⟨hal-01500736⟩
199 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More