Data-Driven Identification of German Phrasal Compounds

Abstract : We present a method to identify and document a phenomenon on which there is very little empirical data: German phrasal compounds occurring in the form of as a single token (without punctuation between their components). Relying on linguistic criteria, our approach implies to have an operational notion of compounds which can be systematically applied as well as (web) corpora which are large and diverse enough to contain rarely seen phenomena. The method is based on word segmentation and morphological analysis, it takes advantage of a data-driven learning process. Our results show that coarse-grained identification of phrasal compounds is best performed with empirical data, whereas fine-grained detection could be improved with a combination of rule-based and frequency-based word lists. Along with the characteristics of web texts, the or-thographic realizations seem to be linked to the degree of expressivity.
Type de document :
Chapitre d'ouvrage
Kamil Ekštein; Václav Matoušek. Text, Speech, and Dialogue, 10415, Springer International Publishing, pp.192-200, 2017, Lecture Notes in Computer Science, 978-3-319-64205-5. 〈10.1007/978-3-319-64206-2_22〉. 〈https://link.springer.com/bookseries/558〉
Liste complète des métadonnées

Littérature citée [30 références]  Voir  Masquer  Télécharger

https://hal.archives-ouvertes.fr/hal-01575651
Contributeur : Adrien Barbaresi <>
Soumis le : lundi 21 août 2017 - 13:30:54
Dernière modification le : mardi 22 août 2017 - 01:05:15

Fichier

Barbaresi&Hein_2017_Data-drive...
Fichiers produits par l'(les) auteur(s)

Licence


Distributed under a Creative Commons Paternité 4.0 International License

Identifiants

Collections

Citation

Adrien Barbaresi, Katrin Hein. Data-Driven Identification of German Phrasal Compounds. Kamil Ekštein; Václav Matoušek. Text, Speech, and Dialogue, 10415, Springer International Publishing, pp.192-200, 2017, Lecture Notes in Computer Science, 978-3-319-64205-5. 〈10.1007/978-3-319-64206-2_22〉. 〈https://link.springer.com/bookseries/558〉. 〈hal-01575651〉

Partager

Métriques

Consultations de la notice

98

Téléchargements de fichiers

103