Data-Driven Identification of German Phrasal Compounds

Adrien Barbaresi; Katrin Hein

doi:10.1007/978-3-319-64206-2_22

Chapitre D'ouvrage Année : 2017

Data-Driven Identification of German Phrasal Compounds

(1, 2) , (3)

1
2
3

Adrien Barbaresi

Fonction : Auteur
PersonId : 1134
IdHAL : adrien-barbaresi
ORCID : 0000-0002-8079-8694

Berlin-Brandenburgische Akademie der Wissenschaften

Austrian Academy of Sciences

Katrin Hein

Fonction : Auteur

Institute for the German Language

Résumé

We present a method to identify and document a phenomenon on which there is very little empirical data: German phrasal compounds occurring in the form of as a single token (without punctuation between their components). Relying on linguistic criteria, our approach implies to have an operational notion of compounds which can be systematically applied as well as (web) corpora which are large and diverse enough to contain rarely seen phenomena. The method is based on word segmentation and morphological analysis, it takes advantage of a data-driven learning process. Our results show that coarse-grained identification of phrasal compounds is best performed with empirical data, whereas fine-grained detection could be improved with a combination of rule-based and frequency-based word lists. Along with the characteristics of web texts, the or-thographic realizations seem to be linked to the degree of expressivity.

Mots clés

corpus linguistics word segmentation morphological analysis web corpora

Domaines

Linguistique Informatique et langage [cs.CL]

Fichier principal

Barbaresi&Hein_2017_Data-driven-Identification-of-German-Phrase-Compounds.pdf (125.15 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Adrien Barbaresi : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01575651

Soumis le : lundi 21 août 2017-13:30:54

Dernière modification le : mercredi 12 décembre 2018-13:32:04

Dates et versions

hal-01575651 , version 1 (21-08-2017)

Licence

Paternité

Identifiants

HAL Id : hal-01575651 , version 1
DOI : 10.1007/978-3-319-64206-2_22

Citer

Adrien Barbaresi, Katrin Hein. Data-Driven Identification of German Phrasal Compounds. Kamil Ekštein; Václav Matoušek. Text, Speech, and Dialogue, 10415, Springer International Publishing, pp.192-200, 2017, Lecture Notes in Computer Science, 978-3-319-64205-5. ⟨10.1007/978-3-319-64206-2_22⟩. ⟨hal-01575651⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

294 Consultations

347 Téléchargements

Data-Driven Identification of German Phrasal Compounds

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Altmetric

Partager