Comparing Recurring Lexico-Syntactic Trees (RLTs) and Ngram Techniques for Extended Phraseology Extraction: a Corpus-based Study on French Scientific Articles

Agnès Tutin; Olivier Kraif

Communication Dans Un Congrès Année : 2017

Comparing Recurring Lexico-Syntactic Trees (RLTs) and Ngram Techniques for Extended Phraseology Extraction: a Corpus-based Study on French Scientific Articles

(1) , (1)

Agnès Tutin

Fonction : Auteur
PersonId : 17094
IdHAL : agnes-tutin
ORCID : 0000-0003-3008-093X
IdRef : 059803150

LInguistique et DIdactique des Langues Étrangères et Maternelles

Olivier Kraif

Fonction : Auteur
PersonId : 20769
IdHAL : olivier-kraif
IdRef : 067256759

LInguistique et DIdactique des Langues Étrangères et Maternelles

Résumé

This paper aims at assessing to what extent a syntax-based method (Recurring Lexico-syntactic Trees (RLT) extraction) allows us to extract large phraseological units such as prefabricated routines, e.g. as previously said or as far as we/I know in scientific writing. In order to evaluate this method, we compare it to the classical ngram extraction technique, on a subset of recurring segments including speech verbs in a French corpus of scientific writing. Results show that the RLT extraction technique is far more accurate for extended MWEs such as routines or collocations but performs more poorly for surface phenomena such as syntactic constructions or fully frozen expressions.

Mots clés

mutiword expressions phraseology scientific writing

Domaines

Linguistique

Fichier principal

comparing-recurring-lexico.pdf (242.07 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Agnès Tutin : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01524862

Soumis le : vendredi 19 mai 2017-07:01:19

Dernière modification le : jeudi 4 avril 2024-20:56:44

Dates et versions

hal-01524862 , version 1 (19-05-2017)

Identifiants

HAL Id : hal-01524862 , version 1

Citer

Agnès Tutin, Olivier Kraif. Comparing Recurring Lexico-Syntactic Trees (RLTs) and Ngram Techniques for Extended Phraseology Extraction: a Corpus-based Study on French Scientific Articles. 13th Workshop on Multiword Expressions - EACL, Apr 2017, Valencia, Spain. ⟨hal-01524862⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA LIDILEM

91 Consultations

51 Téléchargements

Comparing Recurring Lexico-Syntactic Trees (RLTs) and Ngram Techniques for Extended Phraseology Extraction: a Corpus-based Study on French Scientific Articles

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager