Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce.

Ludovic Tanguy; Franck Sajous; Basilio Calderone; Nabil Hathout

Communication Dans Un Congrès Année : 2012

Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce.

(1) , (1) , (1) , (1)

Ludovic Tanguy

Fonction : Auteur
PersonId : 34
IdHAL : ludovic-tanguy
IdRef : 11839777X

Cognition, Langues, Langage, Ergonomie

Franck Sajous

Fonction : Auteur
PersonId : 10494
IdHAL : franck-sajous
ORCID : 0000-0001-9439-3658
IdRef : 253130522

Cognition, Langues, Langage, Ergonomie

Basilio Calderone

Fonction : Auteur
PersonId : 17229
IdHAL : basilio-calderone
ORCID : 0000-0002-0160-7512

Cognition, Langues, Langage, Ergonomie

Nabil Hathout

Fonction : Auteur
PersonId : 173055
IdHAL : nabil-hathout
ORCID : 0000-0003-4492-171X
IdRef : 118073397

Cognition, Langues, Langage, Ergonomie

Résumé

We describe here the technical details of our participation to PAN 2012's "traditional" authorship attribution tasks. The main originality of our approach lies in the use of a large quantity of varied features to represent textual data, processed by a maximum entropy machine learning tool. Most of these features make an intensive use of natural language processing annotation techniques as well as generic language resources such as lexicons and other linguistic databases. Some of the features were even designed specifically for the target data type (contemporary fiction). Our belief is that richer features, that integrate external knowledge about language, have an advantage over knowledge-poorer ones (such as words and character n-grams frequencies) when training data is scarce (both in raw volume and number of training items for each target author). Although overall results were average (66% accuracy over the main tasks for the best run), we will focus in this paper on the differences between feature sets. If the "rich" linguistic features have proven to be better than trigrams of characters and word frequencies, the most efficient features vary widely from task to task. For the intrusive paragraphs tasks, we got better results (73 and 93%) while still using the maximum entropy engine as an unsupervised clustering tool.

Domaines

Linguistique

Fichier principal

PAN2012-Tanguy-etal.pdf (135.91 Ko)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Franck Sajous : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00736452

Soumis le : vendredi 28 septembre 2012-11:40:25

Dernière modification le : vendredi 19 avril 2024-16:18:56

Archivage à long terme le : samedi 29 décembre 2012-04:55:10

Dates et versions

hal-00736452 , version 1 (28-09-2012)

Identifiants

HAL Id : hal-00736452 , version 1

Citer

Ludovic Tanguy, Franck Sajous, Basilio Calderone, Nabil Hathout. Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce.. PAN Lab at CLEF, Sep 2012, Rome, Italy. ⟨hal-00736452⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

EPHE UNIV-TLSE2 CNRS CLLE PSL UNIV-BORDEAUX-MONTAIGNE

384 Consultations

525 Téléchargements

Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce.

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager