Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce. - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2012

Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce.

Ludovic Tanguy
Franck Sajous
Nabil Hathout

Résumé

We describe here the technical details of our participation to PAN 2012's "traditional" authorship attribution tasks. The main originality of our approach lies in the use of a large quantity of varied features to represent textual data, processed by a maximum entropy machine learning tool. Most of these features make an intensive use of natural language processing annotation techniques as well as generic language resources such as lexicons and other linguistic databases. Some of the features were even designed specifically for the target data type (contemporary fiction). Our belief is that richer features, that integrate external knowledge about language, have an advantage over knowledge-poorer ones (such as words and character n-grams frequencies) when training data is scarce (both in raw volume and number of training items for each target author). Although overall results were average (66% accuracy over the main tasks for the best run), we will focus in this paper on the differences between feature sets. If the "rich" linguistic features have proven to be better than trigrams of characters and word frequencies, the most efficient features vary widely from task to task. For the intrusive paragraphs tasks, we got better results (73 and 93%) while still using the maximum entropy engine as an unsupervised clustering tool.

Domaines

Linguistique
Fichier principal
Vignette du fichier
PAN2012-Tanguy-etal.pdf (135.91 Ko) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte
Loading...

Dates et versions

hal-00736452 , version 1 (28-09-2012)

Identifiants

  • HAL Id : hal-00736452 , version 1

Citer

Ludovic Tanguy, Franck Sajous, Basilio Calderone, Nabil Hathout. Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce.. PAN Lab at CLEF, Sep 2012, Rome, Italy. ⟨hal-00736452⟩
384 Consultations
525 Téléchargements

Partager

Gmail Facebook X LinkedIn More