A Multitude of Linguistically-rich Features for Authorship Attribution - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2011

A Multitude of Linguistically-rich Features for Authorship Attribution

Ludovic Tanguy
Assaf Urieli
  • Fonction : Auteur
  • PersonId : 955287
Nabil Hathout
Franck Sajous

Résumé

This paper reports on the procedure and learning models we adopted for the 'PAN 2011 Author Identification' challenge targetting real-world email messages. The novelty of our approach lies in a design which combines shallow characteristics of the emails (words and trigrams frequencies) with a large number of ad hoc linguistically-rich features addressing different language levels. For the author attribution tasks, all these features were used to train a maximum entropy model which gave very good results. For the single author verification tasks, a set of features exclusively based on the linguistic description of the emails' messages was considered as input for symbolic learning techniques (rules and decision trees), and gave weak results. This paper presents in detail the features extracted from the corpus, the learning models and the results obtained.

Domaines

Linguistique
Fichier principal
Vignette du fichier
PAN2011-Tanguy-etal.pdf (301.19 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00703987 , version 1 (04-06-2012)

Identifiants

  • HAL Id : hal-00703987 , version 1

Citer

Ludovic Tanguy, Assaf Urieli, Basilio Calderone, Nabil Hathout, Franck Sajous. A Multitude of Linguistically-rich Features for Authorship Attribution. PAN Lab at CLEF, Sep 2011, Amsterdam, Netherlands. ⟨hal-00703987⟩
218 Consultations
246 Téléchargements

Partager

Gmail Facebook X LinkedIn More