Understanding Social Media Texts with Minimum Human Effort on #Twitter

Named Entity Recognition (NER) is a traditional Natural Language Processing (NLP) task. But traditional machine learning methods are facing new problems to handle this task with Social Media data like Twitter. In this new context, the performance is often degraded. The Twitter messages have particular features. Consider the example "Today wasz Fun cusz anna Came juss for me <3: hahaha". In this example, the difficulties are manifold: 1) Spelling mistakes: wasz (was), cusz (because), juss (just); 2) Uppercase/lowercase inversion: Fun (fun), 3) anna (Anna), Came (came); 4) Emoticon: <3; 5) Interjection: hahaha. The alternation of uppercase/lowercase is a major problem for the NER task because the only person proper noun "anna" of our tweet begins with a lowercase instead of an uppercase, like in grammatically well-formed texts. In this paper, we present our work on recognizing named entities on Twitter.

Mots clés

Natural Language Processing Machine Learning Named Entity Recognition Domain Adaptation Conditional Random Fields (CRF)

Domaines

Informatique et langage [cs.CL] Apprentissage [cs.LG]

Fichier principal

PLIN2016.pdf (191.29 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Marco Dinarelli : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01490018

Soumis le : mardi 14 mars 2017-18:09:04

Dernière modification le : vendredi 19 avril 2024-16:18:57

Archivage à long terme le : jeudi 15 juin 2017-15:13:08

Dates et versions

hal-01490018 , version 1 (14-03-2017)

Identifiants

HAL Id : hal-01490018 , version 1

Citer

Tian Tian, Isabelle Tellier, Marco Dinarelli, Pedro Cardoso. Understanding Social Media Texts with Minimum Human Effort on #Twitter. Language and the new (instant) media (PLIN), May 2016, Louvain-la-Neuve, Belgium. ⟨hal-01490018⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS CNRS UNIV-PARIS3 LATTICE PSL USPC

173 Consultations

78 Téléchargements