A Topic-Based Hidden Markov Model for Real-Time Spam Tweets Filtering

Mahdi Washha; Aziz Qaroush; Manel Mezghani; Florence Sèdes

doi:10.1016/j.procs.2017.08.075

Communication Dans Un Congrès Année : 2017

A Topic-Based Hidden Markov Model for Real-Time Spam Tweets Filtering

(1) , (2) , (1) , (1)

1
2

Mahdi Washha

Fonction : Auteur

Systèmes d’Informations Généralisées

Aziz Qaroush

Fonction : Auteur

Birzeit University

Manel Mezghani

Fonction : Auteur
PersonId : 1165320
IdRef : 195589572

Systèmes d’Informations Généralisées

Florence Sèdes

Fonction : Auteur
PersonId : 735498
IdHAL : florence-sedes
ORCID : 0000-0002-9273-302X
IdRef : 033232679

Systèmes d’Informations Généralisées

Résumé

Online social networks (OSNs) have become an important source of information for a tremendous range of applications and researches such as search engines, and summarization systems. However, the high usability and accessibility of OSNs have exposed many information quality (IQ) problems which consequently decrease the performance of the OSNs dependent applications. Social spammers are a particular kind of ill-intentioned users who degrade the quality of OSNs information through misusing all possible services provided by OSNs. Social spammers spread many intensive posts/tweets to lure legitimate users to malicious or commercial sites containing malware downloads, phishing, and drug sales. Given the fact that Twitter is not immune towards the social spam problem, different researchers have designed various detection methods which inspect individual tweets or accounts for the existence of spam contents. However, although of the high detection rates of the account-based spam detection methods, these methods are not suitable for filtering tweets in the real-time detection because of the need for information from Twitter’s servers. At tweet spam detection level, many light features have been proposed for real-time filtering; however, the existing classification models separately classify a tweet without considering the state of previous handled tweets associated with a topic. Also, these models periodically require retraining using a ground-truth data to make them up-to-date. Hence, in this paper, we formalize a Hidden Markov Model (HMM) as a time-dependent model for real-time topical spam tweets filtering. More precisely, our method only leverages the available and accessible meta-data in the tweet object to detect spam tweets exiting in a stream of tweets related to a topic (e.g., #Trump), with considering the state of previously handled tweets associated to the same topic. Compared to the classical time-independent classification methods such as Random Forest, the experimental evaluation demonstrates the efficiency of increasing the quality of topics in terms of precision, recall, and F-measure performance metrics.

Mots clés

Hidden Markov Model Social Spam Real-Time Twitter

Domaines

Réseaux sociaux et d'information [cs.SI]

Fichier principal

washha_22076.pdf (726.96 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Open Archive Toulouse Archive Ouverte (OATAO) : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02871345

Soumis le : mercredi 17 juin 2020-11:20:15

Dernière modification le : jeudi 8 février 2024-15:00:58

Dates et versions

hal-02871345 , version 1 (17-06-2020)

Identifiants

HAL Id : hal-02871345 , version 1
DOI : 10.1016/j.procs.2017.08.075
OATAO : 22076

Citer

Mahdi Washha, Aziz Qaroush, Manel Mezghani, Florence Sèdes. A Topic-Based Hidden Markov Model for Real-Time Spam Tweets Filtering. 21st International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES 2017), Sep 2017, Marseille, France. pp.833-843, ⟨10.1016/j.procs.2017.08.075⟩. ⟨hal-02871345⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLSE2 CNRS SMS UT1-CAPITOLE IRIT IRIT-SIG IRIT-GD IRIT-UT3 TOULOUSE-INP UNIV-UT3 UT3-TOULOUSEINP

43 Consultations

185 Téléchargements

A Topic-Based Hidden Markov Model for Real-Time Spam Tweets Filtering

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager