Unsupervised Collective-based Framework for Dynamic Retraining of Supervised Real-Time Spam Tweets Detection Model

Mahdi Washha; Aziz Qaroush; Manel Mezghani; Florence Sèdes

doi:10.1016/j.eswa.2019.05.052

Article Dans Une Revue Expert Systems with Applications Année : 2019

Unsupervised Collective-based Framework for Dynamic Retraining of Supervised Real-Time Spam Tweets Detection Model

(1) , (2) , (1) , (1)

1
2

Mahdi Washha

Fonction : Auteur

Systèmes d’Informations Généralisées

Aziz Qaroush

Fonction : Auteur

Birzeit University

Manel Mezghani

Fonction : Auteur
PersonId : 1165320
IdRef : 195589572

Systèmes d’Informations Généralisées

Florence Sèdes

Fonction : Auteur
PersonId : 735498
IdHAL : florence-sedes
ORCID : 0000-0002-9273-302X
IdRef : 033232679

Systèmes d’Informations Généralisées

Résumé

Twitter is one of the most popular social platforms. It has changed the way of communication and information dissemination through its real-time messaging mechanism. Recently, it has been used by researchers and industries as a new source of data for various intelligent systems, such as tweet sentiment analysis and recommendation systems, which require high data quality. However, due to its flexibility and popularity, Twitter has become the main target for spamming activities such as phishing legitimate users or spreading malicious software, which introduces new security issues and waste resources. Therefore, researchers have developed various machine-learning algorithms to reveal Twitter spam. However, as spammers have become smarter and more crafty, the characteristics of the spam tweets are varying over time making these methods inefficient to detect new spammers tricks and strategies. In addition, some of the employed methods (e.g. blacklisting) or spammer features (e.g. graph-based features) are extremely time-consuming, which hinders the ability to detect spammer activities in real-time. In this paper, we introduce a framework to deal with the volatility of the spam contents and new spamming patterns, called the spam drift. The framework combines the strength of unsupervised machine learning approach, which learns from unlabeled tweets, to retrain a real-time supervised tweet-level spam detection model in a batch mode. A set of experiments on a large-scale data set show the effectiveness of the proposed online unsupervised method in adaptively discovers and learns the patterns of new spam activities and achieve stable recall values reaching more than 95%. Although the average spam precision of our method is around 60%, the high spam recall values show the ability of our proposed method in reducing spam drift problems compared to traditional machine learning algorithms.

Mots clés

Real-time Twitter Twitter stream Spam Social spammers

Domaines

Recherche d'information [cs.IR] Réseaux sociaux et d'information [cs.SI]

Fichier principal

washha_25032.pdf (8.91 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Open Archive Toulouse Archive Ouverte (OATAO) : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02419466

Soumis le : jeudi 19 décembre 2019-14:53:01

Dernière modification le : jeudi 8 février 2024-15:00:58

Archivage à long terme le : vendredi 20 mars 2020-18:55:41

Dates et versions

hal-02419466 , version 1 (19-12-2019)

Identifiants

HAL Id : hal-02419466 , version 1
DOI : 10.1016/j.eswa.2019.05.052
OATAO : 25022

Citer

Mahdi Washha, Aziz Qaroush, Manel Mezghani, Florence Sèdes. Unsupervised Collective-based Framework for Dynamic Retraining of Supervised Real-Time Spam Tweets Detection Model. Expert Systems with Applications, 2019, 135, pp.129-152. ⟨10.1016/j.eswa.2019.05.052⟩. ⟨hal-02419466⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLSE2 CNRS SMS UT1-CAPITOLE IRIT IRIT-SIG IRIT-GD IRIT-UT3 TOULOUSE-INP UNIV-UT3 UT3-TOULOUSEINP

138 Consultations

84 Téléchargements

Unsupervised Collective-based Framework for Dynamic Retraining of Supervised Real-Time Spam Tweets Detection Model

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager