Re-ranking Approach to Classification in Large-scale Power-law Distributed Category Systems

Abstract : For large-scale category systems, such as Directory Mozilla, which consist of tens of thousand categories, it has been empirically verified in earlier studies that the distribution of documents among categories can be modeled as a power-law distribution. It implies that a significant fraction of categories, referred to as rare categories, have very few documents assigned to them. This characteristic of the data makes it harder for learning algorithms to learn effective decision boundaries which can correctly detect such categories in the test set. In this work, we exploit the distribution of documents among categories to (i) derive an upper bound on the accuracy of any classifier, and (ii) propose a ranking-based algorithm which aims to maximize this upper bound. The empirical evaluation on publicly available large-scale datasets demonstrate that the proposed method not only achieves higher accuracy but also much higher coverage of rare categories as compared to state-of-the-art methods.
Type de document :
Communication dans un congrès
Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, 2014, New York, NY, USA, United States. ACM, pp.1059-1062, 2014, SIGIR '14. 〈10.1145/2600428.2609509〉
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-01071752
Contributeur : Maria-Irina Nicolae <>
Soumis le : lundi 6 octobre 2014 - 15:57:15
Dernière modification le : mardi 28 octobre 2014 - 18:34:06

Identifiants

Collections

UGA | LIG

Citation

Rohit Babbar, Ioannis Partalas, Eric Gaussier, Massih-Reza Amini. Re-ranking Approach to Classification in Large-scale Power-law Distributed Category Systems. Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, 2014, New York, NY, USA, United States. ACM, pp.1059-1062, 2014, SIGIR '14. 〈10.1145/2600428.2609509〉. 〈hal-01071752〉

Partager

Métriques

Consultations de la notice

119