A First Experimental Study on Functional Dependencies for Imbalanced Datasets Classification

Marie Le Guilly 1 Jean-Marc Petit 1 Vasile-Marian Scuturici 1
1 BD - Base de Données
LIRIS - Laboratoire d'InfoRmatique en Image et Systèmes d'information
Abstract : Imbalanced datasets for classification is a recurring problem in machine learning, as most real-life datasets present classes that are not evenly distributed. This causes many problems for classification algorithms trained on such datasets, as they are often biases towards the majority class. Moreover, the minority class often yields more interest for data scientist, when at the same time it is also the hardest to predict. Many different approaches have been proposed to tackle the problem of imbalanced datasets: they often rely on the sampling of the majority class, or the creation of synthetic examples for the minority one. In this paper, we take a completely different perspective on this problem: we propose to use the notion of distance between databases, to sample from the majority class, so that the minority and majority class are as distant as possible. The chosen distance is based on functional dependencies, with the intuition of capturing inherent constraints of the database. We propose algorithms to generate distant synthetic datasets, as well as ex-perimentations to verify our conjecture on the classification on distant instances. Despite the mitigated results obtained so far, we believe this is a promising research direction, at the intersection of machine learning and databases, and it deserves more investigations.
Marie Le Guilly, Jean-Marc Petit, Vasile-Marian Scuturici. A First Experimental Study on Functional Dependencies for Imbalanced Datasets Classification. 12th International Workshop on Information Search, Integration, and Personalization (ISIP2018), May 2018, Fukuoka, Japan. ⟨hal-02190890⟩



