Data Exploration with SQL using Machine Learning Techniques

Abstract : Nowadays data scientists have access to gigantic data, many of them being accessible through SQL. Despite the inherent simplicity of SQL, writing relevant and efficient SQL queries is known to be difficult, especially for databases having a large number of attributes or meaningless attribute names. In this paper, we propose a " rewriting " technique to help data scientists formulate SQL queries, to rapidly and intuitively explore their big data, while keeping user input at a minimum, with no manual tuple specification or labeling. For a user specified query, we define a negation query, which produces tuples that are not wanted in the initial query's answer. Since there is an exponential number of such negation queries, we describe a pseudo-polynomial heuristic to pick the negation closest in size to the initial query, and construct a balanced learning set whose positive examples correspond to the results desired by analysts, and negative examples to those they do not want. The initial query is reformulated using machine learning techniques and a new query, more efficient and diverse, is obtained. We have implemented a prototype and conducted experiments on real-life datasets and synthetic query workloads to assess the scalability and precision of our proposition. A preliminary qualitative experiment conducted with astrophysicists is also described.
Type de document :
Communication dans un congrès
International Conference on Extending Database Technology - EDBT, Mar 2017, Venice, Italy. Proc. 20th International Conference on Extending Database Technology (EDBT),, 2017, <http://edbticdt2017.unive.it/>
Liste complète des métadonnées


https://hal.archives-ouvertes.fr/hal-01455715
Contributeur : Vasile-Marian Scuturici <>
Soumis le : vendredi 3 février 2017 - 17:29:24
Dernière modification le : jeudi 9 février 2017 - 01:05:11
Document(s) archivé(s) le : vendredi 5 mai 2017 - 13:50:48

Fichier

iSQL_EDBT.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01455715, version 1

Collections

Citation

Julien Cumin, Jean-Marc Petit, Vasile-Marian Scuturici, Sabina Surdu. Data Exploration with SQL using Machine Learning Techniques. International Conference on Extending Database Technology - EDBT, Mar 2017, Venice, Italy. Proc. 20th International Conference on Extending Database Technology (EDBT),, 2017, <http://edbticdt2017.unive.it/>. <hal-01455715>

Partager

Métriques

Consultations de
la notice

215

Téléchargements du document

141