Data Exploration with SQL using Machine Learning Techniques

Abstract : Nowadays data scientists have access to gigantic data, many of them being accessible through SQL. Despite the inherent simplicity of SQL, writing relevant and efficient SQL queries is known to be difficult, especially for databases having a large number of attributes or meaningless attribute names. In this paper, we propose a " rewriting " technique to help data scientists formulate SQL queries, to rapidly and intuitively explore their big data, while keeping user input at a minimum, with no manual tuple specification or labeling. For a user specified query, we define a negation query, which produces tuples that are not wanted in the initial query's answer. Since there is an exponential number of such negation queries, we describe a pseudo-polynomial heuristic to pick the negation closest in size to the initial query, and construct a balanced learning set whose positive examples correspond to the results desired by analysts, and negative examples to those they do not want. The initial query is reformulated using machine learning techniques and a new query, more efficient and diverse, is obtained. We have implemented a prototype and conducted experiments on real-life datasets and synthetic query workloads to assess the scalability and precision of our proposition. A preliminary qualitative experiment conducted with astrophysicists is also described.
Complete list of metadatas

Cited literature [29 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01455715
Contributor : Vasile-Marian Scuturici <>
Submitted on : Friday, February 3, 2017 - 5:29:24 PM
Last modification on : Tuesday, February 26, 2019 - 4:07:31 PM
Long-term archiving on : Friday, May 5, 2017 - 1:50:48 PM

File

iSQL_EDBT.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01455715, version 1

Citation

Julien Cumin, Jean-Marc Petit, Vasile-Marian Scuturici, Sabina Surdu. Data Exploration with SQL using Machine Learning Techniques. International Conference on Extending Database Technology - EDBT, Mar 2017, Venice, Italy. pp.96-107. ⟨hal-01455715⟩

Share

Metrics

Record views

586

Files downloads

1471