Discovering Editing Rules for Data Cleaning

Thierno Diallo 1 Jean-Marc Petit 1 Sylvie Servigne 1
1 BD - Base de Données
LIRIS - Laboratoire d'InfoRmatique en Image et Systèmes d'information
Abstract : Dirty data continues to be an important issue for companies. The database community pays a particular attention to this subject. A variety of integrity constraints like Conditional Functional Dependencies (CFD) have been studied for data cleaning. Data repair methods based on these constraints are strong to detect inconsistencies but are limited on how to correct data, worse they can even introduce new errors. Based on Master Data Management principles, a new class of data quality rules known as Editing Rules (eR) tells how to fix errors, pointing which attributes are wrong and what values they should take. However, finding data quality rules is an expensive process that involves intensive manual efforts. In this paper, we develop pattern mining techniques for discovering eRs from existing source relations (possibly dirty) with respect to master relations (supposed to be clean and accurate). In this setting, we propose a new semantics of eRs taking advantage of both source and master data. The problem turns out to be strongly related to the discovery of both CFD and one-to-one correspondences between sources and target attributes. We have proposed efficient techniques to address the discovery problem of eRs and heuristics to clean data. We have implemented and evaluated our techniques on real-life databases. Experiments show both the feasibility, the scalability and the robustness of our proposal.
Document type :
Conference papers
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01353088
Contributor : Équipe Gestionnaire Des Publications Si Liris <>
Submitted on : Wednesday, August 10, 2016 - 4:22:01 PM
Last modification on : Tuesday, February 26, 2019 - 4:07:29 PM

Identifiers

  • HAL Id : hal-01353088, version 1

Citation

Thierno Diallo, Jean-Marc Petit, Sylvie Servigne. Discovering Editing Rules for Data Cleaning. 10th International Workshop on Quality in Databases In conjunction with VLDB (Very Large Databases) 2012, Aug 2012, Istanbul, Turkey. pp.1-8. ⟨hal-01353088⟩

Share

Metrics

Record views

174