Discovering data quality rules in a master data management context

Thierno Mahamoudou Diallo 1
1 BD - Base de Données
LIRIS - Laboratoire d'InfoRmatique en Image et Systèmes d'information
Abstract : Dirty data continues to be an important issue for companies. The datawarehouse institute [Eckerson, 2002], [Rockwell, 2012] stated poor data costs US businesses 611 billion dollars annually and erroneously priced data in retail databases costs US customers 2.5 billion each year. Data quality becomes more and more critical. The database community pays a particular attention to this subject where a variety of integrity constraints like Conditional Functional Dependencies (CFD) have been studied for data cleaning. Repair techniques based on these constraints are precise to catch inconsistencies but are limited on how to exactly correct data. Master data brings a new alternative for data cleaning with respect to it quality property. Thanks to the growing importance of Master Data Management (MDM), a new class of data quality rule known as Editing Rules (ER) tells how to fix errors, pointing which attributes are wrong and what values they should take. The intuition is to correct dirty data using high quality data from the master. However, finding data quality rules is an expensive process that involves intensive manual efforts. It remains unrealistic to rely on human designers. In this thesis, we develop pattern mining techniques for discovering ER from existing source relations with respect to master relations. In this set- ting, we propose a new semantics of ER taking advantage of both source and master data. Thanks to the semantics proposed in term of satisfaction, the discovery problem of ER turns out to be strongly related to the discovery of both CFD and one-to-one correspondences between sources and target attributes. We first attack the problem of discovering CFD. We concentrate our attention to the particular class of constant CFD known as very expressive to detect inconsistencies. We extend some well know concepts introduced for traditional Functional Dependencies to solve the discovery problem of CFD. Secondly, we propose a method based on INclusion Dependencies to extract one-to-one correspondences from source to master attributes before automatically building ER. Finally we propose some heuristics of applying ER to clean data. We have implemented and evaluated our techniques on both real life and synthetic databases. Experiments show both the feasibility, the scalability and the robustness of our proposal.
Document type :
Preprints, Working Papers, ...
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01458835
Contributor : Équipe Gestionnaire Des Publications Si Liris <>
Submitted on : Tuesday, February 7, 2017 - 9:20:16 AM
Last modification on : Wednesday, November 20, 2019 - 3:18:27 AM

Identifiers

  • HAL Id : hal-01458835, version 1

Citation

Thierno Mahamoudou Diallo. Discovering data quality rules in a master data management context. 2013. ⟨hal-01458835⟩

Share

Metrics

Record views

241