Declarative Data Cleaning : Language, Model, and Algorithms - Archive ouverte HAL Accéder directement au contenu
Rapport (Rapport De Recherche) Année : 2001

Declarative Data Cleaning : Language, Model, and Algorithms

Daniela Florescu
  • Fonction : Auteur
Dennis Shasha
  • Fonction : Auteur
  • PersonId : 833427

Résumé

The problem of data cleaning, which consists of emoving inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. However, for non-conventional applications, such as the migration of largely unstructured data into structured one, or the integration of heterogeneous scientific data sets in inter-discipl- inary fields (e.g., in environmental science), existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. The main challenge with them is the design of a data flow graph that effectively generates clean data, and can perform efficiently on large sets of input data. The difficulty with them comes from (i) a lack of clear separation between the logical specification of data transformations and their physical implementation and (ii) the lack of explanation of cleaning results and user interaction facilities to tune a data cleaning program. This paper addresses these two problems and presents a language, an execution model and algorithms that enable users to express data cleaning specifications declaratively and perform the cleaning efficiently. We use as an example a set of bibliographic references used to construct the Citeseer Web site. The underlying data integration problem is to derive structured and clean textual records so that meaningful queries can be performed. Experimental results report on the assessement of the proposed framework for data cleaning.

Domaines

Autre [cs.OH]
Fichier principal
Vignette du fichier
RR-4149.pdf (654.74 Ko) Télécharger le fichier

Dates et versions

inria-00072476 , version 1 (24-05-2006)

Identifiants

  • HAL Id : inria-00072476 , version 1

Citer

Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Cristian Saita. Declarative Data Cleaning : Language, Model, and Algorithms. [Research Report] RR-4149, INRIA. 2001. ⟨inria-00072476⟩
1511 Consultations
1437 Téléchargements

Partager

Gmail Facebook X LinkedIn More