Cleaning Integrated Data: An Approach based on Conditional Constraints and Data Provenance

Lead Research Organisation: University of Edinburgh
Department Name: Lab. for Foundations of Computer Science

Abstract

Dirty data costs US enterprises $600 billion annually. Errors in data lead to problems like hampering service offerings and loss of revenues, credibility and customers. There is no reason to believe that the scale of the problem is any different in the UK or any other society that is dependent on information technology.Enterprises are not the only victims of poor data: inaccuracy and discrepancies in scientific data also have severe consequences on the quality of the scientific research.While errors and inconsistencies may be introduced for many reasons in various stages when data is processed, data quality problems are particularly evident when overlapping or redundant information from multiple heterogeneous sources is integrated.With this comes the need for cleaning integrated data: detect and remove errors, conflicts and inconsistencies in data integrated from multiple sources, in order to improve the quality of the integrated data. However important, no practical system is yet in place for effectively cleaning data integrated from multiple sources, in either traditional database or XML format. Data integration and cleaning, already highly challenging when considered separately, are significantly more difficult when being dealt with together. A number of intriguing problems associated with data integration and cleaning remain open or unexplored. Open issues include: how to capture the quality of the data (the consistency, accuracy, timeliness and completeness of the data)? How to automatically integrate data? In the presence of inconsistencies, how to choose data from the most accurate and up-to-date sources? How to effectively detect and remove inconsistencies in the integrated data? In response to the compelling need from both industry and scientific data management, this project will develop a principled basis and working tools for integrating and cleaning data. It will provide a new model, reasoning systems and complexity bounds for the analysis of data quality, as well as practical (approximation or heuristic) algorithms and techniques for conducing lossless integration, inconsistency detection and reconciliation, for data in traditional databases or XML format. The novelty of the proposed research consists in the following.1. Novel constraints to specify the consistency of data, data provenance analysis to determine and keep track of the accuracy and timeliness of the data, and lossless schema mappings to deal with the completeness of the data.2. Automatic generation of lossless schema mappings in parallel with data provenance analysis.3. Practical techniques for reasoning about constraints, including the analysis of constraint propagation from sources to the integrated data, in order to discover constraints on the integrated data and eliminate redundancies.4. Efficient detection of errors, conflicts or inconsistencies in integrated data, based on automatic generation of detecting queries.5. Effective methods for removing errors and inconsistencies from integrated data, based on the accuracy and timeliness of the data.The project will lead to the first uniform system that integrates and cleans data. It will produce quality research results that are of considerable interest to both international database theory and system communities, and beyond. The results will be published in top computer science journals and major international database conferences. This project involves extensive collaboration between the University of Edinburgh and Bell Labs, Lucent Technologies. Bell Labs will provide a testbed for deploying and evaluating the system and tools to be developed, using real-world data from Lucent. Thus the project will generate immediate impact on industry. The tools will also find immediate applications in scientific data management, e.g., in the Generation Scotland project, which is a partnership between academics and the National Health Service in Scotland.

Publications

10 25 50
publication icon
Fan W (2010) Relative information completeness in ACM Transactions on Database Systems

publication icon
Fan W (2008) Conditional functional dependencies for capturing data inconsistencies in ACM Transactions on Database Systems

publication icon
Fan W (2008) Information preserving XML schema embedding in ACM Transactions on Database Systems

publication icon
Fan W (2008) Expressiveness and complexity of XML publishing transducers in ACM Transactions on Database Systems

publication icon
Fan W (2011) Discovering Conditional Functional Dependencies in IEEE Transactions on Knowledge and Data Engineering

publication icon
Cong G (2012) On the Complexity of View Update Analysis and Its Application to Annotation Propagation in IEEE Transactions on Knowledge and Data Engineering

publication icon
Fan W (2008) Propagating functional dependencies with conditions in Proceedings of the VLDB Endowment

publication icon
Fan W (2009) Reasoning about record matching rules in Proceedings of the VLDB Endowment

publication icon
Cong G (2007) Improving Data Quality: Consistency and Accuracy in The 33rd International Conference on Very Large Data Bases (VLDB)

 
Description We have established a fundamental theory for each of five central issues for data quality: data consistency, data accuracy, data currency, information completeness and entity resolution. We have also developed a complete package of practical techniques for cleaning data, including methods and algorithms for (a) discovering data quality rules, (b) reasoning about the rules, (c) propagating data quality rules in data integration and transformations, (d) automatically detecting errors in the data, and (e) repairing data based on the rules.

The success of the project is verified by (a) two best paper awards for major international database
conferences, (b) invited tutorials on the topic of data cleaning at top-ranked international database
theory and systems conferences, as well as invited public lectures; (c) 27+ publications in major database
conferences and journals; (d) a functional prototype system for data cleaning; (e) four US patents filed
on topics related to data cleaning; and (f) two PhD dissertations. The outcomes of the project have
established a firm lead of UK researchers in the area of data cleaning.
Exploitation Route Engage industry. In fact, several world-leading IT companies already show interest in using the research outcome. Three US patents have also been filed and granted.
Sectors Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy,Healthcare

URL http://homepages.inf.ed.ac.uk/wenfei/publication.html
 
Description Our dependency theory is being taught in the database courses at several universities. Researchers in industry are also following the approach. Several prototype systems have been developed and demonstrated at major database conferences.
Sector Digital/Communication/Information Technologies (including Software)
 
Description Bell Laboratories 
Organisation Bell Laboratories
Country United States 
Sector Private 
Start Year 2007
 
Description Bell Laboratories 
Organisation Bell Laboratories
Country United States 
Sector Private 
Start Year 2007