Schema Mappings and Automated Services for Data Integration and Exchange

Lead Research Organisation: University of Oxford
Department Name: Computer Science

Abstract

This project, which is predominantly in the area of database theory, deals with schema mappings in the context of data exchange and data integration. Data Exchange is the problem of inserting data structured under one or more source schemas into a target schema of different structure (possibly with integrity constraints) while reflecting the source data as accurate as possible. Data Integration is the problem of answering queries formulated over a target schema that integrates the information provided by several data sources over one or more source schemas. In order to achieve significant progress, we will study classes of schema mappings for which these two problems become tractable and practicable. Rather than restricting ourselves to the study of single schema mappings between one source and one target schema, we will also consider more general settings, where several schemas are connected by various schema mappings and their compositions. We will start from the framework where the integrity constraints on schema mappings and databases are given by arbitrary embedded dependencies. This setting is far too general and needs to be restricted in order to obtain tractability. In the context of a comprehensive complexity analysis, we will therefore identify critical parameters, analyze restrictions (e.g. imposing bounds to some parameters) and investigate tractable settings. Based on these results, we will then design new algorithms for data exchange and integration. We will study how data exchange and integration can be best performed via Web services and develop a model. Finally, we will implement and test our new algorithm sin this context.The project is organised into four main work packages. WP1 studies the efficient computations of solutions to data exchange problems, mainly through new variants of the Chase procedure. WP2 focuses on the efficient computation of succinct solutions, so called cores and of the study of algebraic properties of schema mappings. WP3 consists of a comprehensive complexity study tracing the tractability frontier for data exchange problems based on various parameters. In WP3. we will also develop heuristics for dealing with computationally hard problems. In WP4, we will elaborate a new model for using Web service technology for data integration and exchange, and we will implement and test our new algorithms and methods in that model, and test the feasibility of the overall approach.The scientific project staff will consist of the applicant (part time), one post-doc, and two doctoral students. While both students are expected to intensively co-operate with each other, one will do more theory-oriented work, while the other student will spend more time on developing the Web-service model for data integration and exchange.We plan to publish the results in top database journals and at leading international conferences. The Web services will be made accessible to the public.

Publications

10 25 50
publication icon
Abiteboul S (2011) Distributed XML design in Journal of Computer and System Sciences

publication icon
Abiteboul S (2009) Distributed XML design

publication icon
Andrea Cali (Author) (2010) Query Rewriting under Non-Guarded Rules

publication icon
Andrea Cali (Co-Author) (2010) Advanced Processing for Ontological Queries

publication icon
Andrea Cali (Co-Author) (2009) Dynamic Query Optimization under Access Limitations and Dependencies in Journal of Universal Computer Science

publication icon
Andrea Cali (Co-Author) (2010) Querying the deep web

publication icon
Bárány V (2009) Computer Science Logic

 
Description Data Exchange is the problem of inserting data structured under a source schema into a target schema of different structure (possibly with integrity constraints), while re_ecting the source data as accurately as possible. In this project we have significantly advanced the field of Data Exchange by solving previously open questions and by designing and implementing new algorithms. Our results were published at top conferences (ACM PODS, VLDB) and in top journals. In particular, two papers were published in the prestigious Journal of the ACM (JACM) which is seen by many as the top journal of Computer Science. Another journal paper was published in the VLDB Journal, which is one of the two top journals in the database area. The main findings are: 1. We have solved an open problem on core computation by proving that cores of database instances generated in the data exchange process (via weakly acyclic tuple-generating dependencies and equality-generating dependencies) are computable in polynomial time. BACKGROUND: In their paper "Data Exchange: Getting to the Core" (ACM PODS'03), Fagin, Kolaitis, and Popa have proposed an elegant solution to the data exchange problem that is based on the notion of the core of a structure and of the core of a data exchange problem. They also studied computational issues related to this approach. We proved showing that data exchange is tractable in presence of weakly acyclic tuple generating dependencies . This answered an important open problem posed by Fagin and colleagues. Based on this work. The results were published in JACM. 2. We have developed new methods and algorithms for the normalisation and optimisation of schema mappings and have implemented these algorithms. Schema mappings are high-level speci_cations that describe the relationship between two database schemas. They are an important tool in several areas of database research, notably in data integration and data exchange. However, a concrete theory of schema mapping optimisation including the formulation of optimality criteria and the construction of algorithms for computing optimal schema mappings is completely lacking to date. The goal of this workwas to _ll this gap. We presented a system of rewrite rules to minimize sets of source-to-target tuple-generating dependencies (st-tgds, for short). Moreover, we showed that the result of this minimization is unique up to variable renaming. Hence, our optimization also yields a schema mapping normalization. By appropriately extending our rewrite rule system, we also provided a normalization of schema mappings containing equality-generating target dependencies (egds). An important application of such a normalization is in the area of de_ning the semantics of query answering in data exchange, since several de_nitions in this area depend on the concrete syntactic representation of the st-tgds. This is, in particular, the case for queries with negated atoms and for aggregate queries. The normalization of schema mappings allows us to eliminate the effect of the concrete syntactic representation of the st-tgds from the semantics of query answering. A tool for TGD normalization was implemented and tested. The results were published in the VLDB Journal. 3. We have studied how schema mappings can be automatically synthesized from existing data. In particular, we introduced a theoretical framework for discovering relationships between two relational database instances over distinct and unknown schemata. This framework is grounded in the context of data exchange. We formalized the problem of understanding the relationship between two instances as that of obtaining a schema mapping so that a minimum repair of this mapping provides a perfect description of the target instance given the source instance. We showed that this de_nition yields "intuitive" results when applied on database instances derived from each other with basic operations. We studied the complexity of decision problems related to this optimality notion in the context of different logical languages and show that, even in very restricted cases, the problem is of high complexity. The results were published in JACM.
Exploitation Route Apart from the uses mentioned under "Exploitation Route", I currently don't see further non-academic uses. Various companies (including IBM) have shown interest in our algorithms. There is currently not yet a big market for commercial data exchange software tools. However, we hope that our algorithms will be used once data exchange technology is ripe for commercialisation. We then hope to deliver software, services, and consulting to the industry. Dr Bruno Marnette, whose doctoral studies were supported by this project, was hired by the British investment capital firm "Winton Capital" as a Database Expert. At Winton, Dr Marnette uses the skills acquired in this EPSRC project to solve data management problems. Our project has thus proven useful to provide highly skilled staff to the British financial industry in order to strengthen it technologically and to make it more competitive.
Sectors Digital/Communication/Information Technologies (including Software)

Financial Services

and Management Consultancy

 
Description European Research Council
Amount £2,074,000 (GBP)
Funding ID ERC Advanced Investigators Grant, Project Nr. 246858: DIADEM 
Organisation European Research Council (ERC) 
Sector Public
Country Belgium
Start 03/2010 
End 03/2015
 
Description European Research Council
Amount £2,074,000 (GBP)
Funding ID ERC Advanced Investigators Grant, Project Nr. 246858: DIADEM 
Organisation European Research Council (ERC) 
Sector Public
Country Belgium
Start 03/2010 
End 03/2015