MAGIC: MAnaGing InComplete Data - New Foundations

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Informatics

Abstract

In our data-driven world, one can hardly spend a day without using highly complex software systems that we have learned to rely on, but that at the same time are known to produce incorrect results. These are systems we have on our laptops, they power websites of companies, and they keep companies and government offices running. And yet the incorrect behaviour is built into them, it is a part of multiple standards, and very little effort is made to change things.

The systems are commercial DBMSs (database management systems). We can rely on them as long as information they store is complete. In an autonomous environment, this is a reasonable assumption, but these days data is generated by a huge number of users and applications, and its incompleteness is a fact of life. The moment incompleteness enters the picture, everything changes: unexpected behaviour occurs; queries that we teach students to write and give them full marks for stop working, and one loses trust in the results one gets from such data.

To make matters worse, many modern applications of data, including data integration, data exchange, ontology based data access, data quality, inconsistency management and a host of others, have incompleteness built into them, and try to rely on standard techniques for handling it. This inevitably leads to questionable, or sometimes plain incorrect results. The key reason behind this is the complexity vs correctness tradeoff: current techniques guaranteeing correctness carry a huge complexity price, and applications look for ways around it, sacrificing correctness in the process.

Our main goal is to end this sorry state of affairs. Correctness and efficiency can co-exist, but we need to develop new foundations of the field of incomplete information, and a new set of techniques based on these foundations, to reconcile the two.

To do so, we need to rethink the very basics of the field; crucially, we need to understand what it means to answer queries over incomplete data with correctness guarantees. The classical theory uses a single one-size-fits-all definition, that, upon a careful examination, does not appear to be universally correct. We have an approach that will let us develop a proper theory of correctness and apply it in standard data management tasks and their new applications as well.

It is crucial to make this approach practical. Commercial systems concentrate on efficient evaluation, sacrificing correctness. Our solutions for delivering correct answers will be implementable on top of existing systems, to guarantee their applicability. There will be no need to throw away existing products and techniques to take advantage of new approaches to handling incomplete information.

Our initial set of goals is to deliver solutions for fixing problems with commercial DBMSs, namely to show how to provide correctness guarantees at low and acceptable costs. We shall do so for a wide variety of queries, going far beyond what is now known to be possible. After that, we shall look at applications that will let us, for the first time, ensure correctness of very expressive queries over integrated and exchanged data. We shall expand further, both in terms of data models (e.g., graphs, XML, noSQL databases), and applications (inconsistent data, ontologies). We shall also look at solutions that take into account very large amounts of data, and produce approximate answers in those scenarios.

With the toolkit we develop, the "curse of incomplete information", i.e., the perceived impossibility of achieving correctness and efficiency simultaneously, should be a thing of the past.

Planned Impact

We expect this research to have an impact in a wide range of data analytics problems. Delivering understanding from information is one of the key themes not only for EPSRC but in general for today's data-driven world. Technologies unlocked by the big-data phenomenon are of key importance to the UK and are expected to bring £40B of economic benefit to the UK by 2017. While no one tried to quantify the cost of not having good techniques of handling uncertain data, evidence suggests that this cost is significant: for instance, imprecise and unclean data per se is estimated to cost US companies over $600B annually. The cost of relying on poor quality analytics run over such data therefore is bound to be huge.

We are planning activities to deliver the impact of this work in both private and public sectors, and in the academic world, going beyond the traditional data management community. In private sector, both traditional and new vendors of database software will benefit from our research, as well as vendors of products that crucially rely on incomplete data (for instance, those handling data integration). In public sector, the users of various types of data analytics software will be the primary beneficiaries. In research community, our results will have an impact across several disciplines including knowledge representation, knowledge discovery, algorithms, and the Semantic Web.

They will benefit in the following ways:

(a) The project will provide a clear guidance for designing new algorithms for querying incomplete data with correctness guarantees, and new principles for handling the complexity of evaluation vs the degree of correctness tradeoff;

(b) Our query evaluation algorithmic toolbox will help developers of database software better understand how to handle the notoriously hard problem of uncertain data;

(c) Our tools will provide applications that naturally use large amounts of incomplete data (e.g., data integration, inconsistent data, data translation)with new and reliable data handling techniques.

One of the key elements in delivering the impact is the right choice of partners who complement our work by providing both the necessary expertise and the added breadth, expanding both into industry and into different academic disciplines. These partners include an innovative data management company that fully understands the importance of incomplete information (LogicBlox Inc), a world leading group in the area of knowledge representation (Rome, La Sapienza), a group specialising in efficient processing of aggregate queries, which are of special importance in data analytics (Paris 7), and a leading Semantic Web group (Center for the Semantic Web Research).

To deliver the impact, we shall do the following:

- continue ongoing collaborations and establish new ones, especially with industry (some discussion with DBMS vendors are already happening); provide solutions at various stages for our industrial and academic collaborators to evaluate, to ensure that our tools and techniques are properly tested to guarantee their practicality;

- continue active interaction with leading researchers on data science, and publish results of our research in the very top venues to ensure maximum visibility; and

- deliver seminars, public lectures, as well as keynotes at conferences (for which the PI is regularly invited);

- at the end of the project, produce a book on new foundations of incompleteness.

Publications

10 25 50
publication icon
Barcelo P (2016) Order-Invariant Types and Their Applications in Logical Methods in Computer Science

publication icon
Guagliardo P (2017) A formal semantics of SQL queries, its validation, and applications in Proceedings of the VLDB Endowment

publication icon
Guagliardo P (2017) Correctness of SQL Queries on Databases with Nulls in ACM SIGMOD Record

publication icon
Libkin L (2016) Certain answers as objects and knowledge in Artificial Intelligence

publication icon
Libkin L (2016) SQLâ??s Three-Valued Logic and Certain Answers in ACM Transactions on Database Systems