MAGIC: MAnaGing InComplete Data - New Foundations

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Informatics

Abstract

In our data-driven world, one can hardly spend a day without using highly complex software systems that we have learned to rely on, but that at the same time are known to produce incorrect results. These are systems we have on our laptops, they power websites of companies, and they keep companies and government offices running. And yet the incorrect behaviour is built into them, it is a part of multiple standards, and very little effort is made to change things.

The systems are commercial DBMSs (database management systems). We can rely on them as long as information they store is complete. In an autonomous environment, this is a reasonable assumption, but these days data is generated by a huge number of users and applications, and its incompleteness is a fact of life. The moment incompleteness enters the picture, everything changes: unexpected behaviour occurs; queries that we teach students to write and give them full marks for stop working, and one loses trust in the results one gets from such data.

To make matters worse, many modern applications of data, including data integration, data exchange, ontology based data access, data quality, inconsistency management and a host of others, have incompleteness built into them, and try to rely on standard techniques for handling it. This inevitably leads to questionable, or sometimes plain incorrect results. The key reason behind this is the complexity vs correctness tradeoff: current techniques guaranteeing correctness carry a huge complexity price, and applications look for ways around it, sacrificing correctness in the process.

Our main goal is to end this sorry state of affairs. Correctness and efficiency can co-exist, but we need to develop new foundations of the field of incomplete information, and a new set of techniques based on these foundations, to reconcile the two.

To do so, we need to rethink the very basics of the field; crucially, we need to understand what it means to answer queries over incomplete data with correctness guarantees. The classical theory uses a single one-size-fits-all definition, that, upon a careful examination, does not appear to be universally correct. We have an approach that will let us develop a proper theory of correctness and apply it in standard data management tasks and their new applications as well.

It is crucial to make this approach practical. Commercial systems concentrate on efficient evaluation, sacrificing correctness. Our solutions for delivering correct answers will be implementable on top of existing systems, to guarantee their applicability. There will be no need to throw away existing products and techniques to take advantage of new approaches to handling incomplete information.

Our initial set of goals is to deliver solutions for fixing problems with commercial DBMSs, namely to show how to provide correctness guarantees at low and acceptable costs. We shall do so for a wide variety of queries, going far beyond what is now known to be possible. After that, we shall look at applications that will let us, for the first time, ensure correctness of very expressive queries over integrated and exchanged data. We shall expand further, both in terms of data models (e.g., graphs, XML, noSQL databases), and applications (inconsistent data, ontologies). We shall also look at solutions that take into account very large amounts of data, and produce approximate answers in those scenarios.

With the toolkit we develop, the "curse of incomplete information", i.e., the perceived impossibility of achieving correctness and efficiency simultaneously, should be a thing of the past.

Planned Impact

We expect this research to have an impact in a wide range of data analytics problems. Delivering understanding from information is one of the key themes not only for EPSRC but in general for today's data-driven world. Technologies unlocked by the big-data phenomenon are of key importance to the UK and are expected to bring £40B of economic benefit to the UK by 2017. While no one tried to quantify the cost of not having good techniques of handling uncertain data, evidence suggests that this cost is significant: for instance, imprecise and unclean data per se is estimated to cost US companies over $600B annually. The cost of relying on poor quality analytics run over such data therefore is bound to be huge.

We are planning activities to deliver the impact of this work in both private and public sectors, and in the academic world, going beyond the traditional data management community. In private sector, both traditional and new vendors of database software will benefit from our research, as well as vendors of products that crucially rely on incomplete data (for instance, those handling data integration). In public sector, the users of various types of data analytics software will be the primary beneficiaries. In research community, our results will have an impact across several disciplines including knowledge representation, knowledge discovery, algorithms, and the Semantic Web.

They will benefit in the following ways:

(a) The project will provide a clear guidance for designing new algorithms for querying incomplete data with correctness guarantees, and new principles for handling the complexity of evaluation vs the degree of correctness tradeoff;

(b) Our query evaluation algorithmic toolbox will help developers of database software better understand how to handle the notoriously hard problem of uncertain data;

(c) Our tools will provide applications that naturally use large amounts of incomplete data (e.g., data integration, inconsistent data, data translation)with new and reliable data handling techniques.

One of the key elements in delivering the impact is the right choice of partners who complement our work by providing both the necessary expertise and the added breadth, expanding both into industry and into different academic disciplines. These partners include an innovative data management company that fully understands the importance of incomplete information (LogicBlox Inc), a world leading group in the area of knowledge representation (Rome, La Sapienza), a group specialising in efficient processing of aggregate queries, which are of special importance in data analytics (Paris 7), and a leading Semantic Web group (Center for the Semantic Web Research).

To deliver the impact, we shall do the following:

- continue ongoing collaborations and establish new ones, especially with industry (some discussion with DBMS vendors are already happening); provide solutions at various stages for our industrial and academic collaborators to evaluate, to ensure that our tools and techniques are properly tested to guarantee their practicality;

- continue active interaction with leading researchers on data science, and publish results of our research in the very top venues to ensure maximum visibility; and

- deliver seminars, public lectures, as well as keynotes at conferences (for which the PI is regularly invited);

- at the end of the project, produce a book on new foundations of incompleteness.

Publications

10 25 50
publication icon
Amendola G (2018) Explainable Certain Answers

publication icon
Barcelo P (2016) Order-Invariant Types and Their Applications in Logical Methods in Computer Science

publication icon
Guagliardo P (2017) Correctness of SQL Queries on Databases with Nulls in ACM SIGMOD Record

publication icon
Guagliardo P (2017) A formal semantics of SQL queries, its validation, and applications in Proceedings of the VLDB Endowment

publication icon
Guagliardo P (2018) On the Codd semantics of SQL nulls in Information Systems

publication icon
Guagliardo P (2018) How Standard is the SQL Standard?

 
Description It is well known that traditional relational databases produced by companies such as Oracle, IBM, Microsoft, are very unreliable when it comes to handling incomplete data. This is often accepted as inevitable, as theoretical solutions that overcome these issues carry a very high complexity price. So far in this project we managed to bridge this very large gap between theory and practice in three possible ways. First, we showed that in many relevant cases, good solutions do exist, and they are computationally very efficient. They involve finding approximate but correct and reliable answers to many queries that would be problematic under current technologies. Second, we analyzed the main cause of problems with commercial database systems, namely their use of a 3-valued logic. We showed that this is not necessary, and languages can eliminate the use of this logic that is not natural for the programmer, and thus leads to many errors. Third, we looked at approximations with probabilistic guarantees, and showed that surprisingly small modification to standard query evaluation algorithms produce answers that are correct with very high probability.
Exploitation Route They provide algorithms that can be directly implemented in commercial databases, and restore trust in results of data analytics produced by such systems.
Sectors Digital/Communication/Information Technologies (including Software)

URL https://www.ed.ac.uk/informatics/news-events/stories/2017/researchers-propose-improvements-to-sql-databases
 
Description While fixing issues with the problematic handling of incomplete information in SQL, we had produce the first fully formal semantics of the core of the language. This was taken further by the leading vendor of graph databases, Neo technology, in collaboration with us, to produce a first fully formal semantics of its query language Cypher. The language is used and re-implemented by others, and now one has a proper definition of that language. This work is now being used as a foundation to propose a new standard language for querying graphs.
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Economic

 
Description Neo Technology - industry funding
Amount £150,000 (GBP)
Organisation Neo4j 
Start 01/2017 
End 09/2018
 
Description RS Wolfson
Amount £50,000 (GBP)
Organisation The Royal Society 
Sector Academic/University
Country United Kingdom
Start 01/2017 
End 12/2021
 
Description Neo technology (Neo4j) 
Organisation Neo4j
PI Contribution We established a joint research project with a leading vendor of graph databases, Neo Technology (based in the UK and Sweden; the name of their product is Neo4j). Our main goal was to produce a formal semantics of their query language Cypher and to make suggestions about further development of the language.
Collaborator Contribution In addition to committing significant amount of time of the core staff, Neo also funded a postdoc position in Edinburgh.
Impact Full formal semantics of the core language has been developed. A paper describing it will appear at SIGMOD 2018 in June.
Start Year 2017
 
Description Moderator of a session at the joint W3C/ISO standardization workshop on graph data 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact I was moderating a session on the procedures and nodes of writing a new standard for query languages for graph databases. This activity will take place in an ISO committee, in collaboration with relevant W3C committees. The outcomes of the debate were presented to multiple interested industry players, and will form a part of final workshop recommendations.
Year(s) Of Engagement Activity 2019
 
Description making industry aware of issues with common products discovered in this project 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Talking to companies making customised installations of RDBMSs (e.g. Postgres Professional) to make them aware of issues with handling incomplete data, and our proposed solutions
Year(s) Of Engagement Activity 2017