MAGIC: MAnaGing InComplete Data - New Foundations

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Informatics

Abstract

In our data-driven world, one can hardly spend a day without using highly complex software systems that we have learned to rely on, but that at the same time are known to produce incorrect results. These are systems we have on our laptops, they power websites of companies, and they keep companies and government offices running. And yet the incorrect behaviour is built into them, it is a part of multiple standards, and very little effort is made to change things.

The systems are commercial DBMSs (database management systems). We can rely on them as long as information they store is complete. In an autonomous environment, this is a reasonable assumption, but these days data is generated by a huge number of users and applications, and its incompleteness is a fact of life. The moment incompleteness enters the picture, everything changes: unexpected behaviour occurs; queries that we teach students to write and give them full marks for stop working, and one loses trust in the results one gets from such data.

To make matters worse, many modern applications of data, including data integration, data exchange, ontology based data access, data quality, inconsistency management and a host of others, have incompleteness built into them, and try to rely on standard techniques for handling it. This inevitably leads to questionable, or sometimes plain incorrect results. The key reason behind this is the complexity vs correctness tradeoff: current techniques guaranteeing correctness carry a huge complexity price, and applications look for ways around it, sacrificing correctness in the process.

Our main goal is to end this sorry state of affairs. Correctness and efficiency can co-exist, but we need to develop new foundations of the field of incomplete information, and a new set of techniques based on these foundations, to reconcile the two.

To do so, we need to rethink the very basics of the field; crucially, we need to understand what it means to answer queries over incomplete data with correctness guarantees. The classical theory uses a single one-size-fits-all definition, that, upon a careful examination, does not appear to be universally correct. We have an approach that will let us develop a proper theory of correctness and apply it in standard data management tasks and their new applications as well.

It is crucial to make this approach practical. Commercial systems concentrate on efficient evaluation, sacrificing correctness. Our solutions for delivering correct answers will be implementable on top of existing systems, to guarantee their applicability. There will be no need to throw away existing products and techniques to take advantage of new approaches to handling incomplete information.

Our initial set of goals is to deliver solutions for fixing problems with commercial DBMSs, namely to show how to provide correctness guarantees at low and acceptable costs. We shall do so for a wide variety of queries, going far beyond what is now known to be possible. After that, we shall look at applications that will let us, for the first time, ensure correctness of very expressive queries over integrated and exchanged data. We shall expand further, both in terms of data models (e.g., graphs, XML, noSQL databases), and applications (inconsistent data, ontologies). We shall also look at solutions that take into account very large amounts of data, and produce approximate answers in those scenarios.

With the toolkit we develop, the "curse of incomplete information", i.e., the perceived impossibility of achieving correctness and efficiency simultaneously, should be a thing of the past.

Planned Impact

We expect this research to have an impact in a wide range of data analytics problems. Delivering understanding from information is one of the key themes not only for EPSRC but in general for today's data-driven world. Technologies unlocked by the big-data phenomenon are of key importance to the UK and are expected to bring £40B of economic benefit to the UK by 2017. While no one tried to quantify the cost of not having good techniques of handling uncertain data, evidence suggests that this cost is significant: for instance, imprecise and unclean data per se is estimated to cost US companies over $600B annually. The cost of relying on poor quality analytics run over such data therefore is bound to be huge.

We are planning activities to deliver the impact of this work in both private and public sectors, and in the academic world, going beyond the traditional data management community. In private sector, both traditional and new vendors of database software will benefit from our research, as well as vendors of products that crucially rely on incomplete data (for instance, those handling data integration). In public sector, the users of various types of data analytics software will be the primary beneficiaries. In research community, our results will have an impact across several disciplines including knowledge representation, knowledge discovery, algorithms, and the Semantic Web.

They will benefit in the following ways:

(a) The project will provide a clear guidance for designing new algorithms for querying incomplete data with correctness guarantees, and new principles for handling the complexity of evaluation vs the degree of correctness tradeoff;

(b) Our query evaluation algorithmic toolbox will help developers of database software better understand how to handle the notoriously hard problem of uncertain data;

(c) Our tools will provide applications that naturally use large amounts of incomplete data (e.g., data integration, inconsistent data, data translation) with new and reliable data handling techniques.

One of the key elements in delivering the impact is the right choice of partners who complement our work by providing both the necessary expertise and the added breadth, expanding both into industry and into different academic disciplines. These partners include an innovative data management company that fully understands the importance of incomplete information (LogicBlox Inc), a world leading group in the area of knowledge representation (Rome, La Sapienza), a group specialising in efficient processing of aggregate queries, which are of special importance in data analytics (Paris 7), and a leading Semantic Web group (Center for the Semantic Web Research).

To deliver the impact, we shall do the following:

- continue ongoing collaborations and establish new ones, especially with industry (some discussion with DBMS vendors are already happening); provide solutions at various stages for our industrial and academic collaborators to evaluate, to ensure that our tools and techniques are properly tested to guarantee their practicality;

- continue active interaction with leading researchers on data science, and publish results of our research in the very top venues to ensure maximum visibility; and

- deliver seminars, public lectures, as well as keynotes at conferences (for which the PI is regularly invited);

- at the end of the project, produce a book on new foundations of incompleteness.
 
Description It is well known that traditional relational databases produced by companies such as Oracle, IBM, Microsoft, are very unreliable when it comes to handling incomplete data. This is often accepted as inevitable, as theoretical solutions that overcome these issues carry a very high complexity price. So far in this project we managed to bridge this very large gap between theory and practice in five possible ways.

First, we showed that in many relevant cases, good solutions do exist, and they are computationally very efficient. They involve finding approximate but correct and reliable answers to many queries that would be problematic under current technologies.

Second, we analyzed the main cause of problems with commercial database systems, namely their use of a 3-valued logic. We showed that this is not necessary, and languages can eliminate the use of this logic that is not natural for the programmer, and thus leads to many errors. Initially shown for a simple language primarily used for teaching these days, this has now been extended to the full SQL, the most common database language used in industry.

Third, we looked at approximations with probabilistic guarantees, and showed that surprisingly small modification to standard query evaluation algorithms produce answers that are correct with very high probability. While the latter two findings were for the very core of database query languages, we are now using these techniques to extend results to real-life queries as used by programmers on everyday basis.

Fourth, we engaged with users of relational databases to understand how they expect to use incomplete data, and are in the process of coming up with concrete proposals based on the understanding of what real users want.

Fifth, being active participants in the area of graph query language design, we are trying to introduce the discoveries of this project into the new generation of standards for query languages.
Exploitation Route They provide algorithms that can be directly implemented in commercial databases, and restore trust in results of data analytics produced by such systems. They suggest language features that could be reflected in future standards and that are likely to eliminate several major sources of programmers' mistakes.
Sectors Digital/Communication/Information Technologies (including Software)

Education

URL https://www.ed.ac.uk/informatics/news-events/stories/2017/researchers-propose-improvements-to-sql-databases
 
Description While fixing issues with the problematic handling of incomplete information in SQL, we had produce the first fully formal semantics of the core of the language. This was taken further by the leading vendor of graph databases, Neo technology (as known then, now known as Neo4j), in collaboration with us, to produce a first fully formal semantics of its query language Cypher. The language is used and re-implemented by others (eg SAP, Amazon), and now one has a proper definition of that language. This work is now being used as a foundation to propose a new standard language for querying graphs. In 2019, the proposal for the new standard of the language (now called GQL) has been approved by the International Organization for Standards (ISO) and we are providing a formal semantics of it, resulting in significant language changes. This also affected the latest release of the SQL Standard (SQL 2023) which includes support for property graph querying. In addition, we provided the first accessible descriptions of the new graph facilities of the SQL standard and the GQL standard, enabling researchers to start analysing them. This work received multiple awards from both industry and academia.
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Societal

Economic

 
Description Efficient Querying of Inconsistent Data
Amount £606,439 (GBP)
Funding ID EP/S003800/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 08/2018 
End 08/2024
 
Description Neo Technology - industry funding
Amount £180,000 (GBP)
Organisation Neo4j 
Sector Private
Country United States
Start 01/2017 
End 09/2021
 
Description New generation of graph query languages
Amount £59,999 (GBP)
Organisation The Leverhulme Trust 
Sector Charity/Non Profit
Country United Kingdom
Start 08/2022 
End 03/2024
 
Description RS Wolfson
Amount £50,000 (GBP)
Organisation The Royal Society 
Sector Charity/Non Profit
Country United Kingdom
Start 01/2017 
End 12/2021
 
Description Data.world, nulls survey 
Organisation data.world, Inc.
Country United States 
Sector Private 
PI Contribution We have designed and processed the results of a survey of database practitioners on the use of nulls and incomplete data in relational databases.
Collaborator Contribution Using their network of customers, data.world helped us reach hundreds of database professionals to complete the survey. We actively collaborated with their chief scientist, who spent roughly 2 hours per week in the last 9 months on this project.
Impact We have a much better understanding of industry's expectations of the treatments of nulls and are working on a report that will call for a significant refocus of this area of academic research as the result.
Start Year 2020
 
Description LDBC 
Organisation Linked Data Benchmark Council (LDBC)
Country United Kingdom 
Sector Charity/Non Profit 
PI Contribution LDBC is a key international organisation supported by multiple industrial partners (Oracle, Neo4j, AWS, TigerGraph etc) in charge of bringing together industry and academia in developing new standards and benchmarks for graph databases. I chair one of its working groups, on formal semantics of query languages, and actively participate in two others (on treatment of null values and on property graph schemas).
Collaborator Contribution They provide us with tools to facilitate collaboration.
Impact The most visible one so far is the forthcoming SIGMOD 2021 paper on keys for property graphs. Others go via ISO.
Start Year 2020
 
Description Neo technology (Neo4j) 
Organisation Neo4j
Country United States 
Sector Private 
PI Contribution We established a joint research project with a leading vendor of graph databases, Neo Technology (based in the UK and Sweden; the name of their product is Neo4j). Our initial goal was to produce a formal semantics of their query language Cypher and to make suggestions about further development of the language. It was then expanded to the design of the new graph query language GQL.
Collaborator Contribution In addition to committing significant amount of time of the core staff, Neo4j has provided funding continuously since 2017.
Impact Full formal semantics of the core language has been developed. Paper describing appeared in SIGMOD 2018 and VLDB 2019. Since then the focus shifted to GQL (paper are to be written).
Start Year 2017
 
Description Relational.AI 
Organisation RelationalAI
Country United States 
Sector Private 
PI Contribution A new evaluation strategy for SQL queries based on 2-valued logic. It will be used in their implementation of SQL.
Collaborator Contribution They will provide programming help and implement our ideas in an industrial product in a way that would be impossible to achieve in the context of a research prototype.
Impact Once done, this will lead to a research paper describing the new implementation.
Start Year 2021
 
Description Chair, Formal Semantics Working Group of the Linked Data Benchmark Council 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact The Linked Data Benchmark Council is an organisation that arranges work by academics on behalf of graph database vendors as well as groups that produce new standards for graph query languages. Since 2020, Leonid Libkin leads the formal semantics working group that comprises academics from the UK, France, Germany, Poland, and Chile, and that analyses the emerging standard of graph querying called GQL. The group works in close collaboration with companies such as Neo4j (UK/Sweden), Oracle and TigerGraph (US). Its contributions are already reflected in the new part of the SQL standard for querying graphs, SQL/PGQ.
Year(s) Of Engagement Activity 2020,2021,2022
 
Description Member of INCITS and working groups on GQL and SQL/PGQ 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Member of the US based standards group in charge of data management standards, in particular query languages: SQL, its graph extension SQL/PGQ and a new language for graphs GQL
Year(s) Of Engagement Activity 2022,2023
 
Description Member of the SQL Standard ISO Committee (officially: ISO/IEC JTC1 SC32 WG3) 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact SQL is the main language of relational database systems, used by practically all business and governmental organizations. It is standardized by ISO (International Organization for Standardization). Since 2018, Leonid Libkin is a member of that committee, currently one of only 4 academics influencing the design of this ubiquitous query languages in such areas as handling graph queries and incomplete data.
Year(s) Of Engagement Activity 2018,2019,2020,2021,2022
 
Description Moderator of a session at the joint W3C/ISO standardization workshop on graph data 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact I was moderating a session on the procedures and nodes of writing a new standard for query languages for graph databases. This activity will take place in an ISO committee, in collaboration with relevant W3C committees. The outcomes of the debate were presented to multiple interested industry players, and will form a part of final workshop recommendations.
Year(s) Of Engagement Activity 2019
 
Description making industry aware of issues with common products discovered in this project 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Talking to companies making customised installations of RDBMSs (e.g. Postgres Professional) to make them aware of issues with handling incomplete data, and our proposed solutions
Year(s) Of Engagement Activity 2017