MAGIC: MAnaGing InComplete Data - New Foundations

Lead Research Organisation: University of Edinburgh

Department Name: Sch of Informatics

Abstract

In our data-driven world, one can hardly spend a day without using highly complex software systems that we have learned to rely on, but that at the same time are known to produce incorrect results. These are systems we have on our laptops, they power websites of companies, and they keep companies and government offices running. And yet the incorrect behaviour is built into them, it is a part of multiple standards, and very little effort is made to change things.

The systems are commercial DBMSs (database management systems). We can rely on them as long as information they store is complete. In an autonomous environment, this is a reasonable assumption, but these days data is generated by a huge number of users and applications, and its incompleteness is a fact of life. The moment incompleteness enters the picture, everything changes: unexpected behaviour occurs; queries that we teach students to write and give them full marks for stop working, and one loses trust in the results one gets from such data.

To make matters worse, many modern applications of data, including data integration, data exchange, ontology based data access, data quality, inconsistency management and a host of others, have incompleteness built into them, and try to rely on standard techniques for handling it. This inevitably leads to questionable, or sometimes plain incorrect results. The key reason behind this is the complexity vs correctness tradeoff: current techniques guaranteeing correctness carry a huge complexity price, and applications look for ways around it, sacrificing correctness in the process.

Our main goal is to end this sorry state of affairs. Correctness and efficiency can co-exist, but we need to develop new foundations of the field of incomplete information, and a new set of techniques based on these foundations, to reconcile the two.

To do so, we need to rethink the very basics of the field; crucially, we need to understand what it means to answer queries over incomplete data with correctness guarantees. The classical theory uses a single one-size-fits-all definition, that, upon a careful examination, does not appear to be universally correct. We have an approach that will let us develop a proper theory of correctness and apply it in standard data management tasks and their new applications as well.

It is crucial to make this approach practical. Commercial systems concentrate on efficient evaluation, sacrificing correctness. Our solutions for delivering correct answers will be implementable on top of existing systems, to guarantee their applicability. There will be no need to throw away existing products and techniques to take advantage of new approaches to handling incomplete information.

Our initial set of goals is to deliver solutions for fixing problems with commercial DBMSs, namely to show how to provide correctness guarantees at low and acceptable costs. We shall do so for a wide variety of queries, going far beyond what is now known to be possible. After that, we shall look at applications that will let us, for the first time, ensure correctness of very expressive queries over integrated and exchanged data. We shall expand further, both in terms of data models (e.g., graphs, XML, noSQL databases), and applications (inconsistent data, ontologies). We shall also look at solutions that take into account very large amounts of data, and produce approximate answers in those scenarios.

With the toolkit we develop, the "curse of incomplete information", i.e., the perceived impossibility of achieving correctness and efficiency simultaneously, should be a thing of the past.

Planned Impact

We expect this research to have an impact in a wide range of data analytics problems. Delivering understanding from information is one of the key themes not only for EPSRC but in general for today's data-driven world. Technologies unlocked by the big-data phenomenon are of key importance to the UK and are expected to bring £40B of economic benefit to the UK by 2017. While no one tried to quantify the cost of not having good techniques of handling uncertain data, evidence suggests that this cost is significant: for instance, imprecise and unclean data per se is estimated to cost US companies over $600B annually. The cost of relying on poor quality analytics run over such data therefore is bound to be huge.

We are planning activities to deliver the impact of this work in both private and public sectors, and in the academic world, going beyond the traditional data management community. In private sector, both traditional and new vendors of database software will benefit from our research, as well as vendors of products that crucially rely on incomplete data (for instance, those handling data integration). In public sector, the users of various types of data analytics software will be the primary beneficiaries. In research community, our results will have an impact across several disciplines including knowledge representation, knowledge discovery, algorithms, and the Semantic Web.

They will benefit in the following ways:

(a) The project will provide a clear guidance for designing new algorithms for querying incomplete data with correctness guarantees, and new principles for handling the complexity of evaluation vs the degree of correctness tradeoff;

(b) Our query evaluation algorithmic toolbox will help developers of database software better understand how to handle the notoriously hard problem of uncertain data;

(c) Our tools will provide applications that naturally use large amounts of incomplete data (e.g., data integration, inconsistent data, data translation) with new and reliable data handling techniques.

One of the key elements in delivering the impact is the right choice of partners who complement our work by providing both the necessary expertise and the added breadth, expanding both into industry and into different academic disciplines. These partners include an innovative data management company that fully understands the importance of incomplete information (LogicBlox Inc), a world leading group in the area of knowledge representation (Rome, La Sapienza), a group specialising in efficient processing of aggregate queries, which are of special importance in data analytics (Paris 7), and a leading Semantic Web group (Center for the Semantic Web Research).

To deliver the impact, we shall do the following:

- continue ongoing collaborations and establish new ones, especially with industry (some discussion with DBMS vendors are already happening); provide solutions at various stages for our industrial and academic collaborators to evaluate, to ensure that our tools and techniques are properly tested to guarantee their practicality;

- continue active interaction with leading researchers on data science, and publish results of our research in the very top venues to ensure maximum visibility; and

- deliver seminars, public lectures, as well as keynotes at conferences (for which the PI is regularly invited);

- at the end of the project, produce a book on new foundations of incompleteness.

Funded Value:

£1,140,110

Funded Period:

Sep 16 - Aug 22

Funder:

EPSRC

Project Status:

Closed

Project Category:

Fellowship

Project Reference:

EP/N023056/1

Principal Investigator:

Leonid Libkin

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Information & Knowledge Mgmt (100%)

Organisations

People	ORCID iD
Leonid Libkin (Principal Investigator / Fellow)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 4 > >|

10 25 50

Abiteboul S (2017) Research Directions for Principles of Data Management (Abridged) in ACM SIGMOD Record

Amendola G (2018) Explainable Certain Answers

Angles R (2021) PG-Keys: Keys for Property Graphs

Arenas M (2018) Encyclopedia of Big Data Technologies

Barcelo P (2016) Order-Invariant Types and Their Applications in Logical Methods in Computer Science

Calautti M (2018) An Operational Approach to Consistent Query Answering

Calautti M (2019) Counting Database Repairs under Primary Keys Revisited

Capodicasa C (2022) Design of a Diagnostic Immunoassay for Aflatoxin M1 Based on a Plant-Produced Antibody. in Toxins

Civili C (2018) Approximating Certainty in Querying Data and Metadata

Console M (2020) Reasoning about Measures of Unmeasurable Sets

Key Findings
Impact Summary
Further Funding
Collaboration
Engagement Activities


Description	It is well known that traditional relational databases produced by companies such as Oracle, IBM, Microsoft, are very unreliable when it comes to handling incomplete data. This is often accepted as inevitable, as theoretical solutions that overcome these issues carry a very high complexity price. So far in this project we managed to bridge this very large gap between theory and practice in five possible ways. First, we showed that in many relevant cases, good solutions do exist, and they are computationally very efficient. They involve finding approximate but correct and reliable answers to many queries that would be problematic under current technologies. Second, we analyzed the main cause of problems with commercial database systems, namely their use of a 3-valued logic. We showed that this is not necessary, and languages can eliminate the use of this logic that is not natural for the programmer, and thus leads to many errors. Initially shown for a simple language primarily used for teaching these days, this has now been extended to the full SQL, the most common database language used in industry. Third, we looked at approximations with probabilistic guarantees, and showed that surprisingly small modification to standard query evaluation algorithms produce answers that are correct with very high probability. While the latter two findings were for the very core of database query languages, we are now using these techniques to extend results to real-life queries as used by programmers on everyday basis. Fourth, we engaged with users of relational databases to understand how they expect to use incomplete data, and are in the process of coming up with concrete proposals based on the understanding of what real users want. Fifth, being active participants in the area of graph query language design, we are trying to introduce the discoveries of this project into the new generation of standards for query languages.
Exploitation Route	They provide algorithms that can be directly implemented in commercial databases, and restore trust in results of data analytics produced by such systems. They suggest language features that could be reflected in future standards and that are likely to eliminate several major sources of programmers' mistakes.
Sectors	Digital/Communication/Information Technologies (including Software) Education
URL	https://www.ed.ac.uk/informatics/news-events/stories/2017/researchers-propose-improvements-to-sql-databases


Description	While fixing issues with the problematic handling of incomplete information in SQL, we had produce the first fully formal semantics of the core of the language. This was taken further by the leading vendor of graph databases, Neo technology (as known then, now known as Neo4j), in collaboration with us, to produce a first fully formal semantics of its query language Cypher. The language is used and re-implemented by others (eg SAP, Amazon), and now one has a proper definition of that language. This work is now being used as a foundation to propose a new standard language for querying graphs. In 2019, the proposal for the new standard of the language (now called GQL) has been approved by the International Organization for Standards (ISO) and we are providing a formal semantics of it, resulting in significant language changes. This also affected the latest release of the SQL Standard (SQL 2023) which includes support for property graph querying. In addition, we provided the first accessible descriptions of the new graph facilities of the SQL standard and the GQL standard, enabling researchers to start analysing them. This work received multiple awards from both industry and academia.
Sector	Digital/Communication/Information Technologies (including Software)
Impact Types	Societal Economic


Description	Efficient Querying of Inconsistent Data
Amount	£606,439 (GBP)
Funding ID	EP/S003800/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	08/2018
End	08/2024


Description	Neo Technology - industry funding
Amount	£180,000 (GBP)
Organisation	Neo4j
Sector	Private
Country	United States
Start	01/2017
End	09/2021


Description	New generation of graph query languages
Amount	£59,999 (GBP)
Organisation	The Leverhulme Trust
Sector	Charity/Non Profit
Country	United Kingdom
Start	08/2022
End	03/2024


Description	RS Wolfson
Amount	£50,000 (GBP)
Organisation	The Royal Society
Sector	Charity/Non Profit
Country	United Kingdom
Start	01/2017
End	12/2021


Description	Data.world, nulls survey
Organisation	data.world, Inc.
Country	United States
Sector	Private
PI Contribution	We have designed and processed the results of a survey of database practitioners on the use of nulls and incomplete data in relational databases.
Collaborator Contribution	Using their network of customers, data.world helped us reach hundreds of database professionals to complete the survey. We actively collaborated with their chief scientist, who spent roughly 2 hours per week in the last 9 months on this project.
Impact	We have a much better understanding of industry's expectations of the treatments of nulls and are working on a report that will call for a significant refocus of this area of academic research as the result.
Start Year	2020


Description	LDBC
Organisation	Linked Data Benchmark Council (LDBC)
Country	United Kingdom
Sector	Charity/Non Profit
PI Contribution	LDBC is a key international organisation supported by multiple industrial partners (Oracle, Neo4j, AWS, TigerGraph etc) in charge of bringing together industry and academia in developing new standards and benchmarks for graph databases. I chair one of its working groups, on formal semantics of query languages, and actively participate in two others (on treatment of null values and on property graph schemas).
Collaborator Contribution	They provide us with tools to facilitate collaboration.
Impact	The most visible one so far is the forthcoming SIGMOD 2021 paper on keys for property graphs. Others go via ISO.
Start Year	2020


Description	Neo technology (Neo4j)
Organisation	Neo4j
Country	United States
Sector	Private
PI Contribution	We established a joint research project with a leading vendor of graph databases, Neo Technology (based in the UK and Sweden; the name of their product is Neo4j). Our initial goal was to produce a formal semantics of their query language Cypher and to make suggestions about further development of the language. It was then expanded to the design of the new graph query language GQL.
Collaborator Contribution	In addition to committing significant amount of time of the core staff, Neo4j has provided funding continuously since 2017.
Impact	Full formal semantics of the core language has been developed. Paper describing appeared in SIGMOD 2018 and VLDB 2019. Since then the focus shifted to GQL (paper are to be written).
Start Year	2017


Description	Relational.AI
Organisation	RelationalAI
Country	United States
Sector	Private
PI Contribution	A new evaluation strategy for SQL queries based on 2-valued logic. It will be used in their implementation of SQL.
Collaborator Contribution	They will provide programming help and implement our ideas in an industrial product in a way that would be impossible to achieve in the context of a research prototype.
Impact	Once done, this will lead to a research paper describing the new implementation.
Start Year	2021


Description	Chair, Formal Semantics Working Group of the Linked Data Benchmark Council
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	The Linked Data Benchmark Council is an organisation that arranges work by academics on behalf of graph database vendors as well as groups that produce new standards for graph query languages. Since 2020, Leonid Libkin leads the formal semantics working group that comprises academics from the UK, France, Germany, Poland, and Chile, and that analyses the emerging standard of graph querying called GQL. The group works in close collaboration with companies such as Neo4j (UK/Sweden), Oracle and TigerGraph (US). Its contributions are already reflected in the new part of the SQL standard for querying graphs, SQL/PGQ.
Year(s) Of Engagement Activity	2020,2021,2022


Description	Member of INCITS and working groups on GQL and SQL/PGQ
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	Member of the US based standards group in charge of data management standards, in particular query languages: SQL, its graph extension SQL/PGQ and a new language for graphs GQL
Year(s) Of Engagement Activity	2022,2023


Description	Member of the SQL Standard ISO Committee (officially: ISO/IEC JTC1 SC32 WG3)
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	SQL is the main language of relational database systems, used by practically all business and governmental organizations. It is standardized by ISO (International Organization for Standardization). Since 2018, Leonid Libkin is a member of that committee, currently one of only 4 academics influencing the design of this ubiquitous query languages in such areas as handling graph queries and incomplete data.
Year(s) Of Engagement Activity	2018,2019,2020,2021,2022


Description	Moderator of a session at the joint W3C/ISO standardization workshop on graph data
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	I was moderating a session on the procedures and nodes of writing a new standard for query languages for graph databases. This activity will take place in an ISO committee, in collaboration with relevant W3C committees. The outcomes of the debate were presented to multiple interested industry players, and will form a part of final workshop recommendations.
Year(s) Of Engagement Activity	2019


Description	making industry aware of issues with common products discovered in this project
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	Talking to companies making customised installations of RDBMSs (e.g. Postgres Professional) to make them aware of issues with handling incomplete data, and our proposed solutions
Year(s) Of Engagement Activity	2017

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications