MAGIC: MAnaGing InComplete Data - New Foundations
Lead Research Organisation:
University of Edinburgh
Department Name: Sch of Informatics
Abstract
In our data-driven world, one can hardly spend a day without using highly complex software systems that we have learned to rely on, but that at the same time are known to produce incorrect results. These are systems we have on our laptops, they power websites of companies, and they keep companies and government offices running. And yet the incorrect behaviour is built into them, it is a part of multiple standards, and very little effort is made to change things.
The systems are commercial DBMSs (database management systems). We can rely on them as long as information they store is complete. In an autonomous environment, this is a reasonable assumption, but these days data is generated by a huge number of users and applications, and its incompleteness is a fact of life. The moment incompleteness enters the picture, everything changes: unexpected behaviour occurs; queries that we teach students to write and give them full marks for stop working, and one loses trust in the results one gets from such data.
To make matters worse, many modern applications of data, including data integration, data exchange, ontology based data access, data quality, inconsistency management and a host of others, have incompleteness built into them, and try to rely on standard techniques for handling it. This inevitably leads to questionable, or sometimes plain incorrect results. The key reason behind this is the complexity vs correctness tradeoff: current techniques guaranteeing correctness carry a huge complexity price, and applications look for ways around it, sacrificing correctness in the process.
Our main goal is to end this sorry state of affairs. Correctness and efficiency can co-exist, but we need to develop new foundations of the field of incomplete information, and a new set of techniques based on these foundations, to reconcile the two.
To do so, we need to rethink the very basics of the field; crucially, we need to understand what it means to answer queries over incomplete data with correctness guarantees. The classical theory uses a single one-size-fits-all definition, that, upon a careful examination, does not appear to be universally correct. We have an approach that will let us develop a proper theory of correctness and apply it in standard data management tasks and their new applications as well.
It is crucial to make this approach practical. Commercial systems concentrate on efficient evaluation, sacrificing correctness. Our solutions for delivering correct answers will be implementable on top of existing systems, to guarantee their applicability. There will be no need to throw away existing products and techniques to take advantage of new approaches to handling incomplete information.
Our initial set of goals is to deliver solutions for fixing problems with commercial DBMSs, namely to show how to provide correctness guarantees at low and acceptable costs. We shall do so for a wide variety of queries, going far beyond what is now known to be possible. After that, we shall look at applications that will let us, for the first time, ensure correctness of very expressive queries over integrated and exchanged data. We shall expand further, both in terms of data models (e.g., graphs, XML, noSQL databases), and applications (inconsistent data, ontologies). We shall also look at solutions that take into account very large amounts of data, and produce approximate answers in those scenarios.
With the toolkit we develop, the "curse of incomplete information", i.e., the perceived impossibility of achieving correctness and efficiency simultaneously, should be a thing of the past.
The systems are commercial DBMSs (database management systems). We can rely on them as long as information they store is complete. In an autonomous environment, this is a reasonable assumption, but these days data is generated by a huge number of users and applications, and its incompleteness is a fact of life. The moment incompleteness enters the picture, everything changes: unexpected behaviour occurs; queries that we teach students to write and give them full marks for stop working, and one loses trust in the results one gets from such data.
To make matters worse, many modern applications of data, including data integration, data exchange, ontology based data access, data quality, inconsistency management and a host of others, have incompleteness built into them, and try to rely on standard techniques for handling it. This inevitably leads to questionable, or sometimes plain incorrect results. The key reason behind this is the complexity vs correctness tradeoff: current techniques guaranteeing correctness carry a huge complexity price, and applications look for ways around it, sacrificing correctness in the process.
Our main goal is to end this sorry state of affairs. Correctness and efficiency can co-exist, but we need to develop new foundations of the field of incomplete information, and a new set of techniques based on these foundations, to reconcile the two.
To do so, we need to rethink the very basics of the field; crucially, we need to understand what it means to answer queries over incomplete data with correctness guarantees. The classical theory uses a single one-size-fits-all definition, that, upon a careful examination, does not appear to be universally correct. We have an approach that will let us develop a proper theory of correctness and apply it in standard data management tasks and their new applications as well.
It is crucial to make this approach practical. Commercial systems concentrate on efficient evaluation, sacrificing correctness. Our solutions for delivering correct answers will be implementable on top of existing systems, to guarantee their applicability. There will be no need to throw away existing products and techniques to take advantage of new approaches to handling incomplete information.
Our initial set of goals is to deliver solutions for fixing problems with commercial DBMSs, namely to show how to provide correctness guarantees at low and acceptable costs. We shall do so for a wide variety of queries, going far beyond what is now known to be possible. After that, we shall look at applications that will let us, for the first time, ensure correctness of very expressive queries over integrated and exchanged data. We shall expand further, both in terms of data models (e.g., graphs, XML, noSQL databases), and applications (inconsistent data, ontologies). We shall also look at solutions that take into account very large amounts of data, and produce approximate answers in those scenarios.
With the toolkit we develop, the "curse of incomplete information", i.e., the perceived impossibility of achieving correctness and efficiency simultaneously, should be a thing of the past.
Planned Impact
We expect this research to have an impact in a wide range of data analytics problems. Delivering understanding from information is one of the key themes not only for EPSRC but in general for today's data-driven world. Technologies unlocked by the big-data phenomenon are of key importance to the UK and are expected to bring £40B of economic benefit to the UK by 2017. While no one tried to quantify the cost of not having good techniques of handling uncertain data, evidence suggests that this cost is significant: for instance, imprecise and unclean data per se is estimated to cost US companies over $600B annually. The cost of relying on poor quality analytics run over such data therefore is bound to be huge.
We are planning activities to deliver the impact of this work in both private and public sectors, and in the academic world, going beyond the traditional data management community. In private sector, both traditional and new vendors of database software will benefit from our research, as well as vendors of products that crucially rely on incomplete data (for instance, those handling data integration). In public sector, the users of various types of data analytics software will be the primary beneficiaries. In research community, our results will have an impact across several disciplines including knowledge representation, knowledge discovery, algorithms, and the Semantic Web.
They will benefit in the following ways:
(a) The project will provide a clear guidance for designing new algorithms for querying incomplete data with correctness guarantees, and new principles for handling the complexity of evaluation vs the degree of correctness tradeoff;
(b) Our query evaluation algorithmic toolbox will help developers of database software better understand how to handle the notoriously hard problem of uncertain data;
(c) Our tools will provide applications that naturally use large amounts of incomplete data (e.g., data integration, inconsistent data, data translation) with new and reliable data handling techniques.
One of the key elements in delivering the impact is the right choice of partners who complement our work by providing both the necessary expertise and the added breadth, expanding both into industry and into different academic disciplines. These partners include an innovative data management company that fully understands the importance of incomplete information (LogicBlox Inc), a world leading group in the area of knowledge representation (Rome, La Sapienza), a group specialising in efficient processing of aggregate queries, which are of special importance in data analytics (Paris 7), and a leading Semantic Web group (Center for the Semantic Web Research).
To deliver the impact, we shall do the following:
- continue ongoing collaborations and establish new ones, especially with industry (some discussion with DBMS vendors are already happening); provide solutions at various stages for our industrial and academic collaborators to evaluate, to ensure that our tools and techniques are properly tested to guarantee their practicality;
- continue active interaction with leading researchers on data science, and publish results of our research in the very top venues to ensure maximum visibility; and
- deliver seminars, public lectures, as well as keynotes at conferences (for which the PI is regularly invited);
- at the end of the project, produce a book on new foundations of incompleteness.
We are planning activities to deliver the impact of this work in both private and public sectors, and in the academic world, going beyond the traditional data management community. In private sector, both traditional and new vendors of database software will benefit from our research, as well as vendors of products that crucially rely on incomplete data (for instance, those handling data integration). In public sector, the users of various types of data analytics software will be the primary beneficiaries. In research community, our results will have an impact across several disciplines including knowledge representation, knowledge discovery, algorithms, and the Semantic Web.
They will benefit in the following ways:
(a) The project will provide a clear guidance for designing new algorithms for querying incomplete data with correctness guarantees, and new principles for handling the complexity of evaluation vs the degree of correctness tradeoff;
(b) Our query evaluation algorithmic toolbox will help developers of database software better understand how to handle the notoriously hard problem of uncertain data;
(c) Our tools will provide applications that naturally use large amounts of incomplete data (e.g., data integration, inconsistent data, data translation) with new and reliable data handling techniques.
One of the key elements in delivering the impact is the right choice of partners who complement our work by providing both the necessary expertise and the added breadth, expanding both into industry and into different academic disciplines. These partners include an innovative data management company that fully understands the importance of incomplete information (LogicBlox Inc), a world leading group in the area of knowledge representation (Rome, La Sapienza), a group specialising in efficient processing of aggregate queries, which are of special importance in data analytics (Paris 7), and a leading Semantic Web group (Center for the Semantic Web Research).
To deliver the impact, we shall do the following:
- continue ongoing collaborations and establish new ones, especially with industry (some discussion with DBMS vendors are already happening); provide solutions at various stages for our industrial and academic collaborators to evaluate, to ensure that our tools and techniques are properly tested to guarantee their practicality;
- continue active interaction with leading researchers on data science, and publish results of our research in the very top venues to ensure maximum visibility; and
- deliver seminars, public lectures, as well as keynotes at conferences (for which the PI is regularly invited);
- at the end of the project, produce a book on new foundations of incompleteness.
Organisations
- University of Edinburgh (Fellow, Lead Research Organisation)
- RelationalAI (Collaboration)
- data.world, Inc. (Collaboration)
- Neo4j (Collaboration)
- Linked Data Benchmark Council (LDBC) (Collaboration)
- Logicblox (Project Partner)
- Centre for Semantic Web Research (Project Partner)
- Sapienza University of Rome (Project Partner)
- Laboratoire d'Informatique Algorithmique: Fondements et Applications (Project Partner)
People |
ORCID iD |
Leonid Libkin (Principal Investigator / Fellow) |
Publications
Abiteboul S
(2017)
Research Directions for Principles of Data Management (Abridged)
in ACM SIGMOD Record
Amendola G
(2018)
Explainable Certain Answers
Angles R
(2021)
PG-Keys: Keys for Property Graphs
Arenas M
(2018)
Encyclopedia of Big Data Technologies
Barcelo P
(2016)
Order-Invariant Types and Their Applications
in Logical Methods in Computer Science
Calautti M
(2018)
An Operational Approach to Consistent Query Answering
Calautti M
(2019)
Counting Database Repairs under Primary Keys Revisited
Capodicasa C
(2022)
Design of a Diagnostic Immunoassay for Aflatoxin M1 Based on a Plant-Produced Antibody.
in Toxins
Civili C
(2018)
Approximating Certainty in Querying Data and Metadata
Console M
(2020)
Reasoning about Measures of Unmeasurable Sets
Console M
(2020)
Epistemic Integrity Constraints for Ontology-Based Data Management
in Proceedings of the AAAI Conference on Artificial Intelligence
Console M
(2020)
Coping with Incomplete Data: Recent Advances
Console M
(2019)
Propositional and Predicate Logics of Incomplete Information
Console M
(2019)
Measuring the Likelihood of Numerical Constraints
Console M
(2020)
Queries with Arithmetic on Incomplete Databases
Console M
(2020)
Queries with Arithmetic on Incomplete Databases
Console M
(2022)
Propositional and predicate logics of incomplete information
in Artificial Intelligence
Console M
(2020)
Coping with Incomplete Data: Recent Advances
Console M
(2022)
Fragments of bag relational algebra: Expressiveness and certain answers
in Information Systems
Deutsch A
(2022)
Graph Pattern Matching in GQL and SQL/PGQ
Francis N
(2017)
Schema Mappings for Data Graphs
Guagliardo P
(2017)
A formal semantics of SQL queries, its validation, and applications
in Proceedings of the VLDB Endowment
Guagliardo P
(2018)
How Standard is the SQL Standard?
Guagliardo P
(2017)
On the Codd Semantics of SQL Nulls
Guagliardo P
(2019)
On the Codd semantics of SQL nulls
in Information Systems
Guagliardo P
(2016)
Making SQL Queries Correct on Incomplete Databases
Guagliardo P
(2017)
Correctness of SQL Queries on Databases with Nulls
in ACM SIGMOD Record
Libkin L
(2016)
Web Reasoning and Rule Systems
Libkin L
(2016)
SQL's Three-Valued Logic and Certain Answers
in ACM Transactions on Database Systems
Libkin L
(2017)
Technical Perspective: Data distribution for fast joins
in Communications of the ACM
Libkin L
(2016)
Certain answers as objects and knowledge
in Artificial Intelligence
Libkin L
(2018)
TriAL A Navigational Algebra for RDF Triplestores
in ACM Transactions on Database Systems
Libkin L
(2018)
Encyclopedia of Database Systems
Libkin L
(2018)
Certain Answers Meet Zero-One Laws
Marco Console
(2019)
Fragments of Bag Relational Algebra: Expressiveness and Certain Answers
in 22nd International Conference on Database Theory (ICDT 2019)
Renna M
(2022)
Full-fat insect meals in ruminant nutrition: in vitro rumen fermentation characteristics and lipid biohydrogenation.
in Journal of animal science and biotechnology
Toussaint E
(2020)
Knowledge-Preserving Certain Answers for SQL-like Queries
Description | It is well known that traditional relational databases produced by companies such as Oracle, IBM, Microsoft, are very unreliable when it comes to handling incomplete data. This is often accepted as inevitable, as theoretical solutions that overcome these issues carry a very high complexity price. So far in this project we managed to bridge this very large gap between theory and practice in five possible ways. First, we showed that in many relevant cases, good solutions do exist, and they are computationally very efficient. They involve finding approximate but correct and reliable answers to many queries that would be problematic under current technologies. Second, we analyzed the main cause of problems with commercial database systems, namely their use of a 3-valued logic. We showed that this is not necessary, and languages can eliminate the use of this logic that is not natural for the programmer, and thus leads to many errors. Initially shown for a simple language primarily used for teaching these days, this has now been extended to the full SQL, the most common database language used in industry. Third, we looked at approximations with probabilistic guarantees, and showed that surprisingly small modification to standard query evaluation algorithms produce answers that are correct with very high probability. While the latter two findings were for the very core of database query languages, we are now using these techniques to extend results to real-life queries as used by programmers on everyday basis. Fourth, we engaged with users of relational databases to understand how they expect to use incomplete data, and are in the process of coming up with concrete proposals based on the understanding of what real users want. Fifth, being active participants in the area of graph query language design, we are trying to introduce the discoveries of this project into the new generation of standards for query languages. |
Exploitation Route | They provide algorithms that can be directly implemented in commercial databases, and restore trust in results of data analytics produced by such systems. They suggest language features that could be reflected in future standards and that are likely to eliminate several major sources of programmers' mistakes. |
Sectors | Digital/Communication/Information Technologies (including Software) Education |
URL | https://www.ed.ac.uk/informatics/news-events/stories/2017/researchers-propose-improvements-to-sql-databases |
Description | While fixing issues with the problematic handling of incomplete information in SQL, we had produce the first fully formal semantics of the core of the language. This was taken further by the leading vendor of graph databases, Neo technology (as known then, now known as Neo4j), in collaboration with us, to produce a first fully formal semantics of its query language Cypher. The language is used and re-implemented by others (eg SAP, Amazon), and now one has a proper definition of that language. This work is now being used as a foundation to propose a new standard language for querying graphs. In 2019, the proposal for the new standard of the language (now called GQL) has been approved by the International Organization for Standards (ISO) and we are providing a formal semantics of it, resulting in significant language changes. This also affected the latest release of the SQL Standard (SQL 2023) which includes support for property graph querying. In addition, we provided the first accessible descriptions of the new graph facilities of the SQL standard and the GQL standard, enabling researchers to start analysing them. This work received multiple awards from both industry and academia. |
Sector | Digital/Communication/Information Technologies (including Software) |
Impact Types | Societal Economic |
Description | Efficient Querying of Inconsistent Data |
Amount | £606,439 (GBP) |
Funding ID | EP/S003800/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 08/2018 |
End | 08/2024 |
Description | Neo Technology - industry funding |
Amount | £180,000 (GBP) |
Organisation | Neo4j |
Sector | Private |
Country | United States |
Start | 01/2017 |
End | 09/2021 |
Description | New generation of graph query languages |
Amount | £59,999 (GBP) |
Organisation | The Leverhulme Trust |
Sector | Charity/Non Profit |
Country | United Kingdom |
Start | 08/2022 |
End | 03/2024 |
Description | RS Wolfson |
Amount | £50,000 (GBP) |
Organisation | The Royal Society |
Sector | Charity/Non Profit |
Country | United Kingdom |
Start | 01/2017 |
End | 12/2021 |
Description | Data.world, nulls survey |
Organisation | data.world, Inc. |
Country | United States |
Sector | Private |
PI Contribution | We have designed and processed the results of a survey of database practitioners on the use of nulls and incomplete data in relational databases. |
Collaborator Contribution | Using their network of customers, data.world helped us reach hundreds of database professionals to complete the survey. We actively collaborated with their chief scientist, who spent roughly 2 hours per week in the last 9 months on this project. |
Impact | We have a much better understanding of industry's expectations of the treatments of nulls and are working on a report that will call for a significant refocus of this area of academic research as the result. |
Start Year | 2020 |
Description | LDBC |
Organisation | Linked Data Benchmark Council (LDBC) |
Country | United Kingdom |
Sector | Charity/Non Profit |
PI Contribution | LDBC is a key international organisation supported by multiple industrial partners (Oracle, Neo4j, AWS, TigerGraph etc) in charge of bringing together industry and academia in developing new standards and benchmarks for graph databases. I chair one of its working groups, on formal semantics of query languages, and actively participate in two others (on treatment of null values and on property graph schemas). |
Collaborator Contribution | They provide us with tools to facilitate collaboration. |
Impact | The most visible one so far is the forthcoming SIGMOD 2021 paper on keys for property graphs. Others go via ISO. |
Start Year | 2020 |
Description | Neo technology (Neo4j) |
Organisation | Neo4j |
Country | United States |
Sector | Private |
PI Contribution | We established a joint research project with a leading vendor of graph databases, Neo Technology (based in the UK and Sweden; the name of their product is Neo4j). Our initial goal was to produce a formal semantics of their query language Cypher and to make suggestions about further development of the language. It was then expanded to the design of the new graph query language GQL. |
Collaborator Contribution | In addition to committing significant amount of time of the core staff, Neo4j has provided funding continuously since 2017. |
Impact | Full formal semantics of the core language has been developed. Paper describing appeared in SIGMOD 2018 and VLDB 2019. Since then the focus shifted to GQL (paper are to be written). |
Start Year | 2017 |
Description | Relational.AI |
Organisation | RelationalAI |
Country | United States |
Sector | Private |
PI Contribution | A new evaluation strategy for SQL queries based on 2-valued logic. It will be used in their implementation of SQL. |
Collaborator Contribution | They will provide programming help and implement our ideas in an industrial product in a way that would be impossible to achieve in the context of a research prototype. |
Impact | Once done, this will lead to a research paper describing the new implementation. |
Start Year | 2021 |
Description | Chair, Formal Semantics Working Group of the Linked Data Benchmark Council |
Form Of Engagement Activity | A formal working group, expert panel or dialogue |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Industry/Business |
Results and Impact | The Linked Data Benchmark Council is an organisation that arranges work by academics on behalf of graph database vendors as well as groups that produce new standards for graph query languages. Since 2020, Leonid Libkin leads the formal semantics working group that comprises academics from the UK, France, Germany, Poland, and Chile, and that analyses the emerging standard of graph querying called GQL. The group works in close collaboration with companies such as Neo4j (UK/Sweden), Oracle and TigerGraph (US). Its contributions are already reflected in the new part of the SQL standard for querying graphs, SQL/PGQ. |
Year(s) Of Engagement Activity | 2020,2021,2022 |
Description | Member of INCITS and working groups on GQL and SQL/PGQ |
Form Of Engagement Activity | A formal working group, expert panel or dialogue |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Industry/Business |
Results and Impact | Member of the US based standards group in charge of data management standards, in particular query languages: SQL, its graph extension SQL/PGQ and a new language for graphs GQL |
Year(s) Of Engagement Activity | 2022,2023 |
Description | Member of the SQL Standard ISO Committee (officially: ISO/IEC JTC1 SC32 WG3) |
Form Of Engagement Activity | A formal working group, expert panel or dialogue |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Industry/Business |
Results and Impact | SQL is the main language of relational database systems, used by practically all business and governmental organizations. It is standardized by ISO (International Organization for Standardization). Since 2018, Leonid Libkin is a member of that committee, currently one of only 4 academics influencing the design of this ubiquitous query languages in such areas as handling graph queries and incomplete data. |
Year(s) Of Engagement Activity | 2018,2019,2020,2021,2022 |
Description | Moderator of a session at the joint W3C/ISO standardization workshop on graph data |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Industry/Business |
Results and Impact | I was moderating a session on the procedures and nodes of writing a new standard for query languages for graph databases. This activity will take place in an ISO committee, in collaboration with relevant W3C committees. The outcomes of the debate were presented to multiple interested industry players, and will form a part of final workshop recommendations. |
Year(s) Of Engagement Activity | 2019 |
Description | making industry aware of issues with common products discovered in this project |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Industry/Business |
Results and Impact | Talking to companies making customised installations of RDBMSs (e.g. Postgres Professional) to make them aware of issues with handling incomplete data, and our proposed solutions |
Year(s) Of Engagement Activity | 2017 |