Efficient Querying of Inconsistent Data

Lead Research Organisation: University of Edinburgh

Department Name: Sch of Informatics

Abstract

Data is everywhere, produced by a multitude of applications, devices, and users. The utilization of Big Data is of paramount importance to many activities in domains that include science, industry, governments, and healthcare. The economic impact of Big Data is projected to exceed £62B by 2020. However, in this data-driven world, managing the data relevant to a certain application in a way we have become accustomed to, namely keeping the data in a well-maintained database or data warehouse, is no longer tenable. This is because data typically comes from continuously evolving, heterogeneous, and unreliable sources. This creates a notable gap between the opportunity provided by the availability of Big Data and the capability of users to make the most out of this opportunity, and calls for the creation of new tools and techniques. The key challenges posed by Big Data are usually summarised as the four V's of Big Data: Volume (scale of the data), Velocity (speed of change), Variety (different forms of data), and Veracity (inconsistency and incompleteness).

The proposed research addresses the Veracity, and more precisely, the inconsistency aspect of Big Data. Inconsistency refers to the fact that a database does not conform to its specification. For example, in a database that stores registered companies, each company must have a unique registration number (this is known as a key constraint). However, in an environment where data arrives from multiple sources, we can have two disagreeing sources that associate different companies with the same registration number. If both facts are stored in an integrated database, that makes it inconsistent.

The problem of querying very large, and at the same time inconsistent data, has been recognised as a common challenge in Big Data analysis that must be urgently addressed. While this problem has attracted considerable attention by the database community, we do not yet have good solutions that come with theoretical guarantees and can be implemented in practical systems.

The most common approach to the problem is known as Consistent Query Answering or CQA. Its idea is to find answers to queries that are consistent, i.e., true no matter how we resolve inconsistencies. Resolving inconsistencies means repairing the data to make it consistent with the specification. Since this elegant idea was introduced about 20 years ago, various notions of repairs have been proposed, all essentially asking for some sort of "minimal change" condition with respect to the inconsistent database.

Much of the research on querying inconsistent data focused on isolating convenient scenarios where the problem can be solved efficiently, but still many realistic scenarios remain beyond reach. This is reflected by CQA's limited practical applicability.

The goal of this project is to change this state of affairs by proposing a practically applicable approach to the problem of querying inconsistent data. We are convinced that the ultimate goal of this new approach should be efficient approximation algorithms that quickly deliver sufficiently good consistent answers with explicit error guarantees.

To achieve this, we need to rethink the very notion of what it means to repair an inconsistent database. Our key argument is that such repairs should be viewed operationally, as a sequence of operations that take the database closer to a consistent state. Viewing repairs in this way, opens up many new opportunities to design approximation algorithms for consistent query answering.

The proposed research programme is structured around three main themes: foundational study of the operational approach to consistent query answering and its complexity; design and analysis of efficient approximation algorithms for finding consistent query answers; and implementation and evaluation of these algorithms in a realistic setting, done with our project partner, a leading vendor of business intelligence software.

Planned Impact

We are proposing to lay the foundations of a new framework for querying large inconsistent databases, which will revolutionise how we deal with inconsistency in Big Data. Indeed, one of the pressing challenges of Big Data management is handling inconsistency, most notably in the context of the World Wide Web or the Internet of Things (IoT).

It is widely agreed that Big Data is extremely important for several different activities and domains such as science, industry, governments, and healthcare. Thus, the beneficiaries could include, in the long term, anyone who uses or depends on Big Data. This effectively includes every business/organization and every individual.

In the shorter term, the techniques for inconsistency handling to be developed in this project will exert a major influence on the theory and practice of Big Data management, both within and outside the academic community. Thus, our work will be of benefit to all those working in the broad area of information systems, including researchers - both in academia and industry - in the fields of databases and knowledge representation and reasoning. Specifically, in the context of Web data, our work will allow to deal with inconsistency in databases that result from the extraction and integration of Big Data from the Web. Thus, other short-term beneficiaries will include researchers in both academia and industry working on Web data extraction and integration of Big Data.

As a confirmation of this ambitious impact strategy, a strong interest in the project has been shown by the UK company Wrapidity, which has recently been acquired by Meltwater, a leading media monitoring and business intelligence software company. It has not only demonstrated interest in our techniques, but found them to be very important for their data wrangling techniques, and has agreed to provide a substantial contribution to the project.

In the longer term, beneficiaries will be data analysts working with Big Data who need to handle inconsistent data, as well as researchers and developers who need to exploit Big Data in applications. Hence, our work will also be of benefit to researchers in science, industry, governments, and healthcare in such diverse areas as, e.g., biology, medicine, geography, astronomy, agriculture, and aerospace.

The project will also be of wider benefit to the UK's research community by answering important open questions, contributing to the UK's research base, and helping to establish the UK's world leadership in this area. In summary, the project will contribute to enhance the UK's scientific relevance and excellence.

Funded Value:

£606,439

Funded Period:

Aug 18 - Feb 24

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/S003800/1

Principal Investigator:

Andreas Pieris

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Information & Knowledge Mgmt (100%)

Organisations

People	ORCID iD
Andreas Pieris (Principal Investigator)
Leonid Libkin (Co-Investigator)
Marco Calautti (Researcher)	http://orcid.org/0000-0003-0515-583X

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 4 5 > >|

10 25 50

Alin Deutsch (2022) Graph Pattern Matching in GQL and SQL/PGQ

Alviano M (2023) Generative Datalog with Stable Negation

Andreas Pieris (2023) Absolute Expressiveness of Subgraph-based Centrality Measures

Barceló P (2020) Semantic Optimization of Conjunctive Queries in Journal of the ACM

Barceló P (2021) Hajnal Andréka and István Németi on Unity of Science - From Computing to Relativity Theory Through Algebraic Logic

Berger G (2022) The Space-Efficient Core of Vadalog in ACM Transactions on Database Systems

Calautti M (2024) Below and Above Why-Provenance for Datalog Queries in Proceedings of the ACM on Management of Data

Calautti M (2023) Semi-Oblivious Chase Termination for Linear Existential Rules: An Experimental Study

Calautti M (2023) The Complexity of Why-Provenance for Datalog Queries

Calautti M (2019) Oblivious Chase Termination: The Sticky Case

Key Findings


Description	A new approach to the central problem of querying inconsistent data has been proposed. The new approach opens up the possibility of efficient approximate algorithms that quickly deliver sufficiently good consistent answers with explicit error guarantees. Our results indicate that this will lead to practical solutions to an important database problem for which no practical solution exists.
Exploitation Route	We are proposing to lay the foundations for a new framework for querying inconsistent data. Thus, the beneficiaries could, in the long term, include anyone who uses or depends on data. This effectively includes every business/organisation and every individual. In the shorter term, our work will be of benefit to all those working in the broad area of information systems, including researchers - both in academia and industry - in the fields of databases and knowledge representation and reasoning.
Sectors	Digital/Communication/Information Technologies (including Software)

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications