PDQ: Proof-driven Query Planning

Lead Research Organisation: University of Oxford
Department Name: Computer Science

Abstract

Current data management solutions have several bottlenecks. One concerns scale -- how to get complex queries to run more quickly over ever-larger datasets. Another one, increasingly recognized by the research community, concerns usability: the most common data management solutions require data to be available in an SQL schema, with application programmers needing to write custom code to transform data from a myriad of other formats into the one "gold standard'' flat data description.

This project provides assistance on both of these problems through the development of an advanced query planning system that can deal with sources that have complex interfaces and rich integrity constraints.
By query planning we refer to a process that takes as input a query specified in terms of one vocabulary, translating it into a description in another vocabulary that can be more efficiently executed. Our approach to query planning, proof-driven query planning (PDQ), is based on
foundational ideas from computational logic: we search for "a proof that the query is answerable'' relative to the interfaces and constraints.
For each such proof we can use a variation of a technique from logic -- interpolation -- to produce a query plan that abides by the interfaces while making use of the constraints. As we search for a proof, we can estimate the cost of the generated plan, thus taking into
account proof structure and cost in searching for the optimal plan. Thus PDQ combines ideas from logic, query optimization, and search.
The importance of taking into account interface restrictions and data semantics in new data-driven applications, along with recent advances in reasoning systems for relational data, make this exactly the right time to take a fresh look at exploiting reasoning within query planning.

Proof-driven query planning provides benefits in diverse application scenarios. It can be applied within a middleware setting in which the user queries refer to external data that is difficult to access. It applies also to the problem of finding more efficient plans within a single database manager, either running on top of the DBMS or subsuming the setting of traditional database query optimization.

The impact of PDQ is foundational as well as practical:
proof-driven query planning gives a new methodology for transforming a logical plan to a physical plan that unifies application-level integrity constraints with logical/physical mappings, giving the prospect of a fully logic-based approach to query optimization in database management systems.

We will develop not only the underlying foundation of proof-driven planning, but also create proof-of-concept systems for the middleware and centralized settings.

Planned Impact

The project outlines a new way to inject logical methods into database systems. This has significant industrial impact as well as foundational impact.


I. Impact for users of data management products includes:

-- the project provides an approach to improve performance in running complex queries over datasets that are mushrooming in size, by allowing more powerful integrity constraint-aware optimization. The impact can be particularly significant when the data to be accessed resides on the web, and thus the overhead of accessing data dominates processing time.

-- the project gives a significant increase to the usability of data management systems by:

--- allowing data from web-based sources to be pulled in transparently to answer queries

--- insulating database developers from the details of mappings -- both mappings between the programmer view of data objects and the underlying stored data and mappings among datasources -- while exploiting these mappings in generating efficient query plans. The current need to create and maintain these mappings is known to be a huge "pain point" for database users.

Furthermore:

--- The technology makes the benefits of logic-based optimization available to current database management environments, targeting "closed world'' systems based on the relational model and its extensions (e.g. nested relations). This proposal thus represents an excellent complement to existing research initiatives focusing on newer data models.

--- The techniques can be used in a ``pay-as-you-go'' manner, allowing a trade-off between the sophistication of plan search applied and the time spent during search.


II. For designers of data management systems, the project investigates a more general approach to accounting for reasoning in query processing compared to that used in current systems. The PDQ approach can incorporate reasoning within optimization for formulas with arbitrary quantifier nesting, and allows by default a set of background integrity constraints. This is a significant improvement over current practice.

III. From the foundational perspective, the full realization of the project's goals holds out the promise of a new logic-based approach to query
processing based on proof search and interpolation. This is exciting not just from the perspective of database systems, but for the wider program of using declarative methods within computing.

Constraint-based optimization of database queries has long been a goal of the research community, but the time is ripe for taking the next step in making constraint-aware systems the norm. The explosion of web-accessible datasources has made querying in non-traditional settings, with greater overlap of sources and longer-running and more expensive computations, more urgent.
At the same time, research on reasoning systems for relational database constraints has made great strides in recent years, providing the basis for reasoning-based components in a query optimizer.

Not only is this the right time to carry out this research, Oxford is certainly the ideal place for it. It has world-class research groups in data management and knowledge representation as well as in all aspects of computational logic. It also has research strengths in many areas that border on the project, from programming languages to verification.

Publications

10 25 50
publication icon
Amarilli A (2020) Finite Open-world Query Answering with Number Restrictions in ACM Transactions on Computational Logic

publication icon
Amarilli A (2015) Combining existential rules and description logics in IJCAI International Joint Conference on Artificial Intelligence

publication icon
Amarilli A (2018) Query Answering with Transitive and Linear-Ordered Data in Journal of Artificial Intelligence Research

publication icon
Amarilli A. (2015) Combining existential rules and description logics in IJCAI International Joint Conference on Artificial Intelligence

publication icon
Benedikt M (2017) Benchmarking the Chase

 
Description We have used the techniques developed two bio-medical partners -- one in the public sector (EBI-EBML) during the course of the project and one in the private sector in a follow-up project.
First Year Of Impact 2019
Sector Digital/Communication/Information Technologies (including Software),Healthcare
 
Title Supporting data for "Scalable Analysis of Multi-Modal Biomedical Data" 
Description Targeted diagnosis and treatment options are dependent on insights drawn from multimodal analysis of large-scale biomedical datasets. Advances in genomics sequencing, image processing, and medical data management have supported data collection and management within medical institutions. These efforts have produced large-scale datasets and have enabled integrative analyses that provide a more thorough look at the impact of a disease on the underlying system. The integration of large-scale biomedical data commonly involves several complex data transformation steps, such as combining datasets to build feature vectors for learning analysis. Thus, scalable data integration solutions play a key role in the future of targeted medicine. Though large-scale data processing frameworks have shown promising performance for many domains, they fail to support scalable processing of complex data types. To address these issues and achieve scalable processing of multi-modal biomedical data, we present TraNCE, a framework that automates the difficulties of designing distributed analyses with complex biomedical data types. We outline research and clinical applications for the platform, including data integration support for building feature sets for classification. We show that the system is capable of outperforming the common alternative, based on "flattening'' complex data structures, and runs efficiently when alternative approaches are unable to perform at all. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL http://gigadb.org/dataset/100914
 
Description Collaboration with Omics Data Automation 
Organisation Omics Data Automation
Country United States 
Sector Private 
PI Contribution We began a collaboration with a US-based biomedical company interested in data integration. Thus far we have done some demonstrations of the ability of the technology developed by the project to assist the company. The collaboration is at an early stage, and we are applying for funding to deepen it.
Collaborator Contribution They provided the usecases and some of their computing facilities
Impact The collaboration has just started this year, and is still ongoing. It is multi-disciplinary, being on the borderline of computer science and computational biology.
Start Year 2019
 
Title PDQ software 
Description The PDQ system, 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact This is the main system for query planning developed as part of PDQ. The formal release was delayed due to covid-related problems, but we should finish it this year. 
URL https://github.com/ProofDrivenQuerying/pdq/
 
Description Keynote at workshop in Germany 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Keynote in workshop on logic
Year(s) Of Engagement Activity 2017
URL http://2017.soqe.org/
 
Description Keynote on data integration at database conference 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Keynote at the Web Age Information Management (WAIM) on data integration; WAIM is the main database conference
in China.
Year(s) Of Engagement Activity 2015
URL http://www.cs.sdu.edu.cn/waim2015/
 
Description Keynote speech at Database conference 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact I was the keynote speaker at one of the main conferences for database researchers, Principles of Database Systems (PODS). I gave an overview of work on reasoning within data management.
Year(s) Of Engagement Activity 2018
URL https://sigmod2018.org/
 
Description Summer School Course 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact I gave a 1-week summer school course on the topic of this grant
Year(s) Of Engagement Activity 2017
URL https://www.irit.fr/esslli2017/