QUINTON -- QUerying and INTegrating Over Nested data

Lead Research Organisation: University of Oxford

Department Name: Computer Science

Abstract

It has long been recognized that nested data models -- in which information is modelled as collections of tuples whose attributes may in turn take values that are collections -- are the most natural modelling formalism for a wide variety of information management scenarios. Query languages that support nested data have been developed decades ago. But even as emerging applications have made the need for querying of nested data more crucial, and even as many of the most important big data management frameworks assume programmatic interfaces based on nested data, processing large-scale nested data remains extremely cumbersome, radically more so than in the case of flat data. Our research hypothesis is that fundamental problems in querying and integrating nested data need to be resolved for this situation to change.

This project will provide new foundations for both querying and integration nested data. On the side of querying we will establish a standard processing pipeline for queries over nested data. This will include a foundational study of the basic transformations involved in any such pipeline, such as the "shredding" of nested queries into relational queries. It will also include the development of algorithms and tools that implement this pipeline, working on top of scalable infrastructure for flat data, such as the Apache Spark project. On the side of integration, we will establish the foundations of specifying and querying virtual data sources consisting of nested data, and develop middleware that can implement queries over virtual data on top of heterogenous nested data sources.

The impact of QUINTON is both practical and foundational. We will build infrastructure for querying and integration, but we also investigate the fundamental problems of scalable querying over materialized and virtual datasources, providing the foundations that can guide the research community in future implementations. We will also drill down into a particular compelling and timely application of nested data integration and management, working with an industrial partner to build components and novel analyses in the area of management for biomedical data. Our partner deals with unified interfaces to diverse biomedical datasources -- clinical, imaging, and genomic data -- and their usecases are a perfect fit for the technology we are developing.

Planned Impact

The project will rethink the groundwork of nested data management. This has impact for a variety of applied areas dealing with largescale data management, as well as significant foundational impact.

I. Impact for users of data management products includes:

-- the project provides a framework for running complex queries over nested data. The querying infrastructure works on top of the most powerful and popular open source tools for big data, and already has interest and support from some of the teams that build these tools.

-- the project provides support for defining virtual nested data items on top of a variety of nested datasources, and middleware that extends the querying support for nested data to work on top of virtual nested data items as well as materialized data items. This significantly eases the management of querying in the presence of queries that span many autonomous heterogenous datasources, a situation that is extremely common in practice.

-- Although our emphasis in the project is on generic data management infrastructure, we will layer on top of our infrastructure specific support for querying biomedical data. Our biomedical application software will give a particularly compelling and immediate path to impact for the growing application area of biomedical data management, as well as providing a showcase for our infrastructure and a feedback loop for tuning it based on the requirements of biomedical practitioners.

II. Academic impact

The project provides a robust foundation for querying and integrating nested data that has been lacking in the past. In particular, we establish a standard processing pipeline for declarative queries over nested data, including both algorithmic foundations and a thorough theoretical understanding of the fundamental transformations involved in processing nested data queries. We also establish the fundamental algorithms and properties for defining virtual datasources and querying such sources, lifting the theory that has been developed in the past decades in the case of flat relational data to the nested setting.

Funded Value:

£1,039,798

Funded Period:

Jan 21 - Jan 25

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/T022124/1

Principal Investigator:

Michael Benedikt

Research Subject:

Info. & commun. Technol. (90%)

Tools, technologies & methods (10%)

Research Topic:

Bioinformatics (10%)

Information & Knowledge Mgmt (90%)

Organisations

People	ORCID iD
Michael Benedikt (Principal Investigator)
Dan Olteanu (Co-Investigator)
Boris Motik (Co-Investigator)	http://orcid.org/0000-0003-2506-4118
Milos Nikolic (Co-Investigator)	http://orcid.org/0000-0002-1548-6803

Publications

Author Name

Title Publication Date Published

10 25 50

Benedikt M (2024) Two Variable Logic with Ultimately Periodic Counting in SIAM Journal on Computing

Benedikt M (2023) Embedded Finite Models beyond Restricted Quantifier Collapse

Benedikt M (2021) Balancing Expressiveness and Inexpressiveness in View Design in ACM Transactions on Database Systems

Benedikt M (2022) Rewriting the Infinite Chase

Benedikt M (2022) Rewriting the infinite chase in Proceedings of the VLDB Endowment

Benedikt M (2024) Synthesizing nested relational queries from implicit specifications: via model theory and via proof theory in Logical Methods in Computer Science

Benedikt M (2024) Rewriting the Infinite Chase for Guarded TGDs in ACM Transactions on Database Systems

Benedikt M (2023) On Monotonic Determinacy and Rewritability for Recursive Queries and Views in ACM Transactions on Computational Logic

Benedikt M (2022) Synthesizing Nested Relational Queries from Implicit Specifications

Benedikt M (2024) Monotone Rewritability and the Analysis of Queries, Views, and Rules

Benedikt M (2024) Decidability of Graph Neural Networks via Logical Characterizations

Benedikt M (2023) Synthesizing Nested Relational Queries from Implicit Specifications

Benedikt M (2023) The Complexity of Presburger Arithmetic with Power or Powers

Benedikt M (2023) Embedded Finite Models Beyond Restricted Quantifier Collapse

Benedikt M (2021) Generating collection transformations from proofs in Proceedings of the ACM on Programming Languages

Shaikhha A (2022) Functional collection programming with semi-ring dictionaries in Proceedings of the ACM on Programming Languages

Shaikhha A (2021) Functional Collection Programming with Semi-Ring Dictionaries

Smith J (2021) Scalable querying of nested data in Proceedings of the VLDB Endowment

Smith J (2021) Scalable analysis of multi-modal biomedical data. in GigaScience

Smith J (2021) TraNCE transforming nested collections efficiently in Proceedings of the VLDB Endowment

Zombori Z (2023) Towards Unbiased Exploration in Partial Label Learning

Zombori Zsolt (2024) Towards Unbiased Exploration in Partial Label Learning in JOURNAL OF MACHINE LEARNING RESEARCH

Research Databases and Models
Software and Technical Products


Title	Supporting data for "Scalable Analysis of Multi-Modal Biomedical Data"
Description	Targeted diagnosis and treatment options are dependent on insights drawn from multimodal analysis of large-scale biomedical datasets. Advances in genomics sequencing, image processing, and medical data management have supported data collection and management within medical institutions. These efforts have produced large-scale datasets and have enabled integrative analyses that provide a more thorough look at the impact of a disease on the underlying system. The integration of large-scale biomedical data commonly involves several complex data transformation steps, such as combining datasets to build feature vectors for learning analysis. Thus, scalable data integration solutions play a key role in the future of targeted medicine. Though large-scale data processing frameworks have shown promising performance for many domains, they fail to support scalable processing of complex data types. To address these issues and achieve scalable processing of multi-modal biomedical data, we present TraNCE, a framework that automates the difficulties of designing distributed analyses with complex biomedical data types. We outline research and clinical applications for the platform, including data integration support for building feature sets for classification. We show that the system is capable of outperforming the common alternative, based on "flattening'' complex data structures, and runs efficiently when alternative approaches are unable to perform at all.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
Impact	This dataset and the associated analysis was made available for follow on research.
URL	http://gigadb.org/dataset/100914


Title	Trance
Description	Software for querying of Nested Data
Type Of Technology	Software
Year Produced	2020
Open Source License?	Yes
Impact	Software has been used in several research papers and theses analyzing nested data.
URL	https://github.com/jacmarjorie/trance

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications