📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

QUINTON -- QUerying and INTegrating Over Nested data

Lead Research Organisation: University of Oxford
Department Name: Computer Science

Abstract

It has long been recognized that nested data models -- in which information is modelled as collections of tuples whose attributes may in turn take values that are collections -- are the most natural modelling formalism for a wide variety of information management scenarios. Query languages that support nested data have been developed decades ago. But even as emerging applications have made the need for querying of nested data more crucial, and even as many of the most important big data management frameworks assume programmatic interfaces based on nested data, processing large-scale nested data remains extremely cumbersome, radically more so than in the case of flat data. Our research hypothesis is that fundamental problems in querying and integrating nested data need to be resolved for this situation to change.

This project will provide new foundations for both querying and integration nested data. On the side of querying we will establish a standard processing pipeline for queries over nested data. This will include a foundational study of the basic transformations involved in any such pipeline, such as the "shredding" of nested queries into relational queries. It will also include the development of algorithms and tools that implement this pipeline, working on top of scalable infrastructure for flat data, such as the Apache Spark project. On the side of integration, we will establish the foundations of specifying and querying virtual data sources consisting of nested data, and develop middleware that can implement queries over virtual data on top of heterogenous nested data sources.

The impact of QUINTON is both practical and foundational. We will build infrastructure for querying and integration, but we also investigate the fundamental problems of scalable querying over materialized and virtual datasources, providing the foundations that can guide the research community in future implementations. We will also drill down into a particular compelling and timely application of nested data integration and management, working with an industrial partner to build components and novel analyses in the area of management for biomedical data. Our partner deals with unified interfaces to diverse biomedical datasources -- clinical, imaging, and genomic data -- and their usecases are a perfect fit for the technology we are developing.

Planned Impact

The project will rethink the groundwork of nested data management. This has impact for a variety of applied areas dealing with largescale data management, as well as significant foundational impact.


I. Impact for users of data management products includes:

-- the project provides a framework for running complex queries over nested data. The querying infrastructure works on top of the most powerful and popular open source tools for big data, and already has interest and support from some of the teams that build these tools.

-- the project provides support for defining virtual nested data items on top of a variety of nested datasources, and middleware that extends the querying support for nested data to work on top of virtual nested data items as well as materialized data items. This significantly eases the management of querying in the presence of queries that span many autonomous heterogenous datasources, a situation that is extremely common in practice.

-- Although our emphasis in the project is on generic data management infrastructure, we will layer on top of our infrastructure specific support for querying biomedical data. Our biomedical application software will give a particularly compelling and immediate path to impact for the growing application area of biomedical data management, as well as providing a showcase for our infrastructure and a feedback loop for tuning it based on the requirements of biomedical practitioners.

II. Academic impact

The project provides a robust foundation for querying and integrating nested data that has been lacking in the past. In particular, we establish a standard processing pipeline for declarative queries over nested data, including both algorithmic foundations and a thorough theoretical understanding of the fundamental transformations involved in processing nested data queries. We also establish the fundamental algorithms and properties for defining virtual datasources and querying such sources, lifting the theory that has been developed in the past decades in the case of flat relational data to the nested setting.

Publications

10 25 50
publication icon
Benedikt M (2024) Two Variable Logic with Ultimately Periodic Counting in SIAM Journal on Computing

publication icon
Benedikt M (2021) Balancing Expressiveness and Inexpressiveness in View Design in ACM Transactions on Database Systems

publication icon
Benedikt M (2022) Rewriting the Infinite Chase

publication icon
Benedikt M (2022) Rewriting the infinite chase in Proceedings of the VLDB Endowment

publication icon
Benedikt M (2024) Rewriting the Infinite Chase for Guarded TGDs in ACM Transactions on Database Systems

publication icon
Benedikt M (2023) On Monotonic Determinacy and Rewritability for Recursive Queries and Views in ACM Transactions on Computational Logic

publication icon
Benedikt M (2021) Generating collection transformations from proofs in Proceedings of the ACM on Programming Languages

publication icon
Shaikhha A (2022) Functional collection programming with semi-ring dictionaries in Proceedings of the ACM on Programming Languages

publication icon
Smith J (2021) Scalable querying of nested data in Proceedings of the VLDB Endowment

publication icon
Smith J (2021) TraNCE transforming nested collections efficiently in Proceedings of the VLDB Endowment

publication icon
Zombori Zsolt (2024) Towards Unbiased Exploration in Partial Label Learning in JOURNAL OF MACHINE LEARNING RESEARCH

 
Title Supporting data for "Scalable Analysis of Multi-Modal Biomedical Data" 
Description Targeted diagnosis and treatment options are dependent on insights drawn from multimodal analysis of large-scale biomedical datasets. Advances in genomics sequencing, image processing, and medical data management have supported data collection and management within medical institutions. These efforts have produced large-scale datasets and have enabled integrative analyses that provide a more thorough look at the impact of a disease on the underlying system. The integration of large-scale biomedical data commonly involves several complex data transformation steps, such as combining datasets to build feature vectors for learning analysis. Thus, scalable data integration solutions play a key role in the future of targeted medicine. Though large-scale data processing frameworks have shown promising performance for many domains, they fail to support scalable processing of complex data types. To address these issues and achieve scalable processing of multi-modal biomedical data, we present TraNCE, a framework that automates the difficulties of designing distributed analyses with complex biomedical data types. We outline research and clinical applications for the platform, including data integration support for building feature sets for classification. We show that the system is capable of outperforming the common alternative, based on "flattening'' complex data structures, and runs efficiently when alternative approaches are unable to perform at all. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
Impact This dataset and the associated analysis was made available for follow on research. 
URL http://gigadb.org/dataset/100914
 
Title Trance 
Description Software for querying of Nested Data 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact Software has been used in several research papers and theses analyzing nested data. 
URL https://github.com/jacmarjorie/trance