QUINTON -- QUerying and INTegrating Over Nested data

Lead Research Organisation: University of Oxford
Department Name: Computer Science

Abstract

It has long been recognized that nested data models -- in which information is modelled as collections of tuples whose attributes may in turn take values that are collections -- are the most natural modelling formalism for a wide variety of information management scenarios. Query languages that support nested data have been developed decades ago. But even as emerging applications have made the need for querying of nested data more crucial, and even as many of the most important big data management frameworks assume programmatic interfaces based on nested data, processing large-scale nested data remains extremely cumbersome, radically more so than in the case of flat data. Our research hypothesis is that fundamental problems in querying and integrating nested data need to be resolved for this situation to change.

This project will provide new foundations for both querying and integration nested data. On the side of querying we will establish a standard processing pipeline for queries over nested data. This will include a foundational study of the basic transformations involved in any such pipeline, such as the "shredding" of nested queries into relational queries. It will also include the development of algorithms and tools that implement this pipeline, working on top of scalable infrastructure for flat data, such as the Apache Spark project. On the side of integration, we will establish the foundations of specifying and querying virtual data sources consisting of nested data, and develop middleware that can implement queries over virtual data on top of heterogenous nested data sources.

The impact of QUINTON is both practical and foundational. We will build infrastructure for querying and integration, but we also investigate the fundamental problems of scalable querying over materialized and virtual datasources, providing the foundations that can guide the research community in future implementations. We will also drill down into a particular compelling and timely application of nested data integration and management, working with an industrial partner to build components and novel analyses in the area of management for biomedical data. Our partner deals with unified interfaces to diverse biomedical datasources -- clinical, imaging, and genomic data -- and their usecases are a perfect fit for the technology we are developing.

Planned Impact

The project will rethink the groundwork of nested data management. This has impact for a variety of applied areas dealing with largescale data management, as well as significant foundational impact.


I. Impact for users of data management products includes:

-- the project provides a framework for running complex queries over nested data. The querying infrastructure works on top of the most powerful and popular open source tools for big data, and already has interest and support from some of the teams that build these tools.

-- the project provides support for defining virtual nested data items on top of a variety of nested datasources, and middleware that extends the querying support for nested data to work on top of virtual nested data items as well as materialized data items. This significantly eases the management of querying in the presence of queries that span many autonomous heterogenous datasources, a situation that is extremely common in practice.

-- Although our emphasis in the project is on generic data management infrastructure, we will layer on top of our infrastructure specific support for querying biomedical data. Our biomedical application software will give a particularly compelling and immediate path to impact for the growing application area of biomedical data management, as well as providing a showcase for our infrastructure and a feedback loop for tuning it based on the requirements of biomedical practitioners.

II. Academic impact

The project provides a robust foundation for querying and integrating nested data that has been lacking in the past. In particular, we establish a standard processing pipeline for declarative queries over nested data, including both algorithmic foundations and a thorough theoretical understanding of the fundamental transformations involved in processing nested data queries. We also establish the fundamental algorithms and properties for defining virtual datasources and querying such sources, lifting the theory that has been developed in the past decades in the case of flat relational data to the nested setting.

Publications

10 25 50
publication icon
Benedikt M (2021) Balancing Expressiveness and Inexpressiveness in View Design in ACM Transactions on Database Systems

publication icon
Benedikt M (2022) Rewriting the infinite chase in Proceedings of the VLDB Endowment

publication icon
Benedikt M (2023) On Monotonic Determinacy and Rewritability for Recursive Queries and Views in ACM Transactions on Computational Logic

publication icon
Benedikt M (2021) Generating collection transformations from proofs in Proceedings of the ACM on Programming Languages

publication icon
Shaikhha A (2022) Functional collection programming with semi-ring dictionaries in Proceedings of the ACM on Programming Languages

publication icon
Smith J (2021) Scalable querying of nested data in Proceedings of the VLDB Endowment

publication icon
Smith J (2021) TraNCE transforming nested collections efficiently in Proceedings of the VLDB Endowment

 
Title Trance 
Description Software for querying of Nested Data 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact Software has been used in several research papers and theses analyzing nested data. 
URL https://github.com/jacmarjorie/trance