WELD: Integrated Cyber-Infrastructure for Scalable Data-Driven Research into COVID-19

Lead Research Organisation: University of Oxford
Department Name: Engineering Science

Abstract

The project will implement a unifying cyberinfrastructure based on automated machine-actionable policies that will enable data-driven clinical and medical research across the four nations and produce evidence essential for making rapid decisions on the UK's response to the COVID-19 outbreak. The research explores the development of integrated computing, modelling, simulation, and information technologies as the basis of cross-disciplinary research for collaborating teams investigating COVID-19 and SARS-CoV-2. The application will implement a "cloud native" software infrastructure for federating distributed healthcare data across the Trusted Research Environments and include methods for working with ensemble collections of data. The result will quantifiably improve existing model- and statistically- based methodologies for evaluating vaccine efficacy and risk prediction using live NHS data in clinical research environments. A unique contribution will be "policy-based" data federation techniques that allow each devolved nation or Trusted Research Environment to control its use of a shared UK-wide data collection, across regulatory boundaries, while guarding against data exfiltration. This capability, which has never been addressed, is fundamental to providing a secure evidence base for effective data analysis and scientific discovery of COVID-19 at UK-wide levels. The methodology draws on state-of-the-art in data abstractions and virtualisation levels, provides greater flexibility for automation and re-use across the lifecycle, and addresses the need for a more systematic, community-based approach to data and metadata as part of a shared solution. The architectural model will instruct HDR-UK strategic decision-making responding to COVID-19 and future pandemics that require an improved technology readiness level.

Publications

10 25 50
 
Description The project developed and implemented a unifying e-Infrastructure based on automated machine-actionable policies that enables secure data-driven research to be conducted in highly distributed environments. The work is in response to the recommendations set out in the Goldacre review for Trusted Research Environments, including data curation and information governance, that are specific to healthcare, but which have broader applicability across the entire UK research landscape. To this end, the strategy follows the development of the DARE UK programme to drive modern, efficient, open, collaborative approaches to data science across the four nations.

Although the focus was initially on clinical science and healthcare domains, a generic approach has applicability across scientific and engineering communities, promoting cross-disciplinary scientific and engineering collaborations that are conducted in a virtual research environment across global distances. The methodology draws on state-of-the-art in data abstraction and logic programming semantics, extending concepts used to organise distributed data for data sharing architectures that emerge from the data grid community.

The prototype system was achieved through extensions to open-source, web foundational technologies, almost entirely built from Apache Software Foundation open-source components, to meet big data challenges for which streaming is fundamental. The policy-driven software stack is consequently designed to enable secure, real-time digital twinning and cyber-physical systems with an integration path involving enterprise-based technology providers.

The project was extended to additional scientific and engineering domains requiring solutions that address the need of collaborative research that requires sophisticated policies. The policies automate the enforcement of compliance, governance, copyright, intellectual property, confidentiality, closed records, and classified records on access. Significantly, the e-Infrastructure is extensible to support a complete edge-cloud-supercomputing ecosystem, thereby addressing a need for interoperability that will be required when moving from cloud to data centres, thereby generating significant cost savings. Developments of this nature are critical to avoid lock-in to proprietary cloud providers that currently dominate the TRE landscape.

The workplan adopts the model of shared working practice, with an open-source repository maintained for the reusable software code that is under development. In terms of engagement and diverse community outreach, every attempt has been made to extend the findings to national and international programmes in a range of scientific and engineering disciplines. The methods pioneered for the management of healthcare data have relevance to interdisciplinary collaborations that have equivalent (or greater) requirements for data curation across a distributed supply chain, with an extended lifecycle of 70-100 years.

To achieve sustainability, the methodologies and logical data model developed through this project are now being adopted by larger vendors and systems, including Nvidia (which has recently agreed to support integration into the Omniverse platform). A strategic aim is to engage with industry and embed the findings seamlessly into enterprise platforms capable of providing service and support at scale. As the software matures, the result will enable best-practice science and engineering collaborative engagement as a continuing contribution within the scope of the DARE UK programme.
Exploitation Route The project implemented a policy-oriented architecture in the framework of data curation services that is highly distributed and involve heterogeneous resources brokering Peta Bytes of data collected in hundreds of millions of files and used by thousands of cooperating researchers. Within the broader context of the Goldacre review, the logical data model encapsulates knowledge and knowledge relationships as the basis of Reproducible Analytical Pipelines with audit trails maintained in state (although outside the immediate scope). Further development of these capabilities is undertaken through separately funded projects, with explorations of how the logical data model can be incorporated as part of enterprise-grade data fabric technologies that provide the connectivity and software methodology required for national scale infrastructure (e.g., DevOps, MLOps).

In the context of DARE UK and HDRUK programmes, the software product provides the underpinning for a highly distributed national data fabric that is highly extensible and interoperable, and which deploys reasoning about data management policies using logic programmes with advanced security and data attestation. Experiments at EPCC demonstrated the feasibility of using the data model to provide data analysis and machine learning work without moving raw data between the Trusted Research Environments (TRE).

The experiments targeted the healthcare data use cases identified and described at the project's Scope and Use Case document, incorporating use cases identified from SMI, ODAP, and similar TRE-like platforms, extended further by other WELD project partners. Analysis of the use cases informed a "design pattern" for on-demand elastic virtual infrastructure and virtual services, enabling a policy-based Information Governance (IG) and security in shared virtual services and High-Performance Computing (HPC) environments.

Specific experimentation scenarios were conducted with PPZ design and implementation in OpenStack, which allows creation of public-private networks and provides full control over internal and external connections. The result enables "zones" of Virtual Machines that are configured with a firewall server using an open-source proxy, in combination with firewall restrictions to control outgoing and incoming routes. The approach was further developed through the application of the TensorFlow Federation (TFF) framework for enabling federated analytics and federated machine learning, and additional effort investigated PySyft as an open-source alternative federated learning platform. Current work explores the development of this activity using swarm learning technologies based on the HPE Swarm Learning framework.

The logical data model orchestrating federated discovery processes and maintaining distributed state information are relevant to the Goldacre review and DARE UK agenda. We have already interfaced with funded DARE UK projects announced in February 2023 and are in active discussion with HDRUK and with technical / policy led reviews. A recent response from SATRE, for example, is as follows: "The work you're doing sounds very relevant to us. It will add some requirements that may not have been brought up by others, and your goals of defining infrastructure across organisations fits very well with our goals, and those of the other DARE projects." The concept of policy-based object management is particularly relevant to the DARE TRE-FX project, which has as its objective the use of secure Research Objects to move between TREs while supporting the Five Safes principles of FAIR use data. Each of these projects require a data fabric that will enable shared tools for data management, analysis and visualisation. For DARE UK projects requiring secure environment, the encapsulated policies and procedures can more generally be used to control object decryption, object parsing, object redaction, object integrity.

The project has been highly active in seeking community engagement and input for future development activities. Workshops at CCFE have assessed recent progress and identified further gaps and challenges, providing strong motivation for developing integrated capabilities for interdisciplinary collaborations to pursue fundamental advances in virtual research that can fully leverage emerging computer architectures. An example of such extensibility was the potential use for the design and manufacture of a fusion reactor device ("STEP") for the UK Atomic Energy Authority. The project demonstrated how the generic policy-management could be tailored for a radically different use case, applying sophisticated access control policies (for example requiring redaction, access limitation). The outcomes can be used to enforce copyright, intellectual property, confidentiality, closed records, and classified records on access.
Sectors Digital/Communication/Information Technologies (including Software),Healthcare,Other

 
Description The logical data model and virtualisation architecture has been shown to transform scientific and engineering practice for highly complex and scalable collaborative projects, thereby extending competitive advantage to the UK industrial landscape. An example is the application to in silico fusion reactor design now undertaken at UK Atomic Energy Authority, which requires a seventy-year management of an extended industrial supply chain across jurisprudence and governance environments. Multiple large-scale enterprises are now involved with the co-development of the technology, including HPE, Nvidia, IBM, Rolls Royce, Tokamak Energy, Atkins Global, and many of the outputs are also being extended to first class Apache Software projects, for example Daffodil (Data Format Description Language), on an open-source development platform. The ability to enable secure academic/industry supply-chain collaborations conducted across a virtual environment across global distances will give significant primacy to UK-based initiatives, contributes in real terms to the BEIS agenda, and accelerates attainment of UK Government impact targets (e.g., healthcare, environment). Significant work in mapping compliance and regulatory frameworks has resulted in a more complete understanding of complex regulatory policies that inform healthcare and related governance issues. While the implementation of a fully compliant set of UK national healthcare policies remains to be done, there is an improved analytical understanding of how trans-national policies interface.
First Year Of Impact 2023
Sector Other
 
Title The working title is "RADON" adopted for historical reasons; however, the intention is to change the name of the technical product to reflect more accurately its purpose and capabilities. 
Description The software product is a containerised peer-to-peer federated server-client architecture that uses a distributed rule engine for extended lifecycle data management that is highly extensible. Lifecycle management policies are defined and executed by a rule engine and micro-services, with event information recorded in state (for audit purposes). The software incorporates a logical data model for knowledge encapsulation with the ability to reason across rules and policies mapped to Horn causes and stable model semantics. The software interoperates with multiple enterprise data fabrics (HPE Ezmeral, IBM / RedHat OpenShift, VMWare Tanzu) and other environments (Omniverse). The software stack is based almost entirely on Apache Software Foundation and is designed to enable real-time apps with a scale-out cloud native NoSQL database build on Apache Cassandra. The current version is based on extensions to the bundled "SMACK" software stack, which enables massive throughput, low latency, and elastic scalability as required for big data applications. A production rule system (Drools) applies the data management policies encoded as rules. All components are modular, web foundational, production oriented, and open source. Based on the lessons learned from the present project, the next step is to build a production ready version (turnkey) in partnership with Nvidia and associated companies. 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact Academic impact is achieved through the development of a logical data model that represents a conceptual advance in knowledge management, which may be used to enable disruptive change in UK capability for virtual collaborations "in challenging environment". The project addresses the central industrial strategy goal of increasing global competitiveness across sectors, with already support for Industry 4.0 technology and scaling out the pioneering work conducted in national laboratories (CCFE) and data centres (EPCC). The data model, and its use for reasoning with policy description language, is the cornerstone of an innovation that potentially enables transformational capabilities appliable across domains, in particular across the evolving data fabric of DARE UK. The result gives potential for automating knowledge generation across disciplines enabling on collaborative research conducted in a virtual environment, with data security and attestation performed at all levels of the fabric. The methodology, for example, has been incorporated into the DARPA funded META Cyber-Physical Systems architecture and is now used in the programme (at Vanderbilt University) for systems integration. Economic and societal impact is achieved through the current integration with industry standard and enterprise grade data fabrics that will drive innovation and UK industrial partnerships with academia. Initial use for the nuclear sector is spreading across domains and companies (Rolls Royce, Atkins Global) investigating cross-sector virtual supply chain management across governance and jurisprudence environments (for example in case of UK data export controls for certain industry sectors).