Open Epidemiology for pandemic modelling: a transparent, traceable, reusable, open source pipeline for reproducible science
Lead Research Organisation:
University of Glasgow
Department Name: College of Medical, Veterinary &Life Sci
Abstract
Historically, models used to support advice to government have not been publicly available, at least not readily, prior to publication. Technological advances and the growth of open source and reproducible science mean this is no longer tenable. Although current models feeding into UK policy are publicly available, they still lack the transparent and readily traceable chain of evidence connecting data and assumptions with model outputs that allows them to be readily independently assessed.
Our Data Pipeline supports the implementation of COVID-19 epidemiological models that we, the Scottish COVID-19 Response Consortium (SCRC), have developed using volunteer resources within the RAMP initiative, to create new, complementary models. The Data Pipeline fulfils a critical role in our assessment of fitness for purpose for the models in providing policy advice, by managing and documenting a chain of trust that connects the primary data, analyses, and published and unpublished literature on COVID-19 to model outputs, documenting provenance of the conclusions being reached. The software interfaces we develop will be powerful, generic tools that will be useful to any policy-oriented modelling community.
Our Data Pipeline supports the implementation of COVID-19 epidemiological models that we, the Scottish COVID-19 Response Consortium (SCRC), have developed using volunteer resources within the RAMP initiative, to create new, complementary models. The Data Pipeline fulfils a critical role in our assessment of fitness for purpose for the models in providing policy advice, by managing and documenting a chain of trust that connects the primary data, analyses, and published and unpublished literature on COVID-19 to model outputs, documenting provenance of the conclusions being reached. The software interfaces we develop will be powerful, generic tools that will be useful to any policy-oriented modelling community.
Publications

Mitchell SN
(2022)
FAIR data pipeline: provenance-driven data management for traceable scientific workflows.
in Philosophical transactions. Series A, Mathematical, physical, and engineering sciences
Description | Modern epidemiological analyses to understand and combat the spread of disease depend critically on access to, and use of, data. Rapidly evolving data, such as data streams changing during a disease outbreak, are particularly challenging. Data management is further complicated by data being imprecisely identified when used. Public trust in policy decisions resulting from such analyses is easily damaged and is often low, with cynicism arising where claims of "following the science" are made without accompanying evidence. Tracing the provenance of such decisions back through open software to primary data would clarify this evidence, enhancing the transparency of the decision-making process. Here, we demonstrate a Findable, Accessible, Interoperable and Reusable (FAIR) data pipeline. Although developed during the COVID-19 pandemic, it allows easy annotation of any data as they are consumed by analyses, or conversely traces the provenance of scientific outputs back through the analytical or modelling source code to primary data. Such a tool provides a mechanism for the public, and fellow scientists, to better assess scientific evidence by inspecting its provenance, while allowing scientists to support policy-makers in openly justifying their decisions. We believe that such tools should be promoted for use across all areas of policy-facing research. |
Exploitation Route | The plan to operationalise the pipeline is to integrate a suite of realistic policy-oriented models into the data pipeline. The use cases we have currently identified cover a wide range of activities likely to be carried out by identified users (including mathematical modellers, science-policy brokers, policy-makers and the wider public), will each be implemented for the integrated mathematical models, as part of a process of analysing user-software interactions and developing documented procedures. We would hope to involve science-policy brokers in this process; their involvement will be invaluable, in that they are a key target user group, and can also plausibly serve as proxies for policy-makers and the general public in the process. In particular, some are different inspections of data and results that they (and other individuals like members of the public) might wish to make to understand the origins of conclusions that researchers present. Currently the data registry's web interface to address these use cases is limited, but further work is underway to improve this. Tools for provenance visualisation are also limited, and we believe further work is needed in this area to reduce the complexity of the diagrams produced, and increase their ease of use for exploration of data and results. If gaps become evident in the portfolio of use cases, these will be documented and carried forward for further attention. In the longer term we intend to pilot uptake in groups delivering model-based evidence to policy; it is likely that initial implementation and evaluation would best be carried out as part of an emergency simulation exercise, where the utility, costs and robustness of the data pipeline could be assessed within the context of the wider demands made of the scientists by policy-makers. |
Sectors | Environment,Healthcare |
URL | https://github.com/FAIRDataPipeline |
Description | EPIC |
Organisation | EPIC Centre of Expertise on Animal Disease Outbreaks |
Country | United Kingdom |
Sector | Charity/Non Profit |
PI Contribution | We are working with EPIC to translate our work to make it directly usable by them in the policy work over the coming years. |
Collaborator Contribution | They are dedicating time to developing use cases for the work and to further development of the software itself. |
Impact | None yet. |
Start Year | 2021 |
Title | C++ Implementation of the API for the FAIR Data Pipeline. |
Description | A c++ api to interact with the FAIR Data Pipeline |
Type Of Technology | Software |
Year Produced | 2022 |
Open Source License? | Yes |
Impact | This software, and the many other associated components of the FAIR Data Pipeline, allow traceability of research results in a FAIR manner. |
URL | https://zenodo.org/record/5877992 |
Title | DataPipeline.jl - FAIR Data Pipeline in Julia |
Description | Package for interfacing with the FAIR Data Pipeline in Julia |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | This software, and the many other associated components of the FAIR Data Pipeline, allow traceability of research results in a FAIR manner. |
URL | https://zenodo.org/record/5557281 |
Title | FAIRDataPipeline/javaDataPipeline: |
Description | JAVA Implementation of the FAIR Data Pipeline API |
Type Of Technology | Software |
Year Produced | 2021 |
Impact | This software, and the many other associated components of the FAIR Data Pipeline, allow traceability of research results in a FAIR manner. |
URL | https://zenodo.org/record/5547493 |
Title | The FAIR Data Pipeline command line tool |
Description | Command Line Interface for the FAIR Data Pipeline system, this software provides commands necessary for integrating analysis and data processing into the FAIR registry. |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | This software, and the many other associated components of the FAIR Data Pipeline, allow traceability of research results in a FAIR manner. |
URL | https://zenodo.org/record/5708045 |
Title | The FAIR Data Registry |
Description | The FAIR data registry is a Django website and REST API which is used by the data-pipeline to store metadata about code runs and their inputs and outputs. |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | This software, and the many other associated components of the FAIR Data Pipeline, allow traceability of research results in a FAIR manner. |
URL | https://zenodo.org/record/5562750 |
Title | pyDataPipeline - FAIR Data Pipeline in Python |
Description | Package for interfacing with the FAIR Data Pipeline in Python |
Type Of Technology | Software |
Year Produced | 2022 |
Open Source License? | Yes |
Impact | This software, and the many other associated components of the FAIR Data Pipeline, allow traceability of research results in a FAIR manner. |
URL | https://zenodo.org/record/6010921 |
Title | rDataPipeline - FAIR Data Pipeline in R |
Description | Package for interfacing with the FAIR Data Pipeline in R |
Type Of Technology | Software |
Year Produced | 2022 |
Open Source License? | Yes |
Impact | This software, and the many other associated components of the FAIR Data Pipeline, allow traceability of research results in a FAIR manner. |
URL | https://zenodo.org/record/5921117 |