ProvTemp: Provenance templates as a method for facilitating provenance capture and simulating provenance data

Lead Research Organisation: King's College London
Department Name: Health and Social Care Research

Abstract

Our world is increasingly driven by data. Medical, economic and political decisions are made based on the results of automatically analysing ever-growing volumes of data. Whether these are patient treatment decisions or stock trading recommendations, if we are to trust the decisions being made, we need to have insight into the workings of these systems and achieve understanding of their outputs - referred to as their provenance.

Related to the issue of trust is the concept of reproducibility in scientific discovery, as the ultimate test of findings' validity. Science is now all but impossible without data-intensive infrastructures, but these changes make research harder to verify and follow using traditional "pen-and-paper" methods, and new techniques are required to ensure correctness. A number of recent studies looked into published research in certain areas, only to find that a minority could be reproduced using the information provided. Understanding the provenance of the data and processes that we are relying on has never been more critical.

Data provenance is a research field dedicated to standardised, uniform, representation of the network of data products, tasks that create and use those data, and the human and software actors who perform these tasks - typically represented as provenance graphs. Popular in "computational" disciplines that have long relied on scientific software, provenance is now becoming relevant and necessary to areas which have only recently become data-driven and which operate using multiple disjointed software tools.

In order to facilitate the adoption of provenance in these disciplines, ProvTemp project is modeling provenance templates - the provenance graph fragments that multiple software tools can compose into a unified, meaningful trace of conducted research. A set of templates is defined by the scientists, describing the research details that need to be captured, and these are then translated into concrete provenance data. This theoretical work has two immediate applications. The first is a method for introducing provenance into scientific environments by integrating with existing software tools, minimising the effort needed for the developers of those tools to start capturing provenance. Second is a mechanism for using the templates to simulate realistic provenance data that would be produced from those templates, allowing them to be tested to ensure they are sufficiently informative for the intended purpose, e.g. publishing details of research task, providing legally required audit trail etc.

The ProvTemp approach shall be evaluated on the example of modelling a clinical trial. The medical research community is a typical example of a non-computational discipline becoming increasingly data-driven, and it is currently moving towards big data enabled, intelligent infrastructures through use of data routinely captured in Electronic Health Record systems. The trend in medical research is towards Learning Health Systems, which seek to maximise and optimise the use and benefit of EHR data in clinical research and practice. The EU TRANSFoRm project, implemented a prototype software infrastructure for the Learning Health System, and conducted an international clinical trial, driven by EHR data. ProvTemp approach will replicate the trial execution using provenance templates, and examine the produced provenance data to ensure our method is valid and applicable to future clinical trials.

In addition to the clinical trial work, we shall work closely with UK's Software Sustainability Institute which promotes sustainable software technologies. SSI shall assist in ensuring that ProvTemp is generalisable and relevant to other scientific disciplines. We shall also engage the public in defining the wider questions around reproducibility and quality of research. Finally, ProvTemp will produce a roadmap for further research, taking stock of the work done and identifying future opportunities.

Planned Impact

ProvTemp will facilitate adoption of provenance technologies in non-computational disciplines, such as medicine, that rely on heterogeneous software and lack the computational infrastructure the existing provenance solutions integrate with. By doing this, we shall move these disciplines closer to the vision of reproducible research, reduce administrative costs by automating the audit task, and support trust by increasing transparency of the research task.

These are issues that are of high importance to academics, industry and the public. Aside from an active data provenance research community in the UK, there is a fresh initiative around Research Objects, portable abstractions of a research task, encompassing source data, provenance, structured documentation, associated software snapshots and all other entities contributing to scientific findings. Industry, particularly Contract Research Organisations (CROs) conducting clinical trials and pharmaceutical companies, are under increased scrutiny to have readily available proofs that their data management procedures are effective and conformant to national and international legislation. With personal data on the Web, such as browsing habits and purchase histories, being routinely mined for commercial information, the public is rightfully concerned how will their private data be used in research and increasingly seeks assurance and understandability of the process before giving consent to use of their data.

We shall address public engagement through an active social media presence on blog, Twitter, and LinkedIn, complemented by series of public talks via Cafe Scientifique grassroots science movement, and the Pint of Science initiative. King's College London offers several venues for involving the public in science, which will be pursued, together with similar mechanisms at the Imperial partner.

Software Sustainability Institute, also a partner on the project, seeks to improve the quality of software engineering in science and sees provenance work as one of the key contributing factors to that. As part of their remit, SSI runs the Research Software Engineers community, with annual Collaborations Workshop, and a range of other workshops and activities. By promoting ProvTemp through these channels, we shall ensure that we reach a wide audience of research software developers who are key to popularising our approach.

The PI shall engage key UK academic institutions involved in provenance work through a series of academic visits and presentations at relevant workshops and conferences, including International Provenance Annotation Workshop (IPAW) and Theory and Practice of Provenance (TAPP). The clinical trial use case will also be presented in medical informatics journals and conferences, such American Medical Informatics Association (AMIA) Summits. Of particular use will be aligning our work with the wider Research Objects initiative for sharing and publishing academic research, and we shall investigate the impact of ProvTemp on applicability of Research Objects to non-computational disciplines, that have not been tackled yet.

At the end of the project, a roadmap will be produced, drawing on the insights gained through the engagements and with input from academics in relevant fields. This roadmap will articulate the open problems that need to be addressed to achieving reproducibility of data-producing tasks in scientific research and position these problems in relation to the results of ProvTemp. We will review the current state of the art that addresses these problems, identify the research communities contributing to these areas and provide examples of applications that either already do or could in the future benefit from provenance templates.
 
Description Transparency of research is increasingly recognised as a major problem in all data-intensive settings, as the increase in volume and velocity of data makes it impossible to use manual methods to track all the ways in which data is transformed and utilised. Data provenance offers a model for representing the history of what happened to data, expressed in domain-relevant concepts, but it traditionally suffered from difficulties in implementation. We have successfully piloted a light-weight methodology for implementing provenance support in software tools, enabling automatic generation of a computable audit trail, based on high-level abstract provenance fragments - provenance templates. The methodology consists of a software tool, Provenance Template Server, and an associated method for designing a provenance template-based solution using UML constructs. This has significant effect on making software more transparent and accountable, which is particularly relevant in the health sector. In addition to that, we have developed mechanisms to simulate realistic provenance data based on the high-level models, allowing us to prototype sample provenance data for a particular problem domain before the actual implementation. The methodology and associated technology is being taken forward in a number of follow-up projects on medical decision support and clinical trials, as well as a commercial software tool that performs visual analytics on health data. We worked with the Software Sustainability Institute to get a critical evaluation of our tools, and to effectively plan for future developments. Finally, in addition to engaging with the scientific community through workshops and conferences, we also discussed our ideas with the public through more informal events such as Café Scientifique and the forthcoming Pint of Science 2018.
Exploitation Route This work will lower the barrier to entry for implementing data provenance. The tools and the methods are getting refined in follow-up projects, and we hope to see significant take-up after presentations of the demonstrators at relevant conferences. The technology and the approach have applications in all data-intensive domains where there is a need to trace the origin of entities. In addition to software and health industries (including pharma), this also applies to the food industry (tracing the origin of food through the supply chain) and any other manufacturing processes where it is important to trace the transformation and combination of raw goods into products. Our main focus will be on promoting and encouraging provenance template adoption in the three areas where we are working at the moment: data analytics, clinical trials and decision support systems. The PI of the grant is one of the Co-I-s on the newly funded Health Data Research UK London substantive site (successor to the Farr centres), which will fund an ambitious program of research in the next five years, including 21st Century Clinical Trials and Actionable Analytics, allowing for a fantastic opportunity to further develop the role of data provenance in health data science. Finally, the collaboration with NICE (National Institute for Care and Excellence) has opened the door to usage of ProvTemp technology as the audit trail backend to policy development processes.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

 
Description The research work in ProvTemp has directly impacted a number of projects and collaborations. The most obvious industrial impact was on the collaboration with Imosphere Ltd. (fka Face Ltd.) that is happening through an InnovateUK KTP, whereby the company has embedding data provenance support, using the Provenance Template Server into their Imolytics data analysis software. Thus, Imosphere is now adopting and adapting the full framework and software tooling developed in ProvTemp, which is becoming an official new feature of the system in 2018, and is being promoted in their marketing materials and to existing and prospective customers. Health informatics impact has been notable. The clinical trial templates designed as exemplars are now being used as the basis for the provenance support in the Runny Ear Study (REST) clinical trial that has been funded by the NIHR and is starting in 2018. The decision support template examples have been used in CLAHRC South London Stroke theme decision aid tool, and will be adapted in ROAD2H and CONSULT EPSRC projects. Two further decision support projects have been funded recently and will look into the usage of provenance: NIHR DSS for reducing antibiotic prescription and CRUK DSS for early detection of cancer in primary care. Finally, National Institute for Health and Care Excellence (NICE) has been investigating the template-based provenance as metadata management tool for their guidelines, recommendations and evidence, potentially impacting the way health care is conducted in the UK. The initial findings, funded through an MRC Partnership award, have been encouraging, and we have just started an EPSRC Impact Acceleration Award to deliver the first prototype, leading to a larger grant proposal later in the year.
First Year Of Impact 2018
Sector Digital/Communication/Information Technologies (including Software),Healthcare
Impact Types Economic,Policy & public services

 
Description EPSRC Impact Acceleration Award: NICEProv Pilot - Evaluating the feasibility of PROV Temp technology for managing the lifecycle of NICE evidence-based guidelines
Amount £39,833 (GBP)
Funding ID EP/R511559/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Academic/University
Country United Kingdom
Start 02/2019 
End 01/2020
 
Description EPSRC Intelligent Technologies to Support Collaborative Care
Amount £1,400,000 (GBP)
Funding ID EP/P010105/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Academic/University
Country United Kingdom
Start 03/2017 
End 02/2020
 
Description Global Challenges Research Fund
Amount £1,515,900 (GBP)
Funding ID EP/P029558/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Academic/University
Country United Kingdom
Start 05/2017 
End 04/2020
 
Description Health Technology Assessment programme Antibiotics for acute otitis media with discharge
Amount £1,475,606 (GBP)
Funding ID 16/85/01 
Organisation National Institute for Health Research 
Sector Public
Country United Kingdom
Start 01/2018 
End 12/2020
 
Description Industrial Proximity Awards
Amount £29,500 (GBP)
Organisation Medical Research Council (MRC) 
Sector Academic/University
Country United Kingdom
Start 02/2018 
End 07/2018
 
Description London Substantive Site for HDR UK
Amount £6,000,000 (GBP)
Organisation Health Data Research UK 
Start 04/2018 
End 03/2023
 
Description Population Research Committee project grant
Amount £300,000 (GBP)
Funding ID C37891/A25310 
Organisation Cancer Research UK 
Sector Charity/Non Profit
Country United Kingdom
Start 04/2018 
End 03/2021
 
Description Collaboration with FACE Ltd. on use of provenance templates 
Organisation FACE Recording & Measurement Systems Ltd
Country United Kingdom 
Sector Private 
PI Contribution The work done on ProvTemp has influenced the Knowledge Transfer Partnership my group has with Face Ltd. on developing a software infrastructure for data provenance in their software system.
Collaborator Contribution The partners contributed the use case: a health data analysis and reporting framework.
Impact Provenance module for Face's iMoLYTICS software is in development.
Start Year 2016
 
Description Collaboration with Imperial College on use of template-based provenance for Learning Health System applications 
Organisation Imperial College London
Country United Kingdom 
Sector Academic/University 
PI Contribution The provenance template technology developed by myself and my team in ProvTemp is being used as the basis for the provenance infrastructure in Learning Health System decision support research at Imperial's Institute of Global Health Innovation.
Collaborator Contribution The partners are providing the medical use cases that could benefit from the provenance infrastructure.
Impact Successfully funded projects - EPSRC ROAD2H and CRUK Demonstrating the feasibility of a Learning Health System for cancer diagnosis in primary care. Through the former, our decision support infrastructure, including the provenance module will be deployed in health systems in Serbia and China. This is a multi-discipinary collaboration spanning improvement science, medicine, computer science, and medical informatics.
Start Year 2017
 
Description Collaboration with National Institute for Health and Care Excellence on use of provenance for managing recommendations and evidence 
Organisation National Institute for Health and Care Excellence (NICE)
Country United Kingdom 
Sector Public 
PI Contribution My team is applying the Provenance Template modeling method to the challenges that NICE has in managing their research metadata. Specifically, they are interested in exploring the versioning of their guideline recommendations, and its relationship to the changing evidence base.
Collaborator Contribution NICE is conducting a survey of their stakeholders (industrial partners, clinical organizations etc.) to understand their needs with respect to data provenance of NICE guidelines. This is a valuable piece of work that they are uniquely positioned to deliver, and will be of significant use to my team in developing the provenance research portfolio further. As a secondary benefit, the NICE employee placed in my group is advising on the ROAD2H and CONSULT projects which both have elements of guideline modeling, to ensure its applicability to the UK national requirements.
Impact We have jointly obtained the "Towards Computable Guidelines", MRC Industry Proximity Award funding worth £30K, to establish a pilot collaboration through hosting a NICE employee in my group, and are currently working on a paper and a larger grant proposal. This is a multi-disciplinary collaboration that spans informatics and public health.
Start Year 2017
 
Title Provenance Template Server 
Description Provenance Template Server allows other software tools to capture data provenance information through a set of pre-defined templates which are loaded into the PTS, and then invoked through a programmatic API. In practice, this allows for third party tools to easily store their provenance, without having to worry about storage and graph data management. The tool has three backends: MySQL relational, Neo4J graph, and OrientDB graph - using the TinkerPop API that supports some further graph databases as well. 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact The software is being used in a number of follow-up projects, including CONSULT, ROAD2H, NICE collaboration on guideline recommendation provenance, InnovateUK EmProv and others. The latter is particularly relevant, since it allows a commercial company (Imosphere, fka FACE Ltd.) to embed our software into their product. 
URL https://bitbucket.org/account/signin/?next=/kclbig/templates-server
 
Description Cafe Scientifique event (London) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Public/other audiences
Results and Impact On 31s January 2017, I gave a talk on data provenance in health research at Cosy Science, a Cafe Scientifique event in Holborn, London. The attendance was around 20-25 people, comprising general public, academics, postdocs and medical practitioners. The format was a 20 minute oral presentation with no displays or other helper tools, followed by a 20 minute discussion. The discussion ended up being around 40 minutes due to some interesting questions being asked on the role of provenance in establishing the route of patient's data through the research ecosystem.
Year(s) Of Engagement Activity 2017
URL http://www.cafescientifique.org/
 
Description Data provenance: Principles and why it matters for biomedical applications: tutorial at Informatics for Health conference in Manchester 22nd April 2017 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact My Research Associate and myself gave a four-hour training course on data provenance at the Informatics for Health conference in Manchester on 22nd April 2017. The feedback from the audience was excellent and we raised awareness of both the area and our work on the template approach.
Year(s) Of Engagement Activity 2017
 
Description Pinar's presentation in Manchester 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact Dr Pinar Alper visited Prof Carole Goble's group at the University of Manchester on 27th January 2017 and presented the work done on provenance templates in our project. We got some good feedback from the audiences and have started discussions around potential future collaborations.
Year(s) Of Engagement Activity 2017
 
Description Talk to the Petnica Science Camp students in Serbia 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Schools
Results and Impact Presentation on data provenance and the Learning Health System given to 120-odd high school students attending the Petnica Science Camp in Serbia. The talk coincided with the summer term at PSC, and the students present there were all involved in their own research projects. The presentation kick-started a longer discussion on the role of data provenance in alleviating privacy fears around how people's private medical data are being used.
Year(s) Of Engagement Activity 2017
 
Description The 2nd Learning Health Systems Summit Washington December 2016 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Policymakers/politicians
Results and Impact On 8th-9th December 2016, I was invited to attend the 2nd Learning Health Systems summit in Washington DC, where I talked about the work I am doing in data provenance and the impact it can have on the Learning Health System work in general. The event was sponsored by the Kanter Health foundation (https://webserver.kanterhealth.org/)
Year(s) Of Engagement Activity 2016
URL http://www.learninghealth.org/2016-second-lhs-summit
 
Description Visit to Erasmus MC, Rotterdam, to present ProvTemp outputs 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Vasa Curcin and Martin Chapman visited Rotterdam to present the data provenance and CONSULT technologies developed in the group to the researchers at Erasmus MC. The specific goal was to see how our provenance template server could be used to provide reproducible features to the tooling around OHDSI Observational Medicines Outcome Partnership's Common Data Model. Several possibilities were identified and we shall aim to submit a joint proposal around it.
Year(s) Of Engagement Activity 2019