ReComp: sustained value extraction from analytics by recurring selective re-computation

Lead Research Organisation: Newcastle University
Department Name: Sch of Computing

Abstract

As the cost of allocating computing resources to data-intensive tasks continues to decrease, large-scale data analytics becomes ever more affordable, continuously providing new insights from vast amounts of data. Increasingly, predictive models that encode knowledge from data are used to drive decisions in a broad range of areas, from science to public policy, to marketing and business strategy. The process of learning such actionable knowledge relies upon information assets, including the data itself, the know-how that is encoded in the analytical processes and algorithms, as well as any additional background and prior knowledge. Because these assets continuously change and evolve, models may become obsolete over time, leading to poor decisions in the future, unless they are periodically updated.

This project is concerned with the need and opportunities for selective recomputation of resource-intensive analytical workloads. The decision on how to respond to changes in these information assets requires striking a balance between the estimated cost of recomputing the model, and the expected benefits of doing so. In some cases, for instance when using predictive models to diagnose a patient's genetic disease, new medical knowledge may invalidate a large number of past cases. On the other hand, such changes in knowledge may be marginal or even irrelevant for some of the cases. It is therefore important to be able, firstly, to determine which past results may potentially benefit from recomputation, secondly, to determine whether it is technically possible to reproduce an old computation, and thirdly, when this is the case, to assess the costs and relative benefits associated with the recomputation.
The project investigates the hypothesis that, based on these determinations, and given a budget for allocating computing resources, it should be possible to accurately identify and prioritise analytical tasks that should be considered for recomputation.

Our approach considers three types of meta-knowledge that are associated with analytics tasks, namely (i) knowledge of the history of past results, that is, the provenance metadata that describes which assets were used in the computation, and how; (ii) knowledge of the technical reproducibility of the tasks; and (iii) cost/benefit estimation models.
Element (i) is required to determine which prior outcomes may potentially benefit from changes in information assets, while reproducibility analysis (ii) is required to determine whether an old analytical task is still functional and can actually be performed again, possibly with new components and on newer input data.

As the first two of these elements are independent of the data domain, we aim to develop a general framework that can then be instantiated with domain-specific models, namely for cost/benefit analysis, to provide decision support for prioritising and then carrying out resource-intensive recomputations over a broad range of analytics application domains.

Both (i) and (ii) entail technical challenges, as systematically collecting the provenance of complex analytical tasks, and ensuring their reproducibility, requires instrumentation of the data processing environments. We plan to experiment with workflows, a form of high level programming and middleware technology, to address both these problems.

To show the flexibility and generality of our framework, we will test and validate it on two, very different case studies where decision making is driven by analytical knowledge, namely in genetic diagnostics, and policy making for Smart Cities.

Planned Impact

The ReComp software will benefit consumers of the knowledge outcomes from analytics, by ensuring that the value initially associated with those assets is retained over time. It will also benefit the organisations performing the analyses, by helping them manage their analytics budget through prioritisation of their re-computation efforts.
In the short term, our impact strategy will focus on the user groups associated to the Genomics and Urban Observatory case studies. This will give us iterative feedback on the ReComp tools and techniques, ensuring that they are both useful and useable. The software will be developed through a user-centric design model, whereby users and subject experts from the case studies are going to be engaged from the start and in each phase of development and release cycles.

Genomics: local experts are biomedical researchers at the IGM, who will provide both unsolved or inconclusive patient cases to the project, as well as validating the usability of the ReComp software. The main benefit for this group is the ability to systematically identify opportunities to address previously unsolved cases. This will translate into substantial savings, as IGM researchers estimate the success rate for diagnosis of rare diseases at about 25%, and typical processing fees figures per sample by commercial genetic diagnostics services are about £500. The engagement of IGM researchers is underpinned by the participation of their director - P. Chinnery - as CO-I.
Urban Observatory: the anticipated impact is to inform public policy by providing local government with precise and current knowledge of threats to public welfare, as well as of opportunities for optimisation of public services. P. James - leader of the Urban Observatory and CO-I on the project, will manage user engagement with the project.
As an additional benefit the Meta-K repository will provide a structured knowledge base for the growing volume of sensor data that the group is expected to generate for years to come.
Both these case studies offer a pathway to societal as well as economic benefit - improving health outcomes and city management will both have major benefits for the well-being of society.

In the medium and long-term, ReComp has the potential to benefit knowledge-driven companies and organisations that rely on analytics for decision-making. Impact to them is in terms of continued value from analytics knowledge over time, and financial savings through selective re-computation.
Promotion of ReComp uptake by these organisations will be pursued through non-academic dissemination and promotion channels, as explained in the Pathway document.

Academic impact will be achieved through publication and dissemination in the big data area, but also specifically in e-science venues including the annual IEEE International e-science conference, and the Future Generation Computer Journal, which are known to welcome contributions on data architectures in support of science.

ReComp will be released as open source to encourage adoption and customisation, and made available as a service on a cloud infrastructure. A sustainability plan will be formulated to facilitate this, possibly in collaboration with the Software Sustainability Institute (SSI). In addition to adopting a user-centric development model for the software deliverables, we will promote the framework both through demo events and academic dissemination, and by user training events.

We will also leverage active collaborations between the Digital Institute (DI) at Newcastle University, led by Prof. Watson (CO-I), and data science researchers in a wide range of disciplines, including e-Health, Internet of Things, and wearable computing.

The DI also actively engages with external organisations though its Cloud Innovation Centre. This gives direct routes for the outputs of the project to be widely deployed within and beyond Newcastle University, in companies as well as by research teams
 
Description - We have addressed the "impact estimation problem" that is at the core of the selctie re-computation challenge. Specifically, we have identified two separate "impact estimation" functions that can be used to predict the consequences of input and reference data changes on the outcome of a complex computation. The first is in high throughput Genomics (high-variant calling process), the second is in Flood modelling (using a simulator developed in Newcastle). Furthermore we have demonstrated the kind of data analytics that are required on the provenance of past executions, and more broadly the value of systematically collecting a detailed history on past executions of complex processes, as a way to inform decisions on re-executions in reaction to changes. We have also explored the structure of a generic "ReComp" meta-process which, by observing and controlling an underlying data-intensive computational process, informs domain experts of the opportunities for selective re-computation in reaction to changes, and of their associated costs.
Exploitation Route We expect our results to inform both researchers and practitioners in the "big data analytics" space, as well as of those in computational science and engineering. Our results may be used to help experts faced with the costs of periodic refresh of computational results, to decide when and to what extent those refresh are needed.

Since the previous submission, we have been awarded a IAA EPSRC grant to pursue pre-commercial opportunities to exploit ReComp technology.
This has so far resulted in a new interdisciplinary collaboration with Newcastle university Medical School, and in the preparation of tutorial material to enable third parties to experiment with our system implementation.

Update: the IAA grant has generated a number of interesting engagements with global companies (incl IBM, Microsoft) and public organisations in the UK (DAFNI, ONS, Ordnance Survey amongst others). these are highlighted in the "engagements" section as new entries for this submission.
Sectors Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Environment,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

URL http://recomp.org.uk/
 
Description ReComp Impact Accelerator Grant
Amount £62,942 (GBP)
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 07/2019 
End 03/2020
 
Description DataONE 
Organisation DataONE
Country United States 
Sector Learned Society 
PI Contribution both partners will hopefully be able to benefit from ReComp towards the end of the project
Collaborator Contribution both partners are listed on the grant as in-kind contributors. Both partners provide technology that ReComp can use to build its implemetation of the reference selective re-computation framework
Impact ReComp was tested on the DataONE infrastructure -- but was not taken any further
Start Year 2016
 
Description DataONE 
Organisation University of Manchester
Department School of Computer Science
Country United Kingdom 
Sector Academic/University 
PI Contribution both partners will hopefully be able to benefit from ReComp towards the end of the project
Collaborator Contribution both partners are listed on the grant as in-kind contributors. Both partners provide technology that ReComp can use to build its implemetation of the reference selective re-computation framework
Impact ReComp was tested on the DataONE infrastructure -- but was not taken any further
Start Year 2016
 
Description collaboration with the Institute of Neurosciences at Newcastle University 
Organisation Newcastle University
Department Institute of Neuroscience
Country United Kingdom 
Sector Academic/University 
PI Contribution investigated new forms of "impact functions" that apply to a dataset of relevance to the Brain and Movement Group (within partner institution). full report available
Collaborator Contribution contributed their knowledge of data set and of research domain, participated in numerous dedicated meetings
Impact output is a full internal report, final version yet to be released this is a multi-disciplinary effort, main discipline is study of cognitive decline in Parkinson's patients using accelerometry data analysis
Start Year 2019
 
Description joint funding project with Heriot-Watt 
Organisation Heriot-Watt University
Department School of Engineering & Physical Sciences
Country United Kingdom 
Sector Academic/University 
PI Contribution successful joint grant proposal -- project currently in progress (CEM-DIT)
Collaborator Contribution contribution to joint proposal
Impact none yet -- papers in progress
Start Year 2015
 
Title ReComp reference implementation, with a tutorial designed to enable third parties to experiment with the system 
Description ref implementation designed to experiment with the ReComp process monitoring model. tutorial available here: https://github.com/ReComp-team/ReComp-Main/tree/master/docs please note that the github link is still private. we will open it up when finalised 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact too early to tell 
URL http://recomp.org.uk
 
Description A community workshop organised as part of the ProvenanceWeek bi-annual event 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Study participants or study members
Results and Impact This one-day workshop aims to bring together researchers and practitioners from multiple communities around the general problem of incremental re-computation of knowledge outcomes that are produced using data-intensive and computationally expensive processes (workflows, simulations, training of predictive models). Incremental recomputation is recomputing in response to changes in the elements that contributed to the original computation, i.e., inputs, reference datasets, tools, libraries, and deployment environment. This need for incremental recomputation and its optimizations across multiple computations is fundamental to current trends in data science, and big data, and machine learning, where online learning approaches can sometimes be employed to achieve incremental model re-training.
Year(s) Of Engagement Activity 2018
URL https://sites.google.com/view/incremental-recomp-workshop/home
 
Description Invited academic talk given at Keele University (colloquia series) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact An invited talk given in Dec. 2016 at Keele University, UK, on the formulation of two ReComp problems: change impact estimation, and conditional stream processing.
Year(s) Of Engagement Activity 2016
URL https://www.slideshare.net/pmissier/recomppreserving-the-value-of-large-scale-data-analytics-over-ti...
 
Description Invited participation to the Big Data Quality Panel for the Diachron Workshop at EDBT conference 2016 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Invited participant to a panel organised as part of the the EDBT conference (European DB conference): http://edbticdt2016.labri.fr/. this was an opportunity to present the project to a qualified audience and to discuss with other experts in the database community.
Year(s) Of Engagement Activity 2016
URL http://www.diachron-fp7.eu/2nd-diachron-workshop.html
 
Description Invited talk at Universidad La Rioja, Spain, as part of a provenance workshop organised by the Dept. of Computer Science 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact this invited talk was part of a multi-day event organised by colleagues at Universidad de la Rioja, Spain, and was attended by faculty and students
Year(s) Of Engagement Activity 2019
 
Description Invited talk at university of Leeds School of Computing (School colloquia series) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Academic talk given as part of the Leeds School of Computing colloquia series mostly to academics in the School
Year(s) Of Engagement Activity 2017
URL https://www.slideshare.net/pmissier/preserving-the-currency-of-analytics-outcomes-over-time-through-...
 
Description Invited talk given at the 14th Annual Meeting of the Bioinformatics Italian Society (BITS) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Patients, carers and/or patient groups
Results and Impact This Invited talk focused on the Genomics case study for the ReComp project. It addresses a broad audience from the research (bioinformatics) as well as clinical sectors (health)
Year(s) Of Engagement Activity 2017
URL http://bioinformatics.it/bits2017
 
Description Paper presentation given at the IPAW 2018 conference 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact talk given at the ProvenanceWeek event -- a gathering of academics and practitioners around the topic of data provenance
Year(s) Of Engagement Activity 2018
URL https://www.slideshare.net/pmissier/provenance-annotation-and-analysis-to-support-process-recomputat...
 
Description engagement with Microsoft Research Open Data 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Research into data platforms and analytics at Microsoft Research focuses on all aspects of large-scale cloud and edge data platforms and services, and novel ways to accelerate discovering insight from data. Microsoft is interested in a variety of topics including resource management, storage, caching, query processing, query optimization, security, and privacy, self-service data cleaning and transformation at scale, search and discovery of structured data, information extraction, time-series data analytics, and metadata management.
As a major global technology player with an interest in data platforms and analytics in multiple end-user domains, including the environment and health, we were interested to explore whether there was potential for Microsoft to apply or further extend the outputs of ReComp.
We reached out to Dr Kenji Takeda, Director of Academic Partnerships at Microsoft Cambridge. He introduced us to his colleague, Dr Vani Mandava, Director of Data Science Outreach, based in Redmond.

Vani Mandava is a software engineer and data architect by background, and spent a considerable part of her career at Microsoft developing products such as Office and Bing. Her current role is focused on reproducibility and openness, and she has led the development of Microsoft Research Open Data, a service which gives access to several curated open datasets which have been used in Microsoft research in computer science, social sciences, health, biology, earth sciences, information science and other areas

Vani Mandava was interested in the W3C PROV-based database architecture which underpins the ReComp architecture.

The ReComp team has been considering how to apply ReComp to AI models that need maintenance, and how to periodically re-tune or retrain them without needing to run the entire model. One solution could be to learn impact functions using AI from heuristics of the data inputs and outputs. This is an interesting potential direction for future research.
Year(s) Of Engagement Activity 2021
 
Description engagement with Ordnance Survey (OS) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Ordnance Survey creates, maintains and distributes detailed location information for Great Britain. Ordnance Survey records and keeps 500 million geospatial features in the Ordnance Survey (OS) master map up-to-date. The company provides geospatial data and mapping products in three main sectors: leisure/ consumer mapping, business, government and the public sector.
Ordnance Survey has an active programme of research and development, with the goal of ensuring that the UK remains at the forefront of geospatial data innovation. The main research areas are in mobile mapping and feature extraction, for example using street-level vehicles to record and extract street-level features such as the locations of drains and pipes, which can be integrated with IoT devices to support connected societies; data fusion, which is investigating ways to fuse together data from multiple sources to provide an integrated view of the world; crowdsourcing new types of information from people; and in secure and reliable information to ensure that privacy and security are protected through use of geospatial data. Ordnance Survey increasingly employs AI-based models in its R&D efforts and these models use substantial computation resources, with models taking several days or weeks to compute, therefore there is interest in being able to compute these simulations more efficiently. There was interest in exploring the application of ReComp to deep learning models with OS.
Year(s) Of Engagement Activity 2020
 
Description engagement with the Office of National Statistics, Data Science campus 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact The ONS Data Science Campus is a cross-government data science centre of excellence. It delivers projects for the Office for National Statistics (ONS), government and wider UK and international organisations. The processing time of these computations is often very long, in some cases many weeks. Therefore there was potential for ReComp to be applied and for new use cases to be developed.
We reached out to:
Dr Tom Smith, Director of the ONS Data Science Campus
Dr Louisa Nolan, Chief Data Scientist and Deputy Director of the ONS Data Science Campus
Dr Li Chen, Senior Data Scientist, ONS Data Science Campus

Other contacts that connected by e-mail are:
Hillary Juma, Data Science Community and Engagement Manager; Paul Littler and Isabela Breton, Academic

Potential application areas could include land-use change from census data; use of satellite images to detect surface changes; population estimation models; and a model which uses Boots pharmacy data and retail data to understand Covid transmission (in addition to testing data). These data are dynamic, sparse and incomplete, the retraining of these models takes 1-2 weeks and the model outputs have changed substantially with the new incoming data. Both the data and the datasets used to train these models change.

A 1 hour talk was also given to a mixed audience of applied researchers at ONS Data Science campus
Year(s) Of Engagement Activity 2020,2021
 
Description exploration talk with IBM / Cloudpak for Data / OpenScale gorup 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact dissemination to IBM researchers and managers (see below) to identify common areas of interest where ReComp can be applied or further developed

Dr Petrena Prince, Director of Academic Programmes, EMEA
Dr Peter Waggett, Director of Research, IBM UK
Dr Phil Tetlow, IBM Academy of Technology VP (Emerging Technology), CTO Data Ecosystems
Richard Snell, Senior Client Manager, covering local government and education in the UK.

Professor Missier met separately with Ed Pyzer Knapp, Worldwide Lead for AI Enriched Modelling and Simulation at IBM.
Year(s) Of Engagement Activity 2020
 
Description invited talk given at the Dept. of Computer Science, University of Cardiff, 2019 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact an invited reseach talk attended by faculty and students
Year(s) Of Engagement Activity 2019
 
Description invited talk given at the Massive Analytics Quality Control conference, 2019 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact this is an engagement opportunity in the form of a talk to a mixed audience of Healthcare professionals as well as academics.
Event and theme: Reproducibility of Artificial Intelligence in Medicine, 3rd Annual MAQC Society Conference -
organised by the The MAQC International Society: https://www.pmgenomics.ca/maqcsociety/, a major "big data" player in the HealthCare space: "The objective of the MAQC Society is to communicate, promote, and advance reproducible science principles and quality control for analysis of the massive data generated from the existing and emerging technologies in solving biological, health, and medical problems "
Year(s) Of Engagement Activity 2019
URL https://maqc2019.fbk.eu/program.html
 
Description talk on joint initiatives with DAFNI, Data & Analytics Facility for National Infrastructure to advance UK infrastructure research. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact we explored whether ReComp could be applied to the DAFNI architecture and useful in solving real-world large-scale data problems relating to physical infrastructure in cities.
specifically, OpenCLIM is a research project led by Professor Robert Nicholls at the University of East Anglia, with Brian Matthews and Luke Smith as co-investigators. OpenCLIM will develop and apply a first UK integrated assessment for climate impacts and adaptation. The model will consider UK-wide climate impacts and adaptation in biodiversity, agriculture, infrastructure and urban areas, considering the impacts of flooding, heat stress and changing temperature and precipitation. It will consider two detailed case studies: an urban analysis of Glasgow and the Clyde, and a more rural analysis of the Norfolk Broads. The case studies will demonstrate application of the model to inform the national analysis.
Year(s) Of Engagement Activity 2020