Inference: Capturing Provenance Information with Minimal Intrusion

Lead Research Organisation: University of Southampton

Department Name: Sch of Electronics and Computer Sci

Abstract

Commercial and government decisions are driven by data. Provenance is the record of how data and processes were created, modified and used. It is used to support quality assessments for data, provide traceability, identify possible system intrusions, etc. Unfortunately, all of the uses of provenance require that provenance information be captured by each system within a system of systems. This "capture problem" is costly and does not scale. To date, only applications that have a high value to scientists have been provenance capture-enabled [9, 17]. Instead, we seek to build observation points external to any pre-built system that will create partial, or inferred, provenance that can be reused across any system that uses the same architectural components.

In order to facilitate the adoption of provenance within enterprise systems built from a heterogeneous software stack that is unique to each organization, the Infer-Proven-ence project is researching the underlying feasibility and creating a toolbox of techniques that will reduce the number of applications that must be provenance-enabled. Unlike a provenance-enabled application that can report observed provenance, inferred provenance has a probability of being what actually happened. Depending on the overall architecture, different provenance inference techniques need to be available. An inference technique that works within a database and its limited set of transformations will not work over streaming data. This work is establishing the theoretical underpinnings for two different provenance-capture inference mechanisms that work within common architectures. It will create implementations of each technique that can be evaluated within real-world scenarios.

The Infer-Proven-ence approach shall be evaluated across two distinct architectures: one for stream processing, and one for data analytics. While architectures exist that combine all of these components, we intentionally split them into the smallest representative unit with respect to data flow and application-type. With this in mind, Infer-Proven-ence will be evaluated across two distinct architectures: a stream processing of sensor data architecture; and a data analytic architecture. Evaluation will consider: ability to correctly infer provenance; accuracy of inferred provenance; cost of implementation within the given architecture, scalability of approach and the utility of the inferred provenance for a use case specific to each problem domain. For the first technique, we will work with partners at Roke Manor Research and their autonomous vehicle program in which data from disparate sensors is streamed through a set of micro-processors and driving decsions are made. Provenance within this use case will be used to highlight anomalies and likely sources of decision errors. For the second technique, we will work within a data analytic architecture in which source data is transformed and manipulated during the process of analysis. Provenance within this use case will be used to reproduce the analytic results

In addition to the real-world evaluation, we shall work closely with UK's Software Sustainability Institute, which promotes sustainable software technologies in order to build software that can be transitioned and reused by others. SSI shall assist in ensuring that Infer-Proven-ence is generalizable and relevant to any discipline based only on the architecture required by that discipline. Finally, Infer-Proven-ence will produce a roadmap for further research, taking stock of the work done and identifying future opportunities.

Infer-Proven-ence also builds partnerships across several institutions including Southampton's Cyber Security Research Centre, the University of Massachusetts Amherst, the Software Sustainability Institute and Roke Manor Research in order to investigate provenance inference in real-world situations.

Planned Impact

National Importance
Infer-Proven-ence will facilitate adoption of provenance technologies in non-computational disciplines, such as those within government and commercial organizations that rely on heterogeneous software and lack the computational infrastructure the existing provenance solutions integrate with. By doing this, we will expand the ability to capture provenance for later usage, while minimizing the system impact. This pervasive provenance capture will facilitate: reduced audit costs, reproducible research, highlighting best practices, flagging data anomalies and supporting understanding and trust of data.
These issues are of high importance to academics, industry and the public. In 2009, leading provenance researchers wrote a prospective article extolling all of the uses and future research requirements for provenance after it is widely adopted "10 years in the future", with highlights from the financial sector through scientific research [7]. Today, almost 10 years in the future, very little of those visions have become a reality, mostly because of a slow adoption of provenance by commercial and government organizations. While these organizations are interested in the use and benefits of provenance [6], the high cost of implementing provenance capture within their environments has been prohibitive. The use cases within [6] for using provenance span across themes such as assisting with: context and understanding; curation and reuse; identification of good practice; integrity; interoperability; linking entities; quality; reproducibility; uncertainty. However, there must exist provenance information to work over in order to provide any benefit. While some high-value projects with a very specific focus and minimal technology heterogeneity have end-to-end provenance support [9, 17, 18], it is rare. By providing a mechanism that provides high-quality provenance, without the costly system creation and maintenance costs, we will enable commercial and government agencies to actually use provenance.
Moreover, new technologies, such as autonomous vehicles, require provenance for both debugging, and to establish liability, and have new requirements for capture and storage. In order to facilitate adoption of these new technologies, the provenance component needs to be addressed. Outside of debug mode, there is limited storage in the "automotive black box" should a later investigation be necessary. As such, we cannot spend large quantities of processing power capturing provenance, or large amounts of storage space. Instead, industry needs help identifying the "sweet spot" in which some provenance is captured and stored, and the rest is inferred to reproduce what happened.

Academic Impact
The UK has a strong presence in research in data provenance tools with established teams at, among others, Southampton, KCL, Manchester, Edinburgh, and Newcastle, and track record of EPSRC-funded projects. Infer-Proven-ence will lower the barrier to entry for using provenance in systems that would like provenance information, but find implementation of full provenance capture to be too costly. This will directly benefit other provenance researchers in that it will provide more provenance information to develop their research over. This project also has significant promise to the scientific communities. Many scientific communities, e.g. biologists, geologists, chemists, astrophysicists, have eagerly adopted the use of provenance. However, they are often constrained to particular toolsets if they desire provenance capture. This project will make provenance capture in the scientific domain a less costly and hopefully more commonplace occurrence. This will then feedback to the provenance community by providing the ability to have more provenance across a wider range of systems for their research and experimentation.

Funded Value:

£222,787

Funded Period:

Jun 19 - Sep 21

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/S028366/1

Principal Investigator:

Adriane Chapman

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Information & Knowledge Mgmt (70%)

Software Engineering (30%)

Organisations

People	ORCID iD
Adriane Chapman (Principal Investigator)	http://orcid.org/0000-0002-3814-2587
Stephen Crouch (Researcher)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Almuntashiri A (2024) LLMs for the Post-Hoc Creation of Provenance

Blount T. (2021) Observed vs. possible provenance research track in TaPP 2021 - 13th International Workshop on Theory and Practice of Provenance

Chapman A (2021) Capturing and querying fine-grained provenance of preprocessing pipelines in data science in Proceedings of the VLDB Endowment

Chapman A (2022) DPDS assisting data science with data provenance in Proceedings of the VLDB Endowment

Chapman A (2021) Provenance in Data Science - From Data Models to Context-Aware Knowledge Graphs

Chapman A (2024) Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance in ACM Transactions on Database Systems

Chapman A. (2021) Fine-grained provenance for high-quality data science in CEUR Workshop Proceedings

Cheney J (2022) Advanced Mathematical and Computational Tools in Metrology and Testing XII

Deutch D (2022) Theory and Practice of Provenance

Holub P (2021) Towards a Common Standard for Data and Specimen Provenance in Life Sciences

Key Findings
Impact Summary
Policy Influence
Further Funding
Collaboration
Engagement Activities


Description	Going into the project, we knew that provenance was important to support many business decisions, but was resource intensive to collect. This project has created two different solutions to this problem that is applicable in two very distinct domains: data analytic processing enrironments, streaming environments. In the data analytic processing environment, the problem with collection arrises because there are so many possible applications that allow people to modify and work over the data. In this case, we have created a new technique that allows the use of abductive reasoning to create possible provenance based on a snapshop of the start data, and the end data state. We have researched the bounds in which this technique works, and can be deployed successfully, as well as the characteristics of situations in which this research should not be applied. In the streaming environment, the problem with collection arrises because of volume. There are good processing points that allow easy collection in a "one stop shop" manner, but the size of the provenance makes it unweildy for many scenarios (such as autonomous vehicles that must carry around this data). As such, this component focused on the storage reduction techniques available, from making on-the-fly decisions about what provenance to keep to improving compression. Additionally, across both threads we have been able to identify the following: 1. It is possible and computationally feasible to create possible provenance vs observed provenance. 2. There are key moments when possible provenance is just as helpful as observed provenance. However, the converse is also true. We have identified key situations in which observed provenance must be used. 3. We have a functioning prototypes for each thread of the project. One has been applied to 5 real world scenarios from ETL pipelines, machine learning pipelines, games and classic data analysis. One has been applied to real world streamin scenarios from smart cities.
Exploitation Route	We have a public repository for all code, and all code samples (games, ETL, etc.) so that anyone who wishes to employ our methods can use the tooling provided. We are actively working on spinning out the technology in the data analytic scenario for easier industrial access.
Sectors	Digital/Communication/Information Technologies (including Software) Financial Services and Management Consultancy


Description	Please see previous record. This is still an active and ongoing consideration.
First Year Of Impact	2022
Sector	Aerospace, Defence and Marine,Transport
Impact Types	Policy & public services


Description	NPL and scientific records
Geographic Reach	Europe
Policy Influence Type	Contribution to a national consultation/review


Description	PROVAnon: Anonymisation and provenance
Amount	£92,596 (GBP)
Funding ID	R-SOU-008
Organisation	Alan Turing Institute
Sector	Academic/University
Country	United Kingdom
Start	09/2019
End	03/2022


Description	Abduction and Possible Provenance
Organisation	University of Illinois at Urbana-Champaign
Country	United States
Sector	Academic/University
PI Contribution	My team has identified the core research problem, identified all real world examples, and formalised the research problem.
Collaborator Contribution	This partner has provided the underlying tool, and a developer to program in that tool, to evaluate the use of abduction to generate possible provenance.
Impact	There are several papers in preparation. There is a code repository that uses the underlying tool from the partner, and the real-world scenarios our team has identified.
Start Year	2019


Description	Data Sharing for Reverse Engineering Investigation
Organisation	University of Massachusetts Amherst
Country	United States
Sector	Academic/University
PI Contribution	We have used their data to independently verify a reverse engineering technique different to that used by the University of Massachusetts, Amherst.
Collaborator Contribution	Sharing of the data they used for initial reverse engineering evaluation. This also included time spent in helping us set up and effectively understand the data and supporting code.
Impact	Research Output in submission now.
Start Year	2020


Description	Fine-grained provenance for data science
Organisation	Newcastle University
Country	United Kingdom
PI Contribution	Formal analysis of the provenance queries, experimental design, marshalling project goals.
Collaborator Contribution	Formalized the system capture process, provided 3 datasets and data science pipelines.
Impact	2 papers that are associated with the project
Start Year	2019


Description	Fine-grained provenance for data science
Organisation	Roma Tre University
Country	Italy
Sector	Academic/University
PI Contribution	Provided the experimental design, formal modelling and cohesion of goals.
Collaborator Contribution	This partner brought 2 part-time workers to the project over a period of 1 year. These workers provided software engineering of the underlying system that was used for testing.
Impact	2 papers (attributed to this project) 1 open source code of the system for other academics to use.
Start Year	2019


Description	Alan Turing Workshop on Provenance, Machine Learning and Security
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Researchers, industrial representatives and graduate students outside the domain of provenance from around the world gathered to discuss the implications of security and machine learning, and the tool that provenance provides. Several groups broke out after the initial workshop to carry on ideas and seek additional research funding.
Year(s) Of Engagement Activity	2019
URL	https://www.turing.ac.uk/events/provenance-security-machine-learning


Description	InferProvenance Escape Room
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Schools
Results and Impact	In order to provide an Outreach activity that could be used in any school in the nation, particularly in covid-lockdown times, and as a way to introduce students to our University without Visit Days, we developed an Escape Room, in which students learn about the concepts, problems and opportunities developed in our research project by playing an Escape Room, which also featured highlights of our University.
Year(s) Of Engagement Activity	2021


Description	School Visit - Kings High School
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Schools
Results and Impact	100 putlis attended a talk at the school as part of a Festival of Ideas: "Our Future". This all-girls school is passionate about educating and encouraging girls to pursue degrees and careers in STEM subjects. The visit was so well recieved, the students submitted questions follow-up questions and discussion afterwards. The school reported increased interest in related subject areas. One student contacted me afterward "Good evening, I hope you are well. I am ZP a student from King's High in Warwick, the school you recently did a talk for. I found your talk fascinating and felt it had close links to what I have decided to base my EPQ ( extended project qualification) on. I am writing to ask if you had the time to answer some questions I had and if you would be willing for me to write about any responses in my EPQ process."
Year(s) Of Engagement Activity	2021

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications