Data-Driven Algorithms for Data Acquisition

Lead Research Organisation: University of Oxford

Department Name: Statistics

Abstract

Advances in machine learning have transformed our ability to utilize data. But far less progress has been made on intelligently acquiring such data in the first place. Consequently, though data-driven approaches are now ubiquitous across science and industry, hand-crafted and heuristic approaches are typically still the norm for data acquisition itself.

My goal is to address this shortfall by developing principled quantitative methods for data acquisition. In particular, I will construct adaptive algorithms that leverage information from previous data to guide future data acquisition. The basis for doing this will be the framework of Bayesian adaptive design (BAD), which formalizes the utility of data through the information it provides, then exploits this to optimize the controllable aspects of the acquisition process.

Despite its principled foundations, BAD has not yet seen substantial uptake due to some key challenges in its deployment. Most notably, it has crippling computational bottlenecks that undermine its usage. By overcoming these with a new policy-based approach, I hope to turn BAD's potential into a reality, providing a powerful basis for intelligent data acquisition in domains as diverse as interactive surveys and virtual assistants, to laboratory experiments and psychology trials.

One area of particular focus will be active learning, wherein one iteratively selects points to label from an unlabelled pool. Here BAD has already provided some success, but I believe it is currently fundamentally misapplied. I hope to substantially improve state-of-the-art in the area through various innovations, such as targeting information gain in predictions rather than parameters, properly utilizing unlabelled data, and developing policy-based approaches. I further propose to revisit the foundations of the Bayesian neural network models often used in such settings, questioning their fundamental assumptions and developing radically new approaches.

Funded Value:

£1,229,206

Funded Period:

Mar 24 - Feb 29

Funder:

Horizon Europe Guarantee

Project Status:

Active

Project Category:

Research Grant

Project Reference:

EP/Y037200/1

Principal Investigator:

Tom Rainforth

Research Subject:

Info. & commun. Technol. (45%)

Mathematical sciences (54%)

Research Topic:

Artificial Intelligence (45%)

Statistics & Appl. Probability (54%)

Organisations

People	ORCID iD
Tom Rainforth (Principal Investigator)	http://orcid.org/0000-0001-7939-4230

Publications

Author Name

Title Publication Date Published

10 25 50

Campbell A. (2024) Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design in Proceedings of Machine Learning Research

Dhillon G.S. (2024) On the Expected Size of Conformal Prediction Sets in Proceedings of Machine Learning Research

Kossen J. (2024) IN-CONTEXT LEARNING LEARNS LABEL RELATIONSHIPS BUT IS NOT CONVENTIONAL LEARNING in 12th International Conference on Learning Representations, ICLR 2024

Miao N. (2024) SELFCHECK: USING LLMS TO ZERO-SHOT CHECK THEIR OWN STEP-BY-STEP REASONING in 12th International Conference on Learning Representations, ICLR 2024

Reichelt T. (2024) Beyond Bayesian Model Averaging over Paths in Probabilistic Programs with Stochastic Support in Proceedings of Machine Learning Research

Smith F.B. (2024) Making Better Use of Unlabelled Data in Bayesian Active Learning in Proceedings of Machine Learning Research

Collaboration
Software and Technical Products
Engagement Activities


Description	Microsoft D.Phil Co-supervision
Organisation	Microsoft Research
Country	Global
Sector	Private
PI Contribution	Co-supervision of the D.Phil student Freddie Bickford Smith.
Collaborator Contribution	Co-supervision of the D.Phil student Freddie Bickford Smith by Adam Foster.
Impact	Publication "Making better use of unlabelled data in Bayesian active learning" at AISTATS 2024.
Start Year	2024


Description	Sanger Institute Co-Supervision
Organisation	The Wellcome Trust Sanger Institute
Country	United Kingdom
Sector	Charity/Non Profit
PI Contribution	Co-supervision of the D.Phil student Benjamin Chang.
Collaborator Contribution	Co-supervision of the D.Phil student Benjamin Chang by Mo Lotfollahi.
Impact	Co-supervision of D.Phil student, no publications yet.
Start Year	2024


Title	Bayesian active learning with EPIG acquisition
Description	Code package for performing target-orientated Bayesian active learning using the EPIG acquisition strategy.
Type Of Technology	New/Improved Technique/Technology
Year Produced	2024
Open Source License?	Yes
Impact	Code has been used by other research teams in the production of research papers external to our group.
URL	https://github.com/fbickfordsmith/epig


Description	Meeting Minds Public Lecture
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Public/other audiences
Results and Impact	Presented a public lecture on "Intelligent Data Acquisition" as part of the University's "Meeting Minds" lecture series. This sparked discussions with various audience members and some follow-ups from attendees who felt the talk had helped them in their line of work.
Year(s) Of Engagement Activity	2024

Abstract

Organisations

People

ORCID iD

Publications