Data-Driven Algorithms for Data Acquisition
Lead Research Organisation:
University of Oxford
Department Name: Statistics
Abstract
Advances in machine learning have transformed our ability to utilize data. But far less progress has been made on intelligently acquiring such data in the first place. Consequently, though data-driven approaches are now ubiquitous across science and industry, hand-crafted and heuristic approaches are typically still the norm for data acquisition itself.
My goal is to address this shortfall by developing principled quantitative methods for data acquisition. In particular, I will construct adaptive algorithms that leverage information from previous data to guide future data acquisition. The basis for doing this will be the framework of Bayesian adaptive design (BAD), which formalizes the utility of data through the information it provides, then exploits this to optimize the controllable aspects of the acquisition process.
Despite its principled foundations, BAD has not yet seen substantial uptake due to some key challenges in its deployment. Most notably, it has crippling computational bottlenecks that undermine its usage. By overcoming these with a new policy-based approach, I hope to turn BAD's potential into a reality, providing a powerful basis for intelligent data acquisition in domains as diverse as interactive surveys and virtual assistants, to laboratory experiments and psychology trials.
One area of particular focus will be active learning, wherein one iteratively selects points to label from an unlabelled pool. Here BAD has already provided some success, but I believe it is currently fundamentally misapplied. I hope to substantially improve state-of-the-art in the area through various innovations, such as targeting information gain in predictions rather than parameters, properly utilizing unlabelled data, and developing policy-based approaches. I further propose to revisit the foundations of the Bayesian neural network models often used in such settings, questioning their fundamental assumptions and developing radically new approaches.
My goal is to address this shortfall by developing principled quantitative methods for data acquisition. In particular, I will construct adaptive algorithms that leverage information from previous data to guide future data acquisition. The basis for doing this will be the framework of Bayesian adaptive design (BAD), which formalizes the utility of data through the information it provides, then exploits this to optimize the controllable aspects of the acquisition process.
Despite its principled foundations, BAD has not yet seen substantial uptake due to some key challenges in its deployment. Most notably, it has crippling computational bottlenecks that undermine its usage. By overcoming these with a new policy-based approach, I hope to turn BAD's potential into a reality, providing a powerful basis for intelligent data acquisition in domains as diverse as interactive surveys and virtual assistants, to laboratory experiments and psychology trials.
One area of particular focus will be active learning, wherein one iteratively selects points to label from an unlabelled pool. Here BAD has already provided some success, but I believe it is currently fundamentally misapplied. I hope to substantially improve state-of-the-art in the area through various innovations, such as targeting information gain in predictions rather than parameters, properly utilizing unlabelled data, and developing policy-based approaches. I further propose to revisit the foundations of the Bayesian neural network models often used in such settings, questioning their fundamental assumptions and developing radically new approaches.
Publications
Campbell A.
(2024)
Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design
in Proceedings of Machine Learning Research
Dhillon G.S.
(2024)
On the Expected Size of Conformal Prediction Sets
in Proceedings of Machine Learning Research
Kossen J.
(2024)
IN-CONTEXT LEARNING LEARNS LABEL RELATIONSHIPS BUT IS NOT CONVENTIONAL LEARNING
in 12th International Conference on Learning Representations, ICLR 2024
Miao N.
(2024)
SELFCHECK: USING LLMS TO ZERO-SHOT CHECK THEIR OWN STEP-BY-STEP REASONING
in 12th International Conference on Learning Representations, ICLR 2024
Reichelt T.
(2024)
Beyond Bayesian Model Averaging over Paths in Probabilistic Programs with Stochastic Support
in Proceedings of Machine Learning Research
Smith F.B.
(2024)
Making Better Use of Unlabelled Data in Bayesian Active Learning
in Proceedings of Machine Learning Research
| Description | Microsoft D.Phil Co-supervision |
| Organisation | Microsoft Research |
| Country | Global |
| Sector | Private |
| PI Contribution | Co-supervision of the D.Phil student Freddie Bickford Smith. |
| Collaborator Contribution | Co-supervision of the D.Phil student Freddie Bickford Smith by Adam Foster. |
| Impact | Publication "Making better use of unlabelled data in Bayesian active learning" at AISTATS 2024. |
| Start Year | 2024 |
| Description | Sanger Institute Co-Supervision |
| Organisation | The Wellcome Trust Sanger Institute |
| Country | United Kingdom |
| Sector | Charity/Non Profit |
| PI Contribution | Co-supervision of the D.Phil student Benjamin Chang. |
| Collaborator Contribution | Co-supervision of the D.Phil student Benjamin Chang by Mo Lotfollahi. |
| Impact | Co-supervision of D.Phil student, no publications yet. |
| Start Year | 2024 |
| Title | Bayesian active learning with EPIG acquisition |
| Description | Code package for performing target-orientated Bayesian active learning using the EPIG acquisition strategy. |
| Type Of Technology | New/Improved Technique/Technology |
| Year Produced | 2024 |
| Open Source License? | Yes |
| Impact | Code has been used by other research teams in the production of research papers external to our group. |
| URL | https://github.com/fbickfordsmith/epig |
| Description | Meeting Minds Public Lecture |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | Local |
| Primary Audience | Public/other audiences |
| Results and Impact | Presented a public lecture on "Intelligent Data Acquisition" as part of the University's "Meeting Minds" lecture series. This sparked discussions with various audience members and some follow-ups from attendees who felt the talk had helped them in their line of work. |
| Year(s) Of Engagement Activity | 2024 |
