New challenges in robust statistical learning

Lead Research Organisation: University of Edinburgh

Department Name: Sch of Mathematics

Abstract

In recent years, our ability to collect, store and process vast amounts of data, coupled with rapid advances in technology, have led to the widespread adoption of data-driven decision-making. This includes new application areas, such as precision medicine, where doctors are using data to inform their diagnoses and treatment recommendations. In other areas, such as finance, banks use huge amounts of historical data in order to decide whether a new customer is likely (or not) to default on their loan repayments. It is often the case that we are required to make a discrete prediction about some future patient or customer, based on some (training) data relating to existing patients. In statistics, problems of this type are called classification problems.

Many methods for classification are built on the assumption that any future data we may encounter has the same distribution as our training data. Of course, this assumption is not always valid -- data relating to one set of patients or customers will not necessarily follow the same distribution as data from a new set of people. In this research, we will develop new robust classification algorithms that can deal with noisy and incomplete data. In particular, the new methodology will enable practitioners to combine multiple sources of noisy data, propose modifications to existing methods in order to guarantee they are robust to corruptions in the data, and introduce novel ways of overcoming the issues caused by missing data. We will also provide new theoretical understanding of the limitations of decision-making algorithms when faced with noisy, corrupted and incomplete data.

There are a number of scenarios where our new approaches will be applicable:

- We may have data collected from patients in a particular location (lab or hospital) but wish to make predictions in a different location.

- We may not have access to the full dataset. For example, for privacy reasons, uses may not disclose some of their personal information. In other settings, we may be required to anonymise the data by removing some identifying covariates.

- Often the complexity of the type of data involved will mean that we don't observe the true data. Instead, we only have access to an approximation of the data. This typically occurs in modern settings, where practitioners use crowd-sourcing services such as the Amazon Mechanical Turk to label their data -- such services are rarely perfectly accurate.

- It may be that an adversary is able to arbitrarily contaminate a small proportion of the data (for instance by performing artificial activity online).

Our work will enable practitioners to utilise data that is currently not appropriate for use. We will also provide new insight into the kinds of data that are most useful for a particular purpose.

Planned Impact

Classification is a canonical problem in statistical learning. The technology industry relies on data-driven decision-making -- for instance, hundreds of hours of video content are uploaded to YouTube every minute, it is infeasible to imagine human workers screening that volume of content. More widely, data-based classifiers are now utilised in application areas such as precision medicine, computer vision, machine translation (speech recognition), credit scoring, fraud detection, spam identification and many others.

This proposal outlines a number of projects that will result in significant contributions to the fundamental understanding of robustness in statistical learning. This includes identifying the types of data that are most useful in a classification problem, developing new robust methods for noisy and incomplete data, and providing theoretical guarantees. I believe there will be impact in a wide range of industry sectors, including new insights for the two industrial partners:

- BIOS (www.BIOS.health) is a leading neural engineering start-up, developing open standard hardware and software to connect the human nervous system and AI. For example BIOS are building prosthetic limbs that are controlled directly by the users brain. Their work involves using vast amounts of data and advance machine learning and statistical methods to interpret the complex neural signals produced by the human body. BIOS recently opened a research and development office at Mila, the AI Hub in Montreal, Canada.

- Cambridge Cancer Genomics (CCG -- www.ccg.ai) are building precision oncology solutions for all patients. Their technology can detect relapse earlier than \emph{standard of care}, predict response to therapy more accurately and reduce ineffective treatment regimens. They give your oncologist the head start they need to stay ahead of an evolving tumour. CCG are using vast amounts of data to build an AI platform that will enable oncologists to move beyond the "one-size-fits-all" approach to cancer therapy. CCG also have offices in San Francisco, CA.

Companies like CCG and BIOS use data to inform their technology. One of their primary problems is collecting enough of the right kinds of data -- often data is found to be useless because portions are missing, or it is too noisy -- much of the work of these companies involves combining data from different experiments, labs and sources. The methods developed during the projects in this proposal will inform companies like CCG and BIOS on how to combine their data in a systematic way, in order to make sure that they are using it efficiently and extracting maximal value. The theoretical results in this project will provide insight as to where and how noisy data, which was perhaps previously discarded, can still be used.

Society: The societal impact of this project will be realised by working closely with the two industrial partners. CCG are building the tools needed to empower oncologists to make the best therapeutic decisions for their patients -- there is, therefore, potential for direct impact as the work in this proposal helps to improve CCGs products. BIOS are "creating the open standard hardware and software interface between the human nervous system and AI". Their core team combines the expertise of machine learners and medical clinicians. The work outlined in this proposal will inform BIOS on the types of algorithms and techniques that are best suited to their products.

Both CCG and BIOS are growing. A partnership linked to this proposal will showcase the potential that Edinburgh and Scotland can offer, both in terms of the dynamic and interdisciplinary research environment and the quality of the University of Edinburgh graduate students.

Funded Value:

£266,365

Funded Period:

May 21 - Apr 24

Funder:

EPSRC

Project Status:

Active

Project Category:

Research Grant

Project Reference:

EP/V002694/1

Principal Investigator:

Timothy Cannings

Research Subject:

Mathematical sciences (100%)

Research Topic:

Statistics & Appl. Probability (100%)

Organisations

People	ORCID iD
Timothy Cannings (Principal Investigator)	http://orcid.org/0000-0002-2111-4168

Publications

Author Name

Title Publication Date Published

10 25 50

Cannings T.I. (2022) The correlation-assisted missing data estimator in Journal of Machine Learning Research

Reeve H (2021) Adaptive transfer learning in The Annals of Statistics

Reeve H (2023) Optimal subgroup selection

Reeve H (2021) Optimal subgroup selection

Reeve H (2021) ADAPTIVE TRANSFER LEARNING

Reeve H (2023) Optimal subgroup selection in The Annals of Statistics

Reeve H (2021) Adaptive transfer learning

Sell T (2023) Trace-class Gaussian priors for Bayesian learning of neural networks with MCMC in Journal of the Royal Statistical Society Series B: Statistical Methodology

Sell T (2023) Nonparametric classification with missing data

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications