Breaking the curse of dimensionality in low-data tasks

Lead Research Organisation: University of Cambridge

Department Name: Computer Science and Technology

Abstract

This research seeks to create data-efficient machine learning algorithms that can learn in complex domains with low-data and high dimensionality. Specifically, the research will focus on tasks with - 100s samples and 1000 - 20; 000 features, such as diagnosing patients from clinical trials where sequencing data is available. Learning from little data usually requires creating machine learning models that incorporate adequate invariances and inductive biases. The Bayesian machine learning framework allows inputting domain knowledge through specifying prior distributions and kernel functions. However, specifying kernels for complex domains remains challenging, and it is common practice to use uninformative priors and rely almost entirely on learning from the data.
This research aims to circumvent the apparent limitations of learning from low-data by designing methods to learn priors that capture the rich interactions between features, and to incorporate human knowledge. It will investigate ways to learn kernel functions for Bayesian models. Using data-driven kernel could enable integrating learned information about the complex feature interactions (e.g., gene interactions in particular diseases) and facilitate reliable predicting from small datasets. The first research direction is learning rich kernel functions for semi-parametric models via the framework of deep kernel learning in the context of Gaussian Processes. A second research direction will investigate transfer learning for the 'kernel function' stored in the parameters of the recently introduced neural processes.
Potential impact: This research has the potential to enable reliable estimation from small, high-dimensional datasets from various domains such as Medicine, drug discovery and beyond. Ultimately, the proposed advancements could capture rich interaction between variables and transfer this knowledge to similar scenarios in which relying on data alone is insufficient. I hope the proposed approach will become standard practice in enabling fast learning, similar to the transfer learning framework in domains such as computer vision or natural language processing.

Student:

Andrei Margeloiu

Period of Study:

Oct 20 - Mar 24

Funder:

ESRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

2615996

Research Topic:

Unclassified

Organisations

University of Cambridge (Lead Research Organisation)

People	ORCID iD
Mateja Jamnik (Primary Supervisor)
Andrei Margeloiu (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
ES/P000738/1			01/10/2017	30/09/2027
2615996	Studentship	ES/P000738/1	01/10/2020	20/03/2024	Andrei Margeloiu

Abstract

Organisations

People

ORCID iD

Publications

Studentship Projects