Breaking the curse of dimensionality in low-data tasks

Lead Research Organisation: University of Cambridge
Department Name: Computer Science and Technology

Abstract

This research seeks to create data-efficient machine learning algorithms that can learn in complex domains with low-data and high dimensionality. Specifically, the research will focus on tasks with - 100s samples and 1000 - 20; 000 features, such as diagnosing patients from clinical trials where sequencing data is available. Learning from little data usually requires creating machine learning models that incorporate adequate invariances and inductive biases. The Bayesian machine learning framework allows inputting domain knowledge through specifying prior distributions and kernel functions. However, specifying kernels for complex domains remains challenging, and it is common practice to use uninformative priors and rely almost entirely on learning from the data.
This research aims to circumvent the apparent limitations of learning from low-data by designing methods to learn priors that capture the rich interactions between features, and to incorporate human knowledge. It will investigate ways to learn kernel functions for Bayesian models. Using data-driven kernel could enable integrating learned information about the complex feature interactions (e.g., gene interactions in particular diseases) and facilitate reliable predicting from small datasets. The first research direction is learning rich kernel functions for semi-parametric models via the framework of deep kernel learning in the context of Gaussian Processes. A second research direction will investigate transfer learning for the 'kernel function' stored in the parameters of the recently introduced neural processes.
Potential impact: This research has the potential to enable reliable estimation from small, high-dimensional datasets from various domains such as Medicine, drug discovery and beyond. Ultimately, the proposed advancements could capture rich interaction between variables and transfer this knowledge to similar scenarios in which relying on data alone is insufficient. I hope the proposed approach will become standard practice in enabling fast learning, similar to the transfer learning framework in domains such as computer vision or natural language processing.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
ES/P000738/1 01/10/2017 30/09/2027
2615996 Studentship ES/P000738/1 01/10/2020 20/03/2024 Andrei Margeloiu