Learning from biological context for protein fitness estimation and design

Lead Research Organisation: University of Oxford

Abstract

Proteins, sequences of amino acids, are the fundamental building blocks of life, serving as the workhorses of cellular processes. Their diverse functions, from catalysing chemical reactions to providing structural support, underpin the complex machinery of living organisms. Evolution has conducted a massive experiment over the space of all possible amino acid sequences: those that encode a functional protein survive; those that don't are extinct. By looking at a set of related proteins throughout life, we can begin to understand its evolutionary history, key to unlocking numerous advancements in medicine, biotechnology, and various fields of biology. This project falls within the EPSRC Artificial Intelligence Technologies research area and is co advised by Professor Debbie Marks at Harvard University.

Computational biologists have used more and more complex statistical models to analyse protein evolution. Extending the models to the whole proteome has recently been made possible by large protein language models, that aim to uncover the language of life. These models have been shown to recapitulate protein evolution, or phylogeny, even when the set of related homologs is small. The project aims to leverage new methodologies from the fields of in-context learning and non-parametric modelling to protein statistical modelling.

In context learning allows models to learn from context, for example by the addition of a few examples. Nonparametric modelling allows the model to learn from explicit data points instead of having to memorize an entire dataset in its parametrised weights. By combining these methods, the aim is to better leverage and retrieve the context provided by protein evolution, to improve performance at same compute cost. These methods will allow to better study unalienable protein sequence such as disordered regions or antibodies, as well as to model insertion and deletion of sequences.

The development of such method allows to both quantify pathogenicity of a given protein sequence, to diagnose disease as well as optimize a sequence for its function, with relevance to bioengineering of proteins for chemical processes and drug development.

Planned Impact

In the same way that bioinformatics has transformed genomic research and clinical practice, health data science will have a dramatic and lasting impact upon the broader fields of medical research, population health, and healthcare delivery. The beneficiaries of the proposed training programme, and of the research that it delivers and enables, will include academia, industry, healthcare, and the broader UK economy.

Academia: Graduates of the training programme will be well placed to start their post-doctoral careers in leading academic institutions, engaging in high-impact multi-disciplinary research, helping to build training and research capacity, sharing their experience within the wider academic community.

Industry: Partner organisations will benefit from close collaboration with leading researchers, from the joint exploration of research priorities, and from the commercialisation of arising intellectual property. Other organisations will benefit from the availability of highly-qualified graduates with skills in big health data analytics.

Healthcare: Healthcare organisations and patients will benefit from the results of enabled and accelerated health research, leading to new treatments and technologies, and an improved ability to identify and evaluate potential improvements in practice through the analysis of real-world health data.

Economy: The life sciences sector is a key component of the UK economy. The programme will provide partner companies with direct access to leading-edge research. Graduates of the programme will be well-qualified to contribute to economic growth - supporting health research and the development of new products and services - and will be able to inform policy and decision making at organisational, regional, and national levels.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S02428X/1 01/04/2019 30/09/2027
2593955 Studentship EP/S02428X/1 01/10/2021 30/09/2025 Ruben Weitzman