Transparent Deep Learning for Directed Protein Evolution

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Biological Sciences

Abstract

Protein engineering is a complex process, which requires finding an amino acid sequence associated with a desired function. As the design space grows exponentially as a function of the number of residues, de-novo design is currently an intractable problem. To overcome the curse of protein design complexity, scientists routinely rely on an iterative process consisting of random mutagenesis and selection of protein variants, called Directed Evolution (DE, 1); while this process led to remarkable results, it is extremely slow, low-throughput and expensive, as the probability of generating functional proteins at each step is low. Thus, for the last 30 years, scientists have developed biophysical models and optimisation methods to predict protein structure and function in-silico; however, these methods are usually not scalable to large proteins and are limited by the accuracy of the underlying biophysical models.

Recently, Machine Learning (ML) and, in particular, Deep Learning (DL) have largely overcome these problems by learning functional relationships associated with protein folding and function directly from data [2]. However, it remains opaque and challenging to understand how a DL model makes structural and functional predictions [3], thus limiting their utility in understanding the biological design principles associated with functional proteins.

AIMS AND OBJECTIVES: In collaboration with ZenithAI (OT/ZAI), we propose to design and build transparent and explainable deep learning models for protein design. The protein design space increases exponentially with the number of amino acid positions considered but functional proteins are extremely rare. Therefore, transparent models can provide a principled protein selection method, by only looking at important and uncertain amino acid positions, ultimately reducing the burden of experimental screening of protein variants.

WORKPLAN. The project is structured in 3 work packages.
- WP1 - The student will develop a deep learning framework for protein engineering, using state-of-the-art variational and adversarial models coupled with sequence-to-sequence models, which will be trained using curated protein sequence information stratified by species and function.
- WP2 - The student will then develop probabilistic models to quantify uncertainty in designs by exploiting gradient and weights information learned by the model, ultimately to define a score to prioritise proteins for experimental testing.
- WP3 - The student will use the model to design variants of the human S1PL enzyme, which will then be tested in the lab. S1PL is a central enzyme in the sphingolipid pathway, which is essential for proper cell functioning and it has a causal role in many diseases, including cancer and neurodegenerative disorders.

TRAINING PROGRAM. The student will receive training in machine learning, statistical learning and deep learning, and will build a competitive profile in biological sequence modelling and design. The student will be also introduced to the emerging field of synthetic biology and will learn modern DNA cloning and assembly techniques and the use of protein expression systems at scale. We also put a strong emphasis on reproducible research; the student will receive training in advanced research software engineering and in reproducible workflows for data analyses.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
BB/T00875X/1 01/10/2020 30/09/2028
2745409 Studentship BB/T00875X/1 01/10/2022 30/09/2026