📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

Using deep learning and deep mutational scanning to map, understand and predict the "splicing code"

Lead Research Organisation: University of Cambridge
Department Name: School of Biological Sciences

Abstract

BBSRC strategic theme: Transformative technologies

My PhD project will explore the regulatory and modulatory architecture of mRNA splicing in introns and exons using deep learning (DL) models, interpretability methods, and deep mutational scanning (DMS) data.

I plan to use SpliceAI and other splicing predictors to perform in-silico DMS across all protein-coding genes to reveal their "splicing code." This work builds on my previous development of a bioinformatics tool that maps the architecture of exons and their adjacent introns, shared with my colleagues at Sanger/CRG Barcelona. I will extend this tool to include deep intronic regions, which remain under-explored in splicing literature. Collaborating with colleagues performing DMS on 1000 exons and their proximal intronic regions/random exons/random introns I aim to benchmark model performance and develop algorithms to predict splicing architecture more accurately.

Utilizing DL interpretability methods and model architectures developed by the Peter Koo lab at Cold Spring Harbor Lab, I will apply Surrogate Quantitative Interpretability for Deepnets (SQUID) to my in-silico DMS data. SQUID treats the deep neural network (DNN) as an oracle, fitting a MAVE-NN to specific genomic regions and using simple models with interpretable parameters to approximate the DNN's function. Combined with Global Importance Analysis, this approach will help quantify the significance of genomic features in DNNs and their epistatic interactions within cis-regulatory mechanisms. These methods will support me to generate biological hypotheses for large-scale library design and massively parallel experimental testing.

I will also contribute to developing a new splicing predictor using experimental DMS data from the Lehner lab. My research will focus on how different model architectures and optimisation methods affect the accuracy, interpretability, and biological fidelity of our splicing models. I will explore important trade-offs between model accuracy, robustness and interpretability, incorporating methods such as attention mechanisms, exponential activation functions, variable filtering, adversarial training, Gaussian noise injection, and regularisation.

Additionally, I am interested in addressing open questions in DL for genomics, such as using additional output heads and loss functions to bias the model towards learned representations that are faithful to biology. For instance, besides predicting percent spliced in (PSI) values, the models could predict alternative splicing events and their sequences. I plan to develop a new data analysis pipeline and error model to quantify these novel splicing events from experimental DMS data which will improve the training data available.

Lastly, I am intrigued by the use of masked language modeling in genomic large language models (LLMs). I will explore the advantages and challenges of traditional one-hot encoding versus advanced DNA/RNA models like NTtransformer/SegmentNT, particularly in splicing analysis. Deciphering what these LLM embeddings signify, their performance in non-coding regions, and their inability to capture cell-type specific information presents notable difficulties. However, models based on RNA features that may be relatively constant across cell types could potentially integrate with our splicing DMS data across 5 cell lines to classify splicing architecture with nucleotide level resolution, cell specificity and spatial clarity.

People

ORCID iD

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
BB/X010899/1 30/09/2023 29/09/2028
2888208 Studentship BB/X010899/1 30/09/2023 19/10/2027