Graph learning methods for engieering mammalian promoters in bioproduction

Lead Research Organisation: Imperial College London
Department Name: Bioengineering

Abstract

mRNA COVID-19 vaccines have effectively prevented hospitalization and deathduring the pandamic. Bioproduction of valuable vaccines and biotherapeuticsin mammalian cell lines can be achieved, but it is difficult to develop robust,predictable, and sustainable expression. The design of enhanced mammalianpromoters and genetic circuits is therefore a key strategic industrial target. Thisproject aims to explore the structural properties of large-scale transcriptomicdatasets with graph representation learning to optimise engineered promoters inmammalian cells. Demonstration will be performed in mammalian cell lines us-ing automated DNA assembly and analytics available at the London Biofoundry.As part of this project, multiple publicly available transcriptomic datasetswill be used, including the EPD [2], DEE2 uniform transcriptomic database[7], and SRA [5] for constructing heterogeneous graphs, in which each noderepresents a set of DNA sequences (i.e. promoters and genes); edge types aredetermined by the biological context shared by two nodes. The initial noderepresentation may be generated using the internal DNA sequence structure.With constructed heterogeneous network, there are two objectives: Firstly,we will use Graph neural networks (GNNs) [3, 4, 8] to learn the representationof each node in the network, which maps the nodes into embedding space toreflect the graph structure. In this way, the cellular and environmental con-text will be embedded into the learnt node representations; and then we canuse the resulting node embedding to predict the promoter activity (i.e. thespeed and quantity of gene expression). Promoter activity prediction task maybe integrated with the representation learning process or solved using separatepredictors. The second stage of the project will combine results from exper-iments to iteratively refine the GNNs. The trained model will select a baseset of 50-100 promoters functional over a range of conditions and with desiredperformance to be used as starting point for synthetic promoter design. In col-laboration with the London Biofoundry, their characteristics will be tested intransient transfection to eliminate changes in behaviour due to differences incopy number and physical context. Promoters with verified behaviour will beused by the GNNs algorithm as modular building blocks to build a set of 50synthetic promoters with designed behaviour. Construct performance will becompared against predictions generated by the GNNs and used to refine modelperformance.Overall, the project will leverage the expression power of GNNs and thelarge-scale transcriptomic datasets to develop a broadly applicable frameworkfor gene expression problem analysis, with the application in predicting con-textual promoter behaviours and engineering new mammalian promoters. Inaddition, the interpretability of learnt model can be achieved by applying sym-bolic regression [1] and GNN-explainer [6].

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
BB/W509966/1 01/10/2021 30/09/2025
2786047 Studentship BB/W509966/1 01/10/2021 30/09/2025