Machine learning driven codon optimisation for heterologous protein expression

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Biological Sciences

Abstract

Project Description for Advert (max 500 words). This will be the text that is advertised to prospective students:
BACKGROUND. Expression of proteins requires the transcription of DNA into RNA, followed by its translation into amino acid sequences. Each amino acid is encoded by triplets of nucleotides, called codons, which are universal in Nature. However, an amino acid can be encoded by different codons, a phenomenon known as degeneracy of the genetic code, and the use of a codon instead of another affects downstream protein abundance. Interestingly, despite the synthesis machinery is relatively conserved across species, synonymous codons usage varies across species and even across genes, as a function of a number of factors, including GC content, recombination rates, mRNA stability and codon position [Novoa et al, 2019]. Moreover, it has been shown that once a given codon is used, subsequent codons encoding the same amino acid are not randomly picked but follow complex combinatorial patterns [Cannarozzi et al, 2010].
Despite the wealth of knowledge generated by high-throughput sequencing and proteomics experiments, the rules underpinning codon usage are mostly unknown.
From an industrial biotechnology perspective, this knowledge gap limits our ability to efficiently express heterologous proteins and to optimise properties for end-user applications, such as solubility [Pellizza et al, 2018].

AIMS AND OBJECTIVES. In collaboration with Fujifilm Diosynth Biotechnologies UK (FDBK), we propose to learn codon usage rules by rephrasing protein synthesis as a language modelling problem. We will then use deep learning in order to capture complex epistatic and evolutionary patterns associated with highly expressed genes and with optimal solubility. Ultimately, these models will be validated in silico and in-vivo.

WORKPLAN. The project is structured in 3 work packages.
- WP1 - the student will collect transcriptomic data for E. coli from public repositories and generate a dataset of curated transcripts and associated protein sequences.
- WP2 - the student will develop a neural language model to convert amino acid sequences into DNA sequences, by taking into account evolutionary information and protein function.
- WP3 - experimental validation of models' effectiveness, by synthesizing, building and expressing codon optimised proteins in E. coli and performing downstream comparison against wild-type variants and genes optimised with existing methods.

TRAINING PROGRAM. The student will receive training in machine learning, statistical learning and deep learning, and will build a competitive profile in biological sequence modelling and design. The student will be also introduced to the emerging field of synthetic biology and will learn modern DNA cloning and assembly techniques and the use of protein expression systems at scale. We also put a strong emphasis on reproducible research; the student will receive training in advanced research software engineering and in reproducible workflows for data analyses.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
BB/T00875X/1 01/10/2020 30/09/2028
2599698 Studentship BB/T00875X/1 01/10/2021 30/09/2025