Computational studies of genome evolution and regulation

Lead Research Organisation: University College London
Department Name: Genetics Evolution and Environment

Abstract

Strategic Research Priority: World Class Underpinning Bioscience
Abstract
This thesis takes on the challenge of extracting information from large volumes of biological data produced with newly established experimental techniques. The computational studies of genome evolution and regulation presented here aim at striking a balance between maximising the information gained from the data and identifying types of information that are not present in the data and hence cannot be inferred with any type of analysis. In the first part of the thesis I examined the evolutionary origins of de novo taxonomically restricted genes (TRGs) in Drosophila subgenus. De novo TRGs are genes that have originated after the speciation of a particular clade from previously non-coding regions - functional ncRNA, within introns or alternative frames of older protein-coding genes, or from intergenic sequences. TRGs are clade-specific tool-kits that are likely to contain proteins with yet undocumented functions and new protein folds that are yet to be discovered. One of the main challenges in studying de novo TRGs is the trade-off between false positives (non-functional open reading frames) and false negatives (true TRGs that have properties distinct from well established genes). Here I identified two de novo TRG families in Drosophila subgenus that have not been previously reported as de novo originated genes, and to our knowledge they are the best candidates identified so far for experimental studies aimed at elucidating the properties of de novo genes. In the second part of the thesis I examined the information contained in single cell RNA sequencing (scRNA-seq) data and propose a method for extracting biological knowledge from this data using generative neural networks. The main challenge is the noisiness of scRNA-seq data - the number of transcripts sequenced is not proportional to the number of mRNAs present in the cell. I used variational autoencoder (VAE) to reduce the dimensionality of the data without making untestable assumptions about the data. This embedding into lower dimensional space alongside the features learned by VAE contains information about the cell populations, differentiation trajectories and the regulatory relationships between the genes. Unlike most methods currently used, VAE does not assume that these regulatory relationships are the same in all cells in the data set. The main advantages of our approach is that it makes minimal assumptions about the data, it is robust to noise and it is possible to assess its performance. In the final part of the thesis I summarise lessons learnt from analysing various types of biological data and make suggestions for the future direction of similar computational studies.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
BB/M009513/1 01/10/2015 31/03/2024
1618923 Studentship BB/M009513/1 01/10/2015 30/09/2019 Karina Zile