Computational studies of genome evolution and regulation

Lead Research Organisation: UNIVERSITY COLLEGE LONDON

Department Name: Structural Molecular Biology

Abstract

Strategic Research Priority: World Class Underpinning Bioscience
Abstract
This thesis takes on the challenge of extracting information from large volumes of biological data produced with newly established experimental techniques. The computational studies of genome evolution and regulation presented here aim at striking a balance between maximising the information gained from the data and identifying types of information that are not present in the data and hence cannot be inferred with any type of analysis. In the first part of the thesis I examined the evolutionary origins of de novo taxonomically restricted genes (TRGs) in Drosophila subgenus. De novo TRGs are genes that have originated after the speciation of a particular clade from previously non-coding regions - functional ncRNA, within introns or alternative frames of older protein-coding genes, or from intergenic sequences. TRGs are clade-specific tool-kits that are likely to contain proteins with yet undocumented functions and new protein folds that are yet to be discovered. One of the main challenges in studying de novo TRGs is the trade-off between false positives (non-functional open reading frames) and false negatives (true TRGs that have properties distinct from well established genes). Here I identified two de novo TRG families in Drosophila subgenus that have not been previously reported as de novo originated genes, and to our knowledge they are the best candidates identified so far for experimental studies aimed at elucidating the properties of de novo genes. In the second part of the thesis I examined the information contained in single cell RNA sequencing (scRNA-seq) data and propose a method for extracting biological knowledge from this data using generative neural networks. The main challenge is the noisiness of scRNA-seq data - the number of transcripts sequenced is not proportional to the number of mRNAs present in the cell. I used variational autoencoder (VAE) to reduce the dimensionality of the data without making untestable assumptions about the data. This embedding into lower dimensional space alongside the features learned by VAE contains information about the cell populations, differentiation trajectories and the regulatory relationships between the genes. Unlike most methods currently used, VAE does not assume that these regulatory relationships are the same in all cells in the data set. The main advantages of our approach is that it makes minimal assumptions about the data, it is robust to noise and it is possible to assess its performance. In the final part of the thesis I summarise lessons learnt from analysing various types of biological data and make suggestions for the future direction of similar computational studies.

Sep 15 - Sep 19

Funder:

BBSRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

1618923

Research Topic:

Unclassified

Organisations

UNIVERSITY COLLEGE LONDON (Lead Research Organisation)

People

ORCID iD

Publications

Author Name

Title Publication Date Published

10 25 50

Altenhoff AM (2018) The OMA orthology database in 2018: retrieving evolutionary relationships among all domains of life through richer web and programmatic interfaces. in Nucleic acids research

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
BB/M009513/1			30/09/2015	31/03/2024
1618923	Studentship	BB/M009513/1	30/09/2015	29/09/2019