The systematic functional analysis of regulatory novel open reading frames.

Lead Research Organisation: University of Cambridge
Department Name: Genetics


Nearly 93% of all disease and trait associated single nucleotide polymorphisms (SNPs) are mapped to non-coding regions. To gain a better understanding of the pathogenicity of non-coding regions, we here introduce the gene class of novel open reading frames (nORFs) and investigate the functional identity of non-canonical transcriptional and translational products. At the time being, the nORF database openProt provides mass spectroscopy based translational evidence of 143,102 proteins. Despite decisive evidence of nORF translation and regulation, the comprehensive understanding of nORF function and disease-association remains to be challenging. However, recent advances in bioinformatics and data acquisition promise new opportunities to develop data-driven pipelines and might thus unveil the identity and regulatory function of yet overlooked (non-canonical) proteins.

In this report, we share preliminary results and propose a plan to further establish the relevance of nORFs with the means of bioinformatic analysis and experimental validation to gain a better understanding of a wide array of diseases. This report begins with a general introduction of the fields relevance and progress followed up by 4 chapters that each contribute to the rigorous top-down analysis and prioritization of functional nORF candidates.

In chapter one we introduce the creation of the nORF database and platform ( that from then on constitutes as reference database for all later research. Chapter two utilizes the created database to engineer feature descriptor dimensions for later analysis pipelines. Chapter three introduces a high dimensional analysis pipeline to quality control and analyze a combined dataset of nORFs, RefSeq genes and random sequences. In the analysis, we utilize unsupervised clustering methods (t-SNE and UMAP) as groundwork for sophisticated machine learning methods. In chapter four 37 sequence samples were taken from t-SNE clusters to identify potential functional relationships. Subsequent to those analysis steps the report concludes with preliminary regulatory nORF insights and future research considerations. (VIVA excerpt)


10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
BB/M011194/1 01/10/2015 31/03/2024
2116112 Studentship BB/M011194/1 01/10/2018 30/09/2022 Robin Kohze