Bayesian models of expression in the transcriptome for clinical RNA-seq

Lead Research Organisation: University of Sheffield
Department Name: Neurosciences

Abstract

Background
--
Science is discovering the exciting world of genes, how they interact, how they differ from person to person and the process by which they ultimately form the templates for proteins, the building block of life. But genes are more complicated than we thought. It seems that genes can be transformed: somewhere between the DNA code for a gene and the final protein, genes can be 'spliced' to make different types of proteins and to form other molecules that interact with the gene system.

In a healthy cell, this splicing is useful! It allows us to store the code for different proteins in a single gene. But splicing has been implicated in disease, in particular, Motor Neurone Disease (MND), a debilitating and poorly understood disease. Recently, researchers have found a small part of the genetic code (which we call C9ORF72) which may be a tiny clue in understanding MND, and it seems to have a huge effect on splicing.

New technology is allowing us to uncover the world of genes. Technology developed to sequence the human genome allows us to measure the genes in a cell, right at the point where splicing happens: we call this RNA-Seq. But this sequencing generates LOTS of data, and the amount is increasing. In fact, it's increasing faster than computers are improving. If we want to analyse the data to uncover the world of splicing and the effect it has in MND, we need to create new computational tools which allow us to deal with the data using limited computing resources.

The Problem
--
RNA-Seq presents us with millions of short sequences which represent the genes after splicing has occured. It's a bit like being presented with a huge bag of jigsaw pieces from thousands of different puzzles: how do the pieces fit together? How many types of picture are there? Which pictures occur most often? There is a lot of uncertainty in the problem, and so we use probabilities to express how the pieces might fit together, and thus how genes have been spliced.

In recent work, I've examined how to make probabilistic algorithms like this one more efficient, now I will look at how to make use of this type of method in RNA-Seq data. My research will create such tools usings methods based on approximate Bayesian inference. I'll devise algorithms which can deal with these increasing quantities of data, and allow scientists to make statistical inferences from RNA-Seq data about splicing in disease.

Transferring knowledge
--
To develop such tools, I'll draw inspiration from the related field of natural language processing. With the explosion of the web, data scientists have created methods for organising and categorising our data. One particular statistical method, called a "topic model", closely resembles the analysis of RNA-Seq. With so much focus on the web, there have been lots of developments in topic models that we can borrow to make our algorithms for RNA-Seq faster and better. I'll investigate how we can transfer these ideas to RNA-Seq analysis.

There are lots of statistical models that we might want to adapt to study disease through RNA-Seq. For example, we might want to build a time-series model of gene progression, or we might want to find groups of genes which follow the same pattern or trend. To get the maximum efficiency from the data, I'll build these models right in to the reverse-jigsaw problem described above.

Investigating disease
--
My colleagues are in the process of collecting RNA-Seq data on MND. But the nature of the data will present unforeseen statistical challenges. For example, we might have unknown groupings in patients from whom we have data. I'll use the investigations into MND to inspire statistical models built around the RNA-Seq algorithms that I develop. These methods will be inspired by problems in MND research, but will lead to algorithms that can be used by the wider scientific community in experiments which involve RNA-Seq.

Technical Summary

Background
--
RNA-Seq technology is enabling investigation of gene expression at the transcript level, including the identification of alternatively spliced isoforms. In Motor Neurone Disease, alternative splicing has been strongly implicated as a pathogenic mechanism. RNA-Seq for clinical data and MND in particular requires the development of new statistical methodologies to tackle challenges specific to such data. Bayesian statistical methods for RNA-seq are desirable to deal with the uncertainty in quantifying transcript expression, but existing approaches are prohibitively slow for big data.

Aims & Objectives
--
1) To develop practical algorithms for transcript quantification from RNA-Seq in the Bayesian statistical framework.
2) To build statistical models *around* the transcript quantification problem, addressing problems specific to clinical data.
3) To use the developed algorithms to investigate the effects of splicing in Motor Neurone Disease

Methodology
--
The Bayesian statistical framework will be the cornerstone of the project. Whilst Bayesian methods are often computationally demanding, I shall make use of approximate posterior inference. I'll build on recent work in this area to make fast algorithms for the analysis of RNA-Seq data. I'll collaborate closely with clinical and wet-lab staff in the SITraN neuroscience facility, giving my work immediate impact on research into MND.

Scientific opportunities
--
The quantification of transcripts in RNA-Seq bears a close resemblance to Latent Dirichlet Allocation (LDA), a statistical model used for the analysis of text corpora. Investigation of this link will enable the transfer of knowledge from this field to enable statistical advances for processing RNA-Seq.

Planned Impact

The project has the potential to enhance the research capacity of other research institutions by creating better tools for the analysis of RNA-Seq data.

A workshop for RNA-Seq practitioners will be held this would advertise the work amongst the target user group, as well as ensuring best practise in utilisation of the work.

Commercial exploitation of the software exists as an opportunity for impact. This could be achieved through the production of licensed software within sheffield, with collaboration with groups within the computer science department (such as the GeneSys project) or by collaboration with RNA-Seq manufacturers. In accordance with the guidelines, if a commercial opportunity is not available for an open source software implementation will maximise the availability of the resultsing software.

Statistical methodologies developed during the project will have potential to impact more widely than the are of RNA-Seq analysis. Whilst I intent to borrow technology from the area of Natural Language Processing, it is possible that the methodologies for RNA-Seq analysis are applicable in NLP also. To explore this, I can engage with the NLP group in Sheffield.

The project will contribute toward the health of academic disciplines by crossing boundaries between machine learning and biostatistics. The ensuing collaborations may lead to further knowledge transfer in the future.

The application is aligned with major elements of the UK National Strategic Agenda as outlined in reviews and reports from Government Departments and Non-Departmental Public Bodies:

-Joint HM Treasury/DTI/DES 2004-2014 Science and Innovation Investment Framework objectives: building world class UK centres of research excellence to support growth in internationally mobile R&D and highly skilled people.

-The Department for Business, Innovation and Skills Strategy for Life Sciences 2011. The project is aligned with key elements of this strategy including: attracting, developing and rewarding talent;

Publications

10 25 50

publication icon
Durrande N (2016) Detecting periodicities with Gaussian processes in PeerJ Computer Science

publication icon
Hensman J (2015) Fast Nonparametric Clustering of Structured Time-Series. in IEEE transactions on pattern analysis and machine intelligence

publication icon
Papastamoulis P (2014) Improved variational Bayes inference for transcript expression estimation. in Statistical applications in genetics and molecular biology

publication icon
Yeung C-Y C. (2014) Regulation of BMP signalling by the tendon peripheral clock in INTERNATIONAL JOURNAL OF EXPERIMENTAL PATHOLOGY

 
Description NCSML postdoctoral award
Amount £2,000 (GBP)
Organisation Network on Computational Statistics and Machine Learning 
Sector Academic/University
Country United Kingdom
Start 05/2014 
End 06/2015
 
Description Matthews-Ghahramani 
Organisation University of Cambridge
Country United Kingdom 
Sector Academic/University 
PI Contribution During a receent research visit, I developed statistical methodologies with Alex Matthews and Zoubin Ghahramani
Collaborator Contribution Collaborative research
Impact Accepted papers at international workshops and a international statistics conferences.
Start Year 2014
 
Description Nick Golding 
Organisation University of Oxford
Country United Kingdom 
Sector Academic/University 
PI Contribution With Nick Golding (Dept of Zoology, University of Oxford), I was awarded an NCSML travel award. We plan to apply statistical methods developed by me to large cross-sectional data on disease prevalence.
Collaborator Contribution Collaborative study
Impact None as yet
Start Year 2014
 
Description Qing Jun Meng 
Organisation University of Manchester
Department Faculty of Life Sciences
Country United Kingdom 
Sector Academic/University 
PI Contribution Statistical analysis on longitudinal studies.
Collaborator Contribution provision of data for longitudinal studies.
Impact The collaoboration is multidisciplinary and involves biomedicine and biostatistics.
Start Year 2013
 
Title GPflow 
Description A Gausian process toolbox using tensorflow 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Contribution award from google awarded to collaborator Alexander G. de G. Matthews. 
URL https://github.com/GPflow/GPflow
 
Title GPy 
Description A Gaussian Process framework in python 
Type Of Technology Software 
Year Produced 2013 
Open Source License? Yes  
Impact GPy has hundreds of users across academia and industry. I'm aware of users at: - NASA - BAE systems - Two leading F1 teams - EMBL/EBI - Universities including Cambridge, Oxford, Manchester, Edinburgh, KTH Stockholm, DTU Danish Technical University, Ecole des Mines St Etienne It is also the basis of follow-on software products including GPclust and GPyOpt 
URL http://github.com/sheffieldML/GPy