Bayesian models of expression in the transcriptome for clinical RNA-seq

Lead Research Organisation: University of Sheffield

Department Name: Neurosciences

Abstract

Background
--
Science is discovering the exciting world of genes, how they interact, how they differ from person to person and the process by which they ultimately form the templates for proteins, the building block of life. But genes are more complicated than we thought. It seems that genes can be transformed: somewhere between the DNA code for a gene and the final protein, genes can be 'spliced' to make different types of proteins and to form other molecules that interact with the gene system.

In a healthy cell, this splicing is useful! It allows us to store the code for different proteins in a single gene. But splicing has been implicated in disease, in particular, Motor Neurone Disease (MND), a debilitating and poorly understood disease. Recently, researchers have found a small part of the genetic code (which we call C9ORF72) which may be a tiny clue in understanding MND, and it seems to have a huge effect on splicing.

New technology is allowing us to uncover the world of genes. Technology developed to sequence the human genome allows us to measure the genes in a cell, right at the point where splicing happens: we call this RNA-Seq. But this sequencing generates LOTS of data, and the amount is increasing. In fact, it's increasing faster than computers are improving. If we want to analyse the data to uncover the world of splicing and the effect it has in MND, we need to create new computational tools which allow us to deal with the data using limited computing resources.

The Problem
--
RNA-Seq presents us with millions of short sequences which represent the genes after splicing has occured. It's a bit like being presented with a huge bag of jigsaw pieces from thousands of different puzzles: how do the pieces fit together? How many types of picture are there? Which pictures occur most often? There is a lot of uncertainty in the problem, and so we use probabilities to express how the pieces might fit together, and thus how genes have been spliced.

In recent work, I've examined how to make probabilistic algorithms like this one more efficient, now I will look at how to make use of this type of method in RNA-Seq data. My research will create such tools usings methods based on approximate Bayesian inference. I'll devise algorithms which can deal with these increasing quantities of data, and allow scientists to make statistical inferences from RNA-Seq data about splicing in disease.

Transferring knowledge
--
To develop such tools, I'll draw inspiration from the related field of natural language processing. With the explosion of the web, data scientists have created methods for organising and categorising our data. One particular statistical method, called a "topic model", closely resembles the analysis of RNA-Seq. With so much focus on the web, there have been lots of developments in topic models that we can borrow to make our algorithms for RNA-Seq faster and better. I'll investigate how we can transfer these ideas to RNA-Seq analysis.

There are lots of statistical models that we might want to adapt to study disease through RNA-Seq. For example, we might want to build a time-series model of gene progression, or we might want to find groups of genes which follow the same pattern or trend. To get the maximum efficiency from the data, I'll build these models right in to the reverse-jigsaw problem described above.

Investigating disease
--
My colleagues are in the process of collecting RNA-Seq data on MND. But the nature of the data will present unforeseen statistical challenges. For example, we might have unknown groupings in patients from whom we have data. I'll use the investigations into MND to inspire statistical models built around the RNA-Seq algorithms that I develop. These methods will be inspired by problems in MND research, but will lead to algorithms that can be used by the wider scientific community in experiments which involve RNA-Seq.

Technical Summary

Background
--
RNA-Seq technology is enabling investigation of gene expression at the transcript level, including the identification of alternatively spliced isoforms. In Motor Neurone Disease, alternative splicing has been strongly implicated as a pathogenic mechanism. RNA-Seq for clinical data and MND in particular requires the development of new statistical methodologies to tackle challenges specific to such data. Bayesian statistical methods for RNA-seq are desirable to deal with the uncertainty in quantifying transcript expression, but existing approaches are prohibitively slow for big data.

Aims & Objectives
--
1) To develop practical algorithms for transcript quantification from RNA-Seq in the Bayesian statistical framework.
2) To build statistical models *around* the transcript quantification problem, addressing problems specific to clinical data.
3) To use the developed algorithms to investigate the effects of splicing in Motor Neurone Disease

Methodology
--
The Bayesian statistical framework will be the cornerstone of the project. Whilst Bayesian methods are often computationally demanding, I shall make use of approximate posterior inference. I'll build on recent work in this area to make fast algorithms for the analysis of RNA-Seq data. I'll collaborate closely with clinical and wet-lab staff in the SITraN neuroscience facility, giving my work immediate impact on research into MND.

Scientific opportunities
--
The quantification of transcripts in RNA-Seq bears a close resemblance to Latent Dirichlet Allocation (LDA), a statistical model used for the analysis of text corpora. Investigation of this link will enable the transfer of knowledge from this field to enable statistical advances for processing RNA-Seq.

Planned Impact

The project has the potential to enhance the research capacity of other research institutions by creating better tools for the analysis of RNA-Seq data.

A workshop for RNA-Seq practitioners will be held this would advertise the work amongst the target user group, as well as ensuring best practise in utilisation of the work.

Commercial exploitation of the software exists as an opportunity for impact. This could be achieved through the production of licensed software within sheffield, with collaboration with groups within the computer science department (such as the GeneSys project) or by collaboration with RNA-Seq manufacturers. In accordance with the guidelines, if a commercial opportunity is not available for an open source software implementation will maximise the availability of the resultsing software.

Statistical methodologies developed during the project will have potential to impact more widely than the are of RNA-Seq analysis. Whilst I intent to borrow technology from the area of Natural Language Processing, it is possible that the methodologies for RNA-Seq analysis are applicable in NLP also. To explore this, I can engage with the NLP group in Sheffield.

The project will contribute toward the health of academic disciplines by crossing boundaries between machine learning and biostatistics. The ensuing collaborations may lead to further knowledge transfer in the future.

The application is aligned with major elements of the UK National Strategic Agenda as outlined in reviews and reports from Government Departments and Non-Departmental Public Bodies:

-Joint HM Treasury/DTI/DES 2004-2014 Science and Innovation Investment Framework objectives: building world class UK centres of research excellence to support growth in internationally mobile R&D and highly skilled people.

-The Department for Business, Innovation and Skills Strategy for Life Sciences 2011. The project is aligned with key elements of this strategy including: attracting, developing and rewarding talent;

Funded Value:

£376,962

Funded Period:

Aug 13 - Aug 15

Funder:

MRC

Project Status:

Closed

Project Category:

Fellowship

Project Reference:

MR/K022016/1

Principal Investigator:

James Hensman

Health Category:

Unclassified

Organisations

People	ORCID iD
James Hensman (Principal Investigator / Fellow)
Neil Lawrence (Researcher)	http://orcid.org/0000-0001-9258-1030

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

A. Matthews (2016) On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes

A. Saul (2016) Chained Gaussian Processes

Amin S (2015) Hoxa2 selectively enhances Meis binding to change a branchial arch ground state. in Developmental cell

Durrande N (2016) Detecting periodicities with Gaussian processes in PeerJ Computer Science

Hensman J (2015) Fast Nonparametric Clustering of Structured Time-Series. in IEEE transactions on pattern analysis and machine intelligence

Hensman J (2015) Fast and accurate approximate inference of transcript expression from RNA-seq data. in Bioinformatics (Oxford, England)

Hensman, J (2015) MCMC for Variationally sparse Gaussian Processes

Hensman, J (2015) Scalable Variational Gaussian Process Classification

Papastamoulis P (2014) Improved variational Bayes inference for transcript expression estimation. in Statistical applications in genetics and molecular biology

Yeung C-Y C. (2014) Regulation of BMP signalling by the tendon peripheral clock in INTERNATIONAL JOURNAL OF EXPERIMENTAL PATHOLOGY

Related Projects

Project Reference	Relationship	Related To	Start	End	Award Value
MR/K022016/1			31/08/2013	30/08/2015	£376,962
MR/K022016/2	Transfer	MR/K022016/1	31/08/2015	30/08/2017	£198,408

Further Funding
Collaboration
Software and Technical Products


Description	NCSML postdoctoral award
Amount	£2,000 (GBP)
Organisation	Network on Computational Statistics and Machine Learning
Sector	Academic/University
Country	United Kingdom
Start	05/2014
End	06/2015


Description	Matthews-Ghahramani
Organisation	University of Cambridge
Country	United Kingdom
Sector	Academic/University
PI Contribution	During a receent research visit, I developed statistical methodologies with Alex Matthews and Zoubin Ghahramani
Collaborator Contribution	Collaborative research
Impact	Accepted papers at international workshops and a international statistics conferences.
Start Year	2014


Description	Nick Golding
Organisation	University of Oxford
Country	United Kingdom
Sector	Academic/University
PI Contribution	With Nick Golding (Dept of Zoology, University of Oxford), I was awarded an NCSML travel award. We plan to apply statistical methods developed by me to large cross-sectional data on disease prevalence.
Collaborator Contribution	Collaborative study
Impact	None as yet
Start Year	2014


Description	Qing Jun Meng
Organisation	University of Manchester
Department	Faculty of Life Sciences
Country	United Kingdom
Sector	Academic/University
PI Contribution	Statistical analysis on longitudinal studies.
Collaborator Contribution	provision of data for longitudinal studies.
Impact	The collaoboration is multidisciplinary and involves biomedicine and biostatistics.
Start Year	2013


Title	GPflow
Description	A Gausian process toolbox using tensorflow
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	Contribution award from google awarded to collaborator Alexander G. de G. Matthews.
URL	https://github.com/GPflow/GPflow


Title	GPy
Description	A Gaussian Process framework in python
Type Of Technology	Software
Year Produced	2013
Open Source License?	Yes
Impact	GPy has hundreds of users across academia and industry. I'm aware of users at: - NASA - BAE systems - Two leading F1 teams - EMBL/EBI - Universities including Cambridge, Oxford, Manchester, Edinburgh, KTH Stockholm, DTU Danish Technical University, Ecole des Mines St Etienne It is also the basis of follow-on software products including GPclust and GPyOpt
URL	http://github.com/sheffieldML/GPy