Bayesian models of expression in the transcriptome for clinical RNA-seq

Lead Research Organisation: Lancaster University
Department Name: Medicine

Abstract

Background
--
Science is discovering the exciting world of genes, how they interact, how they differ from person to person and the process by which they ultimately form the templates for proteins, the building block of life. But genes are more complicated than we thought. It seems that genes can be transformed: somewhere between the DNA code for a gene and the final protein, genes can be 'spliced' to make different types of proteins and to form other molecules that interact with the gene system.

In a healthy cell, this splicing is useful! It allows us to store the code for different proteins in a single gene. But splicing has been implicated in disease, in particular, Motor Neurone Disease (MND), a debilitating and poorly understood disease. Recently, researchers have found a small part of the genetic code (which we call C9ORF72) which may be a tiny clue in understanding MND, and it seems to have a huge effect on splicing.

New technology is allowing us to uncover the world of genes. Technology developed to sequence the human genome allows us to measure the genes in a cell, right at the point where splicing happens: we call this RNA-Seq. But this sequencing generates LOTS of data, and the amount is increasing. In fact, it's increasing faster than computers are improving. If we want to analyse the data to uncover the world of splicing and the effect it has in MND, we need to create new computational tools which allow us to deal with the data using limited computing resources.

The Problem
--
RNA-Seq presents us with millions of short sequences which represent the genes after splicing has occured. It's a bit like being presented with a huge bag of jigsaw pieces from thousands of different puzzles: how do the pieces fit together? How many types of picture are there? Which pictures occur most often? There is a lot of uncertainty in the problem, and so we use probabilities to express how the pieces might fit together, and thus how genes have been spliced.

In recent work, I've examined how to make probabilistic algorithms like this one more efficient, now I will look at how to make use of this type of method in RNA-Seq data. My research will create such tools usings methods based on approximate Bayesian inference. I'll devise algorithms which can deal with these increasing quantities of data, and allow scientists to make statistical inferences from RNA-Seq data about splicing in disease.

Transferring knowledge
--
To develop such tools, I'll draw inspiration from the related field of natural language processing. With the explosion of the web, data scientists have created methods for organising and categorising our data. One particular statistical method, called a "topic model", closely resembles the analysis of RNA-Seq. With so much focus on the web, there have been lots of developments in topic models that we can borrow to make our algorithms for RNA-Seq faster and better. I'll investigate how we can transfer these ideas to RNA-Seq analysis.

There are lots of statistical models that we might want to adapt to study disease through RNA-Seq. For example, we might want to build a time-series model of gene progression, or we might want to find groups of genes which follow the same pattern or trend. To get the maximum efficiency from the data, I'll build these models right in to the reverse-jigsaw problem described above.

Investigating disease
--
My colleagues are in the process of collecting RNA-Seq data on MND. But the nature of the data will present unforeseen statistical challenges. For example, we might have unknown groupings in patients from whom we have data. I'll use the investigations into MND to inspire statistical models built around the RNA-Seq algorithms that I develop. These methods will be inspired by problems in MND research, but will lead to algorithms that can be used by the wider scientific community in experiments which involve RNA-Seq.

Technical Summary

Background
--
RNA-Seq technology is enabling investigation of gene expression at the transcript level, including the identification of alternatively spliced isoforms. In Motor Neurone Disease, alternative splicing has been strongly implicated as a pathogenic mechanism. RNA-Seq for clinical data and MND in particular requires the development of new statistical methodologies to tackle challenges specific to such data. Bayesian statistical methods for RNA-seq are desirable to deal with the uncertainty in quantifying transcript expression, but existing approaches are prohibitively slow for big data.

Aims & Objectives
--
1) To develop practical algorithms for transcript quantification from RNA-Seq in the Bayesian statistical framework.
2) To build statistical models *around* the transcript quantification problem, addressing problems specific to clinical data.
3) To use the developed algorithms to investigate the effects of splicing in Motor Neurone Disease

Methodology
--
The Bayesian statistical framework will be the cornerstone of the project. Whilst Bayesian methods are often computationally demanding, I shall make use of approximate posterior inference. I'll build on recent work in this area to make fast algorithms for the analysis of RNA-Seq data. I'll collaborate closely with clinical and wet-lab staff in the SITraN neuroscience facility, giving my work immediate impact on research into MND.

Scientific opportunities
--
The quantification of transcripts in RNA-Seq bears a close resemblance to Latent Dirichlet Allocation (LDA), a statistical model used for the analysis of text corpora. Investigation of this link will enable the transfer of knowledge from this field to enable statistical advances for processing RNA-Seq.

Planned Impact

The project has the potential to enhance the research capacity of other research institutions by creating better tools for the analysis of RNA-Seq data.

A workshop for RNA-Seq practitioners will be held this would advertise the work amongst the target user group, as well as ensuring best practise in utilisation of the work.

Commercial exploitation of the software exists as an opportunity for impact. This could be achieved through the production of licensed software within sheffield, with collaboration with groups within the computer science department (such as the GeneSys project) or by collaboration with RNA-Seq manufacturers. In accordance with the guidelines, if a commercial opportunity is not available for an open source software implementation will maximise the availability of the resultsing software.

Statistical methodologies developed during the project will have potential to impact more widely than the are of RNA-Seq analysis. Whilst I intent to borrow technology from the area of Natural Language Processing, it is possible that the methodologies for RNA-Seq analysis are applicable in NLP also. To explore this, I can engage with the NLP group in Sheffield.

The project will contribute toward the health of academic disciplines by crossing boundaries between machine learning and biostatistics. The ensuing collaborations may lead to further knowledge transfer in the future.

The application is aligned with major elements of the UK National Strategic Agenda as outlined in reviews and reports from Government Departments and Non-Departmental Public Bodies:

-Joint HM Treasury/DTI/DES 2004-2014 Science and Innovation Investment Framework objectives: building world class UK centres of research excellence to support growth in internationally mobile R&D and highly skilled people.

-The Department for Business, Innovation and Skills Strategy for Life Sciences 2011. The project is aligned with key elements of this strategy including: attracting, developing and rewarding talent;
 
Description Nicolas Durrande 
Organisation National School of Mines of Saint-Étienne
Country France 
Sector Academic/University 
PI Contribution Collaboration with Dr Nicolas Durrande, Assosciate Professor at ecole nationale supérieure des mines de saint-etienne. We investigated Fourer methods for scaling compuation in Gaussian processes. A pre-print paper is available.
Collaborator Contribution Dr Durrande co-wrote a paper. Ecole des Mines St Etienne funded an academic visit by James Henmsan to St Etienne.
Impact ArXiv pre-print: https://arxiv.org/abs/1611.06740 open source code available: github.com/jameshensman/vff
Start Year 2016
 
Description Spacelabs 
Organisation Spacelabs Healthcare
Country United States 
Sector Private 
PI Contribution I supervised two MSc research projects on blood pressure monitoring, results presented to spacelabs healthcare.
Collaborator Contribution Spacelabs Healthcare sponsored two MSc research projects as part of the Data Science institute in Lancaster
Impact Two MSc projects completed at part of the lancaster Data Science MSc.
Start Year 2016