Bayesian models of expression in the transcriptome for clinical RNA-seq
Lead Research Organisation:
University of Sheffield
Department Name: Neurosciences
Abstract
Background
--
Science is discovering the exciting world of genes, how they interact, how they differ from person to person and the process by which they ultimately form the templates for proteins, the building block of life. But genes are more complicated than we thought. It seems that genes can be transformed: somewhere between the DNA code for a gene and the final protein, genes can be 'spliced' to make different types of proteins and to form other molecules that interact with the gene system.
In a healthy cell, this splicing is useful! It allows us to store the code for different proteins in a single gene. But splicing has been implicated in disease, in particular, Motor Neurone Disease (MND), a debilitating and poorly understood disease. Recently, researchers have found a small part of the genetic code (which we call C9ORF72) which may be a tiny clue in understanding MND, and it seems to have a huge effect on splicing.
New technology is allowing us to uncover the world of genes. Technology developed to sequence the human genome allows us to measure the genes in a cell, right at the point where splicing happens: we call this RNA-Seq. But this sequencing generates LOTS of data, and the amount is increasing. In fact, it's increasing faster than computers are improving. If we want to analyse the data to uncover the world of splicing and the effect it has in MND, we need to create new computational tools which allow us to deal with the data using limited computing resources.
The Problem
--
RNA-Seq presents us with millions of short sequences which represent the genes after splicing has occured. It's a bit like being presented with a huge bag of jigsaw pieces from thousands of different puzzles: how do the pieces fit together? How many types of picture are there? Which pictures occur most often? There is a lot of uncertainty in the problem, and so we use probabilities to express how the pieces might fit together, and thus how genes have been spliced.
In recent work, I've examined how to make probabilistic algorithms like this one more efficient, now I will look at how to make use of this type of method in RNA-Seq data. My research will create such tools usings methods based on approximate Bayesian inference. I'll devise algorithms which can deal with these increasing quantities of data, and allow scientists to make statistical inferences from RNA-Seq data about splicing in disease.
Transferring knowledge
--
To develop such tools, I'll draw inspiration from the related field of natural language processing. With the explosion of the web, data scientists have created methods for organising and categorising our data. One particular statistical method, called a "topic model", closely resembles the analysis of RNA-Seq. With so much focus on the web, there have been lots of developments in topic models that we can borrow to make our algorithms for RNA-Seq faster and better. I'll investigate how we can transfer these ideas to RNA-Seq analysis.
There are lots of statistical models that we might want to adapt to study disease through RNA-Seq. For example, we might want to build a time-series model of gene progression, or we might want to find groups of genes which follow the same pattern or trend. To get the maximum efficiency from the data, I'll build these models right in to the reverse-jigsaw problem described above.
Investigating disease
--
My colleagues are in the process of collecting RNA-Seq data on MND. But the nature of the data will present unforeseen statistical challenges. For example, we might have unknown groupings in patients from whom we have data. I'll use the investigations into MND to inspire statistical models built around the RNA-Seq algorithms that I develop. These methods will be inspired by problems in MND research, but will lead to algorithms that can be used by the wider scientific community in experiments which involve RNA-Seq.
--
Science is discovering the exciting world of genes, how they interact, how they differ from person to person and the process by which they ultimately form the templates for proteins, the building block of life. But genes are more complicated than we thought. It seems that genes can be transformed: somewhere between the DNA code for a gene and the final protein, genes can be 'spliced' to make different types of proteins and to form other molecules that interact with the gene system.
In a healthy cell, this splicing is useful! It allows us to store the code for different proteins in a single gene. But splicing has been implicated in disease, in particular, Motor Neurone Disease (MND), a debilitating and poorly understood disease. Recently, researchers have found a small part of the genetic code (which we call C9ORF72) which may be a tiny clue in understanding MND, and it seems to have a huge effect on splicing.
New technology is allowing us to uncover the world of genes. Technology developed to sequence the human genome allows us to measure the genes in a cell, right at the point where splicing happens: we call this RNA-Seq. But this sequencing generates LOTS of data, and the amount is increasing. In fact, it's increasing faster than computers are improving. If we want to analyse the data to uncover the world of splicing and the effect it has in MND, we need to create new computational tools which allow us to deal with the data using limited computing resources.
The Problem
--
RNA-Seq presents us with millions of short sequences which represent the genes after splicing has occured. It's a bit like being presented with a huge bag of jigsaw pieces from thousands of different puzzles: how do the pieces fit together? How many types of picture are there? Which pictures occur most often? There is a lot of uncertainty in the problem, and so we use probabilities to express how the pieces might fit together, and thus how genes have been spliced.
In recent work, I've examined how to make probabilistic algorithms like this one more efficient, now I will look at how to make use of this type of method in RNA-Seq data. My research will create such tools usings methods based on approximate Bayesian inference. I'll devise algorithms which can deal with these increasing quantities of data, and allow scientists to make statistical inferences from RNA-Seq data about splicing in disease.
Transferring knowledge
--
To develop such tools, I'll draw inspiration from the related field of natural language processing. With the explosion of the web, data scientists have created methods for organising and categorising our data. One particular statistical method, called a "topic model", closely resembles the analysis of RNA-Seq. With so much focus on the web, there have been lots of developments in topic models that we can borrow to make our algorithms for RNA-Seq faster and better. I'll investigate how we can transfer these ideas to RNA-Seq analysis.
There are lots of statistical models that we might want to adapt to study disease through RNA-Seq. For example, we might want to build a time-series model of gene progression, or we might want to find groups of genes which follow the same pattern or trend. To get the maximum efficiency from the data, I'll build these models right in to the reverse-jigsaw problem described above.
Investigating disease
--
My colleagues are in the process of collecting RNA-Seq data on MND. But the nature of the data will present unforeseen statistical challenges. For example, we might have unknown groupings in patients from whom we have data. I'll use the investigations into MND to inspire statistical models built around the RNA-Seq algorithms that I develop. These methods will be inspired by problems in MND research, but will lead to algorithms that can be used by the wider scientific community in experiments which involve RNA-Seq.
Technical Summary
Background
--
RNA-Seq technology is enabling investigation of gene expression at the transcript level, including the identification of alternatively spliced isoforms. In Motor Neurone Disease, alternative splicing has been strongly implicated as a pathogenic mechanism. RNA-Seq for clinical data and MND in particular requires the development of new statistical methodologies to tackle challenges specific to such data. Bayesian statistical methods for RNA-seq are desirable to deal with the uncertainty in quantifying transcript expression, but existing approaches are prohibitively slow for big data.
Aims & Objectives
--
1) To develop practical algorithms for transcript quantification from RNA-Seq in the Bayesian statistical framework.
2) To build statistical models *around* the transcript quantification problem, addressing problems specific to clinical data.
3) To use the developed algorithms to investigate the effects of splicing in Motor Neurone Disease
Methodology
--
The Bayesian statistical framework will be the cornerstone of the project. Whilst Bayesian methods are often computationally demanding, I shall make use of approximate posterior inference. I'll build on recent work in this area to make fast algorithms for the analysis of RNA-Seq data. I'll collaborate closely with clinical and wet-lab staff in the SITraN neuroscience facility, giving my work immediate impact on research into MND.
Scientific opportunities
--
The quantification of transcripts in RNA-Seq bears a close resemblance to Latent Dirichlet Allocation (LDA), a statistical model used for the analysis of text corpora. Investigation of this link will enable the transfer of knowledge from this field to enable statistical advances for processing RNA-Seq.
--
RNA-Seq technology is enabling investigation of gene expression at the transcript level, including the identification of alternatively spliced isoforms. In Motor Neurone Disease, alternative splicing has been strongly implicated as a pathogenic mechanism. RNA-Seq for clinical data and MND in particular requires the development of new statistical methodologies to tackle challenges specific to such data. Bayesian statistical methods for RNA-seq are desirable to deal with the uncertainty in quantifying transcript expression, but existing approaches are prohibitively slow for big data.
Aims & Objectives
--
1) To develop practical algorithms for transcript quantification from RNA-Seq in the Bayesian statistical framework.
2) To build statistical models *around* the transcript quantification problem, addressing problems specific to clinical data.
3) To use the developed algorithms to investigate the effects of splicing in Motor Neurone Disease
Methodology
--
The Bayesian statistical framework will be the cornerstone of the project. Whilst Bayesian methods are often computationally demanding, I shall make use of approximate posterior inference. I'll build on recent work in this area to make fast algorithms for the analysis of RNA-Seq data. I'll collaborate closely with clinical and wet-lab staff in the SITraN neuroscience facility, giving my work immediate impact on research into MND.
Scientific opportunities
--
The quantification of transcripts in RNA-Seq bears a close resemblance to Latent Dirichlet Allocation (LDA), a statistical model used for the analysis of text corpora. Investigation of this link will enable the transfer of knowledge from this field to enable statistical advances for processing RNA-Seq.
Planned Impact
The project has the potential to enhance the research capacity of other research institutions by creating better tools for the analysis of RNA-Seq data.
A workshop for RNA-Seq practitioners will be held this would advertise the work amongst the target user group, as well as ensuring best practise in utilisation of the work.
Commercial exploitation of the software exists as an opportunity for impact. This could be achieved through the production of licensed software within sheffield, with collaboration with groups within the computer science department (such as the GeneSys project) or by collaboration with RNA-Seq manufacturers. In accordance with the guidelines, if a commercial opportunity is not available for an open source software implementation will maximise the availability of the resultsing software.
Statistical methodologies developed during the project will have potential to impact more widely than the are of RNA-Seq analysis. Whilst I intent to borrow technology from the area of Natural Language Processing, it is possible that the methodologies for RNA-Seq analysis are applicable in NLP also. To explore this, I can engage with the NLP group in Sheffield.
The project will contribute toward the health of academic disciplines by crossing boundaries between machine learning and biostatistics. The ensuing collaborations may lead to further knowledge transfer in the future.
The application is aligned with major elements of the UK National Strategic Agenda as outlined in reviews and reports from Government Departments and Non-Departmental Public Bodies:
-Joint HM Treasury/DTI/DES 2004-2014 Science and Innovation Investment Framework objectives: building world class UK centres of research excellence to support growth in internationally mobile R&D and highly skilled people.
-The Department for Business, Innovation and Skills Strategy for Life Sciences 2011. The project is aligned with key elements of this strategy including: attracting, developing and rewarding talent;
A workshop for RNA-Seq practitioners will be held this would advertise the work amongst the target user group, as well as ensuring best practise in utilisation of the work.
Commercial exploitation of the software exists as an opportunity for impact. This could be achieved through the production of licensed software within sheffield, with collaboration with groups within the computer science department (such as the GeneSys project) or by collaboration with RNA-Seq manufacturers. In accordance with the guidelines, if a commercial opportunity is not available for an open source software implementation will maximise the availability of the resultsing software.
Statistical methodologies developed during the project will have potential to impact more widely than the are of RNA-Seq analysis. Whilst I intent to borrow technology from the area of Natural Language Processing, it is possible that the methodologies for RNA-Seq analysis are applicable in NLP also. To explore this, I can engage with the NLP group in Sheffield.
The project will contribute toward the health of academic disciplines by crossing boundaries between machine learning and biostatistics. The ensuing collaborations may lead to further knowledge transfer in the future.
The application is aligned with major elements of the UK National Strategic Agenda as outlined in reviews and reports from Government Departments and Non-Departmental Public Bodies:
-Joint HM Treasury/DTI/DES 2004-2014 Science and Innovation Investment Framework objectives: building world class UK centres of research excellence to support growth in internationally mobile R&D and highly skilled people.
-The Department for Business, Innovation and Skills Strategy for Life Sciences 2011. The project is aligned with key elements of this strategy including: attracting, developing and rewarding talent;
Publications
A. Saul
(2016)
Chained Gaussian Processes
Amin S
(2015)
Hoxa2 selectively enhances Meis binding to change a branchial arch ground state.
in Developmental cell
Durrande N
(2016)
Detecting periodicities with Gaussian processes
in PeerJ Computer Science
Hensman J
(2015)
Fast Nonparametric Clustering of Structured Time-Series.
in IEEE transactions on pattern analysis and machine intelligence
Hensman J
(2015)
Fast and accurate approximate inference of transcript expression from RNA-seq data.
in Bioinformatics (Oxford, England)
Hensman, J
(2015)
Scalable Variational Gaussian Process Classification
Hensman, J
(2015)
MCMC for Variationally sparse Gaussian Processes
Papastamoulis P
(2014)
Improved variational Bayes inference for transcript expression estimation.
in Statistical applications in genetics and molecular biology
Yeung C-Y C.
(2014)
Regulation of BMP signalling by the tendon peripheral clock
in INTERNATIONAL JOURNAL OF EXPERIMENTAL PATHOLOGY
Description | NCSML postdoctoral award |
Amount | £2,000 (GBP) |
Organisation | Network on Computational Statistics and Machine Learning |
Sector | Academic/University |
Country | United Kingdom |
Start | 05/2014 |
End | 06/2015 |
Description | Matthews-Ghahramani |
Organisation | University of Cambridge |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | During a receent research visit, I developed statistical methodologies with Alex Matthews and Zoubin Ghahramani |
Collaborator Contribution | Collaborative research |
Impact | Accepted papers at international workshops and a international statistics conferences. |
Start Year | 2014 |
Description | Nick Golding |
Organisation | University of Oxford |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | With Nick Golding (Dept of Zoology, University of Oxford), I was awarded an NCSML travel award. We plan to apply statistical methods developed by me to large cross-sectional data on disease prevalence. |
Collaborator Contribution | Collaborative study |
Impact | None as yet |
Start Year | 2014 |
Description | Qing Jun Meng |
Organisation | University of Manchester |
Department | Faculty of Life Sciences |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Statistical analysis on longitudinal studies. |
Collaborator Contribution | provision of data for longitudinal studies. |
Impact | The collaoboration is multidisciplinary and involves biomedicine and biostatistics. |
Start Year | 2013 |
Title | GPflow |
Description | A Gausian process toolbox using tensorflow |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | Contribution award from google awarded to collaborator Alexander G. de G. Matthews. |
URL | https://github.com/GPflow/GPflow |
Title | GPy |
Description | A Gaussian Process framework in python |
Type Of Technology | Software |
Year Produced | 2013 |
Open Source License? | Yes |
Impact | GPy has hundreds of users across academia and industry. I'm aware of users at: - NASA - BAE systems - Two leading F1 teams - EMBL/EBI - Universities including Cambridge, Oxford, Manchester, Edinburgh, KTH Stockholm, DTU Danish Technical University, Ecole des Mines St Etienne It is also the basis of follow-on software products including GPclust and GPyOpt |
URL | http://github.com/sheffieldML/GPy |