Latent variable models for dissecting cell-to-cell heterogeneity in single-cell profiling data

Lead Research Organisation: European Bioinformatics Institute
Department Name: Stegle Group

Abstract

The cell is often viewed as the fundamental unit in biology. The cells within the organs of our bodies and other organisms differ from each other in their molecular state and can be characterised in terms of the set of genes they express. Conventional techniques for measuring this gene expression are based on large populations of cells; rather than quantifying the gene expression of single cells, they rely on measuring the average gene expression of a very large number of cells. Many important research questions, however, can only be addressed when analysing cells individually. For example, single-cell analyses are vital for identifying rare cell types (e.g. tumour cells in blood samples), or for providing insights into disease development (e.g. better characterisation of the differentiation of blood stem cells is crucial for understanding leukaemia). Many recent technological advances now allow us to perform such single-cell profiling and tackle some of these open questions. However, as with any new technology, there are challenges that must be overcome before this powerful approach can attain its full potential.

In this proposal I am planning to address the fundamental challenges which arise in the statistical analysis of this new type of data. In particular, these include problems in the reliable identification of rare cell types and the presence of confounding factors. When measuring the expression of all genes in an ensemble of individual cells, the observed RNA output (which is a measure of the gene expression strength) between cells will always vary. Some of the variation may arise from technical artefacts (the experimental procedure may just be more efficient in one cell than another) while other variation might be due to a multitude of hidden biological processes that cannot be observed directly, e.g. effects of the cell cycle, which is a confounding factor that in turn can mask other processes of interest, such as differentiation. Consequently, for any analysis of single-cell data, it is vital to understand and uncover the reasons behind the observed variability in gene expression.
In the course of this fellowship, I will use a statistical approach called latent variable models to unravel the observed variability between single cells. This in turn will enable the reliable identification of rare cell types within an ensemble of single cells and could be used to formally test the statistical significance of individual biological processes.
A related challenge I will address is the difficulty of identifying the order in which gene expression patterns change during differentiation. As gene expression can only be measured once for each cell, data-sets from cell differentiation studies consist of multiple 'snapshots', each containing a mixed population of cells that are in different stages of differentiation. I will therefore extend latent variable models to reveal the hidden temporal order of the cells and identify cells at various stages within the differentiation process.
I will work together with leading experimental groups and apply the newly developed methods to study differentiation processes in blood stem cells as well as in immune cells. This real-world application of my methods has the potential ultimately to provide new insights into the biology of infection, autoimmune diseases and leukaemia.
Finally, I will provide freely available and user-friendly open-source software, facilitating the broad applicability of my proposed methods in any lab performing single-cell experiments.

Technical Summary

Conventional techniques for profiling cells quantify the average RNA abundance in large populations of cells. Recent technological advances now enable the quantification of transcriptional abundance at the single-cell level, by assaying the complete transcriptome of individual cells (e.g. using single-cell RNA sequencing technology). Datasets derived using this technology can be used to identify distinct molecular states (sub-populations) within heterogeneous populations of cells and yield new insights in core cellular processes. The aim of this proposal is to develop the necessary statistical methods to address pertinent statistical challenges when analysing these data. In particular, we aim to model the cell-to-cell heterogeneity that is caused by the complex interplay of multiple hidden biological factors. Importantly, both technical noise and biological processes underlie this variability. These hidden (latent) processes may include both processes of interest (e.g. differentiation) and confounding processes (e.g. cell cycle). I propose to develop latent variable models to dissect the observed variability. To this end, I will exploit prior knowledge on the sources of heterogeneity to establish a dictionary of latent variables that is able to fully capture the observed variability in gene expression data. In turn, these inferred latent variables will enable the efficient identification of sub-populations and can be used to formally test for the significance of individual latent processes. Furthermore, I will use Gaussian Process models to infer the transcriptional dynamics of differentiating cells from snapshot data of largely unsynchronised cells. Finally, I will develop user-friendly open-source software in order to facilitate broad application of the proposed methods. In collaboration with leading experimental groups, I will apply the newly developed methods to study differentiation processes in blood stem/progenitor cells as well as in immune cells.

Planned Impact

The proposed research on developing statistical methods for analysis of single-cell profiling data has the potential to impact researchers from a wide range of disciplines-mainly in the biomedical sciences, but also in the field of statistical learning. In the short term, my work will benefit the increasing number of researchers from different disciplines conducting single-cell studies, as the challenges I am planning to address are likely to be issues in a considerable fraction of such experiments. While I will mainly develop methods for analysis of single-cell RNA-seq data, these methods are also highly relevant in other single cell "omics" applications, e.g. single-cell metabolomics, which I will explore in work package 3.
In order to maximise the circle of potential users of the proposed methods, I am planning to develop a user-friendly, open-source software package that includes detailed documentation. This will enable many researchers from different backgrounds to use the proposed methods. I am confident that my proposed methods will help some of these researchers gain novel, biologically relevant insights from their data.
Part of the proposed research involves the development and application of efficient inference methods in structured covariance models. Therefore, the community of researchers working in statistical learning will also benefit from the proposed work, potentially facilitating the development of novel methods and applications built upon its foundations.

In the long term, the proposed research has the potential to have an impact on central public health issues. I am planning to collaborate with Dr Teichmann and Prof Göttgens, whose research is focused on transcriptional regulation of the mouse immune system and blood stem cell differentiation, respectively. Ultimately, within these collaborations, the proposed research may contribute to new insights into the biology of infection, autoimmune diseases and leukaemia.

In addition, the proposed research may have an impact in the biotechnological sector. The EMBL-EBI Industry Programme provides a forum for knowledge exchange with industrial partners, which could pave the way for development of new commercial products.

I will also make my research accessible to the general public via press releases and non-technical summaries of scientific articles. By communicating my research to the general public I am hoping to raise awareness of the importance of biostatistics in the age of ever-increasing data and perhaps ignite the interest of the next generation of researchers.

On a related note, I am hoping to continue to engage with current students and researchers, as well as professionals from other disciplines, through various teaching activities. In this way, I am keen to contribute to the training of future computational biologists and biostatisticians.

In summary, the proposed research has the potential to be of great benefit for the knowledge economy of the UK-including the advancement of scientific knowledge-and the long-term improvement of public health.

Publications

10 25 50
 
Title Computational assignment of cell-cycle stage from single-cell transcriptome data 
Description We developed a a method which allows to assign cells to a specific phase of the cell cycle based on gene expression data. We validated our model based on single cell and bulk data 
Type Of Material Computer model/algorithm 
Year Produced 2016 
Provided To Others? Yes  
Impact Other research groups interested in cell cycle can use this tool to advance their research, performing in-silico cell cycle staging. 
 
Title Diffusion maps for visualizing single cell gene expression data 
Description We developed a new model for visualizing the high dimensional gene expression data from single cell experiments. Our tool is based on diffusion maps and allows for efficient visualisation of smooth changes of intrinsic cell state within an ensemble of single cells. 
Type Of Material Data analysis technique 
Year Produced 2016 
Provided To Others? Yes  
Impact Our tool is used by different research groups working with single cell data around the world. 
 
Title Diffusion pseudotime to reconstruct lineage branching 
Description We developed a model that allows to reconstruct branching lineages of differentiating stem cells. 
Type Of Material Computer model/algorithm 
Year Produced 2016 
Provided To Others? Yes  
Impact Other research groups interested in analyzing cell differentiation can use this tool to advance their research, performing in-silico lineage tracing. 
 
Title Prospective identification of hematopoietic lineage choice 
Description We developed a tool to prospectively identify hematopoietic lineage choice based on time lapse microscopy data; we implemented and trained a deep learning network to solve this task. 
Type Of Material Computer model/algorithm 
Year Produced 2017 
Provided To Others? Yes  
Impact Other research groups interested in analyzing time lapse data of differentiating cells can use this tool to advance their research. 
 
Title factorial single cell latent variable model 
Description We developed a scalable modelling framework for single-cell RNA-seq data that uses gene set annotations to dissect single-cell transcriptome heterogeneity, thereby allowing to identify biological drivers of cell-to-cell variability and model confounding factors. 
Type Of Material Computer model/algorithm 
Year Produced 2016 
Provided To Others? Yes  
Impact Other research groups interested in dissecting the variability in single-cell data can use this tool to advance their research. 
 
Description Factor analysis model for multi-omics data 
Organisation European Molecular Biology Laboratory
Department European Molecular Biology Laboratory Heidelberg
Country Germany 
Sector Academic/University 
PI Contribution Contributed to the development of a factor analysis model for analysing multi-omics data
Collaborator Contribution Interpreted the results and implemented the model.
Impact paper in revision at MSB
Start Year 2016
 
Description Multi-omics single-cell analysis of differentiating mouse embryonic stem cells 
Organisation Babraham Institute
Country United Kingdom 
Sector Academic/University 
PI Contribution Understanding the differentiation process of mouse embryonic stem cells in terms of coordinated changes of the methylome and transcriptome, at a single cell level
Collaborator Contribution Contributed to experimental design, bioinformatics analysis of single cell data, development of custom latent variable models to dissect sources of variability and correlation patterns between comics layers.
Impact None, I left academia before paper was written
Start Year 2016
 
Description Pluripotency of human HSCs 
Organisation Cardiff University
Country United Kingdom 
Sector Academic/University 
PI Contribution Bioinformatics analysis of single cell data using latent variable models
Collaborator Contribution Experimental design, performed experiments.
Impact paper in preparation
Start Year 2016
 
Description The Global Molecular Landscape of dormant Hematopoietic Stem Cells 
Organisation German Cancer Research Center
Country Germany 
Sector Academic/University 
PI Contribution I processed and analyzed population RNA-Seq data, developed and implemented custom pre-preprocessing and analysis tools for single-cell RNA-seq data and analyzed the single-cell RNA-seq data. The included the development of a latent variable model to account for batch effects.
Collaborator Contribution The partners at DKFZ perfrormed RNA-seq experiments, including chasing mice and isolating stem cells.
Impact https://doi.org/10.1016/j.cell.2017.04.018
Start Year 2015
 
Description Understand differentiation dynamics on a single cell level 
Organisation Helmholtz Zentrum München
Department Institute of Computational Biology
Country Germany 
Sector Private 
PI Contribution Contribute to the development of statistical tools to understand the dynamics underlying cell differentiation processes von a single cell level
Collaborator Contribution Contribute to the development of statistical tools to understand the dynamics underlying cell differentiation processes von a single cell level
Impact Several publications: https://doi.org/10.1093/bioinformatics/btv715 doi:10.1038/nmeth.3971 doi:10.1038/nmeth.4182
Start Year 2013
 
Description Understanding the differentiation process of human induced pluripotent stem cells towards the endoderm lineage 
Organisation The Wellcome Trust Sanger Institute
Country United Kingdom 
Sector Charity/Non Profit 
PI Contribution Contributed to experimental design, bioinformatics analysis of single cell data, development of custom latent variable models to dissect sources of variability.
Collaborator Contribution Contributed to experimental design, performed experiments.
Impact None, I left academia before paper was written.
Start Year 2015
 
Title Computational assignment of cell-cycle stage from single-cell transcriptome data 
Description cyclone is a set of computational methods to assign cell-cycle stage from single-cell transcriptome data. They include a PCA-based method, a random forest based method and a custom-built predictor. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Facilitates researches to computationally assign cell cycle stage to RNA-seq samples. Resulted in a publication: Scialdone A, Natarajan KN, Saraiva LR, Proserpio V, Teichmann SA, Stegle O, Marioni JC and Buettner F, "Computational assignment of cell-cycle stage from single-cell transcriptome data", Methods, 2015, doi:10.1016/j.ymeth.2015.06.021 
URL https://github.com/PMBio/cyclone
 
Title Factorial latent variable model 
Description A python framework for single-cell RNA-seq data that uses gene set annotations to dissect single-cell transcriptome heterogeneity, thereby allowing to identify biological drivers of cell-to-cell variability and model confounding factors. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact part of a paper currently in submission: http://biorxiv.org/content/early/2016/11/15/087775 
URL https://github.com/PMBio/f-scLVM
 
Title HaematoFatePrediction 
Description Based on the image patches along with a displacemnt feature, our models can be applied to obtain cell-specific predictions of lineage choice. We illustrate the workflow in an ipython notebook that can be viewed interactively. This workflow includes processing of image patches, the extraction of convoluational neural network (CNN)-based patch-specific features as well as the final prediction of cell-specific lineage scores using a recurent neural network (RNN). 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact part of doi:10.1038/nmeth.4182 
URL https://github.com/QSCD/HematoFatePrediction
 
Title Multi-Omics Factor Analysis Model (MOFA) 
Description Multi-Omics Factor Analysis Model (MOFA) 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact Widely used by bioinformaticians; > 100 stars on github. 
 
Title destiny: diffusion maps for large-scale single-cell data in R 
Description Create and plot diffusion maps to visualize single cell gene expression data. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Facilitates applied researchers to generate interpretable visualizations of single cell gene expression data. Resulted in a publication: Angerer, Philipp, Laleh Haghverdi, Maren Büttner, Fabian J. Theis, Carsten Marr, and Florian Buettner. "destiny: diffusion maps for large-scale single-cell data in R." Bioinformatics (2015): btv715. 
URL https://www.bioconductor.org/packages/release/bioc/html/destiny.html
 
Description Cambridge UK Computational Biology & Bioinformatics Meetups 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact I was invited to give a talk at the Cambridge UK Computational Biology & Bioinformatics Meetups on my project; the talk sparked lots of good discussions with people from industry and academia.
Year(s) Of Engagement Activity 2016
 
Description ISMB confernce, Florida 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Gave presentation in the Late Breaking Track at ISMB 2016 in Orlando, Florida.
Year(s) Of Engagement Activity 2016
 
Description Joint Statistical Meeting (Chicago) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Was invited to give a talk at a topic contributed session of the Section on Statistics in Genomics and Genetics at the Joint Statistical Meeting which sparked questions and discussions afterwards.
Year(s) Of Engagement Activity 2016
 
Description MBI Workshop 2 on Models for Oncogenesis, Clonality and Tumor Progression 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Participated in the MBI Workshop 2 on Models for Oncogenesis, Clonality and Tumor Progression by giving a talk and being part of discussion panels. The workshop sparked lots of discussions and ideas for future collaborations were developed.
Year(s) Of Engagement Activity 2016
 
Description Single Cell Biology conference (Wellcome Genome Campus, Hinxton, Cambridge, UK) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Attended conference and presented research as poster.
Year(s) Of Engagement Activity 2016
 
Description Single Cell Conference, Netherlands 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact About 50 scientist and people from industry attended my poster presentation at the conference.
Year(s) Of Engagement Activity 2016