Latent variable models for dissecting cell-to-cell heterogeneity in single-cell profiling data

Lead Research Organisation: European Bioinformatics Institute

Department Name: Stegle Group

Abstract

The cell is often viewed as the fundamental unit in biology. The cells within the organs of our bodies and other organisms differ from each other in their molecular state and can be characterised in terms of the set of genes they express. Conventional techniques for measuring this gene expression are based on large populations of cells; rather than quantifying the gene expression of single cells, they rely on measuring the average gene expression of a very large number of cells. Many important research questions, however, can only be addressed when analysing cells individually. For example, single-cell analyses are vital for identifying rare cell types (e.g. tumour cells in blood samples), or for providing insights into disease development (e.g. better characterisation of the differentiation of blood stem cells is crucial for understanding leukaemia). Many recent technological advances now allow us to perform such single-cell profiling and tackle some of these open questions. However, as with any new technology, there are challenges that must be overcome before this powerful approach can attain its full potential.

In this proposal I am planning to address the fundamental challenges which arise in the statistical analysis of this new type of data. In particular, these include problems in the reliable identification of rare cell types and the presence of confounding factors. When measuring the expression of all genes in an ensemble of individual cells, the observed RNA output (which is a measure of the gene expression strength) between cells will always vary. Some of the variation may arise from technical artefacts (the experimental procedure may just be more efficient in one cell than another) while other variation might be due to a multitude of hidden biological processes that cannot be observed directly, e.g. effects of the cell cycle, which is a confounding factor that in turn can mask other processes of interest, such as differentiation. Consequently, for any analysis of single-cell data, it is vital to understand and uncover the reasons behind the observed variability in gene expression.
In the course of this fellowship, I will use a statistical approach called latent variable models to unravel the observed variability between single cells. This in turn will enable the reliable identification of rare cell types within an ensemble of single cells and could be used to formally test the statistical significance of individual biological processes.
A related challenge I will address is the difficulty of identifying the order in which gene expression patterns change during differentiation. As gene expression can only be measured once for each cell, data-sets from cell differentiation studies consist of multiple 'snapshots', each containing a mixed population of cells that are in different stages of differentiation. I will therefore extend latent variable models to reveal the hidden temporal order of the cells and identify cells at various stages within the differentiation process.
I will work together with leading experimental groups and apply the newly developed methods to study differentiation processes in blood stem cells as well as in immune cells. This real-world application of my methods has the potential ultimately to provide new insights into the biology of infection, autoimmune diseases and leukaemia.
Finally, I will provide freely available and user-friendly open-source software, facilitating the broad applicability of my proposed methods in any lab performing single-cell experiments.

Technical Summary

Conventional techniques for profiling cells quantify the average RNA abundance in large populations of cells. Recent technological advances now enable the quantification of transcriptional abundance at the single-cell level, by assaying the complete transcriptome of individual cells (e.g. using single-cell RNA sequencing technology). Datasets derived using this technology can be used to identify distinct molecular states (sub-populations) within heterogeneous populations of cells and yield new insights in core cellular processes. The aim of this proposal is to develop the necessary statistical methods to address pertinent statistical challenges when analysing these data. In particular, we aim to model the cell-to-cell heterogeneity that is caused by the complex interplay of multiple hidden biological factors. Importantly, both technical noise and biological processes underlie this variability. These hidden (latent) processes may include both processes of interest (e.g. differentiation) and confounding processes (e.g. cell cycle). I propose to develop latent variable models to dissect the observed variability. To this end, I will exploit prior knowledge on the sources of heterogeneity to establish a dictionary of latent variables that is able to fully capture the observed variability in gene expression data. In turn, these inferred latent variables will enable the efficient identification of sub-populations and can be used to formally test for the significance of individual latent processes. Furthermore, I will use Gaussian Process models to infer the transcriptional dynamics of differentiating cells from snapshot data of largely unsynchronised cells. Finally, I will develop user-friendly open-source software in order to facilitate broad application of the proposed methods. In collaboration with leading experimental groups, I will apply the newly developed methods to study differentiation processes in blood stem/progenitor cells as well as in immune cells.

Planned Impact

The proposed research on developing statistical methods for analysis of single-cell profiling data has the potential to impact researchers from a wide range of disciplines-mainly in the biomedical sciences, but also in the field of statistical learning. In the short term, my work will benefit the increasing number of researchers from different disciplines conducting single-cell studies, as the challenges I am planning to address are likely to be issues in a considerable fraction of such experiments. While I will mainly develop methods for analysis of single-cell RNA-seq data, these methods are also highly relevant in other single cell "omics" applications, e.g. single-cell metabolomics, which I will explore in work package 3.
In order to maximise the circle of potential users of the proposed methods, I am planning to develop a user-friendly, open-source software package that includes detailed documentation. This will enable many researchers from different backgrounds to use the proposed methods. I am confident that my proposed methods will help some of these researchers gain novel, biologically relevant insights from their data.
Part of the proposed research involves the development and application of efficient inference methods in structured covariance models. Therefore, the community of researchers working in statistical learning will also benefit from the proposed work, potentially facilitating the development of novel methods and applications built upon its foundations.

In the long term, the proposed research has the potential to have an impact on central public health issues. I am planning to collaborate with Dr Teichmann and Prof Göttgens, whose research is focused on transcriptional regulation of the mouse immune system and blood stem cell differentiation, respectively. Ultimately, within these collaborations, the proposed research may contribute to new insights into the biology of infection, autoimmune diseases and leukaemia.

In addition, the proposed research may have an impact in the biotechnological sector. The EMBL-EBI Industry Programme provides a forum for knowledge exchange with industrial partners, which could pave the way for development of new commercial products.

I will also make my research accessible to the general public via press releases and non-technical summaries of scientific articles. By communicating my research to the general public I am hoping to raise awareness of the importance of biostatistics in the age of ever-increasing data and perhaps ignite the interest of the next generation of researchers.

On a related note, I am hoping to continue to engage with current students and researchers, as well as professionals from other disciplines, through various teaching activities. In this way, I am keen to contribute to the training of future computational biologists and biostatisticians.

In summary, the proposed research has the potential to be of great benefit for the knowledge economy of the UK-including the advancement of scientific knowledge-and the long-term improvement of public health.

Funded Value:

£258,529

Funded Period:

Mar 15 - Dec 16

Funder:

MRC

Project Status:

Closed

Project Category:

Fellowship

Project Reference:

MR/M01536X/1

Principal Investigator:

Florian Buettner

Health Category:

Unclassified

Organisations

People	ORCID iD
Florian Buettner (Principal Investigator / Fellow)

Publications

Author Name

Title Publication Date Published

10 25 50

Angerer P (2016) destiny: diffusion maps for large-scale single-cell data in R. in Bioinformatics (Oxford, England)

Argelaguet R (2018) Multi-Omics Factor Analysis-a framework for unsupervised integration of multi-omics data sets in Molecular Systems Biology

Argelaguet R (2019) Multi-omics profiling of mouse gastrulation at single-cell resolution. in Nature

Buettner F (2017) f-scLVM: scalable and versatile factor analysis for single-cell RNA-seq. in Genome biology

Buggenthin F (2017) Prospective identification of hematopoietic lineage choice by deep learning. in Nature methods

Cabezas-Wallscheid N (2017) Vitamin A-Retinoic Acid Signaling Regulates Hematopoietic Stem Cell Dormancy. in Cell

Haghverdi L (2016) Diffusion pseudotime robustly reconstructs lineage branching. in Nature methods

Heninger AK (2017) A divergent population of autoantigen-responsive CD4+ T cells in infants prior to ß cell autoimmunity. in Science translational medicine

Laimighofer M (2016) Unbiased Prediction and Feature Selection in High-Dimensional Survival Regression. in Journal of computational biology : a journal of computational molecular cell biology

Scialdone A (2015) Computational assignment of cell-cycle stage from single-cell transcriptome data. in Methods (San Diego, Calif.)

Vanneste B (2019) Ano-rectal wall dose-surface maps localize the dosimetric benefit of hydrogel rectum spacers in prostate cancer radiotherapy in Clinical and Translational Radiation Oncology

Research Databases and Models
Collaboration
Software and Technical Products
Engagement Activities


Title	Computational assignment of cell-cycle stage from single-cell transcriptome data
Description	We developed a a method which allows to assign cells to a specific phase of the cell cycle based on gene expression data. We validated our model based on single cell and bulk data
Type Of Material	Computer model/algorithm
Year Produced	2016
Provided To Others?	Yes
Impact	Other research groups interested in cell cycle can use this tool to advance their research, performing in-silico cell cycle staging.


Title	Diffusion maps for visualizing single cell gene expression data
Description	We developed a new model for visualizing the high dimensional gene expression data from single cell experiments. Our tool is based on diffusion maps and allows for efficient visualisation of smooth changes of intrinsic cell state within an ensemble of single cells.
Type Of Material	Data analysis technique
Year Produced	2016
Provided To Others?	Yes
Impact	Our tool is used by different research groups working with single cell data around the world.


Title	Diffusion pseudotime to reconstruct lineage branching
Description	We developed a model that allows to reconstruct branching lineages of differentiating stem cells.
Type Of Material	Computer model/algorithm
Year Produced	2016
Provided To Others?	Yes
Impact	Other research groups interested in analyzing cell differentiation can use this tool to advance their research, performing in-silico lineage tracing.


Title	Prospective identification of hematopoietic lineage choice
Description	We developed a tool to prospectively identify hematopoietic lineage choice based on time lapse microscopy data; we implemented and trained a deep learning network to solve this task.
Type Of Material	Computer model/algorithm
Year Produced	2017
Provided To Others?	Yes
Impact	Other research groups interested in analyzing time lapse data of differentiating cells can use this tool to advance their research.


Title	factorial single cell latent variable model
Description	We developed a scalable modelling framework for single-cell RNA-seq data that uses gene set annotations to dissect single-cell transcriptome heterogeneity, thereby allowing to identify biological drivers of cell-to-cell variability and model confounding factors.
Type Of Material	Computer model/algorithm
Year Produced	2016
Provided To Others?	Yes
Impact	Other research groups interested in dissecting the variability in single-cell data can use this tool to advance their research.


Description	Factor analysis model for multi-omics data
Organisation	European Molecular Biology Laboratory
Department	European Molecular Biology Laboratory Heidelberg
Country	Germany
Sector	Academic/University
PI Contribution	Contributed to the development of a factor analysis model for analysing multi-omics data
Collaborator Contribution	Interpreted the results and implemented the model.
Impact	paper in revision at MSB
Start Year	2016


Description	Multi-omics single-cell analysis of differentiating mouse embryonic stem cells
Organisation	Babraham Institute
Country	United Kingdom
Sector	Academic/University
PI Contribution	Understanding the differentiation process of mouse embryonic stem cells in terms of coordinated changes of the methylome and transcriptome, at a single cell level
Collaborator Contribution	Contributed to experimental design, bioinformatics analysis of single cell data, development of custom latent variable models to dissect sources of variability and correlation patterns between comics layers.
Impact	None, I left academia before paper was written
Start Year	2016


Description	Pluripotency of human HSCs
Organisation	Cardiff University
Country	United Kingdom
Sector	Academic/University
PI Contribution	Bioinformatics analysis of single cell data using latent variable models
Collaborator Contribution	Experimental design, performed experiments.
Impact	paper in preparation
Start Year	2016


Description	The Global Molecular Landscape of dormant Hematopoietic Stem Cells
Organisation	German Cancer Research Center
Country	Germany
Sector	Academic/University
PI Contribution	I processed and analyzed population RNA-Seq data, developed and implemented custom pre-preprocessing and analysis tools for single-cell RNA-seq data and analyzed the single-cell RNA-seq data. The included the development of a latent variable model to account for batch effects.
Collaborator Contribution	The partners at DKFZ perfrormed RNA-seq experiments, including chasing mice and isolating stem cells.
Impact	https://doi.org/10.1016/j.cell.2017.04.018
Start Year	2015


Description	Understand differentiation dynamics on a single cell level
Organisation	Helmholtz Zentrum München
Department	Institute of Computational Biology
Country	Germany
Sector	Private
PI Contribution	Contribute to the development of statistical tools to understand the dynamics underlying cell differentiation processes von a single cell level
Collaborator Contribution	Contribute to the development of statistical tools to understand the dynamics underlying cell differentiation processes von a single cell level
Impact	Several publications: https://doi.org/10.1093/bioinformatics/btv715 doi:10.1038/nmeth.3971 doi:10.1038/nmeth.4182
Start Year	2013


Description	Understanding the differentiation process of human induced pluripotent stem cells towards the endoderm lineage
Organisation	The Wellcome Trust Sanger Institute
Country	United Kingdom
Sector	Charity/Non Profit
PI Contribution	Contributed to experimental design, bioinformatics analysis of single cell data, development of custom latent variable models to dissect sources of variability.
Collaborator Contribution	Contributed to experimental design, performed experiments.
Impact	None, I left academia before paper was written.
Start Year	2015


Title	Computational assignment of cell-cycle stage from single-cell transcriptome data
Description	cyclone is a set of computational methods to assign cell-cycle stage from single-cell transcriptome data. They include a PCA-based method, a random forest based method and a custom-built predictor.
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	Facilitates researches to computationally assign cell cycle stage to RNA-seq samples. Resulted in a publication: Scialdone A, Natarajan KN, Saraiva LR, Proserpio V, Teichmann SA, Stegle O, Marioni JC and Buettner F, "Computational assignment of cell-cycle stage from single-cell transcriptome data", Methods, 2015, doi:10.1016/j.ymeth.2015.06.021
URL	https://github.com/PMBio/cyclone


Title	Factorial latent variable model
Description	A python framework for single-cell RNA-seq data that uses gene set annotations to dissect single-cell transcriptome heterogeneity, thereby allowing to identify biological drivers of cell-to-cell variability and model confounding factors.
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	part of a paper currently in submission: http://biorxiv.org/content/early/2016/11/15/087775
URL	https://github.com/PMBio/f-scLVM


Title	HaematoFatePrediction
Description	Based on the image patches along with a displacemnt feature, our models can be applied to obtain cell-specific predictions of lineage choice. We illustrate the workflow in an ipython notebook that can be viewed interactively. This workflow includes processing of image patches, the extraction of convoluational neural network (CNN)-based patch-specific features as well as the final prediction of cell-specific lineage scores using a recurent neural network (RNN).
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	part of doi:10.1038/nmeth.4182
URL	https://github.com/QSCD/HematoFatePrediction


Title	Multi-Omics Factor Analysis Model (MOFA)
Description	Multi-Omics Factor Analysis Model (MOFA)
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	Widely used by bioinformaticians; > 100 stars on github.


Title	destiny: diffusion maps for large-scale single-cell data in R
Description	Create and plot diffusion maps to visualize single cell gene expression data.
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	Facilitates applied researchers to generate interpretable visualizations of single cell gene expression data. Resulted in a publication: Angerer, Philipp, Laleh Haghverdi, Maren Büttner, Fabian J. Theis, Carsten Marr, and Florian Buettner. "destiny: diffusion maps for large-scale single-cell data in R." Bioinformatics (2015): btv715.
URL	https://www.bioconductor.org/packages/release/bioc/html/destiny.html


Description	Cambridge UK Computational Biology & Bioinformatics Meetups
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Professional Practitioners
Results and Impact	I was invited to give a talk at the Cambridge UK Computational Biology & Bioinformatics Meetups on my project; the talk sparked lots of good discussions with people from industry and academia.
Year(s) Of Engagement Activity	2016


Description	ISMB confernce, Florida
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Gave presentation in the Late Breaking Track at ISMB 2016 in Orlando, Florida.
Year(s) Of Engagement Activity	2016


Description	Joint Statistical Meeting (Chicago)
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Was invited to give a talk at a topic contributed session of the Section on Statistics in Genomics and Genetics at the Joint Statistical Meeting which sparked questions and discussions afterwards.
Year(s) Of Engagement Activity	2016


Description	MBI Workshop 2 on Models for Oncogenesis, Clonality and Tumor Progression
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Participated in the MBI Workshop 2 on Models for Oncogenesis, Clonality and Tumor Progression by giving a talk and being part of discussion panels. The workshop sparked lots of discussions and ideas for future collaborations were developed.
Year(s) Of Engagement Activity	2016


Description	Single Cell Biology conference (Wellcome Genome Campus, Hinxton, Cambridge, UK)
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Attended conference and presented research as poster.
Year(s) Of Engagement Activity	2016


Description	Single Cell Conference, Netherlands
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	About 50 scientist and people from industry attended my poster presentation at the conference.
Year(s) Of Engagement Activity	2016