Latent variable models for dissecting cell-to-cell heterogeneity in single-cell profiling data
Lead Research Organisation:
European Bioinformatics Institute
Department Name: Stegle Group
Abstract
The cell is often viewed as the fundamental unit in biology. The cells within the organs of our bodies and other organisms differ from each other in their molecular state and can be characterised in terms of the set of genes they express. Conventional techniques for measuring this gene expression are based on large populations of cells; rather than quantifying the gene expression of single cells, they rely on measuring the average gene expression of a very large number of cells. Many important research questions, however, can only be addressed when analysing cells individually. For example, single-cell analyses are vital for identifying rare cell types (e.g. tumour cells in blood samples), or for providing insights into disease development (e.g. better characterisation of the differentiation of blood stem cells is crucial for understanding leukaemia). Many recent technological advances now allow us to perform such single-cell profiling and tackle some of these open questions. However, as with any new technology, there are challenges that must be overcome before this powerful approach can attain its full potential.
In this proposal I am planning to address the fundamental challenges which arise in the statistical analysis of this new type of data. In particular, these include problems in the reliable identification of rare cell types and the presence of confounding factors. When measuring the expression of all genes in an ensemble of individual cells, the observed RNA output (which is a measure of the gene expression strength) between cells will always vary. Some of the variation may arise from technical artefacts (the experimental procedure may just be more efficient in one cell than another) while other variation might be due to a multitude of hidden biological processes that cannot be observed directly, e.g. effects of the cell cycle, which is a confounding factor that in turn can mask other processes of interest, such as differentiation. Consequently, for any analysis of single-cell data, it is vital to understand and uncover the reasons behind the observed variability in gene expression.
In the course of this fellowship, I will use a statistical approach called latent variable models to unravel the observed variability between single cells. This in turn will enable the reliable identification of rare cell types within an ensemble of single cells and could be used to formally test the statistical significance of individual biological processes.
A related challenge I will address is the difficulty of identifying the order in which gene expression patterns change during differentiation. As gene expression can only be measured once for each cell, data-sets from cell differentiation studies consist of multiple 'snapshots', each containing a mixed population of cells that are in different stages of differentiation. I will therefore extend latent variable models to reveal the hidden temporal order of the cells and identify cells at various stages within the differentiation process.
I will work together with leading experimental groups and apply the newly developed methods to study differentiation processes in blood stem cells as well as in immune cells. This real-world application of my methods has the potential ultimately to provide new insights into the biology of infection, autoimmune diseases and leukaemia.
Finally, I will provide freely available and user-friendly open-source software, facilitating the broad applicability of my proposed methods in any lab performing single-cell experiments.
In this proposal I am planning to address the fundamental challenges which arise in the statistical analysis of this new type of data. In particular, these include problems in the reliable identification of rare cell types and the presence of confounding factors. When measuring the expression of all genes in an ensemble of individual cells, the observed RNA output (which is a measure of the gene expression strength) between cells will always vary. Some of the variation may arise from technical artefacts (the experimental procedure may just be more efficient in one cell than another) while other variation might be due to a multitude of hidden biological processes that cannot be observed directly, e.g. effects of the cell cycle, which is a confounding factor that in turn can mask other processes of interest, such as differentiation. Consequently, for any analysis of single-cell data, it is vital to understand and uncover the reasons behind the observed variability in gene expression.
In the course of this fellowship, I will use a statistical approach called latent variable models to unravel the observed variability between single cells. This in turn will enable the reliable identification of rare cell types within an ensemble of single cells and could be used to formally test the statistical significance of individual biological processes.
A related challenge I will address is the difficulty of identifying the order in which gene expression patterns change during differentiation. As gene expression can only be measured once for each cell, data-sets from cell differentiation studies consist of multiple 'snapshots', each containing a mixed population of cells that are in different stages of differentiation. I will therefore extend latent variable models to reveal the hidden temporal order of the cells and identify cells at various stages within the differentiation process.
I will work together with leading experimental groups and apply the newly developed methods to study differentiation processes in blood stem cells as well as in immune cells. This real-world application of my methods has the potential ultimately to provide new insights into the biology of infection, autoimmune diseases and leukaemia.
Finally, I will provide freely available and user-friendly open-source software, facilitating the broad applicability of my proposed methods in any lab performing single-cell experiments.
Technical Summary
Conventional techniques for profiling cells quantify the average RNA abundance in large populations of cells. Recent technological advances now enable the quantification of transcriptional abundance at the single-cell level, by assaying the complete transcriptome of individual cells (e.g. using single-cell RNA sequencing technology). Datasets derived using this technology can be used to identify distinct molecular states (sub-populations) within heterogeneous populations of cells and yield new insights in core cellular processes. The aim of this proposal is to develop the necessary statistical methods to address pertinent statistical challenges when analysing these data. In particular, we aim to model the cell-to-cell heterogeneity that is caused by the complex interplay of multiple hidden biological factors. Importantly, both technical noise and biological processes underlie this variability. These hidden (latent) processes may include both processes of interest (e.g. differentiation) and confounding processes (e.g. cell cycle). I propose to develop latent variable models to dissect the observed variability. To this end, I will exploit prior knowledge on the sources of heterogeneity to establish a dictionary of latent variables that is able to fully capture the observed variability in gene expression data. In turn, these inferred latent variables will enable the efficient identification of sub-populations and can be used to formally test for the significance of individual latent processes. Furthermore, I will use Gaussian Process models to infer the transcriptional dynamics of differentiating cells from snapshot data of largely unsynchronised cells. Finally, I will develop user-friendly open-source software in order to facilitate broad application of the proposed methods. In collaboration with leading experimental groups, I will apply the newly developed methods to study differentiation processes in blood stem/progenitor cells as well as in immune cells.
Planned Impact
The proposed research on developing statistical methods for analysis of single-cell profiling data has the potential to impact researchers from a wide range of disciplines-mainly in the biomedical sciences, but also in the field of statistical learning. In the short term, my work will benefit the increasing number of researchers from different disciplines conducting single-cell studies, as the challenges I am planning to address are likely to be issues in a considerable fraction of such experiments. While I will mainly develop methods for analysis of single-cell RNA-seq data, these methods are also highly relevant in other single cell "omics" applications, e.g. single-cell metabolomics, which I will explore in work package 3.
In order to maximise the circle of potential users of the proposed methods, I am planning to develop a user-friendly, open-source software package that includes detailed documentation. This will enable many researchers from different backgrounds to use the proposed methods. I am confident that my proposed methods will help some of these researchers gain novel, biologically relevant insights from their data.
Part of the proposed research involves the development and application of efficient inference methods in structured covariance models. Therefore, the community of researchers working in statistical learning will also benefit from the proposed work, potentially facilitating the development of novel methods and applications built upon its foundations.
In the long term, the proposed research has the potential to have an impact on central public health issues. I am planning to collaborate with Dr Teichmann and Prof Göttgens, whose research is focused on transcriptional regulation of the mouse immune system and blood stem cell differentiation, respectively. Ultimately, within these collaborations, the proposed research may contribute to new insights into the biology of infection, autoimmune diseases and leukaemia.
In addition, the proposed research may have an impact in the biotechnological sector. The EMBL-EBI Industry Programme provides a forum for knowledge exchange with industrial partners, which could pave the way for development of new commercial products.
I will also make my research accessible to the general public via press releases and non-technical summaries of scientific articles. By communicating my research to the general public I am hoping to raise awareness of the importance of biostatistics in the age of ever-increasing data and perhaps ignite the interest of the next generation of researchers.
On a related note, I am hoping to continue to engage with current students and researchers, as well as professionals from other disciplines, through various teaching activities. In this way, I am keen to contribute to the training of future computational biologists and biostatisticians.
In summary, the proposed research has the potential to be of great benefit for the knowledge economy of the UK-including the advancement of scientific knowledge-and the long-term improvement of public health.
In order to maximise the circle of potential users of the proposed methods, I am planning to develop a user-friendly, open-source software package that includes detailed documentation. This will enable many researchers from different backgrounds to use the proposed methods. I am confident that my proposed methods will help some of these researchers gain novel, biologically relevant insights from their data.
Part of the proposed research involves the development and application of efficient inference methods in structured covariance models. Therefore, the community of researchers working in statistical learning will also benefit from the proposed work, potentially facilitating the development of novel methods and applications built upon its foundations.
In the long term, the proposed research has the potential to have an impact on central public health issues. I am planning to collaborate with Dr Teichmann and Prof Göttgens, whose research is focused on transcriptional regulation of the mouse immune system and blood stem cell differentiation, respectively. Ultimately, within these collaborations, the proposed research may contribute to new insights into the biology of infection, autoimmune diseases and leukaemia.
In addition, the proposed research may have an impact in the biotechnological sector. The EMBL-EBI Industry Programme provides a forum for knowledge exchange with industrial partners, which could pave the way for development of new commercial products.
I will also make my research accessible to the general public via press releases and non-technical summaries of scientific articles. By communicating my research to the general public I am hoping to raise awareness of the importance of biostatistics in the age of ever-increasing data and perhaps ignite the interest of the next generation of researchers.
On a related note, I am hoping to continue to engage with current students and researchers, as well as professionals from other disciplines, through various teaching activities. In this way, I am keen to contribute to the training of future computational biologists and biostatisticians.
In summary, the proposed research has the potential to be of great benefit for the knowledge economy of the UK-including the advancement of scientific knowledge-and the long-term improvement of public health.
Organisations
- European Bioinformatics Institute (Lead Research Organisation)
- Cardiff University (Collaboration)
- Babraham Institute (Collaboration)
- German Cancer Research Center (Collaboration)
- European Molecular Biology Laboratory (Collaboration)
- The Wellcome Trust Sanger Institute (Collaboration)
- Helmholtz Zentrum München (Collaboration)
- German Res Ctr for Env Health, Helmholtz (Fellow)
People |
ORCID iD |
Florian Buettner (Principal Investigator / Fellow) |
Publications
Angerer P
(2016)
destiny: diffusion maps for large-scale single-cell data in R.
in Bioinformatics (Oxford, England)
Argelaguet R
(2018)
Multi-Omics Factor Analysis-a framework for unsupervised integration of multi-omics data sets
in Molecular Systems Biology
Argelaguet R
(2019)
Multi-omics profiling of mouse gastrulation at single-cell resolution.
in Nature
Buettner F
(2017)
f-scLVM: scalable and versatile factor analysis for single-cell RNA-seq.
in Genome biology
Buggenthin F
(2017)
Prospective identification of hematopoietic lineage choice by deep learning.
in Nature methods
Cabezas-Wallscheid N
(2017)
Vitamin A-Retinoic Acid Signaling Regulates Hematopoietic Stem Cell Dormancy.
in Cell
Haghverdi L
(2016)
Diffusion pseudotime robustly reconstructs lineage branching.
in Nature methods
Heninger AK
(2017)
A divergent population of autoantigen-responsive CD4+ T cells in infants prior to ß cell autoimmunity.
in Science translational medicine
Laimighofer M
(2016)
Unbiased Prediction and Feature Selection in High-Dimensional Survival Regression.
in Journal of computational biology : a journal of computational molecular cell biology
Scialdone A
(2015)
Computational assignment of cell-cycle stage from single-cell transcriptome data.
in Methods (San Diego, Calif.)
Vanneste B
(2019)
Ano-rectal wall dose-surface maps localize the dosimetric benefit of hydrogel rectum spacers in prostate cancer radiotherapy
in Clinical and Translational Radiation Oncology
Title | Computational assignment of cell-cycle stage from single-cell transcriptome data |
Description | We developed a a method which allows to assign cells to a specific phase of the cell cycle based on gene expression data. We validated our model based on single cell and bulk data |
Type Of Material | Computer model/algorithm |
Year Produced | 2016 |
Provided To Others? | Yes |
Impact | Other research groups interested in cell cycle can use this tool to advance their research, performing in-silico cell cycle staging. |
Title | Diffusion maps for visualizing single cell gene expression data |
Description | We developed a new model for visualizing the high dimensional gene expression data from single cell experiments. Our tool is based on diffusion maps and allows for efficient visualisation of smooth changes of intrinsic cell state within an ensemble of single cells. |
Type Of Material | Data analysis technique |
Year Produced | 2016 |
Provided To Others? | Yes |
Impact | Our tool is used by different research groups working with single cell data around the world. |
Title | Diffusion pseudotime to reconstruct lineage branching |
Description | We developed a model that allows to reconstruct branching lineages of differentiating stem cells. |
Type Of Material | Computer model/algorithm |
Year Produced | 2016 |
Provided To Others? | Yes |
Impact | Other research groups interested in analyzing cell differentiation can use this tool to advance their research, performing in-silico lineage tracing. |
Title | Prospective identification of hematopoietic lineage choice |
Description | We developed a tool to prospectively identify hematopoietic lineage choice based on time lapse microscopy data; we implemented and trained a deep learning network to solve this task. |
Type Of Material | Computer model/algorithm |
Year Produced | 2017 |
Provided To Others? | Yes |
Impact | Other research groups interested in analyzing time lapse data of differentiating cells can use this tool to advance their research. |
Title | factorial single cell latent variable model |
Description | We developed a scalable modelling framework for single-cell RNA-seq data that uses gene set annotations to dissect single-cell transcriptome heterogeneity, thereby allowing to identify biological drivers of cell-to-cell variability and model confounding factors. |
Type Of Material | Computer model/algorithm |
Year Produced | 2016 |
Provided To Others? | Yes |
Impact | Other research groups interested in dissecting the variability in single-cell data can use this tool to advance their research. |
Description | Factor analysis model for multi-omics data |
Organisation | European Molecular Biology Laboratory |
Department | European Molecular Biology Laboratory Heidelberg |
Country | Germany |
Sector | Academic/University |
PI Contribution | Contributed to the development of a factor analysis model for analysing multi-omics data |
Collaborator Contribution | Interpreted the results and implemented the model. |
Impact | paper in revision at MSB |
Start Year | 2016 |
Description | Multi-omics single-cell analysis of differentiating mouse embryonic stem cells |
Organisation | Babraham Institute |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Understanding the differentiation process of mouse embryonic stem cells in terms of coordinated changes of the methylome and transcriptome, at a single cell level |
Collaborator Contribution | Contributed to experimental design, bioinformatics analysis of single cell data, development of custom latent variable models to dissect sources of variability and correlation patterns between comics layers. |
Impact | None, I left academia before paper was written |
Start Year | 2016 |
Description | Pluripotency of human HSCs |
Organisation | Cardiff University |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Bioinformatics analysis of single cell data using latent variable models |
Collaborator Contribution | Experimental design, performed experiments. |
Impact | paper in preparation |
Start Year | 2016 |
Description | The Global Molecular Landscape of dormant Hematopoietic Stem Cells |
Organisation | German Cancer Research Center |
Country | Germany |
Sector | Academic/University |
PI Contribution | I processed and analyzed population RNA-Seq data, developed and implemented custom pre-preprocessing and analysis tools for single-cell RNA-seq data and analyzed the single-cell RNA-seq data. The included the development of a latent variable model to account for batch effects. |
Collaborator Contribution | The partners at DKFZ perfrormed RNA-seq experiments, including chasing mice and isolating stem cells. |
Impact | https://doi.org/10.1016/j.cell.2017.04.018 |
Start Year | 2015 |
Description | Understand differentiation dynamics on a single cell level |
Organisation | Helmholtz Zentrum München |
Department | Institute of Computational Biology |
Country | Germany |
Sector | Private |
PI Contribution | Contribute to the development of statistical tools to understand the dynamics underlying cell differentiation processes von a single cell level |
Collaborator Contribution | Contribute to the development of statistical tools to understand the dynamics underlying cell differentiation processes von a single cell level |
Impact | Several publications: https://doi.org/10.1093/bioinformatics/btv715 doi:10.1038/nmeth.3971 doi:10.1038/nmeth.4182 |
Start Year | 2013 |
Description | Understanding the differentiation process of human induced pluripotent stem cells towards the endoderm lineage |
Organisation | The Wellcome Trust Sanger Institute |
Country | United Kingdom |
Sector | Charity/Non Profit |
PI Contribution | Contributed to experimental design, bioinformatics analysis of single cell data, development of custom latent variable models to dissect sources of variability. |
Collaborator Contribution | Contributed to experimental design, performed experiments. |
Impact | None, I left academia before paper was written. |
Start Year | 2015 |
Title | Computational assignment of cell-cycle stage from single-cell transcriptome data |
Description | cyclone is a set of computational methods to assign cell-cycle stage from single-cell transcriptome data. They include a PCA-based method, a random forest based method and a custom-built predictor. |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | Facilitates researches to computationally assign cell cycle stage to RNA-seq samples. Resulted in a publication: Scialdone A, Natarajan KN, Saraiva LR, Proserpio V, Teichmann SA, Stegle O, Marioni JC and Buettner F, "Computational assignment of cell-cycle stage from single-cell transcriptome data", Methods, 2015, doi:10.1016/j.ymeth.2015.06.021 |
URL | https://github.com/PMBio/cyclone |
Title | Factorial latent variable model |
Description | A python framework for single-cell RNA-seq data that uses gene set annotations to dissect single-cell transcriptome heterogeneity, thereby allowing to identify biological drivers of cell-to-cell variability and model confounding factors. |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | part of a paper currently in submission: http://biorxiv.org/content/early/2016/11/15/087775 |
URL | https://github.com/PMBio/f-scLVM |
Title | HaematoFatePrediction |
Description | Based on the image patches along with a displacemnt feature, our models can be applied to obtain cell-specific predictions of lineage choice. We illustrate the workflow in an ipython notebook that can be viewed interactively. This workflow includes processing of image patches, the extraction of convoluational neural network (CNN)-based patch-specific features as well as the final prediction of cell-specific lineage scores using a recurent neural network (RNN). |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | part of doi:10.1038/nmeth.4182 |
URL | https://github.com/QSCD/HematoFatePrediction |
Title | Multi-Omics Factor Analysis Model (MOFA) |
Description | Multi-Omics Factor Analysis Model (MOFA) |
Type Of Technology | Software |
Year Produced | 2018 |
Open Source License? | Yes |
Impact | Widely used by bioinformaticians; > 100 stars on github. |
Title | destiny: diffusion maps for large-scale single-cell data in R |
Description | Create and plot diffusion maps to visualize single cell gene expression data. |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | Facilitates applied researchers to generate interpretable visualizations of single cell gene expression data. Resulted in a publication: Angerer, Philipp, Laleh Haghverdi, Maren Büttner, Fabian J. Theis, Carsten Marr, and Florian Buettner. "destiny: diffusion maps for large-scale single-cell data in R." Bioinformatics (2015): btv715. |
URL | https://www.bioconductor.org/packages/release/bioc/html/destiny.html |
Description | Cambridge UK Computational Biology & Bioinformatics Meetups |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Professional Practitioners |
Results and Impact | I was invited to give a talk at the Cambridge UK Computational Biology & Bioinformatics Meetups on my project; the talk sparked lots of good discussions with people from industry and academia. |
Year(s) Of Engagement Activity | 2016 |
Description | ISMB confernce, Florida |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Gave presentation in the Late Breaking Track at ISMB 2016 in Orlando, Florida. |
Year(s) Of Engagement Activity | 2016 |
Description | Joint Statistical Meeting (Chicago) |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Was invited to give a talk at a topic contributed session of the Section on Statistics in Genomics and Genetics at the Joint Statistical Meeting which sparked questions and discussions afterwards. |
Year(s) Of Engagement Activity | 2016 |
Description | MBI Workshop 2 on Models for Oncogenesis, Clonality and Tumor Progression |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Participated in the MBI Workshop 2 on Models for Oncogenesis, Clonality and Tumor Progression by giving a talk and being part of discussion panels. The workshop sparked lots of discussions and ideas for future collaborations were developed. |
Year(s) Of Engagement Activity | 2016 |
Description | Single Cell Biology conference (Wellcome Genome Campus, Hinxton, Cambridge, UK) |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Attended conference and presented research as poster. |
Year(s) Of Engagement Activity | 2016 |
Description | Single Cell Conference, Netherlands |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | About 50 scientist and people from industry attended my poster presentation at the conference. |
Year(s) Of Engagement Activity | 2016 |