From single cells to populations: generalized pseudotime analysis to identify patient trajectories from cross-sectional data in cancer genomics

Lead Research Organisation: University of Manchester
Department Name: School of Health Sciences

Abstract

Cancer continually evolves at the genetic level through the acquisition of mutations that subsequently lead to the reprogramming of normal cellular activity and ultimately abnormal function. The evolution of cancer in each patient is unique, even when they are of the same type, although they will share some core similarities. In order to understand how cancers evolve over time, it would be ideal to conduct studies where individual patients are followed over time and samples of tumours continually obtained to understand the molecular changes that are ongoing. In practice, this is both logistically impossible and unethical as multiple invasive surgeries to obtain biopsies would be both costly and distressing to patients and treatment cannot be withheld to enable prospective monitoring of the disease.

The most practical clinical studies involve obtaining a single tumour biopsy from a patient (or multiple biopsies collected during the same surgery) for a large number of patients. This cross-sectional profile across a random patient population would not give us direct information about how the disease of individual patients evolve but we could combine all the information across the patients to identify sub-groups of individuals who appear to have similar disease trajectories. That is, suppose we have two patients, one at an advanced stage of disease with many mutations and another who presents at a relatively earlier disease stage, and both share a similar set of core mutations. The molecular status of the advanced patient could be an indicator of the future molecular profile of the early stage patient, if left untreated. This project proposes to develop novel statistical machine learning algorithms to apply such logic and rationale to integrate molecular profiles obtained from whole genome sequencing analysis of patients in a cross-sectional study to identify and learn temporal information that is not directly observed but may leave tell-tale clues behind.

We will apply these algorithms to the national Genomics England 100,000 Genomes Projects which seeks to sequence tens of thousands of cancer genomes across a range of cancer type over the next few years. The project will give insight into how cancers evolves and importantly provide a means of developing prognostic indicators that are based on molecular information to tell us the severity of a patient's disease and their likely trajectories.

Technical Summary

Longitudinal multiple tumour sampling studies are the ideal approach for the study of cancer evolution and its impact on patient outcome. However, such studies are logistically challenging and expensive to operate, and typically involve relatively smaller cohorts than a cross-sectional study involving the collection of a single tumour sample (or perhaps multiple samples obtained at the same time) from each patient.

This project proposes novel statistical approaches to inferring (pseudo)temporal information by integrating cross-sectional studies involving whole genome sequencing and gene expression analysis of cancers along with clinical covariates. The model probabilistically assigns patients to a latent "pseudotime", which measures a marker of disease progression, based on their molecular and clinical profiles. The transformation from high-dimensional molecular observations to the one-dimensional pseudotime is implemented using a covariate-adjusted Gaussian Process Latent Variable Model that can model a different disease progression trajectory for each combination of the covariates. This will allow us to test if putative molecular mechanisms distinguish between patients, for example, whose disease progresses to metastatic status or become radio and/or chemotherapy resistant by looking for statistically significant differences in the trajectories of each phenotypic group.

This project will also develop novel approaches for integrating methods that can predict the functional annotation of somatic mutations and their impact on cellular programming. These models will provide probabilistic genome annotations that can be used as input into the
pseudotime ordering algorithms. We will integrate all this work and apply about application to the Genomics England 100,000 Genomes Project.

Planned Impact

Precision medicine is an emerging approach for disease treatment and prevention that takes into account individual variability in environment, lifestyle and genes for each person. The practical realisation of precision medicine though is challenging as one of the many key obstacles is the identification of sub-groups of individuals and then to develop specific treatment options for each. The proposed research primary falls within the remit of patient stratification as we will be developing statistical algorithms to learn different patient disease progression trajectories. However, the ability of our models to identify molecular markers associated with each of these trajectories could also provide insight into the development of new therapeutic approaches.

The immediate beneficiary is the Genomics England 100,000 Genomes Project. I would share our research openly within the project and its academic and non-academic partners and work towards feeding back any useful knowledge to clinicians and patients via the national network of Genomic Medicine Centres that have been set up to support the project and are embedded within the NHS.

If successful, I would also work with clinical practitioners and research partners embedded within the University of Birmingham Institute for Translational Medicine to explore the direct clinical applications of this technology. The ability to extract useful temporal information from cross-sectional studies could significantly alter future approaches and decision making regarding clinical study design.

I would also work with our Technology Transfer Office to explore opportunities to involve commercial partners who could potentially embed this technology in suitable genomic analysis platform for wider use in a healthcare environment that is governed by strict regulations. A significant aspect of the research involves fundamental technical development of statistical algorithms, which is an expensive activity for commercial environments,

Publications

10 25 50