Sequence data and the ecology of pathogens: phylogeny and beyond
Lead Research Organisation:
Imperial College London
Department Name: Dept of Mathematics
Abstract
This proposal aims to improve our ability to infer the ecological processes shaping a pathogen's evolution by understanding pathogen phylogenies in a novel way. It is important to understand the ecology of pathogen spread. Indeed, ecological ideas have much to offer in understanding pathogens in particular: ecologists are accustomed to complex datasets without the opportunity for truly controlled experiments, and ecological concepts such as competition and competitive exclusion, niche adaptation, and habitat filtering are increasingly the paradigm of choice for understanding pathogen evolution. An example of a question for which ecological ideas are particularly relevant is that of how and why some pathogens evolve widespread drug resistance rapidly while others maintain long-term coexistence of resistant and sensitive strains.
Pathogen phylogenies contain a lot of information, in principle, both about the specifics of where certain strains or sequences originate and about the general underlying processes shaping when, where, and which pathogen strains are able to spread. Mathematicians have developed tools to create maximum-likelihood phylogenetic trees representing the estimated ancestral patterns of a dataset, as well as tools to simultaneously infer a phylogeny and the population's demographic history. However, existing tools offer no systematic approaches to infer the ecological context shaping a pathogen's spread or evolution. In addition, current methodologies use very limited metrics to assess phylogenies, so to the extent that approaches do exist to use genetic data to understand epidemiological and ecological processes, these are based on very little of the rich information in genetic data.
The work proposed here aims to fill the gap between the rich datasets of pathogen genomes being gathered and our ability to analyse them. I will first develop a suite of ways to summarise phylogenetic trees, taking the topology of the tree into account. Existing methods identify the probability of a tree with the probabilities of its branching times, neglecting the tree's topology. Developing informative measures is likely to be challenging because of the other factors that affect trees (stochastic transmission and mutation, selection, and others). For each new summary I aim to find its distribution on random trees drawn uniformly from tree space, to determine how rare a given tree is. In the second stage of the work I aim to improve inference of the underlying ecological processes shaping pathogen evolution, by better understanding what features of phylogenetic trees (including the novel summary measures developed in the first stage) make them able to account for observed data. Inference of ecological processes from phylogenies will carry some of the same challenges that occur in the inference of population demographics, one of these being that the number of possible trees on n leaves is too high for summing over all such trees to be feasible. Yet such sums are at the heart of likelihood-based inference. I propose to use the features identified in the first stage of the proposal, together with an improved understanding of how the likelihood of the data D given a tree G, L(D|G), is distributed over tree space, to simplify the sum and improve inference.
Pathogen phylogenies contain a lot of information, in principle, both about the specifics of where certain strains or sequences originate and about the general underlying processes shaping when, where, and which pathogen strains are able to spread. Mathematicians have developed tools to create maximum-likelihood phylogenetic trees representing the estimated ancestral patterns of a dataset, as well as tools to simultaneously infer a phylogeny and the population's demographic history. However, existing tools offer no systematic approaches to infer the ecological context shaping a pathogen's spread or evolution. In addition, current methodologies use very limited metrics to assess phylogenies, so to the extent that approaches do exist to use genetic data to understand epidemiological and ecological processes, these are based on very little of the rich information in genetic data.
The work proposed here aims to fill the gap between the rich datasets of pathogen genomes being gathered and our ability to analyse them. I will first develop a suite of ways to summarise phylogenetic trees, taking the topology of the tree into account. Existing methods identify the probability of a tree with the probabilities of its branching times, neglecting the tree's topology. Developing informative measures is likely to be challenging because of the other factors that affect trees (stochastic transmission and mutation, selection, and others). For each new summary I aim to find its distribution on random trees drawn uniformly from tree space, to determine how rare a given tree is. In the second stage of the work I aim to improve inference of the underlying ecological processes shaping pathogen evolution, by better understanding what features of phylogenetic trees (including the novel summary measures developed in the first stage) make them able to account for observed data. Inference of ecological processes from phylogenies will carry some of the same challenges that occur in the inference of population demographics, one of these being that the number of possible trees on n leaves is too high for summing over all such trees to be feasible. Yet such sums are at the heart of likelihood-based inference. I propose to use the features identified in the first stage of the proposal, together with an improved understanding of how the likelihood of the data D given a tree G, L(D|G), is distributed over tree space, to simplify the sum and improve inference.
Planned Impact
The main immediate-term beneficiaries of this project are academic. However, I anticipate that the project will also benefit the government/public health sector, and ultimately the general public. The key government beneficiary is the Health Protection Agency (HPA). Internationally, groups analysing genomic data in public health settings, including the Center for Disease Control (USA) and the British Columbia Centre for Disease Control (BCCDC, in Canada, who recently released a high-profile paper on TB genomic data) are also likely beneficiaries. The HPA has indicated that over the next 5 years it is feasible for hospital settings to begin gathering genomic data in ``real time'', meaning as cases are detected; hospitals making use of genome sequencing for pathogens like methicillin-resistant Staphylococcus aureus (MRSA) will likely benefit from this work. I anticipate annual meetings with the Health Protection Agency to ensure that the new tools have impact in the public health domain. This work also has long-term implication for vaccine design, because of the role ecological competition can play in undermining the effects of vaccines through strain replacement. I am in contact with industrial and public health parties regarding vaccine effects and will disseminate results to them as appropriate.
There are additional benefits of this work for the broader public. This work combines mathematics with both systems biology (using a whole-genome analyses) and epidemiology. Clearly infectious diseases are of great relevance to the general public and are frequently in the media. Research Councils have agreed that is a priority to continue to engage the public regarding the ethical, legal and social issues (ELSI) associated particularly with synthetic biology but also broadly with systems biology. To ensure that the broader public has access to, and is engaged with, this work, I propose to make simulations of pathogen evolution available through intuitive web-based tools incorporating graphical representations of the transmission patterns, genomic data and phylogenetic trees. This will also aid academic beneficiaries. Because pathogens and drug resistance are frequently reported in the media, I anticipate that adding pathogens and evolution to this discussion will be an exciting aspect of this work from the public's point of view. For this reason I propose to hold public discussion events hosted at the Science Museum's Dana Centre. I have a track record of public engagement and look forward to continuing these activities.
There are additional benefits of this work for the broader public. This work combines mathematics with both systems biology (using a whole-genome analyses) and epidemiology. Clearly infectious diseases are of great relevance to the general public and are frequently in the media. Research Councils have agreed that is a priority to continue to engage the public regarding the ethical, legal and social issues (ELSI) associated particularly with synthetic biology but also broadly with systems biology. To ensure that the broader public has access to, and is engaged with, this work, I propose to make simulations of pathogen evolution available through intuitive web-based tools incorporating graphical representations of the transmission patterns, genomic data and phylogenetic trees. This will also aid academic beneficiaries. Because pathogens and drug resistance are frequently reported in the media, I anticipate that adding pathogens and evolution to this discussion will be an exciting aspect of this work from the public's point of view. For this reason I propose to hold public discussion events hosted at the Science Museum's Dana Centre. I have a track record of public engagement and look forward to continuing these activities.
Organisations
- Imperial College London, United Kingdom (Lead Research Organisation)
- Federal University of Grande Dourados (Collaboration)
- Public Health England, Salisbury (Collaboration)
- United States Agency for International Development (Collaboration)
- Stanford University, United States (Collaboration)
- University of Bath, United Kingdom (Collaboration)
- Simon Fraser University, Canada (Fellow)
People |
ORCID iD |
Caroline Colijn (Principal Investigator / Fellow) |
Publications

Yang C
(2018)
Internal migration and transmission dynamics of tuberculosis in Shanghai, China: an epidemiological, spatial, genomic analysis.
in The Lancet. Infectious diseases



Walter KS
(2020)
Genomic variant-identification methods may alter Mycobacterium tuberculosis transmission inferences.
in Microbial genomics

Stimson J
(2019)
Beyond the SNP Threshold: Identifying Outbreak Clusters Using Inferred Transmissions.
in Molecular biology and evolution

Sartelli M
(2016)
Antimicrobials: a global alliance for optimizing their rational use in intra-abdominal infections (AGORA).
in World journal of emergency surgery : WJES

Sartelli M
(2017)
Erratum to: Antimicrobials: a global alliance for optimizing their rational use in intra-abdominal infections (AGORA).
in World journal of emergency surgery : WJES

Ratmann O
(2017)
HIV-1 full-genome phylogenetics of generalized epidemics in sub-Saharan Africa: impact of missing nucleotide characters in next-generation sequences.
in AIDS research and human retroviruses

Ratmann O
(2017)
Phylogenetic Tools for Generalized HIV-1 Epidemics: Findings from the PANGEA-HIV Methods Comparison.
in Molecular biology and evolution

Plazzotta G
(2016)
Effects of memory on the shapes of simple outbreak trees.
in Scientific reports
Description | We have discovered a number of informative summary features of phylogenetic trees which allow inference of aspects of the evolutionary process of the system. These include the number of cherries, which we have shown have a very tight link to the basic reproduction number of an infection (under some assumptions). Informative features also include novel summaries derived from network science. We have developed a new test for neutral phylogenetic branching -- does branching now in a phylogenetic tree predict that one's descendants are next to branch? We have compared the ability of different features to identify the underlying contact networks over which a pathogen is spreading. We find that when the network is dynamic (edges change in time), this task is much harder than when it is static. Real human contact networks are of course dynamic. The new neutral branching test appears to be the best at distinguishing underlying contact networks using pathogen phylogenies. Alongside this work we have developed new metrics -- distance functions -- for comparing phylogenetic trees to each other. This work has already found exciting applications in data from bacteria, viruses and higher organisms. Tree comparisons can reveal distinct alternative patterns of evolution that are consistent with a set of data, and can resolve uncertainty in trees estimated from data. As trees are a key starting point in many evolutionary analyses, this is a powerful tool for many applications. We have extended the distance functions to apply to partially sampled transmission trees from outbreaks, as estimated by our Bayesian inference tool, TransPhylo. We have developed the TransPhylo inference method to link timed phylogenetic trees (inferred from pathogen sequence data) to transmission trees (which describe who infected whom in a disease outbreak). The MCMC method we have designed relies on a novel tree colouring approach, and allows the user to obtain an estimate of who likely infected whom together with the uncertainty in that inference. We have recently extended the approach to capture a wide range of possible timing scenarios, and to estimate where there may be cases that have not been sampled by the public health practitioners analysing the outbreak. This tool is being applied widely. |
Exploitation Route | The MCMC inference of transmission is being used as part of outbreak response analyses that comprise genomic data. The tree metrics are being used in data analysis ranging from comparisons of gene trees to descriptions of Bayesian tree posteriors in wide-ranging datasets. The summary features are applicable to approximate Bayesian computation using genomic data from pathogens. |
Sectors | Environment,Healthcare,Pharmaceuticals and Medical Biotechnology |
Description | Methods to use pathogen sequence data to infer transmission are being used by public health bodies to interpret outbreak data. The tree comparison tools we have developed are being used by several public health bodies to analyse genomic data for a range of pathogens. |
First Year Of Impact | 2015 |
Sector | Healthcare |
Impact Types | Policy & public services |
Description | Chair of advisory board for Statistics and Applied Mathematics at Bath (SAMBa) CDT |
Geographic Reach | Local/Municipal/Regional |
Policy Influence Type | Participation in a guidance/advisory committee |
Impact | The CDT trains a highly effective workforce at the interface of statistics and applied mathematics, with strong industry involvement throughout the PhD training. |
Description | Collaboration on inhomogeneous branching processes |
Organisation | University of Bath |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | We provided novel problems arising from pathogenies phylogenies. |
Collaborator Contribution | We obtained advice on mathematical methods from branching processes. |
Impact | Multi-disciplinary: probability, statistics, mathematical biology, epidemiology. |
Start Year | 2014 |
Description | Collaboration with Public Health England |
Organisation | Public Health England |
Country | United Kingdom |
Sector | Public |
PI Contribution | We provide methods to analyse pathogen sequence data. |
Collaborator Contribution | They provide pathogen sequence data, relevant questions to ask, and interpretation of aspects of the data. |
Impact | The collaboration is multi-disciplinary: applied mathematics; epidemiology; bioinformatics; statistics. |
Start Year | 2013 |
Description | Collaboration with Public Health England (TB growth) |
Organisation | Public Health England |
Country | United Kingdom |
Sector | Public |
PI Contribution | We provide statistical and mathematical models |
Collaborator Contribution | They provide data on growth in chemostat, and sequence data from chemostat cultures. |
Impact | Multi-disciplinary: applied mathematics; statistics; microbiology |
Start Year | 2014 |
Description | Jason Andrews, Julio Croda TB in Brazil |
Organisation | Federal University of Grande Dourados |
Country | Brazil |
Sector | Academic/University |
PI Contribution | I will consult on an NIH-funded project on controlling TB in Brazilian prisons. |
Collaborator Contribution | The partners are leading the project, which is an epidemiology and public health project primarily aiming to improve TB control. |
Impact | Multi-disciplinary: mathematics and computational biology (me); epidemiology, public health, medicine. |
Start Year | 2016 |
Description | Jason Andrews, Julio Croda TB in Brazil |
Organisation | Stanford University |
Department | Graduate School of Business |
Country | United States |
Sector | Academic/University |
PI Contribution | I will consult on an NIH-funded project on controlling TB in Brazilian prisons. |
Collaborator Contribution | The partners are leading the project, which is an epidemiology and public health project primarily aiming to improve TB control. |
Impact | Multi-disciplinary: mathematics and computational biology (me); epidemiology, public health, medicine. |
Start Year | 2016 |
Description | USAID TB Control |
Organisation | United States Agency for International Development |
Country | United States |
Sector | Public |
PI Contribution | I consulted for USAID and attended a meeting in Washington DC in 2015 to participate in discussions on developing the role of whole-genome sequencing in TB control programmes. |
Collaborator Contribution | USAID has agreed in principle to fund a study and capacity-building project in Moldova where we will collect and analyse approximately 2400 TB isolates. |
Impact | No papers have been published yet. The work is multidisciplinary, spanning mathematics, computational biology, epidemiology and medicine. |
Start Year | 2015 |
Title | OutbreakTools |
Description | R package for outbreak analysis |
Type Of Technology | Webtool/Application |
Year Produced | 2014 |
Impact | Improves outbreak analysis by allowing multiple data sources, visualisation and statistical modelling in one platform. |
URL | http://sites.google.com/site/therepiproject/r-pac/about |
Title | TransPhylo |
Description | TransPhylo performs Bayesian inference of transmission using phylogenetic data, capturing uncertainty. |
Type Of Technology | Webtool/Application |
Year Produced | 2016 |
Impact | TransPhylo is attracting international interest and leading to numerous collaborations. |
URL | https://github.com/xavierdidelot/TransPhylo |
Title | Treescape |
Description | R package for tree comparison. NOTE due to a discovered copyright issue we are renaming the package "treespace" instead of "treescape" |
Type Of Technology | Software |
Year Produced | 2015 |
Open Source License? | Yes |
Impact | Easy comparison of phylogenetic trees reveals distinct alternative patterns of evolution in many datasets. |
URL | https://cran.r-project.org/web/packages/treescape/index.html |