Sequence data and the ecology of pathogens: phylogeny and beyond

Lead Research Organisation: Imperial College London
Department Name: Dept of Mathematics

Abstract

This proposal aims to improve our ability to infer the ecological processes shaping a pathogen's evolution by understanding pathogen phylogenies in a novel way. It is important to understand the ecology of pathogen spread. Indeed, ecological ideas have much to offer in understanding pathogens in particular: ecologists are accustomed to complex datasets without the opportunity for truly controlled experiments, and ecological concepts such as competition and competitive exclusion, niche adaptation, and habitat filtering are increasingly the paradigm of choice for understanding pathogen evolution. An example of a question for which ecological ideas are particularly relevant is that of how and why some pathogens evolve widespread drug resistance rapidly while others maintain long-term coexistence of resistant and sensitive strains.

Pathogen phylogenies contain a lot of information, in principle, both about the specifics of where certain strains or sequences originate and about the general underlying processes shaping when, where, and which pathogen strains are able to spread. Mathematicians have developed tools to create maximum-likelihood phylogenetic trees representing the estimated ancestral patterns of a dataset, as well as tools to simultaneously infer a phylogeny and the population's demographic history. However, existing tools offer no systematic approaches to infer the ecological context shaping a pathogen's spread or evolution. In addition, current methodologies use very limited metrics to assess phylogenies, so to the extent that approaches do exist to use genetic data to understand epidemiological and ecological processes, these are based on very little of the rich information in genetic data.

The work proposed here aims to fill the gap between the rich datasets of pathogen genomes being gathered and our ability to analyse them. I will first develop a suite of ways to summarise phylogenetic trees, taking the topology of the tree into account. Existing methods identify the probability of a tree with the probabilities of its branching times, neglecting the tree's topology. Developing informative measures is likely to be challenging because of the other factors that affect trees (stochastic transmission and mutation, selection, and others). For each new summary I aim to find its distribution on random trees drawn uniformly from tree space, to determine how rare a given tree is. In the second stage of the work I aim to improve inference of the underlying ecological processes shaping pathogen evolution, by better understanding what features of phylogenetic trees (including the novel summary measures developed in the first stage) make them able to account for observed data. Inference of ecological processes from phylogenies will carry some of the same challenges that occur in the inference of population demographics, one of these being that the number of possible trees on n leaves is too high for summing over all such trees to be feasible. Yet such sums are at the heart of likelihood-based inference. I propose to use the features identified in the first stage of the proposal, together with an improved understanding of how the likelihood of the data D given a tree G, L(D|G), is distributed over tree space, to simplify the sum and improve inference.

Planned Impact

The main immediate-term beneficiaries of this project are academic. However, I anticipate that the project will also benefit the government/public health sector, and ultimately the general public. The key government beneficiary is the Health Protection Agency (HPA). Internationally, groups analysing genomic data in public health settings, including the Center for Disease Control (USA) and the British Columbia Centre for Disease Control (BCCDC, in Canada, who recently released a high-profile paper on TB genomic data) are also likely beneficiaries. The HPA has indicated that over the next 5 years it is feasible for hospital settings to begin gathering genomic data in ``real time'', meaning as cases are detected; hospitals making use of genome sequencing for pathogens like methicillin-resistant Staphylococcus aureus (MRSA) will likely benefit from this work. I anticipate annual meetings with the Health Protection Agency to ensure that the new tools have impact in the public health domain. This work also has long-term implication for vaccine design, because of the role ecological competition can play in undermining the effects of vaccines through strain replacement. I am in contact with industrial and public health parties regarding vaccine effects and will disseminate results to them as appropriate.

There are additional benefits of this work for the broader public. This work combines mathematics with both systems biology (using a whole-genome analyses) and epidemiology. Clearly infectious diseases are of great relevance to the general public and are frequently in the media. Research Councils have agreed that is a priority to continue to engage the public regarding the ethical, legal and social issues (ELSI) associated particularly with synthetic biology but also broadly with systems biology. To ensure that the broader public has access to, and is engaged with, this work, I propose to make simulations of pathogen evolution available through intuitive web-based tools incorporating graphical representations of the transmission patterns, genomic data and phylogenetic trees. This will also aid academic beneficiaries. Because pathogens and drug resistance are frequently reported in the media, I anticipate that adding pathogens and evolution to this discussion will be an exciting aspect of this work from the public's point of view. For this reason I propose to hold public discussion events hosted at the Science Museum's Dana Centre. I have a track record of public engagement and look forward to continuing these activities.

Publications

10 25 50

publication icon
Hall MD (2019) Transmission Trees on a Known Pathogen Phylogeny: Enumeration and Sampling. in Molecular biology and evolution

 
Description We have discovered a number of informative summary features of phylogenetic trees which allow inference of aspects of the evolutionary process of the system. These include the number of cherries, which we have shown have a very tight link to the basic reproduction number of an infection (under some assumptions). Informative features also include novel summaries derived from network science. We have developed a new test for neutral phylogenetic branching -- does branching now in a phylogenetic tree predict that one's descendants are next to branch? We have compared the ability of different features to identify the underlying contact networks over which a pathogen is spreading. We find that when the network is dynamic (edges change in time), this task is much harder than when it is static. Real human contact networks are of course dynamic. The new neutral branching test appears to be the best at distinguishing underlying contact networks using pathogen phylogenies.

Alongside this work we have developed new metrics -- distance functions -- for comparing phylogenetic trees to each other. This work has already found exciting applications in data from bacteria, viruses and higher organisms. Tree comparisons can reveal distinct alternative patterns of evolution that are consistent with a set of data, and can resolve uncertainty in trees estimated from data. As trees are a key starting point in many evolutionary analyses, this is a powerful tool for many applications. We have extended the distance functions to apply to partially sampled transmission trees from outbreaks, as estimated by our Bayesian inference tool, TransPhylo.

We have developed the TransPhylo inference method to link timed phylogenetic trees (inferred from pathogen sequence data) to transmission trees (which describe who infected whom in a disease outbreak). The MCMC method we have designed relies on a novel tree colouring approach, and allows the user to obtain an estimate of who likely infected whom together with the uncertainty in that inference. We have recently extended the approach to capture a wide range of possible timing scenarios, and to estimate where there may be cases that have not been sampled by the public health practitioners analysing the outbreak. This tool is being applied widely.
Exploitation Route The MCMC inference of transmission is being used as part of outbreak response analyses that comprise genomic data. The tree metrics are being used in data analysis ranging from comparisons of gene trees to descriptions of Bayesian tree posteriors in wide-ranging datasets. The summary features are applicable to approximate Bayesian computation using genomic data from pathogens.
Sectors Environment,Healthcare,Pharmaceuticals and Medical Biotechnology

 
Description Methods to use pathogen sequence data to infer transmission are being used by public health bodies to interpret outbreak data. The tree comparison tools we have developed are being used by several public health bodies to analyse genomic data for a range of pathogens.
First Year Of Impact 2015
Sector Healthcare
Impact Types Policy & public services

 
Description Chair of advisory board for Statistics and Applied Mathematics at Bath (SAMBa) CDT
Geographic Reach Local/Municipal/Regional 
Policy Influence Type Participation in a advisory committee
Impact The CDT trains a highly effective workforce at the interface of statistics and applied mathematics, with strong industry involvement throughout the PhD training.
 
Description Collaboration on inhomogeneous branching processes 
Organisation University of Bath
Country United Kingdom 
Sector Academic/University 
PI Contribution We provided novel problems arising from pathogenies phylogenies.
Collaborator Contribution We obtained advice on mathematical methods from branching processes.
Impact Multi-disciplinary: probability, statistics, mathematical biology, epidemiology.
Start Year 2014
 
Description Collaboration with Public Health England 
Organisation Public Health England
Country United Kingdom 
Sector Public 
PI Contribution We provide methods to analyse pathogen sequence data.
Collaborator Contribution They provide pathogen sequence data, relevant questions to ask, and interpretation of aspects of the data.
Impact The collaboration is multi-disciplinary: applied mathematics; epidemiology; bioinformatics; statistics.
Start Year 2013
 
Description Collaboration with Public Health England (TB growth) 
Organisation Public Health England
Country United Kingdom 
Sector Public 
PI Contribution We provide statistical and mathematical models
Collaborator Contribution They provide data on growth in chemostat, and sequence data from chemostat cultures.
Impact Multi-disciplinary: applied mathematics; statistics; microbiology
Start Year 2014
 
Description Jason Andrews, Julio Croda TB in Brazil 
Organisation Federal University of Grande Dourados
Country Brazil 
Sector Academic/University 
PI Contribution I will consult on an NIH-funded project on controlling TB in Brazilian prisons.
Collaborator Contribution The partners are leading the project, which is an epidemiology and public health project primarily aiming to improve TB control.
Impact Multi-disciplinary: mathematics and computational biology (me); epidemiology, public health, medicine.
Start Year 2016
 
Description Jason Andrews, Julio Croda TB in Brazil 
Organisation Stanford University
Department Graduate School of Business
Country United States 
Sector Academic/University 
PI Contribution I will consult on an NIH-funded project on controlling TB in Brazilian prisons.
Collaborator Contribution The partners are leading the project, which is an epidemiology and public health project primarily aiming to improve TB control.
Impact Multi-disciplinary: mathematics and computational biology (me); epidemiology, public health, medicine.
Start Year 2016
 
Description USAID TB Control 
Organisation United States Agency for International Development
Country United States 
Sector Public 
PI Contribution I consulted for USAID and attended a meeting in Washington DC in 2015 to participate in discussions on developing the role of whole-genome sequencing in TB control programmes.
Collaborator Contribution USAID has agreed in principle to fund a study and capacity-building project in Moldova where we will collect and analyse approximately 2400 TB isolates.
Impact No papers have been published yet. The work is multidisciplinary, spanning mathematics, computational biology, epidemiology and medicine.
Start Year 2015
 
Title OutbreakTools 
Description R package for outbreak analysis 
Type Of Technology Webtool/Application 
Year Produced 2014 
Impact Improves outbreak analysis by allowing multiple data sources, visualisation and statistical modelling in one platform. 
URL http://sites.google.com/site/therepiproject/r-pac/about
 
Title TransPhylo 
Description TransPhylo performs Bayesian inference of transmission using phylogenetic data, capturing uncertainty. 
Type Of Technology Webtool/Application 
Year Produced 2016 
Impact TransPhylo is attracting international interest and leading to numerous collaborations. 
URL https://github.com/xavierdidelot/TransPhylo
 
Title Treescape 
Description R package for tree comparison. NOTE due to a discovered copyright issue we are renaming the package "treespace" instead of "treescape" 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact Easy comparison of phylogenetic trees reveals distinct alternative patterns of evolution in many datasets. 
URL https://cran.r-project.org/web/packages/treescape/index.html