A genealogical approach to tracking bacterial transmission

Lead Research Organisation: University of Warwick
Department Name: School of Life Sciences

Abstract

Bacteria are extremely diverse organisms: some live freely in the Oceans, some live in the soil, some infect animals, some colonize our human guts without which we could not live, whereas others cause serious human diseases. In that last category, we find the causes of some of the most deadly threats that mankind has ever been faced with, such as the Black Death epidemic that killed half of the European population in the Middle Ages, or the tuberculosis global pandemic that currently affects a third of the worldwide population, killing over a million people every year. Many of these bacterial pathogens are transmitted directly from humans to humans, and to design control measures that can limit their spread it is necessary to understand more exactly how transmission happens. For example bacterial infections are often uncovered during hospital visits, but it is not always clear whether transmission occurred in the hospital or beforehand.

Such questions could easily be investigated if we could know the transmission tree, that is a tree indicating who infected whom and when. Reconstructing this transmission tree is an important aim in traditional infectious disease epidemiology, but a new complementary strategy is starting to emerge based on bacterial genome sequencing. In the past few years, the sequencing technology has made huge progress in cost, speed and accuracy, to the point that very large numbers of genomes can now be sequenced. This genomic revolution enables a new 'genomic epidemiology' strategy to track transmission. The basic idea is that if a pathogen has been transmitted between two hosts, their genomes should be very similar.

This genomic strategy has already been implemented in a few studies, and showed that genomics can often help determine when transmission happened or not. However, to unlock the full potential of the genomic approach, it is necessary to consider both the between-host transmission process and the within-host genomic evolution process. New analytical methods can then be implemented to reconstruct a transmission tree from genomic data in a rigorous and flexible way, including quantification of uncertainty in the output. To ensure that the new developed approach reliably gives correct answers, simulations will be used where an artificial dataset is generated in which the real transmission tree is known so that the accuracy of the reconstruction can be measured.

The genomic approach will also be applied to several state-of-the-art datasets of important human pathogens for which transmission routes are still unclear. Most importantly, the method will be released freely online as a user-friendly software package. This will enable other scientists to apply the new analytical approach to their own datasets, to help them understand transmission of many pathogens and settings, and thus inform which infectious disease control measures are likely to be successful in improving public health.

Technical Summary

Whole genome sequencing (WGS) data is increasingly accepted as a powerful approach to track host-to-host transmission of bacterial infections. To correctly analyse WGS data, it is however necessary to account for both within-host evolution and between-host transmission. Here we propose a two-steps pipeline to perform such analysis: first a timed phylogeny is reconstructed based on the time-stamped WGS data, and then this tree is subdivided (i.e. 'coloured') into parts corresponding to the evolution going on in different hosts.

For the first step, we propose to exploit the ClonalFrame model that includes the effect of recombination, and to use recently developed maximum likelihood techniques to efficiently reconstruct a timed phylogeny under this model from WGS data.

For the second step, we propose to develop a Monte-Carlo Markov Chain (MCMC) algorithm that can colour a phylogeny while accounting for the possibility of missing transmission links, realistic epidemiological models, varying within-host pathogen population dynamics, and the possibility of an incomplete transmission bottleneck. This MCMC algorithm will also be extended to use other epidemiological information, and to allow some hosts to be sampled multiple times.

The new pipeline will by applied to both simulated and real datasets, to guarantee its correctness, efficiency, relevance and general applicability to the large genomic datasets that are increasingly becoming available. The approach will be implemented in software that will be distributed freely and open source via the Internet. This software will be developed so that it does not require bioinformatics expertise to be used. This will guarantee that the new approach is available for microbiologists to apply it to a wide range of epidemiological investigations.

Planned Impact

The project will generate impact for a wide range of academic and non-academic beneficiaries. In particular:

(1) Academic researchers who work in bacterial genomic epidemiology. This includes researchers with a wide variety of backgrounds, ranging from statistics, infectious diseases, bioinformatics and microbiology. These researchers typically have specific epidemiological questions they want to investigate about a given organism, and have collected large amounts of genomic data to answer them. They will directly benefit from the software implementation of our proposed approach for tracking the transmission of bacterial pathogens, since they will be able to directly apply it to their data.

(2) Non-academic microbiologists and health professionals. As the new method is applied to a variety of systems, this will reveal new insights into the transmission of many important bacterial pathogens. Some of this impact will originate from our own proposed applications of the method to four important bacterial pathogens, but most of the impact will come from applications of the methodology by other academic researchers rather than ourselves. A better understanding of how bacterial pathogens spread from host to host will underpin new infectious disease control measures.

(3) The general public. Application of the proposed methodology will create a better understanding of the evolution and epidemiology of many bacterial pathogens. This will allow more effective measures to be taken to limit their burden on public health and therefore benefit to the general public.

Individuals in the first category will be directly impacted by the proposed research, even before the research programme is completed since they will be able to download and apply the software while it is still in development. Members of the second and third categories will be less directly and less immediately impacted by the proposed research than members of the first category, but this impact could nevertheless be far reaching. For example, the applicant previously developed the software ClonalFrame to infer relationships between bacterial isolates in a way that accounts for recombination, and this software was of most direct impact to the academic researchers who applied it to their molecular datasets. Over the past few years, there have been over 400 published studies that have applied ClonalFrame to a wide range of bacterial species. These studies have had a significant impact on our understanding of the population biology and epidemiology of these species, many of which are important human pathogens.

Publications

10 25 50