Combining epidemiological and phylogenetic models of infectious disease dynamics

Lead Research Organisation: University of Cambridge
Department Name: Veterinary Medicine

Abstract

Viruses such as influenza A and HIV-1 mutate extremely rapidly, such that the viruses in one individual are genetically different from those in another individual. While this presents significant hurdles to develop effective vaccines, the genetic variation can be used to determine the extent to which viruses in different individuals are related, and to generate a `family tree', or phylogeny, of viruses. Viral phylogenies contain a great deal of information about the past spread of the virus, but this information is difficult to extract, as many factors can affect viral transmission, such as the probability of infection per contact, and the duration of infection.

This project will combine epidemiological models of infectious diseases, which are commonly used tools to consider how incidence and prevalence changes over time, with models of viral evolution. We will endeavour to make our models as biologically realistic as possible, allowing us to consider the different pathways via which viruses may spread over a geographic area, as well as helping us to understand how the transmission of viruses may be affected by genetic changes in the virus.

Technical Summary

Coalescent models are commonly used to model the population dynamics of viruses using viral sequence data. This approach is attractive epidemiologically, as information on the past transmission dynamics of a virus can be obtained even using a single, cross-sectional sample of viruses. Coalescent models are also appealing computationally, as only the evolutionary past of the sample of sequences needs to be considered, rather than that of the population from which the sample has been obtained. However, these coalescent models originate from considering the dynamics of single populations or species, and while these models can be fitted to viral sequence data, the resulting parameter estimates are extremely hard to interpret, as they are so far abstracted from meaningful epidemiological quantities. For example, we have previously shown that the `effective population size' of a viral epidemic is not, as is commonly assumed, proportional to the number of infected individuals, rather it is related to both the incidence and the prevalence.

We propose the development of evolutionary models that incorporate explicit models of viral transmission, considering factors such as geographic spread of viruses, differences in sampling effort, demographic stochasticity and selection. While progress has recently been made in some of these areas, we argue that the failure to consider the details of the transmission process may lead to incorrect conclusions being drawn. This additional flexibility in allowing biological realism comes at the cost of considering the evolutionary dynamics of the entire population via simulation-based techniques, and part of the research associated with addressing the main aims involves alleviating at least some of this computational cost, partly through the adoption of approaches which parallelize easily, and partly through algorithmic development.

Planned Impact

Impact Summary

Development of the models will improve our understanding and interpretation of epidemiological dynamics obtained using viral sequence data. The proposed models will be designed to scale up to large sequence datasets using high-performance, parallel computing facilities, yet will provide a simple interface to scientists for their data analysis. Dissemination of results will be by the usual publication routes and conferences, but we will also particularly discuss our results with our collaborators on other projects.

Wider impact

The emergence or re-emergence of viral species or strains, whether by chance natural events, zoonotic transmissions or selection pressure due to theraputic drug use or vaccination, can present a significant risk to human and/or animal populations. Consequently it is important to track the transmission of viral pathogens, both through space and time, and the proposed programme enables this. Ultimately, epidemiological insights resulting from analyses with the proposed software can be used by medical professionals and policy makers. This project may also have an economic impact, by contributing to reducing the detrimental effects of infectious disease on human and animal health and productivity, by increasing understanding of pathogen evolution.

Impact timescales

The immediate impact of this research is likely to be a re-evaluation of previouus results. During the methodological development associated with the project, we will also present new results on the molecular epidemiology of HIV, hepatitis C, and West Nile Virus. We will release software implementing the methods on a regular, incremental basis over the course of the project, to maximise the exposure of these approaches. We are confident that these methods will be well received by the scientific community, and towards the end of the project, we will provide training in the use of the methods. Such sessions are likely to be important in assessing the ongoing needs of the scientific community in these methods, beyond those of our own projects.
 
Description Molecular Epidemiology of Viruses Course
Geographic Reach Europe 
Policy Influence Type Influenced training of practitioners or researchers
Impact A course was ran to train researchers in the use of the R programming language to conduct molecular epidemiological analysis of viral sequence data, at the Gulbenkian Institute near Lisbon, Portugal, in 2015.
URL http://sdwfrost.github.io/mevr/
 
Description Cambridge-Africa Alborada Research Fund
Amount £12,991 (GBP)
Organisation University of Cambridge 
Department Alborada Research Fund
Sector Academic/University
Country United Kingdom
Start 09/2016 
End 09/2017
 
Description Genetics Society Summer Studentship
Amount £1,712 (GBP)
Organisation The Genetics Society 
Sector Charity/Non Profit
Country United Kingdom
Start 06/2014 
End 09/2014
 
Description International Exchanges Scheme
Amount £6,250 (GBP)
Funding ID IE160720 
Organisation The Royal Society 
Sector Charity/Non Profit
Country United Kingdom
Start 12/2016 
End 11/2017
 
Description Isaac Newton Trust
Amount £34,500 (GBP)
Funding ID 16.07(d) 
Organisation University of Cambridge 
Department Isaac Newton Trust
Sector Academic/University
Country United Kingdom
Start 06/2016 
End 06/2017
 
Description UK-Indonesia Joint Health Research
Amount £266,705 (GBP)
Funding ID MR/P017541/1 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 09/2017 
End 09/2019
 
Title Gillespie.jl 
Description This is a Julia library to simulate stochastic models (e.g. epidemiological models) in the Julia programming language. It is notable for both its simplicity and speed. 
Type Of Material Improvements to research infrastructure 
Year Produced 2016 
Provided To Others? Yes  
Impact The paper describing the release of this tool has already been cited twice, and is becoming increasingly used for research purposes. 
URL http://github.com/sdwfrost/Gillespie.jl
 
Title PDMP.jl 
Description This is a Julia library for simulating piecewise deterministic Markov processes, a general class of stochastic processes that allows one to simulate, for example, seasonally forced stochastic epidemiological models. 
Type Of Material Improvements to research infrastructure 
Year Produced 2016 
Provided To Others? Yes  
Impact None as yet 
URL http://github.com/sdwfrost/PDMP.jl
 
Title distributions.nim 
Description This is a library for the Nim programming language that provides the basic building blocks - random numbers and distributions - for simulating stochastic processes. 
Type Of Material Improvements to research infrastructure 
Year Produced 2016 
Provided To Others? Yes  
Impact None at present 
URL http://github.com/sdwfrost/distributions
 
Title liblsoda 
Description This is a refactoring of the widely used LSODA algorithm for numerical solution of ordinary differential equations. 
Type Of Material Improvements to research infrastructure 
Year Produced 2016 
Provided To Others? Yes  
Impact The library has already been incorporated into higher level libraries in R and Julia by other researchers. 
URL http://github.com/sdwfrost/liblsoda
 
Title libtn93 
Description This is a portable C library for calculating genetic distances between sequences according to the TN93 model of sequence evolution. It is notable for its speed, and that it can be used in conjunction with high level languages such as Python, R, and Julia. 
Type Of Material Improvements to research infrastructure 
Year Produced 2017 
Provided To Others? Yes  
Impact None as yet 
URL http://github.com/sdwfrost/libtn93
 
Title saphy: sequential analysis of phylogenies 
Description This is a R software library to analyse phylogenetic data in an 'on-line' fashion, with taxa added sequentially over time. 
Type Of Material Improvements to research infrastructure 
Year Produced 2016 
Provided To Others? Yes  
Impact None at present 
URL http://github.com/hackout3/saphy
 
Title treeImbalance 
Description Phylogenetic trees of viruses sampled from different individuals provide clues to the dynamics of transmission. The extent to which the tree is asymmetric may be influenced by biological factors such as differences in infectiousness or contact rates between individuals, but also by nuisance factors such as the pattern of sampling. We have devised a simple statistical test for asymmetry, which controls for sampling patterns and potentially complex temporal dynamics by conditioning on the sampling and coalescence times in a phylogeny, and can also detect whether specific clades in the phylogeny drive patterns of asymmetry. We have developed an open-source R package for detecting asymmetry in time-sampled phylogenetic trees using this test. 
Type Of Material Improvements to research infrastructure 
Year Produced 2015 
Provided To Others? Yes  
Impact None as yet. 
URL https://github.com/bdearlove/treeImbalance
 
Title treedater 
Description This is a method to infer time-calibrated phylogenies from sequence data, developed in collaboration with Erik Volz at Imperial College London. 
Type Of Material Improvements to research infrastructure 
Year Produced 2017 
Provided To Others? Yes  
Impact None at present 
URL http://github.com/emvolz/treedater
 
Title PANGEA-HIV methods comparison 
Description The PANGEA-HIV consortium, funded by the Bill and Melinda Gates Foundation, is investigating the dynamics of HIV transmission in sub-Saharan Africa, and will generate a large number of full length HIV genomes to provide insights into transmission. As a prelude to the release of the data, simulated datasets have been generated by researchers at Imperial College London and the University of Edinburgh to provide a testbed for different methods. We have generated a database of analyses and 'metadata', such as reconstructed phylogenetic trees. 
Type Of Material Database/Collection of data 
Year Produced 2014 
Provided To Others? Yes  
Impact The results of our analyses will be presented at an upcoming meeting in December, together with results from other groups. 
URL https://github.com/sdwfrost/pangea
 
Title Artificial Neural Networks for Viral Lineages (ANVIL) 
Description ANVIL is a software package to identify viral lineages based on a supervised set of sequences. ANVIL uses neural networks to infer the genotype of short segments of sequence, from which it can conclude the genotype of a virus, and determine whether there has been inter-subtype recombination. 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact None as yet 
URL https://github.com/asmmhossain/ANVIL
 
Title Gillespie.jl 
Description Gillespie.jl is a library for the programming language Julia that implements Gillespie's stochastic simulation algorithm, a widely used approach for stochastic simulation 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact None at present; the library is being used for research purposes 
URL https://github.com/sdwfrost/Gillespie.jl
 
Title OutbreakTools 
Description OutbreakTools is an R package for the analysis and visualisation of epidemiological data. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact None as yet 
URL https://sites.google.com/site/therepiproject/r-pac/about
 
Title PDMP.jl 
Description PDMP.jl is a library written in the Julia programming language to perform simulation of piecewise deterministic Markov processes; examples of this include stochastic simulations with time-varying rates and hybrid discrete/continuous systems. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact None as yet 
URL https://github.com/rveltz/PDMP.jl
 
Title Phlow 
Description We are developing a workflow to streamline phylogenetic analyses of viral sequence datasets 
Type Of Technology Webtool/Application 
Year Produced 2014 
Impact None at present; a publication describing the software is in preparation 
URL https://github.com/asmmhossain/phlow
 
Title Pipelign 
Description An automated pipeline for generating multiple sequence alignments of viral sequences. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact None as yet 
URL https://github.com/asmmhossain/pipelign
 
Title epiwidgets 
Description A collection of dynamic 'widgets' useful for visualising epidemiological data 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact None as yet 
URL https://github.com/sdwfrost/epiwidgets
 
Title mathmodels 
Description A collection of dynamic widgets for demonstrating mathematical models of epidemiology, genetics, and ecology 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact This was used in a sixth-form extension course for BSix College in Hackney. 
URL https://github.com/sdwfrost/mathmodels
 
Title merlin 
Description merlin is a software library for the R programming language to aid in the analysis of sequence data, particularly that of viruses, in molecular epidemiology studies 
Type Of Technology Software 
Year Produced 2013 
Impact This package was used as part of a training course in bioinformatics held in Cambridge 
URL https://r-forge.r-project.org/projects/merlin/
 
Title nextHIV 
Description nextHIV is a platform for real-time HIV surveillance, that combines clinical data with HIV sequence data, and automatically processes the data and generates an interactive report that can be shared with public health officials and policy makers. 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact A new collaboration with researchers at UNC-Chapel Hill has begun to use nextHIV as a platform for an intervention trial to determine whether HIV sequence data can be harnessed for guiding prevention. 
URL https://github.com/sdwfrost/nextHIV
 
Title treeImbalance 
Description treeImbalance is a software library for the R programming language to calculate measures of imbalance in phylogenetic trees, and assess their statistical significance using permutation tests. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact None as yet; a manuscript employing this approach is currently in preparation 
URL https://github.com/bdearlove/treeImbalance