Bayesian Inference under the Structured Coalescent Model

Lead Research Organisation: University of Warwick
Department Name: Statistics

Abstract

The coalescent is a population genetics model for the inheritance of genetic material over time. Genetic sequences are taken at known times and based on genetic similarities between sequences, the times until pairs of lineages coalesce at a common ancestor can be estimated. An important factor, which is not accounted for in the ordinary coalescent model, is the spatial constraints. For example, if two lineages have been separated geographically at some time in the past, they cannot find a common ancestor until both lineages exist in a common location. This motivates an extension to the ordinary coalescent model which factors in spatial constraints, known as the structured coalescent. Individuals are assumed to exist in a fixed, and possibly unknown, number of distinct demes, with migrations occurring between demes at fixed rates backwards in time. A dataset consists of a number of genomes sampled at various timepoints and from various demes. From this, there are several evolutionary parameters which we would like to infer in the structured coalescent model, including the migration rates between demes and effective population sizes of each deme. The migration history that led to the current locations of the samples is also often of interest. The uncertainty in at least some of these parameters is likely to be important, which motivates a Bayesian approach to inference. Current
methods to infer these parameters are either computationally expensive, or rely on approximations of the structured coalescent in place of the full model which can introduce significant biases (Muller et al, 2017).

To combat this lack of scalable approaches to perform inference under the structured coalescent, I intend to construct a reversible jump Markov chain Monte Carlo algorithm which will infer migration histories and evolutionary parameters for a fixed coalescent genealogy. There are multiple robust methods currently available to infer a genealogy from genomic data, including BEAST (Suchard et al, 2018), LSD (To et al, 2016) and TreeTime (Sagulenko et al, 2018). My work will build upon previous MCMC schemes proposed by Drummond et al. (2002) and Ewing et al. (2004) for the coalescent and structured coalescent respectively. Further, I will release an implementation of my algorithm as an open source R package.

The correctness and computational efficiency of my algorithm will be assessed by benchmarking on simulated datasets. Applications to state-of-the-art real datasets from infectious disease pathogens will demonstrate the usefulness of my algorithm, for example a global dataset of cholera genomes from the seventh pandemic (Didelot et al 2015) and a collection of Ebola genomes from the 2013-2016 West African epidemic (Dudas et al 2017). I anticipate that this project will contribute to advances in the accuracy of statistical methods for genetic sequences. It will also be relevant for generic MCMC methods on constrained and non-Euclidean spaces, which have applications across applied sciences and engineering.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/T51794X/1 01/10/2020 30/09/2025
2435782 Studentship EP/T51794X/1 05/10/2020 31/03/2024 Ian Roberts