Exact scalable inference for coalescent processes

Lead Research Organisation: University of Warwick
Department Name: Statistics

Abstract

Modern genetic data sets are vast, both in terms of the number of sequenced individuals and the length of the sequenced DNA segments. Patterns within these data sets carry information about the biological and demographic histories of the population, which cannot usually be observed directly.
The central tool connecting observed patterns to predictions and inference is the Kingman coalescent: a random tree that provides a model for the unobserved ancestry of the sampled DNA sequences. Since the ancestry is unobserved, inferences are made by averaging over all possible ancestries.

In simple cases the average over ancestries can be calculated analytically, but in most biologically relevant scenarios the average has to be approximated. This is usually done by simulating an ensemble of possible ancestral trees, and treating the ensemble average as an approximation of the true, unknown average. The quality of the approximation depends on the degree to which the ensemble is representative of the set of all possible ancestries.

Ensuring that an ensemble is both representative, and not infeasibly large, is a challenging problem. Existing methods for producing ensembles split into two categories: importance sampling (IS), and Markov chain Monte Carlo (MCMC), of which the latter is typically more flexible and easier to implement. Both are known to scale poorly with the size and complexity of the data set. This proposal seeks to improve the scalability of state of the art MCMC methods in three related ways:

1. Much work has been done to characterise optimal IS algorithms, which have been observed to perform roughly as well as naive implementations of the more flexible MCMC. Preliminary results for this project show that optimality results for IS can also be used to characterise optimal MCMC algorithms, but this has never been done. This work will investigate and thoroughly benchmark the performance of the resulting, optimised MCMC algorithms.

2. The practical utility of MCMC algorithms has improved dramatically through so-called optimal scaling results, which provide a guide for how to tweak the algorithm as the data set grows. However, these typically apply only to settings in which the distribution being simulated consists of independent, real-valued components. In genetics, the distributions of interests consist of trees, and is hence much more complicated. This project will investigate extensions of optimal scaling results to tree-valued settings using recently developed machinery of optimal scaling via Dirichlet forms, which are a natural way to analyse tree-valued algorithms.

3. A recently published algorithm called msprime uses a novel data structure, called a sparse tree, to improve the speed and memory consumption of naive coalescent simulation by many orders of magnitude. This does not immediately translate to improved inference algorithms, because naive simulation typically results in ensembles that are poor representations of the true average. The sparse tree structure cannot be directly inserted into an MCMC algorithm, but preliminary work has identified several ways in which MCMC can be modified to use data structures resembling sparse trees. This project will implement and benchmark all of the resulting algorithms to determine which of these ways is the most effective.

The end result of these three streams will be a highly optimised, flexible, open source algorithm for inference in genetics. It will have unprecedented performance on large data sets due to a combination of mathematical optimisation (objectives 1 and 2) and optimisation of the underlying data structure (objective 3). MCMC algorithms also provide automatic, rigorous uncertainty quantification for their estimates, which many state-of-the-art competitors are not able to provide. This makes MCMC particularly well suited to e.g. clinical practice, where understanding uncertainties is crucial for medical outcomes.

Planned Impact

The immediate beneficiaries of the proposed work are other researchers working in the fields of population genetics, computational statistics, and probability theory. Less direct beneficiaries are the plethora of people relying on the work of these three groups, such as clinicians or conservation groups making use of genetic data, and practitioners in virtually every field of applied science using statistical and probabilistic methods and modelling.

Population geneticists are faced with a wealth of data, but no rigorous, scalable statistical methods by which to interrogate it. Existing methods are either limited to small data sets, or are based on heuristic models which are difficult to interpret, and which provide no coherent estimates of uncertainty or statistical significance. This work will bridge this gap by delivering an inference algorithm with unprecedented scaling properties based on rigorous, gold-standard population genetic models. This is a crucial step in the pipeline that connects genetic data to inferences, and ultimately to decision-making in applied fields such as medicine and conservation, and ultimately these fields will also benefit from this work.

This new methodology will be rely on rigorous mathematical analysis, both of the underlying population genetic model, and of the inference algorithm itself. Broadly speaking, both sets of analyses involve taking a theory which has already been developed for simple settings, and generalising it to the complex setting of ancestral modelling in genetics. There is every reason to believe that the methods by which these generalisation are achieved will be transferable, and facilitate the discovery of new probabilistic results, as well as the development of optimised algorithms in other areas of computational statistics. The immediate beneficiaries will be academics working in the relevant areas of probability theory and statistical methodology. If these advances end up facilitating improved methodology in other areas of applied science, the benefit will obviously trickle down to the relevant scientists as well. It is all but impossible to predict which areas will end up as beneficiaries in this way.

Publications

10 25 50
 
Description So-called non-reversible methods for simulation-based parameter fitting are well-suited to coalescent models of genetic evolution. As a byproduct, the range of models for which they can be implemented has also considerably increased.

The class of models of evolution from which DNA sequence data can be feasibly simulated has been considerably expanded.
Exploitation Route This work has produced novel and improved simulation and inference tools for genetic data. There is scope for them to feature in a myriad of analysis pipelines in applied settings in the future.
Sectors Digital/Communication/Information Technologies (including Software),Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

 
Description Mathematical foundations of non-reversible MCMC for genome-scale inference
Amount £76,221 (GBP)
Funding ID EP/V049208/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 02/2021 
End 01/2022
 
Description Implementation of an MCMC algorithm for genomic data 
Organisation University of Melbourne
Country Australia 
Sector Academic/University 
PI Contribution This collaboration is a joint effort to produce an inference algorithm for genomic data based on a mechanistic, interpretable biological model, and provably unbiased Markov chain Monte Carlo methodology. I instigated this collaboration and have been directing the work, which is ongoing.
Collaborator Contribution My partners at the University of Oxford are experts on data structures in genomics, and have been acting in a consulting role in the design of the scheme. My collaborator at the University of Melbourne is carrying out the actual coding.
Impact Outputs have not yet been made available as the collaboration is ongoing.
Start Year 2019
 
Description Implementation of an MCMC algorithm for genomic data 
Organisation University of Oxford
Department Big Data Institute
Country United Kingdom 
Sector Academic/University 
PI Contribution This collaboration is a joint effort to produce an inference algorithm for genomic data based on a mechanistic, interpretable biological model, and provably unbiased Markov chain Monte Carlo methodology. I instigated this collaboration and have been directing the work, which is ongoing.
Collaborator Contribution My partners at the University of Oxford are experts on data structures in genomics, and have been acting in a consulting role in the design of the scheme. My collaborator at the University of Melbourne is carrying out the actual coding.
Impact Outputs have not yet been made available as the collaboration is ongoing.
Start Year 2019
 
Title Additional features to the msprime genomic simulator 
Description msprime is a state-of-the-art simulator of genomic data. 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact This record pertains to two additional features incorporated into the software package: evaluation of likelihoods, and simulation of so-called mutliple merger coalescents to model high fecundity organisms undergoing sweepstakes reproduction. The latter extends the functionality of the simulator to a broader class of species, while the former opens the door for its use in likelihood-based statistics. 
URL https://github.com/tskit-dev/msprime
 
Title tree-zig-zag 
Description A Markov chain Monte Carlo model-fitting algorithm for the Kingman coalescent model, based on the nonreversible zig-zag process. 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact None at time of writing. 
URL https://github.com/JereKoskela/tree-zig-zag