Exact scalable inference for coalescent processes

Lead Research Organisation: University of Warwick

Department Name: Statistics

Abstract

Modern genetic data sets are vast, both in terms of the number of sequenced individuals and the length of the sequenced DNA segments. Patterns within these data sets carry information about the biological and demographic histories of the population, which cannot usually be observed directly.
The central tool connecting observed patterns to predictions and inference is the Kingman coalescent: a random tree that provides a model for the unobserved ancestry of the sampled DNA sequences. Since the ancestry is unobserved, inferences are made by averaging over all possible ancestries.

In simple cases the average over ancestries can be calculated analytically, but in most biologically relevant scenarios the average has to be approximated. This is usually done by simulating an ensemble of possible ancestral trees, and treating the ensemble average as an approximation of the true, unknown average. The quality of the approximation depends on the degree to which the ensemble is representative of the set of all possible ancestries.

Ensuring that an ensemble is both representative, and not infeasibly large, is a challenging problem. Existing methods for producing ensembles split into two categories: importance sampling (IS), and Markov chain Monte Carlo (MCMC), of which the latter is typically more flexible and easier to implement. Both are known to scale poorly with the size and complexity of the data set. This proposal seeks to improve the scalability of state of the art MCMC methods in three related ways:

1. Much work has been done to characterise optimal IS algorithms, which have been observed to perform roughly as well as naive implementations of the more flexible MCMC. Preliminary results for this project show that optimality results for IS can also be used to characterise optimal MCMC algorithms, but this has never been done. This work will investigate and thoroughly benchmark the performance of the resulting, optimised MCMC algorithms.

2. The practical utility of MCMC algorithms has improved dramatically through so-called optimal scaling results, which provide a guide for how to tweak the algorithm as the data set grows. However, these typically apply only to settings in which the distribution being simulated consists of independent, real-valued components. In genetics, the distributions of interests consist of trees, and is hence much more complicated. This project will investigate extensions of optimal scaling results to tree-valued settings using recently developed machinery of optimal scaling via Dirichlet forms, which are a natural way to analyse tree-valued algorithms.

3. A recently published algorithm called msprime uses a novel data structure, called a sparse tree, to improve the speed and memory consumption of naive coalescent simulation by many orders of magnitude. This does not immediately translate to improved inference algorithms, because naive simulation typically results in ensembles that are poor representations of the true average. The sparse tree structure cannot be directly inserted into an MCMC algorithm, but preliminary work has identified several ways in which MCMC can be modified to use data structures resembling sparse trees. This project will implement and benchmark all of the resulting algorithms to determine which of these ways is the most effective.

The end result of these three streams will be a highly optimised, flexible, open source algorithm for inference in genetics. It will have unprecedented performance on large data sets due to a combination of mathematical optimisation (objectives 1 and 2) and optimisation of the underlying data structure (objective 3). MCMC algorithms also provide automatic, rigorous uncertainty quantification for their estimates, which many state-of-the-art competitors are not able to provide. This makes MCMC particularly well suited to e.g. clinical practice, where understanding uncertainties is crucial for medical outcomes.

Planned Impact

The immediate beneficiaries of the proposed work are other researchers working in the fields of population genetics, computational statistics, and probability theory. Less direct beneficiaries are the plethora of people relying on the work of these three groups, such as clinicians or conservation groups making use of genetic data, and practitioners in virtually every field of applied science using statistical and probabilistic methods and modelling.

Population geneticists are faced with a wealth of data, but no rigorous, scalable statistical methods by which to interrogate it. Existing methods are either limited to small data sets, or are based on heuristic models which are difficult to interpret, and which provide no coherent estimates of uncertainty or statistical significance. This work will bridge this gap by delivering an inference algorithm with unprecedented scaling properties based on rigorous, gold-standard population genetic models. This is a crucial step in the pipeline that connects genetic data to inferences, and ultimately to decision-making in applied fields such as medicine and conservation, and ultimately these fields will also benefit from this work.

This new methodology will be rely on rigorous mathematical analysis, both of the underlying population genetic model, and of the inference algorithm itself. Broadly speaking, both sets of analyses involve taking a theory which has already been developed for simple settings, and generalising it to the complex setting of ancestral modelling in genetics. There is every reason to believe that the methods by which these generalisation are achieved will be transferable, and facilitate the discovery of new probabilistic results, as well as the development of optimised algorithms in other areas of computational statistics. The immediate beneficiaries will be academics working in the relevant areas of probability theory and statistical methodology. If these advances end up facilitating improved methodology in other areas of applied science, the benefit will obviously trickle down to the relevant scientists as well. It is all but impossible to predict which areas will end up as beneficiaries in this way.

Funded Value:

£99,492

Funded Period:

Jul 18 - Jul 21

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/R044732/1

Principal Investigator:

Jere Koskela

Research Subject:

Mathematical sciences (100%)

Research Topic:

Statistics & Appl. Probability (100%)

Organisations

People	ORCID iD
Jere Koskela (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Baumdicker F (2022) Efficient ancestry and mutation simulation with msprime 1.0. in Genetics

Blath J (2020) Statistical tools for seed bank detection. in Theoretical population biology

Brown S (2021) Simple conditions for convergence of sequential Monte Carlo genealogies with applications in Electronic Journal of Probability

Brown S. (2023) Weak convergence of non-neutral genealogies to Kingman's coalescent in Stochastic Processes and their Applications

Koskela J (2022) Erratum: Asymptotic genealogies of interacting particle systems with an application to sequential Monte Carlo in The Annals of Statistics

Koskela J (2022) Zig-Zag Sampling for Discrete Structures and Nonreversible Phylogenetic MCMC in Journal of Computational and Graphical Statistics

Koskela J (2020) Asymptotic genealogies of interacting particle systems with an application to sequential Monte Carlo in The Annals of Statistics

Koskela J (2019) Robust model selection between population growth and multiple merger coalescents. in Mathematical biosciences

Koskela J (2020) Asymptotic genealogies of interacting particle systems with an application to sequential Monte Carlo in Annals of Statistics

Mahmoudi A (2022) Bayesian inference of ancestral recombination graphs. in PLoS computational biology

Key Findings
Further Funding
Collaboration
Software and Technical Products


Description	So-called non-reversible methods for simulation-based parameter fitting are well-suited to coalescent models of genetic evolution. As a byproduct, the range of models for which they can be implemented has also considerably increased. The class of models of evolution from which DNA sequence data can be feasibly simulated has been considerably expanded.
Exploitation Route	This work has produced novel and improved simulation and inference tools for genetic data. There is scope for them to feature in a myriad of analysis pipelines in applied settings in the future.
Sectors	Digital/Communication/Information Technologies (including Software) Healthcare Manufacturing including Industrial Biotechology Pharmaceuticals and Medical Biotechnology


Description	Mathematical foundations of non-reversible MCMC for genome-scale inference
Amount	£76,221 (GBP)
Funding ID	EP/V049208/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	07/2021
End	01/2022


Description	Implementation of an MCMC algorithm for genomic data
Organisation	University of Melbourne
Country	Australia
Sector	Academic/University
PI Contribution	This collaboration is a joint effort to produce an inference algorithm for genomic data based on a mechanistic, interpretable biological model, and provably unbiased Markov chain Monte Carlo methodology. I instigated this collaboration and have been directing the work, which is ongoing.
Collaborator Contribution	My partners at the University of Oxford are experts on data structures in genomics, and have been acting in a consulting role in the design of the scheme. My collaborator at the University of Melbourne is carrying out the actual coding.
Impact	Outputs have not yet been made available as the collaboration is ongoing.
Start Year	2019


Description	Implementation of an MCMC algorithm for genomic data
Organisation	University of Oxford
Department	Big Data Institute
Country	United Kingdom
Sector	Academic/University
PI Contribution	This collaboration is a joint effort to produce an inference algorithm for genomic data based on a mechanistic, interpretable biological model, and provably unbiased Markov chain Monte Carlo methodology. I instigated this collaboration and have been directing the work, which is ongoing.
Collaborator Contribution	My partners at the University of Oxford are experts on data structures in genomics, and have been acting in a consulting role in the design of the scheme. My collaborator at the University of Melbourne is carrying out the actual coding.
Impact	Outputs have not yet been made available as the collaboration is ongoing.
Start Year	2019


Title	Additional features to the msprime genomic simulator
Description	msprime is a state-of-the-art simulator of genomic data.
Type Of Technology	Software
Year Produced	2020
Open Source License?	Yes
Impact	This record pertains to two additional features incorporated into the software package: evaluation of likelihoods, and simulation of so-called mutliple merger coalescents to model high fecundity organisms undergoing sweepstakes reproduction. The latter extends the functionality of the simulator to a broader class of species, while the former opens the door for its use in likelihood-based statistics.
URL	https://github.com/tskit-dev/msprime


Title	EWF
Description	An efficient simulator for exact Wright-Fisher diffusion and diffusion bridge paths, accounting for a wide class of selective regimes (genic, diploid and arbitrary polynomial selection), and the presence/absence of mutation.
Type Of Technology	Software
Year Produced	2022
Open Source License?	Yes
Impact	Synthetic allele frequency time series data can now be generated much more easily than before, for a broad class of models accounting for genetic drift, various forms of natural selection, and recurrent mutation.
URL	https://github.com/JaroSant/EWF


Title	arginfer
Description	An exact MCMC inference method for ancestral recombination graphs using highly optimised data structures.
Type Of Technology	Software
Year Produced	2021
Open Source License?	Yes
Impact	This method facilitates exact inference for population genetics of recombining organisms at a scale which exceeds previously available methods by a factor of ~10. The main utility of the method is in benchmarking more approximate methods, which tend to be far more scalable.
URL	https://github.com/alimahmoudi29/arginfer


Title	tree-zig-zag
Description	A Markov chain Monte Carlo model-fitting algorithm for the Kingman coalescent model, based on the nonreversible zig-zag process.
Type Of Technology	Software
Year Produced	2020
Open Source License?	Yes
Impact	None at time of writing.
URL	https://github.com/JereKoskela/tree-zig-zag

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications