Mathematical foundations of non-reversible MCMC for genome-scale inference

Lead Research Organisation: University of Warwick

Department Name: Statistics

Abstract

The world is undergoing an explosion of genetic DNA sequence data. Patterns within data sets carry information about unobservable biological and demographic histories of populations, which in turn are fueling discoveries in areas such as medicine, demography, and conservation. A central tool connecting observed patterns to testable predictions and inference is the Ancestral Recombination Graph, which models the patterns of common ancestry along sampled DNA sequences. Since common ancestry is typically not observable directly, inferences are made by averaging over possible ancestries.

In very simple cases the averaging can be carried out exactly, but in biologically relevant settings it typically has to be approximated. Typical approximation methods create an ensemble of candidate ancestries, and use the ensemble average as a proxy for the true average. The accuracy of this procedure depends on the degree to which the ensemble is representative of the set of all possible ancestries. The computational time required to guarantee a representative ensemble grows rapidly as the size of a data set increases, and in practice, such ensemble-based methods can only be applied to small data sets by modern standards. In practice, researchers resort to computationally faster methods, the theoretical performance of which is less well understood. The lack of theoretical foundations can complicate the interpretability of findings, and makes it difficult to accurately quantify their associated uncertainty.

A new class of methods for building representative ensembles, called zig-zag algorithms, has been developed and become increasingly widespread over the last several years. It has also shown promise in pilot applications in genetics, but an effective data structure for applying zig-zag methods to genome-scale data is an essential ingredient, and remains unknown. This project aims to develop and test suitable data structures, making possible the engineering of software packages which combine feasible run times with efficient use of statistical signal in data sets.

Funded Value:

£76,221

Funded Period:

Aug 21 - Jul 22

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/V049208/1

Principal Investigator:

Jere Koskela

Research Subject:

Mathematical sciences (100%)

Research Topic:

Statistics & Appl. Probability (100%)

Organisations

University of Warwick (Lead Research Organisation)

People	ORCID iD
Jere Koskela (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Brown S (2023) Weak convergence of non-neutral genealogies to Kingman's coalescent in Stochastic Processes and their Applications

Helekal D (2023) Inference of multiple mergers while dating a pathogen phylogeny

Ignatieva A (2023) The distribution of branch duration and detection of inversions in ancestral recombination graphs

Koskela J (2023) Bernoulli factories and duality in Wright-Fisher and Allen-Cahn models of population genetics

Koskela J (2024) Bernoulli factories and duality in Wright-Fisher and Allen-Cahn models of population genetics in Theoretical Population Biology

Sant J (2023) EWF: simulating exact paths of the Wright-Fisher diffusion. in Bioinformatics (Oxford, England)

Sant J (2023) EWF : simulating exact paths of the Wright--Fisher diffusion

Wong Y (2023) A general and efficient representation of ancestral recombination graphs. in bioRxiv : the preprint server for biology

Árnason E (2022) Sweepstakes reproductive success via pervasive and recurrent selective sweeps

Árnason E (2023) Sweepstakes reproductive success via pervasive and recurrent selective sweeps. in eLife

Key Findings


Description	The work undertaken as a result of this award has resulted in a very clear characterisation of the computational bottleneck associated with genome-scale statistical inference using coalescent-type models of genetic ancestry. Roughly speaking, the amount of "missing data" that needs to be imputed is linear in the sequence length when the aim is merely to store or post-process data, but likelihood-based inference methods for the same models require an approximately quadratic amount of imputation. This severely limits the extent to which likelihood-based inference can be expected to remain feasible, even in principle. Even if likelihood-based methods cannot scale to very long sequences, they still have value as gold-standard tools for benchmarking other tools, and in uncertainty quantification. A nonreversible MCMC algorithm for genomic data has been formulated, but an implementation is not yet available. The remainder of the work is software engineering - the mathematical description of the algorithm is complete.
Exploitation Route	The characterisation of the computational bottleneck in likelihood-based methods suggests the development of new models of genetic ancestry, which would likely be cruder approximations of the underlying biology but able to avoid the bottleneck. A nonreversible algorithm for genomic inference is still likely to be useful in practice when it is released, because there are very few gold-standard tools available for relevant models.
Sectors	Digital/Communication/Information Technologies (including Software),Education,Pharmaceuticals and Medical Biotechnology

Abstract

Organisations

People

ORCID iD

Publications