📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

Mathematical foundations of non-reversible MCMC for genome-scale inference

Lead Research Organisation: University of Warwick
Department Name: Statistics

Abstract

The world is undergoing an explosion of genetic DNA sequence data. Patterns within data sets carry information about unobservable biological and demographic histories of populations, which in turn are fueling discoveries in areas such as medicine, demography, and conservation. A central tool connecting observed patterns to testable predictions and inference is the Ancestral Recombination Graph, which models the patterns of common ancestry along sampled DNA sequences. Since common ancestry is typically not observable directly, inferences are made by averaging over possible ancestries.

In very simple cases the averaging can be carried out exactly, but in biologically relevant settings it typically has to be approximated. Typical approximation methods create an ensemble of candidate ancestries, and use the ensemble average as a proxy for the true average. The accuracy of this procedure depends on the degree to which the ensemble is representative of the set of all possible ancestries. The computational time required to guarantee a representative ensemble grows rapidly as the size of a data set increases, and in practice, such ensemble-based methods can only be applied to small data sets by modern standards. In practice, researchers resort to computationally faster methods, the theoretical performance of which is less well understood. The lack of theoretical foundations can complicate the interpretability of findings, and makes it difficult to accurately quantify their associated uncertainty.

A new class of methods for building representative ensembles, called zig-zag algorithms, has been developed and become increasingly widespread over the last several years. It has also shown promise in pilot applications in genetics, but an effective data structure for applying zig-zag methods to genome-scale data is an essential ingredient, and remains unknown. This project aims to develop and test suitable data structures, making possible the engineering of software packages which combine feasible run times with efficient use of statistical signal in data sets.
 
Description The work undertaken as a result of this award has resulted in a very clear characterisation of the computational bottleneck associated with genome-scale statistical inference using coalescent-type models of genetic ancestry. Roughly speaking, the amount of "missing data" that needs to be imputed is linear in the sequence length when the aim is merely to store or post-process data, but likelihood-based inference methods for the same models require an approximately quadratic amount of imputation. This severely limits the extent to which likelihood-based inference can be expected to remain feasible, even in principle.

Even if likelihood-based methods cannot scale to very long sequences, they still have value as gold-standard tools for benchmarking other tools, and in uncertainty quantification. A nonreversible MCMC algorithm for genomic data has been formulated, but an implementation is not yet available. The remainder of the work is software engineering - the mathematical description of the algorithm is complete.

Instead of the envisaged focus on exact, nonreversible methodology, this award has facilitated the creation of many inference and simulation tools in genomics. Examples include inference and simulation methods for allele frequency time series, methods for using lengths of shared sequence blocks for detecting inversions, and the first method for dating phylogenetic trees for highly fecund organisms.
Exploitation Route The characterisation of the computational bottleneck in likelihood-based methods suggests the development of new models of genetic ancestry, which would likely be cruder approximations of the underlying biology but able to avoid the bottleneck.

A nonreversible algorithm for genomic inference is still likely to be useful in practice when it is released, because there are very few gold-standard tools available for relevant models.
Sectors Digital/Communication/Information Technologies (including Software)

Education

Pharmaceuticals and Medical Biotechnology

 
Title Dolores 
Description A computational tool for detecting localised suppression of recombination and thus detect structural variation in DNA sequences. 
Type Of Technology Software 
Year Produced 2024 
Impact None yet. The paper is still at a preprint stage. 
URL https://github.com/a-ignatieva/dolores
 
Title EWF 
Description An efficient simulator for exact Wright-Fisher diffusion and diffusion bridge paths, accounting for a wide class of selective regimes (genic, diploid and arbitrary polynomial selection), and the presence/absence of mutation. 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact Synthetic allele frequency time series data can now be generated much more easily than before, for a broad class of models accounting for genetic drift, various forms of natural selection, and recurrent mutation. 
URL https://github.com/JaroSant/EWF
 
Title MCMC4WF 
Description An efficient MCMC inference scheme for genetic data assuming a Wright-Fisher diffusion model. 
Type Of Technology Software 
Year Produced 2025 
Open Source License? Yes  
Impact None yet. This is still at preprint stage and has only just been released. 
URL https://github.com/JaroSant/MCMC4WF
 
Title MMCTime 
Description MMCTime is a package for timing (dating) mutation scaled pathogen phylogenies under Lambda coalescents. Lambda coalescents, also known as Multiple Merger Coalescents are genealogic models that admit multiple mergers. This enables timing of phylogenies which may contain true polytomies. Polytomies in pathogen genealogies can be related to various phenomena of interest, such as superspreading, selective sweeps, or rapid adaptation. 
Type Of Technology Software 
Year Produced 2023 
Open Source License? Yes  
Impact An improved understanding of how multiple merger coalescent models can be used in scalable inference methods. 
URL https://github.com/dhelekal/MMCTime