Mathematical foundations of non-reversible MCMC for genome-scale inference

Lead Research Organisation: University of Warwick
Department Name: Statistics

Abstract

The world is undergoing an explosion of genetic DNA sequence data. Patterns within data sets carry information about unobservable biological and demographic histories of populations, which in turn are fueling discoveries in areas such as medicine, demography, and conservation. A central tool connecting observed patterns to testable predictions and inference is the Ancestral Recombination Graph, which models the patterns of common ancestry along sampled DNA sequences. Since common ancestry is typically not observable directly, inferences are made by averaging over possible ancestries.

In very simple cases the averaging can be carried out exactly, but in biologically relevant settings it typically has to be approximated. Typical approximation methods create an ensemble of candidate ancestries, and use the ensemble average as a proxy for the true average. The accuracy of this procedure depends on the degree to which the ensemble is representative of the set of all possible ancestries. The computational time required to guarantee a representative ensemble grows rapidly as the size of a data set increases, and in practice, such ensemble-based methods can only be applied to small data sets by modern standards. In practice, researchers resort to computationally faster methods, the theoretical performance of which is less well understood. The lack of theoretical foundations can complicate the interpretability of findings, and makes it difficult to accurately quantify their associated uncertainty.

A new class of methods for building representative ensembles, called zig-zag algorithms, has been developed and become increasingly widespread over the last several years. It has also shown promise in pilot applications in genetics, but an effective data structure for applying zig-zag methods to genome-scale data is an essential ingredient, and remains unknown. This project aims to develop and test suitable data structures, making possible the engineering of software packages which combine feasible run times with efficient use of statistical signal in data sets.
 
Description The work undertaken as a result of this award has resulted in a very clear characterisation of the computational bottleneck associated with genome-scale statistical inference using coalescent-type models of genetic ancestry. Roughly speaking, the amount of "missing data" that needs to be imputed is linear in the sequence length when the aim is merely to store or post-process data, but likelihood-based inference methods for the same models require an approximately quadratic amount of imputation. This severely limits the extent to which likelihood-based inference can be expected to remain feasible, even in principle.

Even if likelihood-based methods cannot scale to very long sequences, they still have value as gold-standard tools for benchmarking other tools, and in uncertainty quantification. A nonreversible MCMC algorithm for genomic data has been formulated, but an implementation is not yet available. The remainder of the work is software engineering - the mathematical description of the algorithm is complete.
Exploitation Route The characterisation of the computational bottleneck in likelihood-based methods suggests the development of new models of genetic ancestry, which would likely be cruder approximations of the underlying biology but able to avoid the bottleneck.

A nonreversible algorithm for genomic inference is still likely to be useful in practice when it is released, because there are very few gold-standard tools available for relevant models.
Sectors Digital/Communication/Information Technologies (including Software),Education,Pharmaceuticals and Medical Biotechnology