Efficient simulation and inference under approximate models of ancestry

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Biological Sciences

Abstract

While large whole genome data sets are now being generated routinely for many taxa and populations, analyses of these data remain superficial and largely descriptive. In order to make sense of the genetic variation present in samples of genomes, we need to relate it mathematically to the evolutionary processes that generated it. This requires mathematical models of genetic ancestry that are tractable, yet realistic, and general enough to capture all fundamental evolutionary forces. At a minimum, a null model of genomes sampled from a population should capture the randomness of meiotic recombination and the fact that most mutations are either neutral or deleterious, and so are likely to be removed from the population as a result of genetic drift and (background) selection. Although the ancestry for a sample of recombining genomes can be described mathematically as a graph, this full backward-in-time description does not scale to large populations and currently does not include background selection. This means that it is currently impossible to efficiently simulate genomic variation even under the simplest biologically plausible null model. Statistical inference from genomic data is even more limited and state of the art statistical approaches for inferring past selection or demography from genomic data are based on crude (and extremely lossy) summaries of genome-wide variation.

This cross-disciplinary project brings together experts in computer science and mathematical biology and builds on recent breakthroughs to develop efficient approximate algorithms that accurately capture the effect of recombination and background selection on genome-wide ancestry and sequence variation. These algorithms will be implemented both as part of a standard simulation software and tools that calculate the fit of sequence data to models of past demography and selection. Such tools are fundamental for interpreting the vast volumes of genome sequence data that are now being generated across the tree of life. While the algorithms and tools to be developed are general, this project will immediately improve our ability to scan genomic data for signals of past positive selection whilst accounting for the randomness of ancestry.

Publications

10 25 50