From Population Genomes to Global Pedigrees

Lead Research Organisation: University of Oxford
Department Name: Statistics

Abstract

Central to biology is homology and genealogical relationships. Due to the phenomenal growth in sequence data from different species, phylogenetics has risen to prominence and been put on a firm statistical ground. Similarly, intra-population variation has been catalogued on an unprecedented scale and led to better characterization of the genealogical relationship of sequence sampled from a population. Besides the genealogical concepts of phylogeny, the ancestral recombination graph, and their inference, the influx of sequence data also heralds the possibility of pedigree inference on an unprecedented scale. The concepts of phylogeny and pedigree are well known concepts throughout the scientific community, and are very old - hundreds and thousands of years, respectively. The term 'pedigree', together with 'individual', 'life', and 'species', could be the oldest biological concepts. But in terms of mathematical and combinatorial studies, pedigrees have received less attention than phylogenies. This project aims to investigate the combinatorial space of pedigree graphs, and the connections between a pedigree and its embedded phylogenies, ancestral recombination graphs, and local pedigrees. As a practical outcome, this will further our understanding of the amount of data required to reliably reconstruct a pedigree, and lead to novel algorithms for pedigree reconstruction.In the current setting, a pedigree refers to a graph with extant individuals labelled with distinct names and unlabelled ancestors. As a simplifying assumption we initially require that ancestors occur at discrete generations going back in time, though counterexamples abound in natural pedigrees as well as pedigrees in animal and plant studies. The individuals will be nodes and may or may not be labelled by gender. The three basic genealogical structures are closely related: The ARG for a point (not an interval) is a phylogeny; if a pedigree is pruned by tracing only one parent for each individual, a phylogeny will be obtained; and the ARG is embedded within a pedigree. The pedigree of a population thus constrains the ARGs and phylogenies observable for the population. Conversely, this means that the ARGs and phylogenies observable in a population will be informative about the pedigree of the population.The main question is how reliable is pedigree inference as a function of generations back in time. It should be expected that reliability is good for a few generations but then tails rapidly off, but if this is 3-4 generations or more than 15 is presently unknown. If it was beyond the latter number, large scale sequencing would eventually have great potential to aid historical demography going back many centuries. We will address a series of central problems: Understanding the underlying structure and size of the set of pedigrees is essential to evaluate the hardness of pedigree inference and to formulate appropriate algorithms operating on the set of pedigrees. Will sequences under idealized models determine pedigrees or not is a major open problem and deserves serious attention. Global pedigrees can be reconstructed (up to isomorphism) from local pedigrees on pairs of individuals provided these local pedigrees are gender labelled. However, in the case where this gender information is not available for the local pedigrees, the authors showed that the reconstruction result from pairs can fail. Nevertheless this suggests a tantalizing question, posed in that paper: Do local pedigrees (without gender) on k-tuples of individuals (for some fixed k, independent of the size of the population) suffice for reconstructing a global pedigree. Steel and Hein showed that the number of segregating sites required to accurately reconstruct a pedigree up to depth d (generations into the past) for an extant population of size n, must grow at least as fast as the rate dlog(n). Can this be improved?

Publications

10 25 50

publication icon
Sainudiin R (2016) Ancestries of a recombining diploid population. in Journal of mathematical biology

publication icon
SCHWERDTFEGER U (2010) Area Limit Laws for Symmetry Classes of Staircase Polygons in Combinatorics, Probability and Computing

publication icon
Thatte B (2008) Combinatorics of Pedigrees I: Counterexamples to a Reconstruction Question in SIAM Journal on Discrete Mathematics

publication icon
Thatte BD (2008) Reconstructing pedigrees: a stochastic perspective. in Journal of theoretical biology

 
Description Pedigrees are directed acyclic graphs that represent ancestral
relationships between individuals in a population. Based on a schematic recombination process, we describe two simple Markov models for sequences evolving on pedigrees - Model R (recombinations without mutations) and Model RM (recombinations with mutations). For these models, we ask an identifiability question: is it possible to construct a pedigree from the joint probability distribution of extant sequences? We present partial identifiability results for general pedigrees: we show that when the crossover probabilities are sufficiently small, certain spanning subgraph sequences can be counted from the joint distribution of extant sequences. We demonstrate how pedigrees that earlier seemed difficult to distinguish are distinguished by counting their spanning subgraph sequences.
Exploitation Route To follow
Sectors Agriculture, Food and Drink,Pharmaceuticals and Medical Biotechnology