A graphical model approach to pedigree construction using constrained optimisation

Lead Research Organisation: University of York
Department Name: Computer Science

Abstract

Population biobanks with genetic data on large numbers of unrelated individuals have been enormously successful in detecting common genetic variants affecting diseases of public health concern. Attention is now shifting towards finding rarer variants and to investigating gene-gene and gene-environment interaction effects. Ideally, related individuals are required for this, but family studies are no longer routinely collected. In reality most large population studies, especially those collected from a particular geographical region, will contain sets of (undeclared) relatives. Identification of relatives from existing biobank data would be highly beneficial, both in furthering the use of these studies to search for rare variants and in adjusting statistical analyses to take account of relatedness. Although a crude or general measure of relatedness might be enough if the aim is solely to find individuals who might share rare variants, having a good estimate of the true relationship, or pedigree, would be much better if this could be obtained efficiently: it would enable better adjustment methods and facilitate the search for genes with many variants segregating in different families rather than a single variant across the population.

Our proposal is to develop efficient methods for reconstructing pedigrees from genetic data in large population studies. We will use fast combinatorial optimisation algorithms developed in computer science. These are general graph-searching algorithms but, because a pedigree is a special kind of graph and genetic data are correlated in very particular ways, we will adapt the algorithms to search for valid structures. Adaptation is performed by imposing constraints. One of the main challenges in the project is to formulate constraints that work efficiently and incorporate the relevant biology.

The general algorithms assume that all individuals in the pedigree are in the study and have complete genetic data. This does not hold for this application as unobserved individuals will typically be required to provide the missing links connecting the relatives in the study. Our algorithms will search over all possible pedigrees with missing individuals. Finally, we will incorporate additional non-genetic information via a Bayesian framework to inform the search that some relationships are known with certainty or up to some degree of confidence, for example. All our methods will be developed using simulated data but will be tested using real data from the Avon Longitudinal Study of Parents and Children (ALSPAC). Fast and efficient pedigree reconstruction would permit much fuller use of existing population cohort studies for genetic research.

Technical Summary

Our proposed research aims to construct pedigrees (family trees) from
genetic marker data. Uncovering relationships between groups of
individuals is important for epidemiological and genealogical
research. For example, when investigating the genetic risk factors
underlying the common complex diseases of major public health concern
it is difficult to discover rare genes unless *related* individuals
are used as they are more likely to share longer haplotypes around
susceptibility loci and are hence biologically more informative than
unrelated `cases and `controls .

We propose a novel cross-disciplinary approach to the task of
pedigree reconstruction by formulating it as a probabilistic graphical
model selection problem and exploiting state-of-the-art methods from
combinatorial optimisation for model selection within a Bayesian
framework. We will focus on applications where we may not have
complete marker data for each individual. We do not assume age or
sex information but can exploit it when it is available. Any known
relationships can also be incorporated.

We will not be restricted to small pedigrees and in many cases can
guarantee to return a most probable pedigree, conditional on the
available information. Where this is infeasible, the degree of
approximation will be precisely quantified. By being able to deliver
multiple high probability pedigrees, we will also allow for a more
complete picture of the inherent uncertainty in any particular
pedigree reconstruction.

Our method is most reliably evaluated when the true pedigree is known
and so we will make extensive use of simulated data for testing
purposes. However, we will also use real data from the Avon
Longitudinal Study on Parents and Children (ALSPAC) which will serve
as an important test case for the potential usefulness of our approach
to medical research. Being able to construct pedigrees from such
existing data is faster and cheaper than starting a family-based study
from scratch.

Publications

10 25 50