A graphical model approach to pedigree construction using constrained optimisation

Lead Research Organisation: University of York
Department Name: Computer Science

Abstract

Population biobanks with genetic data on large numbers of unrelated individuals have been enormously successful in detecting common genetic variants affecting diseases of public health concern. Attention is now shifting towards finding rarer variants and to investigating gene-gene and gene-environment interaction effects. Ideally, related individuals are required for this, but family studies are no longer routinely collected. In reality most large population studies, especially those collected from a particular geographical region, will contain sets of (undeclared) relatives. Identification of relatives from existing biobank data would be highly beneficial, both in furthering the use of these studies to search for rare variants and in adjusting statistical analyses to take account of relatedness. Although a crude or general measure of relatedness might be enough if the aim is solely to find individuals who might share rare variants, having a good estimate of the true relationship, or pedigree, would be much better if this could be obtained efficiently: it would enable better adjustment methods and facilitate the search for genes with many variants segregating in different families rather than a single variant across the population.

Our proposal is to develop efficient methods for reconstructing pedigrees from genetic data in large population studies. We will use fast combinatorial optimisation algorithms developed in computer science. These are general graph-searching algorithms but, because a pedigree is a special kind of graph and genetic data are correlated in very particular ways, we will adapt the algorithms to search for valid structures. Adaptation is performed by imposing constraints. One of the main challenges in the project is to formulate constraints that work efficiently and incorporate the relevant biology.

The general algorithms assume that all individuals in the pedigree are in the study and have complete genetic data. This does not hold for this application as unobserved individuals will typically be required to provide the missing links connecting the relatives in the study. Our algorithms will search over all possible pedigrees with missing individuals. Finally, we will incorporate additional non-genetic information via a Bayesian framework to inform the search that some relationships are known with certainty or up to some degree of confidence, for example. All our methods will be developed using simulated data but will be tested using real data from the Avon Longitudinal Study of Parents and Children (ALSPAC). Fast and efficient pedigree reconstruction would permit much fuller use of existing population cohort studies for genetic research.

Technical Summary

Our proposed research aims to construct pedigrees (family trees) from
genetic marker data. Uncovering relationships between groups of
individuals is important for epidemiological and genealogical
research. For example, when investigating the genetic risk factors
underlying the common complex diseases of major public health concern
it is difficult to discover rare genes unless *related* individuals
are used as they are more likely to share longer haplotypes around
susceptibility loci and are hence biologically more informative than
unrelated `cases and `controls .

We propose a novel cross-disciplinary approach to the task of
pedigree reconstruction by formulating it as a probabilistic graphical
model selection problem and exploiting state-of-the-art methods from
combinatorial optimisation for model selection within a Bayesian
framework. We will focus on applications where we may not have
complete marker data for each individual. We do not assume age or
sex information but can exploit it when it is available. Any known
relationships can also be incorporated.

We will not be restricted to small pedigrees and in many cases can
guarantee to return a most probable pedigree, conditional on the
available information. Where this is infeasible, the degree of
approximation will be precisely quantified. By being able to deliver
multiple high probability pedigrees, we will also allow for a more
complete picture of the inherent uncertainty in any particular
pedigree reconstruction.

Our method is most reliably evaluated when the true pedigree is known
and so we will make extensive use of simulated data for testing
purposes. However, we will also use real data from the Avon
Longitudinal Study on Parents and Children (ALSPAC) which will serve
as an important test case for the potential usefulness of our approach
to medical research. Being able to construct pedigrees from such
existing data is faster and cheaper than starting a family-based study
from scratch.

Publications

10 25 50
 
Description ICMS Workshops
Amount £20,500 (GBP)
Organisation International Centre for Mathematical Sciences (ICMS) 
Sector Academic/University
Country United Kingdom
Start 09/2014 
End 09/2014
 
Title GOBNILP 
Description Software to induce Bayesian networks (particularly pedigrees) from data. 
Type Of Technology Software 
Year Produced 2012 
Open Source License? Yes  
Impact GOBNILP has been used by a number of researchers. The following is a list of related publications: Siqi Nie, Cassio P. de Campos, and Qiang Ji. Learning Bayesian Networks with Bounded Tree-width via Guided Search, In AAAI Conference on Artificial Intelligence (AAAI), 2016. Meng Sun. Estimating relationships and relatedness from dense genome-wide data PhD thesis, University of Leicester, 2015. Brandon Malone. Empirical Behavior of Bayesian Network Structure Learning Algorithms, Proc. Advanced Methodologies for Bayesian Networks: Second International Workshop, AMBN 2015, Yokohama, Japan, November 16-18, Springer, 2015. Yoni Halpern, Steven Horng and David Sontag. Anchored Discrete Factor Analysis . Arxiv 1511.03299, Nov 2015. Paul Saikko. Re-implementing and Extending a Hybrid SAT-IP Approach to Maximum Satisfiability . MSc thesis, University of Helsinki, Nov 2015. Janne H. Korhonen and Pekka Parviainen. Tractable Bayesian Network Structure Learning with Bounded Vertex Cover Number . NIPS 2015: Proceedings of the 29th Annual Conference on Neural Information Processing Systems, 2015 Mauro Scanagatta, Cassio P. de Campos, Giorgio Corani and Marco Zaffalon. Learning Bayesian networks with thousands of variables . NIPS 2015: Proceedings of the 29th Annual Conference on Neural Information Processing Systems, 2015 Pekka Parviainen and Samuel Kaski. Bayesian Networks for Variable Groups. Arxiv 1508.07753, August, 2015 Arnoud Pastink and Linda can der Gaag. Multi-classifiers of Small Treewidth. Proc. ECSQARU 2015, LNAI 9161, pp. 199-209, 2015. Peter van Beek and Hella-Franziska Hoffmann. Machine learning of Bayesian networks using constraint programming. Proceedings of CP 2015, Cork, Ireland, August, 2015. Vera Djordjilovic, Monica Chiogna and Jirka Vomlel. An empirical comparison of popular algorithms for learning gene networks. Proc. CEB-EIB 2015, 2015 Chris J. Oates, Lilia Carneiro da Costa and Tom Nichols. Towards a Multi-Subject Analysis of Neural Connectivity. Neural Computation, 27:151-170, 2015. (Also Arkiv 1404.1239, April 2014.) Waleed Alsanie and James Cussens. Learning Failure-free PRISM Programs. International Journal of Approximate Reasoning, 67:73-110, 2015. Brandon Malone, Matti Järvisalo and Petri Myllymäki. Impact of Learning Strategies on the Quality of Bayesian Networks: An Empirical Evaluation. Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence (UAI-15) July 2015. Mark Bartlett and James Cussens. Integer Linear Programming for the Bayesian Network Structure Learning Problem. Artificial Intelligence. Forthcoming. (Preprint) Chris J. Oates, Jim Q. Smith, Sach Mukherjee and James Cussens. Exact Estimation of Multiple Directed Acyclic Graphs. Statistics and Computing, Forthcoming. (Arkiv 1404.1238, April 2014.) Paul Saikko, Brandon Malone and Matti Järvisalo. MaxSAT-based Cutting Planes for Learning Graphical Models. Proc. CPAIOR 2015 Kustaa Kangas, Teppo Niinimäki and Mikko Koivisto. Learning Chordal Markov Networks by Dynamic Programming. NIPS 2014. Chris J. Oates, Jim Q. Smith, Sach Mukherjee. Estimating causal structure using conditional DAG models. Arkiv 1411.2755, November 2014. Giorgio Corani, Alessandro Antonucci, Denis D. Maua and Sandra Gabaglio. Trading off Speed and Accuracy in Multilabel Classification Proc. PGM 2014, 145-159, LNAI 8754, Springer, 2014. (Proceedings). Mauro Scanagatta, Cassio P. de Campos and Marco Zaffalon. Min-BDeu and Max-BDeu Scores for Learning Bayesian Networks. Proc. PGM 2014, 426-441, LNAI 8754, Springer, 2014. (Preprint). (Proceedings). Ilya Korsunsky, Daniele Ramazzotti, Giulio Caravagna and Bud Mishra. Inference of Cancer Progression Models with Biological Noise. Arkiv 1408.6032, August 2014 Lilia Costa, Jim Smith, Thomas Nichols and James Cussens. Searching Multiregression Dynamic Models of Resting-State fMRI Networks using Integer Programming. Bayesian Analysis. To appear. Nuala Sheehan, Mark Bartlett and James Cussens. Improved Maximum Likelihood Reconstruction of Complex Multi-generational Pedigrees. Theoretical Population Biology. 97:11-19, November 2014. Brandon Malone, Kustaa Kangas, Matti Järvisalo, Mikko Koivisto, and Petri Myllymäki. Predicting the Hardness of Learning Bayesian Networks. In Carla E. Brodley and Peter Stone, editors, Proceedings of the 28th AAAI Conference on Artificial Intelligence (AAAI 2014) AAAI Press, 2014. Alberto Franzin. An Integer Programming approach to Bayesian Network Structure Learning. Masters dissertation, University of Padua, April 2014. Milan Studený, David Haws Learning Bayesian network structure: Towards the essential graph by integer linear programming tools International Journal of Approximate Reasoning, 2013 Changhe Yuan and Brandon Malone Learning Optimal Bayesian Networks: A Shortest Path Perspective, Journal of Artificial Intelligence Research, Volume 48, pages 23-65, 2013 Pekka Parviainen and Mikko Koivisto. Finding Optimal Bayesian Networks Using Precedence Constraints. Journal of Machine Learning Resarch. vol 14. pp1387-1415, 2013. Eman Aljohani and James Cussens. Informative Priors For Learning Graphical Models. Proc. AIGM-13, Paris, 2013. Antonucci A., Corani G., Maua D., Gabaglio S., An Ensemble of Bayesian Networks for Multilabel Classification. Proc. 23rd Int. Joint Conference on Artificial Intelligence (IJCAI-13) Brandon Malone and Changhe Yuan. Evaluating Anytime Algorithms for Learning Optimal Bayesian Networks . Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence (UAI-13) July 2013. Eliot Brenner and David Sontag. SparsityBoost: A New Scoring Function for Learning Bayesian Network Structure. Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence (UAI-13) July 2013. Mark Bartlett and James Cussens. Advances in Bayesian Network Learning using Integer Programming. Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence (UAI-13) July 2013. Waleed Alsanie. Learning Failure-free PRISM Programs. PhD Thesis, University of York, September 2012. Zhenxing Wang and Laiwan Chan. Learning Bayesian Networks from Markov Random Fields: An Efficient Algorithm for Linear Models. ACM Transactions on Knowledge Discovery from Data (TKDD). 6(3) 2012 10:1-10:31. Robert G. Cowell. A simple greedy algorithm for reconstructing pedigrees. Theoretical Population Biology. 83 (2013) 55-63. 
URL http://www.cs.york.ac.uk/aig/sw/gobnilp/
 
Title Gurobi pedigree learning sw 
Description This sw uses a Python interface to the Gurobi system to learn pedigrees from data. 
Type Of Technology Software 
Year Produced 2012 
Impact This sw was used to produce the results in this paper: James Cussens, Mark Bartlett, Elinor M. Jones and Nuala A. Sheehan. Maximum Likelihood Pedigree Reconstruction using Integer Linear Programming. Genetic Epidemiology. http://onlinelibrary.wiley.com/doi/10.1002/gepi.21686/full 
URL http://onlinelibrary.wiley.com/doi/10.1002/gepi.21686/full
 
Description COSA Workshop 2012 attended by James Cussens and Mark Bartlett 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Type Of Presentation Keynote/Invited Speaker
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Combinatorial Optimization, Statistics, and Applications
September 6-7, 2012, TU Munich, Germany
http://www.cosa-workshop.de/

Ongoing collaboration with David Haws (IBM Research, New York) and Milan Studeny (Prague Academy of Sciences)
Year(s) Of Engagement Activity 2012
 
Description EURAC visit by Nuala Sheehan 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Type Of Presentation Keynote/Invited Speaker
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Nuala Sheehan visited EURAC (European Academy of Bozen/Bolzano) at the beginning of October to get the microsatellite data on the Micros study. Prof Sheehan also was invited to present our work.

microsatellite data obtained
Year(s) Of Engagement Activity 2013
 
Description European Institute in Statistical Genetics 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Nuala Sheehan attended a 3-day course (European Institute in Statistical Genetics) in Edinburgh by
Elizabeth Thompson in order to network with pedigree community.

Nuala Sheehan met Jane Reid there, who came to the later ICMS workshop and with whom we are collaborating.
Year(s) Of Engagement Activity 2012
 
Description European Mathematical Genetics meeting G√∂ttingen 2012 - contributed talk by Nuala Sheehan 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Type Of Presentation Paper Presentation
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact European Mathematical Genetics meeting Göttingen 2012 - contributed talk


Abstract published in Annals of Human Genetics.
Year(s) Of Engagement Activity 2012
 
Description Helsinki visit by James Cussens 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Type Of Presentation Keynote/Invited Speaker
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact This was actually a week-long visit to University of Helsinki (UH) financially supported by UH. I gave a talk and had many scientific meetings about network (esp pedigree) reconstruction.

We are now working with UH on a paper on network (esp pedigree reconstruction).
Year(s) Of Engagement Activity 2013
URL http://www.cs.york.ac.uk/aig/talks/cussens_2013b.pdf
 
Description ICMS workshop organised by James Cussens and Nuala Sheehan 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? Yes
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact The participants' feedback from the workshop was that is was a success where people had learned a lot from each other.

The talks from "Generation Scotland" and "The Avon Longitudinal Study of Parents and Children" on human cohort studies were particularly successful.
There will be an associated special issue of "Theoretical Population Biology" with papers submitted by the participants.
Year(s) Of Engagement Activity 2014
URL http://www.icms.org.uk/workshops/statistical
 
Description KU Leuven talk by James Cussens 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Type Of Presentation Keynote/Invited Speaker
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact About 40 people, mainly students (mostly PG) came to an invited talk I gave at KU Leuven, Belgium about techniques for network (esp pedigree) reconstruction.

Luc De Raedt from KU Leuven has invited me to a "Dagstuhl" seminar:
Constraints, Optimization and Data
http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=14411
Year(s) Of Engagement Activity 2013
URL http://www.cs.york.ac.uk/aig/talks/cussens_2013a.pdf
 
Description LSHTM talk by Nuala Sheehan 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Type Of Presentation Keynote/Invited Speaker
Geographic Reach National
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Will occur on 20 November 2013
Year(s) Of Engagement Activity 2013
 
Description Leeds seminar given by James Cussens 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Type Of Presentation Keynote/Invited Speaker
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Invited talk to the Leeds University Statistics Dept

None so far
Year(s) Of Engagement Activity 2013
URL http://www.cs.york.ac.uk/aig/talks/cussens_2013e.pdf
 
Description Prague visit by James Cussens 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Type Of Presentation Keynote/Invited Speaker
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact This was a week-long visit to the Czech Academy of Sciences (CAS) partly funded by CAS.

We are now working on joint publications with the Czech Academy of Sciences.
Year(s) Of Engagement Activity 2013
URL http://www.cs.york.ac.uk/aig/talks/cussens_2013d.pdf
 
Description Presentation at ECAI workshop on workshop on algorithmic issues for inference in graphical models by James Cussens 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Type Of Presentation Paper Presentation
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact ECAI workshop on workshop on algorithmic issues for inference in graphical models, Montpelier, France, 2012.
Author: James Cussens
http://carlit.toulouse.inra.fr/wikiz/index.php/AIGM12

I have since done a little collaboration with Uni, of Toulouse
on Bayesian network learning
Year(s) Of Engagement Activity 2012
URL http://carlit.toulouse.inra.fr/wikiz/index.php/AIGM12
 
Description SCIP workshop attended by James Cussens and Mark Bartlett 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Type Of Presentation Paper Presentation
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact SCIP Workshop 2012
October 8 and 9, 2012, in Darmstadt.
http://www3.mathematik.tu-darmstadt.de/ags/optimierung/events/activities/scip-workshop-2012.html

Ongoing collaboration with the SCIP 'team' particularly those from Darmstadt University
Year(s) Of Engagement Activity 2012
 
Description Structure and uncertainty workshop attended by James Cussens 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Type Of Presentation Paper Presentation
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Structure and uncertainty
modelling, inference and computation in complex stochastic systems
Research workshop: 24-27 September 2012, Bristol, http://www.sustain.bris.ac.uk/ws-structure/index.html

None yet.
Year(s) Of Engagement Activity 2012
 
Description TPB special issue 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Nuala Sheehan and James Cussens liaised with the editor of "Theoretical Population Biology" to organise a special issue of pedigree and relatedness inference emanating from the ICMS workshop.

Too soon to say.
Year(s) Of Engagement Activity 2014
URL http://www.journals.elsevier.com/theoretical-population-biology/