Computational Statistical Methods for Population Genomics

Lead Research Organisation: Imperial College London

Department Name: Dept of Medicine

Abstract

It is now possible, and relatively cheap, to scan the entire genomes of multiple individuals within a population. The resulting data can be used to infer aspects of the history of a population, including the values of parameters such as population growth and migration rates, recombination rates, and selection coefficients, as well as levels of admixture. The Bayesian statistical paradigm offers a good framework for such inferences, because it allows maximal extraction of information from data under the specified model, and because background information can be incorporated via the prior distribution. Although straightforward in principle, exact application of the Bayesian paradigm is virtually impossible in practice in this setting because the large datasets and complex models mean that computation times are prohibitively large.In the past few years a number of exciting developments have arisen that push back the boundaries of the model complexity and dataset size that can be analysed, at the cost of an extra approximation (see for example Hey J, Machado CA, NATURE REVIEWS GENETICS, 4 (7): 535-543 JUL 2003). In the presence of ample data, this approximation is often worthwhile to achieve inferences in more realistic models than would otherwise be possible. Two such advances are:(a) Computation of the likelihood may be replaced by a simulation step in which data are simulated under the model given the current parameter settings, and these are accepted if the simulated data are close to the observed data.(b) Instead of the full likelihood, an analogous function is calculated or approximated but with the full data replaced by a vector of summary statistics. Computational Bayesian methods based on this approach have come to be known as ABC, Approximate Bayesian Computation.The applicants have contributed substantially to both these advances, and now propose to investigate systematically ways to make them work more efficiently, and to develop user-friendly computer software to make them more widely available to research workers in population genomics, conservation genetics, and related fields. These tasks will be pursued by a post-doctoral research associate at Imperial College. At the same time, a PhD student at Reading will work on applications of the new methods developed at Imperial to specific problems in population genomics. The result will be that at least approximate inferences will be possible for many more complex situations than was previously feasible, for example detailed aspects of the history of entire animal species. Other researchers will also have explicit examples of the usefulness of this new methodology.The methods we will be developing are very general, and can be applied in any area of science that uses complex models and large amounts of data. Although our project focusses on population genomics, which seems the most fruitful area for application, disease transmission models in epidemiology is an example of another field that is likely to benefit from the methods that we will develop.

Funded Value:

£181,349

Funded Period:

Mar 06 - Feb 10

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/C533542/1

Principal Investigator:

David Balding

Research Subject:

Mathematical sciences (50%)

Omic sciences & technologies (25%)

Tools, technologies & methods (25%)

Research Topic:

Bioinformatics (25%)

Genomics (25%)

Statistics & Appl. Probability (50%)

Organisations

Imperial College London (Lead Research Organisation)

People	ORCID iD
David Balding (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Adhikari K (2016) A genome-wide association scan implicates DCHS2, RUNX2, GLI3, PAX1 and EDAR in human facial variation. in Nature communications

Adhikari K (2016) A genome-wide association scan in admixed Latin Americans identifies loci influencing facial and scalp hair features. in Nature communications

Beaumont MA (2010) In defence of model-based inference in phylogeography. in Molecular ecology

Cornuet JM (2008) Inferring population history with DIY ABC: a user-friendly approach to approximate Bayesian computation. in Bioinformatics (Oxford, England)

Lopes JS (2009) PopABC: a program to infer historical demographic parameters. in Bioinformatics (Oxford, England)

M Nunes (2010) On optimal selection of summary statistics for inference from high-dimensional datasets in Statistical Applications in Genetics and Molecular Biology

Persing A (2015) A simulation approach for change-points on phylogenetic trees. in Journal of computational biology : a journal of computational molecular cell biology

Scutari M (2016) Using Genetic Distance to Infer the Accuracy of Genomic Prediction. in PLoS genetics

Toni T (2009) Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. in Journal of the Royal Society, Interface

Key Findings
Impact Summary


Description	We contributed to the development of approximate Bayesian computational (ABC) methods, in particular stochastic Monte Carlo ABC (SMC-ABC) methods that in many settings are much more efficient than the previous rejection-ABC and Markov chain Monte Carlo (MCMC-ABC) methods. This work contributed to a J Roy Soc Interface paper that was the second most cited article in that journal during 2009 (rsif.royalsocietypublishing.org/site/misc/top_ten_citations.xhtml). Moreover we have developed new methodology for choosing the set of summary statistics used in ABC inference. Our novel two-stage method uses minimum entropy as a criterion in Stage 1, followed by minimising mean root integrated square error (MRISE) over simulated datasets similar to that observed. In our extensive simulation study, the two-stage method almost halved the gap in MRISE between current best practice and a theoretical optimum (that is unachievable in practice). This has been published in Stat Appl Genet Mol Biol (2010) and we have also released an R software package (http://nunes.homelinux.com/~matt/computerstuff/ABC.html) to facilitate the take-up of the method by other researchers. We also contributed substantially to two software packages, PopABC and DIY-ABC that are both aimed at making the computational methods more accessible to all population genomics researchers. The papers describing these packages have both been well-cited.
Exploitation Route	Computational statistical methods that we have helped develop are being used in many fields as described above.
Sectors	Healthcare Leisure Activities including Sports Recreation and Tourism Culture Heritage Museums and Collections Other


Description	Our research develops "scientific infrastructure": methodology to help other scientists do their research more effectively. Scientists who benefit most from our infrastructure include those using large datasets and complex models that are composed of simple components, amenable to simulation. This includes research workers interested in using human genomic data to infer detailed aspects of human demographic history, for example the times and magnitudes of migrations and periods of population growth. As well as its intrinsic interest, the details of human genomic history are important for understanding the causes of current human genetic variation, and hence looking for outliers that might indicate subjection to intense selection and/or a causal role in disease, drug response, or other important phenotypes. Thus medical and pharmaceutical researchers will benefit indirectly. Similarly, inferences about population size and history is important in conservation genetics, and workers interested in the conservation and management of endangered species will benefit from improved understanding of natural genetic variation. Away from genetics and genomics, infectious disease epidemiology is a field that has benefitted from our computational advances in ABC and related methods. In particular, inference about many unobserved parameters of disease infection process are achieved, at least approximately, much more rapidly using the methods that we will develop than is currently possible. Complex models arising in economics may also be usefully analysed using our methods.
First Year Of Impact	2010
Sector	Healthcare,Culture, Heritage, Museums and Collections,Other
Impact Types	Cultural

Abstract

Organisations

People

ORCID iD

Publications