Managing and exploiting high dimensionality in genetic epidemiology

Lead Research Organisation: London School of Hygiene & Tropical Medicine
Department Name: Epidemiology and Population Health


In the last five years there has been a great deal of progress in discovering the genetic variations that explain differences in risk for complex diseases such as diabetes, heart disease, and many cancers. But it is clear that much remains to be discovered, and in order to explain this so-called missing heritability researchers are conducting increasingly larger scale studies, enrolling greater numbers of patients and including much larger numbers of genetic variations. At the same time, there are increased efforts to use these new genetic discoveries to learn about the biological processes involved in disease. These new studies in the field of genetic epidemiology need specialised statistical methods to analyse their data, as the path from gene to disease is complex and subject to apparently random influences. This research aims to develop powerful methods that allow for, and take advantage of, the large number of genes measured in a typical study. One important aspect is statistical significance: when many genes are studied, some will appear to be associated with disease simply due to chance fluctuations in the data. Standard guidelines have been given for existing studies in order to differentiate true associations from chance effects, and we will extend these recommendations for the next generation of studies that consider a larger number of genetic variants, many of them rare, and which study the range of populations worldwide. We will then study the best way to analyse new genotyping products that are designed to obtain enhanced information for specific classes of disease such as auto-immune, cancer and psychiatric conditions. We will also study an emerging application of genetics, called Mendelian randomisation, which can help to explain whether an observed association between an exposure such as alcohol intake and an outcome such as heart disease is due to a true causal effect or only to other common causes. We will consider how large numbers of genetic variants can be combined into a single tool that can improve the resolution and power of this method. Taken together, this research will provide a methodological basis for gaining further insights from large scale genetic studies.

Technical Summary

A salient characteristic of genetic epidemiology in recent years has been the high dimensionality of studies in terms of the number of genetic variants assayed, phenotypic outcomes analysed, and sample sizes attained. This leads for example to increased possibilities for false positive findings and in selection bias when focusing on the most significant results. However, high dimensionality can be usefully exploited by pooling information across multiple genes or phenotypes, which can improve power by means such as aggregating evidence across a biological pathway, estimating the total heritability explained by a genomewide association scan (GWAS), or eliciting stronger instruments for Mendelian randomisation (MR) analyses. The next generation of genetic epidemiology studies can be broadly dichotomised into 1) continued disease mapping efforts using larger sample sizes and complete DNA sequence data, focussing on rare mutations and structural variants; 2) efforts to gain insight from GWAS results by relating DNA variation to biological function. High dimensionality will continue to be a key feature in both cases. This research will address high dimensionality in both cases, which will continue to be a key feature. We will address genomewide significance for whole genome and exome sequencing studies, in diverse worldwide populations. We will develop improved methods for analysing the new generation of disease specific genotyping chips, which are optimised for the study of particular classes of disease, using empirical Bayes optimal discovery procedures. We will develop improved methods for analysing standard GWAS, including more powerful two-stage approaches within a single data set, and proper reporting of significance levels from replication studies. Finally we will study the use of whole genome instruments in Mendelian randomisation, by using shrinkage methods within two-stage least squares and inferring non-linear causal effects from observational data.

Planned Impact

The immediate beneficiaries will be a wide range of academic researchers performing research in genetic epidemiology. Beyond this immediate application, this research will benefit basic scientists aiming to translate genomics-era discoveries into new clinical treatments and healthcare policies. The pharmaceutical industry will also benefit in its efforts to translate genomics into healthcare. There is also a high demand for training in genetic epidemiology, both from career geneticists and non-specialists, and this research will benefit this training both by advancing the state of knowledge in the field and more indirectly by ensuring that leading edge training and information exchange can be maintained by the applicants. In the longer term this research will contribute, in a technical but necessary way, to improvements to public health through the exploitation of genetic knowledge via better understanding of disease pathways, applications to personalised medicine and clearer understanding of the causal role of environmental risk factors in common diseases.
Title Polygenic score 
Description Methods for calculating power and predictive accuracy of polygenic risk scores 
Type Of Material Model of mechanisms or symptoms - human 
Year Produced 2013 
Provided To Others? Yes  
Impact Invitations to present methods to international research institutes 
Description CEU 
Organisation University of Cambridge
Country United Kingdom 
Sector Academic/University 
PI Contribution Statistical methods for genetic epidemiology
Collaborator Contribution Research priorities in statistical genetics
Impact None
Start Year 2016
Description ICR 
Organisation Institute of Cancer Research UK
Country United Kingdom 
Sector Academic/University 
PI Contribution Statistical analysis and development of methods
Collaborator Contribution Data generation and project management
Impact Publications
Start Year 2010
Description IEU 
Organisation University of Bristol
Department MRC Integrative Epidemiology Unit
Country United Kingdom 
Sector Academic/University 
PI Contribution Development of statistical methods for causal inference
Collaborator Contribution Development of statistical methods for causal inference
Impact None
Start Year 2016
Description PGC 
Organisation Psychiatric GWAS Consortium (PGC)
Country Global 
Sector Academic/University 
PI Contribution Statistical advice
Collaborator Contribution Data analysis
Impact Publications
Start Year 2009
Description UWA 
Organisation University of Western Australia
Country Australia 
Sector Academic/University 
PI Contribution Expertise in statistical genetics
Collaborator Contribution Genetic studies of osteoporosis and thyroid disease
Impact Several publications
Description Estimation of genetic model parameters from polygenic association statistics 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact Used as a research tool by a number of independent groups. 
Description International Innovation 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Type Of Presentation Keynote/Invited Speaker
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Interview for a magazine presenting recent scientific advances to an informed lay readership

Invitations to present work to international research institutes
Year(s) Of Engagement Activity 2013