Genomic prediction of anthropomorphic traits using hundreds of thousands of individuals

Lead Research Organisation: University of Edinburgh
Department Name: The Roslin Institute


Most of the common diseases that afflict humans and other traits of medical relevance (for instance, blood pressure or cholesterol levels) are determined by the interplay of genetic and environmental factors. Unlike environmental factors, genetic factors can be accurately and inexpensively measured, are constant over time and can be measured as early as at birth. Genetic information could potentially be useful to identify what people are at highest risk of disease and therefore preventative strategies could be designed for those individuals that need them most. However, prediction of risk or other traits that are determined by thousands of genes has been very challenging because, until now, there were not sufficient people with trait and genetic information recorded to yield accurate predictions, or computational tools to analyse the large volumes of data needed to yield accurate predictions. The UK Biobank (a large epidemiological study) has now around 500,000 individuals genotyped (i.e. with genetic information) and with phenotypes, and we have developed computer software to analyse this cohort in the UK national supercomputer called ARCHER. We will develop prediction models for nine exemplar traits (e.g. height or body weight) in this cohort to show that prediction from genetic markers is feasible. If we could show that these traits can be accurately predicted, it would mean that predictions could too work for diseases such as colorectal or breast cancer. This would open the way to personalised medicine.

Technical Summary

Genome-wide association studies have identified a large number of genetic variants associated with complex traits. Despite the importance of these discoveries, it is clear that translation into medically useful tools that could help tailoring disease management to the genetic make-up of the individual has lagged behind. We will use genotypes to predict nine quantitative traits measured in the UK Biobank, which comprises around 500,000 individuals. We have shown by simulation that, for a range of heritabilities and genetic architectures, this sample size should allow us to achieve prediction accuracies that are clinically relevant. We now propose to test this using real data. Both, the large sample size of the UK Biobank and the software that we have developed will allow us to develop accurate genetic predictors of these complex phenotypes.
We will use mixed linear models to model additive and non-additive sources of genetic variation, and will develop computationally efficient approaches that could be used to combine information of multiple datasets when it is not possible to share individual level data across multiple cohorts. Our predictions will be validated using internal and external cross-validation.
This is a proof of principle proposal: if genomic prediction worked for exemplar quantitative traits, then it would likely work for other disease or quantitative traits provided that large training datasets were available for analysis and would take us one step closer to personalised medicine.

Planned Impact

In the short term, our research will benefit the scientific community working on genomic and phenotypic prediction of complex traits. Extension of our research to other phenotypes is straightforward, and therefore it will benefit researchers working on a broad variety of traits, provided that sufficiently large numbers of samples are available for those traits or diseases. As there are currently GWAS meta-analysis consortia for a wide variety of traits in humans, the number of researchers that might benefit from our research is large, especially if our work on meta-analytical methods demonstrates that predictions made by combining results from individual cohorts can still achieve acceptable accuracies. Researchers in phenotypic prediction in humans will gain insights on the achievable accuracy of prediction in human populations and also from our method comparison work. Both will direct further methodological and translational research in the field. Potential translational research that can stem from our work may look at translating our findings first into cancer GWAS and after that into cancer screening programs.
Our research will clarify if accurate individual prediction of phenotypes is possible or significantly aided, by using genomic information. Extension of this research to other clinically relevant phenotypes is straightforward. Therefore, the outcome of our research will benefit health-care professionals and policy makers, as it will, in the medium term, guide practice and policy with regards to the use of genome-wide genomic information to either develop personalized treatments of individual patients or tailor interventions to specific strata of the population.
The possibility of targeting particular clinical interventions to the individual offers the opportunity of benefiting patients and allows for a more efficient use of health-care resources, which has economic implications.
As well as guiding treatment allocation for patients, accurate genomic prediction could also be beneficial in the context of clinical trials, where patients could be allocated to groups on the basis of their predicted response to drugs or treatments, thereby reducing the costs of the trials.
Our computational tools will be useful for other efforts such as the 100,000 genomes project from Genomics England. We will also have an impact on the plant and animal breeding industry, not only through our genomic prediction research, but also through the provision of our software DISSECT, that can be used by plant animal breeders to predict genomic values of selection candidates, and complement or replace currents selection tools. Our tools would increase the competitiveness of the UK breeding industry as they will facilitate the sustained improvement of the breeding stock in plants and animals, which in turn would lead an increase in sales and market share for the breeding companies, improved margins for their customers, benefit consumers through lower food costs and to the government through an increase in revenue from taxes paid by successful companies.
Our research will showcase the benefits of the UK Biobank to the scientific community and the general public (including the half a million participants of the UK Biobank). The UK Biobank is not only an great scientific resource but also has the potential, partly given the large numbers of volunteers that make-up the cohort, to be a great tool for science communication and public engagement in science. From it, it is clear that the public is involved and needed in research, not only through the public funding provided, but also through active participation in building-up the resources.
Finally, the post-doctoral researcher employed on the grant will benefit from excellent training and exposure to our industrial and academic collaborators, which will increase her/his opportunities of future employment both within and outside of academia.


10 25 50
Title Database of genetic associations 
Description This is the largest atlas of genetic associations with complex traits. It includes associations of over 9 million genetic polymorphisms and 778 complex traits. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact This web site has had over 180,000 queries from around 10,000 researchers from across 100 countries. 
Description Collaboration with GSK 
Organisation GlaxoSmithKline (GSK)
Country Global 
Sector Private 
PI Contribution We provide the tool for analyses of GWAS and the expertise in mixed linear models.
Collaborator Contribution GSK provides curated phenotypic data from UK Biobank
Impact None yet. A CDA is being negotiated.
Start Year 2017
Title UpdateDISSECT 
Description The software can perform genome-wide association studies in large structured populations.The software was designed with farm animal populations in mind. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact We used the software in the largest genotype-phenotype dataset publicly available (UK Biobank) as an exemplar. 
Description Maths and biology. James Gillespies' High School 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact 20-30 pupils and 3-4 teachers attended for presentations from my lab on how numerical skills (mathematics and computing) are applied in biological settings. One of these students, now at University has visited since the Roslin Institute to speak to other researchers.
Year(s) Of Engagement Activity 2018
Description Michigan State University 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Michigan University Research Seminars with a wide variety of audience ranging from animal breeders and quantitative geneticists to medical doctors.
Year(s) Of Engagement Activity 2017
Description Participating in Sciennes Primary Science Fair 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Public/other audiences
Results and Impact We used balloons and other materials to create cells, and explain the function of each part of the cell.
Year(s) Of Engagement Activity 2016
Description Seminar - MRC Centre for Neuropsychiatric Genetics and Genomics 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Part of research institution seminar series
Year(s) Of Engagement Activity 2017