Genomic prediction of anthropomorphic traits using hundreds of thousands of individuals

Lead Research Organisation: University of Edinburgh

Department Name: The Roslin Institute

Abstract

Most of the common diseases that afflict humans and other traits of medical relevance (for instance, blood pressure or cholesterol levels) are determined by the interplay of genetic and environmental factors. Unlike environmental factors, genetic factors can be accurately and inexpensively measured, are constant over time and can be measured as early as at birth. Genetic information could potentially be useful to identify what people are at highest risk of disease and therefore preventative strategies could be designed for those individuals that need them most. However, prediction of risk or other traits that are determined by thousands of genes has been very challenging because, until now, there were not sufficient people with trait and genetic information recorded to yield accurate predictions, or computational tools to analyse the large volumes of data needed to yield accurate predictions. The UK Biobank (a large epidemiological study) has now around 500,000 individuals genotyped (i.e. with genetic information) and with phenotypes, and we have developed computer software to analyse this cohort in the UK national supercomputer called ARCHER. We will develop prediction models for nine exemplar traits (e.g. height or body weight) in this cohort to show that prediction from genetic markers is feasible. If we could show that these traits can be accurately predicted, it would mean that predictions could too work for diseases such as colorectal or breast cancer. This would open the way to personalised medicine.

Technical Summary

Genome-wide association studies have identified a large number of genetic variants associated with complex traits. Despite the importance of these discoveries, it is clear that translation into medically useful tools that could help tailoring disease management to the genetic make-up of the individual has lagged behind. We will use genotypes to predict nine quantitative traits measured in the UK Biobank, which comprises around 500,000 individuals. We have shown by simulation that, for a range of heritabilities and genetic architectures, this sample size should allow us to achieve prediction accuracies that are clinically relevant. We now propose to test this using real data. Both, the large sample size of the UK Biobank and the software that we have developed will allow us to develop accurate genetic predictors of these complex phenotypes.
We will use mixed linear models to model additive and non-additive sources of genetic variation, and will develop computationally efficient approaches that could be used to combine information of multiple datasets when it is not possible to share individual level data across multiple cohorts. Our predictions will be validated using internal and external cross-validation.
This is a proof of principle proposal: if genomic prediction worked for exemplar quantitative traits, then it would likely work for other disease or quantitative traits provided that large training datasets were available for analysis and would take us one step closer to personalised medicine.

Planned Impact

In the short term, our research will benefit the scientific community working on genomic and phenotypic prediction of complex traits. Extension of our research to other phenotypes is straightforward, and therefore it will benefit researchers working on a broad variety of traits, provided that sufficiently large numbers of samples are available for those traits or diseases. As there are currently GWAS meta-analysis consortia for a wide variety of traits in humans, the number of researchers that might benefit from our research is large, especially if our work on meta-analytical methods demonstrates that predictions made by combining results from individual cohorts can still achieve acceptable accuracies. Researchers in phenotypic prediction in humans will gain insights on the achievable accuracy of prediction in human populations and also from our method comparison work. Both will direct further methodological and translational research in the field. Potential translational research that can stem from our work may look at translating our findings first into cancer GWAS and after that into cancer screening programs.
Our research will clarify if accurate individual prediction of phenotypes is possible or significantly aided, by using genomic information. Extension of this research to other clinically relevant phenotypes is straightforward. Therefore, the outcome of our research will benefit health-care professionals and policy makers, as it will, in the medium term, guide practice and policy with regards to the use of genome-wide genomic information to either develop personalized treatments of individual patients or tailor interventions to specific strata of the population.
The possibility of targeting particular clinical interventions to the individual offers the opportunity of benefiting patients and allows for a more efficient use of health-care resources, which has economic implications.
As well as guiding treatment allocation for patients, accurate genomic prediction could also be beneficial in the context of clinical trials, where patients could be allocated to groups on the basis of their predicted response to drugs or treatments, thereby reducing the costs of the trials.
Our computational tools will be useful for other efforts such as the 100,000 genomes project from Genomics England. We will also have an impact on the plant and animal breeding industry, not only through our genomic prediction research, but also through the provision of our software DISSECT, that can be used by plant animal breeders to predict genomic values of selection candidates, and complement or replace currents selection tools. Our tools would increase the competitiveness of the UK breeding industry as they will facilitate the sustained improvement of the breeding stock in plants and animals, which in turn would lead an increase in sales and market share for the breeding companies, improved margins for their customers, benefit consumers through lower food costs and to the government through an increase in revenue from taxes paid by successful companies.
Our research will showcase the benefits of the UK Biobank to the scientific community and the general public (including the half a million participants of the UK Biobank). The UK Biobank is not only an great scientific resource but also has the potential, partly given the large numbers of volunteers that make-up the cohort, to be a great tool for science communication and public engagement in science. From it, it is clear that the public is involved and needed in research, not only through the public funding provided, but also through active participation in building-up the resources.
Finally, the post-doctoral researcher employed on the grant will benefit from excellent training and exposure to our industrial and academic collaborators, which will increase her/his opportunities of future employment both within and outside of academia.

Funded Value:

£373,123

Funded Period:

Nov 15 - Jul 19

Funder:

MRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

MR/N003179/1

Principal Investigator:

Albert Tenesa

Health Category:

Unclassified

Organisations

People	ORCID iD
Albert Tenesa (Principal Investigator)
Pau Navarro (Co-Investigator)

Publications

Author Name Title Publication

Date Published

|< < 1 2 > >|

10 25 50

Caballero A (2015) The Nature of Genetic Variation for Complex Traits Revealed by GWAS and Regional Heritability Mapping Analyses. in Genetics

Rawlik K (2016) Imputation of DNA Methylation Levels in the Brain Implicates a Risk Factor for Parkinson's Disease. in Genetics

Rawlik K (2016) Evidence for sex-specific genetic architectures across a spectrum of human complex traits. in Genome biology

Tenesa A (2016) Genetic determination of height-mediated mate choice. in Genome biology

Rawlik K (2019) Indirect assortative mating for human disease and longevity. in Heredity

Rowlatt A (2016) The heritability and patterns of DNA methylation in normal human colorectum. in Human molecular genetics

Rawlik K (2017) Evidence of epigenetic admixture in the Colombian population. in Human molecular genetics

Li Y (2020) Statistical and Functional Studies Identify Epistasis of Cardiovascular Risk Genomic Variants From Genome-Wide Association Studies. in Journal of the American Heart Association

Canela-Xandri O (2015) A new tool called DISSECT for analysing large genomic data sets using a Big Data approach in Nature Communications

Bernabeu E (2021) Sex differences in genetic architecture in the UK Biobank. in Nature genetics

Research Databases and Models
Collaboration
Software and Technical Products
Spin Outs
Engagement Activities


Title	Database of genetic associations
Description	This is the largest atlas of genetic associations with complex traits. It includes associations of over 9 million genetic polymorphisms and 778 complex traits.
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes
Impact	This web site has had over 180,000 queries from around 10,000 researchers from across 100 countries.
URL	http://geneatlas.roslin.ed.ac.uk


Description	Collaboration with GSK
Organisation	GlaxoSmithKline (GSK)
Country	Global
Sector	Private
PI Contribution	We provide the tool for analyses of GWAS and the expertise in mixed linear models.
Collaborator Contribution	GSK provides curated phenotypic data from UK Biobank
Impact	None yet. A CDA is being negotiated.
Start Year	2017


Description	UK Biobank Research Analysis Platform
Organisation	UK Biobank
Country	United Kingdom
Sector	Charity/Non Profit
PI Contribution	We were invited by Mark Effingham (Depute CEO of UK Biobank) to be one of the avant-garde teams to access the UK Biobank research analysis platform to adapt and deploy some of the tools we have developed for the analysis of genomic data.
Collaborator Contribution	We are working with UK Biobank and DNAnexus to set up the compute configuration to allow fast genome-wide association studies with array genotypes, imputed genotyped, whole exome and whole genome data.
Impact	No outputs yet.
Start Year	2020


Title	UpdateDISSECT
Description	The software can perform genome-wide association studies in large structured populations.The software was designed with farm animal populations in mind.
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	We used the software in the largest genotype-phenotype dataset publicly available (UK Biobank) as an exemplar.


Company Name	OMECU LIMITED
Description	Software development for analysis of big data.
Year Established	2021
Impact	Received support from the Wellcome iTPA programme, participated in the SETSquared ICURe programme, and received Medical Research Council grants. They also received funding from the University's Data-Driven Entrepreneurship Seed Fund and Fast Track Mentor initiatives, supported by the Scottish Funding Council.
Website	https://www.omecu.com


Description	Maths and biology. James Gillespies' High School
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Schools
Results and Impact	20-30 pupils and 3-4 teachers attended for presentations from my lab on how numerical skills (mathematics and computing) are applied in biological settings. One of these students, now at University has visited since the Roslin Institute to speak to other researchers.
Year(s) Of Engagement Activity	2018


Description	Michigan State University
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Michigan University Research Seminars with a wide variety of audience ranging from animal breeders and quantitative geneticists to medical doctors.
Year(s) Of Engagement Activity	2017


Description	Participating in Sciennes Primary Science Fair
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Public/other audiences
Results and Impact	We used balloons and other materials to create cells, and explain the function of each part of the cell.
Year(s) Of Engagement Activity	2016


Description	Seminar - MRC Centre for Neuropsychiatric Genetics and Genomics
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	Part of research institution seminar series
Year(s) Of Engagement Activity	2017