Development of GPGPU tools for modelling complex phenotypes

Lead Research Organisation: University of Edinburgh
Department Name: The Roslin Institute

Abstract

We are investigating how genes make some people or animals more susceptible to certain diseases (e.g. cancer) or better at production traits (e.g. milk yield) than others. In the long term this research could be used to predict what diseases individuals and animals are prone to and what age they are likely to develop them. With this information better drugs and preventative treatments could be developed. This will also help to improve food production and safety for an increasing human population. To investigate this, we take samples from a large number of diseased or healthy people or animals. The genomes of these two groups are then studied and particular parts of the genome (called genes) pinpointed as contributing to the differences between the groups. Doing those comparisons requires complex mathematical and statistical models.
We have developed statistical methods that are able to model the traits of animals and people as a function of their genetic make-up and aid us in identifying what genes are contributing to the differences between groups. However, these methods require a large number of calculations that take a long time to complete when using standard computer processors (or clusters of them). This research proposal will develop software tools that speed-up this calculations substantially and hence will help us achieve our scientific aims more quickly. The software tools will run on Graphics Processing Units (GPUs), which are the fast computing processors used in graphic cards and that allow people to play fast and fun computer games. We will use the same programming 'tricks' and technology used by the computing game industry to understand how genes work, and how they interact with each other to make people or animals more or less prone to disease.

Technical Summary

In the last years, genome-wide association studies (GWAS) have allowed an unprecedented exploration for genetic variants contributing to complex traits. GWAS have genotyped thousands of human and animal genomes with very dense single nucleotide polymorphism arrays and correlated genetic variation with phenotypic variation. Despite the arguable success of GWAS for most complex traits, in reality most of the standing genetic variation remains unidentified. Although the 'missing heritability' problem is currently obvious in human studies, it is very likely that the same problem will arise in wild, farm and companion animals as data becomes available.
One strategy to identify the 'missing heritability' is to fit non-additive genetic models. Fitting these models is computationally intense and we lack fast tools to perform global and unbiased searches of the genome in a reasonable amount of time. We will exploit the power of Graphics Processing Units (GPUs) to address one of the most important unanswered questions in complex traits' genetics: where is the missing genetic variation hidden?.
We have developed an analytical approach to identify quantitative trait and disease susceptibility loci, i.e. to capture genetic variation at functional genomic regions. Our approach estimates genomic relationships among individuals at particular position of the genome from the observed genotypes and fits the individuals' additive genetic value at that position as a random effect in a mixed-linear model framework. However, current tools are slow and this makes global epistatic searches and obtaining empirical significance thresholds impossible using our analytical approach. We estimate that the proposed project will deliver a five-fold increase in performance over our current CPU software implementation.

Planned Impact

Impact on the academic community
The proposed research will benefit complex traits' geneticists working in model organisms, humans and livestock, wild and companion animals. It will aid them to identify the genes and loci that code for and control complex traits and diseases. This in turn will help to understand how genes interact with each other and with the environment.
Identifying the genes that contribute to particular traits (e.g. diseases) makes feasible the study of the molecular mechanisms that lead to them. Molecular biologists will be primary beneficiaries of the successful application of our tools to complex traits in humans and animals.
Impact on the industry
Our research will help the breeding industry to maintain a competitive advantage through improved breeding schemes. Identifying the loci contributing to production traits will help to build better prediction models and hence achieve higher genetic gains. It will also help to maintain sustainable food (protein) production and reduce the environmental burden of the livestock industry.
Our tools will allow the discovery of genes associated with disease onset and progression. Mechanistic insights generated by the discovery of those genes will help the pharmaceutical industry to inform the selection of candidate chemical compounds thereby increasing the success rate of potential useful compounds and speeding-up drug discovery and development.

Impact on human and animal health
Predicting phenotypes is important in human disease: better prediction models will lead to better screening strategies, allocation of resources and intervention strategies, hence informing public health policy.
Our methods will help to understand the genetic architecture of complex diseases in livestock and companion animals, this will help to develop better screening programmes, improve public health policies and facilitate the development of better therapeutics.

Impact on users
The impact on users will be tremendous; our GPU code is likely to be a hundred times faster that available software. This means that global searches for epistasis would be feasible and that empirical significance thresholds could be obtained. Both of these analyses are currently not feasible.

Timescale
Uptake of the software is likely to be quick because the results from genome-wide association studies have, to a degree, not fulfilled their original expectations and there is a need to try new approaches to identify the missing genetic variation.
 
Description We developed computer software to perform complex statistical analyses of genomic data. The software is very fast because uses computer CPUs (used in standard computers) and GPUs (usually used for gaming). This has allowed us to make better use of the genomic data available and help us identify parts of the genome that are important in traits such as height, colorectal cancer or milk yield.
Exploitation Route Researchers have used this software to study the genetics underlying hip dysplasia in dogs, non-pathological cognitive decline in humans, and in theoretical simulations.
Sectors Agriculture

Food and Drink

Healthcare

URL http://www.roslin.ed.ac.uk/albert-tenesa/software/
 
Description The development of the software lead to training in high performance computing, and statistics. Training in these two difficult to find skills is the main impact of this grant. The person trained moved to New Zealand for a senior industry-research post.
First Year Of Impact 2013
Sector Agriculture, Food and Drink
Impact Types Economic

 
Description UK Biobank Research Analysis Platform 
Organisation UK Biobank
Country United Kingdom 
Sector Charity/Non Profit 
PI Contribution We were invited by Mark Effingham (Depute CEO of UK Biobank) to be one of the avant-garde teams to access the UK Biobank research analysis platform to adapt and deploy some of the tools we have developed for the analysis of genomic data.
Collaborator Contribution We are working with UK Biobank and DNAnexus to set up the compute configuration to allow fast genome-wide association studies with array genotypes, imputed genotyped, whole exome and whole genome data.
Impact No outputs yet.
Start Year 2020
 
Title REACTA 
Description The software performs mixed linear models using genomic information. 
Type Of Technology Software 
Year Produced 2013 
Open Source License? Yes  
Impact The software has been used widely within the University of Edinburgh, and our algorithms were incorporated into the original GCTA software that we started from. The GCTA is widely used since it was the original. 
URL http://www.roslin.ed.ac.uk/albert-tenesa/software/
 
Company Name Omecu 
Description Omecu develops a cloud-based platform for the analysis of large-scale genetic and epidemiologic datasets, with the aim of democratising genome data. 
Year Established 2021 
Impact Received support from the Wellcome iTPA programme, participated in the SETSquared ICURe programme, and received Medical Research Council grants. They also received funding from the University's Data-Driven Entrepreneurship Seed Fund and Fast Track Mentor initiatives, supported by the Scottish Funding Council.
Website https://omecu.com/