Vast-scale linear mixed modelling genetic discovery approaches for genome- and exome-wide association analyses to enable therapeutic target validation

Lead Research Organisation: University of Edinburgh
Department Name: MRC Human Genetics Unit

Abstract

Large-scale publicly available datasets, such as the UK Biobank (n=500,000 participants), which combine genome-wide genotyping and exome sequencing data with linkage to detailed phenotype measurement and electronic healthcare records have the opportunity to transform human genetic discovery analyses. Such datasets are transformative both in their scale and in the depth and diversity of quantitative and disease phenotypes available, and raised a strong interest both in the academia and the industry. In this regard, we have identified partners in Target Sciences (TSci) at GlaxoSmithKline (GSK), a leading team in the application of genetics in drug target discovery and validation. They have previously shown that drugs developed against targets with genetic support for the proposed disease are more likely to reach approval (PMID: 26121088), have used existing GWAS results to search for drug repurposing opportunities (PMID: 22491277) and to develop databases of gene-disease pairs to inform target discovery and validation decisions (PMID: 27899665, 28472345), and have used other biobank samples to influence selection of cardiovascular endpoints (PMID: 26791069) and search for drug repurposing opportunities (PMID: 27301456). GSK have previously performed large-scale targeted sequencing studies (PMID: 22604722) and recently funded exome sequencing of 50,000 participants in UK Biobank, with the aim of further supporting drug target discovery and validation. A major aim at GSK is to use UK Biobank data to conduct phenome-wide association studies (PheWAS), for variants known or predicted to affect gene function for drug targets of interest. The approach currently used is to test each single variant against thousands of disease traits, in the subset of unrelated individuals. However, this approach needs to be improved to distinguish between associations where the drug target variants are likely causal, from associations where the drug target variants are merely correlated (in linkage disequilibrium).

Testing all variants (potentially thousands) in order to fine map in the genomic context of each association of interest is inefficient. A preferable approach is to conduct PheWAS and fine mapping in genomic context, by querying a database of genome-wide association results for all diseases and phenotypes of interest. To maximize discovery power and fine mapping resolution, it is preferable to populate this database with results calculated using in the largest possible sample size. However, an almost inevitable consequence of increasing sample sizes from human populations, is that a larger fraction of participants are related to other participants in the sample. Traditional approaches, such as removing one participant from each related pair, may lead to the removal of a significant proportion of participants from the analysis with consequent loss of statistical power. An alternative approach is using mixed linear model approaches to correct for population structure. However, these approaches require the development of new software tools to deal with large sample sizes, variants and numbers of phenotypes. However, GSK TSci scientists lack the technical expertise required to implement efficient mixed model association testing at the scale required, so this joint project is aimed to collaborate with them to develop the required methods to populate the database. Our work has the opportunity to be impactful on drug discovery and development.

Technical Summary

To address the objectives of the fellowship, we will further develop DISSECT (PMID: 26657010). This is a software tool developed within the group, which was designed to overcome the compute and memory limitations of single compute nodes by taking advantage of the aggregate power of the thousands of processor cores and large distributed memory available on supercomputers or large compute clusters. For this purpose, DISSECT distributes the available data over the multiple nodes. At any given time, each node has access to only a small portion of the data on which it performs local computations. When the algorithm requires access to blocks of data currently held on other nodes, the nodes communicate to coordinate data redistribution. This approach provides access to much larger computational resources for a single analysis (i.e. increases the scalability) than standard tools that can only use the resources of a single compute node for each analysis, even when running on similar computer clusters environments. In addition, using as a basis our current development, we will further develop, evaluate and implement previous approaches (PMID: 25642633, 21465547) that propose to perform approximations to reduce the computational cost of fitting these models on large datasets, and find a balance between speed, accuracy, and computation requirements. The proposed analyses will be run on Tier-1 and Tier-2 High Performance Computing Centres such as ARCHER (https://www.archer.ac.uk) and CIRRUS (http://www.cirrus.ac.uk).
 
Description Analysis of non-a dditive genetic effects affecti ng complex traits in large datasets
Amount £30,000 (GBP)
Funding ID IS3-R86 
Organisation University of Edinburgh 
Sector Academic/University
Country United Kingdom
Start 04/2019 
End 04/2020
 
Description GOLEM: High Performance Computing platform for a paradigm shift in genetic analysis
Amount £94,329 (GBP)
Funding ID MRC/CIC8/76 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 03/2021 
End 02/2022
 
Description Golem: A disruptive platform to access and interactively analyse genetic data
Amount £288,694 (GBP)
Funding ID 10001080 
Organisation Innovate UK 
Sector Public
Country United Kingdom
Start 09/2021 
End 09/2022
 
Description What genomic analyses and iTunes have in common?
Amount £54,022 (GBP)
Funding ID 29-34 / 520268126 
Organisation Innovate UK 
Sector Public
Country United Kingdom
Start 08/2020 
End 03/2021
 
Title Genetic analyses on demand 
Description We developed a computational system which will largely improve the capacity of performing common genetic analysis in large datasets (GWAS, GxE, GxG, etc). Its key strengths are around the capacity to compute orders of magnitudes faster using large datasets, privacy and UX/UI: · The back-end enables users to perform the analysis in seconds: currently it can run ~80,000-160,000 GWAS per day on data-sets w/ >500k individuals and >10M genetic variants on a very reduced set of servers. · The analyst does not need to have direct access to the data, so the owner may keep it safe, if there are any restrictions in place. · A web tool enabling researchers without programming skills to rapidly and efficiently explore and prepare the data, combined with a front-end that allows the researcher to explore interactively the results and integrate them with information from different public databases. The system will largely optimize the use of researchers time, and cost of performing analyses, by enabling them to explore the data and perform analyses interactively in real-time. It would be equivalent to querying a database of pre-computed results. However, because analyses are performed on demand, the approach allows researchers to modify, and adapt the models efficiently and re-run the analysis interactively. 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? No  
Impact The tools is not public, yet. We expect to release it this year. 
 
Title A comprehensive catalogue of regulatory variants in the cattle transcriptome 
Description Understanding functional consequences of genetic variants on the transcriptome of livestock is essential for interpreting the molecular mechanisms underlying traits of economic value, and for improving the rate of genetic gain through artificial selection. Here, we build a cattle Genotype-Tissue Expression atlas (cGTEx) for the research community based on 11,642 RNA-seq publicly available datasets (by July, 2019), representing over 100 tissues/cell types among over 40 breeds. We describe the landscape of transcriptome across tissues and report thousands of cis- and trans- genetic variants (QTLs) associated with gene expression and alternative splicing for 24 major tissues in cattle. Additionally, we detect 496 gene-tissue pairs significantly associated with 43 economically important traits in cattle via a large transcriptome-wide association study (TWAS). All the genome annotation files are based on ARS-UCD1.2 (Ensembl 96 version). The cGTEx Portal allows researchers to query gene expression, alternative splicing and QTLs across tissues in an easy and uniform way, which can serve as a primary source of reference for cattle genomics, cattle breeding, adaptive evolution, comparative genomics, and veterinary medicine. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
Impact It is accepted in Nature Genetics (not published, yet) and lead to an international collaboration to create a much more comprehensive database (farmGTEx) that aims to combine data from different species. 
URL https://cgtex.roslin.ed.ac.uk/
 
Title Comprehensive analyses of 723 transcriptomes enhance genetic and biological interpretations for complex traits in cattle 
Description We here uniformly analyzed 723 (156 newly generated and 567 existing) RNA-seq datasets to build a gene atlas in cattle, which included 91 tissues and cell types from 447 individuals. We summarized the sample information, their NCBI accession numbers, and expression (FPKM) of 24,616 Ensembl genes (based on UMD3.1) here. Through integrative analyses of this gene atlas with large-scale genome-wide association studies, we detected relevant tissues/cell types and candidate genes for 45 economically important traits in cattle (under review in Genome Research). This cattle gene atlas will serve as a primary source for biological interpretation and functional validation of GWAS findings, studies of adaptive evolution and population genetics, as well as genomic improvement in cattle. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
Impact This created the basis of a collaboration that lead to the creation of a bigger dataset (cGTEx) and ultimately the farmGTEx international collaboration. 
URL http://cattlegeneatlas.roslin.ed.ac.uk/
 
Title Gene ATLAS GWAS database 
Description Database containing genome-wide association analysis results for 778 human traits and ~30 million genetic variants. In the analysis we used ~450,000 individuals from UK Biobank. We also developed a web tool to browse this database (see also "Software & Technical Products" section). 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact The website were the database is public received more than 140,000 visits from researchers around the world since created. We also published an article in a high impact journal (https://www.nature.com/articles/s41588-018-0248-z). 
URL http://geneatlas.roslin.ed.ac.uk/
 
Description Vast-scale linear mixed modelling genetic discovery approaches for genetic by environment association analyses 
Organisation GlaxoSmithKline (GSK)
Country Global 
Sector Private 
PI Contribution Analyzing large datasets, such as those of the size of UK Biobank, is computationally expensive. The challenge is bigger when thousands of phenotypes have to be analyzed. Although different software solutions are arising, they are in general limited on the types of models they can fit. To address this problem, we are expanding our tool (http://www.dissect.ed.ac.uk/) to test genetic by environment interactions on thousands of phenotypes in datasets of the size of UK Biobank.
Collaborator Contribution UK Biobank provided access to thousands of measurements on very large numbers of individuals. However, several of those require expertise in a particular field to properly prepare the data, or combine different data fields to generate or curate a new one. GSK has the expertise and resources to do this.
Impact Work in progress. There is not any output from this collaboration, yet.
Start Year 2018
 
Title Interactive analysis of large datasets whilst keeping the data protected. 
Description Human genomic data is doubling in size every seven months and will soon exceed other Big Data generators such as astronomy, YouTube and Twitter. Extracting value from this data is a key step in areas such as drug targeting and personalized medicine. According to Global Market Insights, the digital genome market is projected to hit $50.4 billion by 2025, and is key for two UK Grand Challenges: AI & data and ageing society. Accordingly, the UK is positioning as a big player through strategic investments to create world leading resources such as UK Biobank and Genomics England. Reaching the full potential of this substantial investment relies on developing associated industries around it to unlock value from the data. However, several barriers still exist: a) Legal, political, or economic restrictions hamper access to multi-institutional and multi-national fragmented data. b) Preparing and analysing the data may require days or even weeks of a highly skilled individual's work performing repetitive low-value tasks. c) Analysing genetic data requires multidisciplinary skills. d) Requirement of large computational resources. Not all organizations perceive these problems the same way. Whilst public organisations and small biotechs struggle to find the adequate skills and allocate the required computational resources, this does not seem to be a major concern for big pharma companies. On the other hand, difficulties accessing multi-organization scattered data affects all organizations. Other problems in the field including scalability, evidence, equity, democratization, information, health, and carbon footprint. Several companies have been created to address these problems, some of whom we have met (LifeBit and DNA Nexus). We do not believe their solutions satisfactorily address the field's major challenges we have identified. In particular, the analyst still needs to "see" the sensitive data through their platforms and also requires high computational costs and time requirements. The development proposes to overcome these challenges through: 1) An extremely efficient computation engine we developed. Using inexpensive hardware, it reduces large dataset analysis times, from days to seconds. 2) An easy-to-use web system that enables the engine to be interactively queried without requiring direct data access - even by the person analysing it. These technologies together can be disruptive to how data is currently accessed and analysed. Our solution can move the field from the current situation, where an analyst struggles to reach data and then spends weeks in an iterative cycle of data preparation and analysis, to a situation where multi-organization fragmented data is accessed easily, and queried interactively and on-demand. 
IP Reference  
Protection Trade Mark
Year Protection Granted
Licensed Yes
Impact It is used in an early stage spin-out
 
Title Real time diagnosis of rare diseases 
Description One in two-hundred babies will be born with a developmental disorder. The success of human genetics means there is an ever-increasing number of human diseases for which genome-wide sequencing of DNA (DNA-GWS) can provide confident diagnosis. A unifying and robust diagnosis is key to improving prognosis and identifying effective therapeutic strategies. This is particularly critical in rapid DNA-GWS in acutely unwell infants, one of the most rapidly growing sectors of diagnostic genomics. As sequencing technologies improve, comprehensive analysis of sequence data has become the bottleneck. Currently, analysis of different classes of genomic variants is done serially using distinct and computationally intensive quality control measures (QC) prior to diagnostic analysis to determine the clinical (or research) usefulness of each test. The scale of DNA-GWS data means many hours of computational time with review by specially trained scientists are required for each step. Our disruptive innovation answers an important unmet need: to provide flexible and robust methods to identify all major classes of disease associated genomic pathologies in real-time. Our analytical protocols can QC and optimise the data on a per-individual, per-family or per-cohort basis as required, enabling novel analytical approaches to be developed, tested and implemented. Our current prototype, comprising the computation engine and rudimentary user interface, is tested on parent-child trio DNA-GWS data where it demonstrates variant tests at a rate of 119 million variants/second, allowing filter manipulations sufficient to compare 100 parameters on 50 million variants within 0.42 seconds in consumer grade hardware. 
IP Reference  
Protection Trade Mark
Year Protection Granted
Licensed No
Impact Still under development.
 
Title Expansion of DISSECT, a tool to use High Performance Distributed Computing environments to perform analysis on large datasets. 
Description DISSECT is designed to distribute computationally expensive analysis between large numbers of computing nodes connected through a network. This allows very large scalability by enabling the use of thousands of processors and large amounts of memory to perform a single analysis on very large datasets. The tool is not designed to just perform a particular analysis. It is easy to expand to add new analysis on this distributed computing schema. During this period, we extended the capabilities of DISSECT by adding the possibility of performing genome-wide association studies on large numbers of phenotypes in one analysis. We are also extending it to add more complex tests such as genetic by environment, or genetic by genetic interactions. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact Since we started developing DISSECT, we used it to produce several works published in different high impact journals. In this last year, the expansion of this tool allowed us to pre-compute genome wide association studies for 788 human traits using ~450,000 related and unrelated UK Biobank individuals. The results of these analysis, allowed us to develop and publish the Gene ATLAS database and web tool. This has been published in a high impact journal (https://www.nature.com/articles/s41588-018-0248-z) and the web received more than 140,000 visits from researchers around the world, since the web was made public. 
URL http://www.dissect.ed.ac.uk/
 
Title Gene ATLAS 
Description Webtool that enables to explore the Gene ATLAS database (see also "Research Databases & Models") results obtained from pre-computed genome-wide association studies on 778 human traits analyzed using ~450,000 related and unrelated individuals from UK Biobank. 
Type Of Technology Webtool/Application 
Year Produced 2018 
Impact The tool produced a publication in a high impact journal (https://www.nature.com/articles/s41588-018-0248-z), and accumulates more than 140,000 visits from researchers all around the world since published. 
URL http://geneatlas.roslin.ed.ac.uk/
 
Company Name OMECU LIMITED 
Description Extracting clinically actionable insights from increasingly large and diverse genetic and health datasets is a key step in addressing medical unmet needs in the areas of target and biomarker discovery, as well as personalised medicine. For example, it has been shown that targeting drugs using Genome-Wide Association Studies (GWAS) results, increase the approval success by greater than two-fold. This change has been embraced by the biotech and pharmaceutical industries, leading to new unmet needs (e.g. expensive computer requirements, long processing times, multi-disciplinary teams, restricted access to highly regulated multi-organization fragmented data) around how these ever-growing and complex datasets are analysed and securely shared. Our research within MRC-HGU (University of Edinburgh), led us to develop a disruptive and ambitious step forward to address these problems and transform how genetic data is accessed and analysed. The two key elements of our solution, based on a set of novel complex algorithms never publicly disclosed, are: ** A core computation engine placed where the data sits, that massively reduces the computation time on large datasets (e.g. from days to seconds on datasets such as UK Biobank) using inexpensive commercial hardware. The engine enables clustering, transforming, querying, and analysing the data, but without requiring exposing the data itself. By enabling on-the-fly analysis and real time exploration of the data it has the potential to change how data is explored. ** A web platform that enables simple querying of the engine, exploring the data, and exploring the results returned from the engine. On top of providing strict protection of the data (even from the analyst), the web platform developed around the computation engine also addresses others common problems in the field such as reproducibility, storage requirements, and automatic reporting of data usage by strictly logging any action made into the data. This could enable, for instance, automatic reporting of data usage to study participants who have given their data. Our ambition for our platform is the creation of an international data market, that democratizes data access while largely reducing overall costs, but without requiring data holding organizations to expose their data. Alternatively, the same structure could be created internally within organizations, so that they can enable controlled access to internal data for their own researchers. Ultimately, our objective is to move the field from a paradigm where highly skilled individuals are needed to spend large amounts of time in low value tasks such as accessing, preparing, and analysing the data, to a paradigm where data access is democratised to individuals without computer science background (e.g. a clinician expert on a specific disease) who would be able to easily, interactively, and on-demand query fragmented datasets. 
Year Established 2021 
Impact It has been created recently.
 
Description Talk to medicine students about enterpreneurship 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Undergraduate students
Results and Impact The Edinburgh Innovation Ambassador for the College of Medicine at the University of Edinburgh invited me to give a presentation to students about my experiences of spinning out from the university. I made a presentation to them in the Student Enterprise Hub.
Year(s) Of Engagement Activity 2022
URL https://events.irm.ed.ac.uk/Events/Event/7015J000000HU5Y