Vast-scale linear mixed modelling genetic discovery approaches for genome- and exome-wide association analyses to enable therapeutic target validation

Lead Research Organisation: University of Edinburgh

Department Name: MRC Human Genetics Unit

Abstract

Large-scale publicly available datasets, such as the UK Biobank (n=500,000 participants), which combine genome-wide genotyping and exome sequencing data with linkage to detailed phenotype measurement and electronic healthcare records have the opportunity to transform human genetic discovery analyses. Such datasets are transformative both in their scale and in the depth and diversity of quantitative and disease phenotypes available, and raised a strong interest both in the academia and the industry. In this regard, we have identified partners in Target Sciences (TSci) at GlaxoSmithKline (GSK), a leading team in the application of genetics in drug target discovery and validation. They have previously shown that drugs developed against targets with genetic support for the proposed disease are more likely to reach approval (PMID: 26121088), have used existing GWAS results to search for drug repurposing opportunities (PMID: 22491277) and to develop databases of gene-disease pairs to inform target discovery and validation decisions (PMID: 27899665, 28472345), and have used other biobank samples to influence selection of cardiovascular endpoints (PMID: 26791069) and search for drug repurposing opportunities (PMID: 27301456). GSK have previously performed large-scale targeted sequencing studies (PMID: 22604722) and recently funded exome sequencing of 50,000 participants in UK Biobank, with the aim of further supporting drug target discovery and validation. A major aim at GSK is to use UK Biobank data to conduct phenome-wide association studies (PheWAS), for variants known or predicted to affect gene function for drug targets of interest. The approach currently used is to test each single variant against thousands of disease traits, in the subset of unrelated individuals. However, this approach needs to be improved to distinguish between associations where the drug target variants are likely causal, from associations where the drug target variants are merely correlated (in linkage disequilibrium).

Testing all variants (potentially thousands) in order to fine map in the genomic context of each association of interest is inefficient. A preferable approach is to conduct PheWAS and fine mapping in genomic context, by querying a database of genome-wide association results for all diseases and phenotypes of interest. To maximize discovery power and fine mapping resolution, it is preferable to populate this database with results calculated using in the largest possible sample size. However, an almost inevitable consequence of increasing sample sizes from human populations, is that a larger fraction of participants are related to other participants in the sample. Traditional approaches, such as removing one participant from each related pair, may lead to the removal of a significant proportion of participants from the analysis with consequent loss of statistical power. An alternative approach is using mixed linear model approaches to correct for population structure. However, these approaches require the development of new software tools to deal with large sample sizes, variants and numbers of phenotypes. However, GSK TSci scientists lack the technical expertise required to implement efficient mixed model association testing at the scale required, so this joint project is aimed to collaborate with them to develop the required methods to populate the database. Our work has the opportunity to be impactful on drug discovery and development.

Technical Summary

To address the objectives of the fellowship, we will further develop DISSECT (PMID: 26657010). This is a software tool developed within the group, which was designed to overcome the compute and memory limitations of single compute nodes by taking advantage of the aggregate power of the thousands of processor cores and large distributed memory available on supercomputers or large compute clusters. For this purpose, DISSECT distributes the available data over the multiple nodes. At any given time, each node has access to only a small portion of the data on which it performs local computations. When the algorithm requires access to blocks of data currently held on other nodes, the nodes communicate to coordinate data redistribution. This approach provides access to much larger computational resources for a single analysis (i.e. increases the scalability) than standard tools that can only use the resources of a single compute node for each analysis, even when running on similar computer clusters environments. In addition, using as a basis our current development, we will further develop, evaluate and implement previous approaches (PMID: 25642633, 21465547) that propose to perform approximations to reduce the computational cost of fitting these models on large datasets, and find a balance between speed, accuracy, and computation requirements. The proposed analyses will be run on Tier-1 and Tier-2 High Performance Computing Centres such as ARCHER (https://www.archer.ac.uk) and CIRRUS (http://www.cirrus.ac.uk).

Funded Value:

£299,616

Funded Period:

Feb 18 - Feb 21

Funder:

MRC

Project Status:

Closed

Project Category:

Fellowship

Project Reference:

MR/R025851/1

Principal Investigator:

Oriol Canela-Xandri

Health Category:

Unclassified

Organisations

People	ORCID iD
Oriol Canela-Xandri (Principal Investigator / Fellow)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Bernabeu E (2021) Sex differences in genetic architecture in the UK Biobank. in Nature genetics

Bretherick AD (2020) Linking protein to phenotype with Mendelian Randomization detects 38 proteins with causal roles in human diseases and traits. in PLoS genetics

Canela-Xandri O (2018) An atlas of genetic associations in UK Biobank. in Nature genetics

Canela-Xandri O (2020) TiFoSi: an efficient tool for mechanobiology simulations of epithelia. in Bioinformatics (Oxford, England)

Fang L (2020) Comprehensive analyses of 723 transcriptomes enhance genetic and biological interpretations for complex traits in cattle. in Genome research

Fang L (2022) A compendium of genetic regulatory effects across pig tissues

Li Y (2020) Statistical and Functional Studies Identify Epistasis of Cardiovascular Risk Genomic Variants From Genome-Wide Association Studies. in Journal of the American Heart Association

Liu S (2022) A multi-tissue atlas of regulatory variants in cattle in Nature Genetics

Teng J (2024) A compendium of genetic regulatory effects across pig tissues. in Nature genetics

The PigGTEx Consortium (2024) A compendium of genetic regulatory effects across pig tissues

Further Funding
Research Databases and Models
Research Tools and Methods
Collaboration
Intellectual Property
Software and Technical Products
Spin Outs
Engagement Activities


Description	Analysis of non-a dditive genetic effects affecti ng complex traits in large datasets
Amount	£30,000 (GBP)
Funding ID	IS3-R86
Organisation	University of Edinburgh
Sector	Academic/University
Country	United Kingdom
Start	03/2019
End	04/2020


Description	GOLEM: High Performance Computing platform for a paradigm shift in genetic analysis
Amount	£94,329 (GBP)
Funding ID	MRC/CIC8/76
Organisation	Medical Research Council (MRC)
Sector	Public
Country	United Kingdom
Start	03/2021
End	02/2022


Description	Golem: A disruptive platform to access and interactively analyse genetic data
Amount	£288,694 (GBP)
Funding ID	10001080
Organisation	Innovate UK
Sector	Public
Country	United Kingdom
Start	08/2021
End	09/2022


Description	What genomic analyses and iTunes have in common?
Amount	£54,022 (GBP)
Funding ID	29-34 / 520268126
Organisation	Innovate UK
Sector	Public
Country	United Kingdom
Start	07/2020
End	03/2021


Title	Genetic analyses on demand
Description	We developed a computational system which will largely improve the capacity of performing common genetic analysis in large datasets (GWAS, GxE, GxG, etc). Its key strengths are around the capacity to compute orders of magnitudes faster using large datasets, privacy and UX/UI: · The back-end enables users to perform the analysis in seconds: currently it can run ~80,000-160,000 GWAS per day on data-sets w/ >500k individuals and >10M genetic variants on a very reduced set of servers. · The analyst does not need to have direct access to the data, so the owner may keep it safe, if there are any restrictions in place. · A web tool enabling researchers without programming skills to rapidly and efficiently explore and prepare the data, combined with a front-end that allows the researcher to explore interactively the results and integrate them with information from different public databases. The system will largely optimize the use of researchers time, and cost of performing analyses, by enabling them to explore the data and perform analyses interactively in real-time. It would be equivalent to querying a database of pre-computed results. However, because analyses are performed on demand, the approach allows researchers to modify, and adapt the models efficiently and re-run the analysis interactively.
Type Of Material	Improvements to research infrastructure
Year Produced	2020
Provided To Others?	No
Impact	The tools is not public, yet. We expect to release it this year.


Title	A comprehensive catalogue of regulatory variants in the cattle transcriptome
Description	Understanding functional consequences of genetic variants on the transcriptome of livestock is essential for interpreting the molecular mechanisms underlying traits of economic value, and for improving the rate of genetic gain through artificial selection. Here, we build a cattle Genotype-Tissue Expression atlas (cGTEx) for the research community based on 11,642 RNA-seq publicly available datasets (by July, 2019), representing over 100 tissues/cell types among over 40 breeds. We describe the landscape of transcriptome across tissues and report thousands of cis- and trans- genetic variants (QTLs) associated with gene expression and alternative splicing for 24 major tissues in cattle. Additionally, we detect 496 gene-tissue pairs significantly associated with 43 economically important traits in cattle via a large transcriptome-wide association study (TWAS). All the genome annotation files are based on ARS-UCD1.2 (Ensembl 96 version). The cGTEx Portal allows researchers to query gene expression, alternative splicing and QTLs across tissues in an easy and uniform way, which can serve as a primary source of reference for cattle genomics, cattle breeding, adaptive evolution, comparative genomics, and veterinary medicine.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
Impact	It is accepted in Nature Genetics (not published, yet) and lead to an international collaboration to create a much more comprehensive database (farmGTEx) that aims to combine data from different species.
URL	https://cgtex.roslin.ed.ac.uk/


Title	Additional file 1 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 1: Table S1. Summary of RNA-seq samples in humans and cattle.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_1_of_Comparative_transcriptome_...


Title	Additional file 1 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 1: Table S1. Summary of RNA-seq samples in humans and cattle.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_1_of_Comparative_transcriptome_...


Title	Additional file 10 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 10: Table S9. Partitioning heritability with expression-conserved and divergent genes in milk production traits using GREML-LDMS.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_10_of_Comparative_transcriptome...


Title	Additional file 10 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 10: Table S9. Partitioning heritability with expression-conserved and divergent genes in milk production traits using GREML-LDMS.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_10_of_Comparative_transcriptome...


Title	Additional file 11 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 11: Table S10. Summary of novel variants detected by PolyFun + SuSiE in human height.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_11_of_Comparative_transcriptome...


Title	Additional file 11 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 11: Table S10. Summary of novel variants detected by PolyFun + SuSiE in human height.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_11_of_Comparative_transcriptome...


Title	Additional file 3 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 3: Table S2. Significantly enriched Gene Ontology terms for three groups of tissue-specific genes.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_3_of_Comparative_transcriptome_...


Title	Additional file 3 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 3: Table S2. Significantly enriched Gene Ontology terms for three groups of tissue-specific genes.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_3_of_Comparative_transcriptome_...


Title	Additional file 4 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 4: Table S3. Significantly enriched Gene Ontology terms for up-regulated genes in cattle and humans.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_4_of_Comparative_transcriptome_...


Title	Additional file 4 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 4: Table S3. Significantly enriched Gene Ontology terms for up-regulated genes in cattle and humans.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_4_of_Comparative_transcriptome_...


Title	Additional file 5 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 5: Table S4. Significantly enriched Gene Ontology terms for genes with more conserved expression between human and cattle than between human and mouse.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_5_of_Comparative_transcriptome_...


Title	Additional file 5 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 5: Table S4. Significantly enriched Gene Ontology terms for genes with more conserved expression between human and cattle than between human and mouse.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_5_of_Comparative_transcriptome_...


Title	Additional file 6 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 6: Table S5. Significantly enriched Gene Ontology terms for genes with variable and consistent expression across tissues in humans and cattle.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_6_of_Comparative_transcriptome_...


Title	Additional file 6 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 6: Table S5. Significantly enriched Gene Ontology terms for genes with variable and consistent expression across tissues in humans and cattle.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_6_of_Comparative_transcriptome_...


Title	Additional file 7 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 7: Table S6. Summary of 46 GWAS in humans.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_7_of_Comparative_transcriptome_...


Title	Additional file 7 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 7: Table S6. Summary of 46 GWAS in humans.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_7_of_Comparative_transcriptome_...


Title	Additional file 8 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 8: Table S7. Summary of LDSC results of base model (without partitioning heritability) for 46 human complex traits.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_8_of_Comparative_transcriptome_...


Title	Additional file 8 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 8: Table S7. Summary of LDSC results of base model (without partitioning heritability) for 46 human complex traits.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_8_of_Comparative_transcriptome_...


Title	Additional file 9 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 9: Table S8. Heritability enrichment analysis of expression-conserved and divergent genes in human complex traits using LDSC.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_9_of_Comparative_transcriptome_...


Title	Additional file 9 of Comparative transcriptome in large-scale human and cattle populations
Description	Additional file 9: Table S8. Heritability enrichment analysis of expression-conserved and divergent genes in human complex traits using LDSC.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_9_of_Comparative_transcriptome_...


Title	Comprehensive analyses of 723 transcriptomes enhance genetic and biological interpretations for complex traits in cattle
Description	We here uniformly analyzed 723 (156 newly generated and 567 existing) RNA-seq datasets to build a gene atlas in cattle, which included 91 tissues and cell types from 447 individuals. We summarized the sample information, their NCBI accession numbers, and expression (FPKM) of 24,616 Ensembl genes (based on UMD3.1) here. Through integrative analyses of this gene atlas with large-scale genome-wide association studies, we detected relevant tissues/cell types and candidate genes for 45 economically important traits in cattle (under review in Genome Research). This cattle gene atlas will serve as a primary source for biological interpretation and functional validation of GWAS findings, studies of adaptive evolution and population genetics, as well as genomic improvement in cattle.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
Impact	This created the basis of a collaboration that lead to the creation of a bigger dataset (cGTEx) and ultimately the farmGTEx international collaboration.
URL	http://cattlegeneatlas.roslin.ed.ac.uk/


Title	Gene ATLAS GWAS database
Description	Database containing genome-wide association analysis results for 778 human traits and ~30 million genetic variants. In the analysis we used ~450,000 individuals from UK Biobank. We also developed a web tool to browse this database (see also "Software & Technical Products" section).
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes
Impact	The website were the database is public received more than 140,000 visits from researchers around the world since created. We also published an article in a high impact journal (https://www.nature.com/articles/s41588-018-0248-z).
URL	http://geneatlas.roslin.ed.ac.uk/


Title	PigGTEx_v0 - Significant molQTL
Description	Summary statistics of significant molQTL from the pilot phase of PigGTEx (http://piggtex.farmgtex.org).
Type Of Material	Database/Collection of data
Year Produced	2024
Provided To Others?	Yes
URL	https://www.scidb.cn/en/detail?dataSetId=8c6e14efcc8d4e37b5e5294c86439367


Title	cGTEx_dataset:A multi-tissue atlas of regulatory variants in cattle
Description	The files are raw data of the cGTEX dataset used in the publication https://doi.org/10.1038/s41588-022-01153-5. For details, please read the Methods section. 1. cGTEx_meta_data_8646sample.xlsx Metadata consists of sample names with their sample accession, including information such as data size, cleaned reads, mapping rate, and age. The data is extracted from SRA (https://www.ncbi.nlm.nih.gov/sra/) and BIGD (https://bigd.big.ac.cn/bioproject/) ( samples starting with CRS) 2. cGTEx_count_8646sample_27607gene.txt.gz Data consist of raw RNA-seq read count of 27607 genes (column names as Ensembl gene id )of 8646 samples (as row names) 3. cGTEx_TPM_8646sample_27607gene.txt.gz Data consist of TPM values of 27607 genes (column names as Ensembl gene id) in samples (8646 samples as row names) 4. cGTEx_imputed_vcf.tar.gz Imputed genotypes (SNP) of 7297 RNA-seq samples in 29 autosomes. 5. cGTEx_exon_junction_8646sample.tar.gz Exon junction files of 8646 files Note: Small discrepancies in some sample names or the absence of headers in some data sets compared to https://cgtex.roslin.ed.ac.uk/ are sorted out in this upload.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://zenodo.org/record/7560234


Description	Vast-scale linear mixed modelling genetic discovery approaches for genetic by environment association analyses
Organisation	GlaxoSmithKline (GSK)
Country	Global
Sector	Private
PI Contribution	Analyzing large datasets, such as those of the size of UK Biobank, is computationally expensive. The challenge is bigger when thousands of phenotypes have to be analyzed. Although different software solutions are arising, they are in general limited on the types of models they can fit. To address this problem, we are expanding our tool (http://www.dissect.ed.ac.uk/) to test genetic by environment interactions on thousands of phenotypes in datasets of the size of UK Biobank.
Collaborator Contribution	UK Biobank provided access to thousands of measurements on very large numbers of individuals. However, several of those require expertise in a particular field to properly prepare the data, or combine different data fields to generate or curate a new one. GSK has the expertise and resources to do this.
Impact	Work in progress. There is not any output from this collaboration, yet.
Start Year	2018


Title	Interactive analysis of large datasets whilst keeping the data protected.
Description	Human genomic data is doubling in size every seven months and will soon exceed other Big Data generators such as astronomy, YouTube and Twitter. Extracting value from this data is a key step in areas such as drug targeting and personalized medicine. According to Global Market Insights, the digital genome market is projected to hit $50.4 billion by 2025, and is key for two UK Grand Challenges: AI & data and ageing society. Accordingly, the UK is positioning as a big player through strategic investments to create world leading resources such as UK Biobank and Genomics England. Reaching the full potential of this substantial investment relies on developing associated industries around it to unlock value from the data. However, several barriers still exist: a) Legal, political, or economic restrictions hamper access to multi-institutional and multi-national fragmented data. b) Preparing and analysing the data may require days or even weeks of a highly skilled individual's work performing repetitive low-value tasks. c) Analysing genetic data requires multidisciplinary skills. d) Requirement of large computational resources. Not all organizations perceive these problems the same way. Whilst public organisations and small biotechs struggle to find the adequate skills and allocate the required computational resources, this does not seem to be a major concern for big pharma companies. On the other hand, difficulties accessing multi-organization scattered data affects all organizations. Other problems in the field including scalability, evidence, equity, democratization, information, health, and carbon footprint. Several companies have been created to address these problems, some of whom we have met (LifeBit and DNA Nexus). We do not believe their solutions satisfactorily address the field's major challenges we have identified. In particular, the analyst still needs to "see" the sensitive data through their platforms and also requires high computational costs and time requirements. The development proposes to overcome these challenges through: 1) An extremely efficient computation engine we developed. Using inexpensive hardware, it reduces large dataset analysis times, from days to seconds. 2) An easy-to-use web system that enables the engine to be interactively queried without requiring direct data access - even by the person analysing it. These technologies together can be disruptive to how data is currently accessed and analysed. Our solution can move the field from the current situation, where an analyst struggles to reach data and then spends weeks in an iterative cycle of data preparation and analysis, to a situation where multi-organization fragmented data is accessed easily, and queried interactively and on-demand.
IP Reference
Protection	Trade Mark
Year Protection Granted
Licensed	Yes
Impact	It is used in an early stage spin-out


Title	Real time diagnosis of rare diseases
Description	One in two-hundred babies will be born with a developmental disorder. The success of human genetics means there is an ever-increasing number of human diseases for which genome-wide sequencing of DNA (DNA-GWS) can provide confident diagnosis. A unifying and robust diagnosis is key to improving prognosis and identifying effective therapeutic strategies. This is particularly critical in rapid DNA-GWS in acutely unwell infants, one of the most rapidly growing sectors of diagnostic genomics. As sequencing technologies improve, comprehensive analysis of sequence data has become the bottleneck. Currently, analysis of different classes of genomic variants is done serially using distinct and computationally intensive quality control measures (QC) prior to diagnostic analysis to determine the clinical (or research) usefulness of each test. The scale of DNA-GWS data means many hours of computational time with review by specially trained scientists are required for each step. Our disruptive innovation answers an important unmet need: to provide flexible and robust methods to identify all major classes of disease associated genomic pathologies in real-time. Our analytical protocols can QC and optimise the data on a per-individual, per-family or per-cohort basis as required, enabling novel analytical approaches to be developed, tested and implemented. Our current prototype, comprising the computation engine and rudimentary user interface, is tested on parent-child trio DNA-GWS data where it demonstrates variant tests at a rate of 119 million variants/second, allowing filter manipulations sufficient to compare 100 parameters on 50 million variants within 0.42 seconds in consumer grade hardware.
IP Reference
Protection	Trade Mark
Year Protection Granted
Licensed	No
Impact	Still under development.


Title	Expansion of DISSECT, a tool to use High Performance Distributed Computing environments to perform analysis on large datasets.
Description	DISSECT is designed to distribute computationally expensive analysis between large numbers of computing nodes connected through a network. This allows very large scalability by enabling the use of thousands of processors and large amounts of memory to perform a single analysis on very large datasets. The tool is not designed to just perform a particular analysis. It is easy to expand to add new analysis on this distributed computing schema. During this period, we extended the capabilities of DISSECT by adding the possibility of performing genome-wide association studies on large numbers of phenotypes in one analysis. We are also extending it to add more complex tests such as genetic by environment, or genetic by genetic interactions.
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	Since we started developing DISSECT, we used it to produce several works published in different high impact journals. In this last year, the expansion of this tool allowed us to pre-compute genome wide association studies for 788 human traits using ~450,000 related and unrelated UK Biobank individuals. The results of these analysis, allowed us to develop and publish the Gene ATLAS database and web tool. This has been published in a high impact journal (https://www.nature.com/articles/s41588-018-0248-z) and the web received more than 140,000 visits from researchers around the world, since the web was made public.
URL	http://www.dissect.ed.ac.uk/


Title	Gene ATLAS
Description	Webtool that enables to explore the Gene ATLAS database (see also "Research Databases & Models") results obtained from pre-computed genome-wide association studies on 778 human traits analyzed using ~450,000 related and unrelated individuals from UK Biobank.
Type Of Technology	Webtool/Application
Year Produced	2018
Impact	The tool produced a publication in a high impact journal (https://www.nature.com/articles/s41588-018-0248-z), and accumulates more than 140,000 visits from researchers all around the world since published.
URL	http://geneatlas.roslin.ed.ac.uk/


Company Name	Omecu
Description	Omecu develops a cloud-based platform for the analysis of large-scale genetic and epidemiologic datasets, with the aim of democratising genome data.
Year Established	2021
Impact	It has been created recently.
Website	http://omecu.com


Description	Talk to medicine students about enterpreneurship
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Undergraduate students
Results and Impact	The Edinburgh Innovation Ambassador for the College of Medicine at the University of Edinburgh invited me to give a presentation to students about my experiences of spinning out from the university. I made a presentation to them in the Student Enterprise Hub.
Year(s) Of Engagement Activity	2022
URL	https://events.irm.ed.ac.uk/Events/Event/7015J000000HU5Y

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications