Using whole genome sequencing to identify non-coding elements associated with diabetes and related traits across ancestries

Lead Research Organisation: UNIVERSITY OF EXETER
Department Name: Institute of Biomed & Clinical Science

Abstract

The complete genetic sequences, medical records, and extensive health data of over 1 million people will become available for researchers this year. Major progress has recently been made on understanding the regulatory sequences in the human genome that act as switches, turning genes on and off in cells. There are only a few examples of variants in these DNA switches causing disease. We have identified variants of these switches causing very rare disease. We have identified variants of a short sequence that mean children are born without a pancreas. We showed that this short sequence is a master switch that turns on the key gene leading to pancreas development. We have also identified very rare variants in another switch that leads to children producing too much insulin and having dangerously low glucose levels. In this case it is because the switch is inappropriately turned on and a protein is produced in the pancreas that shouldn't be. In this project we will use the >1 million individuals with whole genome sequencing data to identify the switches that are important for common type 2 diabetes.

As preliminary data and proof of principle we have already analysed height in 150,000 UK Biobank participants. We identified 31 previously unknown associations. One example is variants of a switch that turns on a gene called HMGA1. People with these switch variants are, on average, 5cm taller. This is particularly interesting because changing the protein sequence of HMGA1 does not affect height. We have confirmed these associations in 200,000 people from the All of Us and TOPMed cohorts. We have also performed preliminary analyses for diabetes. We have identified an association with a rare variant near HNF1A that occurs in a long non-coding RNA, a specific type of switch. We have recently demonstrated this long non-coding RNA is important for turning on HNF1A.

It is extremely challenging computationally to analyse data on 1,000,000 complete whole genomes. Interpretation is a substantial challenge. This project will build on our initial work by refining our WGS analysis pipeline to make it efficient, cost-effective and publically available. This project is timely because UK Biobank will release whole genome sequence data on 500,000 people by the end of this year. We will use this data to perform single variant and group testing of regulatory switches. The analyses will be performed in different ancestry groups as well as a combined analysis. We will confirm our findings using the US cohorts All of Us and TOPMed which will have >500,000 individuals of diverse ancestries available for analysis. We will test the identified regions in our rare familial diabetes cohort and in the 100,000 genomes project. These are a collection of people where it is expected that there is a single genetic cause of their diabetes. This is important because we have an excellent track record of translating genetic diagnosis into treatment change. We will also perform functional follow-up of a subset of switches to provide new insights into pancreas development and function.

This project will provide a substantial advance in our understanding of the role of non-coding variants in human disease. It will allow us to develop efficient and cost-effective approaches analysing whole genome sequence data. We will provide new insights into the regulation of pancreas development and function. It may also dramatically improve the quality of life for some patients with rare forms of diabetes. Our project is important if we are to make major advances in understanding disease mechanisms using whole genome sequencing.

Technical Summary

We will use data on >1 million people with whole genome sequencing together with regulatory and epigenomic annotation to identify new non-coding causes of type 2 diabetes and related traits. We will build on our preliminary work by refining and developing our WGS analysis pipeline to make it efficient and cost effective for dealing with millions of whole human genomes. We will expand our discovery analysis in UK Biobank to 500,000 individuals, by performing single variant and aggregate testing in the full 500,000 whole genomes from UK Biobank, including all ancestries for 10 metabolic traits. We will adjust for all known variants associated with the different disease and traits. We will annotate variants using data from publicly available and new specific regulatory epigenetic maps and novel transcripts derived from long read sequencing. We will replicate using the US cohorts All of Us and TOPMed that will have >500,000 individuals of diverse ancestries available for analysis. We will leverage the diverse ancestries to aid fine-mapping of previous published GWAS signals and identify allelic series of multiple causal variants within loci. We will test the identified regions in our monogenic diabetes cohort and in the 100,000 genomes project to identify new causes of monogenic diabetes and related single gene conditions. For selected associated regions and variants we will perform functional follow-up of variants and elements to provide new insights into pancreas development and function.

Publications

10 25 50