Developing methods and software to analyse and interpret Normal Human Genome Variation and Disease

Lead Research Organisation: University of Manchester
Department Name: Computer Science


A common strategy for selection of disease causing genes/variations is a comparison of variants in
affected individuals with unaffected population databases. It is expected that patients with severe
phenotypes have variants which are very rare or not present in healthy individuals. The challenge is
to identify them among the thousands of other variants. This is complicated by issues like the
presence of rare benign variants in affected individuals (1-4% have frequency <0.5%[1]); multiple
genes causing the same disease phenotypes; and deleterious mutations occurring in healthy
populations possibly due to reduced penetrance[2].
The overall aim of this project is to link an individual's genetic variation to disease potential by
studying mutations and their propensity to be tolerated in the context of population variation. A
computer-based multidimensional approach will be used to develop software which would consider
multiple factors (e.g. genome/gene/protein/phenotype/network) and analyse human genome variants
at individual (where data is available) and group levels in both disease and population datasets (e.g.
ExAC[3], gnomAD[4], 1000 Genomes[1]), to help interpret and predict where disease-causing
variation is likely to arise in the human genome. Automation of existing manual procedures and
applying them on large-scale 'omics' datasets as well as developing new methods for more detailed
computational analysis of genomic data would aid variant prioritization and provide a ranked subset
of variants for further biological investigation.


10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/N509565/1 01/10/2016 30/09/2021
1959618 Studentship EP/N509565/1 13/07/2017 30/09/2020 Nikita Abramovs
Description Normally, humans have two copies of each gene (except X chromosome genes in males) and variants can affect one or both gene copies (called heterozygous and homozygous, respectively). The balance between the number of heterozygous and homozygous individuals in a population can be measured with Hardy-Weinberg equilibrium. Previous studies showed that deviations from Hardy-Weinberg equilibrium in large population databases (i.e. excess or deficiency of heterozygous) can be caused by sequencing errors. Individuals with severe childhood disorders are often excluded from these databases. Therefore, we hypothesised that in some cases deviations from Hardy-Weinberg equilibrium due to excess of heterozygous might also be caused by this filtration and natural selection when homozygous variants result in severe recessive diseases. We developed a filtering strategy to detect variants with heterozygous excess that is unlikely caused by sequencing errors and applied it on data from 137,842 individuals in gnomAD database. We identified 161 such variants in 149 genes, most of which were specific to African/African American populations (~79.5%). Although the majority of them were not associated with known diseases or were classified as clinically "benign," they were enriched in genes associated with autosomal recessive diseases. The resulting dataset also contained two known recessive disease-causing variants with evidence of heterozygote advantage in sickle-cell anemia (HBB) and cystic fibrosis (CFTR). We anticipate that our approach will aid the detection of rare recessive disease-causing variants in the future.

Gene variation intolerance metrics are widely used in genetic studies as in silico evidence that candidate genes might play an important role in human health. The main idea is that genes or their regions that contain fewer variants in large healthy population databases (e.g. gnomAD) are more intolerant to variation and therefore might be more important. In other words, there might be fewer/no variants in these regions/genes because individuals who had them were affected by some disease and therefore were not present in the healthy population database. Previous variation intolerance metrics were focused on measuring overall variant load in a gene or identification of its more intolerant regions. We found that gene intolerance can also be estimated by measuring variant distribution within a gene. In other words, the less is the chance that variants were spread randomly within a gene, the more their distribution could be affected by natural selection and therefore the more intolerant/important the gene could be. This approach showed the potential to prioritise short and medium intolerant genes which might be associated with recessive diseases.
Exploitation Route The metrics which we have developed can facilitate the prediction of the inheritance mode (AD or AR) for potential disease genes and should aid the interpretation of genome sequencing data in a clinical setting and advance human disease gene discovery.
Sectors Healthcare,Pharmaceuticals and Medical Biotechnology

Title Gene Variation Intolerance Rank (GeVIR) website 
Description GeVIR is a recently published continuous gene-level metric that uses variant distribution patterns to prioritize disease candidate genes ( Current version of GeVIR website allows user to search for gene intolerance scores (the same data as in the manuscript supplementary) and check if a variant is located in a potentially variant intolerant region. GeVIR scores were calculated using population variant data from gnomAD ( that is continuously updated. The website was developed to have an option to update (recalculate scores using new population variant data) or improve GeVIR after publication, as well as add any new gene level metrics or tools that estimate gene importance/intolerance. 
Type Of Technology Webtool/Application 
Year Produced 2019 
Impact The manuscript with a link to the website was published only 2 months ago, so there is no notable impact yet.