Computational analysis of protein covariation for the identification of disease-associated variants in coding regions

Lead Research Organisation: University of Manchester
Department Name: School of Medical Sciences

Abstract

Attaining Good Health and Well-Being for everyone are the main goals of very diverse initiatives. Local and global Institutions are targeting a reduction in mortality and an improvement of standard of care for the forthcoming years. A key element in achieving these goals is the development of better diagnostics tools; i.e., more
accurate diagnoses may mean more precise treatments with fewer secondary effects.
Because most of the human diseases have a genetic factor, it is essential that we improve the computational methods for precisely identifying the genome variants that cause congenital disorders (e.g. cystic fibrosis), or increase the risk of suffering multifactorial diseases such as type-2 diabetes.
Current approaches for discovery of disease-causing variants and for genetic diagnostics focus in the identification of genome variants that 1) are present in patients, but non-existing or extremely rare in the healthy populations, and 2) are deemed detrimental based on their lack of evolutionary conservation. Both of these conditions are implicitly based on the assumption that single variants are responsible for the disease. Nevertheless, in many cases the analyses of data remain inconclusive in regards of the particular variants that cause the disorders.
My hypothesis is that in many cases the disease is not caused by single variants, but by an unfavourable combination of genomic variants; that is, variants that are separately found in the human population, but that cause a harmful effect when found together in an individual. I will focus on variants within the coding genome (the
part of the genome that codifies for proteins), because proteins carry out the vast majority of biological processes within cells and tissues. First, I am going to identify which protein positions show evidence of covariation throughout evolution or within
the human population; namely, those pairs of positions whereby changes have often occurred concurrently in order to maintain the fitness of the organism. Second, I am going to identify which combinations of amino acids are favoured within those covarying positions. Third, I am going to use this information for reanalysing genetic testing data from NHS patients suffering of cardiac or eye genetic disorders. My goal is to increase the rate of cases that can be genetically diagnosed. Finally, I will analyse sequencing data from patients suffering of Tetralogy of Fallot, which is a
type of Congenital Heart Disease. This is a complex genetic disorder involving variants in various genes; however, the exact causes are unknown. I will identify cases where the disorder is caused by a harmful combination of variants.
My research is going to produce a series of computational tools that will be made freely available to the academic and clinical communities. These tools will make a
decisive contribution to the goal of achieving better genetic diagnoses.

Technical Summary

I will develop a series of computational tools for analysing the consequences of human variants taking into account the genetic background. My hypotheses are 1) that some protein residues tend to covary (i.e., residues are replaced simultaneously more often than expected if their variation were not correlated), and 2) that the risk posed for some variants is associated to the absence of compensatory variants affecting covarying positions.
I will use evolutionary and population data in order to identify the covarying positions within and between human proteins. First, I will download or build multiple sequence alignments of the homologous sequences. Second, I will infer the phylogenetic tree associated to that alignment. Third, I will use a Maximum Parsimony approach in
order to identify in which branches of the tree the residues change. For each pair of protein positions, I will count how many times the positions change in the same branch, and when there are only single changes. As the numbers of single and double changes are linearly related for each given position, I will use a regression analysis for identifying the pairs of residues with fewer single changes than expected. Statistical significance will be assessed through multiple Monte Carlo simulations. After that, I will identify which amino acid couplings are favoured in each of those covarying positions.
In a second stage, I will use this information in order to analyse gene panels from patients of the Manchester Centre for Genomic Medicine and exome data from patients with Tetralogy of Fallot. The strategy will be to identify uncoupling variants that affect the normal pattern of variation within covarying positions. I will release the code for bioinformaticians interested in including my tool in their analytical pipelines.

Planned Impact

The main goal of this research programme is the development of computational tools
that help in achieving more accurate diagnoses in the area of clinical genetics. The
improvement of diagnostics methodologies and instruments is important to both the
local and international socio-political agenda: e.g., Devo Manc (the Health and Social
Care Devolution occurring in Greater Manchester), Health Innovation Manchester
(the partnership for speeding up discovery, development and delivery of innovative
healthcare solutions), and the UN Agenda for Sustainable Development.
The prediction of disease-associated variants is an area of research that has
attracted the interest of many researchers; however, I am proposing a novel
approach that focuses on variants that are undetectable with standard strategies.
This means that my research has the potential for making a huge impact both in the
academic and non-academic worlds. The academic beneficiaries not only include the
staff involved in the project and my close collaborators, but also other researchers at
the University of Manchester and beyond. Amongst the non-academic beneficiaries
of this research it is worth mentioning clinical geneticists, patients with congenital
diseases, the NHS, and software companies.
I will become a fully independent researcher with this New Investigator Research
Grant. The postdoctoral scientist involved in the project will acquire expertise in
areas that will allow them to pursue an academic career in basic science or devote
themselves to Medical Informatics. I will collaborate with Professors Black, Keavney
and Newman all through this research programme. They will provide me with
sequencing data from patients with genetic disorders, and they will get knowledge of
variants potentially involved in the disease, and of possible molecular mechanism of
deleteriousness. I also have collaborations with several research groups at the
School of Biology (e.g. Lovell, Pavitt and Hentges groups) that are interested in the
causes and effects of protein variation. I am also exploring collaborations with
researchers involved in genetic diagnostics in other University Hospitals or Research
Institutes; e.g., Professor de la Cruz from the Vall d'Hebron Hospital. Finally, many
other researchers (locally or worldwide) will benefit of my methods for identifying
covarying positions and detrimental variants. They will gain knowledge of my findings
through seminars, publications, and conference talks and posters.
Clinical geneticists will be able to include my approach in their diagnosis procedure.
Thus they will be able to reach a definite genetic diagnosis in more cases that they
are doing now. This is important because the same disease or syndrome might
contain different subtypes, which would differ through the variants that cause them.
Therefore this opens the room to treating the patients differently. This personalised
medicine is integral to the NHS Five Year Forward View. In order to convince NHS
professionals of using my methods I will seek advice on the design of the
computational tools from my collaborators at the Manchester Centre for Genomic
Medicine. The reason is to make the tools accessible not only to clinical
bioinformaticians but also to clinical geneticists and a wider population of clinical and
healthcare staff.
Health care providers such as the NHS, equivalent foreign health systems, and
private suppliers will benefit from the professionals within being able to achieve more
accurate diagnoses. In many cases this will result in more precise and cost-effective
treatments, which will improve the quality of life of patients suffering of genetic
disorders. Finally, software companies may be interested in using the knowledge that
I generate for developing their own commercial tools for genetic analyses. In that
case, I will take advice from the University of Manchester Intellectual Property office
and the Business Engagement services.