Developing methods for identifying clinical phenotypes from routinely collected health data with applications to stroke genetics.

Lead Research Organisation: University of Edinburgh
Department Name: Centre of Population Health Sciences

Abstract

Genetic, and other risk factor associations with stroke are known to be to a large extent type and subtype specific. Systematic review data and UK Biobank pilot work data has shown that stroke type (ischaemic versus haemorrhagic) is not specified for around 40%, and further stroke subtype (TOAST, OCSP, haemorrhage location etc) is not specified for around 70% of stroke cases ascertained from routinely collected coded health data. With expert adjudication it is possible to assign a type and subtype for around 80% of these cases, but this is not a scalable method suitable for very large studies (e.g., UK Biobank). I propose to develop scalable, automated methods that will allow further stroke typing and subtyping from routinely collected health data by investigating the use of various algorithmic code combinations and of natural language processing methods that could be applied to free text medical records and imaging reports. I would then propose to validate these methods directly, as well as indirectly by comparing them with other phenotyping approaches in genetic studies.

Phenome-association studies can be used to systematically examine the impact of one or many genetic variants across a broad range of human phenotypes, and have the potential to reveal novel insights to underlying disease mechanisms, as well as hold great potential for the identification of novel drug targets and drug repurposing opportunities. UK Biobank with its vast and varied phenotypic data is a dataset that is highly suitable for these studies. However, to date there is a relative lack of sophisticated phenotypic methods to select and identify outcomes of interest. I propose to apply existing phenome-wide association study methods to investigate hypothesis-based associations with stroke as a model disease, and to develop these methods further for wider use.

During the past decade, findings of genome-wide association studies have improved our knowledge and understanding of complex disease genetics. Statistical analysis typically looks for association between a phenotype and single genetic variants taken individually via single-variant tests. However, this is an oversimplified approach to tackle the complexity of underlying biological mechanisms. The next steps would be to also consider the interactions between genetic variants, or epistasis. Epistasis detection gives rise to new analytic challenges since analysing every single nucleotide polymorphism combination is at present impractical at a genome-wide scale. I propose to apply existing methods and develop these further for wider use, starting with a hypothesis-driven approach to investigate epistatic associations between selected stroke genes.

Technical Summary

For objective (1) I will start by using data from the UK Biobank participants, building on the pilot work undertaken in the UK Biobank to date, which has generated "case vignettes" composed of relevant medical record free text. Following validation in UK Biobank, the developed algorithms and methods can then be applied to and further validated in other consented cohorts (e.g., Generation Scotland). For objective (2), I will build on an ongoing systematic review that will lead to a set of phenotypes that will form the hypothesis for a phenome-wide association study using data from the UK Biobank. For objective (3) I will use my existing network of collaborations within the International Stroke Genetics Consortium (ISGC) to access relevant datasets to test the proposed methods. I will also collaborate with other researchers from the University of Edinburgh and across the UK, as well as with other HDR UK fellows, to build the skills and capacity to undertake and further develop these proposed directions.
This project aligns with HDR UK priorities, as it will develop national leadership, partnerships, and interdisciplinary skills and capacity through the development of novel analytical methods and tools, which can in the future be applied and taken up for research of health conditions beyond stroke.

Publications

10 25 50
publication icon
Zhang H (in Press) (2021) Benchmarking network-based gene prioritization methods for cerebral small vessel disease in Briefings in Bioinformatics

publication icon
Whittaker E (2022) Systematic Review of Cerebral Phenotypes Associated With Monogenic Cerebral Small-Vessel Disease. in Journal of the American Heart Association

publication icon
Rannikmäe K (2021) Developing automated methods for disease subtyping in UK Biobank: an exemplar study on stroke in BMC Medical Informatics and Decision Making

 
Description BHF REA3 pump priming award for project "Clinical consequences of rare variants in Cerebral Small Vessel Disease genes"
Amount £44,375 (GBP)
Organisation British Heart Foundation (BHF) 
Sector Charity/Non Profit
Country United Kingdom
Start 01/2020 
End 06/2021
 
Description Carnegie Vacation Scholarship for medical student summer project
Amount £1,000 (GBP)
Organisation Carnegie Trust 
Sector Charity/Non Profit
Country United Kingdom
Start 06/2019 
End 07/2019
 
Description Cerebral phenotypes associated with monogenic cSVDs
Amount £1,800 (GBP)
Organisation The Genetics Society 
Sector Charity/Non Profit
Country United Kingdom
Start 06/2021 
End 07/2021
 
Description Wellcome Trust University of Edinburgh ISSF funding
Amount £24,000 (GBP)
Organisation University of Edinburgh 
Sector Academic/University
Country United Kingdom
Start 07/2021 
End 12/2021
 
Title Code lists for the HDR UK Phenome Library 
Description Routinely collected health data disease code lists generated and published in the HDR UK Phenome Library: http://phenotypes.healthdatagateway.org/ 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
Impact Generated codes for stroke and other diseases are publicly available for any researcher to use in their work with routinely collected health data. 
URL https://phenotypes.healthdatagateway.org/
 
Description Collaboration with Dr Honghan Wu in UCL 
Organisation University College London
Country United Kingdom 
Sector Academic/University 
PI Contribution We have worked together on two projects. One involving developing automated methods for stroke subtyping based on radiology reports. The second involving using network methods to identify new genetic associations with cerebral small vessel disease. I have provided the clinical perspective for both projects, while Dr Wu has provided the machine learning / informatics perspective.
Collaborator Contribution See above.
Impact 2 publications
Start Year 2018
 
Description Collaboration with Estonian Biobank 
Organisation University of Tartu
Country Estonia 
Sector Academic/University 
PI Contribution I have consulted the Estonian Biobank team about identifying relevant stroke cases to include from their biobank for the Precise4q project.
Collaborator Contribution Site for Precise4q project.
Impact n/a
Start Year 2020
 
Description Collaboration with Y Ruigrok's team in UMC Utrecht 
Organisation University Medical Center Utrecht (UMC)
Country Netherlands 
Sector Academic/University 
PI Contribution The collaborative project is studying genetic associations with subarachnoid haemorrhage and unruptured intracranial aneurysms. Dr Ruigrok leads the working group within the International Stroke Genetics Consortium. My role has been developing methods for informing relevant phenotype identification in UK Biobank, allowing Dr Ruigrok's team to integrate data from UK Biobank to the larger genetic meta-analysis.
Collaborator Contribution Please see above.
Impact Publication: • Bakker M, (18 authors), Rannikmäe K, (53 authors). Genome-wide association study of intracranial aneurysms reveals 17 risk loci, polygenic architecture, genetic overlap with clinical risk factors, and opportunities for prevention. Nature Genetics 2020;52(12):1303-1313.
Start Year 2020
 
Description McMaster University 
Organisation McMaster University
Country Canada 
Sector Academic/University 
PI Contribution We are collaborating with colleagues from McMaster University genetic and molecular epidemiology laboratory (PI: Guillaume Pare) on a project investigating the penetrance and variable expressivity of rare variants in monogenic stroke genes. Our group has undertaken a systematic review to identify all reported variants and their associated clinical phenotypes.
Collaborator Contribution Our collaborators are investigating the frequency of these variants in publicly available control databases to better understand their role in health and disease.
Impact Multi-disciplinary collaboration: clinical neurology and general medical knowledge and genetic epidemiology and molecular genetics. Two publications: pubmed id 32106772 and 32842921.
Start Year 2018
 
Description University of Edinburgh, A. Tenesa team 
Organisation University of Edinburgh
Department The Roslin Institute
Country United Kingdom 
Sector Academic/University 
PI Contribution We are collaborating on a project investigating how different coded phenotype definitions of stroke affect the genetic association results using UK Biobank data. Our team has experience in understanding the accuracy and nuances associated with different stroke definitions.
Collaborator Contribution Our collaborators (Albert Tenesa's team) have bioinformatics skills allowing genetic analyses of complex data.
Impact Multi-disciplinary: quantitative genetics, clinical phenomics
Start Year 2018
 
Description University of Munich 
Organisation Ludwig Maximilian University of Munich (LMU Munich)
Country Germany 
Sector Academic/University 
PI Contribution We have a joint project using the UK Biobank data to investigate genetic associations with stroke and its subtypes. Our research team has developed methods for identifying stroke and other disease outcomes from routinely collected electronic health data that we have used in this project.
Collaborator Contribution Our partners have analytical skills that ave allowed them to process complex and large scale genetic data for this project.
Impact Manuscript: pubmed ID 30383316; Manuscript in press: Malik et al. Midlife vascular risk factors and risk of incident dementia: longitudinal cohort and Mendelian randomization analyses in the UK Biobank, 2021. In press in Alzheimer's & Dementia: The Journal of the Alzheimer's Association.
Start Year 2018
 
Description Presentation at conference (ISGC) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact I gave an oral presentation at the 24th International Stroke Genetics Consortium meeting in Washington, USA, 08.11.2018 - 09.11.2018. I presented the results of a systematic review investigating the associations between phenotypes and genetic variants in genes thought to cause familial stroke, followed by further research plans as part of my award, to researchers across the world.
Year(s) Of Engagement Activity 2017,2018
 
Description Presentation at conference (PQG) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Poster presentation at the Program in Quantitative Genomics 12th annual conference "Biobanks: Study Design and Data Analysis" 01.11.2018 - 02.11.2018 at Harvard Medical School in Boston, MA, USA. I presented the results of our research into the accuracy of routinely collected coded disease diagnoses and further research plans as part of my award to around 100 researchers from different disciplines across the world.
Outcomes: I had a lot of interest in our poster during the poster session of our conference, and interesting and useful discussions with various researchers. A couple of researchers reported improved understanding of the nature and accuracy of routinely collected electronic health data and how it may influence research results using such data.
Year(s) Of Engagement Activity 2018
 
Description Presentation at seminar (CMI) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Other audiences
Results and Impact I gave an oral presentation at the University of Edinburgh, Usher Institute, Centre for Medical Informatics weekly seminar series on 15.10.2018. I presented the results of my research into the accuracy of routinely collected coded stroke diagnoses and further research plans as part of my award to around 30 researchers (all grades ranging from PhD students to professors) across different disciplines and from different institutions and departments, mainly from the University of Edinburgh. The aim of these seminars is to encourage and facilitate discussion, collaboration and learning across The Centre for Medical Informatics which encompasses people from many different backgrounds and disciplines.
Year(s) Of Engagement Activity 2018
 
Description Presentation at seminar (DCN) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact I gave an oral presentation at the Department of Clinical Neurosciences, Western General Hospital, academic afternoon on 28.02.2019. This is a weekly meeting of clinicians (predominantly consultant and trainee neurologists) from hospitals across the South East of Scotland. I presented the results of a systematic review investigating the associations between phenotypes and genetic variants in genes thought to cause familial stroke, followed by further research plans as part of my award. The purpose of the talk was to introduce my research to my clinical colleagues.
Year(s) Of Engagement Activity 2019
 
Description Presentation at seminar (McMaster) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact I gave a talk at a collaborator's (Guillaume Pare) institution (Genetic and Molecular Epidemiology Laboratory, McMaster University, Hamilton, Canada) at their seminar about my research.
Year(s) Of Engagement Activity 2018
 
Description School Visit on STEM day 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Schools
Results and Impact Gave a presentation and interactive question/answer session to S4 students at the Coatbridge Highschool STEM day on 19.02.2020.
Year(s) Of Engagement Activity 2020