Deriving an actionable patient phenome from healthcare data

Lead Research Organisation: University of Edinburgh

Department Name: Centre of Population Health Sciences

Abstract

Translating routinely collected health data into knowledge is a requirement of a "learning health system". Since joining the Biomedical Research Centre at the South London and Maudsley Hospital, Kings College London, my research has been focused on developing 'CogStack and SemEHR'. This is an integrated health informatics platform which aims to to unlock unstructured health records and assist in clinical decision making and research. The system does much to surface the deep data within the NHS, for example through providing a patient-centric search on semantically annotated clinical notes to support studies such as the recruitment of patients for Genomics England's 100,000 Genomes project [1,2] and predicting adverse drug reactions [3].

However, there is considerable further potential for the generation of knowledge and action, for example through the application of machine learning to the data from this platform. For instance, the data returned through these systems needs to be integrated, verified and cleaned with biomedical knowledge, enriched with an accurate clinical context (to enhance the current sentence-level language context) and aligned with the patient timeline to derive a comprehensive patient phenome. Clinical knowledge needs to be formalised from clinical ontologies and integrated with relevant open data, which will drive automated inferences to lift lower-level features (e.g. numeric blood pressure readings) up to higher-level clinical variables (e.g. hypertension) for supporting decision making.

A pilot study of the comprehensive phenome model, SemEHR's medical profiles [2], evaluated on publicly accessible data from the Medical Information Mart for Intensive Care (MIMIC), has proven that better contextual information can lead to much better accuracy in making clinical conclusions - e.g. using patient medical history for subtyping atrial fibrillation where we demonstrated that such phenome data is within the top 10 key features in identifying clinically-sensible patient clusters. For 'action' generation in clinical settings, we have demonstrated the feasibility of alerts through a number of simple examples using CogStack. For example, at Kings College Hospital, we have detected abnormal pathology results for 25 patients being prescribed methotrexate for rheumatoid arthritis, preventing potentially fatal renal failure.

The proposed research will devise a semantic electronic health record toolkit that is able to derive a consistent and comprehensive patient phenome from unstructured and structured electronic health records and provide semantic computation upon it to support decision making for tailored care, trial recruitment and research.

References:
1. Wu H, et al. SemEHR: surfacing semantic data from clinical notes in electronic health records for tailored care, trial recruitment, and clinical research. Lancet. 2017;390: S97.
2. Wu H, et al. A General-purpose Semantic Search System to Surface Semantic Data from Clinical Notes for Tailored Care, Trial Recruitment and Clinical Research. Journal of the American Medical Informatics Association. 2017; doi: https://doi.org/10.1101/235622.
3. Bean DM, Wu H, et al. Knowledge graph prediction of unknown adverse drug reactions and validation in electronic health records. Sci Rep. 2017;7: 16416.

Technical Summary

For objective 1, at the data layer, my research will focus on a semantic phenome model that is able to detect/correct erroneous and inconsistent phenotypes, associate accurate contextual and temporal information with each phenotype mention and also support rule based reasoning to complete missing data. For objective 2, I will be devising and applying artificial intelligence models to derive unknown clinical knowledge from large scale, longitudinal and interlinked phenome data. potential use cases include predicting outcomes of septic shock treatments within intensive care units; predicting unknown adverse drug reactions in depression patients with comorbidities; subtyping atrial fibrillation to deliver tailored care. For objective 3, my research will provide actionable suggestions in clinical settings with applications of clinical trial recruitment and automated alerting for ensuring patient safety. Key challenges to be tackled here include how to make action suggestions explainable and reliable.

This project aims to deliver enabling technologies for The University of Edinburgh's HDR UK focus including deriving and applying health-related phenotypes at scale; computational tools for genetic and environmental risk prediction and causal inference. It will develop national leadership, partnerships, and interdisciplinary skills and capacity through the development of semantic computation infrastructure on top of deep and accurate patient phenome data, which if successful, can be disseminated to a wide range of healthcare service providers nationally/internationally and achieve high impact in research and patient care.

Funded Value:

£315,181

Funded Period:

Feb 18 - Apr 20

Funder:

MRC

Project Status:

Closed

Project Category:

Fellowship

Project Reference:

MR/S004149/1

Principal Investigator:

Honghan Wu

Health Category:

Unclassified

Organisations

People	ORCID iD
Honghan Wu (Principal Investigator / Fellow)	http://orcid.org/0000-0002-0213-5668

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 4 5 6 > >|

10 25 50

Banerjee A (2021) Excess deaths in people with cardiovascular diseases during the COVID-19 pandemic. in European journal of preventive cardiology

Banerjee A (2020) Excess deaths in people with cardiovascular diseases during the COVID-19 pandemic

Bean D (2019) Semantic computational analysis of anticoagulation use in atrial fibrillation from real world data

Bean DM (2019) Semantic computational analysis of anticoagulation use in atrial fibrillation from real world data. in PloS one

Carr E (2021) Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study. in BMC medicine

Casey A (2021) A Systematic Review of Natural Language Processing Applied to Radiology Reports

Casey A (2021) A systematic review of natural language processing applied to radiology reports. in BMC medical informatics and decision making

Casey A (2021) Additional file 1 of A systematic review of natural language processing applied to radiology reports

Cheung JPY (2022) Learning-based fully automated prediction of lumbar disc degeneration progression with specified clinical parameters and preliminary validation. in European spine journal : official publication of the European Spine Society, the European Spinal Deformity Society, and the European Section of the Cervical Spine Research Society

Davidson EM (2021) The reporting quality of natural language processing studies: systematic review of studies of radiology reports. in BMC medical imaging

Related Projects

Project Reference	Relationship	Related To	Start	End	Award Value
MR/S004149/1			14/02/2018	29/04/2020	£315,182
MR/S004149/2	Transfer	MR/S004149/1	31/07/2020	27/09/2022	£123,239

Artistic and Creative Products
Policy Influence
Further Funding
Research Databases and Models
Collaboration
Software and Technical Products
Engagement Activities


Title	Additional file 5 of Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study
Description	Additional file 5: Figure S1. Calibration (logistic and LOESS curves) of supplemented NEWS2 model for 3-day ICU/death model at validation sites.
Type Of Art	Film/Video/Animation
Year Produced	2021
URL	https://springernature.figshare.com/articles/figure/Additional_file_5_of_Evaluation_and_improvement_...


Title	Additional file 5 of Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study
Description	Additional file 5: Figure S1. Calibration (logistic and LOESS curves) of supplemented NEWS2 model for 3-day ICU/death model at validation sites.
Type Of Art	Film/Video/Animation
Year Produced	2021
URL	https://springernature.figshare.com/articles/figure/Additional_file_5_of_Evaluation_and_improvement_...


Title	Additional file 8 of Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study
Description	Additional file 8: Figure S2. Net benefit of supplemented NEWS2 model for 3-day ICU/death compared to default strategies ('treat all' and 'treat none') at training and validation sites.
Type Of Art	Film/Video/Animation
Year Produced	2021
URL	https://springernature.figshare.com/articles/figure/Additional_file_8_of_Evaluation_and_improvement_...


Title	Additional file 8 of Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study
Description	Additional file 8: Figure S2. Net benefit of supplemented NEWS2 model for 3-day ICU/death compared to default strategies ('treat all' and 'treat none') at training and validation sites.
Type Of Art	Film/Video/Animation
Year Produced	2021
URL	https://springernature.figshare.com/articles/figure/Additional_file_8_of_Evaluation_and_improvement_...


Description	Findings from international COVID-19 collaborations informed SAGE during the pandemic
Geographic Reach	National
Policy Influence Type	Implementation circular/rapid advice/letter to e.g. Ministry of Health
Impact	We developed a novel artificial intelligence method (ensemble learning) to synergise seven multinational prediction models to realise a robust and high-performing prediction model. This is the first work to use ensemble learning for risk prediction of COVID-19 and the validation cohorts are one of the most diverse international COVID-19 datasets (4 cohorts with mortality rates: 2.4-45%). The ensemble model consistently outperformed any single models in all aspects validated and can be used in clinical practice to inform the COVID-19 triage, treatments and resource allocations.
URL	https://www.hdruk.ac.uk/wp-content/uploads/2020/09/200915-Health-Data-Research-UK-COVID-19-fortnight...


Description	Invited talk at 1st International Symposium on Evidence-based Artificial Intelligence and Medicine (ISEAIM)
Geographic Reach	Multiple continents/international
Policy Influence Type	Influenced training of practitioners or researchers
Impact	My talk was titled "Derive insights from health data using knowledge graph technologies". I started with a brief introduction about what is a knowledge graph. Then, I used real-world examples to introduce how knowledge graph technologies could help clinical natural language processing. I finalised the talk with a bit of my own thinking in challenges and future directions of knowledge graphs for health care.


Description	Artificial Intelligence and Multimorbidity: Clustering in Individuals, Space and Clinical Context (AIM-CISC)
Amount	£3,919,510 (GBP)
Funding ID	NIHR202639
Organisation	National Institute for Health and Care Research
Sector	Public
Country	United Kingdom
Start	07/2021
End	08/2024


Description	Building a database of the immunohistochemical profiles of tumours from histopathology reports at scale using large language models and machine learning
Amount	£59,907 (GBP)
Funding ID	PGS23 100040
Organisation	Rosetrees Trust
Sector	Charity/Non Profit
Country	United Kingdom
Start	09/2023
End	10/2025


Description	Facilitating Better Urology Care With Effective And Fair Use Of Artificial Intelligence - A Partnership Between UCL And Shanghai Jiao Tong University School Of Medicine
Amount	£39,968 (GBP)
Organisation	British Council
Sector	Charity/Non Profit
Country	United Kingdom
Start	03/2024
End	02/2026


Description	ISCF HDRUK DIH Sprint Exemplar: Graph-Based Data Federation for Healthcare Data Science
Amount	£260,057 (GBP)
Funding ID	MC_PC_18029
Organisation	Medical Research Council (MRC)
Sector	Public
Country	United Kingdom
Start	03/2019
End	11/2020


Description	Improving the quality and value of care for people with poor prognosis cancers - a national, mixed methods study across Scotland
Amount	£399,224 (GBP)
Organisation	Health Foundation
Sector	Charity/Non Profit
Country	United Kingdom
Start	03/2020
End	08/2023


Description	Iris.AI - The AI Chemist
Amount	£39,000 (GBP)
Organisation	Research Council of Norway
Sector	Public
Country	Norway
Start	07/2021
End	01/2022


Description	QMIA: Quantifying and Mitigating Bias affecting and induced by AI in Medicine
Amount	£649,218 (GBP)
Organisation	Medical Research Council (MRC)
Sector	Public
Country	United Kingdom
Start	09/2023
End	03/2026


Description	The Advanced Care Research Centre Programme
Amount	£20,000,000 (GBP)
Organisation	Legal and General Group
Sector	Private
Country	United Kingdom
Start	03/2020
End	04/2026


Description	Towards an AI-driven Health Informatics Platform for supporting clinical decision making in Scotland - a pilot study in NHS Lothian
Amount	£29,200 (GBP)
Organisation	Wellcome Trust
Sector	Charity/Non Profit
Country	United Kingdom
Start	01/2020
End	02/2021


Description	UCL-NMU-SEU International Collaboration On Artificial Intelligence In Medicine: Tackling Challenges Of Low Generalisability And Health Inequality
Amount	£29,400 (GBP)
Organisation	British Council
Sector	Charity/Non Profit
Country	United Kingdom
Start	02/2022
End	02/2024


Description	Using rare disease phenotype models to identify people at risk of COVID-19 adverse outcomes
Amount	£38,065 (GBP)
Organisation	National Institute for Health and Care Research
Sector	Public
Country	United Kingdom
Start	01/2023
End	03/2023


Title	Additional file 1 of Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study
Description	Additional file 1: Table S1. SNOMED terms.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_1_of_Evaluation_and_improvement...


Title	Additional file 1 of Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study
Description	Additional file 1: Table S1. SNOMED terms.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_1_of_Evaluation_and_improvement...


Title	Additional file 1 of Increased COVID-19 mortality rate in rare disease patients: a retrospective cohort study in participants of the Genomics England 100,000 Genomes project
Description	Additional file 1: Table S2. Lists of ICD-10 codes for comorbidities associated to COVID-19
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_1_of_Increased_COVID-19_mortali...


Title	Additional file 2 of A systematic review of natural language processing applied to radiology reports
Description	Additional file 2. Individual properties for every publication.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_2_of_A_systematic_review_of_nat...


Title	Additional file 2 of A systematic review of natural language processing applied to radiology reports
Description	Additional file 2. Individual properties for every publication.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_2_of_A_systematic_review_of_nat...


Title	Additional file 2 of Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study
Description	Additional file 2: Table S2. F1, precision and recall for NLP comorbidity detection.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_2_of_Evaluation_and_improvement...


Title	Additional file 2 of Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study
Description	Additional file 2: Table S2. F1, precision and recall for NLP comorbidity detection.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_2_of_Evaluation_and_improvement...


Title	Additional file 2 of Increased COVID-19 mortality rate in rare disease patients: a retrospective cohort study in participants of the Genomics England 100,000 Genomes project
Description	Additional file 2: Table S1. Univariable and multivariable ORs for association between rare disease groups/specific diseases and COVID-19
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_2_of_Increased_COVID-19_mortali...


Title	Additional file 3 of Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study
Description	Additional file 3: Table S3. Logistic regression models for each blood and physiological measure tested separately in the KCH training cohort, for 14- and 3-day ICU/death.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_3_of_Evaluation_and_improvement...


Title	Additional file 3 of Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study
Description	Additional file 3: Table S3. Logistic regression models for each blood and physiological measure tested separately in the KCH training cohort, for 14- and 3-day ICU/death.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_3_of_Evaluation_and_improvement...


Title	Additional file 4 of Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study
Description	Additional file 4: Table S4. Internally validated discrimination for KCH training sample based on nested repeated cross-validation.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_4_of_Evaluation_and_improvement...


Title	Additional file 4 of Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study
Description	Additional file 4: Table S4. Internally validated discrimination for KCH training sample based on nested repeated cross-validation.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_4_of_Evaluation_and_improvement...


Title	Additional file 6 of Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study
Description	Additional file 6: Table S5. Univariate logistic regression models for sensitivity analyses showing odds ratios of ICU/death at 3- and 14-days for subsets of the training cohort.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_6_of_Evaluation_and_improvement...


Title	Additional file 6 of Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study
Description	Additional file 6: Table S5. Univariate logistic regression models for sensitivity analyses showing odds ratios of ICU/death at 3- and 14-days for subsets of the training cohort.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_6_of_Evaluation_and_improvement...


Title	Additional file 7 of Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study
Description	Additional file 7: Table S6. Discrimination for all models in training and validation cohorts, including alternative baseline model of 'NEWS2 only'.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_7_of_Evaluation_and_improvement...


Title	Additional file 7 of Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study
Description	Additional file 7: Table S6. Discrimination for all models in training and validation cohorts, including alternative baseline model of 'NEWS2 only'.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_7_of_Evaluation_and_improvement...


Description	Use natural language processing for surfacing stroke phenotypes from Scottish radiology reports: a comparison of different methodologies
Organisation	University of Edinburgh
Department	School of Informatics Edinburgh
Country	United Kingdom
Sector	Academic/University
PI Contribution	Investigate NLP model adaptation by reusing models trained on EHRs of London NHS trusts in Scottish radiology reports.
Collaborator Contribution	Collaborators from Centre for Clinical Brain Sciences, University of Edinburgh provide ESS Stroke study data and Tayside radiology reports. They also manually labelled the data. Collaborators from Informatics Department provide computational resources for accessing data. They also provided their results on the same task by using rule based NLP and a neural network method.
Impact	Named Entity Recognition for Electronic Health Records: A Comparison of Rule-based and Machine Learning Approaches. Philip John Gorinski, Honghan Wu, Claire Grover, Richard Tobin, Conn Talbot, Heather Whalley, Cathie Sudlow, William Whiteley, Beatrice Alex. Accepted by HealTAC 2019. This is a multi-disciplinary study involves neurology and computing science.
Start Year	2018


Title	Ensemble Learning for COVID-19 Risk Prediction
Description	- implemented 7 prognosis risk prediction models for COVID-19. Detailed info in this paper: DOI:10.1093/jamia/ocaa295 - introduced a competence quantification framework for assessing the competence/confidence of a model in predicting a given data entry (i.e. a digital representation of a covid patient) - ensembled 7 prediction models for prediction using fusion strategies based on their competences - evaluated single models and the ensembled mode on two large COVID-19 cohorts from Wuhan, China (N=2,384) and King's College Hospital (N=1,475)
Type Of Technology	Software
Year Produced	2020
Open Source License?	Yes
Impact	- Ensemble model works the best on all aspects evaluated (PPV/Sensitivity/Calibration/Discrimination) - Findings from this study informed SAGE during the COVID-19 pandemic
URL	https://github.com/Honghan/EnsemblePrediction


Title	Knowledge Graph based Phenotyping on Heterogenous Data Sources
Description	Extracting patient phenotypes from routinely collected health data (such as Electronic Health Records) requires translating clinically-sound phenotype definitions into queries/computations executable on the underlying data sources by clinical researchers. This requires significant knowledge and skills to deal with heterogeneous and often imperfect data. Translations are time-consuming, error-prone and, most importantly, hard to share and reproduce across different settings. This software implements a knowledge driven phenotyping framework that decouples the specification of phenotype semantics from underlying data sources; can automatically populate and conduct phenotype computations on heterogeneous data spaces.
Type Of Technology	Software
Year Produced	2019
Open Source License?	Yes
Impact	This software is used to federate 5 health datasets across Scotland for asking important clinical questions. It helps the initiation of a national integrated laboratory dataset across Scotland.


Title	nlp2phenome: using AI models to infer patient phenotypes from identified named entities (instances of biomedical concepts)
Description	Using natural language processing(NLP) to identify mentions of biomedical concepts from free text medical records is just the first step. There is often a gap between NLP results and what the clinical study is after. For example, a radiology report does not contain the term - ischemic stroke. Instead, it reports the patient had blocked arteries and stroke. To infer the "unspoken" ischemic stroke, a mechanism is needed to do such inferences from NLP identifiable mentions of blocked arteries and stroke. nlp2phenome is designed for doing this extra step from NLP to patient phenome.
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	nlp2phenome was developed for a stroke subtyping study using NLP on radiology reports in Edinburgh University. It is based on top of SemEHR results. It identified 2,922 mentions of 32 types of phenotypes from 266 radiology reports and achieved an average F1: 0.929; Precision: 0.925; Recall: 0.939.
URL	https://github.com/CogStack/nlp2phenome


Description	Towards an AI-driven Health Informatics Platform for supporting clinical decision making in Scotland - a pilot study in NHS Lothian
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Professional Practitioners
Results and Impact	This is a pilot study with NHS Lothian (Edinburgh), which is supported by Wellcome Trust iTPA award. The long term objective is to enhance Electronic Health Records (EHRs) across NHS Lothian Health Board using artificial intelligence (AI) driven data science infrastructure to benefit patients and the health service provision. This project will serve as a pilot study for the larger Data Loch City Deal collaboration, which aims to use all of our health and social care data assets to drive research and innovation, improve patient care and reduce health inequalities for all patients. This particular pilot project will develop two exemplar use cases in NHS Lothian: (a) improving the management of hypoglycaemia; and (b) decision support in prescribing anticoagulants to patients with Atrial Fibrillation. These pilot studies will (1) initialise collaborations with NHS Lothian eHealth team; (2) understand the data landscape (data formats, storage, data schema, access control restrictions); (3) investigate integration approaches with TrakCare - the EHR information system.
Year(s) Of Engagement Activity	2019,2020