Semantic-based secure infrastructure for interoperable, in-depth biomedical data and analytics

Lead Research Organisation: University of Birmingham
Department Name: Institute of Cancer and Genomic Sciences

Abstract

The use of electronic health care records (EHRs) is paramount in today's modern health care systems. The reasons are manifold and range from enabling physicians to have remote access to data, alerting consultants of critical laboratory values, or indicating harmful multi-drug interactions. These records are also employed to some extend to analyse the underlying diseases. However, EHRs are based upon historical models of clinical care containing mostly unstructured and poorly defined information related to the patients' experience and clinicians' interpretation of their illnesses, diagnoses as well as reports from clinical imaging. Moreover, the rapid advent of modern diagnostic medical imaging methods ranging from MRI, PET, and CT scans to ultrasonography and ECGs, have resulted in rapidly increasing unstructured and unlabelled data in the form of 2D and 3D images. Finally, whilst most clinical settings are progressing to complete paperless environments thereby constantly increasing the amount of clinical information that is computationally accessible, most of the settings are employing different and diverse electronic health care records, rendering the integration and interoperability of health care data extremely difficult and laborious.

What is required is a unifying interoperability and analysis framework that would act as an integrated data environment allowing EHRs to be used for computer-based exploration using harmonized methods, i.e. using common interfaces with same standards and semantics. This would ensure that electronic health data can be accessed and interpreted across multiple disciplines in a consistent and readily usable fashion, allowing to easily explore novel avenues in analytics. Such an approach would allow the separation between health care data and analytical tasks. The advantages would be that predictive models developed by one health care provider could easily be applied be validated by others. Moreover, and perhaps more importantly, such an approach will cater for the development of accurate in-depth phenotypic characterisation of diseases at an individual patient level, which would facilitate the improvement of diagnosis, risk prediction, and patient management.

Furthermore, what is required to unlock the full potential of EHRs, is the possibility to extract information from unstructured data stored in form of free texts or as images. Modern AI approaches for such data allow the semantic extraction of information to a very high degree and are to some extend capable of harvesting information on human expert level. The automatic extraction of information from such unstructured data would contribute greatly to the proposed harmonized environment and would not only allow the rapid exploration of, i.e. the search for, patient records expressing specific symptoms, but would furthermore allow future, more symbolic AI approaches for the analysis. This would allow to close the loop from data collection over predictive modelling back to clinician able to constantly validating predictive models in their daily work.

One aspect of using modern AI approaches is the large number of examples required for inducing models for text and images. Simply sharing EHRs is often not feasible due to legal and ethical constraints. What is needed is a novel approach of sharing information without sharing the actual data. One avenue might be using anonymization-based techniques, sampling examples from learned distributions of the data. Such data can subsequently be used with less constraints by others, strengthening the models and enabling the sharing of medical data on a broader, yet secure, scale.

The main goal of this fellowship is the development of such a semantic-based "information commons for research" framework that will form a fully integrated health research information platform, with access to both deep clinical data and extensive large and unstructured data on individual patients and populations.

Technical Summary

WP1: Harmonized data framework

WP1 will research the development of an integrated health-data framework based on syntactic and semantic interoperability. We will investigate how current data held in various resources (EHR, LIMS) can be represented for different content models based on existing ontologies and standards. We will then employ the semantically characterised data to develop and apply symbolic AI techniques that will facilitate the creation and manipulation of comprehensible conceptualisations of information, allowing us not only to make decisions, but to truly learn from data.

WP2: Computer-usable information extraction from text

Current electronic healthcare records commonly contain poorly defined information in textual form. Based on our proposed framework, we will develop methodologies using advances AI techniques (Word2Vec, LSTMs) and apply them to characterise textual data and transmute them into structured formats. Initially we intend to apply our approach to a few chosen domains and then scale it up.

WP3: Analysis of complex multi-dimensional images

Similarly, multi-dimensional data cannot easily be used for computer-based analytical systems. To harvest the potential information hidden in unstructured data, we will develop approaches to employ modern AI techniques (CNNs, GANs) for knowledge extraction. We plan to further combine this feature learning with defined vocabularies to directly extract meaningful annotations.

WP4: Anonymization approaches for distributed analysis

AI techniques rely on large datasets with millions of examples. However, this creates challenges with respect to sharing data whilst respecting the wishes of patients and meeting the GDPR requirements. To address this, we will develop anonymization approaches able to learn distributions from data. These distributions can be employed to sample novel, artificial data, which can be shared without disclosing any real data and be used to boost model induction.

Publications

10 25 50
 
Description Defining and redefining disease using multimodal data on a national scale: the HDR UK Phenomics Resource
Amount £1,087,168 (GBP)
Funding ID CFC0111 
Organisation Health Data Research UK 
Sector Private
Country United Kingdom
Start 10/2019 
End 09/2022
 
Description HDR UK National Text Analytics Implementation Project
Amount £650,000 (GBP)
Funding ID CFC0108 
Organisation Health Data Research UK 
Sector Private
Country United Kingdom
Start 10/2019 
End 09/2022
 
Description Psychosis Immune Mechanism Stratified Medicine Study (PIMS)
Amount £1,400,000 (GBP)
Funding ID MRC MR/S037675/1 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 12/2019 
End 11/2024
 
Description RespiraTox: In silico model for predicting human respiratory irritation
Amount £99,996 (GBP)
Funding ID NC/C017S01/1 
Organisation National Centre for the Replacement, Refinement and Reduction of Animals in Research (NC3Rs) 
Sector Public
Country United Kingdom
Start 03/2018 
End 02/2019
 
Title HiVAE 
Description Further development of an embedding approach for heterogeneous data embedding using variational autoencoders as easy to use Python library. 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? No  
Impact Currently the library is used in a sub-phenotype clustering based approach on a cardiac patient cohort from RCTs. The tool will be made public as soon as the works is published. 
URL https://github.com/gkoutos-group/hivae
 
Title RespiraTox 
Description The aim was to develop QSAR-based tool that reliably predicts the human respiratory irritancy potential of chemicals. The tool fulfills the five OECD principles for QSAR validation. To address this Challenge, Dr Sylvia Escher (Fraunhofer ITEM, Germany) and Dr Andreas Karwath (University of Birmingham) have developed a QSAR model that is available as a web-based tool, which allows end-users to predict human respiratory irritation of chemical compounds by entering structural information. 
Type Of Material Improvements to research infrastructure 
Year Produced 2019 
Provided To Others? Yes  
Impact The tool has only been published in 2019 and has so far attracted over 30 active users from around the world, including academics and researchers from private companies. 
URL https://respiratox.item.fraunhofer.de/
 
Title Vec2SPARQL 
Description Vec2SPARQL allows jointly querying vector functions such as computing similarities (cosine, correlations) or classifications with machine learning models within a single SPARQL query. The framework can be applied for biomedical and clinical use cases. 
Type Of Material Improvements to research infrastructure 
Year Produced 2018 
Provided To Others? Yes  
Impact The framework will be employed in a modified form within the clinical research environment at the UHB and UoB. The direct notable impact within these organisation can only be measured, once this integration is set up. 
URL https://github.com/bio-ontology-research-group/vec2sparql
 
Description Cancer Precision Medicine 
Organisation King Abdullah University of Science and Technology (KAUST)
Department Biological and Environmental Science and Engineering Division
Country Saudi Arabia 
Sector Academic/University 
PI Contribution Participation in Workshop on Cancer Precision Medicine (at KAUST). Joint work on cancer variant interpretation through provision of data (colorectal samples) and joint analysis. A joint manuscript was published on biorxiv, the journal version is about to be submitted.
Collaborator Contribution Joint work on cancer variant interpretation using novel vectorisation techniques.
Impact Ontology-based prediction of cancer driver genes, Sara Althubaiti, Andreas Karwath, Ashraf Dallol, Adeeb Noor, Shadi Salem Alkhayyat, Rolina Alwassia, Katsuhiko Mineta, Takashi Gojobori, Andrew D Beggs, Paul N Schofield, Georgios V Gkoutos, Robert Hoehndorf, 2019, https://www.biorxiv.org/content/10.1101/561480v1
Start Year 2018
 
Description Cancer Precision Medicine 
Organisation University of Cambridge
Country United Kingdom 
Sector Academic/University 
PI Contribution Participation in Workshop on Cancer Precision Medicine (at KAUST). Joint work on cancer variant interpretation through provision of data (colorectal samples) and joint analysis. A joint manuscript was published on biorxiv, the journal version is about to be submitted.
Collaborator Contribution Joint work on cancer variant interpretation using novel vectorisation techniques.
Impact Ontology-based prediction of cancer driver genes, Sara Althubaiti, Andreas Karwath, Ashraf Dallol, Adeeb Noor, Shadi Salem Alkhayyat, Rolina Alwassia, Katsuhiko Mineta, Takashi Gojobori, Andrew D Beggs, Paul N Schofield, Georgios V Gkoutos, Robert Hoehndorf, 2019, https://www.biorxiv.org/content/10.1101/561480v1
Start Year 2018
 
Description Learning to Rank 
Organisation Johannes Gutenberg University of Mainz
Country Germany 
Sector Academic/University 
PI Contribution Collaboration with the University of Mainz on developing novel, neural network-based approaches for learning ranking of data examples. It does work on cross-comparing two samples and learning a comparative function. My contribution is the provision of the initial idea and setting, as well as joined (see below) bi-weekly supervision of the team. Furthermore, I contributed essential parts of the latest conference paper publication.
Collaborator Contribution The team in Mainz was mainly concerned with developing the actual code and running the experimental analysis, Furthermore, Prof. S. Kramer and I shared the supervision of the team.
Impact Pairwise Learning to Rank by Neural Networks Revisited: Reconstruction, Theoretical Analysis and Practical Performance, M. Köppel, A. Segner, M. Wagener, L. Pensel, A. Karwath, and S. Kramer, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (2019) - https://ecmlpkdd2019.org/downloads/paper/400.pdf
Start Year 2018
 
Description Oesophagectomy outcomes from ICU patients 
Organisation Queen Elizabeth Hospital Birmingham
Country United Kingdom 
Sector Hospitals 
PI Contribution Initial investigation of outcomes of oesophageal surgery for patients with a number of dependent and competing variables during the operation (drugs, administered fluids, operation duration) and patient survival.
Collaborator Contribution Provision of data and clinical and medical expertise from surgeon and ICU practitioners.
Impact None so far
Start Year 2020
 
Description RespiraTox: In silico model for predicting human respiratory irritation 
Organisation Fraunhofer Society
Department Fraunhofer Institute for Toxicology and Experimental Medicine
Country Germany 
Sector Academic/University 
PI Contribution Analysis of chemical datasets as well as the development of underlying tools and software, including web services ( https://respiratox.item.fraunhofer.de/index.php ).
Collaborator Contribution The Fraunhofer Institute for Toxicology and Experimental Medicine ITEM in Germany, provided the chemical datasets and scientific background in this development, as well as providing additional resources on the analytics side of the project.
Impact Online Server https://respiratox.item.fraunhofer.de/index.php Server software backend: https://github.com/athro/respiraTox-app Poster presentation at German Pharm-Tox Summit 2019 (Stuttgart, Germany): Development of a QSAR model for respiratory irritation M. Wehr , A. Karwath , S. S. Sarang , M. Rooseboom , P. J. Boogaard , S. E. Escher (2019) Conference presentation at Society of Toxicology 58th Annual Meeting (Baltimore, USA): Development of a QSAR Model to Predict Respiratory Irritation by Individual Constituents. M. Wehr , A. Karwath , S. S. Sarang , M. Rooseboom , P. J. Boogaard , S. E. Escher (2019)
Start Year 2018
 
Description SPARQL2Vec 
Organisation King Abdullah University of Science and Technology (KAUST)
Department Biological and Environmental Science and Engineering Division
Country Saudi Arabia 
Sector Academic/University 
PI Contribution Joint work on a framework to enable simultaneous entry retrieval of structured and unstructured data using data vectors using SPARQL endpoints.
Collaborator Contribution Joint work on a framework to enable simultaneous entry retrieval of structured and unstructured data using data vectors using SPARQL endpoints.
Impact Vec2SPARQL: integrating SPARQL queries and knowledge graph embeddings Maxat Kulmanov, Senay Kafkas, Andreas Karwath, Alexander Malic, Georgios Gkoutos, Michel Dumontier, Robert Hoehndorf, 2018 https://www.biorxiv.org/content/10.1101/463778v1
Start Year 2018
 
Title Direct Ranker 
Description A Python implementation of the DirectRanker, a software library for learning to rank. See publications. 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact Currently none to report. 
URL https://github.com/kramerlab/direct-ranker
 
Title RespiraTox 
Description The aim was to develop QSAR-based tool that reliably predicts the human respiratory irritancy potential of chemicals. The tool fulfills the five OECD principles for QSAR validation. To address this Challenge, Dr Sylvia Escher (Fraunhofer ITEM, Germany) and Dr Andreas Karwath (University of Birmingham) have developed a QSAR model that is available as a web-based tool, which allows end-users to predict human respiratory irritation of chemical compounds by entering structural information. 
Type Of Technology Webtool/Application 
Year Produced 2019 
Open Source License? Yes  
Impact The tool has only been published in 2019 and has so far attracted over 30 active users from around the world, including academics and researchers from private companies. 
URL https://respiratox.item.fraunhofer.de/
 
Description Analysing eHR data 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other audiences
Results and Impact Health Data Research UK Summer School 2019
- half day lecture and tutorial (https://github.com/athro/hdruk_summerschool_session_1_2)
Year(s) Of Engagement Activity 2019
URL https://www.hdruk.ac.uk/events/health-data-research-uk-summer-school/
 
Description Organising & Teaching for the immersion week for HDRUK-ATI PhD 2020/21 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Designing, organising and teaching for a small group of future health data experts. Within the HDRUK-ATI PhD program a immersion week in Birmingham with general theme 'Patients and Real-world Data' was designed. The presenters represented a number of local and regional experts.
Year(s) Of Engagement Activity 2021