Predictive analytics of integrated genomic and clinical data using machine learning and complex statistical approaches

Lead Research Organisation: Wellcome Sanger Institute
Department Name: Human Genetics

Abstract

Large-scale electronic health record (EHR) data has the potential to transform our understanding of disease aetiology and clinical risk, providing important information for clinical decision making and health policy at population scale. Aligning multiple sources of data, including genetic, epigenetic and clinical data can substantially improve our understanding of disease risk. Such big data can provide unique opportunities for development of algorithms for precision medicine as well as for novel discovery of candidate genes associated with disease; providing a framework for integration of genetic discovery into clinical applications in medicine. While large-scale EHR resources have been utilised for risk prediction, and identification of genetic associations, analyses of these data have been limited and have not harnessed the multi-dimensional and longitudinal richness of data. The recent development of large-scale biodata resources within the UK necessitates the parallel development of flexible analytic methods that can realise the full potential of these rich datasets.

Machine learning methods provide a framework whereby the relationships among different variables and types of data are learnt from the data itself by identifying and utilising features that best predict outcomes. Such approaches may provide an advantage over classical statistical methods, where the relationships among variables require pre-specification, and only a limited number of factors can be modelled at a time. By contrast, machine learning methods not only allow model-free prediction of clinical risk, but also help better understand which factors, among large numbers of potential predictors, influence clinical risk.

This proposal focuses on development and validation of statistical, and machine learning approaches that utilise complex data to flexibly predict the risk of patient outcomes across multiple disease areas. Additionally, application of such methods also allows a better understanding of clinical and genetic risk factors associated with disease. This work will specifically focus on systematically evaluating and extending current statistical methods for risk prediction to incorporate machine learning approaches that can model complex patterns of risk using large-scale genetic and clinical data flexibly, while appropriately accounting for the time-dependent context of risk factors (e.g. repeated measurements).

The first phase of the project will involve integration, curation and harmonisation of publicly available biodata resources, such as the UK Biobank, Genomics England and INTERVAL study. Further data, including gene expression, and functional data will be layered to develop a rich multi-dimensional dataset. These data can be reasonably predicted from sequence data, when not directly measured, using published imputation and deep learning approaches. Developing on previous work with EHRs, complex statistical approaches will be used to develop predictive algorithms for specific disease areas. The next stage will involve evaluation of existing machine and deep learning approaches that can model risk. These approaches will then be extended to model longitudinal data incorporating repeated measurements over time. The predictive accuracy of these approaches will be evaluated using independent datasets. To understand the genetic aetiology of disease, classical GWAS approaches will be compared with approaches that integrate machine learning to allow prioritisation of the most important genetic and clinical predictors of risk.

This project will provide a broad analytic framework for clinical risk prediction and identification of genetic associations with disease in the context of big data analytics. In the longer term, this will contribute to a programme of developing research capacity and expertise in high throughput analytics of multi-dimensional data with the aim of supporting clinical decision making, and improving patient health.

Technical Summary

Although the utility of complex statistical, machine and deep learning (ML and DL) approaches in the context of multi-dimensional data has been clearly demonstrated, these methods have not been widely utilised to improve novel drug discovery and clinical risk prediction. This proposal aims to harness the potential of large-scale integrated genetic and health data to spur innovation, and develop predictive algorithms to improve clinical decision making and patient health. Specifically, this will focus on the development and evaluation of ML and DL frameworks for GWAS, and clinical risk prediction using publicly available large-scale EHR and genomics biodata resources, including UK Biobank, Genomics England and INTERVAL studies. Transcriptomic and functional data will be integrated into these using predictive approaches, where this has not been directly measured.

This will be implemented in three stages: 1) assessment of complex time-dependent statistical approaches for modelling of hazard; 2) optimisation and assessment of existing ML and DL approaches for modelling of clinical risk; 3) development of novel approaches, specifically using recurrent neural networks (RNNs) to incorporate temporality and missingness in clinical data, including time varying covariates to accurately model complex hazard functions; the objective of this project will be to develop approaches that appropriately leverage the rich longitudinal and time-dependent data on individuals shown by us and others to substantially improve clinical risk prediction.

In addition to risk prediction, this proposal will also focus on improving our understanding of genetic aetiology of disease. In addition to standard GWAS approaches, hybrid ML and GWAS approaches for prioritisation of candidate genes, and genetic variants associated with disease will also be applied, potentially improving the power to identify novel associations, with important implications for prioritisation of therapeutic targets.
 
Description Synthetic RNA modulators of gene function: a pilot study
Amount £30,575 (GBP)
Funding ID I3240 
Organisation The Wellcome Trust Sanger Institute 
Sector Charity/Non Profit
Country United Kingdom
Start 06/2018 
End 09/2018
 
Description Unravelling sepsis heterogeneity through RNA sequencing
Amount £74,238 (GBP)
Funding ID I3253 
Organisation The Wellcome Trust Sanger Institute 
Sector Charity/Non Profit
Country United Kingdom
Start 07/2018 
End 09/2019
 
Title Predictive algorithm for severe proliferative diabetic retinopathy or cystic macular edema 
Description During the past year, I have developed a complex statistical algorithm that can classify individuals with diabetic retinopathy into strata of risk for sight-threatening diabetic retinopathy, and suggest more aggressive management and screening for those individuals at high risk. I am in the process of refining this algorithm further, and validating this in independent datasets. 
Type Of Material Model of mechanisms or symptoms - human 
Year Produced 2018 
Provided To Others? No  
Impact At present this tool has not been published, but if this is validated, and found to be generalisable across the UK, this may be used in clinical practice to stratify those at high risk for sight-related complications of diabetes, allowing better personalised management. 
 
Title Primary care data on diabetes related risk factors and outcomes on 3 million individuals 
Description The research database described here has been developed in collaboration with ResearchOne and Dr. Manjinder Sandhu. This represents a curated database of ~3M individuals with longitudinal data on lifestyle risk factors, demographic variables, laboratory assays, follow up, clinic appointments and diabetes related outcomes. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? No  
Impact This research database allows a better understanding of the epidemiology of diabetes and diabetes complications within the UK, including a better understanding of risk factors, and development of predictive algorithms to stratify individuals by risk of developing disease. This could potentially inform precision medicine initiatives within the UK, and allow personalised treatments to individuals at greater risk of these outcomes. 
 
Description Prediction of complications of diabetes using primary care electronic health records 
Organisation Moorfields Eye Hospital
Country United Kingdom 
Sector Hospitals 
PI Contribution This is collaboration between myself, Dr. Manjinder Sandhu, Department of Medicine, University of Cambridge, Prof. Mihaela van der Schaar (Department of Applied Mathematics and Theoretical Physics, University of Cambridge), Prof. Sobha Sivaprasad (Moorfield's Eye Hospital and University College London), Prof. John Robson (the Clinical Effectiveness Group, Queen Mary University of London) and ResearchOne (an electronic health record provider) to develop prediction scores for diabetes and complications of diabetes among diabetics. These data include two primary care datasets: anonymised data on ~3 million nationally representative individuals with a median follow up of 14 years from primary care records and anonymised EHR data from the London region with follow up for 5-10 yrs. In the context of this collaboration, we plan to develop methods to optimise prediction of diabetes related outcomes among individuals, with the objective to develop and publish prediction scores for diabetes related outcomes in this rich longitudinal data. The ResearchOne dataset will be used for algorithmic development, with the CEG data being used for validation.
Collaborator Contribution In the context of this project, my collaborators have have extracted relevant data on metabolic profiles, lifestyle, outcomes, clinic visits, follow up, medication, and diagnostic data from 3 million individuals across the UK who have not opted out of anonymised data being used for research purposes. They have also provided data dictionaries to facilitate harmonisation and curation of data. This is a rich longitudinal data spanning a median of 14 years that provides the opportunity to understand predictors of diabetes risk, and complications, as well as using statistical and machine learning approaches to develop predictive algorithms for disease and complications. Prof Sivaprasad's team has provided important clinical input into the development of these algorithms. Prof. van der Schaar's team has input into the development of deep learning algorithms for longitudinal data. Dr. Manjinder Sandhu's group has also been involved in population health epidemiology and has provided expertise in population epidemiology, and high performance computation.
Impact The output of this collaboration has been access to a curated anonymised dataset of primary health care data on >3 million individuals that has been shared with myself as part of HDR UK for development and assessment of algorithms for predictive analytics. This collaboration is multi-disciplinary, with ResearchOne providing expertise in interpretation of EHR data, Dr. Manjinder's team specialising in population health epidemiology and high performance computation.
Start Year 2018
 
Description Prediction of complications of diabetes using primary care electronic health records 
Organisation Queen Mary University of London
Department Blizard Institute
Country United Kingdom 
Sector Academic/University 
PI Contribution This is collaboration between myself, Dr. Manjinder Sandhu, Department of Medicine, University of Cambridge, Prof. Mihaela van der Schaar (Department of Applied Mathematics and Theoretical Physics, University of Cambridge), Prof. Sobha Sivaprasad (Moorfield's Eye Hospital and University College London), Prof. John Robson (the Clinical Effectiveness Group, Queen Mary University of London) and ResearchOne (an electronic health record provider) to develop prediction scores for diabetes and complications of diabetes among diabetics. These data include two primary care datasets: anonymised data on ~3 million nationally representative individuals with a median follow up of 14 years from primary care records and anonymised EHR data from the London region with follow up for 5-10 yrs. In the context of this collaboration, we plan to develop methods to optimise prediction of diabetes related outcomes among individuals, with the objective to develop and publish prediction scores for diabetes related outcomes in this rich longitudinal data. The ResearchOne dataset will be used for algorithmic development, with the CEG data being used for validation.
Collaborator Contribution In the context of this project, my collaborators have have extracted relevant data on metabolic profiles, lifestyle, outcomes, clinic visits, follow up, medication, and diagnostic data from 3 million individuals across the UK who have not opted out of anonymised data being used for research purposes. They have also provided data dictionaries to facilitate harmonisation and curation of data. This is a rich longitudinal data spanning a median of 14 years that provides the opportunity to understand predictors of diabetes risk, and complications, as well as using statistical and machine learning approaches to develop predictive algorithms for disease and complications. Prof Sivaprasad's team has provided important clinical input into the development of these algorithms. Prof. van der Schaar's team has input into the development of deep learning algorithms for longitudinal data. Dr. Manjinder Sandhu's group has also been involved in population health epidemiology and has provided expertise in population epidemiology, and high performance computation.
Impact The output of this collaboration has been access to a curated anonymised dataset of primary health care data on >3 million individuals that has been shared with myself as part of HDR UK for development and assessment of algorithms for predictive analytics. This collaboration is multi-disciplinary, with ResearchOne providing expertise in interpretation of EHR data, Dr. Manjinder's team specialising in population health epidemiology and high performance computation.
Start Year 2018
 
Description Prediction of complications of diabetes using primary care electronic health records 
Organisation University of Cambridge
Country United Kingdom 
Sector Academic/University 
PI Contribution This is collaboration between myself, Dr. Manjinder Sandhu, Department of Medicine, University of Cambridge, Prof. Mihaela van der Schaar (Department of Applied Mathematics and Theoretical Physics, University of Cambridge), Prof. Sobha Sivaprasad (Moorfield's Eye Hospital and University College London), Prof. John Robson (the Clinical Effectiveness Group, Queen Mary University of London) and ResearchOne (an electronic health record provider) to develop prediction scores for diabetes and complications of diabetes among diabetics. These data include two primary care datasets: anonymised data on ~3 million nationally representative individuals with a median follow up of 14 years from primary care records and anonymised EHR data from the London region with follow up for 5-10 yrs. In the context of this collaboration, we plan to develop methods to optimise prediction of diabetes related outcomes among individuals, with the objective to develop and publish prediction scores for diabetes related outcomes in this rich longitudinal data. The ResearchOne dataset will be used for algorithmic development, with the CEG data being used for validation.
Collaborator Contribution In the context of this project, my collaborators have have extracted relevant data on metabolic profiles, lifestyle, outcomes, clinic visits, follow up, medication, and diagnostic data from 3 million individuals across the UK who have not opted out of anonymised data being used for research purposes. They have also provided data dictionaries to facilitate harmonisation and curation of data. This is a rich longitudinal data spanning a median of 14 years that provides the opportunity to understand predictors of diabetes risk, and complications, as well as using statistical and machine learning approaches to develop predictive algorithms for disease and complications. Prof Sivaprasad's team has provided important clinical input into the development of these algorithms. Prof. van der Schaar's team has input into the development of deep learning algorithms for longitudinal data. Dr. Manjinder Sandhu's group has also been involved in population health epidemiology and has provided expertise in population epidemiology, and high performance computation.
Impact The output of this collaboration has been access to a curated anonymised dataset of primary health care data on >3 million individuals that has been shared with myself as part of HDR UK for development and assessment of algorithms for predictive analytics. This collaboration is multi-disciplinary, with ResearchOne providing expertise in interpretation of EHR data, Dr. Manjinder's team specialising in population health epidemiology and high performance computation.
Start Year 2018
 
Description Presentation at Human Genetics Retreat, Wellcome Sanger Institute 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Researchers in the department of Human Genetics attended the talk, which was about using machine learning and complex statistical approaches for predictive analytics in electronic health record data. There was a lot of interest, and questions asked about the applicability and generalisability of results in practice. This extended the audiences understanding of the utility of electronic health data within the UK for research purposes, and influencing public health policy.
Year(s) Of Engagement Activity 2018