Statistical methods for developing, assessing and validating risk prediction models in multiple epidemiological studies

Lead Research Organisation: University of Cambridge
Department Name: Institute of Public Health

Abstract

The identification of predictors of disease has led to considerable advances in clinical practice in the last century. Many markers such as systolic blood pressure and blood cholesterol levels are now routinely measured on individuals to predict conditions such as heart disease. Levels of markers that indicate an individual is at high risk of disease are subsequently modified by suitable interventions to help prevent disease onset and/or progression. However, there has much controversy regarding the proportion of risk actually explained by such markers and the generalisability of risk markers between different populations. To enable a more comprehensive and powerful evaluation of the relevance of such predictors, it is often necessary to pool data from different studies. I plan to advance the development of statistical methods for use in such data pooling approaches by working on detailed information previously collated on up to 40,000 cases of heart disease from over 1 million participants in 104 studies. The main aim is to develop statistical methods that will enable more reliable conclusions to be drawn about (i) the relationship between predictors and disease; (ii) the predictive ability that can be attributed to risk models and (iii) the generalisability of risk models to different populations. The methods that will be developed will have applications to many different situations and to different diseases, and will become increasingly important as the trend continues towards data sharing and pooling in large, collaborative multi-centre studies.

Technical Summary

An increasing number of biological markers are being proposed as important predictors of chronic diseases (eg, heart attacks and strokes). Reliably identifying risk predictors of disease can have important scientific and public health implications (exemplified by the utility of measurement and modification of blood cholesterol values in prevention of heart attacks). To enable a more comprehensive and powerful evaluation of the relevance of such markers, it is often necessary to pool data from different studies, which has motivated the 1.1-million-participant, 104-cohort Emerging Risk Factors Collaboration (ERFC). Optimum biostatistical methods are needed to help maximize the value of such databases. Specifically, it is important to (i) account for differences in the distribution of the predictors across studies (ie, between-study heterogeneity); (ii) deal with missing predictor values and (iii) appropriately combine data and results from different study designs (eg, cohort, case-cohort and nested case-control studies). This present proposal seeks support to address these unresolved statistical issues in relation to the development, assessment and validation of risk prediction models in multiple epidemiological studies.

Risk prediction models ideally should allow for non-linear relationships between predictors and disease, joint effects of predictors and other common functions of predictors, such as ratios. Through implementation in the ERFC database and through simulation, we will develop models that incorporate these features, for example using flexible fractional polynomial models, whilst overcoming the statistical limitations listed above. As well as determining relevant predictors of disease, it is important to identify the prognostic ability of the risk prediction model. No one measure of prognostic ability is widely accepted for general use. We will compare and investigate the use of several measures, such as the AUROC, C-index and Royston?s D and measures which summarise risk reclassification tables, in relation to multiple studies. It is also important to validate a risk prediction model to assess it?s generalisability to other populations. Multiple studies should allow an optimum validation procedure, and we will consider using both internal validation (splitting individuals within cohorts) and external validation procedures (splitting cohorts). Throughout, we propose to incorporate between-study heterogeneity via random-effects, use multiple imputation to deal with missing values and combine estimates from different study designs in a meta-analysis approach.

The products of this proposed work will have rapid application, initially to the ERFC and then to other existing data pooling initiatives. Our findings should become increasingly useful as the trend continues towards data pooling in large, collaborative multi-centre analyses.

Publications

10 25 50
 
Guideline Title ESC Clinical Practice Guidelines
Description SCORE2
Geographic Reach Europe 
Policy Influence Type Citation in clinical guidelines
Impact • Improved and updated risk calculators allow tailored use among people aged 40+ to accurately predict who is at risk of having a heart attack or stroke in the next 5 or 10 years • People flagged as having increased risk are recommended personalised preventative treatment • Our tool, called 'SCORE2', has been adopted by the European Guidelines on Cardiovascular
URL https://www.escardio.org/Education/ESC-Prevention-of-CVD-Programme/Risk-assessment/esc-cvd-risk-calc...
 
Description Training in the analysis of individual participant data from multiple studies
Geographic Reach Multiple continents/international 
Policy Influence Type Influenced training of practitioners or researchers
 
Description Translational research tools in the analysis of individual participant data from multiple studies
Geographic Reach National 
Policy Influence Type Participation in a advisory committee
 
Description Characterisation, determinants, mechanisms and consequences of the long-term effects of COVID-19: providing the evidence base for health care
Amount £10,000,000 (GBP)
Funding ID MC_PC_20051 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 03/2021 
End 02/2024
 
Description EU: Innovative Medicines Initiative - "BigData@Heart"
Amount € 19,000,000 (EUR)
Funding ID 116074 
Organisation European Union 
Sector Public
Country European Union (EU)
Start  
 
Description Large-scale integrative studies of risk factors in coronary heart disease: from discovery to application
Amount £2,017,846 (GBP)
Funding ID MR/L003120/1 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 07/2013 
End 08/2018
 
Description Looking beyond the mean: what within-person variability can tell us about dementia, cardiovascular disease and cystic fibrosis
Amount £486,957 (GBP)
Funding ID MR/V020595/1 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 08/2021 
End 03/2024
 
Description MRC Industrial Strategy PhD Award
Amount £360,000 (GBP)
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 09/2018 
End 10/2021
 
Description NIHR BTRU in Donor Health & Genomics
Amount £4,000,000 (GBP)
Organisation National Institute for Health Research 
Sector Public
Country United Kingdom
Start  
 
Description Phase 1 COVID-19 Longitudinal Health and Wellbeing - National Core Study
Amount £9,074,000 (GBP)
Funding ID MC_PC_20059 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 03/2021 
End 09/2022
 
Description Pump-priming proposals
Amount £50,000 (GBP)
Organisation British Heart Foundation (BHF) 
Sector Charity/Non Profit
Country United Kingdom
Start 03/2015 
End 11/2015
 
Description RCUK Innovation / Rutherford Fund Fellowships
Amount £760,000 (GBP)
Organisation Research Councils UK (RCUK) 
Sector Public
Country United Kingdom
Start 07/2018 
End 08/2021
 
Description The risk of stroke after SARS-CoV-2 in a UK population-wide cohort
Amount £60,000 (GBP)
Funding ID SA_CV_20/100018 
Organisation Stroke Association 
Sector Charity/Non Profit
Country United Kingdom
Start 02/2021 
End 03/2022
 
Description Towards early identification of adolescent mental health problems
Amount £100,577 (GBP)
Funding ID MR/T046430/1 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 06/2020 
End 10/2021
 
Description Using machine learning for personalised CVD risk management
Amount £91,414 (GBP)
Funding ID BDCSA_100005 Wood 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start  
 
Title Stata-ado files 
Description Development of generic STATA (statistical software) programs 
Type Of Material Improvements to research infrastructure 
Year Produced 2012 
Provided To Others? Yes  
Impact The development of the methods and software programs has enabled progression of several applied projects evaluating new potential biomarkers for cardiovascular risk prediction 
URL http://www.phpc.cam.ac.uk/ceu
 
Title risk prediciton in multiple studies 
Description Development of statistical methods to assess predictive ability of new risk factors and models using data from multiple studies 
Type Of Material Improvements to research infrastructure 
Provided To Others? No  
Impact Contributing to further manuscripts on assessing risk prediciton 
 
Title CVD-COVID-UK/COVID-IMPACT 
Description See https://www.hdruk.ac.uk/projects/cvd-covid-uk-project/ CVD-COVID-UK established a novel population wide resource in partnership with NHS Digital, comprising of a range of linked datasets covering the entire population of England, including o hospital data o death registrations o primary care data o community dispensing data o Covid-19 vaccination data and lab test o Data from intensive care units and from cardiovascular specialist registries • 
Type Of Material Data analysis technique 
Year Produced 2021 
Provided To Others? Yes  
Impact Results from analyses using this research database have informed national COVID-19 Advisory Groups and public health agencies on COVID-19 vaccine safety. 
URL https://www.hdruk.ac.uk/projects/cvd-covid-uk-project/
 
Title Models for risk prediction 
Description 1-stage and 2-stage approaches to combine data and results across multiple studies to build risk prediction models 
Type Of Material Computer model/algorithm 
Year Produced 2014 
Provided To Others? Yes  
Impact Various publications cite this work 
 
Description CRUK International Alliance for Cancer Early Detection - Real-world risk-stratified early de-tection and diagnosis using linked electronic health records data 
Organisation University College London
Department Institute of Epidemiology and Health Care
Country United Kingdom 
Sector Academic/University 
PI Contribution Leading statistical methods development and application
Collaborator Contribution Contributing clinical expertise
Impact Successful grant award for multi-disciplinary team science. Co-applicant: CRUK International Alliance for Cancer Early Detection - Real-world risk-stratified early detection and diagnosis using linked electronic health records data, £800K
Start Year 2020
 
Description EPIC-CVD 
Organisation European Commission
Department Seventh Framework Programme (FP7)
Country European Union (EU) 
Sector Public 
PI Contribution The statistical methodology developed in this project ishaving a direct impact on statistical plans for EPIC-CVD
Collaborator Contribution Our partners will be applying the statistical methodology developed in this current grant.
Impact Output: the statistical analysis plan for EPIC-CVD research projects related to risk prediction. This will be used by statisticians and interpreted by Epidemiologists and Public Health researchers.
Start Year 2012
 
Description Evaluating the prognostic value of new cardiovascular biomarkers 
Organisation Northwestern University
Country United States 
Sector Academic/University 
PI Contribution Provided statistical knowledge and context about risk prediction in cardiovascular disease
Collaborator Contribution provided examples and motivations for assessing predictive ability of new risk factors, including genetic variants
Impact Collaboration led to a publication, 19773609. The collaboration is multidisciplinary, combining mathematics and statistics with medicine
Start Year 2008
 
Description Genomic risk prediction 
Organisation University of Cambridge
Country United Kingdom 
Sector Academic/University 
PI Contribution Provided statistical knowledge and context about risk prediction in cardiovascular disease
Collaborator Contribution provided examples and motivations for assessing predictive ability of new risk factors, including genetic variants
Impact Lancet commentary currently under review
Start Year 2010
 
Description Machine learning and AI 
Organisation University of Cambridge
Department Department of Applied Mathematics and Theoretical Physics (DAMTP)
Country United Kingdom 
Sector Academic/University 
PI Contribution Collaborating with Mihaela van der Schaar in various machine learning and AI projects.
Collaborator Contribution Contributing methods development and data
Impact Not yet
Start Year 2019
 
Description Risk prediction workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Type Of Presentation Workshop Facilitator
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact I organised a workshop on "Statistical Challenges in Risk Prediction" with 30 international participants, in Cambridge Nov 2012.

The output of the workshop will be a special journal issue in The Biometrical Journal.
Year(s) Of Engagement Activity 2012
 
Description Teaching at post-graduate level in MPhil courses 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact "Risk prediction" workshop day for MPhil in Epidemiology, Public Health and Primary care students.
Year(s) Of Engagement Activity 2014,2015