An investigation into the use of shrinkage methods to alleviate over-fitting of prognostic models for independent and clustered data with few events

Lead Research Organisation: University College London
Department Name: Statistical Science

Abstract

Clinicians, health service researchers and epidemiologists often wish to predict a future health outcome for patients and the public. Examples of such outcomes include development of coronary heart disease, the occurrence of in-hospital mortality following surgery, and the onset of depression. These predictions are used by clinicians to determine the prognosis of patients to plan their treatment, to detect high risk patients, and to provide information to patients enabling them to make decisions about their treatment options. Policy makers often use these predictions to assess the performance of hospitals and general practices and identify under performing institutions.

Statistical models using patients' clinical and demographic characteristics are typically used to make these predictions. These models are referred to as prognostic models. To develop such models, information is collected on patients or relevant subjects, regarding their risk factors and the health outcome they experienced. The relationship between the risk factors and the outcome is quantified using a statistical model, which can then be used to make predictions for new patients. Models are usually presented in the form of a risk algorithm. This algorithm is then tested on new patients to ensure that it makes reliable predictions. If its performance is found to be satisfactory, it is recommended for use by clinicians in practice. Examples of risk algorithms used in practice include the Framingham risk score to predict the 10 year risk of coronary heart disease, Euroscore to predict in-hospital mortality following cardiac surgery and the PREDICT score to predict the risk of developing depression. When the health outcome of interest is rare it is often problematic to develop a risk algorithm that will both predict risk accurately and be able to classify patients into high and low risk groups. This is a common problem in health research and is often not alleviated by collecting patient data from many centres, or over a long period of time. A further statistical problem occurs with data from many centres as there may be variability in the outcomes between the centres.

Robust models exist for relatively common events such as coronary heart disease, in-hospital mortality following cardiac surgery, and depression. However reliable prognostic models are scarce, or not available, for rarer health outcomes, for example death or recurrence following diagnosis of rare types of cancer, and the onset of Parkinson's disease. There is a similar problem when trying to develop prognostic models for common events in relatively small subgroups of people, for example, a model to predict coronary heart disease in people who have severe mental health problems.

Some methodological research has been done to handle the problem of fitting statistical models for rare outcomes in genetic studies. However, limited work has been done to develop methods to produce reliable prognostic models with rare outcomes in clinical settings such as public health and health services research. Moreover, the methods that have been developed to date are not used routinely because of lack of software and adequate evaluation. There are currently no guidelines regarding how statisticians and other researchers should be using these methods in practice. The proposed research will evaluate the existing statistical methodology that is available to handle risk predictions when the health outcome of interest is rare, and will develop new methods where necessary. The proposed research will make recommendations regarding the use of these methods in practice. Additionally, the methods developed in this research project will be implemented in widely available statistical software to enable their routine use. The prognostic models developed using these methods should enable clinicians and policy makers to make predictions for patients regarding health outcomes, in these settings even if the outcome is rare

Technical Summary

Prognostic models are increasingly used by clinicians and policy makers to predict health outcomes for public and patients. These models are used to guide the clinical management of patients, help patients make informed decisions about their treatment and compare institutional performances after adjusting for patient case-mix. However, accurate and reliable prognostic models can be difficult to develop if the disease or event of interest is rare. Model overfitting is a problem in this situation, and is typically handled using variable selection approaches based on P-values. However, variable selection has problems, which are exacerbated in sparse data, including model instability in terms of the selected predictors. Additional complexity arises with multi-centre data as patients within a centre are more likely to be similar compared to patients across centres, leading to clustered (correlated) data.

Penalised maximum likelihood methods, which apply shrinkage to regression coefficients, have been proposed to address the model overfitting problem. However, these methods have not been adequately evaluated for scenarios that typically occur in the clinical areas of public health, epidemiology and health services research. Moreover little work has been done to develop shrinkage methods for clustered data. It is important to conduct research to identify appropriate methods that enables the development of reliable prognostic models in the clinical settings described above. This research will conduct a thorough investigation of the performance of existing methods for binary and survival health outcomes through the evaluation of their theoretical properties and performance in simulation studies. The overreaching aim is to make recommendations to health researchers regarding the use of appropriate methods in prognostic modelling studies in these settings, and to facilitate their use by providing an overview of available software and developing routines where required.

Planned Impact

Prognostic models are increasingly being used by clinicians and policy makers to predict future health outcomes of patients and the public. For example, the Framingham model has been developed to predict the 10 year risk of coronary heart disease, the PREDICT score to predict the onset of depression, and the Ambler/Omar model to predict the risk of in-hospital mortality following heart valve surgery. These predictions are usually made using patients' clinical and demographic characteristics. The main applications for these models are to guide clinical management of patients, to help patients make informed decisions about their treatment options and to compare institutional performances after adjusting for patients' case-mix.

Robust models exist for relatively common events described above. However reliable prognostic models are scarce, or not available, for rarer health outcomes such as death or recurrence following diagnosis of rare types of cancer, mechanical failure of artificial heart valves, and the onset of Parkinson's disease. There is a similar problem when trying to develop prognostic models for common events in relatively small subgroups of people, for example predicting coronary heart disease in people who have severe mental health problems. Prognostic models developed in these scenarios are often overfitted and unable to make accurate predictions for future patients.

Statistical shrinkage methods have been developed to handle the problem of overfitting in genetic studies where the number of predictors often greatly exceeds the number of patients. However the focus of these types of study is often on the identification of key predictors, rather than on the development of prognostic models for routine clinical use. In contrast, these shrinkage methods are rarely used in prognostic modelling studies in public health, health services research and epidemiology, where though the number of coefficients is usually lower than both the sample size and number of events but the model over-fitting problem still exists.

The proposed research will conduct a comprehensive evaluation of existing shrinkage methods and corresponding software, and develop new methods and software where necessary. This will form a basis from which practical recommendations can be made regarding the development of prognostic models with such data for biostatisticians. This will also make statistical tools available to epidemiologists, public health researchers, health services researchers and other health researchers to develop reliable prognostic models for diseases with rare outcomes, using studies specifically designed for that purpose instead of using routine data. This should enable clinicians to make reliable risk predictions in these clinical areas, assisting in clinical management of their patients and also benefit patients suffering from such diseases in making decisions about their treatment options. Additionally, policy makers will not be restricted to evaluate institutional performances for common health outcomes only.
The researchers in this team, in particular the post doctoral research fellow (RF) will learn new statistical methodology and theory and how to design and conduct simulation studies. They will learn how to use and develop new software routines. The use of real clinical datasets will provide the opportunity to the theoretical statisticians and the RF to develop skills to interpret results from health studies and apply statistical methods in practice, thus creating the scope for future collaboration between them and health researchers. This should enable capacity building in biostatistics. The RF will develop skills to write scientific papers. It will also provide the opportunity to enhance the statistical methods used in NIHR research by integrating expertise of biostatisticians and theoretical statisticians and the application of the best possible methods for health of care of the public and patients.
 
Guideline Title European Society of Cardiology Guidelines
Description HCM model and AF model
Geographic Reach Multiple continents/international 
Policy Influence Type Citation in clinical guidelines
Impact Our risk model is widely used in clinical practice and helps improved decision making in implanting devices in patients to prevent sudden cardiac death.
 
Description MRC Methodology
Amount £300,000 (GBP)
Funding ID MR/P015190/1 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 03/2018 
End 02/2021
 
Title Risk model for HCM and AF 
Description A novel clinical risk prediction model for sudden cardiac death in hypertrophic cardiomyopathy Prediction of thrombo-embolic risk in patients with hypertrophic cardiomyopathy (HCM Risk-CVA). 
Type Of Material Computer model/algorithm 
Year Produced 2014 
Provided To Others? Yes  
Impact It is included in clinical guidelunes and used in clinical practice. 
 
Description Clinical risk Models 
Organisation Monaldi Hospital, Second University of Naples
Country Italy 
Sector Hospitals 
PI Contribution We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy.
Collaborator Contribution Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario.
Impact Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16.
Start Year 2013
 
Description Clinical risk Models 
Organisation National and Kapodistrian University of Athens
Country Greece 
Sector Academic/University 
PI Contribution We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy.
Collaborator Contribution Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario.
Impact Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16.
Start Year 2013
 
Description Clinical risk Models 
Organisation University College London
Department Medical School
Country United Kingdom 
Sector Academic/University 
PI Contribution We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy.
Collaborator Contribution Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario.
Impact Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16.
Start Year 2013
 
Description Clinical risk Models 
Organisation University Hospital Plzen
Country Czech Republic 
Sector Hospitals 
PI Contribution We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy.
Collaborator Contribution Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario.
Impact Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16.
Start Year 2013
 
Description Clinical risk Models 
Organisation University of A Coruña
Department Institute of Biomedical Research of A Coruña
Country Spain 
Sector Academic/University 
PI Contribution We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy.
Collaborator Contribution Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario.
Impact Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16.
Start Year 2013
 
Description Clinical risk Models 
Organisation University of Bologna
Country Italy 
Sector Academic/University 
PI Contribution We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy.
Collaborator Contribution Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario.
Impact Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16.
Start Year 2013
 
Description Clinical risk Models 
Organisation Virgen de la Arrixaca University Hospital
Country Spain 
Sector Hospitals 
PI Contribution We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy.
Collaborator Contribution Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario.
Impact Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16.
Start Year 2013
 
Description NICOR 
Organisation University College London
Department National Institute for Cardiovascular Outcomes Research (NICOR)
Country United Kingdom 
Sector Academic/University 
PI Contribution Rumana Omar and Gareth Ambler will provide senior statistical support to develop and validate risk models for cardiovascular outcomes and carry out research on risk modelling methodology for the National Institute for Cardiovascular Outcomes Research (NICOR). NICOR collects clinical information from UK hospitals into secure registries established by the cardiovascular specialist societies.They help the NHS, the government and regulatory bodies improve quality of care by checking that the care received by heart disease patients meets good practice standards. RO will be a member of their strategic board and GA a member of their methodological advisory board.
Collaborator Contribution They will provide data and facilitate the impact pathway.
Impact No outputs yet. It involved clinician,s statisticians, data scientists and audit methodologists.
Start Year 2017
 
Description SMIRP 
Organisation University College London
Department Department of Statistical Science
Country United Kingdom 
Sector Academic/University 
PI Contribution We have formed a collaborative group with researchers working on risk prediction models, from the MRC Biostatistics Unit, Cambridge, Oxford University and UCL. The contributions from this group are in the form of feedback on presentations from ongoing research.
Collaborator Contribution The contributions are in the form of feedback on presentations from ongoing research .
Impact Research presentation meetings held at the MRC Biostatistics Unit and the MRC Clinical Trials Unit in April, July, and October 2013.
Start Year 2013
 
Description SMIRP 
Organisation University of Cambridge
Department MRC Biostatistics Unit
Country United Kingdom 
Sector Academic/University 
PI Contribution We have formed a collaborative group with researchers working on risk prediction models, from the MRC Biostatistics Unit, Cambridge, Oxford University and UCL. The contributions from this group are in the form of feedback on presentations from ongoing research.
Collaborator Contribution The contributions are in the form of feedback on presentations from ongoing research .
Impact Research presentation meetings held at the MRC Biostatistics Unit and the MRC Clinical Trials Unit in April, July, and October 2013.
Start Year 2013
 
Description Public speaking 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Media (as a channel to the public)
Results and Impact Public speaking, awareness of risk prediction methodology
Year(s) Of Engagement Activity 2016