An investigation into the use of shrinkage methods to alleviate over-fitting of prognostic models for independent and clustered data with few events
Lead Research Organisation:
University College London
Abstract
Clinicians, health service researchers and epidemiologists often wish to predict a future health outcome for patients and the public. Examples of such outcomes include development of coronary heart disease, the occurrence of in-hospital mortality following surgery, and the onset of depression. These predictions are used by clinicians to determine the prognosis of patients to plan their treatment, to detect high risk patients, and to provide information to patients enabling them to make decisions about their treatment options. Policy makers often use these predictions to assess the performance of hospitals and general practices and identify under performing institutions.
Statistical models using patients' clinical and demographic characteristics are typically used to make these predictions. These models are referred to as prognostic models. To develop such models, information is collected on patients or relevant subjects, regarding their risk factors and the health outcome they experienced. The relationship between the risk factors and the outcome is quantified using a statistical model, which can then be used to make predictions for new patients. Models are usually presented in the form of a risk algorithm. This algorithm is then tested on new patients to ensure that it makes reliable predictions. If its performance is found to be satisfactory, it is recommended for use by clinicians in practice. Examples of risk algorithms used in practice include the Framingham risk score to predict the 10 year risk of coronary heart disease, Euroscore to predict in-hospital mortality following cardiac surgery and the PREDICT score to predict the risk of developing depression. When the health outcome of interest is rare it is often problematic to develop a risk algorithm that will both predict risk accurately and be able to classify patients into high and low risk groups. This is a common problem in health research and is often not alleviated by collecting patient data from many centres, or over a long period of time. A further statistical problem occurs with data from many centres as there may be variability in the outcomes between the centres.
Robust models exist for relatively common events such as coronary heart disease, in-hospital mortality following cardiac surgery, and depression. However reliable prognostic models are scarce, or not available, for rarer health outcomes, for example death or recurrence following diagnosis of rare types of cancer, and the onset of Parkinson's disease. There is a similar problem when trying to develop prognostic models for common events in relatively small subgroups of people, for example, a model to predict coronary heart disease in people who have severe mental health problems.
Some methodological research has been done to handle the problem of fitting statistical models for rare outcomes in genetic studies. However, limited work has been done to develop methods to produce reliable prognostic models with rare outcomes in clinical settings such as public health and health services research. Moreover, the methods that have been developed to date are not used routinely because of lack of software and adequate evaluation. There are currently no guidelines regarding how statisticians and other researchers should be using these methods in practice. The proposed research will evaluate the existing statistical methodology that is available to handle risk predictions when the health outcome of interest is rare, and will develop new methods where necessary. The proposed research will make recommendations regarding the use of these methods in practice. Additionally, the methods developed in this research project will be implemented in widely available statistical software to enable their routine use. The prognostic models developed using these methods should enable clinicians and policy makers to make predictions for patients regarding health outcomes, in these settings even if the outcome is rare
Statistical models using patients' clinical and demographic characteristics are typically used to make these predictions. These models are referred to as prognostic models. To develop such models, information is collected on patients or relevant subjects, regarding their risk factors and the health outcome they experienced. The relationship between the risk factors and the outcome is quantified using a statistical model, which can then be used to make predictions for new patients. Models are usually presented in the form of a risk algorithm. This algorithm is then tested on new patients to ensure that it makes reliable predictions. If its performance is found to be satisfactory, it is recommended for use by clinicians in practice. Examples of risk algorithms used in practice include the Framingham risk score to predict the 10 year risk of coronary heart disease, Euroscore to predict in-hospital mortality following cardiac surgery and the PREDICT score to predict the risk of developing depression. When the health outcome of interest is rare it is often problematic to develop a risk algorithm that will both predict risk accurately and be able to classify patients into high and low risk groups. This is a common problem in health research and is often not alleviated by collecting patient data from many centres, or over a long period of time. A further statistical problem occurs with data from many centres as there may be variability in the outcomes between the centres.
Robust models exist for relatively common events such as coronary heart disease, in-hospital mortality following cardiac surgery, and depression. However reliable prognostic models are scarce, or not available, for rarer health outcomes, for example death or recurrence following diagnosis of rare types of cancer, and the onset of Parkinson's disease. There is a similar problem when trying to develop prognostic models for common events in relatively small subgroups of people, for example, a model to predict coronary heart disease in people who have severe mental health problems.
Some methodological research has been done to handle the problem of fitting statistical models for rare outcomes in genetic studies. However, limited work has been done to develop methods to produce reliable prognostic models with rare outcomes in clinical settings such as public health and health services research. Moreover, the methods that have been developed to date are not used routinely because of lack of software and adequate evaluation. There are currently no guidelines regarding how statisticians and other researchers should be using these methods in practice. The proposed research will evaluate the existing statistical methodology that is available to handle risk predictions when the health outcome of interest is rare, and will develop new methods where necessary. The proposed research will make recommendations regarding the use of these methods in practice. Additionally, the methods developed in this research project will be implemented in widely available statistical software to enable their routine use. The prognostic models developed using these methods should enable clinicians and policy makers to make predictions for patients regarding health outcomes, in these settings even if the outcome is rare
Technical Summary
Prognostic models are increasingly used by clinicians and policy makers to predict health outcomes for public and patients. These models are used to guide the clinical management of patients, help patients make informed decisions about their treatment and compare institutional performances after adjusting for patient case-mix. However, accurate and reliable prognostic models can be difficult to develop if the disease or event of interest is rare. Model overfitting is a problem in this situation, and is typically handled using variable selection approaches based on P-values. However, variable selection has problems, which are exacerbated in sparse data, including model instability in terms of the selected predictors. Additional complexity arises with multi-centre data as patients within a centre are more likely to be similar compared to patients across centres, leading to clustered (correlated) data.
Penalised maximum likelihood methods, which apply shrinkage to regression coefficients, have been proposed to address the model overfitting problem. However, these methods have not been adequately evaluated for scenarios that typically occur in the clinical areas of public health, epidemiology and health services research. Moreover little work has been done to develop shrinkage methods for clustered data. It is important to conduct research to identify appropriate methods that enables the development of reliable prognostic models in the clinical settings described above. This research will conduct a thorough investigation of the performance of existing methods for binary and survival health outcomes through the evaluation of their theoretical properties and performance in simulation studies. The overreaching aim is to make recommendations to health researchers regarding the use of appropriate methods in prognostic modelling studies in these settings, and to facilitate their use by providing an overview of available software and developing routines where required.
Penalised maximum likelihood methods, which apply shrinkage to regression coefficients, have been proposed to address the model overfitting problem. However, these methods have not been adequately evaluated for scenarios that typically occur in the clinical areas of public health, epidemiology and health services research. Moreover little work has been done to develop shrinkage methods for clustered data. It is important to conduct research to identify appropriate methods that enables the development of reliable prognostic models in the clinical settings described above. This research will conduct a thorough investigation of the performance of existing methods for binary and survival health outcomes through the evaluation of their theoretical properties and performance in simulation studies. The overreaching aim is to make recommendations to health researchers regarding the use of appropriate methods in prognostic modelling studies in these settings, and to facilitate their use by providing an overview of available software and developing routines where required.
Planned Impact
Prognostic models are increasingly being used by clinicians and policy makers to predict future health outcomes of patients and the public. For example, the Framingham model has been developed to predict the 10 year risk of coronary heart disease, the PREDICT score to predict the onset of depression, and the Ambler/Omar model to predict the risk of in-hospital mortality following heart valve surgery. These predictions are usually made using patients' clinical and demographic characteristics. The main applications for these models are to guide clinical management of patients, to help patients make informed decisions about their treatment options and to compare institutional performances after adjusting for patients' case-mix.
Robust models exist for relatively common events described above. However reliable prognostic models are scarce, or not available, for rarer health outcomes such as death or recurrence following diagnosis of rare types of cancer, mechanical failure of artificial heart valves, and the onset of Parkinson's disease. There is a similar problem when trying to develop prognostic models for common events in relatively small subgroups of people, for example predicting coronary heart disease in people who have severe mental health problems. Prognostic models developed in these scenarios are often overfitted and unable to make accurate predictions for future patients.
Statistical shrinkage methods have been developed to handle the problem of overfitting in genetic studies where the number of predictors often greatly exceeds the number of patients. However the focus of these types of study is often on the identification of key predictors, rather than on the development of prognostic models for routine clinical use. In contrast, these shrinkage methods are rarely used in prognostic modelling studies in public health, health services research and epidemiology, where though the number of coefficients is usually lower than both the sample size and number of events but the model over-fitting problem still exists.
The proposed research will conduct a comprehensive evaluation of existing shrinkage methods and corresponding software, and develop new methods and software where necessary. This will form a basis from which practical recommendations can be made regarding the development of prognostic models with such data for biostatisticians. This will also make statistical tools available to epidemiologists, public health researchers, health services researchers and other health researchers to develop reliable prognostic models for diseases with rare outcomes, using studies specifically designed for that purpose instead of using routine data. This should enable clinicians to make reliable risk predictions in these clinical areas, assisting in clinical management of their patients and also benefit patients suffering from such diseases in making decisions about their treatment options. Additionally, policy makers will not be restricted to evaluate institutional performances for common health outcomes only.
The researchers in this team, in particular the post doctoral research fellow (RF) will learn new statistical methodology and theory and how to design and conduct simulation studies. They will learn how to use and develop new software routines. The use of real clinical datasets will provide the opportunity to the theoretical statisticians and the RF to develop skills to interpret results from health studies and apply statistical methods in practice, thus creating the scope for future collaboration between them and health researchers. This should enable capacity building in biostatistics. The RF will develop skills to write scientific papers. It will also provide the opportunity to enhance the statistical methods used in NIHR research by integrating expertise of biostatisticians and theoretical statisticians and the application of the best possible methods for health of care of the public and patients.
Robust models exist for relatively common events described above. However reliable prognostic models are scarce, or not available, for rarer health outcomes such as death or recurrence following diagnosis of rare types of cancer, mechanical failure of artificial heart valves, and the onset of Parkinson's disease. There is a similar problem when trying to develop prognostic models for common events in relatively small subgroups of people, for example predicting coronary heart disease in people who have severe mental health problems. Prognostic models developed in these scenarios are often overfitted and unable to make accurate predictions for future patients.
Statistical shrinkage methods have been developed to handle the problem of overfitting in genetic studies where the number of predictors often greatly exceeds the number of patients. However the focus of these types of study is often on the identification of key predictors, rather than on the development of prognostic models for routine clinical use. In contrast, these shrinkage methods are rarely used in prognostic modelling studies in public health, health services research and epidemiology, where though the number of coefficients is usually lower than both the sample size and number of events but the model over-fitting problem still exists.
The proposed research will conduct a comprehensive evaluation of existing shrinkage methods and corresponding software, and develop new methods and software where necessary. This will form a basis from which practical recommendations can be made regarding the development of prognostic models with such data for biostatisticians. This will also make statistical tools available to epidemiologists, public health researchers, health services researchers and other health researchers to develop reliable prognostic models for diseases with rare outcomes, using studies specifically designed for that purpose instead of using routine data. This should enable clinicians to make reliable risk predictions in these clinical areas, assisting in clinical management of their patients and also benefit patients suffering from such diseases in making decisions about their treatment options. Additionally, policy makers will not be restricted to evaluate institutional performances for common health outcomes only.
The researchers in this team, in particular the post doctoral research fellow (RF) will learn new statistical methodology and theory and how to design and conduct simulation studies. They will learn how to use and develop new software routines. The use of real clinical datasets will provide the opportunity to the theoretical statisticians and the RF to develop skills to interpret results from health studies and apply statistical methods in practice, thus creating the scope for future collaboration between them and health researchers. This should enable capacity building in biostatistics. The RF will develop skills to write scientific papers. It will also provide the opportunity to enhance the statistical methods used in NIHR research by integrating expertise of biostatisticians and theoretical statisticians and the application of the best possible methods for health of care of the public and patients.
Organisations
- University College London (Lead Research Organisation)
- University of Cambridge (Collaboration)
- University Hospital Plzen (Collaboration)
- National and Kapodistrian University of Athens (Collaboration)
- University College London (Collaboration)
- University of A Coruña (Collaboration)
- Monaldi Hospital, Second University of Naples (Collaboration)
- Virgen de la Arrixaca University Hospital (Collaboration)
- University of Bologna (Collaboration)
Publications

Pavlou M
(2016)
Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events.
in Statistics in medicine

Pavlou M
(2015)
How to develop a more accurate risk prediction model when there are few events.
in BMJ (Clinical research ed.)

Pavlou M
(2015)
A note on obtaining correct marginal predictions from a random intercepts model for binary outcomes.
in BMC medical research methodology

Pavlou M
(2016)
Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events.
in Statistics in medicine
Guideline Title | European Society of Cardiology Guidelines |
Description | HCM model and AF model |
Geographic Reach | Multiple continents/international |
Policy Influence Type | Citation in clinical guidelines |
Impact | Our risk model is widely used in clinical practice and helps improved decision making in implanting devices in patients to prevent sudden cardiac death. |
Description | MRC Methodology |
Amount | £300,000 (GBP) |
Funding ID | MR/P015190/1 |
Organisation | Medical Research Council (MRC) |
Sector | Public |
Country | United Kingdom |
Start | 03/2018 |
End | 02/2021 |
Title | Risk model for HCM and AF |
Description | A novel clinical risk prediction model for sudden cardiac death in hypertrophic cardiomyopathy Prediction of thrombo-embolic risk in patients with hypertrophic cardiomyopathy (HCM Risk-CVA). |
Type Of Material | Computer model/algorithm |
Year Produced | 2014 |
Provided To Others? | Yes |
Impact | It is included in clinical guidelunes and used in clinical practice. |
Description | Clinical risk Models |
Organisation | Monaldi Hospital, Second University of Naples |
Country | Italy |
Sector | Hospitals |
PI Contribution | We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy. |
Collaborator Contribution | Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario. |
Impact | Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16. |
Start Year | 2013 |
Description | Clinical risk Models |
Organisation | National and Kapodistrian University of Athens |
Country | Greece |
Sector | Academic/University |
PI Contribution | We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy. |
Collaborator Contribution | Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario. |
Impact | Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16. |
Start Year | 2013 |
Description | Clinical risk Models |
Organisation | University College London |
Department | Medical School |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy. |
Collaborator Contribution | Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario. |
Impact | Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16. |
Start Year | 2013 |
Description | Clinical risk Models |
Organisation | University Hospital Plzen |
Country | Czech Republic |
Sector | Hospitals |
PI Contribution | We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy. |
Collaborator Contribution | Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario. |
Impact | Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16. |
Start Year | 2013 |
Description | Clinical risk Models |
Organisation | University of A Coruña |
Department | Institute of Biomedical Research of A Coruña |
Country | Spain |
Sector | Academic/University |
PI Contribution | We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy. |
Collaborator Contribution | Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario. |
Impact | Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16. |
Start Year | 2013 |
Description | Clinical risk Models |
Organisation | University of Bologna |
Country | Italy |
Sector | Academic/University |
PI Contribution | We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy. |
Collaborator Contribution | Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario. |
Impact | Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16. |
Start Year | 2013 |
Description | Clinical risk Models |
Organisation | Virgen de la Arrixaca University Hospital |
Country | Spain |
Sector | Hospitals |
PI Contribution | We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy. |
Collaborator Contribution | Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario. |
Impact | Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16. |
Start Year | 2013 |
Description | NICOR |
Organisation | University College London |
Department | National Institute for Cardiovascular Outcomes Research (NICOR) |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Rumana Omar and Gareth Ambler will provide senior statistical support to develop and validate risk models for cardiovascular outcomes and carry out research on risk modelling methodology for the National Institute for Cardiovascular Outcomes Research (NICOR). NICOR collects clinical information from UK hospitals into secure registries established by the cardiovascular specialist societies.They help the NHS, the government and regulatory bodies improve quality of care by checking that the care received by heart disease patients meets good practice standards. RO will be a member of their strategic board and GA a member of their methodological advisory board. |
Collaborator Contribution | They will provide data and facilitate the impact pathway. |
Impact | No outputs yet. It involved clinician,s statisticians, data scientists and audit methodologists. |
Start Year | 2017 |
Description | SMIRP |
Organisation | University College London |
Department | Department of Statistical Science |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | We have formed a collaborative group with researchers working on risk prediction models, from the MRC Biostatistics Unit, Cambridge, Oxford University and UCL. The contributions from this group are in the form of feedback on presentations from ongoing research. |
Collaborator Contribution | The contributions are in the form of feedback on presentations from ongoing research . |
Impact | Research presentation meetings held at the MRC Biostatistics Unit and the MRC Clinical Trials Unit in April, July, and October 2013. |
Start Year | 2013 |
Description | SMIRP |
Organisation | University of Cambridge |
Department | MRC Biostatistics Unit |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | We have formed a collaborative group with researchers working on risk prediction models, from the MRC Biostatistics Unit, Cambridge, Oxford University and UCL. The contributions from this group are in the form of feedback on presentations from ongoing research. |
Collaborator Contribution | The contributions are in the form of feedback on presentations from ongoing research . |
Impact | Research presentation meetings held at the MRC Biostatistics Unit and the MRC Clinical Trials Unit in April, July, and October 2013. |
Start Year | 2013 |
Description | Public speaking |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Media (as a channel to the public) |
Results and Impact | Public speaking, awareness of risk prediction methodology |
Year(s) Of Engagement Activity | 2016 |