An investigation into the use of shrinkage methods to alleviate over-fitting of prognostic models for independent and clustered data with few events

Lead Research Organisation: UNIVERSITY COLLEGE LONDON

Department Name: Statistical Science

Abstract

Clinicians, health service researchers and epidemiologists often wish to predict a future health outcome for patients and the public. Examples of such outcomes include development of coronary heart disease, the occurrence of in-hospital mortality following surgery, and the onset of depression. These predictions are used by clinicians to determine the prognosis of patients to plan their treatment, to detect high risk patients, and to provide information to patients enabling them to make decisions about their treatment options. Policy makers often use these predictions to assess the performance of hospitals and general practices and identify under performing institutions.

Statistical models using patients' clinical and demographic characteristics are typically used to make these predictions. These models are referred to as prognostic models. To develop such models, information is collected on patients or relevant subjects, regarding their risk factors and the health outcome they experienced. The relationship between the risk factors and the outcome is quantified using a statistical model, which can then be used to make predictions for new patients. Models are usually presented in the form of a risk algorithm. This algorithm is then tested on new patients to ensure that it makes reliable predictions. If its performance is found to be satisfactory, it is recommended for use by clinicians in practice. Examples of risk algorithms used in practice include the Framingham risk score to predict the 10 year risk of coronary heart disease, Euroscore to predict in-hospital mortality following cardiac surgery and the PREDICT score to predict the risk of developing depression. When the health outcome of interest is rare it is often problematic to develop a risk algorithm that will both predict risk accurately and be able to classify patients into high and low risk groups. This is a common problem in health research and is often not alleviated by collecting patient data from many centres, or over a long period of time. A further statistical problem occurs with data from many centres as there may be variability in the outcomes between the centres.

Robust models exist for relatively common events such as coronary heart disease, in-hospital mortality following cardiac surgery, and depression. However reliable prognostic models are scarce, or not available, for rarer health outcomes, for example death or recurrence following diagnosis of rare types of cancer, and the onset of Parkinson's disease. There is a similar problem when trying to develop prognostic models for common events in relatively small subgroups of people, for example, a model to predict coronary heart disease in people who have severe mental health problems.

Some methodological research has been done to handle the problem of fitting statistical models for rare outcomes in genetic studies. However, limited work has been done to develop methods to produce reliable prognostic models with rare outcomes in clinical settings such as public health and health services research. Moreover, the methods that have been developed to date are not used routinely because of lack of software and adequate evaluation. There are currently no guidelines regarding how statisticians and other researchers should be using these methods in practice. The proposed research will evaluate the existing statistical methodology that is available to handle risk predictions when the health outcome of interest is rare, and will develop new methods where necessary. The proposed research will make recommendations regarding the use of these methods in practice. Additionally, the methods developed in this research project will be implemented in widely available statistical software to enable their routine use. The prognostic models developed using these methods should enable clinicians and policy makers to make predictions for patients regarding health outcomes, in these settings even if the outcome is rare

Technical Summary

Prognostic models are increasingly used by clinicians and policy makers to predict health outcomes for public and patients. These models are used to guide the clinical management of patients, help patients make informed decisions about their treatment and compare institutional performances after adjusting for patient case-mix. However, accurate and reliable prognostic models can be difficult to develop if the disease or event of interest is rare. Model overfitting is a problem in this situation, and is typically handled using variable selection approaches based on P-values. However, variable selection has problems, which are exacerbated in sparse data, including model instability in terms of the selected predictors. Additional complexity arises with multi-centre data as patients within a centre are more likely to be similar compared to patients across centres, leading to clustered (correlated) data.

Penalised maximum likelihood methods, which apply shrinkage to regression coefficients, have been proposed to address the model overfitting problem. However, these methods have not been adequately evaluated for scenarios that typically occur in the clinical areas of public health, epidemiology and health services research. Moreover little work has been done to develop shrinkage methods for clustered data. It is important to conduct research to identify appropriate methods that enables the development of reliable prognostic models in the clinical settings described above. This research will conduct a thorough investigation of the performance of existing methods for binary and survival health outcomes through the evaluation of their theoretical properties and performance in simulation studies. The overreaching aim is to make recommendations to health researchers regarding the use of appropriate methods in prognostic modelling studies in these settings, and to facilitate their use by providing an overview of available software and developing routines where required.

Planned Impact

Prognostic models are increasingly being used by clinicians and policy makers to predict future health outcomes of patients and the public. For example, the Framingham model has been developed to predict the 10 year risk of coronary heart disease, the PREDICT score to predict the onset of depression, and the Ambler/Omar model to predict the risk of in-hospital mortality following heart valve surgery. These predictions are usually made using patients' clinical and demographic characteristics. The main applications for these models are to guide clinical management of patients, to help patients make informed decisions about their treatment options and to compare institutional performances after adjusting for patients' case-mix.

Robust models exist for relatively common events described above. However reliable prognostic models are scarce, or not available, for rarer health outcomes such as death or recurrence following diagnosis of rare types of cancer, mechanical failure of artificial heart valves, and the onset of Parkinson's disease. There is a similar problem when trying to develop prognostic models for common events in relatively small subgroups of people, for example predicting coronary heart disease in people who have severe mental health problems. Prognostic models developed in these scenarios are often overfitted and unable to make accurate predictions for future patients.

Statistical shrinkage methods have been developed to handle the problem of overfitting in genetic studies where the number of predictors often greatly exceeds the number of patients. However the focus of these types of study is often on the identification of key predictors, rather than on the development of prognostic models for routine clinical use. In contrast, these shrinkage methods are rarely used in prognostic modelling studies in public health, health services research and epidemiology, where though the number of coefficients is usually lower than both the sample size and number of events but the model over-fitting problem still exists.

The proposed research will conduct a comprehensive evaluation of existing shrinkage methods and corresponding software, and develop new methods and software where necessary. This will form a basis from which practical recommendations can be made regarding the development of prognostic models with such data for biostatisticians. This will also make statistical tools available to epidemiologists, public health researchers, health services researchers and other health researchers to develop reliable prognostic models for diseases with rare outcomes, using studies specifically designed for that purpose instead of using routine data. This should enable clinicians to make reliable risk predictions in these clinical areas, assisting in clinical management of their patients and also benefit patients suffering from such diseases in making decisions about their treatment options. Additionally, policy makers will not be restricted to evaluate institutional performances for common health outcomes only.
The researchers in this team, in particular the post doctoral research fellow (RF) will learn new statistical methodology and theory and how to design and conduct simulation studies. They will learn how to use and develop new software routines. The use of real clinical datasets will provide the opportunity to the theoretical statisticians and the RF to develop skills to interpret results from health studies and apply statistical methods in practice, thus creating the scope for future collaboration between them and health researchers. This should enable capacity building in biostatistics. The RF will develop skills to write scientific papers. It will also provide the opportunity to enhance the statistical methods used in NIHR research by integrating expertise of biostatisticians and theoretical statisticians and the application of the best possible methods for health of care of the public and patients.

Funded Value:

£300,445

Funded Period:

Feb 13 - Jan 16

Funder:

MRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

MR/J013692/1

Principal Investigator:

Rumana Omar

Health Category:

Unclassified

Organisations

People	ORCID iD
Rumana Omar (Principal Investigator)
Trevor Sweeting (Co-Investigator)
Shaun Seaman (Co-Investigator)
Gareth Ambler (Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Pavlou M (2015) How to develop a more accurate risk prediction model when there are few events.

Pavlou M (2016) Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events. in Statistics in medicine

Pavlou M (2015) Use of Bayesian shrinkage for risk prediction in clustered data with few events

Pavlou M (2016) Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events. in Statistics in medicine

Pavlou M (2015) How to develop a more accurate risk prediction model when there are few events. in BMJ (Clinical research ed.)

Pavlou M (2015) A note on obtaining correct marginal predictions from a random intercepts model for binary outcomes. in BMC medical research methodology

Policy Influence
Further Funding
Research Databases and Models
Collaboration
Engagement Activities


Guideline Title	European Society of Cardiology Guidelines
Description	HCM model and AF model
Geographic Reach	Multiple continents/international
Policy Influence Type	Citation in clinical guidelines
Impact	Our risk model is widely used in clinical practice and helps improved decision making in implanting devices in patients to prevent sudden cardiac death.


Description	MRC Methodology
Amount	£300,000 (GBP)
Funding ID	MR/P015190/1
Organisation	Medical Research Council (MRC)
Sector	Public
Country	United Kingdom
Start	03/2018
End	02/2021


Title	Risk model for HCM and AF
Description	A novel clinical risk prediction model for sudden cardiac death in hypertrophic cardiomyopathy Prediction of thrombo-embolic risk in patients with hypertrophic cardiomyopathy (HCM Risk-CVA).
Type Of Material	Computer model/algorithm
Year Produced	2014
Provided To Others?	Yes
Impact	It is included in clinical guidelunes and used in clinical practice.


Description	Clinical risk Models
Organisation	Monaldi Hospital, Second University of Naples
Country	Italy
Sector	Hospitals
PI Contribution	We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy.
Collaborator Contribution	Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario.
Impact	Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16.
Start Year	2013


Description	Clinical risk Models
Organisation	National and Kapodistrian University of Athens
Country	Greece
Sector	Academic/University
PI Contribution	We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy.
Collaborator Contribution	Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario.
Impact	Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16.
Start Year	2013


Description	Clinical risk Models
Organisation	University College London
Department	Medical School
Country	United Kingdom
Sector	Academic/University
PI Contribution	We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy.
Collaborator Contribution	Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario.
Impact	Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16.
Start Year	2013


Description	Clinical risk Models
Organisation	University Hospital Plzen
Country	Czech Republic
Sector	Hospitals
PI Contribution	We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy.
Collaborator Contribution	Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario.
Impact	Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16.
Start Year	2013


Description	Clinical risk Models
Organisation	University of A Coruña
Department	Institute of Biomedical Research of A Coruña
Country	Spain
Sector	Academic/University
PI Contribution	We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy.
Collaborator Contribution	Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario.
Impact	Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16.
Start Year	2013


Description	Clinical risk Models
Organisation	University of Bologna
Country	Italy
Sector	Academic/University
PI Contribution	We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy.
Collaborator Contribution	Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario.
Impact	Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16.
Start Year	2013


Description	Clinical risk Models
Organisation	Virgen de la Arrixaca University Hospital
Country	Spain
Sector	Hospitals
PI Contribution	We have used our methods to develop a risk model to predict the thromboembolic risk in patients with hypertrophic cardiomyopathy.
Collaborator Contribution	Provided real data from several European Cardiac Centres. Constructed a risk model for a real clinical scenario.
Impact	Guttmann O, Pavlou M. Omar RZ, Elliott P (2015). Prediction of thromboembolic risk in patients with hypertrophic cardiomyopathy (HCMRisk-CVA). European Journal of Heart Failure. 17(8):837-45. doi: 10.1002/ejhf.316. Epub 2015 Jul 16.
Start Year	2013


Description	NICOR
Organisation	University College London
Department	National Institute for Cardiovascular Outcomes Research (NICOR)
Country	United Kingdom
Sector	Academic/University
PI Contribution	Rumana Omar and Gareth Ambler will provide senior statistical support to develop and validate risk models for cardiovascular outcomes and carry out research on risk modelling methodology for the National Institute for Cardiovascular Outcomes Research (NICOR). NICOR collects clinical information from UK hospitals into secure registries established by the cardiovascular specialist societies.They help the NHS, the government and regulatory bodies improve quality of care by checking that the care received by heart disease patients meets good practice standards. RO will be a member of their strategic board and GA a member of their methodological advisory board.
Collaborator Contribution	They will provide data and facilitate the impact pathway.
Impact	No outputs yet. It involved clinician,s statisticians, data scientists and audit methodologists.
Start Year	2017


Description	SMIRP
Organisation	University College London
Department	Department of Statistical Science
Country	United Kingdom
Sector	Academic/University
PI Contribution	We have formed a collaborative group with researchers working on risk prediction models, from the MRC Biostatistics Unit, Cambridge, Oxford University and UCL. The contributions from this group are in the form of feedback on presentations from ongoing research.
Collaborator Contribution	The contributions are in the form of feedback on presentations from ongoing research .
Impact	Research presentation meetings held at the MRC Biostatistics Unit and the MRC Clinical Trials Unit in April, July, and October 2013.
Start Year	2013


Description	SMIRP
Organisation	University of Cambridge
Department	MRC Biostatistics Unit
Country	United Kingdom
Sector	Academic/University
PI Contribution	We have formed a collaborative group with researchers working on risk prediction models, from the MRC Biostatistics Unit, Cambridge, Oxford University and UCL. The contributions from this group are in the form of feedback on presentations from ongoing research.
Collaborator Contribution	The contributions are in the form of feedback on presentations from ongoing research .
Impact	Research presentation meetings held at the MRC Biostatistics Unit and the MRC Clinical Trials Unit in April, July, and October 2013.
Start Year	2013


Description	Public speaking
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Media (as a channel to the public)
Results and Impact	Public speaking, awareness of risk prediction methodology
Year(s) Of Engagement Activity	2016