Guidance on sample size when using observational data to develop and validate risk prediction models in health research

Lead Research Organisation: University College London
Department Name: Statistical Science

Abstract

Clinicians and health researchers often wish to predict the risk of a future health event for patients and the public. Examples of such events include the occurrence of in-hospital mortality following surgery, the development of cardiovascular disease and the onset of depression. These predictions are used by clinicians to detect and plan the treatment options for high risk patients and to help patients make informed decisions regarding their treatment options. Policy makers often use these predictions to assess the performance of health institutions (for example hospitals or general practices) and identify under-performing institutions.
These predictions are obtained from statistical models (risk models) based on patients' clinical and demographic characteristics. To develop such models, health researchers collect information on relevant patient characteristics and their health events. The researchers typically use databases of patient records, held at hospitals or general practices to obtain this information. The relationship between patient characteristics and the risk of having a health event is then quantified using a statistical model. These models are usually presented in the form of risk algorithms. These algorithms are then evaluated on a new set of patients to assess whether they are able to make reliable and accurate predictions of health events. If the performance of a model is found to be satisfactory, it can then be recommended for use in clinical practice. Examples of risk algorithms that are used in clinical practice include the QRisk2 score to predict the 10 year risk of heart attack and stroke, Euroscore to predict in-hospital mortality following cardiac surgery and the PREDICT score to predict the risk of developing depression.
It is often not clear how much data are required to develop a risk model that will be reliable for use in clinical practice. The objective of this research proposal is to conduct research and to develop guidance for model developers on the size of the data required to develop a risk tool that can be used to make reliable predictions of patients' risk.

Technical Summary

Risk prediction models are used to make predictions regarding a patient's (current or future) health based on their clinical and demographic characteristics. An example of such a model is QRISK2 which is routinely used to estimate the risk of having a heart attack or stroke in the next 10 years. Risk models are typically developed using statistical models such as logistic or Cox regression and sample size plays a major role in ensuring that risk models are developed and validated correctly.

There has been some work in this area but there are several remaining issues. It has been suggested through simulation that having an EPV (events per variable) ratio of least 10 results in unbiased estimates of the regression coefficients which should increase the chance that the resulting risk model can make reliable predictions. We intend to take a more focused approach that will also consider the accuracy of the individual predictions. We will also consider other very important risk model scenarios that include the development and validation of risk models for clustered data, the use of variable selection to develop models, and both internal and external validation. To date, there has been only limited work on sample size in these scenarios.

Most of this work will be carried out using simulation based on real clinical datasets. Datasets with different characteristics (e.g. EPV, censoring, clustering) will be generated and various risk models (e.g. using variable selection, lasso) fitted to them. The performance of these risk models will be quantified in validation data using performance measures that consider the calibration and discrimination of the risk model. Analytical work will also be carried out where feasible. The primary objective of this research proposal is to develop comprehensive sample size guidance for researchers who develop and validate clinical risk models.

Planned Impact

Risk prediction models are increasingly being used by clinicians in both primary and secondary care and also by policy makers to predict future health outcomes of patients and the public. The main applications for these models are to a) guide clinical management of patients, b) to help patients make informed decisions about their treatment options, c) to select high-risk patients for treatment in trials and d) to compare institutional performances after adjusting for patient case-mix. Given the characteristics of patients, a risk model can be used to predict their probability of having a health outcome of interest. While sample size calculations are routinely performed in trials in health research, this is not the case in risk prediction studies. At present there are no guidelines on the size of the data required to develop a reliable risk model for clinical use. Most risk models are developed either without any sample size considerations or use the limited guidance that is available on sample size.
Statisticians: The proposed research will carry out a detailed investigation on sample size requirements under realistic clinical scenarios where risk prediction models are used. The sample size requirements will address the performance of risk models both in terms of their predictive ability as well as their clinical usefulness. The proposed research will provide statisticians with practical recommendations on how to calculate sample size for risk prediction studies (prospective and using routine data) in order to produce reliable risk models. We hope to produce graphical presentations of sample sizes for different scenarios as well as software routines where appropriate.
Health Researchers: The methodological advances will also make statistical tools available to epidemiologists, public health researchers, health services researchers and other health researchers to develop reliable risk models. The methods developed in this project thus should help researchers ensure that sufficient data are available to develop a reliable risk model, and also save resources when extracting data from large routine databases to develop a risk model.
Clinicians and clinical guidelines: The proposed work should enable clinicians to make reliable risk predictions which will assist in the clinical management of their patients. Risk models are often part of clinical guidelines, for example guidelines of the American Heart Association, the European Society of Cardiology etc. These clinical guidelines represent best clinical practice, combining research findings and clinical judgement to create the best possible recommendations for patient care, and to assist healthcare providers in clinical decision making and thus the proposed research is expected to have an impact there.

Policy makers: The policy makers should benefit from the use of reliable tools to evaluate institutional performances for health outcomes. Similarly, this work will aid the National Institute for Health and Care Excellence (NICE) to draw up their recommendations on treatment options based on patients' risks.

The proposed work will also enable the enhancement of statistical methods used by the National Institute of Health Research (NIHR) by integrating the expertise of biostatisticians, clinicians and machine learning experts, and enable the application of the best possible methods for health of care of the public and patients.

Capacity building: The Research Associate will learn new skills beyond statistics which will enable them to carry out research prioritised by the NIHR.

Patients: should benefit from the use of more reliable results for their care from improved methodology.

The dissemination of this work through conferences, meetings, workshops and publications will take place throughout the three-year project, and the impact should continue in to the future through the publications and web materials.

Publications

10 25 50