Statistical methods for developing, assessing and validating risk prediction models in multiple epidemiological studies

Lead Research Organisation: University of Cambridge
Department Name: Institute of Public Health

Abstract

The identification of predictors of disease has led to considerable advances in clinical practice in the last century. Many markers such as systolic blood pressure and blood cholesterol levels are now routinely measured on individuals to predict conditions such as heart disease. Levels of markers that indicate an individual is at high risk of disease are subsequently modified by suitable interventions to help prevent disease onset and/or progression. However, there has much controversy regarding the proportion of risk actually explained by such markers and the generalisability of risk markers between different populations. To enable a more comprehensive and powerful evaluation of the relevance of such predictors, it is often necessary to pool data from different studies. I plan to advance the development of statistical methods for use in such data pooling approaches by working on detailed information previously collated on up to 40,000 cases of heart disease from over 1 million participants in 104 studies. The main aim is to develop statistical methods that will enable more reliable conclusions to be drawn about (i) the relationship between predictors and disease; (ii) the predictive ability that can be attributed to risk models and (iii) the generalisability of risk models to different populations. The methods that will be developed will have applications to many different situations and to different diseases, and will become increasingly important as the trend continues towards data sharing and pooling in large, collaborative multi-centre studies.

Technical Summary

An increasing number of biological markers are being proposed as important predictors of chronic diseases (eg, heart attacks and strokes). Reliably identifying risk predictors of disease can have important scientific and public health implications (exemplified by the utility of measurement and modification of blood cholesterol values in prevention of heart attacks). To enable a more comprehensive and powerful evaluation of the relevance of such markers, it is often necessary to pool data from different studies, which has motivated the 1.1-million-participant, 104-cohort Emerging Risk Factors Collaboration (ERFC). Optimum biostatistical methods are needed to help maximize the value of such databases. Specifically, it is important to (i) account for differences in the distribution of the predictors across studies (ie, between-study heterogeneity); (ii) deal with missing predictor values and (iii) appropriately combine data and results from different study designs (eg, cohort, case-cohort and nested case-control studies). This present proposal seeks support to address these unresolved statistical issues in relation to the development, assessment and validation of risk prediction models in multiple epidemiological studies.

Risk prediction models ideally should allow for non-linear relationships between predictors and disease, joint effects of predictors and other common functions of predictors, such as ratios. Through implementation in the ERFC database and through simulation, we will develop models that incorporate these features, for example using flexible fractional polynomial models, whilst overcoming the statistical limitations listed above. As well as determining relevant predictors of disease, it is important to identify the prognostic ability of the risk prediction model. No one measure of prognostic ability is widely accepted for general use. We will compare and investigate the use of several measures, such as the AUROC, C-index and Royston?s D and measures which summarise risk reclassification tables, in relation to multiple studies. It is also important to validate a risk prediction model to assess it?s generalisability to other populations. Multiple studies should allow an optimum validation procedure, and we will consider using both internal validation (splitting individuals within cohorts) and external validation procedures (splitting cohorts). Throughout, we propose to incorporate between-study heterogeneity via random-effects, use multiple imputation to deal with missing values and combine estimates from different study designs in a meta-analysis approach.

The products of this proposed work will have rapid application, initially to the ERFC and then to other existing data pooling initiatives. Our findings should become increasingly useful as the trend continues towards data pooling in large, collaborative multi-centre analyses.

Publications

10 25 50