Missing data imputation in clinical databases: development of a longitudinal model for cardiovascular risk factors

Lead Research Organisation: University College London
Department Name: Primary Care and Population Sciences


Clinical databases which use information that health care professionals record in their day-to-day work are rich sources of data for health research. The databases offer many opportunities for research that would be expensive and difficult to address using standard types of research study. They have already proved to be valuable resources for research into heart disease and stroke. One drawback of using many of these clinical databases, however, has been the large amount of missing data on some important factors such as smoking, alcohol consumption, blood pressure and weight. New statistical methods (multiple imputation) have been developed which potentially can help with this problem. There is difficulty in using these new methods as they assume that missing data are random, that is not related to other characteristics of the person seeing their doctor. This is frequently not the case in clinical databases as doctors record information more often on someone when it is relevant to a disease they have. For example, smoking history or blood pressure is more likely to be recorded in someone who has had a heart attack. Data recording is improving, however, and more recent data are likely to be more complete.

This study aims to use new innovative statistical techniques, called a ?forward-backward? algorithm to allow us to make ?imputations? or estimations of missing data allowing for changes in the quality of recording of data over time, and changes in the risk factors in a person over time, for example their smoking status. We will use more recent data as a starting point because it is more complete. Information from other research studies, for example on how many people give up smoking per year by age group and sex, will also be used to make more accurate estimations on data that is missing in earlier years. This will allow researchers to undertake more accurate research using these databases on heart diseases, stroke and many other areas, such as diabetes, obesity or lung diseases.

Technical Summary

Several UK primary care databases have emerged as powerful data sources for research into cardiovascular disease. One drawback of using these databases, however, has been the high proportion of missing data on cardiovascular health indicators such as smoking, alcohol consumption, blood pressure and body mass index. While multiple imputation of missing data has been increasingly used in epidemiology, it has proved difficult in the settings of primary care databases. Cross-sectional imputation models cannot generally be applied due to the intrinsically irregular and dynamic structure of the data.

Our preliminary work on The Health Improvement Network (THIN) demonstrated that not all data on health indicators were missing at random (MAR). For example, patients with cardiovascular diseases were more likely to have indicators recorded than the general population, and prevalence of non-smokers was much lower than in other cohorts and surveys representative of the UK population.
Since April 2007, sustained moves have occurred towards recording these data at more regular intervals for all primary care patients. A substantial subset of patients with at least one recent record now exists, and these data are now closer to MAR than at any time before. This provides a key foothold for creating valid multiple imputed datasets.

The aim of this project is to implement and evaluate imputation algorithms for missing data on cardiovascular health indicators, which take into account the dynamic and longitudinal structure of large primary care databases.
The objectives are to:
1) Document the patterns and structure of missing data over time in THIN.
2) Develop imputation algorithms to multiply impute the missing health indicator data, taking account of the specific features and timing of the data recording.
3) Evaluate and cross validate the results of imputed data and apply imputed data in our ongoing research projects. In particular, we will appraise the suitability of the imputed data for construction of a plausible cardiovascular risk score.

Once the imputation algorithms have been developed and evaluated for cardiovascular disease they will be easily adaptable for other related clinical areas. Imputed datasets will be made available to other researchers for use in their own analyses; i.e. neither detailed knowledge of the imputation algorithm nor creation of further imputed datasets will be necessary. This project could also potentially transfer knowledge and experience gained to other clinical databases. Thus this methodological research will benefit a whole range of epidemiological and health service research.


10 25 50
Description Grant/National School for Primary Care Research
Amount £96,000 (GBP)
Organisation National Institute for Health Research 
Department School for Primary Care Research
Sector Academic/University
Country United Kingdom
Start 01/2011 
End 01/2012
Description Programme grant
Amount £1,980,000 (GBP)
Organisation National Institute for Health Research 
Sector Public
Country United Kingdom
Start 01/2011 
End 12/2016
Description Programme grant
Amount £2,000,000 (GBP)
Organisation National Institute for Health Research 
Sector Public
Country United Kingdom
Start 03/2011 
End 02/2016
Title Twofold FCS algorithm for multiple imputation of missing data 
Description This method allow for multiple imputation of longitudinal records. This has been implemented in Stata. 
Type Of Material Improvements to research infrastructure 
Year Produced 2012 
Provided To Others? Yes  
Impact This methods allow researchers to make fully use of longitudinal records in electronic health records. The details of the method is now published in Stata Journal and code is available via Stata 
Description Imputation of missing data in clinical databases 
Organisation London School of Hygiene and Tropical Medicine (LSHTM)
Department Faculty of Epidemiology and Population Health
Country United Kingdom 
Sector Academic/University 
PI Contribution I am leading this project
Collaborator Contribution The research assistant on this project has registered for a PhD and senior staff at LSHTM is co-supervisor for this project.
Impact A serie of publications are underway from this collaboration. In 2010 we published an initial paper on missing data in primary care databases. This work is results of multi-disciplinary team work. The team involves general practitioners, statisticians and epidemiologists.
Start Year 2009