Missing data imputation in clinical databases: development of a longitudinal model for cardiovascular risk factors
Lead Research Organisation:
University College London
Department Name: Primary Care and Population Sciences
Abstract
Clinical databases which use information that health care professionals record in their day-to-day work are rich sources of data for health research. The databases offer many opportunities for research that would be expensive and difficult to address using standard types of research study. They have already proved to be valuable resources for research into heart disease and stroke. One drawback of using many of these clinical databases, however, has been the large amount of missing data on some important factors such as smoking, alcohol consumption, blood pressure and weight. New statistical methods (multiple imputation) have been developed which potentially can help with this problem. There is difficulty in using these new methods as they assume that missing data are random, that is not related to other characteristics of the person seeing their doctor. This is frequently not the case in clinical databases as doctors record information more often on someone when it is relevant to a disease they have. For example, smoking history or blood pressure is more likely to be recorded in someone who has had a heart attack. Data recording is improving, however, and more recent data are likely to be more complete.
This study aims to use new innovative statistical techniques, called a ?forward-backward? algorithm to allow us to make ?imputations? or estimations of missing data allowing for changes in the quality of recording of data over time, and changes in the risk factors in a person over time, for example their smoking status. We will use more recent data as a starting point because it is more complete. Information from other research studies, for example on how many people give up smoking per year by age group and sex, will also be used to make more accurate estimations on data that is missing in earlier years. This will allow researchers to undertake more accurate research using these databases on heart diseases, stroke and many other areas, such as diabetes, obesity or lung diseases.
This study aims to use new innovative statistical techniques, called a ?forward-backward? algorithm to allow us to make ?imputations? or estimations of missing data allowing for changes in the quality of recording of data over time, and changes in the risk factors in a person over time, for example their smoking status. We will use more recent data as a starting point because it is more complete. Information from other research studies, for example on how many people give up smoking per year by age group and sex, will also be used to make more accurate estimations on data that is missing in earlier years. This will allow researchers to undertake more accurate research using these databases on heart diseases, stroke and many other areas, such as diabetes, obesity or lung diseases.
Technical Summary
Several UK primary care databases have emerged as powerful data sources for research into cardiovascular disease. One drawback of using these databases, however, has been the high proportion of missing data on cardiovascular health indicators such as smoking, alcohol consumption, blood pressure and body mass index. While multiple imputation of missing data has been increasingly used in epidemiology, it has proved difficult in the settings of primary care databases. Cross-sectional imputation models cannot generally be applied due to the intrinsically irregular and dynamic structure of the data.
Our preliminary work on The Health Improvement Network (THIN) demonstrated that not all data on health indicators were missing at random (MAR). For example, patients with cardiovascular diseases were more likely to have indicators recorded than the general population, and prevalence of non-smokers was much lower than in other cohorts and surveys representative of the UK population.
Since April 2007, sustained moves have occurred towards recording these data at more regular intervals for all primary care patients. A substantial subset of patients with at least one recent record now exists, and these data are now closer to MAR than at any time before. This provides a key foothold for creating valid multiple imputed datasets.
The aim of this project is to implement and evaluate imputation algorithms for missing data on cardiovascular health indicators, which take into account the dynamic and longitudinal structure of large primary care databases.
The objectives are to:
1) Document the patterns and structure of missing data over time in THIN.
2) Develop imputation algorithms to multiply impute the missing health indicator data, taking account of the specific features and timing of the data recording.
3) Evaluate and cross validate the results of imputed data and apply imputed data in our ongoing research projects. In particular, we will appraise the suitability of the imputed data for construction of a plausible cardiovascular risk score.
Once the imputation algorithms have been developed and evaluated for cardiovascular disease they will be easily adaptable for other related clinical areas. Imputed datasets will be made available to other researchers for use in their own analyses; i.e. neither detailed knowledge of the imputation algorithm nor creation of further imputed datasets will be necessary. This project could also potentially transfer knowledge and experience gained to other clinical databases. Thus this methodological research will benefit a whole range of epidemiological and health service research.
Our preliminary work on The Health Improvement Network (THIN) demonstrated that not all data on health indicators were missing at random (MAR). For example, patients with cardiovascular diseases were more likely to have indicators recorded than the general population, and prevalence of non-smokers was much lower than in other cohorts and surveys representative of the UK population.
Since April 2007, sustained moves have occurred towards recording these data at more regular intervals for all primary care patients. A substantial subset of patients with at least one recent record now exists, and these data are now closer to MAR than at any time before. This provides a key foothold for creating valid multiple imputed datasets.
The aim of this project is to implement and evaluate imputation algorithms for missing data on cardiovascular health indicators, which take into account the dynamic and longitudinal structure of large primary care databases.
The objectives are to:
1) Document the patterns and structure of missing data over time in THIN.
2) Develop imputation algorithms to multiply impute the missing health indicator data, taking account of the specific features and timing of the data recording.
3) Evaluate and cross validate the results of imputed data and apply imputed data in our ongoing research projects. In particular, we will appraise the suitability of the imputed data for construction of a plausible cardiovascular risk score.
Once the imputation algorithms have been developed and evaluated for cardiovascular disease they will be easily adaptable for other related clinical areas. Imputed datasets will be made available to other researchers for use in their own analyses; i.e. neither detailed knowledge of the imputation algorithm nor creation of further imputed datasets will be necessary. This project could also potentially transfer knowledge and experience gained to other clinical databases. Thus this methodological research will benefit a whole range of epidemiological and health service research.
Publications

Falcaro M
(2015)
Estimating excess hazard ratios and net survival when covariate data are missing: strategies for multiple imputation.
in Epidemiology (Cambridge, Mass.)

Fardet L
(2011)
Prevalence of long-term oral glucocorticoid prescriptions in the UK over the past 20 years
in Rheumatology

Fardet L
(2012)
Suicidal behavior and severe neuropsychiatric disorders following glucocorticoid therapy in primary care.
in The American journal of psychiatry

Fardet L
(2011)
[Description of oral glucocorticoid prescriptions in general population].
in La Revue de medecine interne

Fardet L
(2012)
Risk of cardiovascular events in people prescribed glucocorticoids with iatrogenic Cushing's syndrome: cohort study.
in BMJ (Clinical research ed.)

Hardoon S
(2013)
Recording of severe mental illness in United Kingdom primary care, 2000-2010.
in PloS one

Hardoon SL
(2011)
Trends in longer-term survival following an acute myocardial infarction and prescribing of evidenced-based medications in primary care in the UK from 1991: a longitudinal population-based study.
in Journal of epidemiology and community health


Horsfall L
(2013)
Identifying periods of acceptable computer usage in primary care research databases.
in Pharmacoepidemiology and drug safety

Horsfall LJ
(2011)
Serum bilirubin and risk of respiratory disease and death.
in JAMA
Description | Grant/National School for Primary Care Research |
Amount | £96,000 (GBP) |
Organisation | National Institute for Health Research |
Department | School for Primary Care Research |
Sector | Academic/University |
Country | United Kingdom |
Start | 01/2011 |
End | 01/2012 |
Description | Programme grant |
Amount | £1,980,000 (GBP) |
Organisation | National Institute for Health Research |
Sector | Public |
Country | United Kingdom |
Start | 01/2011 |
End | 12/2016 |
Description | Programme grant |
Amount | £2,000,000 (GBP) |
Organisation | National Institute for Health Research |
Sector | Public |
Country | United Kingdom |
Start | 03/2011 |
End | 02/2016 |
Title | Twofold FCS algorithm for multiple imputation of missing data |
Description | This method allow for multiple imputation of longitudinal records. This has been implemented in Stata. |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2012 |
Provided To Others? | Yes |
Impact | This methods allow researchers to make fully use of longitudinal records in electronic health records. The details of the method is now published in Stata Journal and code is available via Stata |
Description | Imputation of missing data in clinical databases |
Organisation | London School of Hygiene and Tropical Medicine (LSHTM) |
Department | Faculty of Epidemiology and Population Health |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | I am leading this project |
Collaborator Contribution | The research assistant on this project has registered for a PhD and senior staff at LSHTM is co-supervisor for this project. |
Impact | A serie of publications are underway from this collaboration. In 2010 we published an initial paper on missing data in primary care databases. This work is results of multi-disciplinary team work. The team involves general practitioners, statisticians and epidemiologists. |
Start Year | 2009 |