Using linked health and administrative data to reduce bias due to missing data and measurement error in observational research
Lead Research Organisation:
University of Bristol
Department Name: Faculty of Medicine and Dentistry
Abstract
The Avon Longitudinal Study of Parents and Children (ALSPAC), also known as Children of the 90s, is a health research study. Around 14,000 pregnant women joined the study in 1990-1991 and their children, born between April 1991 and December 1992, have been followed up ever since. Information about these children (and the mothers) has been collected using postal questionnaires and through clinics held at the University of Bristol.
The main aim of ALSPAC is to identify factors which influence people's physical and mental health and development so that steps can be taken to prevent illness and improve the health and well-being of the population as a whole. To do this, scientists use the data collected in ALSPAC to estimate a "measure of effect", a measure which quantifies the likely extent of association between a particular factor and the outcome they are investigating. For example, in 2003 researchers found that the use of skin preparations containing peanut oil was associated with an almost seven-fold increase in the risk of developing peanut allergy. In observational studies like ALSPAC, particularly when data is collected over a very long period of time, it is unusual to have complete information on all the individuals in the study. Some people drop out of the study for various reasons; others do not complete every questionnaire or attend every clinic; in addition, some people may not answer a whole questionnaire or may not want certain measurements taken at a clinic. All of these scenarios result in missing data. When information is more likely to be missing for some people than others (for example, heavy smokers may be less likely to complete questions on smoking), the measure of effect may be distorted (biased). Questionnaire-based studies like ALSPAC are also prone to errors because people are asked about events that they may not completely remember. In addition, some topics on questionnaires may be sensitive for some people and they might not be completely honest - about how much they smoke, for example. Both of these issues result in something called misclassification, whereby some people may be wrongly classified as having (or not having) a particular condition - such as asthma, for example - or wrongly classified as being a light smoker when in fact they are a heavy smoker. This can also lead to biased measures of effect.
One way of addressing these problems in studies like ALSPAC is to use comparable information from health or administrative (government) records. ALSPAC has already obtained education data from the DfE. In addition, the Project to Enhance ALSPAC through Record Linkage (PEARL) has been set up to obtain data on ALSPAC participants from the following records: health, benefits and earnings, criminal convictions and cautions, plus further and higher education. PEARL is currently investigating how to use the data obtained from these sources to enhance the existing ALSPAC data as well as looking at the feasibility of using such data to provide future information on health and other outcomes.
In this project I will build on the work of PEARL by investigating particular measures - smoking, IQ, and teenage depression - in depth, investigating missing data and misclassification and devising ways in which administrative and health data can be used to overcome these issues, both in ALSPAC and in similar studies. In particular, I will look at whether linked health and education data can be used to understand whether particular people are more likely to have missing information on smoking, IQ or depression. I will also investigate whether the linked data can be used to "fill in" missing information in the ALSPAC data. In addition, by comparing self-reported smoking and depression to equivalent information in the GP records I will assess how accurate the self-reported data is likely to be and what influence this may have on results based on these measures.
The main aim of ALSPAC is to identify factors which influence people's physical and mental health and development so that steps can be taken to prevent illness and improve the health and well-being of the population as a whole. To do this, scientists use the data collected in ALSPAC to estimate a "measure of effect", a measure which quantifies the likely extent of association between a particular factor and the outcome they are investigating. For example, in 2003 researchers found that the use of skin preparations containing peanut oil was associated with an almost seven-fold increase in the risk of developing peanut allergy. In observational studies like ALSPAC, particularly when data is collected over a very long period of time, it is unusual to have complete information on all the individuals in the study. Some people drop out of the study for various reasons; others do not complete every questionnaire or attend every clinic; in addition, some people may not answer a whole questionnaire or may not want certain measurements taken at a clinic. All of these scenarios result in missing data. When information is more likely to be missing for some people than others (for example, heavy smokers may be less likely to complete questions on smoking), the measure of effect may be distorted (biased). Questionnaire-based studies like ALSPAC are also prone to errors because people are asked about events that they may not completely remember. In addition, some topics on questionnaires may be sensitive for some people and they might not be completely honest - about how much they smoke, for example. Both of these issues result in something called misclassification, whereby some people may be wrongly classified as having (or not having) a particular condition - such as asthma, for example - or wrongly classified as being a light smoker when in fact they are a heavy smoker. This can also lead to biased measures of effect.
One way of addressing these problems in studies like ALSPAC is to use comparable information from health or administrative (government) records. ALSPAC has already obtained education data from the DfE. In addition, the Project to Enhance ALSPAC through Record Linkage (PEARL) has been set up to obtain data on ALSPAC participants from the following records: health, benefits and earnings, criminal convictions and cautions, plus further and higher education. PEARL is currently investigating how to use the data obtained from these sources to enhance the existing ALSPAC data as well as looking at the feasibility of using such data to provide future information on health and other outcomes.
In this project I will build on the work of PEARL by investigating particular measures - smoking, IQ, and teenage depression - in depth, investigating missing data and misclassification and devising ways in which administrative and health data can be used to overcome these issues, both in ALSPAC and in similar studies. In particular, I will look at whether linked health and education data can be used to understand whether particular people are more likely to have missing information on smoking, IQ or depression. I will also investigate whether the linked data can be used to "fill in" missing information in the ALSPAC data. In addition, by comparing self-reported smoking and depression to equivalent information in the GP records I will assess how accurate the self-reported data is likely to be and what influence this may have on results based on these measures.
Technical Summary
Aim
To examine how linked health and administrative data can be used to avoid bias in cohort studies, using the Avon Longitudinal Study of Parents and Children (ALSPAC) as an exemplar.
Objectives
1. To develop methods for using linked health and administrative data to examine patterns of missing data and model missingness mechanisms in ALSPAC.
2. To incorporate linked health and administrative data in multiple imputation models.
3. To compare data in ALSPAC to equivalent outcomes recorded in linked electronic primary care records to investigate measurement error.
4. To develop methods to use both linked data and self-reported data to minimise the impact of measurement error on analyses.
5. To devise and modify existing algorithms for defining depression using electronic GP data and to use this information to estimate the prevalence of depression among ALSPAC teenagers.
Methodology
ALSPAC is a prospective cohort study. Around 14,000 pregnant women were recruited into the study during 1990-1991. Follow up is ongoing; data have been primarily collected via questionnaires and clinics held at the University of Bristol. Educational data have also been obtained via linkage to the National Pupil Database and the Project to Enhance ALSPAC through Record Linkage (PEARL) has or is currently linking to other datasets, including electronic patient (GP) records. GP data will be analysed in a safe setting and relevant statistical methods, including simulations and multiple imputation will be used as appropriate.
Scientific/medical opportunities
To draw valid conclusions from observational research, selection and measurement bias need to be quantified and their impact minimised. The proposed research will address this by using linked health and education data to examine misclassification and missingness mechanisms in ALSPAC (a large observational study) and develop ways in which linked data can be used to reduce bias.
To examine how linked health and administrative data can be used to avoid bias in cohort studies, using the Avon Longitudinal Study of Parents and Children (ALSPAC) as an exemplar.
Objectives
1. To develop methods for using linked health and administrative data to examine patterns of missing data and model missingness mechanisms in ALSPAC.
2. To incorporate linked health and administrative data in multiple imputation models.
3. To compare data in ALSPAC to equivalent outcomes recorded in linked electronic primary care records to investigate measurement error.
4. To develop methods to use both linked data and self-reported data to minimise the impact of measurement error on analyses.
5. To devise and modify existing algorithms for defining depression using electronic GP data and to use this information to estimate the prevalence of depression among ALSPAC teenagers.
Methodology
ALSPAC is a prospective cohort study. Around 14,000 pregnant women were recruited into the study during 1990-1991. Follow up is ongoing; data have been primarily collected via questionnaires and clinics held at the University of Bristol. Educational data have also been obtained via linkage to the National Pupil Database and the Project to Enhance ALSPAC through Record Linkage (PEARL) has or is currently linking to other datasets, including electronic patient (GP) records. GP data will be analysed in a safe setting and relevant statistical methods, including simulations and multiple imputation will be used as appropriate.
Scientific/medical opportunities
To draw valid conclusions from observational research, selection and measurement bias need to be quantified and their impact minimised. The proposed research will address this by using linked health and education data to examine misclassification and missingness mechanisms in ALSPAC (a large observational study) and develop ways in which linked data can be used to reduce bias.
Planned Impact
The aim of this fellowship is to understand how linked health and administrative data can be used to understand and reduce bias in observational studies. Missing data are inevitable in observational studies and methods are currently being developed to support inferences made in these studies. The work proposed here will explore biases introduced by missing data and measurement error in observational studies and investigate ways in which these different biases can be overcome using linked data. It will also develop methods for combining self-reported and linked data. This work will benefit others using observational data and, specifically, those working on longitudinal studies that have already, or plan to, collect data via linkage.
As this research is methodological, it is unlikely to have a direct effect on population health. However, the exemplar questions being addressed will contribute to our understanding of the relationship between early life exposures (breastfeeding, prenatal exposure to smoking) and cognitive and behavioural outcomes in adolescence. More importantly, ALSPAC is - and will continue to be - an important resource for carrying out research that will impact on our understanding of many areas of human health and development. The findings of the proposed work will influence how future analyses are carried out and, specifically, ensuring that the available data - both linked and self-reported data - are used in such a way as to minimise the potential for bias. This will be particularly the case for the variables investigated as part of this work but is likely to apply to other outcomes. Thus, it is anticipated that the work proposed in this application will influence research and thus impact on the NHS and the wider public in the longer term.
Full details are given in the attached "pathways to impact".
As this research is methodological, it is unlikely to have a direct effect on population health. However, the exemplar questions being addressed will contribute to our understanding of the relationship between early life exposures (breastfeeding, prenatal exposure to smoking) and cognitive and behavioural outcomes in adolescence. More importantly, ALSPAC is - and will continue to be - an important resource for carrying out research that will impact on our understanding of many areas of human health and development. The findings of the proposed work will influence how future analyses are carried out and, specifically, ensuring that the available data - both linked and self-reported data - are used in such a way as to minimise the potential for bias. This will be particularly the case for the variables investigated as part of this work but is likely to apply to other outcomes. Thus, it is anticipated that the work proposed in this application will influence research and thus impact on the NHS and the wider public in the longer term.
Full details are given in the attached "pathways to impact".
People |
ORCID iD |
Rosaleen Peggy Cornish (Principal Investigator / Fellow) |
Publications
Cornish R
(2015)
Using linkage to electronic primary care records to evaluate recruitment and nonresponse bias in the Avon Longitudinal Study of Parents and Children.
in Epidemiology (Cambridge, Mass.)
Cornish RP
(2022)
Complete case logistic regression with a dichotomised continuous outcome led to biased estimates
in Journal of Clinical Epidemiology
Cornish RP
(2015)
Using linked educational attainment data to reduce bias due to missing outcome data in estimates of the association between the duration of breastfeeding and IQ at 15 years.
in International journal of epidemiology
Cornish RP
(2021)
Factors associated with participation over time in the Avon Longitudinal Study of Parents and Children: a study using linked education and primary care data.
in International journal of epidemiology
Cornish RP
(2017)
Multiple imputation using linked proxy outcome data resulted in important bias reduction and efficiency gains: a simulation study.
in Emerging themes in epidemiology
Cornish RP
(2023)
Complete case logistic regression with a dichotomised continuous outcome led to biased estimates.
in Journal of clinical epidemiology
John A
(2016)
Case-finding for common mental disorders of anxiety and depression in primary care: an external validation of routinely collected data.
in BMC medical informatics and decision making
Lee KJ
(2021)
Framework for the treatment and reporting of missing data in observational studies: The Treatment And Reporting of Missing data in Observational Studies framework.
in Journal of clinical epidemiology
Teyhan A
(2016)
The impact of cycle proficiency training on cycle-related behaviours and accidents in adolescence: findings from ALSPAC, a UK longitudinal cohort.
in BMC public health
Description | Development of miDOC: an expert system and methodology for multiple imputation |
Amount | £321,633 (GBP) |
Funding ID | MR/V020641/1 |
Organisation | Medical Research Council (MRC) |
Sector | Public |
Country | United Kingdom |
Start | 09/2021 |
End | 03/2024 |
Description | Home Office / Administrative Data Research UK feasibility study |
Amount | £79,124 (GBP) |
Organisation | Economic and Social Research Council |
Sector | Public |
Country | United Kingdom |
Start | 03/2020 |
End | 09/2020 |
Description | LONGITUDINAL ADMINISTRATIVE DATA SPINE SCOPING PROJECT GRANT FOR THE SPF UK POPULATION LAB WAVE I |
Amount | £236,901 (GBP) |
Funding ID | ES/S016732/1 |
Organisation | Economic and Social Research Council |
Sector | Public |
Country | United Kingdom |
Start | 12/2018 |
End | 03/2021 |
Description | Mental health and incontinence |
Amount | £525,115 (GBP) |
Funding ID | MR/V033581/1 |
Organisation | Medical Research Council (MRC) |
Sector | Public |
Country | United Kingdom |
Start | 02/2022 |
End | 11/2024 |
Description | Understanding non-response in young people in Understanding Society |
Amount | £43,370 (GBP) |
Organisation | University of Essex |
Sector | Academic/University |
Country | United Kingdom |
Start | 05/2023 |
End | 05/2024 |
Description | Framework for treatment and reporting of missing data |
Organisation | Murdoch Children's Research Institute |
Country | Australia |
Sector | Academic/University |
PI Contribution | Co-authored publication |
Collaborator Contribution | Co-authored publication |
Impact | Publication in the Journal of Clinical Epidemiology: Framework for the treatment and reporting of missing data in observational studies: The Treatment And Reporting of Missing data in Observational Studies framework |
Start Year | 2019 |
Description | Multiple imputation using linked proxy: simulation study |
Organisation | London School of Hygiene and Tropical Medicine (LSHTM) |
Department | Department of Medical Statistics |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | We were the main investigators, carried out the statistical analysis and lead on the writing up. |
Collaborator Contribution | They contributed to writing up the work for publication. |
Impact | Published paper: Cornish RP, Macleod J, Carpenter JR, Tilling K. Multiple imputation using linked proxy outcome data resulted in important bias reduction and efficiency gains: a simulation study. Emerging Themes in Epidemiology 2017; 14:14. doi:10.1186s12982-017-0068-0 |
Start Year | 2015 |