Creating longitudinal datasets for linked administrative data research using synthetic data

Lead Research Organisation: University College London
Department Name: Institute of Child Health

Abstract

Administrative data hold great potential for informing public policy. However, this potential is not yet being realised due to restrictions around data access, linkage, and privacy protection. Governance procedures and approvals lead to long timescales and tight restrictions on data access, which can jeopardise publicly funded research.

One solution is to generate synthetic data that preserve the statistical properties of the original sources, but do not correspond to any real individuals or pose privacy risks. These data could be widely shared, allowing researchers to understand the data structures, develop analysis plans and algorithms, and test out different models. This could be done in parallel to applying for access to linked administrative datasets, streamlining the research process. Final refinements and analyses would be conducted on the real data.

Our study will test the feasibility of approaches for creating synthetic linked administrative datasets. We will compare two existing methods: 'Synthpop', used to create synthetic versions of the Scottish Longitudinal Study, and 'Simulacrum', used to create synthetic versions of the National Cancer Registry, with a new approach 'Jomo', based on recent methodological developments for the imputation of missing data. We will evaluate these approaches using an exemplar of linking the third National Survey of Sexual Attitudes and Lifestyles (Natsal-3) to two administrative datasets: Hospital Episode Statistics (HES) and the National Pupil Database (NPD).

Natsal-3 is one of the largest sexual population-based behaviour surveys in the world and collected data from 15000 participants during 2010-2012. HES contains information on attendances to all NHS hospitals in England, allowing detailed analysis of procedures and diagnoses. NPD contains information on pupils attending state schools in England, including school achievement, absences, and special educational needs. Linkage between Natsal-3, HES and NPD will provide a unique opportunity to gain a deeper understanding of the social, behavioural and biological aspects of sexual and reproductive health, and to generate evidence to inform implementation of sexual health interventions.

We will first compare different methods for generating synthetic versions of the three datasets separately (since all have different structures and characteristics), based on how well the data generated by these methods represent the original data. We will also apply for approvals to link the data, to i) explore whether there are any additional considerations needed when synthesising complex, linked data, and ii) generate synthetic versions of the linked data that can be shared with researchers more widely.

The quality and usability of synthetic data is highly dependent on the data generation model and the purpose of analysis. However, identifying all relevant variables and possible dependencies or interactions between these is highly resource intensive. One of the challenges for synthetic data generation is therefore understanding whether there are situations in which generic versions of synthetic data may be sufficient for some purposes, or whether bespoke synthetic datasets (tailored to a specific research problem) are always required. We will explore this balance by engaging with data providers and researchers and determining the nature and practicality of communication between the two that is required to produce acceptable outputs. We will also engage with the public to seek their views on the use of synthetic data.

Based on a set of exemplar research questions, we will generate synthetic data and compare feasibility and outputs from different approaches. To evaluate how well the synthetic data represent the real data, we will compare characteristics and statistical inferences from the synthetic data with those from the real data. Based on our findings, we will generate guidelines on the appropriate use of synthetic data.

Publications

10 25 50
publication icon
Kokosi T (2022) Synthetic data in medical research. in BMJ medicine

publication icon
Kokosi T (2022) An overview on synthetic administrative data for research in International Journal of Population Data Science

 
Description Pre-conference workshop at the International Population Data Linkage Network 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Pre-conference workshop at the International Population Data Linkage Network 2022
Year(s) Of Engagement Activity 2022
 
Description Presentation at NCRM Research Methods e-Festival 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Presented as part of a session on "What is synthetic data"
Year(s) Of Engagement Activity 2021
 
Description Presentation at the International Population Data Linkage Network 2022 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Presentation at the IPDLN conference 2022
Year(s) Of Engagement Activity 2022
 
Description Public engagement with UseMYData 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact We held a public engagement event organised by UseMYData, discussing with members of the public how synthetic data might be used to support research using administrative data.
Year(s) Of Engagement Activity 2021
 
Description Synthetic data workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other audiences
Results and Impact We hosted 55 people to learn about a range of activities in this space from speakers who are using or generating synthetic data, to support the community, and to agree on some top priorities to focus on during the next two-three years. The project outcome was to better understand the challenges and the opportunities for accelerating the uptake of synthetic data, by bringing together
researchers from diverse backgrounds.
Year(s) Of Engagement Activity 2022
URL https://ukhealthdata.org/wp-content/uploads/2022/08/220705-Synthetic-Data-Workshop-Outputs-June-2022...