Creating longitudinal datasets for linked administrative data research using synthetic data

Lead Research Organisation: University College London

Department Name: Institute of Child Health

Abstract

Administrative data hold great potential for informing public policy. However, this potential is not yet being realised due to restrictions around data access, linkage, and privacy protection. Governance procedures and approvals lead to long timescales and tight restrictions on data access, which can jeopardise publicly funded research.

One solution is to generate synthetic data that preserve the statistical properties of the original sources, but do not correspond to any real individuals or pose privacy risks. These data could be widely shared, allowing researchers to understand the data structures, develop analysis plans and algorithms, and test out different models. This could be done in parallel to applying for access to linked administrative datasets, streamlining the research process. Final refinements and analyses would be conducted on the real data.

Our study will test the feasibility of approaches for creating synthetic linked administrative datasets. We will compare two existing methods: 'Synthpop', used to create synthetic versions of the Scottish Longitudinal Study, and 'Simulacrum', used to create synthetic versions of the National Cancer Registry, with a new approach 'Jomo', based on recent methodological developments for the imputation of missing data. We will evaluate these approaches using an exemplar of linking the third National Survey of Sexual Attitudes and Lifestyles (Natsal-3) to two administrative datasets: Hospital Episode Statistics (HES) and the National Pupil Database (NPD).

Natsal-3 is one of the largest sexual population-based behaviour surveys in the world and collected data from 15000 participants during 2010-2012. HES contains information on attendances to all NHS hospitals in England, allowing detailed analysis of procedures and diagnoses. NPD contains information on pupils attending state schools in England, including school achievement, absences, and special educational needs. Linkage between Natsal-3, HES and NPD will provide a unique opportunity to gain a deeper understanding of the social, behavioural and biological aspects of sexual and reproductive health, and to generate evidence to inform implementation of sexual health interventions.

We will first compare different methods for generating synthetic versions of the three datasets separately (since all have different structures and characteristics), based on how well the data generated by these methods represent the original data. We will also apply for approvals to link the data, to i) explore whether there are any additional considerations needed when synthesising complex, linked data, and ii) generate synthetic versions of the linked data that can be shared with researchers more widely.

The quality and usability of synthetic data is highly dependent on the data generation model and the purpose of analysis. However, identifying all relevant variables and possible dependencies or interactions between these is highly resource intensive. One of the challenges for synthetic data generation is therefore understanding whether there are situations in which generic versions of synthetic data may be sufficient for some purposes, or whether bespoke synthetic datasets (tailored to a specific research problem) are always required. We will explore this balance by engaging with data providers and researchers and determining the nature and practicality of communication between the two that is required to produce acceptable outputs. We will also engage with the public to seek their views on the use of synthetic data.

Based on a set of exemplar research questions, we will generate synthetic data and compare feasibility and outputs from different approaches. To evaluate how well the synthetic data represent the real data, we will compare characteristics and statistical inferences from the synthetic data with those from the real data. Based on our findings, we will generate guidelines on the appropriate use of synthetic data.

Funded Value:

£161,389

Funded Period:

Jan 21 - Jul 22

Funder:

ESRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

ES/V005448/1

Principal Investigator:

Katie Harron

Research Subject:

Tools, technologies & methods (96%)

Research Topic:

Social Stats., Comp. & Methods (96%)

Organisations

University College London (Lead Research Organisation)

People	ORCID iD
Katie Harron (Principal Investigator)
Andrew Copas (Co-Investigator)
James Carpenter (Co-Investigator)
Pam Sonnenberg (Co-Investigator)
Bianca De Stavola (Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Kokosi T (2022) An overview on synthetic administrative data for research in International Journal of Population Data Science

Kokosi T (2022) Synthetic data in medical research. in BMJ medicine

Wing K (2021) Medications for chronic obstructive pulmonary disease: a historical non-interventional cohort study with validation against RCT results. in Health technology assessment (Winchester, England)

Key Findings
Impact Summary
Engagement Activities


Description	In the UK, administrative data has helped us gain a picture of public service users and their needs. But administrative data contain personal and sensitive information which is important to protect so that individuals can never be identified. However, the approvals and governance processes associated with accessing administrative data are extremely time consuming which can threaten the timeliness of research. The other time-consuming part of any research study using administrative data is understanding the structure of the data, and developing data cleaning and analysis plans. The time taken conducting final analyses is often comparatively short. Research timelines could be substantially reduced if there was a way to do these preliminary tasks, in parallel to applying for access to the real data on which final analyses would be conducted. This is where synthetic (or artificial) data could help. Synthetic data are artificially generated data designed to mimic real datasets, without containing personally identifiable information. Synthetic data also has a lot of potential for capacity building. It can be difficult to recruit researchers who have experience in using administrative data, in part because it is often impossible to grant access to these datasets to MSc students. If we could train students using synthetic data, we could really enhance the training we can provide to the next generation of data scientists. Synthetic data can: • Facilitate easier access to data for those who are generating hypotheses and developing tools • Prepare and train researchers for the practical challenges of working with national clinical datasets • Be used as pilot data (instead of real data) to strengthen research applications Being able to explore the datasets, understand what is available, and test code on the data can help streamline the research process, and enable researchers to make informed decisions and plan their research thoroughly, in a low-risk setting. We used data from Hospital Episode Statistics (an administrative dataset capturing information on all hospitalisations in NHS hospitals in England) and Natsal-4 (The National Survey of Sexual Attitudes and Lifestyles) to evaluate different approaches to generating synthetic data. We explored the use of two different approaches to generating synthetic versions of the HES and Natsal-4 data. First, we used Synthpop, a package in R developed originally to generate synthetic versions of the Scottish Longitudinal Study. Second, we used OnePassImpute, another R package designed to deal with missing data in multi-level models. We explored barriers to wider use of synthetic data by engaging with the public, data providers, and researchers. Research Findings Although the idea of synthetic data was introduced around 30 years ago, it is still not widely used, and synthetic versions of administrative datasets are not routinely available. One of the main barriers to wider use of synthetic administrative data is uncertainty about the level of fidelity that is required in the data. However, terminology describing synthetic data varies, which makes it difficult to communicate within the research field, and with data providers and members of the public. Research Impact In an article published in the International Journal of Population Data Science (IJPDS), we provide a comprehensive overview of the main synthetic data generation methods in the context of UK administrative data research. We discuss the benefits and challenges, and propose simplified terms that would help data holders and data users familiarise themselves with the concepts of synthetic data. Using a consistent terminology should promote collaboration and engagement and allow effective communication of the benefits of synthetic data, to help build further acceptance and trust. Our workshop highlighted the clear value of synthetic data for a range of purposes. It also showed that the first step to demonstrating the value of synthetic data would be to facilitate the rollout of a number of low fidelity datasets for training and to consolidate and validate data methods for synthetic data. Demonstrating value with low fidelity datasets, which are not resource-intensive to generate, will pave the way for high fidelity datasets. For example, more research is needed to understand how best to replicate the longitudinal and high-dimensional nature of administrative datasets.
Exploitation Route	ADRUK have expanded their interest in funding research in this area and have advertised for additional grants: https://www.adruk.org/news-publications/news-blogs/funding-opportunity-evaluating-the-benefits-costs-and-utility-of-synthetic-data-822/#:~:text=Funding%20and%20scope&text=The%20ESRC%20and%20ADR%20UK,up%20to%20%C2%A3300%2C000%20total).&text=The%20grant%20holder%20will%3A,synthetic%20data%2C%20using%20different%20models
Sectors	Healthcare Government Democracy and Justice


Description	Through the full day, in-person workshop on synthetic data that we organised with 55 participants, we have enhanced the discussion and knowledge about synthetic data across a range of fields. More information, including a recording of the event, can be found here: https://ukhealthdata.org/wp-content/uploads/2022/08/220705-Synthetic-Data-Workshop-Outputs-June-2022-v0.2.pdf and a recording of the event can be found here: https://www.youtube.com/playlist?list=PLBI5k9SgYrIt8CTZzhw4jlAwCezBGsHJN
First Year Of Impact	2023
Sector	Healthcare,Government, Democracy and Justice
Impact Types	Policy & public services


Description	Pre-conference workshop at the International Population Data Linkage Network
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	Pre-conference workshop at the International Population Data Linkage Network 2022
Year(s) Of Engagement Activity	2022


Description	Presentation at NCRM Research Methods e-Festival
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	Presented as part of a session on "What is synthetic data"
Year(s) Of Engagement Activity	2021


Description	Presentation at the International Population Data Linkage Network 2022
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	Presentation at the IPDLN conference 2022
Year(s) Of Engagement Activity	2022


Description	Public engagement with UseMYData
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Public/other audiences
Results and Impact	We held a public engagement event organised by UseMYData, discussing with members of the public how synthetic data might be used to support research using administrative data.
Year(s) Of Engagement Activity	2021


Description	Synthetic data workshop
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Other audiences
Results and Impact	We hosted 55 people to learn about a range of activities in this space from speakers who are using or generating synthetic data, to support the community, and to agree on some top priorities to focus on during the next two-three years. The project outcome was to better understand the challenges and the opportunities for accelerating the uptake of synthetic data, by bringing together researchers from diverse backgrounds.
Year(s) Of Engagement Activity	2022
URL	https://ukhealthdata.org/wp-content/uploads/2022/08/220705-Synthetic-Data-Workshop-Outputs-June-2022...

Abstract

Organisations

People

ORCID iD

Publications