Creating longitudinal datasets for linked administrative data research using synthetic data
Lead Research Organisation:
University College London
Department Name: Institute of Child Health
Abstract
Administrative data hold great potential for informing public policy. However, this potential is not yet being realised due to restrictions around data access, linkage, and privacy protection. Governance procedures and approvals lead to long timescales and tight restrictions on data access, which can jeopardise publicly funded research.
One solution is to generate synthetic data that preserve the statistical properties of the original sources, but do not correspond to any real individuals or pose privacy risks. These data could be widely shared, allowing researchers to understand the data structures, develop analysis plans and algorithms, and test out different models. This could be done in parallel to applying for access to linked administrative datasets, streamlining the research process. Final refinements and analyses would be conducted on the real data.
Our study will test the feasibility of approaches for creating synthetic linked administrative datasets. We will compare two existing methods: 'Synthpop', used to create synthetic versions of the Scottish Longitudinal Study, and 'Simulacrum', used to create synthetic versions of the National Cancer Registry, with a new approach 'Jomo', based on recent methodological developments for the imputation of missing data. We will evaluate these approaches using an exemplar of linking the third National Survey of Sexual Attitudes and Lifestyles (Natsal-3) to two administrative datasets: Hospital Episode Statistics (HES) and the National Pupil Database (NPD).
Natsal-3 is one of the largest sexual population-based behaviour surveys in the world and collected data from 15000 participants during 2010-2012. HES contains information on attendances to all NHS hospitals in England, allowing detailed analysis of procedures and diagnoses. NPD contains information on pupils attending state schools in England, including school achievement, absences, and special educational needs. Linkage between Natsal-3, HES and NPD will provide a unique opportunity to gain a deeper understanding of the social, behavioural and biological aspects of sexual and reproductive health, and to generate evidence to inform implementation of sexual health interventions.
We will first compare different methods for generating synthetic versions of the three datasets separately (since all have different structures and characteristics), based on how well the data generated by these methods represent the original data. We will also apply for approvals to link the data, to i) explore whether there are any additional considerations needed when synthesising complex, linked data, and ii) generate synthetic versions of the linked data that can be shared with researchers more widely.
The quality and usability of synthetic data is highly dependent on the data generation model and the purpose of analysis. However, identifying all relevant variables and possible dependencies or interactions between these is highly resource intensive. One of the challenges for synthetic data generation is therefore understanding whether there are situations in which generic versions of synthetic data may be sufficient for some purposes, or whether bespoke synthetic datasets (tailored to a specific research problem) are always required. We will explore this balance by engaging with data providers and researchers and determining the nature and practicality of communication between the two that is required to produce acceptable outputs. We will also engage with the public to seek their views on the use of synthetic data.
Based on a set of exemplar research questions, we will generate synthetic data and compare feasibility and outputs from different approaches. To evaluate how well the synthetic data represent the real data, we will compare characteristics and statistical inferences from the synthetic data with those from the real data. Based on our findings, we will generate guidelines on the appropriate use of synthetic data.
One solution is to generate synthetic data that preserve the statistical properties of the original sources, but do not correspond to any real individuals or pose privacy risks. These data could be widely shared, allowing researchers to understand the data structures, develop analysis plans and algorithms, and test out different models. This could be done in parallel to applying for access to linked administrative datasets, streamlining the research process. Final refinements and analyses would be conducted on the real data.
Our study will test the feasibility of approaches for creating synthetic linked administrative datasets. We will compare two existing methods: 'Synthpop', used to create synthetic versions of the Scottish Longitudinal Study, and 'Simulacrum', used to create synthetic versions of the National Cancer Registry, with a new approach 'Jomo', based on recent methodological developments for the imputation of missing data. We will evaluate these approaches using an exemplar of linking the third National Survey of Sexual Attitudes and Lifestyles (Natsal-3) to two administrative datasets: Hospital Episode Statistics (HES) and the National Pupil Database (NPD).
Natsal-3 is one of the largest sexual population-based behaviour surveys in the world and collected data from 15000 participants during 2010-2012. HES contains information on attendances to all NHS hospitals in England, allowing detailed analysis of procedures and diagnoses. NPD contains information on pupils attending state schools in England, including school achievement, absences, and special educational needs. Linkage between Natsal-3, HES and NPD will provide a unique opportunity to gain a deeper understanding of the social, behavioural and biological aspects of sexual and reproductive health, and to generate evidence to inform implementation of sexual health interventions.
We will first compare different methods for generating synthetic versions of the three datasets separately (since all have different structures and characteristics), based on how well the data generated by these methods represent the original data. We will also apply for approvals to link the data, to i) explore whether there are any additional considerations needed when synthesising complex, linked data, and ii) generate synthetic versions of the linked data that can be shared with researchers more widely.
The quality and usability of synthetic data is highly dependent on the data generation model and the purpose of analysis. However, identifying all relevant variables and possible dependencies or interactions between these is highly resource intensive. One of the challenges for synthetic data generation is therefore understanding whether there are situations in which generic versions of synthetic data may be sufficient for some purposes, or whether bespoke synthetic datasets (tailored to a specific research problem) are always required. We will explore this balance by engaging with data providers and researchers and determining the nature and practicality of communication between the two that is required to produce acceptable outputs. We will also engage with the public to seek their views on the use of synthetic data.
Based on a set of exemplar research questions, we will generate synthetic data and compare feasibility and outputs from different approaches. To evaluate how well the synthetic data represent the real data, we will compare characteristics and statistical inferences from the synthetic data with those from the real data. Based on our findings, we will generate guidelines on the appropriate use of synthetic data.
Organisations
Publications
Kokosi T
(2022)
An overview on synthetic administrative data for research
in International Journal of Population Data Science
Kokosi T
(2022)
Synthetic data in medical research.
in BMJ medicine
Wing K
(2021)
Medications for chronic obstructive pulmonary disease: a historical non-interventional cohort study with validation against RCT results.
in Health technology assessment (Winchester, England)
Description | In the UK, administrative data has helped us gain a picture of public service users and their needs. But administrative data contain personal and sensitive information which is important to protect so that individuals can never be identified. However, the approvals and governance processes associated with accessing administrative data are extremely time consuming which can threaten the timeliness of research. The other time-consuming part of any research study using administrative data is understanding the structure of the data, and developing data cleaning and analysis plans. The time taken conducting final analyses is often comparatively short. Research timelines could be substantially reduced if there was a way to do these preliminary tasks, in parallel to applying for access to the real data on which final analyses would be conducted. This is where synthetic (or artificial) data could help. Synthetic data are artificially generated data designed to mimic real datasets, without containing personally identifiable information. Synthetic data also has a lot of potential for capacity building. It can be difficult to recruit researchers who have experience in using administrative data, in part because it is often impossible to grant access to these datasets to MSc students. If we could train students using synthetic data, we could really enhance the training we can provide to the next generation of data scientists. Synthetic data can: • Facilitate easier access to data for those who are generating hypotheses and developing tools • Prepare and train researchers for the practical challenges of working with national clinical datasets • Be used as pilot data (instead of real data) to strengthen research applications Being able to explore the datasets, understand what is available, and test code on the data can help streamline the research process, and enable researchers to make informed decisions and plan their research thoroughly, in a low-risk setting. We used data from Hospital Episode Statistics (an administrative dataset capturing information on all hospitalisations in NHS hospitals in England) and Natsal-4 (The National Survey of Sexual Attitudes and Lifestyles) to evaluate different approaches to generating synthetic data. We explored the use of two different approaches to generating synthetic versions of the HES and Natsal-4 data. First, we used Synthpop, a package in R developed originally to generate synthetic versions of the Scottish Longitudinal Study. Second, we used OnePassImpute, another R package designed to deal with missing data in multi-level models. We explored barriers to wider use of synthetic data by engaging with the public, data providers, and researchers. Research Findings Although the idea of synthetic data was introduced around 30 years ago, it is still not widely used, and synthetic versions of administrative datasets are not routinely available. One of the main barriers to wider use of synthetic administrative data is uncertainty about the level of fidelity that is required in the data. However, terminology describing synthetic data varies, which makes it difficult to communicate within the research field, and with data providers and members of the public. Research Impact In an article published in the International Journal of Population Data Science (IJPDS), we provide a comprehensive overview of the main synthetic data generation methods in the context of UK administrative data research. We discuss the benefits and challenges, and propose simplified terms that would help data holders and data users familiarise themselves with the concepts of synthetic data. Using a consistent terminology should promote collaboration and engagement and allow effective communication of the benefits of synthetic data, to help build further acceptance and trust. Our workshop highlighted the clear value of synthetic data for a range of purposes. It also showed that the first step to demonstrating the value of synthetic data would be to facilitate the rollout of a number of low fidelity datasets for training and to consolidate and validate data methods for synthetic data. Demonstrating value with low fidelity datasets, which are not resource-intensive to generate, will pave the way for high fidelity datasets. For example, more research is needed to understand how best to replicate the longitudinal and high-dimensional nature of administrative datasets. |
Exploitation Route | ADRUK have expanded their interest in funding research in this area and have advertised for additional grants: https://www.adruk.org/news-publications/news-blogs/funding-opportunity-evaluating-the-benefits-costs-and-utility-of-synthetic-data-822/#:~:text=Funding%20and%20scope&text=The%20ESRC%20and%20ADR%20UK,up%20to%20%C2%A3300%2C000%20total).&text=The%20grant%20holder%20will%3A,synthetic%20data%2C%20using%20different%20models |
Sectors | Healthcare Government Democracy and Justice |
Description | Through the full day, in-person workshop on synthetic data that we organised with 55 participants, we have enhanced the discussion and knowledge about synthetic data across a range of fields. More information, including a recording of the event, can be found here: https://ukhealthdata.org/wp-content/uploads/2022/08/220705-Synthetic-Data-Workshop-Outputs-June-2022-v0.2.pdf and a recording of the event can be found here: https://www.youtube.com/playlist?list=PLBI5k9SgYrIt8CTZzhw4jlAwCezBGsHJN |
First Year Of Impact | 2023 |
Sector | Healthcare,Government, Democracy and Justice |
Impact Types | Policy & public services |
Description | Pre-conference workshop at the International Population Data Linkage Network |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | Pre-conference workshop at the International Population Data Linkage Network 2022 |
Year(s) Of Engagement Activity | 2022 |
Description | Presentation at NCRM Research Methods e-Festival |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | Presented as part of a session on "What is synthetic data" |
Year(s) Of Engagement Activity | 2021 |
Description | Presentation at the International Population Data Linkage Network 2022 |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | Presentation at the IPDLN conference 2022 |
Year(s) Of Engagement Activity | 2022 |
Description | Public engagement with UseMYData |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Public/other audiences |
Results and Impact | We held a public engagement event organised by UseMYData, discussing with members of the public how synthetic data might be used to support research using administrative data. |
Year(s) Of Engagement Activity | 2021 |
Description | Synthetic data workshop |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Other audiences |
Results and Impact | We hosted 55 people to learn about a range of activities in this space from speakers who are using or generating synthetic data, to support the community, and to agree on some top priorities to focus on during the next two-three years. The project outcome was to better understand the challenges and the opportunities for accelerating the uptake of synthetic data, by bringing together researchers from diverse backgrounds. |
Year(s) Of Engagement Activity | 2022 |
URL | https://ukhealthdata.org/wp-content/uploads/2022/08/220705-Synthetic-Data-Workshop-Outputs-June-2022... |