Developing synthetic data methods for large confidential administrative databases

Lead Research Organisation: Lancaster University
Department Name: Mathematics and Statistics

Abstract

There is a demand from social scientists to access high quality data for research, traditionally large surveys. These are costly and so there has been a shift to making routinely collected administrative
data more available to researchers. The government open data access policy has also led to an initiative to make the administrative data their departments hold, available more widely. These
databases typically contain information on a large number of records with potentially sensitive information, and have severely restricted access. This has led to investigating ways to improve access to government administrative databases without compromising confidentiality.
Synthetic data is an increasingly popular approach to address this problem. The approach replaces the data with synthetic values drawn from a statistical model fit to the original data. This is typically
done multiple times to generate multiple synthetic data sets. As the data now comprise only synthetic values, confidentiality should have been protected, and providing a plausible model has
been used, statistical properties should be preserved. Synthetic data would give researchers the ability to test their methodology on a synthetic version prior to analysis of the original data. This project will develop synthetic data methods for administrative databases leading to the potential for more accessible synthetic versions.
The use of partially synthetic data for SDC has been increasing in recent years. There are multiple examples of synthetic data products being developed in the US, such as the Survey of Income and Program Participation (https://ecommons.cornell.edu/handle/1813/43924), and the Longitudinal Business Database (Kinney et al., International Statistical Review, 2011: 79(3)). The appeal is also growing in Europe with the IAB in Germany investigating synthetic data to protect the German Establishment Survey (Drechsler and Reiter, Journal of Official Statistics, 2009: 25(4)). There is relatively little activity in producing synthetic data in the UK, the one exception being a project considering methods for synthesising longitudinal data (https://sls.lscs.ac.uk/projects/view/2013_012/). To date there has been no substantive work on generating synthetic administrative databases in the UK.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
ES/P000665/1 01/10/2017 30/09/2027
2203901 Studentship ES/P000665/1 01/10/2019 30/09/2022 James Jackson