Enhancing Data Accessibility and Security through Innovative Data Synthesis (EDASIDA).

Lead Research Organisation: The University of Manchester
Department Name: Social Sciences

Abstract

The proposal outlines a project geared towards revolutionizing data accessibility and security through innovative data synthesis techniques. We first highlight one bottleneck in the data discovery process: the scarcity of good teaching datasets, particularly for data that sit in virtual research environments where access restrictions impede their creation. The creates a discoverability challenge for new users, who are unable to explore data before going through an approval process, increasing barriers to entry.

While synthetic data is a potential solution, concerns about risk and utility exist. Data services often grapple with assessing the disclosure risk associated with synthetic data, as it deviates from the scope of conventional output disclosure control rules. Moreover, there is uncertainty about its utility, especially when specific analyses might yield results diverging from real data, diminishing the training process's effectiveness.

The project has three objectives: (1) investigate tailored teaching datasets for restricted data access, (2) develop a systematic approach to assess disclosure risk in analytical outputs from restricted data sources, and (3) assess the feasibility of producing linked synthetic data from different sources (using the same methodology).

The project spans from April 2024 to March 2025 and falls primarily under Theme 2: Data discovery using machine learning or other AI technologies, but also has the potential to add value under the other two themes (with objective 3 speaking to the federated services agenda and objective 2 providing a tool for augmenting the skills of output checkers).

A preliminary study conducted at Manchester University, in collaboration with Administrative Data Research UK, demonstrates the feasibility of generating synthetic datasets with both high utility and low risk. The methodology involves leveraging cleared analytical outputs from data services as the basis for generating synthetic data using a genetic algorithm. The goal is to provide trainees with data that not only closely resembles real-world data but also yields analytical output very similar to that of the real data, enhancing the training experience.

Beyond merely this replication of analytical properties, the approach also offers a route to formalise assessment the disclosure risk associated with analytical outputs from safe settings. By embodying statistical outputs in synthetic data, it enables a systematic evaluation of disclosure risk, addressing the informality and potential inconsistencies present in current output checking procedures.

Furthermore, the project aims to bolster the federated services agenda by exploring the creation of synthetic linked data from using analytical outputs from data of multiple services. This approach expands the possibilities of data synthesis without the need for actual linkage and elaborate governance of infrastructure, such as trusted third parties.

Deliverables include open-source code, example synthetic datasets, and academic papers aimed at knowledge dissemination and skill development. The project emphasizes collaboration among data providers, services, and stakeholders to address challenges in data accessibility and security.

In essence, the project aims to redefine data accessibility by providing tailored teaching datasets and systematic disclosure risk assessment methods. It will also foster a collaborative ecosystem for transformative advancements in data synthesis and access management, and contributes to the broader research data landscape.

Publications

10 25 50