Enhancing Data Accessibility and Security through Innovative Data Synthesis (EDASIDA).
Lead Research Organisation:
The University of Manchester
Department Name: Social Sciences
Abstract
The proposal outlines a project geared towards revolutionizing data accessibility and security through innovative data synthesis techniques. We first highlight one bottleneck in the data discovery process: the scarcity of good teaching datasets, particularly for data that sit in virtual research environments where access restrictions impede their creation. The creates a discoverability challenge for new users, who are unable to explore data before going through an approval process, increasing barriers to entry.
While synthetic data is a potential solution, concerns about risk and utility exist. Data services often grapple with assessing the disclosure risk associated with synthetic data, as it deviates from the scope of conventional output disclosure control rules. Moreover, there is uncertainty about its utility, especially when specific analyses might yield results diverging from real data, diminishing the training process's effectiveness.
The project has three objectives: (1) investigate tailored teaching datasets for restricted data access, (2) develop a systematic approach to assess disclosure risk in analytical outputs from restricted data sources, and (3) assess the feasibility of producing linked synthetic data from different sources (using the same methodology).
The project spans from April 2024 to March 2025 and falls primarily under Theme 2: Data discovery using machine learning or other AI technologies, but also has the potential to add value under the other two themes (with objective 3 speaking to the federated services agenda and objective 2 providing a tool for augmenting the skills of output checkers).
A preliminary study conducted at Manchester University, in collaboration with Administrative Data Research UK, demonstrates the feasibility of generating synthetic datasets with both high utility and low risk. The methodology involves leveraging cleared analytical outputs from data services as the basis for generating synthetic data using a genetic algorithm. The goal is to provide trainees with data that not only closely resembles real-world data but also yields analytical output very similar to that of the real data, enhancing the training experience.
Beyond merely this replication of analytical properties, the approach also offers a route to formalise assessment the disclosure risk associated with analytical outputs from safe settings. By embodying statistical outputs in synthetic data, it enables a systematic evaluation of disclosure risk, addressing the informality and potential inconsistencies present in current output checking procedures.
Furthermore, the project aims to bolster the federated services agenda by exploring the creation of synthetic linked data from using analytical outputs from data of multiple services. This approach expands the possibilities of data synthesis without the need for actual linkage and elaborate governance of infrastructure, such as trusted third parties.
Deliverables include open-source code, example synthetic datasets, and academic papers aimed at knowledge dissemination and skill development. The project emphasizes collaboration among data providers, services, and stakeholders to address challenges in data accessibility and security.
In essence, the project aims to redefine data accessibility by providing tailored teaching datasets and systematic disclosure risk assessment methods. It will also foster a collaborative ecosystem for transformative advancements in data synthesis and access management, and contributes to the broader research data landscape.
While synthetic data is a potential solution, concerns about risk and utility exist. Data services often grapple with assessing the disclosure risk associated with synthetic data, as it deviates from the scope of conventional output disclosure control rules. Moreover, there is uncertainty about its utility, especially when specific analyses might yield results diverging from real data, diminishing the training process's effectiveness.
The project has three objectives: (1) investigate tailored teaching datasets for restricted data access, (2) develop a systematic approach to assess disclosure risk in analytical outputs from restricted data sources, and (3) assess the feasibility of producing linked synthetic data from different sources (using the same methodology).
The project spans from April 2024 to March 2025 and falls primarily under Theme 2: Data discovery using machine learning or other AI technologies, but also has the potential to add value under the other two themes (with objective 3 speaking to the federated services agenda and objective 2 providing a tool for augmenting the skills of output checkers).
A preliminary study conducted at Manchester University, in collaboration with Administrative Data Research UK, demonstrates the feasibility of generating synthetic datasets with both high utility and low risk. The methodology involves leveraging cleared analytical outputs from data services as the basis for generating synthetic data using a genetic algorithm. The goal is to provide trainees with data that not only closely resembles real-world data but also yields analytical output very similar to that of the real data, enhancing the training experience.
Beyond merely this replication of analytical properties, the approach also offers a route to formalise assessment the disclosure risk associated with analytical outputs from safe settings. By embodying statistical outputs in synthetic data, it enables a systematic evaluation of disclosure risk, addressing the informality and potential inconsistencies present in current output checking procedures.
Furthermore, the project aims to bolster the federated services agenda by exploring the creation of synthetic linked data from using analytical outputs from data of multiple services. This approach expands the possibilities of data synthesis without the need for actual linkage and elaborate governance of infrastructure, such as trusted third parties.
Deliverables include open-source code, example synthetic datasets, and academic papers aimed at knowledge dissemination and skill development. The project emphasizes collaboration among data providers, services, and stakeholders to address challenges in data accessibility and security.
In essence, the project aims to redefine data accessibility by providing tailored teaching datasets and systematic disclosure risk assessment methods. It will also foster a collaborative ecosystem for transformative advancements in data synthesis and access management, and contributes to the broader research data landscape.
Publications
Description | 1. That it is possible to generate useful synthetic datasets from analytical output. 2. That it is possible to use the method (in combination with other tools) to assess the disclosure risk of outputs from safe settings such as trusted research environments. |
Exploitation Route | TBC |
Sectors | Digital/Communication/Information Technologies (including Software) Education |
Title | The production of teaching datasets without access to the original data (POTWOD) |
Description | Data Synthesis (DS) is a methodology within statistics and machine learning that produces an artificial dataset. Instead of releasing or sharing real data, DS produces an artificial dataset that does not contain any real records but provides an underlying data structure that is like the original data whilst having low disclosure risk. It can be used as a data confidentiality method that is applied to datasets to prevent leakage of confidential information about data subjects whilst delivering utility for analysts equivalent to the real or original data. DS can therefore allow greater access to data that might otherwise be safeguarded since it presents lower disclosure risk than the original data. Teaching datasets are a pivotal component of the data discovery pipeline. These datasets serve as the initial point of interaction for data users, allowing them to explore a dataset's contents and assess its relevance to their specific needs. Typically constituting compact subsets of the complete dataset, traditional teaching datasets employ data minimisation techniques to control disclosure risks. While teaching datasets play a crucial role in the data discovery process, there are instances where their viability is limited, particularly in scenarios where source data is only accessible within safe settings such as trusted research environments (TREs). Some TREs have attempted to address this limitation by creating generic synthetic datasets for teaching purposes. However, this approach encounters a significant challenge inherent in all general synthetic data- it may fail to accurately replicate the analytical results desired by teachers for their students. TREs can also be uncomfortable with assessing disclosure risk for synthetic data that does not map onto standard output disclosure control rules. In response to this challenge, our new method allows the production of bespoke synthetic datasets tailored for specific teaching purposes. This approach utilises already cleared and published analyses as the basis for the synthesis. This could be the trainer's own analyses or those published by a third party. Unlike generic synthetic datasets, the bespoke synthetic datasets are designed to solely reproduce the specific analyses. This allows teachers to produce their own bespoke synthetic teaching datasets that look and feel like real datasets and reproduce the required outputs faithfully and can be generated without access to the original data. |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2025 |
Provided To Others? | No |
Impact | Access to such bespoke synthetic datasets offers a valuable opportunity for users to undergo training using realistic data before applying for access to the real restricted dataset. This not only enhances the training experience but also reduces the time required within a TRE. By introducing this innovative approach to DS, the pilot project aims to redefine the landscape of data accessibility, offering enhanced training experiences and expanded opportunities for data exploration within the research data community. The first of these datasets is now available via the UK data service. https://beta.ukdataservice.ac.uk/datacatalogue/doi/?id=9282#!#0 |
URL | https://synthdig.github.io/ |
Title | Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data |
Description | The synthetic ASHE-2011 Census dataset (hereafter referred to as Synthetic Dataset) was created without access to the original secure dataset and used only publicly available statistics and information to generate the data. An Evolutionary Algorithm (EA) was developed for the purpose of generating the synthetic data. The Synthetic Dataset contains a subset of the variables available in the original. |
Type Of Material | Database/Collection of data |
Year Produced | 2024 |
Provided To Others? | Yes |
Impact | This was the first known example of the creation of synthetic data without access to the original data. It has implications for the generation of teaching datasets. |
URL | https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=9282 |