Methods for the privacy preserving analysis of sensitive health data: text analysis and data visualisation

Lead Research Organisation: Newcastle University

Department Name: Population Health Sciences Institute

Abstract

The "data revolution" can enhance health/social care, accelerate research and help us to assess new ways to improve health and health-care. But new ways to analyse health data must be used in ways that the public understand, are happy with and appropriately address data privacy and security. This fellowship will develop tools to help scientists and doctors make good use of sensitive health data, while minimising the risk of an individual or their health status becoming known. I will focus on two increasingly important areas of health data use: 1) information from medical text; 2) visual display of data, particularly in augmented reality (AR) or virtual reality (VR).

1) Sensitive text analysis
Medical text (eg health records, medical letters) contain patient data over time including identifying information (eg address, next of kin, full date of birth). Although helpful for care and research, use of sensitive medical text is strictly controlled for privacy reasons. Existing methods extract information from text, but may control disclosure risk by deleting identifiable data or grouping patients into blocks. But these procedures are not foolproof: some patients may still be identifiable, and after discarding key information results may be wrong. My fellowship adopts a new approach we have developed for the free software package DataSHIELD. This allows sensitive data to be analysed without being seen/copied and automatically detects and blocks many analyses that may be identifying. My earlier work has shown DataSHIELD can be used on text data and I will extend it to protect the privacy of data extracted from medical text by computer-based text mining tools. This will markedly increase the range of analyses that may be applied to medical text while maintaining confidentiality. I will first work on synthetic (made-up but realistic) text to safely develop and test the new approach. Once I am satisfied the software works, I will apply it to a research project run by Dr Sarah Slight (School of Pharmacy, Newcastle University), asking whether patients treated with many medications ("polypharmacy") have poorer outcomes (eg more falls, hospital admissions). If they do, new policies can be created to control polypharmacy and improve health outcomes.

2) Sensitive data visualisation
AR/VR technologies provide a quick way to interpret and understand health data without special technical/scientific expertise. These immersive environments work because they can simultaneously present more pieces of information about someone than can be seen on paper or screen. But this also makes individuals more identifiable. If AR/VR becomes widely used, we must properly understand the disclosure risks and develop ways to protect against them. In 2015, our collaboration with industry partners Masters of Pie and Lumacode won a competition to display Wellcome Trust data in VR. Ongoing work I led extended our work to explore VR visual methods using synthetic data based on the ALSPAC cohort. Together, we built the BigDataVR pilot analysis tool. This fellowship will explore factors determining the risk of identifying someone when using immersive environments like BigDataVR. The findings will be used to develop new ways to create VR compatible graphics via DataSHIELD that convey the "essence" of a data set without full data display which may identify someone. I will create a preliminary proof of concept, using DataSHIELD to send data underpinning visualisation to the free WebVR environment. Once safe visualisation has been shown using the synthetic data, the work will be extended to a real use case based on the polypharmacy project (see above) or on research data released by METADAC (a committee overseeing access to biomedical data from 5 major UK studies).

Software created under both work programs will be freely available to researchers, helping doctors and scientists to better analyse sensitive health data while protecting confidentiality.

Technical Summary

Health Data Science extracts, integrates and interprets health/biomedical data at population, organisation and individual levels to support: front-line clinical care/public health; health care planning/evaluation; research for academia, industry and the health/social services. The evolution of data science in the health/social sciences has lagged behind the physical sciences - including earth and space sciences where I am familiar. In part this reflects the social and technical challenges associated with governing human data in a responsible manner. My fellowship addresses methods and software to facilitate well-governed access, analysis and exploitation of sensitive health/biomedical data, with a joint focus on the privacy protected analysis of textual data and on guarding against the disclosure risk associated with data visualisation, particularly in several dimensions.

Building on my earlier work, including as a Farr Future Leader, my fellowship will exploit big data analytics, text mining and new technologies in virtual and/or augmented reality. Novel approaches to disclosure control will be implemented via DataSHIELD, an open source software for the distributed analysis of sensitive data - where individual-level data can be analysed, but not seen or abstracted by the analyst. Embedded disclosure controls (set by the data custodian and inaccessible to the analyst) mitigate against inferential (analysis-based) disclosure and can avoid costly, error-prone human scrutiny of results. The fellowship builds on work that I have personally led including three years as manager of the DataSHIELD development team. The new functionality to be developed will allow DataSHIELD to act as an automated disclosure-control layer between the user and either medical text or data underpinning sophisticated visual representation. Key applications will include personalised medicine, epidemiology and modern public health with data coming from one source or several.

Funded Value:

£224,395

Funded Period:

Feb 18 - May 20

Funder:

MRC

Project Status:

Closed

Project Category:

Fellowship

Project Reference:

MR/S003959/1

Principal Investigator:

Rebecca Wilson

Health Category:

Unclassified

Organisations

People	ORCID iD
Rebecca Wilson (Principal Investigator / Fellow)

Publications

Author Name Title

Publication Date Published

10 25 50

Vrijheid M (2021) Advancing tools for human early lifecourse exposome research and translation (ATHLETE): Project overview. in Environmental epidemiology (Philadelphia, Pa.)

Pastorino S (2019) Associations between maternal physical activity in early and late pregnancy and offspring birth size: remote federated individual level meta-analysis from eight cohort studies. in BJOG : an international journal of obstetrics and gynaecology

Butters O (2018) Generation of a cleaned dataset listing Avon Longitudinal Study of Parents And Children peer-reviewed publications to 2015. in Wellcome open research

Vinther JL (2023) Gestational age at birth and body size from infancy through adolescence: An individual participant data meta-analysis on 253,810 singletons in 16 birth cohort studies. in PLoS medicine

Fortier I (2023) Life course of retrospective harmonization initiatives: key elements to consider. in Journal of developmental origins of health and disease

Avraam D (2021) Privacy preserving data visualizations. in EPJ data science

Butters O (2020) PUblications Metadata Augmentation (PUMA) pipeline in F1000Research

Butters OW (2020) PUblications Metadata Augmentation (PUMA) pipeline. in F1000Research

Butters OW (2020) Recognizing, reporting and reducing the data curation debt of cohort studies. in International journal of epidemiology

Artistic and Creative Products
Further Funding
Research Databases and Models
Engagement Activities


Title	Panel on Diversity and Inclusion
Description	A copy of my slides scene setting disability inclusion within the academic sector.
Type Of Art	Film/Video/Animation
Year Produced	2021
URL	https://ssi-cw.figshare.com/articles/presentation/Panel_on_Diversity_and_Inclusion/14345951


Description	A federated FAIR platform enabling large-scale analysis of high-value cohort data connecting Europe and Canada in personalized health
Amount	€ 6,717,953 (EUR)
Funding ID	824989
Organisation	European Commission H2020
Sector	Public
Country	Belgium
Start	10/2022
End	12/2023


Description	Advancing Tools for Human Early Lifecourse Exposome Research and Translation - ATHLETE
Amount	€ 12,000,000 (EUR)
Funding ID	874583
Organisation	European Commission H2020
Sector	Public
Country	Belgium
Start	01/2020
End	12/2024


Description	Health and Life Sciences Tenure Track Fellowship
Amount	£0 (GBP)
Organisation	University of Liverpool
Sector	Academic/University
Country	United Kingdom
Start	02/2023
End	02/2028


Description	TRE-FX
Amount	£560,000 (GBP)
Funding ID	MC_PC_23007
Organisation	Medical Research Council (MRC)
Sector	Public
Country	United Kingdom
Start	02/2023
End	10/2023


Title	ALSPAC peer reviewed publications 1989-2015
Description	List of peer reviewed publications generated from the Avon Longitudinal Study of Parents and Children (ALSPAC) data from 1989 to the end of 2015.
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes
Impact	dataset curated for the impact analysis of cohort studies.
URL	https://zenodo.org/record/2276785


Description	DataSHIELD Workshop 2018
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Third sector organisations
Results and Impact	I have organised a three day workshop to showcase new community developments, functionality, applications and introduce potential users to DataSHIELD for privacy protected distributed analysis. The agenda includes mix of talks and demonstrations, a tutorial on how to use DataSHIELD and discussion sessions to facilitate DataSHIELD community-led solutions to a range of development and application challenges. These discussions will assist in the roadmap planning of DataSHIELD. The workshop This workshop target three groups: - those that are unfamiliar with DataSHIELD, or that may have a new usecase or application for DataSHIELD - current DataSHIELD users or adopters - those developing new statistical methodology, functionality or infrastructure for DataSHIELD.
Year(s) Of Engagement Activity	2018
URL	http://www.datashield.ac.uk/workshop18


Description	DataSHIELD Workshop 2019
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	I organised the 2019 DataSHIELD workshop comprising training and an introduction to our open source software. In addition several speakers slots for developers, adopters and users of DataSHIELD were included to disseminate work. The workshop has led to the a new release of DataSHIELD v 5.1 Several new adopters and interested parties of DataSHIELD including IMI global consortia projects and other European consortia of longitudinal research studies. New developers welcomed to the DataSHIELD community.
Year(s) Of Engagement Activity	2019
URL	http://www.datashield.ac.uk/events/eucan-connectagm2019datashieldworkshop/agendas/2019datashieldwork...

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications