Methods for the privacy preserving analysis of sensitive health data: text analysis and data visualisation

Lead Research Organisation: Newcastle University
Department Name: Population Health Sciences Institute

Abstract

The "data revolution" can enhance health/social care, accelerate research and help us to assess new ways to improve health and health-care. But new ways to analyse health data must be used in ways that the public understand, are happy with and appropriately address data privacy and security. This fellowship will develop tools to help scientists and doctors make good use of sensitive health data, while minimising the risk of an individual or their health status becoming known. I will focus on two increasingly important areas of health data use: 1) information from medical text; 2) visual display of data, particularly in augmented reality (AR) or virtual reality (VR).

1) Sensitive text analysis
Medical text (eg health records, medical letters) contain patient data over time including identifying information (eg address, next of kin, full date of birth). Although helpful for care and research, use of sensitive medical text is strictly controlled for privacy reasons. Existing methods extract information from text, but may control disclosure risk by deleting identifiable data or grouping patients into blocks. But these procedures are not foolproof: some patients may still be identifiable, and after discarding key information results may be wrong. My fellowship adopts a new approach we have developed for the free software package DataSHIELD. This allows sensitive data to be analysed without being seen/copied and automatically detects and blocks many analyses that may be identifying. My earlier work has shown DataSHIELD can be used on text data and I will extend it to protect the privacy of data extracted from medical text by computer-based text mining tools. This will markedly increase the range of analyses that may be applied to medical text while maintaining confidentiality. I will first work on synthetic (made-up but realistic) text to safely develop and test the new approach. Once I am satisfied the software works, I will apply it to a research project run by Dr Sarah Slight (School of Pharmacy, Newcastle University), asking whether patients treated with many medications ("polypharmacy") have poorer outcomes (eg more falls, hospital admissions). If they do, new policies can be created to control polypharmacy and improve health outcomes.

2) Sensitive data visualisation
AR/VR technologies provide a quick way to interpret and understand health data without special technical/scientific expertise. These immersive environments work because they can simultaneously present more pieces of information about someone than can be seen on paper or screen. But this also makes individuals more identifiable. If AR/VR becomes widely used, we must properly understand the disclosure risks and develop ways to protect against them. In 2015, our collaboration with industry partners Masters of Pie and Lumacode won a competition to display Wellcome Trust data in VR. Ongoing work I led extended our work to explore VR visual methods using synthetic data based on the ALSPAC cohort. Together, we built the BigDataVR pilot analysis tool. This fellowship will explore factors determining the risk of identifying someone when using immersive environments like BigDataVR. The findings will be used to develop new ways to create VR compatible graphics via DataSHIELD that convey the "essence" of a data set without full data display which may identify someone. I will create a preliminary proof of concept, using DataSHIELD to send data underpinning visualisation to the free WebVR environment. Once safe visualisation has been shown using the synthetic data, the work will be extended to a real use case based on the polypharmacy project (see above) or on research data released by METADAC (a committee overseeing access to biomedical data from 5 major UK studies).

Software created under both work programs will be freely available to researchers, helping doctors and scientists to better analyse sensitive health data while protecting confidentiality.

Technical Summary

Health Data Science extracts, integrates and interprets health/biomedical data at population, organisation and individual levels to support: front-line clinical care/public health; health care planning/evaluation; research for academia, industry and the health/social services. The evolution of data science in the health/social sciences has lagged behind the physical sciences - including earth and space sciences where I am familiar. In part this reflects the social and technical challenges associated with governing human data in a responsible manner. My fellowship addresses methods and software to facilitate well-governed access, analysis and exploitation of sensitive health/biomedical data, with a joint focus on the privacy protected analysis of textual data and on guarding against the disclosure risk associated with data visualisation, particularly in several dimensions.

Building on my earlier work, including as a Farr Future Leader, my fellowship will exploit big data analytics, text mining and new technologies in virtual and/or augmented reality. Novel approaches to disclosure control will be implemented via DataSHIELD, an open source software for the distributed analysis of sensitive data - where individual-level data can be analysed, but not seen or abstracted by the analyst. Embedded disclosure controls (set by the data custodian and inaccessible to the analyst) mitigate against inferential (analysis-based) disclosure and can avoid costly, error-prone human scrutiny of results. The fellowship builds on work that I have personally led including three years as manager of the DataSHIELD development team. The new functionality to be developed will allow DataSHIELD to act as an automated disclosure-control layer between the user and either medical text or data underpinning sophisticated visual representation. Key applications will include personalised medicine, epidemiology and modern public health with data coming from one source or several.

Publications

10 25 50
 
Title Panel on Diversity and Inclusion 
Description A copy of my slides scene setting disability inclusion within the academic sector. 
Type Of Art Film/Video/Animation 
Year Produced 2021 
URL https://ssi-cw.figshare.com/articles/presentation/Panel_on_Diversity_and_Inclusion/14345951
 
Title Panel on Diversity and Inclusion 
Description A copy of my slides scene setting disability inclusion within the academic sector. 
Type Of Art Film/Video/Animation 
Year Produced 2021 
URL https://ssi-cw.figshare.com/articles/presentation/Panel_on_Diversity_and_Inclusion/14345951/1
 
Description A federated FAIR platform enabling large-scale analysis of high-value cohort data connecting Europe and Canada in personalized health
Amount € 6,717,953 (EUR)
Funding ID 824989 
Organisation European Commission H2020 
Sector Public
Country Belgium
Start 10/2022 
End 12/2023
 
Description Advancing Tools for Human Early Lifecourse Exposome Research and Translation - ATHLETE
Amount € 12,000,000 (EUR)
Funding ID 874583 
Organisation European Commission H2020 
Sector Public
Country Belgium
Start 01/2020 
End 12/2024
 
Description Health and Life Sciences Tenure Track Fellowship
Amount £0 (GBP)
Organisation University of Liverpool 
Sector Academic/University
Country United Kingdom
Start 02/2023 
End 02/2028
 
Description TRE-FX
Amount £560,000 (GBP)
Funding ID MC_PC_23007 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 02/2023 
End 10/2023
 
Title ALSPAC peer reviewed publications 1989-2015 
Description List of peer reviewed publications generated from the Avon Longitudinal Study of Parents and Children (ALSPAC) data from 1989 to the end of 2015. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact dataset curated for the impact analysis of cohort studies. 
URL https://zenodo.org/record/2276785
 
Description DataSHIELD Workshop 2018 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Third sector organisations
Results and Impact I have organised a three day workshop to showcase new community developments, functionality, applications and introduce potential users to DataSHIELD for privacy protected distributed analysis.

The agenda includes mix of talks and demonstrations, a tutorial on how to use DataSHIELD and discussion sessions to facilitate DataSHIELD community-led solutions to a range of development and application challenges. These discussions will assist in the roadmap planning of DataSHIELD.

The workshop This workshop target three groups:
- those that are unfamiliar with DataSHIELD, or that may have a new usecase or application for DataSHIELD
- current DataSHIELD users or adopters
- those developing new statistical methodology, functionality or infrastructure for DataSHIELD.
Year(s) Of Engagement Activity 2018
URL http://www.datashield.ac.uk/workshop18
 
Description DataSHIELD Workshop 2019 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact I organised the 2019 DataSHIELD workshop comprising training and an introduction to our open source software. In addition several speakers slots for developers, adopters and users of DataSHIELD were included to disseminate work.

The workshop has led to the a new release of DataSHIELD v 5.1

Several new adopters and interested parties of DataSHIELD including IMI global consortia projects and other European consortia of longitudinal research studies.

New developers welcomed to the DataSHIELD community.
Year(s) Of Engagement Activity 2019
URL http://www.datashield.ac.uk/events/eucan-connectagm2019datashieldworkshop/agendas/2019datashieldwork...