Methods for the privacy preserving analysis of sensitive health data: text analysis and data visualisation
Lead Research Organisation:
Newcastle University
Department Name: Population Health Sciences Institute
Abstract
The "data revolution" can enhance health/social care, accelerate research and help us to assess new ways to improve health and health-care. But new ways to analyse health data must be used in ways that the public understand, are happy with and appropriately address data privacy and security. This fellowship will develop tools to help scientists and doctors make good use of sensitive health data, while minimising the risk of an individual or their health status becoming known. I will focus on two increasingly important areas of health data use: 1) information from medical text; 2) visual display of data, particularly in augmented reality (AR) or virtual reality (VR).
1) Sensitive text analysis
Medical text (eg health records, medical letters) contain patient data over time including identifying information (eg address, next of kin, full date of birth). Although helpful for care and research, use of sensitive medical text is strictly controlled for privacy reasons. Existing methods extract information from text, but may control disclosure risk by deleting identifiable data or grouping patients into blocks. But these procedures are not foolproof: some patients may still be identifiable, and after discarding key information results may be wrong. My fellowship adopts a new approach we have developed for the free software package DataSHIELD. This allows sensitive data to be analysed without being seen/copied and automatically detects and blocks many analyses that may be identifying. My earlier work has shown DataSHIELD can be used on text data and I will extend it to protect the privacy of data extracted from medical text by computer-based text mining tools. This will markedly increase the range of analyses that may be applied to medical text while maintaining confidentiality. I will first work on synthetic (made-up but realistic) text to safely develop and test the new approach. Once I am satisfied the software works, I will apply it to a research project run by Dr Sarah Slight (School of Pharmacy, Newcastle University), asking whether patients treated with many medications ("polypharmacy") have poorer outcomes (eg more falls, hospital admissions). If they do, new policies can be created to control polypharmacy and improve health outcomes.
2) Sensitive data visualisation
AR/VR technologies provide a quick way to interpret and understand health data without special technical/scientific expertise. These immersive environments work because they can simultaneously present more pieces of information about someone than can be seen on paper or screen. But this also makes individuals more identifiable. If AR/VR becomes widely used, we must properly understand the disclosure risks and develop ways to protect against them. In 2015, our collaboration with industry partners Masters of Pie and Lumacode won a competition to display Wellcome Trust data in VR. Ongoing work I led extended our work to explore VR visual methods using synthetic data based on the ALSPAC cohort. Together, we built the BigDataVR pilot analysis tool. This fellowship will explore factors determining the risk of identifying someone when using immersive environments like BigDataVR. The findings will be used to develop new ways to create VR compatible graphics via DataSHIELD that convey the "essence" of a data set without full data display which may identify someone. I will create a preliminary proof of concept, using DataSHIELD to send data underpinning visualisation to the free WebVR environment. Once safe visualisation has been shown using the synthetic data, the work will be extended to a real use case based on the polypharmacy project (see above) or on research data released by METADAC (a committee overseeing access to biomedical data from 5 major UK studies).
Software created under both work programs will be freely available to researchers, helping doctors and scientists to better analyse sensitive health data while protecting confidentiality.
1) Sensitive text analysis
Medical text (eg health records, medical letters) contain patient data over time including identifying information (eg address, next of kin, full date of birth). Although helpful for care and research, use of sensitive medical text is strictly controlled for privacy reasons. Existing methods extract information from text, but may control disclosure risk by deleting identifiable data or grouping patients into blocks. But these procedures are not foolproof: some patients may still be identifiable, and after discarding key information results may be wrong. My fellowship adopts a new approach we have developed for the free software package DataSHIELD. This allows sensitive data to be analysed without being seen/copied and automatically detects and blocks many analyses that may be identifying. My earlier work has shown DataSHIELD can be used on text data and I will extend it to protect the privacy of data extracted from medical text by computer-based text mining tools. This will markedly increase the range of analyses that may be applied to medical text while maintaining confidentiality. I will first work on synthetic (made-up but realistic) text to safely develop and test the new approach. Once I am satisfied the software works, I will apply it to a research project run by Dr Sarah Slight (School of Pharmacy, Newcastle University), asking whether patients treated with many medications ("polypharmacy") have poorer outcomes (eg more falls, hospital admissions). If they do, new policies can be created to control polypharmacy and improve health outcomes.
2) Sensitive data visualisation
AR/VR technologies provide a quick way to interpret and understand health data without special technical/scientific expertise. These immersive environments work because they can simultaneously present more pieces of information about someone than can be seen on paper or screen. But this also makes individuals more identifiable. If AR/VR becomes widely used, we must properly understand the disclosure risks and develop ways to protect against them. In 2015, our collaboration with industry partners Masters of Pie and Lumacode won a competition to display Wellcome Trust data in VR. Ongoing work I led extended our work to explore VR visual methods using synthetic data based on the ALSPAC cohort. Together, we built the BigDataVR pilot analysis tool. This fellowship will explore factors determining the risk of identifying someone when using immersive environments like BigDataVR. The findings will be used to develop new ways to create VR compatible graphics via DataSHIELD that convey the "essence" of a data set without full data display which may identify someone. I will create a preliminary proof of concept, using DataSHIELD to send data underpinning visualisation to the free WebVR environment. Once safe visualisation has been shown using the synthetic data, the work will be extended to a real use case based on the polypharmacy project (see above) or on research data released by METADAC (a committee overseeing access to biomedical data from 5 major UK studies).
Software created under both work programs will be freely available to researchers, helping doctors and scientists to better analyse sensitive health data while protecting confidentiality.
Technical Summary
Health Data Science extracts, integrates and interprets health/biomedical data at population, organisation and individual levels to support: front-line clinical care/public health; health care planning/evaluation; research for academia, industry and the health/social services. The evolution of data science in the health/social sciences has lagged behind the physical sciences - including earth and space sciences where I am familiar. In part this reflects the social and technical challenges associated with governing human data in a responsible manner. My fellowship addresses methods and software to facilitate well-governed access, analysis and exploitation of sensitive health/biomedical data, with a joint focus on the privacy protected analysis of textual data and on guarding against the disclosure risk associated with data visualisation, particularly in several dimensions.
Building on my earlier work, including as a Farr Future Leader, my fellowship will exploit big data analytics, text mining and new technologies in virtual and/or augmented reality. Novel approaches to disclosure control will be implemented via DataSHIELD, an open source software for the distributed analysis of sensitive data - where individual-level data can be analysed, but not seen or abstracted by the analyst. Embedded disclosure controls (set by the data custodian and inaccessible to the analyst) mitigate against inferential (analysis-based) disclosure and can avoid costly, error-prone human scrutiny of results. The fellowship builds on work that I have personally led including three years as manager of the DataSHIELD development team. The new functionality to be developed will allow DataSHIELD to act as an automated disclosure-control layer between the user and either medical text or data underpinning sophisticated visual representation. Key applications will include personalised medicine, epidemiology and modern public health with data coming from one source or several.
Building on my earlier work, including as a Farr Future Leader, my fellowship will exploit big data analytics, text mining and new technologies in virtual and/or augmented reality. Novel approaches to disclosure control will be implemented via DataSHIELD, an open source software for the distributed analysis of sensitive data - where individual-level data can be analysed, but not seen or abstracted by the analyst. Embedded disclosure controls (set by the data custodian and inaccessible to the analyst) mitigate against inferential (analysis-based) disclosure and can avoid costly, error-prone human scrutiny of results. The fellowship builds on work that I have personally led including three years as manager of the DataSHIELD development team. The new functionality to be developed will allow DataSHIELD to act as an automated disclosure-control layer between the user and either medical text or data underpinning sophisticated visual representation. Key applications will include personalised medicine, epidemiology and modern public health with data coming from one source or several.
People |
ORCID iD |
Rebecca Wilson (Principal Investigator / Fellow) |
Publications
Vrijheid M
(2021)
Advancing tools for human early lifecourse exposome research and translation (ATHLETE): Project overview.
in Environmental epidemiology (Philadelphia, Pa.)
Pastorino S
(2019)
Associations between maternal physical activity in early and late pregnancy and offspring birth size: remote federated individual level meta-analysis from eight cohort studies.
in BJOG : an international journal of obstetrics and gynaecology
Butters O
(2018)
Generation of a cleaned dataset listing Avon Longitudinal Study of Parents And Children peer-reviewed publications to 2015.
in Wellcome open research
Fortier I
(2023)
Life course of retrospective harmonization initiatives: key elements to consider.
in Journal of developmental origins of health and disease
Avraam D
(2021)
Privacy preserving data visualizations.
in EPJ data science
Butters O
(2020)
PUblications Metadata Augmentation (PUMA) pipeline
in F1000Research
Butters OW
(2020)
PUblications Metadata Augmentation (PUMA) pipeline.
in F1000Research
Butters OW
(2020)
Recognizing, reporting and reducing the data curation debt of cohort studies.
in International journal of epidemiology
Title | Panel on Diversity and Inclusion |
Description | A copy of my slides scene setting disability inclusion within the academic sector. |
Type Of Art | Film/Video/Animation |
Year Produced | 2021 |
URL | https://ssi-cw.figshare.com/articles/presentation/Panel_on_Diversity_and_Inclusion/14345951 |
Description | A federated FAIR platform enabling large-scale analysis of high-value cohort data connecting Europe and Canada in personalized health |
Amount | € 6,717,953 (EUR) |
Funding ID | 824989 |
Organisation | European Commission H2020 |
Sector | Public |
Country | Belgium |
Start | 10/2022 |
End | 12/2023 |
Description | Advancing Tools for Human Early Lifecourse Exposome Research and Translation - ATHLETE |
Amount | € 12,000,000 (EUR) |
Funding ID | 874583 |
Organisation | European Commission H2020 |
Sector | Public |
Country | Belgium |
Start | 01/2020 |
End | 12/2024 |
Description | Health and Life Sciences Tenure Track Fellowship |
Amount | £0 (GBP) |
Organisation | University of Liverpool |
Sector | Academic/University |
Country | United Kingdom |
Start | 02/2023 |
End | 02/2028 |
Description | TRE-FX |
Amount | £560,000 (GBP) |
Funding ID | MC_PC_23007 |
Organisation | Medical Research Council (MRC) |
Sector | Public |
Country | United Kingdom |
Start | 02/2023 |
End | 10/2023 |
Title | ALSPAC peer reviewed publications 1989-2015 |
Description | List of peer reviewed publications generated from the Avon Longitudinal Study of Parents and Children (ALSPAC) data from 1989 to the end of 2015. |
Type Of Material | Database/Collection of data |
Year Produced | 2018 |
Provided To Others? | Yes |
Impact | dataset curated for the impact analysis of cohort studies. |
URL | https://zenodo.org/record/2276785 |
Description | DataSHIELD Workshop 2018 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Third sector organisations |
Results and Impact | I have organised a three day workshop to showcase new community developments, functionality, applications and introduce potential users to DataSHIELD for privacy protected distributed analysis. The agenda includes mix of talks and demonstrations, a tutorial on how to use DataSHIELD and discussion sessions to facilitate DataSHIELD community-led solutions to a range of development and application challenges. These discussions will assist in the roadmap planning of DataSHIELD. The workshop This workshop target three groups: - those that are unfamiliar with DataSHIELD, or that may have a new usecase or application for DataSHIELD - current DataSHIELD users or adopters - those developing new statistical methodology, functionality or infrastructure for DataSHIELD. |
Year(s) Of Engagement Activity | 2018 |
URL | http://www.datashield.ac.uk/workshop18 |
Description | DataSHIELD Workshop 2019 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | I organised the 2019 DataSHIELD workshop comprising training and an introduction to our open source software. In addition several speakers slots for developers, adopters and users of DataSHIELD were included to disseminate work. The workshop has led to the a new release of DataSHIELD v 5.1 Several new adopters and interested parties of DataSHIELD including IMI global consortia projects and other European consortia of longitudinal research studies. New developers welcomed to the DataSHIELD community. |
Year(s) Of Engagement Activity | 2019 |
URL | http://www.datashield.ac.uk/events/eucan-connectagm2019datashieldworkshop/agendas/2019datashieldwork... |