Federated Learning for Unstructured Healthcare Data

Lead Research Organisation: University of Oxford

Abstract

Data plays a crucial role in driving advancements in healthcare through the development of algorithms for diagnostics, treatments, and patient care. However, healthcare data is often distributed across multiple institutions each governed by data privacy laws. Deep learning methods have shown great potential in healthcare, but their effectiveness relies on the volume and quality of the training data available. Centralizing all the data in one location for processing and training global models is often impractical due to privacy concerns and the distributed nature of healthcare data. Hence, we are limited to training our models on local datasets representing the local population. In comparison to a potential global model, these local models often fail to generalise and may be more susceptible to distribution shifts.

Federated learning (FL) offers a solution by enabling collaboration and insights from distributed datasets while maintaining privacy and data ownership. FL operates in a de- centralized manner, with institutions retaining control over their local data. Instead of sharing raw data, model updates are exchanged, ensuring that patient information remains secure and complying with stringent privacy regulations. However, FL assumes that the structure of the training data is consistent across sites. In healthcare informatics, data is often unstructured, meaning it lacks uniformity in dimensionality and feature placement within a feature vector. Unstructured data can arise as a result of missing features, permuted features, or variations in the way measurements are recorded. A practical example is a particular blood test not being available across all sites, different metrics used to record the same reading, or measurements not taken and recorded in the same order. As a result, a lot of manual and computational effort is required to make this unstructured data uniform across all the sites, hindering the development of federated models.

The first set of aims concerns exploring existing FL frameworks and developing new ones that can work in the case of unstructured data across sites. This would be particularly useful to resource constrained clients who may not have enough data to train good local models or may be missing certain features in the global feature vector, so are unable to participant in current FL efforts. As part of our methodology, we propose the novel use of a cross-attention mechanism to facilitate information sharing across different healthcare sites involved in the federated learning effort. This mechanism allows for the exchange of critical information without the need to share raw data or pre-process feature vectors.

Secondly, FL is designed to enable model training without sharing raw data, but even sharing model updates can potentially reveal sensitive information about individual clients. By incorporating data privacy-preserving techniques such as differential privacy during the algorithmic development, we can ensure that the aggregation process does not disclose information specific to any individual participant in the FL process.

Finally, our project aims to personalize global models to specific user preferences, contexts, or domains by fine-tuning them with local datasets. This personalized approach enhances the relevance and accuracy of AI models for individual users or groups while still benefiting from shared knowledge obtained during global training.
We will leverage openly available datasets such as eICU and MIMIC, along with private datasets like CURIAL and HAVEN, to develop FL frameworks. By simulating unstructured environments, we aim to create versatile frameworks capable of handling diverse data scenarios.

This project falls within the EPSRC Artificial Intelligence Technologies research area, contributing to advancements in AI-driven healthcare technologies.

Planned Impact

In the same way that bioinformatics has transformed genomic research and clinical practice, health data science will have a dramatic and lasting impact upon the broader fields of medical research, population health, and healthcare delivery. The beneficiaries of the proposed training programme, and of the research that it delivers and enables, will include academia, industry, healthcare, and the broader UK economy.

Academia: Graduates of the training programme will be well placed to start their post-doctoral careers in leading academic institutions, engaging in high-impact multi-disciplinary research, helping to build training and research capacity, sharing their experience within the wider academic community.

Industry: Partner organisations will benefit from close collaboration with leading researchers, from the joint exploration of research priorities, and from the commercialisation of arising intellectual property. Other organisations will benefit from the availability of highly-qualified graduates with skills in big health data analytics.

Healthcare: Healthcare organisations and patients will benefit from the results of enabled and accelerated health research, leading to new treatments and technologies, and an improved ability to identify and evaluate potential improvements in practice through the analysis of real-world health data.

Economy: The life sciences sector is a key component of the UK economy. The programme will provide partner companies with direct access to leading-edge research. Graduates of the programme will be well-qualified to contribute to economic growth - supporting health research and the development of new products and services - and will be able to inform policy and decision making at organisational, regional, and national levels.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S02428X/1 01/04/2019 30/09/2027
2722183 Studentship EP/S02428X/1 01/10/2022 30/09/2026 Pafue Nganjimi