SIEPH: Safe Information Extraction from Patient Histories

Lead Research Organisation: University of Glasgow

Department Name: School of Computing Science

Abstract

Medical records are an important resource for making discoveries about human health. Patterns of symptoms, prescriptions and other clinical events can help researchers understand why different patients respond differently to drugs and lead to new understandings about disease. Substantial information about a patient's medical history is recorded in written clinical notes. Unfortunately, these clinical notes frequently cannot be used by medical researchers because they may contain sensitive personal information. This directly limits the applicability of natural language processing (NLP) methods to use computers to automatically read the notes and extract them. This project proposes to build new methods that will safely extract important information from clinical notes that are needed by medical researchers to answer complex medical questions that could lead to new discoveries. To achieve this, the project will develop a novel method using the concept of synthetic records that are artificially generated medical records that resemble real records in structure and content but do not contain any sensitive information. These synthetic records can then be provided to medical researchers who can annotate the exact type of information that they want to pull from real medical records. These annotations can be used to build a machine learning system to extract the specific type of information from real medical records. The resulting data will be further scanned to ensure that no sensitive information is leaked to researchers thereby providing them with the medical data they need to make medical discoveries but not endangering patient privacy. The resulting technologies will enable medical researchers to ask new complex questions of medical records where the information they need is locked in written clinical notes. We will work with the NHS Safe Havens team to evaluate this approach so that it may aid medical researchers and the NHS in the future.

Funded Value:

£202,134

Funded Period:

Mar 23 - Aug 24

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/X018237/1

Principal Investigator:

Graham McDonald

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Information & Knowledge Mgmt (100%)

Organisations

University of Glasgow (Lead Research Organisation)

People	ORCID iD
Graham McDonald (Principal Investigator)
Iadh Ounis (Co-Investigator)
Jake Lever (Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Frayling E (2024) UoG Siephers at "Discharge Me!": Exploring Ways to Generate Synthetic Patient Notes From Multi-Part Electronic Health Records

Frayling, E (2024) Zero-shot and Few-shot Generation Strategies for Artificial Clinical Records

Key Findings


Description	We have discovered promising new AI and Machine Learning driven techniques for improving the quality of generated synthetic patient health records, similar to the records that are recorded about patients' stays in hospital and their discharge summaries. Being able to generate synthetic records may help to reduce the privacy barriers associated with patient data access in clinical research. Moreover, our developed techniques focus on methods to generate high quality synthetic records without the need to train a machine learning model on large quantities of real patients' records. An overview of some of the work is available in a pre-print at https://doi.org/10.48550/arXiv.2403.08664 and in Frayling, E., Lever, J. and McDonald, G. (2024) UoG Siephers at "Discharge Me!": Exploring Ways to Generate Synthetic Patient Notes From Multi-Part Electronic Health Records. In: BioNLP 2024 Workshop, Bangkok, Thailand, 11-16 Aug 2024, pp. 712-718.. In shaping our research direction, we have also explored research directions that have thus far show to be unsuccessful. Notably, we have investigated strategies to evaluate the faithfulness of artificially generated clinical text (with respect to the contents real patients' records). In particular, we investigated utilising external knowledge-bases, such as the UMLS medical ontology, with entity recognition tools to assess how well the generated artificial records reflect the true concepts of original records by focusing on mentions of medical entities. Our findings have shown that the ambiguities in the strengths of the relationships between the generated and the patients' medical concepts result in such an approach being not suitable as the basis for a metric that is capable of providing a reasonable degree of exact evaluation of the quality and correctness of the generated patient records.
Exploitation Route	With respect to academic routes, we believe our work provides a good foundation for further research using Large Language Models to generate artificial clinical unstructured text. We have shown effective ways to utilise the structure of clinical records with recent advances in Machine Learning architectures for clinical text generation. Moreover, our approach utilises methods that are very accessible to other researchers - we use open-source and publicly available models, we use methods that do not require excessive computational resources and reduce the need to access sensitive patient information to work in this area. This could potentially lead to accessible, timely and relevant new research directions for other researchers in this space. Concerning non-academic routes, our ongoing work is nicely situated to be taken forward and we are working with NHS Safe Havens to evaluate our approaches on real NHS data and, going forward, we aim to be able to conduct user studies to evaluate the usefulness of our techniques for medical researchers.
Sectors	Digital/Communication/Information Technologies (including Software) Healthcare
URL	https://doi.org/10.48550/arXiv.2403.08664

Abstract

Organisations

People

ORCID iD

Publications