Artificial Intelligence for Missing Data Imputation in Electronic Medical Records

Lead Research Organisation: University of Birmingham
Department Name: Institute of Inflammation and Ageing

Abstract

Health systems in the UK and Canada have made extensive use of Electronic Medical Records (EMR) for many years as an integral part of their operations. However, whilst digitally recorded data exists, their use as the basis of a "learning health system" whereby continuous improvements in patient experience, hospital operations, and quality of care has are made by collating and examining data and evidence to improve all these areas. However, real-world EMR data can be very challenging to handle.

One significant contribution to these difficulties is data quality. Missing data is a particular issue, with rates of missingness of between 10-30% for some records. Properly addressing the missing data issue in EMR data is complicated by the fact that it can be difficult to differentiate between genuine missing data (data was not recorded into the system) and a non-applicable response (e.g. the test was not appropriate therefore it was not done). Data can be missing-at-random (MAR) or missing-not-at-random (MNAR) where, in the latter, there is an underlying factor that determines the missingness patterns. Certain types of missingness can therefore be "informative" since, if a clinician decided not to order certain tests, it indicates a certain implicit belief about the perceived health state of the patient. Failure to account for these sources of bias may lead to incorrect inferences.

Artificial Intelligence technologies are seen as an important tool in unlocking the information wealth held in our electronic medical records. This project will contribute to the maturation of these technologies to account for the real-world complexities of EMR datasets. The research proposed here will develop algorithms for data imputation that seek to be more robust, reliable and generalisable. We have chosen to initially focus on automated sepsis diagnosis, a pressing area of biomedical research given that sepsis accounts for around 44,000 deaths each year in the UK alone. Therefore, by applying modern approaches based on machine learning to large EMR datasets we promise to tackle this problem in a unique way that could have meaningful real-world impact.

However, as many AI prediction models require complete datasets as input, one popular strategy for handling missing data involves "data imputation", whereby an algorithm is used to fill in missing data values. These methods vary in complexity from simply filling in missing values with the average observed values over the entire dataset through to more advanced methods that attempt to elicit the underlying patterns in the data. However, many current imputation methods are designed for only certain types of EMR data (e.g. clinical time series of molecular measurements) and fail to account for sources of bias and provide measures of certainty about the quality of the imputed data.

The overall goal of this project is to develop novel machine learning methods for missing data imputation in EMRs that account for biases and statistical uncertainty in the imputation.

Publications

10 25 50