QuantiCode: Intelligent infrastructure for quantitative, coded longitudinal data

Lead Research Organisation: University of Leeds
Department Name: Sch of Computing

Abstract

This cross-disciplinary project aims to develop novel data mining and visualization tools and techniques, which will transform people's ability to analyse quantitative and coded longitudinal data. Such data are common in many sectors. For example, health data is classified using a hierarchy of hundreds of thousands of Read Codes (a thesaurus of clinical terms), with analysts needing to provide business intelligence for clinical commissioning decisions, and researchers tacking challenges such modelling disease risk stratification. Retailers such as Sainsbury's sell 50,000+ types of products, and want to combine data from purchasing, demographic and other sources to understand behavioural phenomena such as the convenience culture, to guide investment and reduce waste.

To solve these needs, public and private sector organisations require an infrastructure that provides far more powerful analytical tools than are available today. Today's analysis tools are deficient because they (a) are crude for assessing data quality, (b) often involve analysis techniques are designed to operate on aggregated, rather than fine-grained, data, and (c) are often laborious to use, which inhibits users from discovering important patterns.

The QuantiCode project will address these deficiencies by bringing together experts in statistics, modelling, visualization, user evaluation and ethics. The project will be based in the Leeds Institute for Data Analytics (LIDA), which houses the ESRC Consumer Data Research Centre (£5m ES/L011891/1) and the MRC Medical Bioinformatics Centre (£7m ES/L011891/1), and provides a development facilities complete with high-performance computing (HPC), visualization and safe rooms for sensitive data. Our project will deliver proof of concept visual analytic systems, which we will evaluate with a wide variety of users drawn from our partners and researchers/external users based in LIDA.

At the outset of the project we will engage with our partners to identify analysis use cases and requirements that drive the details of our research, which is divided into four workpackages (WPs). WP1 (Data Fusion) will develop governance principles for the analysis of fine-grained data from multiple sources,
implement tools to substantially reduce the effort of linking those sources, and develop new techniques to visualize completeness, concordance, plausibility, and other aspects of data quality.

WP2 (Analytical Techniques) and WP3 (Abstraction Models) are the project's technical core. WP2 will deliver a new, robust approach for modelling data as they appear naturally in health and retail data (irregularly dispersed or sampled over time), scaling that approach with stochastic control to guide learning and resource usage, and developing a low-effort 'question-posing' visual interface to drastically lower the human effort of investigating data and finding patterns. WP3 (Abstraction Models) focuses on data granularity, and will deliver a tool that implements a working version of the governance principles we develop in WP1, and new computational and interactive techniques for exploring abstraction spaces to create inputs suited to each aspect of analysis.

WP4 will implement the above tools and techniques in three versions of our proof of concept system, evaluating each with our partners and LIDA researchers/users. This will ensure that our solutions are compatible with, and scale to, challenging real-world data analysis problems. Success criteria will be time saved, increased analysis scope, notable insights, and tackling previously unfeasible types of analysis - all compared against a baseline provided by users' current analysis tools. We will encourage adoption via showcases, workshops and licensed installations at our partners' sites. The project's legacy will include tools that are embedded as an integral part of the LIDA infrastructure, a plan for their on-going development, and a research roadmap.

Planned Impact

Health, retail, government and other sectors routinely record data, using tens of thousands of codes to categorise information that ranges from the treatment of patients, to supermarket purchases, and where people live, work and go to school. By linking and analysing data from multiple sources, NHS Clinical Commissioning Groups (CCGs), local authorities and retailers can garner new insights to inform evidence-based planning decisions for day-to-day operations and investment. Analysing such data sets poses fundamental challenges, which we are addressing in this project by developing tools and techniques for an information infrastructure that helps users to interpret and make sense of complex data.

Through thought leadership and the engagement of stakeholders early in the research, we will clarify ethical principles for the integrated analysis of fine-grained data from multiple sources, and recommend solutions for developing ethically-compliant analysis systems that balance the rights of individuals (privacy, consent, etc.) with the value of advances in knowledge and the public good. Building an ethically-compliant system will be of interest to our partners involved in this research and their public/private sector colleagues. Government bodies and public/statutory sector organisations, whose ethical review boards need to build a more reflective understanding of the ethical implications of big data, are key to embedding consistent and well-evidenced practices for the future.

We will develop novel computational and interactive visualization tools and techniques for assessing the quality of data, performing fine-grained analysis, and exploring how complex data should (not) be simplified to reveal patterns. To ensure that our tools and techniques address real-world challenges of data analysis, at the outset of the project we will engage with our partners to define use cases, analyse requirements and pool our knowledge. The beneficiaries of all this work include: (a) data quality teams who leverage our tools in their work to develop methods so that other users are able to conduct their investigations with 'clean' data, (b) analysts who make use of our new tools or integrate software libraries containing our techniques into their own code/scripts, and (c) decision makers who bear responsibility for operations and investment. To promote early adoption, we will make our tools and techniques available for our partners to install and use on their own infrastructure, and also integrate it with the infrastructure of the Leeds Institute for Data Analytics (LIDA) for use by researchers and external users. Through this adoption, we aim for our partners and other users becoming ambassadors for the project's tools. To stimulate wider adoption, we will run showcases for health, retail and general business audiences, and hands-on workshops for analysts.

The ultimate beneficiaries, are the general public. For example, CCGs and local authorities analyse data to generate business intelligence for operations and investment, with the aim of providing us all with improved and more cost-effective services. Businesses similarly require business intelligence for operations and investment, which translates to jobs and other economic benefits. To bridge the gap between these indirect benefits from our project and popular interest in big data, we will conduct a range of public engagement activities which include live demonstrations at the annual Leeds Festival of Science, a short film, an on-line tutorial about the ethics surrounding data analytics, and publishing articles in the popular scientific press.

Publications

10 25 50
 
Title Visualizing the Quality of Data 
Description Data quality may be divided into two main areas - completeness and correctness. This film explains what each area involves, how we can visualize it and why it matters, using case studies from healthcare and retail. Completeness covers missing values and records. Missing values are illustrated with nine 80-field datasets, which contain a total of 150 million records of hospital episode statistics. By combining small multiples with perceptual discontinuity and semantic encoding, an analyst finds both expected and unexpected patterns in the missing values. Then, using bar charts, a heat map and data mining in an interactive visualization tool, the analyst finds a hospital that has a problem with its diagnosis codes. Missing records are first illustrated with line and area charts, to understand the origin of eight fields that were collected longitudinally from 11 health data sources. However, line charts are not always effective for visualizing missing records. An alternative is to use heatmaps, as is illustrated using data about hourly sales for 365 days in 400 supermarkets. Data may be incorrect in a variety of ways, including by being misleading, outliers, plausibility, consistency, validity, special values or integrity. A misleading spike in supermarket sales is visualized using bar charts that use 188 different colours effectively, to break normal guidelines. Misleading loyalty card data is visualized on a map. Histograms, box plots and line charts illustrate the visualization of outliers and implausible values, including a child who is 3 m tall, a 150 cm baby, and sudden changes growth trajectories. Other problems with correctness are visualized using dot-and-whisker plots (inconsistent encryption of patient identifiers), character patterns (validity), bar charts (date values that have a special meaning, and are not outliers), and heatmaps and bar charts (integrity of maternity episodes). 
Type Of Art Film/Video/Animation 
Year Produced 2019 
Impact 1185 views so far 
URL https://www.youtube.com/watch?v=PnNMfCRWL7k&feature=youtu.be
 
Description Developed & released the first version of ACE (Analysis of Combinations of Events) visual analytics tool to investigate patterns in missing data.
Developed fast algorithms for temporal pattern mining. These algorithms allow mining of temporal data with/without uncertainty on time stamps and with/without time restrictions put on the patterns.
Developed mathematical and numerical foundations and a Python implementation of a novel method for automatic model training under restrictive time constraints.
Exploitation Route Development of proof-of-concept software tool for risk stratification, for use by social care services in local government.
Release of open source software for optimising metaparameter choice in automated machine learning.
Adoption of tools for visualizing data quality, which were developed by the project for investigating patterns of missing values and diverse aspects of data correctness.
Sectors Digital/Communication/Information Technologies (including Software)

Healthcare

Government

Democracy and Justice

Retail

 
Description QuantiCode's findings have/are being used in three primary ways: 1) We developed a novel and highly-scalable set visualization tool for investigating patterns of missing values. It has been applied to NHS datasets, identifying a number of important data quality issues, allowing feedback to be given to specific hospital to rectify some of the issues, and defining new business rules for validating future datasets provided by hospitals. 2) We investigated how set-based visualization techniques may be used to analyse customer missions , and developed that into a visual analytics workflow that combined visualization with exclusive set intersection and high utility itemset mining techniques. A detailed report about productionising the workflow has been provided to one of the project's industrial partners. 3) We investigated how robust temporal mining, model interpretation and visualization techniques may be used to to identify people at risk of losing independence and moving to residential care. Results sparked interests from two local councils and led to a further 12 month impact project aiming to perfect those methods and extend them to recommend effective individual pathways of care. We have also described our approach to conducting collaborative research (with Leeds City Council), via a professionally produced YouTube film.
Sector Healthcare,Government, Democracy and Justice,Retail
Impact Types Societal

Economic

Policy & public services

 
Description 32) Making Visualization Scalable (MAVIS) for explaining machine learning classification models
Amount £570,749 (GBP)
Funding ID EP/X029689/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 08/2023 
End 08/2026
 
Description Adult Social Care - Risk Stratification and Prevention
Amount £55,535 (GBP)
Funding ID EP/R511717/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 03/2020 
End 02/2021
 
Description Advise on the commissioning of NERC's environmental data service
Amount £53,227 (GBP)
Organisation Natural Environment Research Council 
Sector Public
Country United Kingdom
Start 02/2022 
End 08/2022
 
Description DECOVID
Amount £74,532 (GBP)
Organisation Alan Turing Institute 
Sector Academic/University
Country United Kingdom
Start 03/2020 
End 04/2021
 
Description DynAIRx: AI for Dynamic prescribing optimisation and care integration in multimorbidity
Amount £2,807,430 (GBP)
Organisation National Institute for Health Research 
Sector Public
Country United Kingdom
Start 03/2022 
End 09/2024
 
Description Impact Acceleration Account - University of Leeds 2017
Amount £1,775,630 (GBP)
Funding ID EP/R511717/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 09/2018 
End 03/2019
 
Title The ACE software, and training materials for visualizing missing data and set-type data 
Description This dataset comprises a Java software program called ACE, for visualizing missing data and set-type data. Also include din the dataset are training materials that show how ACE can be used in several data quality and general set data analysis scenarios. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://archive.researchdata.leeds.ac.uk/958/
 
Title ACE version 1.0 
Description A Set Visualization Tool to Analyze Patterns in Missing Data 
Type Of Technology Software 
Year Produced 2018 
Impact Identify origin of data quality problem (DIAG codes and OPERTN codes) in NHS Admitted Patient Care data, allowing feedback to be given to specific hospital to rectify the problem. 
 
Title ACE version 2.0 
Description A Set Visualization Tool for analysing patterns of missing data and transactions data (e.g., supermarket purchases) 
Type Of Technology Software 
Year Produced 2019 
Impact Identify origin of data quality problems in NHS Admitted Patient Care data. Examples include: 1) Gaps in the 24 fields that record a patient's operations, which has implications for the NHS Payment by Results system. 2) Inconsistencies between operation fields and the corresponding 24 date fields, which affected millions of records. 3) Gaps in the 20 fields that record a patient's diagnoses, which affects the data cleaning methods used by epidemiologists. The origin pointed to a particular unit in a specific healthcare provider, allowing feedback to be given so that the problem could be rectified. 4) Developing a simple method for investigating integrity errors in maternity records about the location of a baby's delivery, allowing feedback to be given to specific healthcare providers. 5) Identifying an error in the APC Data Dictionary for OPERTN_nn fields. 
 
Title Python set visualization package 
Description Python set visualization package 
Type Of Technology Software 
Year Produced 2023 
Impact None yet 
 
Title vizdataquality 
Description This is a Python package for visualizing data quality, and includes this six-step workflow: (1) Look at your data (is anything obviously wrong?), (2) Watch out for special values, (3) Is any data missing?, (4) Check each variable, (5) Check combinations of variables, and (6) Profile the cleaned data. 
Type Of Technology Software 
Year Produced 2024 
Open Source License? Yes  
Impact None yet 
 
Description ACE demo at NHS England's Quarry House office 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact ACE demo at NHS England's Quarry House office
Year(s) Of Engagement Activity 2017
 
Description ACE demo to NHS Digital Casemix team 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact ACE demo to NHS Digital Casemix team
Year(s) Of Engagement Activity 2017
 
Description AIUK 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact AIUK presentation
Year(s) Of Engagement Activity 2021
 
Description Advisory Boards 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact Advisory Board
Year(s) Of Engagement Activity 2016,2017,2018,2019
 
Description BIG DATA: Turning Data into Value (LUBS event) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact BIG DATA: Turning Data into Value (LUBS event)
Year(s) Of Engagement Activity 2016
 
Description Data cleaning strategies 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Discussing & explaining data cleaning strategies
Year(s) Of Engagement Activity 2019
 
Description Demos & poster of QuantiCode tools 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact Demos & poster of QuantiCode tools
Year(s) Of Engagement Activity 2017,2018,2019
 
Description Film: Visualizing the Quality of Data 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Visualizing the Quality of Data: Data quality may be divided into two main areas - completeness and correctness. This film explains what each area involves, how we can visualize it and why it matters, using case studies from healthcare and retail.
Year(s) Of Engagement Activity 2019,2020
URL https://www.youtube.com/watch?v=PnNMfCRWL7k
 
Description LCC 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Scenario workshop
Local Government Association - webinar
Discussion of options in predictive analysis in social care
Update on risk stratification analysis and discussion on further collaboration.
Strategies for LCC client risk stratification, Discussion on application of the methodology within the current social care protocols.
Evaluation of risk stratification results and identification of future directions
Year(s) Of Engagement Activity 2016,2017,2018,2019
 
Description LIDA AGM 2017 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact LIDA AGM presentation withLCC
Year(s) Of Engagement Activity 2017
 
Description Leeds Data Ethics Roundtable 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Leeds Data Ethics Roundtable (DLA Piper)
Year(s) Of Engagement Activity 2018,2019
 
Description Leeds University, Leeds City Council Review of Collaborations video 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Production of a film about academic/local council collaboration in research.
Year(s) Of Engagement Activity 2021
URL https://www.youtube.com/watch?v=JTeoR0HoPys
 
Description Provided material for Research Data Science course 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Provided material about Visualizing the Quality of Data for course.
Year(s) Of Engagement Activity 2021
URL https://tinyurl.com/VizDataQuality
 
Description Sainsbury's 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact Scenario workshop
Customer missions meetings
Year(s) Of Engagement Activity 2016,2018,2019
 
Description UoL Open Days 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact Open Day demos of ACE
Year(s) Of Engagement Activity 2018,2019,2020
 
Description Workshop on Missing Data and the ACE Tool 2018 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Workshop on Missing Data and the ACE Tool
Year(s) Of Engagement Activity 2018