QuantiCode: Intelligent infrastructure for quantitative, coded longitudinal data

Lead Research Organisation: University of Leeds

Department Name: Sch of Computing

Abstract

This cross-disciplinary project aims to develop novel data mining and visualization tools and techniques, which will transform people's ability to analyse quantitative and coded longitudinal data. Such data are common in many sectors. For example, health data is classified using a hierarchy of hundreds of thousands of Read Codes (a thesaurus of clinical terms), with analysts needing to provide business intelligence for clinical commissioning decisions, and researchers tacking challenges such modelling disease risk stratification. Retailers such as Sainsbury's sell 50,000+ types of products, and want to combine data from purchasing, demographic and other sources to understand behavioural phenomena such as the convenience culture, to guide investment and reduce waste.

To solve these needs, public and private sector organisations require an infrastructure that provides far more powerful analytical tools than are available today. Today's analysis tools are deficient because they (a) are crude for assessing data quality, (b) often involve analysis techniques are designed to operate on aggregated, rather than fine-grained, data, and (c) are often laborious to use, which inhibits users from discovering important patterns.

The QuantiCode project will address these deficiencies by bringing together experts in statistics, modelling, visualization, user evaluation and ethics. The project will be based in the Leeds Institute for Data Analytics (LIDA), which houses the ESRC Consumer Data Research Centre (£5m ES/L011891/1) and the MRC Medical Bioinformatics Centre (£7m ES/L011891/1), and provides a development facilities complete with high-performance computing (HPC), visualization and safe rooms for sensitive data. Our project will deliver proof of concept visual analytic systems, which we will evaluate with a wide variety of users drawn from our partners and researchers/external users based in LIDA.

At the outset of the project we will engage with our partners to identify analysis use cases and requirements that drive the details of our research, which is divided into four workpackages (WPs). WP1 (Data Fusion) will develop governance principles for the analysis of fine-grained data from multiple sources,
implement tools to substantially reduce the effort of linking those sources, and develop new techniques to visualize completeness, concordance, plausibility, and other aspects of data quality.

WP2 (Analytical Techniques) and WP3 (Abstraction Models) are the project's technical core. WP2 will deliver a new, robust approach for modelling data as they appear naturally in health and retail data (irregularly dispersed or sampled over time), scaling that approach with stochastic control to guide learning and resource usage, and developing a low-effort 'question-posing' visual interface to drastically lower the human effort of investigating data and finding patterns. WP3 (Abstraction Models) focuses on data granularity, and will deliver a tool that implements a working version of the governance principles we develop in WP1, and new computational and interactive techniques for exploring abstraction spaces to create inputs suited to each aspect of analysis.

WP4 will implement the above tools and techniques in three versions of our proof of concept system, evaluating each with our partners and LIDA researchers/users. This will ensure that our solutions are compatible with, and scale to, challenging real-world data analysis problems. Success criteria will be time saved, increased analysis scope, notable insights, and tackling previously unfeasible types of analysis - all compared against a baseline provided by users' current analysis tools. We will encourage adoption via showcases, workshops and licensed installations at our partners' sites. The project's legacy will include tools that are embedded as an integral part of the LIDA infrastructure, a plan for their on-going development, and a research roadmap.

Planned Impact

Health, retail, government and other sectors routinely record data, using tens of thousands of codes to categorise information that ranges from the treatment of patients, to supermarket purchases, and where people live, work and go to school. By linking and analysing data from multiple sources, NHS Clinical Commissioning Groups (CCGs), local authorities and retailers can garner new insights to inform evidence-based planning decisions for day-to-day operations and investment. Analysing such data sets poses fundamental challenges, which we are addressing in this project by developing tools and techniques for an information infrastructure that helps users to interpret and make sense of complex data.

Through thought leadership and the engagement of stakeholders early in the research, we will clarify ethical principles for the integrated analysis of fine-grained data from multiple sources, and recommend solutions for developing ethically-compliant analysis systems that balance the rights of individuals (privacy, consent, etc.) with the value of advances in knowledge and the public good. Building an ethically-compliant system will be of interest to our partners involved in this research and their public/private sector colleagues. Government bodies and public/statutory sector organisations, whose ethical review boards need to build a more reflective understanding of the ethical implications of big data, are key to embedding consistent and well-evidenced practices for the future.

We will develop novel computational and interactive visualization tools and techniques for assessing the quality of data, performing fine-grained analysis, and exploring how complex data should (not) be simplified to reveal patterns. To ensure that our tools and techniques address real-world challenges of data analysis, at the outset of the project we will engage with our partners to define use cases, analyse requirements and pool our knowledge. The beneficiaries of all this work include: (a) data quality teams who leverage our tools in their work to develop methods so that other users are able to conduct their investigations with 'clean' data, (b) analysts who make use of our new tools or integrate software libraries containing our techniques into their own code/scripts, and (c) decision makers who bear responsibility for operations and investment. To promote early adoption, we will make our tools and techniques available for our partners to install and use on their own infrastructure, and also integrate it with the infrastructure of the Leeds Institute for Data Analytics (LIDA) for use by researchers and external users. Through this adoption, we aim for our partners and other users becoming ambassadors for the project's tools. To stimulate wider adoption, we will run showcases for health, retail and general business audiences, and hands-on workshops for analysts.

The ultimate beneficiaries, are the general public. For example, CCGs and local authorities analyse data to generate business intelligence for operations and investment, with the aim of providing us all with improved and more cost-effective services. Businesses similarly require business intelligence for operations and investment, which translates to jobs and other economic benefits. To bridge the gap between these indirect benefits from our project and popular interest in big data, we will conduct a range of public engagement activities which include live demonstrations at the annual Leeds Festival of Science, a short film, an on-line tutorial about the ethics surrounding data analytics, and publishing articles in the popular scientific press.

Funded Value:

£977,832

Funded Period:

Mar 16 - Feb 20

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/N013980/1

Principal Investigator:

Roy Ruddle

Research Subject:

Info. & commun. Technol. (65%)

Mathematical sciences (35%)

Research Topic:

Computer Graphics & Visual. (30%)

Information & Knowledge Mgmt (35%)

Statistics & Appl. Probability (35%)

Organisations

People	ORCID iD
Roy Ruddle (Principal Investigator)
J Keen (Co-Investigator)
Alexander Markham (Co-Investigator)
Jan Palczewski (Co-Investigator)
Chris Megone (Co-Investigator)
Mark Birkin (Co-Investigator)	http://orcid.org/0000-0001-5991-098X
Georgios Aivaliotis (Co-Investigator)
Kevin Macnish (Researcher)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Adnan M (2019) Visual Analytics of Event Data using Multiple Mining Methods

Adnan M. (2018) A Set-based Visual Analytics Approach to Analyze Retail Data in International Workshop on Visual Analytics

Adnan, M (2019) Visual analytics of event data using multiple mining methods.

Adnan, M (2018) A set-based visual analytics approach to analyze retail data

Aivaliotis G (2021) A comparison of time to event analysis methods, using weight status and breast cancer as a case study. in Scientific reports

Aivaliotis G (2018) An HJB approach to a general continuous-time mean-variance stochastic control problem in Random Operators and Stochastic Equations

Cironis L (2021) Automatic model training under restrictive time constraints

Cironis L (2022) Automatic model training under restrictive time constraints in Statistics and Computing

Cironis, L (2021) Automatic Model Training under Restrictive Time Constraints in arXiv

Hall M (2019) Using Miniature Visualizations of Descriptive Statistics to Investigate the Quality of Electronic Health Records

Artistic and Creative Products
Key Findings
Impact Summary
Further Funding
Research Databases and Models
Software and Technical Products
Engagement Activities


Title	Visualizing the Quality of Data
Description	Data quality may be divided into two main areas - completeness and correctness. This film explains what each area involves, how we can visualize it and why it matters, using case studies from healthcare and retail. Completeness covers missing values and records. Missing values are illustrated with nine 80-field datasets, which contain a total of 150 million records of hospital episode statistics. By combining small multiples with perceptual discontinuity and semantic encoding, an analyst finds both expected and unexpected patterns in the missing values. Then, using bar charts, a heat map and data mining in an interactive visualization tool, the analyst finds a hospital that has a problem with its diagnosis codes. Missing records are first illustrated with line and area charts, to understand the origin of eight fields that were collected longitudinally from 11 health data sources. However, line charts are not always effective for visualizing missing records. An alternative is to use heatmaps, as is illustrated using data about hourly sales for 365 days in 400 supermarkets. Data may be incorrect in a variety of ways, including by being misleading, outliers, plausibility, consistency, validity, special values or integrity. A misleading spike in supermarket sales is visualized using bar charts that use 188 different colours effectively, to break normal guidelines. Misleading loyalty card data is visualized on a map. Histograms, box plots and line charts illustrate the visualization of outliers and implausible values, including a child who is 3 m tall, a 150 cm baby, and sudden changes growth trajectories. Other problems with correctness are visualized using dot-and-whisker plots (inconsistent encryption of patient identifiers), character patterns (validity), bar charts (date values that have a special meaning, and are not outliers), and heatmaps and bar charts (integrity of maternity episodes).
Type Of Art	Film/Video/Animation
Year Produced	2019
Impact	1185 views so far
URL	https://www.youtube.com/watch?v=PnNMfCRWL7k&feature=youtu.be


Description	Developed & released the first version of ACE (Analysis of Combinations of Events) visual analytics tool to investigate patterns in missing data. Developed fast algorithms for temporal pattern mining. These algorithms allow mining of temporal data with/without uncertainty on time stamps and with/without time restrictions put on the patterns. Developed mathematical and numerical foundations and a Python implementation of a novel method for automatic model training under restrictive time constraints.
Exploitation Route	Development of proof-of-concept software tool for risk stratification, for use by social care services in local government. Release of open source software for optimising metaparameter choice in automated machine learning. Adoption of tools for visualizing data quality, which were developed by the project for investigating patterns of missing values and diverse aspects of data correctness.
Sectors	Digital/Communication/Information Technologies (including Software) Healthcare Government Democracy and Justice Retail


Description	QuantiCode's findings have/are being used in three primary ways: 1) We developed a novel and highly-scalable set visualization tool for investigating patterns of missing values. It has been applied to NHS datasets, identifying a number of important data quality issues, allowing feedback to be given to specific hospital to rectify some of the issues, and defining new business rules for validating future datasets provided by hospitals. 2) We investigated how set-based visualization techniques may be used to analyse customer missions , and developed that into a visual analytics workflow that combined visualization with exclusive set intersection and high utility itemset mining techniques. A detailed report about productionising the workflow has been provided to one of the project's industrial partners. 3) We investigated how robust temporal mining, model interpretation and visualization techniques may be used to to identify people at risk of losing independence and moving to residential care. Results sparked interests from two local councils and led to a further 12 month impact project aiming to perfect those methods and extend them to recommend effective individual pathways of care. We have also described our approach to conducting collaborative research (with Leeds City Council), via a professionally produced YouTube film.
Sector	Healthcare,Government, Democracy and Justice,Retail
Impact Types	Societal Economic Policy & public services


Description	32) Making Visualization Scalable (MAVIS) for explaining machine learning classification models
Amount	£570,749 (GBP)
Funding ID	EP/X029689/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	08/2023
End	08/2026


Description	Advise on the commissioning of NERC's environmental data service
Amount	£53,227 (GBP)
Organisation	Natural Environment Research Council
Sector	Public
Country	United Kingdom
Start	02/2022
End	08/2022


Description	DECOVID
Amount	£74,532 (GBP)
Organisation	Alan Turing Institute
Sector	Academic/University
Country	United Kingdom
Start	03/2020
End	04/2021


Description	DynAIRx: AI for Dynamic prescribing optimisation and care integration in multimorbidity
Amount	£2,807,430 (GBP)
Organisation	National Institute for Health and Care Research
Sector	Public
Country	United Kingdom
Start	03/2022
End	09/2024


Title	The ACE software, and training materials for visualizing missing data and set-type data
Description	This dataset comprises a Java software program called ACE, for visualizing missing data and set-type data. Also include din the dataset are training materials that show how ACE can be used in several data quality and general set data analysis scenarios.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://archive.researchdata.leeds.ac.uk/958/


Title	The Effect of Alignment on People's Ability to Judge Event Sequence Similarity
Description	This dataset is from investigations into local and global alignment methods for visualizing event sequence data. Some of the data sequences generated with a range of parameters, which are described in the paper. The other data comprise training material, screenshots of example trials, and results from a user experiment.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
URL	https://archive.researchdata.leeds.ac.uk/786/


Title	ACE version 1.0
Description	A Set Visualization Tool to Analyze Patterns in Missing Data
Type Of Technology	Software
Year Produced	2018
Impact	Identify origin of data quality problem (DIAG codes and OPERTN codes) in NHS Admitted Patient Care data, allowing feedback to be given to specific hospital to rectify the problem.


Title	ACE version 2.0
Description	A Set Visualization Tool for analysing patterns of missing data and transactions data (e.g., supermarket purchases)
Type Of Technology	Software
Year Produced	2019
Impact	Identify origin of data quality problems in NHS Admitted Patient Care data. Examples include: 1) Gaps in the 24 fields that record a patient's operations, which has implications for the NHS Payment by Results system. 2) Inconsistencies between operation fields and the corresponding 24 date fields, which affected millions of records. 3) Gaps in the 20 fields that record a patient's diagnoses, which affects the data cleaning methods used by epidemiologists. The origin pointed to a particular unit in a specific healthcare provider, allowing feedback to be given so that the problem could be rectified. 4) Developing a simple method for investigating integrity errors in maternity records about the location of a baby's delivery, allowing feedback to be given to specific healthcare providers. 5) Identifying an error in the APC Data Dictionary for OPERTN_nn fields.


Title	Python set visualization package
Description	Python set visualization package
Type Of Technology	Software
Year Produced	2023
Impact	None yet


Title	vizdataquality
Description	This is a Python package for visualizing data quality, and includes this six-step workflow: (1) Look at your data (is anything obviously wrong?), (2) Watch out for special values, (3) Is any data missing?, (4) Check each variable, (5) Check combinations of variables, and (6) Profile the cleaned data.
Type Of Technology	Software
Year Produced	2024
Open Source License?	Yes
Impact	None yet


Title	vizdataquality 1.1.2
Description	Released 4 new versions of this software, which is a Python package for visualizing data quality, and includes this six-step workflow: (1) Look at your data (is anything obviously wrong?), (2) Watch out for special values, (3) Is any data missing?, (4) Check each variable, (5) Check combinations of variables, and (6) Profile the cleaned data.
Type Of Technology	Software
Year Produced	2024
Open Source License?	Yes
Impact	None yet


Description	ACE demo at NHS England's Quarry House office
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	ACE demo at NHS England's Quarry House office
Year(s) Of Engagement Activity	2017


Description	ACE demo to NHS Digital Casemix team
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	ACE demo to NHS Digital Casemix team
Year(s) Of Engagement Activity	2017


Description	AIUK
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	AIUK presentation
Year(s) Of Engagement Activity	2021


Description	Advisory Boards
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Industry/Business
Results and Impact	Advisory Board
Year(s) Of Engagement Activity	2016,2017,2018,2019


Description	BIG DATA: Turning Data into Value (LUBS event)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Professional Practitioners
Results and Impact	BIG DATA: Turning Data into Value (LUBS event)
Year(s) Of Engagement Activity	2016


Description	Data cleaning strategies
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	Discussing & explaining data cleaning strategies
Year(s) Of Engagement Activity	2019


Description	Demos & poster of QuantiCode tools
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Public/other audiences
Results and Impact	Demos & poster of QuantiCode tools
Year(s) Of Engagement Activity	2017,2018,2019


Description	Film: Visualizing the Quality of Data
Form Of Engagement Activity	Engagement focused website, blog or social media channel
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Visualizing the Quality of Data: Data quality may be divided into two main areas - completeness and correctness. This film explains what each area involves, how we can visualize it and why it matters, using case studies from healthcare and retail.
Year(s) Of Engagement Activity	2019,2020
URL	https://www.youtube.com/watch?v=PnNMfCRWL7k


Description	LCC
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Professional Practitioners
Results and Impact	Scenario workshop Local Government Association - webinar Discussion of options in predictive analysis in social care Update on risk stratification analysis and discussion on further collaboration. Strategies for LCC client risk stratification, Discussion on application of the methodology within the current social care protocols. Evaluation of risk stratification results and identification of future directions
Year(s) Of Engagement Activity	2016,2017,2018,2019


Description	LIDA AGM 2017
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	LIDA AGM presentation withLCC
Year(s) Of Engagement Activity	2017


Description	Leeds Data Ethics Roundtable
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Professional Practitioners
Results and Impact	Leeds Data Ethics Roundtable (DLA Piper)
Year(s) Of Engagement Activity	2018,2019


Description	Leeds University, Leeds City Council Review of Collaborations video
Form Of Engagement Activity	Engagement focused website, blog or social media channel
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Public/other audiences
Results and Impact	Production of a film about academic/local council collaboration in research.
Year(s) Of Engagement Activity	2021
URL	https://www.youtube.com/watch?v=JTeoR0HoPys


Description	Provided material for Research Data Science course
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	Provided material about Visualizing the Quality of Data for course.
Year(s) Of Engagement Activity	2021
URL	https://tinyurl.com/VizDataQuality


Description	Sainsbury's
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Industry/Business
Results and Impact	Scenario workshop Customer missions meetings
Year(s) Of Engagement Activity	2016,2018,2019


Description	UoL Open Days
Form Of Engagement Activity	Participation in an open day or visit at my research institution
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Public/other audiences
Results and Impact	Open Day demos of ACE
Year(s) Of Engagement Activity	2018,2019,2020


Description	Workshop on Missing Data and the ACE Tool 2018
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Professional Practitioners
Results and Impact	Workshop on Missing Data and the ACE Tool
Year(s) Of Engagement Activity	2018

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications