QuantiCode: Intelligent infrastructure for quantitative, coded longitudinal data
Lead Research Organisation:
University of Leeds
Department Name: Sch of Computing
Abstract
This cross-disciplinary project aims to develop novel data mining and visualization tools and techniques, which will transform people's ability to analyse quantitative and coded longitudinal data. Such data are common in many sectors. For example, health data is classified using a hierarchy of hundreds of thousands of Read Codes (a thesaurus of clinical terms), with analysts needing to provide business intelligence for clinical commissioning decisions, and researchers tacking challenges such modelling disease risk stratification. Retailers such as Sainsbury's sell 50,000+ types of products, and want to combine data from purchasing, demographic and other sources to understand behavioural phenomena such as the convenience culture, to guide investment and reduce waste.
To solve these needs, public and private sector organisations require an infrastructure that provides far more powerful analytical tools than are available today. Today's analysis tools are deficient because they (a) are crude for assessing data quality, (b) often involve analysis techniques are designed to operate on aggregated, rather than fine-grained, data, and (c) are often laborious to use, which inhibits users from discovering important patterns.
The QuantiCode project will address these deficiencies by bringing together experts in statistics, modelling, visualization, user evaluation and ethics. The project will be based in the Leeds Institute for Data Analytics (LIDA), which houses the ESRC Consumer Data Research Centre (£5m ES/L011891/1) and the MRC Medical Bioinformatics Centre (£7m ES/L011891/1), and provides a development facilities complete with high-performance computing (HPC), visualization and safe rooms for sensitive data. Our project will deliver proof of concept visual analytic systems, which we will evaluate with a wide variety of users drawn from our partners and researchers/external users based in LIDA.
At the outset of the project we will engage with our partners to identify analysis use cases and requirements that drive the details of our research, which is divided into four workpackages (WPs). WP1 (Data Fusion) will develop governance principles for the analysis of fine-grained data from multiple sources,
implement tools to substantially reduce the effort of linking those sources, and develop new techniques to visualize completeness, concordance, plausibility, and other aspects of data quality.
WP2 (Analytical Techniques) and WP3 (Abstraction Models) are the project's technical core. WP2 will deliver a new, robust approach for modelling data as they appear naturally in health and retail data (irregularly dispersed or sampled over time), scaling that approach with stochastic control to guide learning and resource usage, and developing a low-effort 'question-posing' visual interface to drastically lower the human effort of investigating data and finding patterns. WP3 (Abstraction Models) focuses on data granularity, and will deliver a tool that implements a working version of the governance principles we develop in WP1, and new computational and interactive techniques for exploring abstraction spaces to create inputs suited to each aspect of analysis.
WP4 will implement the above tools and techniques in three versions of our proof of concept system, evaluating each with our partners and LIDA researchers/users. This will ensure that our solutions are compatible with, and scale to, challenging real-world data analysis problems. Success criteria will be time saved, increased analysis scope, notable insights, and tackling previously unfeasible types of analysis - all compared against a baseline provided by users' current analysis tools. We will encourage adoption via showcases, workshops and licensed installations at our partners' sites. The project's legacy will include tools that are embedded as an integral part of the LIDA infrastructure, a plan for their on-going development, and a research roadmap.
To solve these needs, public and private sector organisations require an infrastructure that provides far more powerful analytical tools than are available today. Today's analysis tools are deficient because they (a) are crude for assessing data quality, (b) often involve analysis techniques are designed to operate on aggregated, rather than fine-grained, data, and (c) are often laborious to use, which inhibits users from discovering important patterns.
The QuantiCode project will address these deficiencies by bringing together experts in statistics, modelling, visualization, user evaluation and ethics. The project will be based in the Leeds Institute for Data Analytics (LIDA), which houses the ESRC Consumer Data Research Centre (£5m ES/L011891/1) and the MRC Medical Bioinformatics Centre (£7m ES/L011891/1), and provides a development facilities complete with high-performance computing (HPC), visualization and safe rooms for sensitive data. Our project will deliver proof of concept visual analytic systems, which we will evaluate with a wide variety of users drawn from our partners and researchers/external users based in LIDA.
At the outset of the project we will engage with our partners to identify analysis use cases and requirements that drive the details of our research, which is divided into four workpackages (WPs). WP1 (Data Fusion) will develop governance principles for the analysis of fine-grained data from multiple sources,
implement tools to substantially reduce the effort of linking those sources, and develop new techniques to visualize completeness, concordance, plausibility, and other aspects of data quality.
WP2 (Analytical Techniques) and WP3 (Abstraction Models) are the project's technical core. WP2 will deliver a new, robust approach for modelling data as they appear naturally in health and retail data (irregularly dispersed or sampled over time), scaling that approach with stochastic control to guide learning and resource usage, and developing a low-effort 'question-posing' visual interface to drastically lower the human effort of investigating data and finding patterns. WP3 (Abstraction Models) focuses on data granularity, and will deliver a tool that implements a working version of the governance principles we develop in WP1, and new computational and interactive techniques for exploring abstraction spaces to create inputs suited to each aspect of analysis.
WP4 will implement the above tools and techniques in three versions of our proof of concept system, evaluating each with our partners and LIDA researchers/users. This will ensure that our solutions are compatible with, and scale to, challenging real-world data analysis problems. Success criteria will be time saved, increased analysis scope, notable insights, and tackling previously unfeasible types of analysis - all compared against a baseline provided by users' current analysis tools. We will encourage adoption via showcases, workshops and licensed installations at our partners' sites. The project's legacy will include tools that are embedded as an integral part of the LIDA infrastructure, a plan for their on-going development, and a research roadmap.
Planned Impact
Health, retail, government and other sectors routinely record data, using tens of thousands of codes to categorise information that ranges from the treatment of patients, to supermarket purchases, and where people live, work and go to school. By linking and analysing data from multiple sources, NHS Clinical Commissioning Groups (CCGs), local authorities and retailers can garner new insights to inform evidence-based planning decisions for day-to-day operations and investment. Analysing such data sets poses fundamental challenges, which we are addressing in this project by developing tools and techniques for an information infrastructure that helps users to interpret and make sense of complex data.
Through thought leadership and the engagement of stakeholders early in the research, we will clarify ethical principles for the integrated analysis of fine-grained data from multiple sources, and recommend solutions for developing ethically-compliant analysis systems that balance the rights of individuals (privacy, consent, etc.) with the value of advances in knowledge and the public good. Building an ethically-compliant system will be of interest to our partners involved in this research and their public/private sector colleagues. Government bodies and public/statutory sector organisations, whose ethical review boards need to build a more reflective understanding of the ethical implications of big data, are key to embedding consistent and well-evidenced practices for the future.
We will develop novel computational and interactive visualization tools and techniques for assessing the quality of data, performing fine-grained analysis, and exploring how complex data should (not) be simplified to reveal patterns. To ensure that our tools and techniques address real-world challenges of data analysis, at the outset of the project we will engage with our partners to define use cases, analyse requirements and pool our knowledge. The beneficiaries of all this work include: (a) data quality teams who leverage our tools in their work to develop methods so that other users are able to conduct their investigations with 'clean' data, (b) analysts who make use of our new tools or integrate software libraries containing our techniques into their own code/scripts, and (c) decision makers who bear responsibility for operations and investment. To promote early adoption, we will make our tools and techniques available for our partners to install and use on their own infrastructure, and also integrate it with the infrastructure of the Leeds Institute for Data Analytics (LIDA) for use by researchers and external users. Through this adoption, we aim for our partners and other users becoming ambassadors for the project's tools. To stimulate wider adoption, we will run showcases for health, retail and general business audiences, and hands-on workshops for analysts.
The ultimate beneficiaries, are the general public. For example, CCGs and local authorities analyse data to generate business intelligence for operations and investment, with the aim of providing us all with improved and more cost-effective services. Businesses similarly require business intelligence for operations and investment, which translates to jobs and other economic benefits. To bridge the gap between these indirect benefits from our project and popular interest in big data, we will conduct a range of public engagement activities which include live demonstrations at the annual Leeds Festival of Science, a short film, an on-line tutorial about the ethics surrounding data analytics, and publishing articles in the popular scientific press.
Through thought leadership and the engagement of stakeholders early in the research, we will clarify ethical principles for the integrated analysis of fine-grained data from multiple sources, and recommend solutions for developing ethically-compliant analysis systems that balance the rights of individuals (privacy, consent, etc.) with the value of advances in knowledge and the public good. Building an ethically-compliant system will be of interest to our partners involved in this research and their public/private sector colleagues. Government bodies and public/statutory sector organisations, whose ethical review boards need to build a more reflective understanding of the ethical implications of big data, are key to embedding consistent and well-evidenced practices for the future.
We will develop novel computational and interactive visualization tools and techniques for assessing the quality of data, performing fine-grained analysis, and exploring how complex data should (not) be simplified to reveal patterns. To ensure that our tools and techniques address real-world challenges of data analysis, at the outset of the project we will engage with our partners to define use cases, analyse requirements and pool our knowledge. The beneficiaries of all this work include: (a) data quality teams who leverage our tools in their work to develop methods so that other users are able to conduct their investigations with 'clean' data, (b) analysts who make use of our new tools or integrate software libraries containing our techniques into their own code/scripts, and (c) decision makers who bear responsibility for operations and investment. To promote early adoption, we will make our tools and techniques available for our partners to install and use on their own infrastructure, and also integrate it with the infrastructure of the Leeds Institute for Data Analytics (LIDA) for use by researchers and external users. Through this adoption, we aim for our partners and other users becoming ambassadors for the project's tools. To stimulate wider adoption, we will run showcases for health, retail and general business audiences, and hands-on workshops for analysts.
The ultimate beneficiaries, are the general public. For example, CCGs and local authorities analyse data to generate business intelligence for operations and investment, with the aim of providing us all with improved and more cost-effective services. Businesses similarly require business intelligence for operations and investment, which translates to jobs and other economic benefits. To bridge the gap between these indirect benefits from our project and popular interest in big data, we will conduct a range of public engagement activities which include live demonstrations at the annual Leeds Festival of Science, a short film, an on-line tutorial about the ethics surrounding data analytics, and publishing articles in the popular scientific press.
Organisations
Publications
Adnan M.
(2019)
Visual Analytics of Event Data using Multiple Mining Methods
in International Workshop on Visual Analytics
Adnan M.
(2018)
A Set-based Visual Analytics Approach to Analyze Retail Data
in International Workshop on Visual Analytics
Adnan, M
(2018)
A set-based visual analytics approach to analyze retail data
Adnan, M
(2019)
Visual analytics of event data using multiple mining methods.
Aivaliotis G
(2021)
A comparison of time to event analysis methods, using weight status and breast cancer as a case study.
in Scientific reports
Aivaliotis G
(2018)
An HJB approach to a general continuous-time mean-variance stochastic control problem
in Random Operators and Stochastic Equations
Cironis L
(2022)
Automatic model training under restrictive time constraints
in Statistics and Computing
Cironis L
(2021)
Automatic model training under restrictive time constraints
Cironis, L
(2021)
Automatic Model Training under Restrictive Time Constraints
in arXiv
Palczewska, A
(2017)
RobustSPAM for inference from noisy longitudinal data and preservation of privacy
Title | Visualizing the Quality of Data |
Description | Data quality may be divided into two main areas - completeness and correctness. This film explains what each area involves, how we can visualize it and why it matters, using case studies from healthcare and retail. Completeness covers missing values and records. Missing values are illustrated with nine 80-field datasets, which contain a total of 150 million records of hospital episode statistics. By combining small multiples with perceptual discontinuity and semantic encoding, an analyst finds both expected and unexpected patterns in the missing values. Then, using bar charts, a heat map and data mining in an interactive visualization tool, the analyst finds a hospital that has a problem with its diagnosis codes. Missing records are first illustrated with line and area charts, to understand the origin of eight fields that were collected longitudinally from 11 health data sources. However, line charts are not always effective for visualizing missing records. An alternative is to use heatmaps, as is illustrated using data about hourly sales for 365 days in 400 supermarkets. Data may be incorrect in a variety of ways, including by being misleading, outliers, plausibility, consistency, validity, special values or integrity. A misleading spike in supermarket sales is visualized using bar charts that use 188 different colours effectively, to break normal guidelines. Misleading loyalty card data is visualized on a map. Histograms, box plots and line charts illustrate the visualization of outliers and implausible values, including a child who is 3 m tall, a 150 cm baby, and sudden changes growth trajectories. Other problems with correctness are visualized using dot-and-whisker plots (inconsistent encryption of patient identifiers), character patterns (validity), bar charts (date values that have a special meaning, and are not outliers), and heatmaps and bar charts (integrity of maternity episodes). |
Type Of Art | Film/Video/Animation |
Year Produced | 2019 |
Impact | 1185 views so far |
URL | https://www.youtube.com/watch?v=PnNMfCRWL7k&feature=youtu.be |
Description | Developed & released the first version of ACE (Analysis of Combinations of Events) visual analytics tool to investigate patterns in missing data. Developed fast algorithms for temporal pattern mining. These algorithms allow mining of temporal data with/without uncertainty on time stamps and with/without time restrictions put on the patterns. Developed mathematical and numerical foundations and a Python implementation of a novel method for automatic model training under restrictive time constraints. |
Exploitation Route | Development of proof-of-concept software tool for risk stratification, for use by social care services in local government. Release of open source software for optimising metaparameter choice in automated machine learning. Adoption of tools for visualizing data quality, which were developed by the project for investigating patterns of missing values and diverse aspects of data correctness. |
Sectors | Digital/Communication/Information Technologies (including Software) Healthcare Government Democracy and Justice Retail |
Description | QuantiCode's findings have/are being used in three primary ways: 1) We developed a novel and highly-scalable set visualization tool for investigating patterns of missing values. It has been applied to NHS datasets, identifying a number of important data quality issues, allowing feedback to be given to specific hospital to rectify some of the issues, and defining new business rules for validating future datasets provided by hospitals. 2) We investigated how set-based visualization techniques may be used to analyse customer missions , and developed that into a visual analytics workflow that combined visualization with exclusive set intersection and high utility itemset mining techniques. A detailed report about productionising the workflow has been provided to one of the project's industrial partners. 3) We investigated how robust temporal mining, model interpretation and visualization techniques may be used to to identify people at risk of losing independence and moving to residential care. Results sparked interests from two local councils and led to a further 12 month impact project aiming to perfect those methods and extend them to recommend effective individual pathways of care. We have also described our approach to conducting collaborative research (with Leeds City Council), via a professionally produced YouTube film. |
Sector | Healthcare,Government, Democracy and Justice,Retail |
Impact Types | Societal Economic Policy & public services |
Description | 32) Making Visualization Scalable (MAVIS) for explaining machine learning classification models |
Amount | £570,749 (GBP) |
Funding ID | EP/X029689/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 08/2023 |
End | 08/2026 |
Description | Adult Social Care - Risk Stratification and Prevention |
Amount | £55,535 (GBP) |
Funding ID | EP/R511717/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 03/2020 |
End | 02/2021 |
Description | Advise on the commissioning of NERC's environmental data service |
Amount | £53,227 (GBP) |
Organisation | Natural Environment Research Council |
Sector | Public |
Country | United Kingdom |
Start | 02/2022 |
End | 08/2022 |
Description | DECOVID |
Amount | £74,532 (GBP) |
Organisation | Alan Turing Institute |
Sector | Academic/University |
Country | United Kingdom |
Start | 03/2020 |
End | 04/2021 |
Description | DynAIRx: AI for Dynamic prescribing optimisation and care integration in multimorbidity |
Amount | £2,807,430 (GBP) |
Organisation | National Institute for Health Research |
Sector | Public |
Country | United Kingdom |
Start | 03/2022 |
End | 09/2024 |
Description | Impact Acceleration Account - University of Leeds 2017 |
Amount | £1,775,630 (GBP) |
Funding ID | EP/R511717/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 09/2018 |
End | 03/2019 |
Title | The ACE software, and training materials for visualizing missing data and set-type data |
Description | This dataset comprises a Java software program called ACE, for visualizing missing data and set-type data. Also include din the dataset are training materials that show how ACE can be used in several data quality and general set data analysis scenarios. |
Type Of Material | Database/Collection of data |
Year Produced | 2022 |
Provided To Others? | Yes |
URL | https://archive.researchdata.leeds.ac.uk/958/ |
Title | ACE version 1.0 |
Description | A Set Visualization Tool to Analyze Patterns in Missing Data |
Type Of Technology | Software |
Year Produced | 2018 |
Impact | Identify origin of data quality problem (DIAG codes and OPERTN codes) in NHS Admitted Patient Care data, allowing feedback to be given to specific hospital to rectify the problem. |
Title | ACE version 2.0 |
Description | A Set Visualization Tool for analysing patterns of missing data and transactions data (e.g., supermarket purchases) |
Type Of Technology | Software |
Year Produced | 2019 |
Impact | Identify origin of data quality problems in NHS Admitted Patient Care data. Examples include: 1) Gaps in the 24 fields that record a patient's operations, which has implications for the NHS Payment by Results system. 2) Inconsistencies between operation fields and the corresponding 24 date fields, which affected millions of records. 3) Gaps in the 20 fields that record a patient's diagnoses, which affects the data cleaning methods used by epidemiologists. The origin pointed to a particular unit in a specific healthcare provider, allowing feedback to be given so that the problem could be rectified. 4) Developing a simple method for investigating integrity errors in maternity records about the location of a baby's delivery, allowing feedback to be given to specific healthcare providers. 5) Identifying an error in the APC Data Dictionary for OPERTN_nn fields. |
Title | Python set visualization package |
Description | Python set visualization package |
Type Of Technology | Software |
Year Produced | 2023 |
Impact | None yet |
Title | vizdataquality |
Description | This is a Python package for visualizing data quality, and includes this six-step workflow: (1) Look at your data (is anything obviously wrong?), (2) Watch out for special values, (3) Is any data missing?, (4) Check each variable, (5) Check combinations of variables, and (6) Profile the cleaned data. |
Type Of Technology | Software |
Year Produced | 2024 |
Open Source License? | Yes |
Impact | None yet |
Description | ACE demo at NHS England's Quarry House office |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Professional Practitioners |
Results and Impact | ACE demo at NHS England's Quarry House office |
Year(s) Of Engagement Activity | 2017 |
Description | ACE demo to NHS Digital Casemix team |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Professional Practitioners |
Results and Impact | ACE demo to NHS Digital Casemix team |
Year(s) Of Engagement Activity | 2017 |
Description | AIUK |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Industry/Business |
Results and Impact | AIUK presentation |
Year(s) Of Engagement Activity | 2021 |
Description | Advisory Boards |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Industry/Business |
Results and Impact | Advisory Board |
Year(s) Of Engagement Activity | 2016,2017,2018,2019 |
Description | BIG DATA: Turning Data into Value (LUBS event) |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Professional Practitioners |
Results and Impact | BIG DATA: Turning Data into Value (LUBS event) |
Year(s) Of Engagement Activity | 2016 |
Description | Data cleaning strategies |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Professional Practitioners |
Results and Impact | Discussing & explaining data cleaning strategies |
Year(s) Of Engagement Activity | 2019 |
Description | Demos & poster of QuantiCode tools |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Public/other audiences |
Results and Impact | Demos & poster of QuantiCode tools |
Year(s) Of Engagement Activity | 2017,2018,2019 |
Description | Film: Visualizing the Quality of Data |
Form Of Engagement Activity | Engagement focused website, blog or social media channel |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Visualizing the Quality of Data: Data quality may be divided into two main areas - completeness and correctness. This film explains what each area involves, how we can visualize it and why it matters, using case studies from healthcare and retail. |
Year(s) Of Engagement Activity | 2019,2020 |
URL | https://www.youtube.com/watch?v=PnNMfCRWL7k |
Description | LCC |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Professional Practitioners |
Results and Impact | Scenario workshop Local Government Association - webinar Discussion of options in predictive analysis in social care Update on risk stratification analysis and discussion on further collaboration. Strategies for LCC client risk stratification, Discussion on application of the methodology within the current social care protocols. Evaluation of risk stratification results and identification of future directions |
Year(s) Of Engagement Activity | 2016,2017,2018,2019 |
Description | LIDA AGM 2017 |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Professional Practitioners |
Results and Impact | LIDA AGM presentation withLCC |
Year(s) Of Engagement Activity | 2017 |
Description | Leeds Data Ethics Roundtable |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Professional Practitioners |
Results and Impact | Leeds Data Ethics Roundtable (DLA Piper) |
Year(s) Of Engagement Activity | 2018,2019 |
Description | Leeds University, Leeds City Council Review of Collaborations video |
Form Of Engagement Activity | Engagement focused website, blog or social media channel |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Public/other audiences |
Results and Impact | Production of a film about academic/local council collaboration in research. |
Year(s) Of Engagement Activity | 2021 |
URL | https://www.youtube.com/watch?v=JTeoR0HoPys |
Description | Provided material for Research Data Science course |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Professional Practitioners |
Results and Impact | Provided material about Visualizing the Quality of Data for course. |
Year(s) Of Engagement Activity | 2021 |
URL | https://tinyurl.com/VizDataQuality |
Description | Sainsbury's |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Industry/Business |
Results and Impact | Scenario workshop Customer missions meetings |
Year(s) Of Engagement Activity | 2016,2018,2019 |
Description | UoL Open Days |
Form Of Engagement Activity | Participation in an open day or visit at my research institution |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Public/other audiences |
Results and Impact | Open Day demos of ACE |
Year(s) Of Engagement Activity | 2018,2019,2020 |
Description | Workshop on Missing Data and the ACE Tool 2018 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Professional Practitioners |
Results and Impact | Workshop on Missing Data and the ACE Tool |
Year(s) Of Engagement Activity | 2018 |