From trivial representations to learning concepts in AI by exploiting unique data
Lead Research Organisation:
University of Edinburgh
Department Name: Sch of Engineering
Abstract
The prospect of an AI-based revolution and its socio-economic benefits is tantalising. We want to live in a world where AI learns effectively with high performance and minimal risks. Such a world is extremely exciting.
We tend to believe that AI learns higher level concepts from data, but this is not what happens. Particularly in data such as images, AI extracts rather trivial (low-level) notions from the data even when provided with millions of examples. We often hear that providing more data with high diversity should help improve the information that AI can extract.
This data amassing does have though privacy and cost implications. Indeed, considerable cost comes also by the need to pre-process and to sanitise data (i.e. remove unwanted information).
More critically, though, in several key applications (e.g. healthcare) some events (e.g. disease) can be rare or truly unique. Collecting more and more data will not change the relative frequency of such rare data. It appears that current AI is not data efficient: it poorly leverages the goldmine of information present in unique and rare data.
This project aims to answer a key research question:
**Why does AI struggle with concepts, and what is the role of unique data? **
We suspect there are several reasons why AI struggles with concepts:
A) The mechanisms we use to extract information from data (known as representation learning) rely on very simple assumptions that do not reflect how real data exist in the world.
For example, we know that data have correlations, and we now make simplified assumptions of no correlation at all.
We propose to introduce stronger assumptions of causal relationships in the concepts we want to extract. This should in turn help us extract better information.
B) To learn any model, we do have to use optimisation processes to find the parameters of the model. We find a weakness in these processes: data that are unique and rare do not get so much attention, or if they do get some, it happens by chance.
This leads to considerable inconsistency in the extraction of information. In addition, sometimes wrong information is extracted, either because we found suboptimal representations or because we latched on some data that escaped from the sanitisation process -since no such perfect process can always be guaranteed.
We want to understand why such inconsistency exists and propose to devise methods that can ensure that when we train models, we can consistently extract information even from rare data.
There is a tight connection between B and A. Without new methods that better optimise learning functions we cannot extract representations reliably from rare data, and hence we cannot impose the causal relationships we need.
There is an additional element about this work that helps answer the second part of the question. Rare and unique data may actually reveal unique causal relationships. This is a very tantalising prospect that the work we propose aims to investigate.
There are considerable and broad rewards of the work we propose.
We put herein the underpinnings for an AI that, because it is data efficient, should not require blind amassing of data with all the privacy fears this engenders for the general public. Because it learns high-lever concepts it will be more adept to empower decision tools that can support how decisions have been reached. And because we introduce strong causal priors in extracting these concepts, we reduce the risk of learning trivial data associations.
Overall, a major goal of the AI research community is to create AI that can generalise to new unseen data beyond what was available during training time. We hope that our AI will bring us closer to this goal, thus further paving the way to broader deployment of AI to the real world.
We tend to believe that AI learns higher level concepts from data, but this is not what happens. Particularly in data such as images, AI extracts rather trivial (low-level) notions from the data even when provided with millions of examples. We often hear that providing more data with high diversity should help improve the information that AI can extract.
This data amassing does have though privacy and cost implications. Indeed, considerable cost comes also by the need to pre-process and to sanitise data (i.e. remove unwanted information).
More critically, though, in several key applications (e.g. healthcare) some events (e.g. disease) can be rare or truly unique. Collecting more and more data will not change the relative frequency of such rare data. It appears that current AI is not data efficient: it poorly leverages the goldmine of information present in unique and rare data.
This project aims to answer a key research question:
**Why does AI struggle with concepts, and what is the role of unique data? **
We suspect there are several reasons why AI struggles with concepts:
A) The mechanisms we use to extract information from data (known as representation learning) rely on very simple assumptions that do not reflect how real data exist in the world.
For example, we know that data have correlations, and we now make simplified assumptions of no correlation at all.
We propose to introduce stronger assumptions of causal relationships in the concepts we want to extract. This should in turn help us extract better information.
B) To learn any model, we do have to use optimisation processes to find the parameters of the model. We find a weakness in these processes: data that are unique and rare do not get so much attention, or if they do get some, it happens by chance.
This leads to considerable inconsistency in the extraction of information. In addition, sometimes wrong information is extracted, either because we found suboptimal representations or because we latched on some data that escaped from the sanitisation process -since no such perfect process can always be guaranteed.
We want to understand why such inconsistency exists and propose to devise methods that can ensure that when we train models, we can consistently extract information even from rare data.
There is a tight connection between B and A. Without new methods that better optimise learning functions we cannot extract representations reliably from rare data, and hence we cannot impose the causal relationships we need.
There is an additional element about this work that helps answer the second part of the question. Rare and unique data may actually reveal unique causal relationships. This is a very tantalising prospect that the work we propose aims to investigate.
There are considerable and broad rewards of the work we propose.
We put herein the underpinnings for an AI that, because it is data efficient, should not require blind amassing of data with all the privacy fears this engenders for the general public. Because it learns high-lever concepts it will be more adept to empower decision tools that can support how decisions have been reached. And because we introduce strong causal priors in extracting these concepts, we reduce the risk of learning trivial data associations.
Overall, a major goal of the AI research community is to create AI that can generalise to new unseen data beyond what was available during training time. We hope that our AI will bring us closer to this goal, thus further paving the way to broader deployment of AI to the real world.
Publications
Hartley J
(2023)
Neural networks memorise personal information from one sample.
in Scientific reports
Lyu J
(2025)
The state-of-the-art in cardiac MRI reconstruction: Results of the CMRxRecon challenge in MICCAI 2023.
in Medical image analysis
Sadeghian R
(2024)
Editorial: Methods in artificial intelligence for dementia 2024.
in Frontiers in dementia
Vilouras K
(2024)
Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models.
in IEEE journal of biomedical and health informatics
| Description | This work has shown that when AI systems aim to learn representations of their training data, the representations learned can contain biases. Central to the reason behind such biases appears to be the inclination of said systems to learn from data correlations rather than causal relationships. This finding has led to a considerable body of work as epitomized by the funding and establishment of CHAI - Causality in Healthcare AI hub (funded by EPSRC) |
| Exploitation Route | In many way and while the work was done in simple datasets and in healthcare the need to identify true relationships in data is paramount to scientific discovery and the development of robust systems |
| Sectors | Aerospace Defence and Marine Agriculture Food and Drink Digital/Communication/Information Technologies (including Software) Energy Environment Financial Services and Management Consultancy Healthcare |
| URL | https://vios.science/research/ |
| Description | One of the researchers who has contributed to the work supported by this award has established a startup company. |
| First Year Of Impact | 2023 |
| Sector | Healthcare |
| Impact Types | Economic |
| Description | CHAI - EPSRC AI Hub for Causality in Healthcare AI with Real Data |
| Amount | £10,288,789 (GBP) |
| Funding ID | EP/Y028856/1 |
| Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
| Sector | Public |
| Country | United Kingdom |
| Start | 02/2024 |
| End | 01/2029 |
| Title | Code implementation for papers |
| Description | Open source Implementations for papers in 2023 and 2024 |
| Type Of Material | Technology assay or reagent |
| Year Produced | 2024 |
| Provided To Others? | Yes |
| Impact | Increased uptake of work as manifested by "stars" and forks on open code repositories. |
| Title | Unintended memorisation of unique features in neural networks |
| Description | The source code accompanies the paper Hartley, J., Sanchez, P.P., Haider, F. et al. Neural networks memorise personal information from one sample. Sci Rep 13, 21366 (2023). https://doi.org/10.1038/s41598-023-48034-3 It reproduces the experiments and figures. |
| Type Of Technology | Software |
| Year Produced | 2023 |
| Open Source License? | Yes |
| Impact | The software and accompanying paper illustrate several findings: 1. neural networks memorise unique features in several datasets for a range of model architectures; 2. memorisation of unique features cannot be prevented using typical regularisation strategies; 3. memorisation happens due to the presence of such a rare feature which is unusual and hence unique with respect to (they are unusual only once) features in concepts which are rare (unusual) in the data and it happens from the first epoch and over the entire unique feature; 4. we are able to audit models with the M score in a grey or black box settings Practical impact and implications: The findings of this study highlight the need to develop strategies to protect personal information when present as a unique feature. One of the possible ways to avoid the presence/influence of unique features is to develop automatic solutions to detect personal information printed on training images for removal before moving forward with machine learning training. Another suggestion for safeguarding is to develop a privacy filter (testing stage) that rejects/modifies an image with identifiable information printed on it so that an attacker will not be able to get access to identifiable information learned by neural networks. By doing that, a data scientist is lowering the possibility of linking a breached patient record (as happened in England (https://www.bbc.co.uk/news/technology-44682369)) to training data of their ML model. The findings will also inform policymakers to develop practices and guidelines for data scientists and companies to protect personal information for those situations according to policy document (https://www.gov.uk/government/publications/ai-regulation-a-pro-innovation-approach/white-paper) by safeguarding against bad actors. |
| URL | https://doi.org/10.1038/s41598-023-48034-3 |
| Description | BMVC 2024 |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Postgraduate students |
| Results and Impact | I offered a presentation at a specialized workshop on ethics and the role of data biases. It was organized in Glasgow as part of an international top 10 conference. |
| Year(s) Of Engagement Activity | 2024 |
| Description | Presentation at Onassis Health Day |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | Regional |
| Primary Audience | Patients, carers and/or patient groups |
| Results and Impact | I presented at the Onassis Health day celebrating 30 years since the start of the cardiac transplant hospital in Greece. I discussed the impact of AI on cardiac health at present and the immediate fture. |
| Year(s) Of Engagement Activity | 2023 |
| URL | https://www.onassis.org/whats-on/onassis-health-day-2023 |
