Detecting Training Abuses in Neural Nets

Lead Research Organisation: University of Sheffield

Department Name: Computer Science

Abstract

Many military systems carry out classification tasks. For example, a system might be required to distinguish between an allied tank and an enemy tank (a classic problem in machine learning). Modern machine learning approaches are being brought to bear on classification problems within the military domain and more widely. Neural networks, a technology that, loosely speaking, makes decisions in a manner analogous to the way the human brain works, play a particularly prominent role.
Most work proceeds on the assumption that all is benign. But imagine if an enemy wanted to cause your classifier to work well, except when presented with a very specific classification task. For example, an enemy tank with a particular appearance could be classified as an allied one, with very significant consequences.
Can an enemy engineer such behaviour? In certain circumstances, yes! It depends on how and by whom the classifier system was built. The building of such systems is often outsourced in some way, e.g. because the procurer lacks the computational capability to craft an effective system or by the use of publicly generated components.
We often refer to hidden malicious functionality that can be invoked when convenient as a 'trapdoor'. Trapdoors are often very difficult to detect. Imagine a system that classified perfectly thousands of tank examples provided by you. It seems like this is a very good system. But the system may have been trained so that an enemy tank with "666" painted on its side is misclassified. If you don't know this specific condition you would have little reason to generate a test example to discover it. Neural networks are also notoriously opaque in rendering apparent how they make decisions and this makes this sort of trapdoor detection particularly hard.
We might reasonably ask whether or how well we can detect such trapdoors. There are various levels at which understanding may be sought. Thus, determining whether a system has a trapdoor in it (yes/no) is a simpler and less ambitious task than seeking the specific trapdoor condition (the "666' indicated above).
Though there is a fair amount on trapdoors in the literature, typically addressing issues of planting or detecting trapdoors, there appears to be little concerned with characterising them. It would seem clear that any detection technique is likely to be more successful on some trapdoors than on others. This raises the question, however, as to how to describe those where the technique works well and those where it performs less well. A rigorous approach to detection, the primary goal of this project, requires a nuanced understanding of trapdoors. In particular, a characterisation of trapdoors together with measurements of their properties, e.g. how much a trapdoor example deviates from a normal example, is essential.
If trapdoor generation is now considered, the characterisation of trapdoors allows more refined specification of properties we would like an inserted trapdoor to have. This serves two purposes: firstly, it facilitates a more nuanced generational capability for practical operational purposes, i.e. for someone who wants to benefit from planting a trapdoor in the real world; and secondly, it allows researchers (initially ourselves!) to generate sets of trapdoors for rigorous evaluation of detection techniques. We can define what it means to 'cover' the trapdoor space in some way, much as we cover input or other space in general testing. Since there is no extant workable characterisation of trapdoors there is also clearly no extant generational capability.

Student:

Mudit Pandya

Period of Study:

Jan 19 - Feb 23

Funder:

EPSRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

2301656

Research Topic:

Unclassified

Organisations

People	ORCID iD
John Clark (Primary Supervisor)
Mudit Pandya (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/S51388X/1			01/10/2018	30/09/2023
2301656	Studentship	EP/S51388X/1	02/01/2019	27/02/2023	Mudit Pandya

Abstract

Organisations

People

ORCID iD

Publications

Studentship Projects