Se-Ma-Match: Semantic Malware Matching

Lead Research Organisation: University of Kent
Department Name: Sch of Computing

Abstract

The flood of malware samples is predicted to grow into a deluge in
2012, making the problem of maintaining a database of malware
signatures ever more difficult. For each new sample, it is important
to determine the threat that it poses.

In response to this, dynamic malware analysis
tools have been designed that execute the sample in a sandbox,
monitoring the actions of a sample. If these actions are similar
to those of malware that has been already indexed in the database,
then one might draw conclusions regarding provenance and severity
of the threat posed. If the sample does not match against known
malware, then it can be subject to manual scrutiny, using a dissembler
such as IDA Pro.

This Linnaean approach to malware analysis is both natural and
convenient: it is natural to group malware into families that share
common attributes; and it is provides a convenient way of assessing
threat. Yet the whole methodology is predicated on the accuracy
with which samples are characterised by their signatures. If a
sample is assigned a signature that does not express its behaviour,
then samples that are behaviourally distinct can be erroneously
grouped together. Conversely, samples which behave the same, but
appear different, can be accidentally placed in different groups.

The main problem with dynamic malware analysis tools is that they
execute the binary for a limited time, typically considering just
one path through the binary. This limits the actions that can be
observed, rendering the signature inaccurate for programs that
reveal their true behaviour later. In addition, the dynamic approach
can miss infrequent actions or logic bombs. The dynamic approach is
also susceptible to timing attacks that detect a tracer to turn off
some action. Above all, the signatures are based solely and only
on those actions that are encountered during the trace.

More static approaches have been applied too, at one extreme using
the call graph of the binary itself for classification, and at the
other deploying model checking techniques to search the paths through
call graph for signature behaviours that characterise known malware
families. Yet graph matching techniques are sensitive to control-flow
obfuscation and model checking requires the signature behaviours
to be known up-front and distilled into a temporal formula or an
automata.

A middle ground is offered by abstract interpretation since it
provides a way to systematically consider all paths, while monitoring
a program for actions that inform the construction of the signature.
Abstract interpretation provides a way to break the dichotomy between
the purely dynamic and the purely static approach to malware analysis
into a graduated continuum. Formally, purely static approach
(a.k.a. a static analysis) can be derived from the purely dynamic
approach (a.k.a. a tracer) by compositing a sequence of abstractions:
if all n abstractions are applied the result is the static analysis;
but if the first m < n abstractions are applied the result is a
hybrid. The challenge is to find the hybrid that provides sufficient
path coverage to undercover logic bombs yet is sufficiently robust
to be used by practitioners in the security sector. The proposed
project will discover this sweet point by following two complementary
lines of inquiry. Concrete traces will be abstracted to cover more
paths and more actions (at UCL). Static analyses, which covers all
paths, will be refined to avoid paths and actions that do not
actually occur (at Kent). Thus UCL will add missing information
to signatures (converging on the ideal signature from below) whilst
Kent will remove excess information from signatures (converging on
the ideal signature from above). By reflecting on the relative
merits of these approaches, we will draw conclusions on how to
construct robust signatures for malware classification and thereby
advance the whole field.

Planned Impact

Maximising Impact on Industrial Beneficiaries

Although malware is a widespread problem the techniques we will develop are specialised and probably only of interest to the relatively small number of companies engaged in fighting malware. Technology transfer to these partners will be key to maximising the impact. The initial industrial beneficiaries will be the current industrial partners of the project. However, during the project we will seek out other industrial partners and industrial beneficiaries. We will hold a workshop at Kent six months before the end of the project, at which point the research will be sufficiently mature, to which we will invite potential industrial connections. Involvement in the Research Institute will in itself assist in forming relevant industrial and enabling dissemination of the research outcomes of the project.

Maximising Academic Impact

The SeMaMatch project will maximise dissemination to its target research community and facilitate early involvement. To this end the SeMaMatch team will organise two workshops to deepen collaboration, widen the research base in compiled code analysis and to facilitate dissemination of our research outputs. The first workshop will be organised by the UCL site halfway through the project and will be aimed at academics in the research area. The second workshop will be organised by the Kent University site towards the end of the project as set out above and will target both relevant academics and industry. We have experience of organising such events through the UCL CREST Open Workshop programme. This had attracted 721 registrations from 167 organisations (including industrial organisations) in over 31 countries as of June 2012. We will, in addition to the two workshops associated with the project, organise three CREST open workshops on the theme of malware analysis, seeking to make connections with the wider software engineering research base. For maximal academic impact, we will publish our outputs in leading international conferences, whilst augmenting these results with archival journal publications. We will also distribute our research prototypes with an open source license, partly to encourage take up of the ideas, and partly to make IPR transparent.

Maximising Impact on the Wider User Community and Society

The project team will make regular contributions to suitable on-line forums and discussion groups. Ideas and results from the project will feed into modules of UCL's MSc in Information Security. In addition the project team will seek to widen interest in the outputs of the project among the general public, incidentally raising awareness of malware.

Publications

10 25 50
 
Description The main scientific finding which came out of this work was a simple and practical way to extract (abstract) the semantics of a program, and in particular to do so in a way that was practical and scalable. Matrices are traditionally used to describe relationships between variables but the number of relations grows fast with the number of variables in a program, impeding scalability. Our observation was that many of the relationships between different pairs of the variables are actually the same, which makes it possible to describe the relationship just once but use the relationship many, typically thousands, of times. This has a huge saving on memory which, in turn, speeds up the analysis of a problem to problems that previously were out of reach for automatic, semantic analysis.
Exploitation Route They techniques we developed can deployed in many setting, including automatic bug finding and verification, as well as improving the accuracy of program matching techniques.
Sectors Aerospace, Defence and Marine,Creative Economy,Digital/Communication/Information Technologies (including Software),Electronics,Energy,Financial Services, and Management Consultancy