Mining the Network Behaviour of Bots

Lead Research Organisation: Royal Holloway University of London
Department Name: Information Security

Abstract

The botnets phenomenon has quickly become a major security concern for all the
Internet users. In fact, not only has it rapidly gained popularity among the
mass media, but it has also received the attention of the research community
interested in understanding, analyzing, and detecting bot-infected machines.
Once infected with a bot, the victim host joins a botnet, a network of
compromised machines that are under the control of a malicious entity. Botnets
are the primary means for cyber-criminals to carry out criminal tasks,
such as sending spam mail, launching denial-of-service attacks, or stealing
personal data such as mail accounts or bank credentials.

Clustering and correlating network events represent the state-of-the-art when
it comes to detecting and understanding the botnets phenomenon from a network
perspective. While effective, such approaches rest on weak foundations being
vulnerable to easy-to-perform (time and network) obfuscation attacks.

The goal of this project is to build on the promising results of our previous
work to explore novel machine-learning techniques to make the state-of-the-art
more accurate and robust against evasions and advanced malware. Exploring the
possibilities of advanced malware (and thus bots) to enable the development of
novel mathematical techniques to address such threats is not a mere academic
exercise. On the contrary, it is of paramount importance to build robust and
hard-to-elude mitigation approaches; something we currently lack, as
acknowledged by the research community at large.

On the cyber security side, we will develop techniques to analyze the network
traffic generated by a bot sample. Our analysis will focus on inferring the
interesting part of a bot's network behaviour to automatically generate models
that faithfully describe it. Our analysis aims at being independent from the
underlying botnet infrastructure, payload-agnostic, and able to pinpoint
legitimate-resembling malicious activities. The network flows of a monitored
bot will be initially filtered to remove well-defined attack patterns. The
remaining flows will be clustered using a number of network features and
suitable similarity functions. Clusters whose size exceeds a given threshold
will then be analyzed for periodicity: bots tend to engage in similar network
activities that have interflow intervals that either are sampled independently
from a potentially unknown probability distribution, or belong to a small
number of well-defined clusters. Once clusters exhibiting interesting
periodicity patterns are identified, they can be used, along with their network
features, for detecting (or understanding the behaviour of) bots in a mixed
population containing both compromised and clean hosts.

On the machine learning side, we propose to explore the use of conformal
prediction developed by our team to make such cluster-based analysis more
accurate and robust against arbitrary obfuscation-based evasion attacks. A
powerful clustering method is based on nonparametric probability density
estimation. A recent work proposes a computationally efficient method of
nonparametric density estimation based on conformal prediction and inherits its
properties of validity. We plan to explore the use of this method for the
purpose of robust clustering. A theoretical challenge is to spell out and study
the properties of robustness for this clustering method that are inherited from
the validity of the study mentioned above. In addition, the property of
validity of conformal predictors is usually established under the randomness
assumption; we will explore how this assumption can be relaxed. In addition,
the property of validity can be used to control the number of "alarms"
(predicting that a host is compromised) raised by a bot detection algorithm.
This is valuable in situations where alarms have to be investigated by human
experts but the available manpower is limited.

Planned Impact

This project is concerned with the development of novel
obfuscation-resistant techniques to analyze and detect core network
behavioral traits of bot-infected computing devices.

We will organize a two day workshop on the subject of machine learning
and malware analysis. The workshop will be aimed at bringing together
all the project collaborators, academic researchers and industry
practitioners with interest in the project's topics. The goal of the
workshop is to narrow the gap that nowadays exists between security
research carried out in academia and industry to face common threats.

The workshop will be held at the end of the second year of the project
and members of the advisory board (see next) will be encouraged to
attend. The workshop will include a number of invited talks from
representative of research industry laboratories, including---but not
limited to---McAfee Labs (UK), Symantec Research Labs (Europe), VU
University Amsterdam (The Netherlands), and FORTH-ICS (Greece), for
which this project has already received strong letters of support.

We will also appoint a panel of experts---mostly from the
industry---who will provide advice on practical problems and needs,
and help to promulgate the results of our research.

The Information Security Group at Royal Holloway, University of London
has considerable experience in working with industry. Over the last
ten years it has convened expert panels on public key infrastructures
(the "PKI Club"), authentication and identity management (the "AIM
Club") and, most recently, on cyber security. These panels,
comprising experts from industry and government circles, meet
regularly to discuss the challenges facing industry and government,
focusing on a different aspect at each meeting. We anticipate that the
experts advisory board for this project would operate in a similar way.

The composition of the board will be determined in the first few
months of the project. At least one member of the advisory board will
be invited to give an industry-centered talk at the workshop. The
board will meet four times, once within six months of the start of the
project and at the end of each year of the project.

We are confident our project will produce key results to overcome
advanced malware, which are of paramount interest to academic and
industry security researchers. However, not only the developed machine
learning techniques will improve our understanding of bot-related
threats, but they will also likely be applicable to other important
areas, such as System Biology, where time-series data---and clustering
in general---has to be examined.
 
Description We have made considerable progress on the theory of conformal predictor, which provides classification supported by statistical confidence. While we have only recently started to adopt it to the original aim of identifying malicious network communications, we have quite unexpectedly built and expanded on it in such to provide a statistical evaluation of a broad range of machine learning algorithms. This seems to be quite successful in providing suitable quality indicators for identifying when classifiers start decaying, which triggers the detection of objects (e.g., network communications or any other event of interest) that are drifting away from the model the classifiers represent. Such findings can be potentially applied to other research domains, but we leave this as our future work.
Exploitation Route Our statistical evaluation of machine learning algorithms enables the identification of concept drift, identifying learning decaying and potentially suggesting retraining strategy. We are planning to release the source code and a paper of our findings to allow other researchers to integrate our approach in realistic machine learning scenario (open-world experiments).
Sectors Aerospace, Defence and Marine,Chemicals,Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy,Retail,Security and Diplomacy,Transport

 
Description The findings of this research project helped to better understand how to identify and quarantine security threats that evolve over time. Forecasting future threats is still an open challenge but it is possible to equip learning models with the ability to reject uncertain predictions. Although classification with rejection and more generally abstaining classifiers are well-explored in the machine learning community, little is known on their effectiveness when applied in evolving contexts, typical of computer security. The breakthrough of this research project represents a first step towards achieving this overarching goal, paving the way to address fundamental research challenges. See https://s2lab.cs.ucl.ac.uk and https://s2lab.cs.ucl.ac.uk/projects/transcend for further details.
First Year Of Impact 2017
Sector Digital/Communication/Information Technologies (including Software),Education,Other
Impact Types Societal

 
Description GCHQ Small Grants scheme 2015-2016
Amount £39,000 (GBP)
Organisation Government Communications Headquarters (GCHQ) 
Sector Public
Country United Kingdom
Start 01/2016 
End 03/2016
 
Description NCSC Small Grants scheme 2017-2018
Amount £20,000 (GBP)
Organisation National Cyber Security Centre 
Sector Public
Country United Kingdom
Start 11/2017 
End 01/2018
 
Description NVIDIA GPU donation
Amount £3,000 (GBP)
Organisation NVIDIA 
Sector Private
Country Global
Start  
 
Title CopperDroid 
Description We have developed the infrastructure to enable dynamic analysis and classification of Android applications at scale; we are in the process of finalizing a RESTful API to provide free use of the service to practitioners and researchers. 
Type Of Material Improvements to research infrastructure 
Provided To Others? No  
Impact We are discussing with Google and McAfee for a potential integration of our analysis system in their backend infrastructure. 
 
Title Conformal Evaluator 
Description We have developed a statistical machine learning evaluation framework to provide a quantifiable assessment of the quality of a given machine learning classification. Not only this enables to understand how well an approach may be performing in real-life deployment, but it also provides metrics that can be leveraged to detect concept drift and thus decaying in the classifier performances (suggesting retraining strategies) in realistic settings. 
Type Of Material Data analysis technique 
Provided To Others? No  
Impact N/A yet. We are filing a patent and we are using this approach in the output generated by MobSec: Security in the Mobile Age EPSRC research grant. We are in the process of publishing the results of this approach, including source code, to enable other research groups to build on this outcome. 
 
Description HP Labs Bristol 
Organisation Hewlett Packard Ltd
Department Hewlett Packard Laboratories, Bristol
Country United Kingdom 
Sector Private 
PI Contribution We will be using HP Labs Bristol data to assess the effectiveness of our methodology
Collaborator Contribution HP Labs Bristol is supporting our research effort by providing us with real-world data (passive DNS traffic), which we'll be experimenting with in one of the research directions (see Phoenix and Cerberus systems) we are exploring within the project.
Impact We have a working prototype codenamed Cerberus, follow-up of Phoenix (see publications), which is using an initial dataset provided by Nominet. Our plan is to integrate conformal predictors (a novel machine learning approach developed by the team) in Cerberus to provide confidence and credibility in the results of the analysis performed.
Start Year 2014
 
Description Nominet 
Organisation Nominet Trust
Country United Kingdom 
Sector Charity/Non Profit 
PI Contribution We will be using Nominet data to assess the effectiveness of our methodology
Collaborator Contribution Nominet is supporting our research effort by providing us with real-world data (passive DNS traffic), which we'll be experimenting with in one of the research directions (see Phoenix and Cerberus systems) we are exploring within the project.
Impact We have a working prototype codenamed Cerberus, follow-up of Phoenix (see publications), which is using an initial dataset provided by Nominet. Our plan is to integrate conformal predictors (a novel machine learning approach developed by the team) in Cerberus to provide confidence and credibility in the results of the analysis performed.
Start Year 2014
 
Description University of Georgia 
Organisation University of Georgia
Country United States 
Sector Academic/University 
PI Contribution The collaboration aims at improving the performance of our machine learning-based botnet classification (with confidence).
Collaborator Contribution University of Georgia will be providing a large data corpus on which evaluating our approach.
Impact No output yet. This is still an ongoing activity that requires one of my student to visit University of Georgia from Aug to Oct 2017 to carry out large-scale evaluation of the system we have been developing throughout this grant.
Start Year 2017
 
Title METHOD OF MONITORING THE PERFORMANCE OF A MACHINE LEARNING ALGORITHM 
Description A crucial requirement for building sustainable learning models is to train on a wide variety of samples. Unfortunately, objects on which the learned models are used may evolve and the learned models may no longer work well. The invention provides a framework to identify aging classification models in vivo during deployment(concept drift), much before the machine learning model's performance starts to degrade. A statistical comparison of samples seen during deployment with those used to train the model is used, thereby building metrics for classification quality. 
IP Reference WO2019002603 
Protection Patent application published
Year Protection Granted 2019
Licensed No
Impact We have been contacted by several academic institutions that want access to our code; we are working on a license suitable for academic researchers as well as for industrial partners. To this end, we are in contact with Huawei Technologies to develop a research impact potentially through licensing.
 
Title AntiBot 
Description AntiBot is a framework we have developed to explore one of the research directions tackled within the project, i.e., the design of novel machine learning techniques (conformal predictors), devised and developed within the team as part of the project, to detect bot-infected computing devices. While Phoenix and Cerberus (see other entries) focuses on the analysis of passive DNS traffic to characterise a specific component of bot-initiated communications, AntiBot looks at netflows to generate a semantic signature of the core behavioural traits of bot-like malware. It currently relies on off-the-shelf machine learning, e.g., hierarchical clustering algorithms as the underlying machine learning mechanism, but we are nonetheless at a stage where the advances in the development and our understanding of conformal clustering and conformal predictors (on classification problems) developed within the project can start to be integrated in the frameworks. 
Type Of Technology Software 
Year Produced 2014 
Impact Work in progress 
 
Title Cerberus app 
Description The software implements Phoenix (see Publications) and Cerberus (its follow-up research) to analyze passive DNS traffic and automatically identify likely malicious automatically generated domains with high accuracy and low false positives. Cerberus is still a work-in-progress, but more can be read in the MSc thesis available at https://www.politesi.polimi.it/bitstream/10589/92341/1/2014_04_Colombo.pdf 
Type Of Technology Webtool/Application 
Year Produced 2014 
Impact Once deployed, the software would allow to 1) gather insights about the ever evolving malicious DNS infrastructure used by malware and 2) generates smart black lists to be deployed by the community at large (deployment is private for the time being; we need to finalize large-scale experiments and to secure additional data feeds before releasing it to the public). 
 
Title Conformal evaluator 
Description This is the python library that implements conformal evaluator, an framework to statistically assess the quality of a broad range of machine learning algorithms. 
Type Of Technology Software 
Year Produced 2016 
Impact We are using this evaluation internally across projects, but we plan to release the python library open source for the community to provide statistical evaluation to machine learning tasks. 
 
Title CopperDroid and related machine learning infrastructure 
Description CopperDroid is a dynamic analysis framework to reconstruct the behavior of Android apps. Beside providing information to analysts, the reconstructed behaviors are fed to machine learning to enable automated classification of Android apps and malware. 
Type Of Technology Software 
Year Produced 2015 
Impact We are engaged in a number of conversation with industrial partners (e.g., McAfee Labs, Qualcomm, and Google) and academia (e.g., University of Luxembourg, National University Singapore, TU Munich) to further monetize on the capability analysis of CopperDroid. 
URL http://copperdroid.isg.rhul.ac.uk
 
Description A number of talks given to industry and academia 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact I have given a number of talks to further disseminate the outcome of the research carried out by the 2 EPSRC grants I am currently PI for. The following is just an excerpt, but I will update the list wit proper entries soon. The talks often cover topics carried out by both grants:

OWASP AppSec EU Keynote 2014, Dagstuhl 2014, National Cyber Crime Unit 20145, BlackHat London Mobile Summit 2015, Georgia Tech 2015, Stony Brook University, 2015, Qualcomm Inc. 2015, Google 2015, University of Catania 2015, University of Luxembourg 2015, IMDEA Software 2016, Kyushu University 2016, NIMBUS (EPSRC) 2016, Polytechnic University of Hong Kong 2016
Year(s) Of Engagement Activity 2014,2015,2016
 
Description Automatic Analysis and Classification of Android Malware (Concept Drift Detection) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact This talk outlines how the analysis we have been exploring in "MobSec: Security in the Mobile Age" EPSRC projects influenced "Mining the Network Behaviour of Bots" EPSRC grant and viceversa. In particular, we have been introducing a statistical machine learning evaluation framework to identify concept drift. This is applicable not only to either of the domains explored in such research projects, but to other fields as well.
Year(s) Of Engagement Activity 2016,2017
 
Description CEReS Dissemination Day 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact The talk was well-perceived and sparkled questions and discussion afterwards

Not aware of any direct one.
Year(s) Of Engagement Activity 2014
URL http://www.epsrc.ac.uk/newsevents/events/ceresconference/
 
Description COPA 2013 (Conformal Prediction under Hypergraphical Models) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Type Of Presentation paper presentation
Geographic Reach International
Primary Audience Other audiences
Results and Impact N/A
Year(s) Of Engagement Activity 2013
 
Description COPA 2013 (Defensive Forecast for Conformal Bounded Regression) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Type Of Presentation paper presentation
Geographic Reach International
Primary Audience Other audiences
Results and Impact N/A
Year(s) Of Engagement Activity 2013
 
Description COPA 2013 (Enhanced Conformal Predictors for Indoor Localisation Based on Fingerprinting Method) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Type Of Presentation paper presentation
Geographic Reach International
Primary Audience Other audiences
Results and Impact N/A
Year(s) Of Engagement Activity 2013
 
Description COPA 2013 (Learning by Conformal Predictors with Additional Information) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Type Of Presentation paper presentation
Geographic Reach International
Primary Audience Other audiences
Results and Impact N/A
Year(s) Of Engagement Activity 2013
 
Description COPA 2013 (Transductive Conformal Predictors) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Type Of Presentation paper presentation
Geographic Reach International
Primary Audience Other audiences
Results and Impact N/A
Year(s) Of Engagement Activity 2013
 
Description COPA 2014 (Anomaly Detection of Trajectories with Kernel Density Estimation by Conformal Prediction) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Type Of Presentation paper presentation
Geographic Reach International
Primary Audience Other audiences
Results and Impact N/A
Year(s) Of Engagement Activity 2014
 
Description COPA 2014 (Conformal Prediction under Probabilistic Input) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Type Of Presentation paper presentation
Geographic Reach International
Primary Audience Other audiences
Results and Impact N/A
Year(s) Of Engagement Activity 2014
 
Description COPA 2014 (SVM Venn Machine with k-Means Clustering) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Type Of Presentation paper presentation
Geographic Reach International
Primary Audience Other audiences
Results and Impact N/A
Year(s) Of Engagement Activity 2014
 
Description Dr Huazhen Wang, Huaqiao University visit (1 year) to study conformal predictors 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Discussions were very interesting and sparked further research directions in conformal predictors

Further engagement with the collaborator
Year(s) Of Engagement Activity 2013,2014
 
Description HP Labs Bristol 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact The talked allowed us to sign NDAs with HP Labs Bristol to support further our research.

NDA and future collaboration.
Year(s) Of Engagement Activity 2014
 
Description Invited talks at Huawei Germany, Huawei Finland, University of Bologna, King's College London 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Several talks on promoting dissemination results of the USENIX Sec 2018 paper. The aim is to pursue research impact through licensing of this research.
Year(s) Of Engagement Activity 2017
 
Description Misleading Metrics: On Evaluating Machine Learning for Malware with Confidence 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Invited talk at Mathematical foundations of probabilistic conformal prediction and its applications in machine learning; A number of invited talks and seminars at conferences, University and industrial partners.
Year(s) Of Engagement Activity 2015,2016,2017