Mining the Network Behaviour of Bots

Lead Research Organisation: Royal Holloway University of London

Department Name: Information Security

Abstract

The botnets phenomenon has quickly become a major security concern for all the
Internet users. In fact, not only has it rapidly gained popularity among the
mass media, but it has also received the attention of the research community
interested in understanding, analyzing, and detecting bot-infected machines.
Once infected with a bot, the victim host joins a botnet, a network of
compromised machines that are under the control of a malicious entity. Botnets
are the primary means for cyber-criminals to carry out criminal tasks,
such as sending spam mail, launching denial-of-service attacks, or stealing
personal data such as mail accounts or bank credentials.

Clustering and correlating network events represent the state-of-the-art when
it comes to detecting and understanding the botnets phenomenon from a network
perspective. While effective, such approaches rest on weak foundations being
vulnerable to easy-to-perform (time and network) obfuscation attacks.

The goal of this project is to build on the promising results of our previous
work to explore novel machine-learning techniques to make the state-of-the-art
more accurate and robust against evasions and advanced malware. Exploring the
possibilities of advanced malware (and thus bots) to enable the development of
novel mathematical techniques to address such threats is not a mere academic
exercise. On the contrary, it is of paramount importance to build robust and
hard-to-elude mitigation approaches; something we currently lack, as
acknowledged by the research community at large.

On the cyber security side, we will develop techniques to analyze the network
traffic generated by a bot sample. Our analysis will focus on inferring the
interesting part of a bot's network behaviour to automatically generate models
that faithfully describe it. Our analysis aims at being independent from the
underlying botnet infrastructure, payload-agnostic, and able to pinpoint
legitimate-resembling malicious activities. The network flows of a monitored
bot will be initially filtered to remove well-defined attack patterns. The
remaining flows will be clustered using a number of network features and
suitable similarity functions. Clusters whose size exceeds a given threshold
will then be analyzed for periodicity: bots tend to engage in similar network
activities that have interflow intervals that either are sampled independently
from a potentially unknown probability distribution, or belong to a small
number of well-defined clusters. Once clusters exhibiting interesting
periodicity patterns are identified, they can be used, along with their network
features, for detecting (or understanding the behaviour of) bots in a mixed
population containing both compromised and clean hosts.

On the machine learning side, we propose to explore the use of conformal
prediction developed by our team to make such cluster-based analysis more
accurate and robust against arbitrary obfuscation-based evasion attacks. A
powerful clustering method is based on nonparametric probability density
estimation. A recent work proposes a computationally efficient method of
nonparametric density estimation based on conformal prediction and inherits its
properties of validity. We plan to explore the use of this method for the
purpose of robust clustering. A theoretical challenge is to spell out and study
the properties of robustness for this clustering method that are inherited from
the validity of the study mentioned above. In addition, the property of
validity of conformal predictors is usually established under the randomness
assumption; we will explore how this assumption can be relaxed. In addition,
the property of validity can be used to control the number of "alarms"
(predicting that a host is compromised) raised by a bot detection algorithm.
This is valuable in situations where alarms have to be investigated by human
experts but the available manpower is limited.

Planned Impact

This project is concerned with the development of novel
obfuscation-resistant techniques to analyze and detect core network
behavioral traits of bot-infected computing devices.

We will organize a two day workshop on the subject of machine learning
and malware analysis. The workshop will be aimed at bringing together
all the project collaborators, academic researchers and industry
practitioners with interest in the project's topics. The goal of the
workshop is to narrow the gap that nowadays exists between security
research carried out in academia and industry to face common threats.

The workshop will be held at the end of the second year of the project
and members of the advisory board (see next) will be encouraged to
attend. The workshop will include a number of invited talks from
representative of research industry laboratories, including---but not
limited to---McAfee Labs (UK), Symantec Research Labs (Europe), VU
University Amsterdam (The Netherlands), and FORTH-ICS (Greece), for
which this project has already received strong letters of support.

We will also appoint a panel of experts---mostly from the
industry---who will provide advice on practical problems and needs,
and help to promulgate the results of our research.

The Information Security Group at Royal Holloway, University of London
has considerable experience in working with industry. Over the last
ten years it has convened expert panels on public key infrastructures
(the "PKI Club"), authentication and identity management (the "AIM
Club") and, most recently, on cyber security. These panels,
comprising experts from industry and government circles, meet
regularly to discuss the challenges facing industry and government,
focusing on a different aspect at each meeting. We anticipate that the
experts advisory board for this project would operate in a similar way.

The composition of the board will be determined in the first few
months of the project. At least one member of the advisory board will
be invited to give an industry-centered talk at the workshop. The
board will meet four times, once within six months of the start of the
project and at the end of each year of the project.

We are confident our project will produce key results to overcome
advanced malware, which are of paramount interest to academic and
industry security researchers. However, not only the developed machine
learning techniques will improve our understanding of bot-related
threats, but they will also likely be applicable to other important
areas, such as System Biology, where time-series data---and clustering
in general---has to be examined.

Funded Value:

£680,622

Funded Period:

Jun 13 - Jun 17

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/K033344/1

Principal Investigator:

Lorenzo Cavallaro

Research Subject:

Info. & commun. Technol. (75%)

Mathematical sciences (25%)

Research Topic:

Artificial Intelligence (25%)

Networks & Distributed Systems (50%)

Statistics & Appl. Probability (25%)

Organisations

People	ORCID iD
Lorenzo Cavallaro (Principal Investigator)
Alexander Gammerman (Co-Investigator)
Zhiyuan Luo (Co-Investigator)
Vladimir Vovk (Co-Investigator)
Hugh Shanahan (Co-Investigator)	http://orcid.org/0000-0003-1374-6015
Ilia Nouretdinov (Researcher)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 4 5 > >|

10 25 50

Burnaev E (2014) Efficiency of conformalized ridge regression

Cherubin G (2016) Conformal and Probabilistic Prediction with Applications

Cherubin G (2015) Statistical Learning and Data Sciences - Third International Symposium, SLDS 2015, Egham, UK, April 20-23, 2015, Proceedings

Dash S (2015) Set-based Classification of Android Malware from Behavioral Abstractions (Poster)

Deo A (2016) Prescience

Dong Y (2023) Hrip1 enhances tomato resistance to yellow leaf curl virus by manipulating the phenylpropanoid biosynthesis and plant hormone pathway. in 3 Biotech

Fedorova V (2013) Artificial Intelligence Applications and Innovations

Fedorova V (2015) Hypergraphical Conformal Predictors in International Journal on Artificial Intelligence Tools

Fedorova V (2013) Conformal Prediction under Hypergraphical Models

Hurier M (2017) Euphony: Harmonious Unification of Cacophonous Anti-Virus Vendor Labels for Android Malware

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Research Tools and Methods
Collaboration
Intellectual Property
Software and Technical Products
Engagement Activities


Description	We have made considerable progress on the theory of conformal predictor, which provides classification supported by statistical confidence. While we have only recently started to adopt it to the original aim of identifying malicious network communications, we have quite unexpectedly built and expanded on it in such to provide a statistical evaluation of a broad range of machine learning algorithms. This seems to be quite successful in providing suitable quality indicators for identifying when classifiers start decaying, which triggers the detection of objects (e.g., network communications or any other event of interest) that are drifting away from the model the classifiers represent. Such findings can be potentially applied to other research domains, but we leave this as our future work.
Exploitation Route	Our statistical evaluation of machine learning algorithms enables the identification of concept drift, identifying learning decaying and potentially suggesting retraining strategy. We are planning to release the source code and a paper of our findings to allow other researchers to integrate our approach in realistic machine learning scenario (open-world experiments).
Sectors	Aerospace Defence and Marine Chemicals Digital/Communication/Information Technologies (including Software) Financial Services and Management Consultancy Retail Security and Diplomacy Transport


Description	The findings of this research project helped to better understand how to identify and quarantine security threats that evolve over time. Forecasting future threats is still an open challenge but it is possible to equip learning models with the ability to reject uncertain predictions. Although classification with rejection and more generally abstaining classifiers are well-explored in the machine learning community, little is known on their effectiveness when applied in evolving contexts, typical of computer security. The breakthrough of this research project represents a first step towards achieving this overarching goal, paving the way to address fundamental research challenges. See https://s2lab.cs.ucl.ac.uk and https://s2lab.cs.ucl.ac.uk/projects/transcend for further details.
First Year Of Impact	2017
Sector	Digital/Communication/Information Technologies (including Software),Education,Other
Impact Types	Societal


Description	GCHQ Small Grants scheme 2015-2016
Amount	£39,000 (GBP)
Organisation	Government Communications Headquarters (GCHQ)
Sector	Public
Country	United Kingdom
Start	01/2016
End	03/2016


Description	NCSC Small Grants scheme 2017-2018
Amount	£20,000 (GBP)
Organisation	National Cyber Security Centre
Sector	Public
Country	United Kingdom
Start	11/2017
End	01/2018


Description	NVIDIA GPU donation
Amount	£3,000 (GBP)
Organisation	NVIDIA
Sector	Private
Country	Global
Start


Title	CopperDroid
Description	We have developed the infrastructure to enable dynamic analysis and classification of Android applications at scale; we are in the process of finalizing a RESTful API to provide free use of the service to practitioners and researchers.
Type Of Material	Improvements to research infrastructure
Provided To Others?	No
Impact	We are discussing with Google and McAfee for a potential integration of our analysis system in their backend infrastructure.


Title	Conformal Evaluator
Description	We have developed a statistical machine learning evaluation framework to provide a quantifiable assessment of the quality of a given machine learning classification. Not only this enables to understand how well an approach may be performing in real-life deployment, but it also provides metrics that can be leveraged to detect concept drift and thus decaying in the classifier performances (suggesting retraining strategies) in realistic settings.
Type Of Material	Data analysis technique
Provided To Others?	No
Impact	N/A yet. We are filing a patent and we are using this approach in the output generated by MobSec: Security in the Mobile Age EPSRC research grant. We are in the process of publishing the results of this approach, including source code, to enable other research groups to build on this outcome.


Description	HP Labs Bristol
Organisation	Hewlett Packard Ltd
Department	Hewlett Packard Laboratories, Bristol
Country	United Kingdom
Sector	Private
PI Contribution	We will be using HP Labs Bristol data to assess the effectiveness of our methodology
Collaborator Contribution	HP Labs Bristol is supporting our research effort by providing us with real-world data (passive DNS traffic), which we'll be experimenting with in one of the research directions (see Phoenix and Cerberus systems) we are exploring within the project.
Impact	We have a working prototype codenamed Cerberus, follow-up of Phoenix (see publications), which is using an initial dataset provided by Nominet. Our plan is to integrate conformal predictors (a novel machine learning approach developed by the team) in Cerberus to provide confidence and credibility in the results of the analysis performed.
Start Year	2014


Description	Nominet
Organisation	Nominet Trust
Country	United Kingdom
Sector	Charity/Non Profit
PI Contribution	We will be using Nominet data to assess the effectiveness of our methodology
Collaborator Contribution	Nominet is supporting our research effort by providing us with real-world data (passive DNS traffic), which we'll be experimenting with in one of the research directions (see Phoenix and Cerberus systems) we are exploring within the project.
Impact	We have a working prototype codenamed Cerberus, follow-up of Phoenix (see publications), which is using an initial dataset provided by Nominet. Our plan is to integrate conformal predictors (a novel machine learning approach developed by the team) in Cerberus to provide confidence and credibility in the results of the analysis performed.
Start Year	2014


Description	University of Georgia
Organisation	University of Georgia
Country	United States
Sector	Academic/University
PI Contribution	The collaboration aims at improving the performance of our machine learning-based botnet classification (with confidence).
Collaborator Contribution	University of Georgia will be providing a large data corpus on which evaluating our approach.
Impact	No output yet. This is still an ongoing activity that requires one of my student to visit University of Georgia from Aug to Oct 2017 to carry out large-scale evaluation of the system we have been developing throughout this grant.
Start Year	2017


Title	METHOD OF MONITORING THE PERFORMANCE OF A MACHINE LEARNING ALGORITHM
Description	A crucial requirement for building sustainable learning models is to train on a wide variety of samples. Unfortunately, objects on which the learned models are used may evolve and the learned models may no longer work well. The invention provides a framework to identify aging classification models in vivo during deployment(concept drift), much before the machine learning model's performance starts to degrade. A statistical comparison of samples seen during deployment with those used to train the model is used, thereby building metrics for classification quality.
IP Reference	WO2019002603
Protection	Patent application published
Year Protection Granted	2019
Licensed	No
Impact	We have been contacted by several academic institutions that want access to our code; we are working on a license suitable for academic researchers as well as for industrial partners. To this end, we are in contact with Huawei Technologies to develop a research impact potentially through licensing.


Title	AntiBot
Description	AntiBot is a framework we have developed to explore one of the research directions tackled within the project, i.e., the design of novel machine learning techniques (conformal predictors), devised and developed within the team as part of the project, to detect bot-infected computing devices. While Phoenix and Cerberus (see other entries) focuses on the analysis of passive DNS traffic to characterise a specific component of bot-initiated communications, AntiBot looks at netflows to generate a semantic signature of the core behavioural traits of bot-like malware. It currently relies on off-the-shelf machine learning, e.g., hierarchical clustering algorithms as the underlying machine learning mechanism, but we are nonetheless at a stage where the advances in the development and our understanding of conformal clustering and conformal predictors (on classification problems) developed within the project can start to be integrated in the frameworks.
Type Of Technology	Software
Year Produced	2014
Impact	Work in progress


Title	Cerberus app
Description	The software implements Phoenix (see Publications) and Cerberus (its follow-up research) to analyze passive DNS traffic and automatically identify likely malicious automatically generated domains with high accuracy and low false positives. Cerberus is still a work-in-progress, but more can be read in the MSc thesis available at https://www.politesi.polimi.it/bitstream/10589/92341/1/2014_04_Colombo.pdf
Type Of Technology	Webtool/Application
Year Produced	2014
Impact	Once deployed, the software would allow to 1) gather insights about the ever evolving malicious DNS infrastructure used by malware and 2) generates smart black lists to be deployed by the community at large (deployment is private for the time being; we need to finalize large-scale experiments and to secure additional data feeds before releasing it to the public).


Title	Conformal evaluator
Description	This is the python library that implements conformal evaluator, an framework to statistically assess the quality of a broad range of machine learning algorithms.
Type Of Technology	Software
Year Produced	2016
Impact	We are using this evaluation internally across projects, but we plan to release the python library open source for the community to provide statistical evaluation to machine learning tasks.


Title	CopperDroid and related machine learning infrastructure
Description	CopperDroid is a dynamic analysis framework to reconstruct the behavior of Android apps. Beside providing information to analysts, the reconstructed behaviors are fed to machine learning to enable automated classification of Android apps and malware.
Type Of Technology	Software
Year Produced	2015
Impact	We are engaged in a number of conversation with industrial partners (e.g., McAfee Labs, Qualcomm, and Google) and academia (e.g., University of Luxembourg, National University Singapore, TU Munich) to further monetize on the capability analysis of CopperDroid.
URL	http://copperdroid.isg.rhul.ac.uk


Description	A number of talks given to industry and academia
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	I have given a number of talks to further disseminate the outcome of the research carried out by the 2 EPSRC grants I am currently PI for. The following is just an excerpt, but I will update the list wit proper entries soon. The talks often cover topics carried out by both grants: OWASP AppSec EU Keynote 2014, Dagstuhl 2014, National Cyber Crime Unit 20145, BlackHat London Mobile Summit 2015, Georgia Tech 2015, Stony Brook University, 2015, Qualcomm Inc. 2015, Google 2015, University of Catania 2015, University of Luxembourg 2015, IMDEA Software 2016, Kyushu University 2016, NIMBUS (EPSRC) 2016, Polytechnic University of Hong Kong 2016
Year(s) Of Engagement Activity	2014,2015,2016


Description	Automatic Analysis and Classification of Android Malware (Concept Drift Detection)
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	This talk outlines how the analysis we have been exploring in "MobSec: Security in the Mobile Age" EPSRC projects influenced "Mining the Network Behaviour of Bots" EPSRC grant and viceversa. In particular, we have been introducing a statistical machine learning evaluation framework to identify concept drift. This is applicable not only to either of the domains explored in such research projects, but to other fields as well.
Year(s) Of Engagement Activity	2016,2017


Description	CEReS Dissemination Day
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	Yes
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	The talk was well-perceived and sparkled questions and discussion afterwards Not aware of any direct one.
Year(s) Of Engagement Activity	2014
URL	http://www.epsrc.ac.uk/newsevents/events/ceresconference/


Description	COPA 2013 (Conformal Prediction under Hypergraphical Models)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	Yes
Type Of Presentation	paper presentation
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	N/A
Year(s) Of Engagement Activity	2013


Description	COPA 2013 (Defensive Forecast for Conformal Bounded Regression)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	Yes
Type Of Presentation	paper presentation
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	N/A
Year(s) Of Engagement Activity	2013


Description	COPA 2013 (Enhanced Conformal Predictors for Indoor Localisation Based on Fingerprinting Method)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	Yes
Type Of Presentation	paper presentation
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	N/A
Year(s) Of Engagement Activity	2013


Description	COPA 2013 (Learning by Conformal Predictors with Additional Information)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	Yes
Type Of Presentation	paper presentation
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	N/A
Year(s) Of Engagement Activity	2013


Description	COPA 2013 (Transductive Conformal Predictors)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	Yes
Type Of Presentation	paper presentation
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	N/A
Year(s) Of Engagement Activity	2013


Description	COPA 2014 (Anomaly Detection of Trajectories with Kernel Density Estimation by Conformal Prediction)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	Yes
Type Of Presentation	paper presentation
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	N/A
Year(s) Of Engagement Activity	2014


Description	COPA 2014 (Conformal Prediction under Probabilistic Input)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	Yes
Type Of Presentation	paper presentation
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	N/A
Year(s) Of Engagement Activity	2014


Description	COPA 2014 (SVM Venn Machine with k-Means Clustering)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	Yes
Type Of Presentation	paper presentation
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	N/A
Year(s) Of Engagement Activity	2014


Description	Dr Huazhen Wang, Huaqiao University visit (1 year) to study conformal predictors
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	Discussions were very interesting and sparked further research directions in conformal predictors Further engagement with the collaborator
Year(s) Of Engagement Activity	2013,2014


Description	HP Labs Bristol
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Professional Practitioners
Results and Impact	The talked allowed us to sign NDAs with HP Labs Bristol to support further our research. NDA and future collaboration.
Year(s) Of Engagement Activity	2014


Description	Invited talks at Huawei Germany, Huawei Finland, University of Bologna, King's College London
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	Several talks on promoting dissemination results of the USENIX Sec 2018 paper. The aim is to pursue research impact through licensing of this research.
Year(s) Of Engagement Activity	2017


Description	Misleading Metrics: On Evaluating Machine Learning for Malware with Confidence
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	Invited talk at Mathematical foundations of probabilistic conformal prediction and its applications in machine learning; A number of invited talks and seminars at conferences, University and industrial partners.
Year(s) Of Engagement Activity	2015,2016,2017

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications