CybercrimeNLP (CC-NLP): A natural language processing toolkit for the interdisciplinary analysis of underground online forums

Lead Research Organisation: University of Cambridge

Department Name: Computer Science and Technology

Abstract

Online and electronic crime now account for about half of all property crime, in all countries for which we have good victimisation data. A significant number of other offences, including harassment, also happen online. It is therefore essential for criminologists, lawyers, social scientists, psychologists and others to be able to study online crime and work out what's going on.

We are starting to have some really good sources of data, including more than 70 million messages scraped from underground crime forums in the CrimeBB database. There forums are where cyber-crooks meet up, trade tools and techniques, and sell each other services. They are a gold mine for criminologists studying how young people get drawn into crime; social scientists studying the evolution of political ideology, racism and homophobia; lawyers interested in criminal business models and how they respond to police interventions; and many others.

The missing link at present is this: that scholars in the humanities and social sciences do not at present have the tools to deal with such large bodies of text. In the pre-Internet era, researchers might have interviewed a few dozen criminals, coded up the interviews by hand and analysed them using a statistics package; but dealing with millions of messages requires new approaches.

This project will draw upon the discipline of natural-language processing to build tools that will enable scholars in the humanities and social sciences deal with these large volumes of text using modern techniques of artificial intelligence and machine learning (AI/ML). They will help researchers find topics of interest, identify the types of crime being discussed, search for messages that are similar in various ways to those already identified, track trends, and match users across forums. Users will be able to look for indicators that identify users who are just starting out (and might therefore be targeted with primary prevention approaches) as well as those who are becoming influential (and might therefore be worth more aggressive interventions). Our tools will also enable researchers to measure the effect of both crime-prevention initiatives and policing action, so that policymakers can gather evidence of what works and what doesn't.

The tools we build will start to do for research with large text corpora drawn from crime forums, what search engines have done for the Internet -- namely making such resources accessible to researchers who do not have either technical skills or technical assistance. They will therefore enable much more use to be made of existing data resources, starting with the CrimeBB database (which was funded in a previous project funded by ESRC and EPSRC), but not limited to it. Their use by researchers in diverse disciplines will also enable us to learn about how NLP tools, and more generally AI/ML tools, can be used robustly. This is of independent importance given the current rush to use AI/ML techniques and the concern that some of these techniques may simply reflect the bias in their training data, leading naive researchers to just measure their own ruler. It's not enough just to invent new tools; we also have to figure out how to use them properly, and for that, it's vital to work with a community of scholars from multiple disciplines in the humanities and social sciences on a shared problem, using shared data, and where we have some access eventually to ground truth.

Planned Impact

The Cambridge Cybercrime Centre achieves its impact through research, research support and operations. First, we publish our own research based on the data we collect; second, we are the go-to place for companies that want to share cybercrime data with academics, and for academics who want data for research; and third, we work with law enforcement agencies such as the NCA and the FBI, and with the abuse teams of service firms. The most important of these is the research support.

The focus of the project we propose is here research support. Our CrimeBB database has become the treasure trove for cybercrime researchers whose backgrounds are in criminology, law and the other social sciences and humanities. We will make it much easier for scholars who do not have a computer science background, and who are not working with a computer science colleague, to search the information in a very large text corpus such as CrimeBB, organise it, and make it useful. By enabling scholars to identify topics, work out what types of crime they relate to, look for similar posts, perform clustering analysis and track trends over time, we will empower them and significantly increase their productivity, just as search engines have empowered all of us. It will no longer be necessary to have technical skills or technical support to write programs to interrogate the database. This will not only make our existing users more productive but enable many more users to benefit from the available data -- much of which was collected using support from a previous grant awarded jointly by EPSRC and ESRC.

So the first impact is to enable more scholars to use data collected with ESRC funds, and to enable existing scholars to use the data better.

The second set of beneficiaries will be law-enforcement agencies with whom we and other scholars work directly; with better tools we will be able to help them, both in identifying possible interventions and in assessing the effectiveness of interventions.

The third set of beneficiaries will be the NLP community as we create another worked example of NLP tools being used in a challenging real-world environment where their effectiveness can be assessed, both explicitly and implicitly, in a large number of peer-reviewed publications, namely the publications written by our users. The impact here will hopefully lie not just in refining the state of the art in active learning and working out how to annotate Russian criminal slang, but in a better understanding of how techniques drawn from artificial intelligence and machine learning can be used to develop robust methodologies for dealing with adversarial material that stand up to peer review in multiple disciplines. This is important in itself, given the growing concern that many AI/ML techniques are overhyped and that researchers who use them may end up measuring their own rulers.

Finally, we hope for an emergent impact in that by getting people from different disciplines to work together, not just on common problems but using shared data resources and, thanks to this project, shared tools. we can help build real cross-disciplinary collaboration. This already happens via the Cambridge Cybercrime Centre, through academic publications, through our communications and through our annual conference; we believe that by engaging our user community in helping us develop the tools they need as participants in the process, we can build community cohesion still further.

Funded Value:

£242,595

Funded Period:

Sep 20 - Mar 22

Funder:

ESRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

ES/T008466/1

Principal Investigator:

Alice Hutchings

Research Subject:

Info. & commun. Technol. (32%)

Linguistics (16%)

Psychology (32%)

Sociology (16%)

Research Topic:

Computational Linguistics (16%)

Criminology (16%)

Forensic Psychology (32%)

Networks & Distributed Systems (32%)

Organisations

University of Cambridge (Lead Research Organisation)

People	ORCID iD
Alice Hutchings (Principal Investigator)	http://orcid.org/0000-0003-3037-2684
Paula Buttery (Co-Investigator)
Ross Anderson (Co-Investigator)	http://orcid.org/0000-0001-8697-5682
Andrew Caines (Researcher)

Publications

Author Name

Title Publication Date Published

10 25 50

Atondo Siu G (2021) Follow the money: The relationship between currency exchange and illicit behaviour in an underground forum

Hughes J (2021) Researching Cybercrimes - Methodologies, Ethics, and Critical Approaches

Hughes J (2020) Detecting Trending Terms in Cybersecurity Forum Discussions

Pete I (2022) PostCog: A tool for interdisciplinary research into underground forums at scale

Siu G (2021) Follow the money: The relationship between currency exchange and illicit behaviour in an underground forum

Key Findings
Impact Summary
Research Tools and Methods
Software and Technical Products


Description	The key outcome from this grant is the creation of PostCog, a web application designed for interdisciplinary researchers to enable them to search and extract data from our collection of cybercrime forum data. PostCog incorporates additional tools, such as a trending topics tool and a crime type classifier, as well as existing tools developed by the research team. We have also expanded the tools to include Spanish and German languages.
Exploitation Route	PostCog will be used by researchers who sign data sharing agreements to use the CrimeBB dataset through the Cambridge Cybercrime Centre. We hope to do further user testing to identify how we should improve PostCog in the future. We would also like to introduce additional functionality to PostCog in the future, including the ability to create social network graphs.
Sectors	Creative Economy Digital/Communication/Information Technologies (including Software) Government Democracy and Justice Security and Diplomacy Other


Description	Our search interface is now widely used by licensees who access our datasets. It is the main interface for researchers working with the CrimeBB dataset, and now also provides access to ExtremeBB (our collections of extremist forums). We now have over 400 researchers using datasets from the Cambridge Cybercrime Centre, across a wide variety of departments, including the social sciences as well as computer science. The interface incorporates automated labelling of the data, using machine learning classifiers, to add value and increase searchability. The datasets we provided and the PostCog application particularly benefit early career researchers, who would have otherwise faced difficulty in gaining access to high-quality datasets in often time and resource-challenged situations.
First Year Of Impact	2022
Sector	Digital/Communication/Information Technologies (including Software),Government, Democracy and Justice,Security and Diplomacy
Impact Types	Policy & public services


Title	CrimeBB
Description	CrimeBB is a database of postings to underground cybercrime forums. It is the most widely used of all the resources collected and curated by our team. Starting in 2016, we scraped the contents of hackforums, where people bought and sold malware and other crime tools and services. We can't list it under "databases" as it's not "published" and doesn't have a DOI. For ethical and data-protection reasons it's available only under license.
Type Of Material	Improvements to research infrastructure
Year Produced	2016
Provided To Others?	Yes
Impact	We set out at the Cambridge Cybercrime Centre to turn cybercrime research into a science. Previously, researchers collected their own data and couldn't share it, so their findings could not easily be replicated or built on. We set out to change that by collecting and curating data at scale. Of all our collections, CrimeBB has turned out to be by far the most popular. We have other collections too; for example, ExtremeBB is a more recent project, which collects postings to extremist forums. As of February 2022, our data are licensed by 198 researchers at 65 research groups in 17 countries.
URL	https://www.cambridgecybercrime.uk/process.html


Title	Crime type classifier
Description	The crime type classifier annotates cybercrime forum data, adding labels for posts associated with different types of cybercrime.
Type Of Technology	Webtool/Application
Year Produced	2021
Open Source License?	Yes
Impact	The crime type classifier provides searchable labels which are integrated into the PostCog search interface.


Title	PostCog
Description	PostCog is a web application designed to support users from both technical and non-technical backgrounds in forum analysis tasks, such as search, information extraction and cross-forum comparison.
Type Of Technology	Webtool/Application
Year Produced	2022
Open Source License?	Yes
Impact	PostCog will be used by researchers who use CrimeBB dataset, which contains data from cybercrime forums.


Title	Trending topics tool
Description	The trending topics tool provides a lightweight method for observing trending topics on underground forums.
Type Of Technology	Webtool/Application
Year Produced	2022
Open Source License?	Yes
Impact	The trending topics tool will be integrated into the PostCog user interface.

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications