CybercrimeNLP (CC-NLP): A natural language processing toolkit for the interdisciplinary analysis of underground online forums

Lead Research Organisation: University of Cambridge
Department Name: Computer Science and Technology

Abstract

Online and electronic crime now account for about half of all property crime, in all countries for which we have good victimisation data. A significant number of other offences, including harassment, also happen online. It is therefore essential for criminologists, lawyers, social scientists, psychologists and others to be able to study online crime and work out what's going on.

We are starting to have some really good sources of data, including more than 70 million messages scraped from underground crime forums in the CrimeBB database. There forums are where cyber-crooks meet up, trade tools and techniques, and sell each other services. They are a gold mine for criminologists studying how young people get drawn into crime; social scientists studying the evolution of political ideology, racism and homophobia; lawyers interested in criminal business models and how they respond to police interventions; and many others.

The missing link at present is this: that scholars in the humanities and social sciences do not at present have the tools to deal with such large bodies of text. In the pre-Internet era, researchers might have interviewed a few dozen criminals, coded up the interviews by hand and analysed them using a statistics package; but dealing with millions of messages requires new approaches.

This project will draw upon the discipline of natural-language processing to build tools that will enable scholars in the humanities and social sciences deal with these large volumes of text using modern techniques of artificial intelligence and machine learning (AI/ML). They will help researchers find topics of interest, identify the types of crime being discussed, search for messages that are similar in various ways to those already identified, track trends, and match users across forums. Users will be able to look for indicators that identify users who are just starting out (and might therefore be targeted with primary prevention approaches) as well as those who are becoming influential (and might therefore be worth more aggressive interventions). Our tools will also enable researchers to measure the effect of both crime-prevention initiatives and policing action, so that policymakers can gather evidence of what works and what doesn't.

The tools we build will start to do for research with large text corpora drawn from crime forums, what search engines have done for the Internet -- namely making such resources accessible to researchers who do not have either technical skills or technical assistance. They will therefore enable much more use to be made of existing data resources, starting with the CrimeBB database (which was funded in a previous project funded by ESRC and EPSRC), but not limited to it. Their use by researchers in diverse disciplines will also enable us to learn about how NLP tools, and more generally AI/ML tools, can be used robustly. This is of independent importance given the current rush to use AI/ML techniques and the concern that some of these techniques may simply reflect the bias in their training data, leading naive researchers to just measure their own ruler. It's not enough just to invent new tools; we also have to figure out how to use them properly, and for that, it's vital to work with a community of scholars from multiple disciplines in the humanities and social sciences on a shared problem, using shared data, and where we have some access eventually to ground truth.

Planned Impact

The Cambridge Cybercrime Centre achieves its impact through research, research support and operations. First, we publish our own research based on the data we collect; second, we are the go-to place for companies that want to share cybercrime data with academics, and for academics who want data for research; and third, we work with law enforcement agencies such as the NCA and the FBI, and with the abuse teams of service firms. The most important of these is the research support.

The focus of the project we propose is here research support. Our CrimeBB database has become the treasure trove for cybercrime researchers whose backgrounds are in criminology, law and the other social sciences and humanities. We will make it much easier for scholars who do not have a computer science background, and who are not working with a computer science colleague, to search the information in a very large text corpus such as CrimeBB, organise it, and make it useful. By enabling scholars to identify topics, work out what types of crime they relate to, look for similar posts, perform clustering analysis and track trends over time, we will empower them and significantly increase their productivity, just as search engines have empowered all of us. It will no longer be necessary to have technical skills or technical support to write programs to interrogate the database. This will not only make our existing users more productive but enable many more users to benefit from the available data -- much of which was collected using support from a previous grant awarded jointly by EPSRC and ESRC.

So the first impact is to enable more scholars to use data collected with ESRC funds, and to enable existing scholars to use the data better.

The second set of beneficiaries will be law-enforcement agencies with whom we and other scholars work directly; with better tools we will be able to help them, both in identifying possible interventions and in assessing the effectiveness of interventions.

The third set of beneficiaries will be the NLP community as we create another worked example of NLP tools being used in a challenging real-world environment where their effectiveness can be assessed, both explicitly and implicitly, in a large number of peer-reviewed publications, namely the publications written by our users. The impact here will hopefully lie not just in refining the state of the art in active learning and working out how to annotate Russian criminal slang, but in a better understanding of how techniques drawn from artificial intelligence and machine learning can be used to develop robust methodologies for dealing with adversarial material that stand up to peer review in multiple disciplines. This is important in itself, given the growing concern that many AI/ML techniques are overhyped and that researchers who use them may end up measuring their own rulers.

Finally, we hope for an emergent impact in that by getting people from different disciplines to work together, not just on common problems but using shared data resources and, thanks to this project, shared tools. we can help build real cross-disciplinary collaboration. This already happens via the Cambridge Cybercrime Centre, through academic publications, through our communications and through our annual conference; we believe that by engaging our user community in helping us develop the tools they need as participants in the process, we can build community cohesion still further.
 
Description The key outcome from this grant is the creation of PostCog, a web application designed for interdisciplinary researchers to enable them to search and extract data from our collection of cybercrime forum data. PostCog incorporates additional tools, such as a trending topics tool and a crime type classifier, as well as existing tools developed by the research team. We have also expanded the tools to include Spanish and German languages.
Exploitation Route PostCog will be used by researchers who sign data sharing agreements to use the CrimeBB dataset through the Cambridge Cybercrime Centre. We hope to do further user testing to identify how we should improve PostCog in the future. We would also like to introduce additional functionality to PostCog in the future, including the ability to create social network graphs.
Sectors Creative Economy,Digital/Communication/Information Technologies (including Software),Government, Democracy and Justice,Security and Diplomacy,Other

 
Title Crime type classifier 
Description The crime type classifier annotates cybercrime forum data, adding labels for posts associated with different types of cybercrime. 
Type Of Technology Webtool/Application 
Year Produced 2021 
Open Source License? Yes  
Impact The crime type classifier provides searchable labels which are integrated into the PostCog search interface. 
 
Title PostCog 
Description PostCog is a web application designed to support users from both technical and non-technical backgrounds in forum analysis tasks, such as search, information extraction and cross-forum comparison. 
Type Of Technology Webtool/Application 
Year Produced 2022 
Open Source License? Yes  
Impact PostCog will be used by researchers who use CrimeBB dataset, which contains data from cybercrime forums. 
 
Title Trending topics tool 
Description The trending topics tool provides a lightweight method for observing trending topics on underground forums. 
Type Of Technology Webtool/Application 
Year Produced 2022 
Open Source License? Yes  
Impact The trending topics tool will be integrated into the PostCog user interface.