CROSS: Real-time Story Detection Across Multiple Massive Streams

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Informatics

Abstract

The World is rapidly becoming more and more connected, with people communicating using multiple streams - Social Media, Newswire, Wikipedia etc - on a bewildering range of topics and at a furious rate. Twitter alone receives more than 250 million new posts every day (Tsotsis 2011). This massive interconnection means that content can appear and quickly spread through and across different streams. For example, in the recent London riots, many tweets reported the rioting events as they happened in real-time. However, not all content posted is either of good quality or is factually correct, complicating the job of monitoring such streams for any purpose. An example of this happened when a comedian spread false rumours on Twitter about Osama Bin Laden watching his television show (Lineham 2011). Communication streams are also known to spread rumours, outright misinformation and content with malicious intent. For instance, during the same riots, radicalising posts were spread calling for participation in the so-called "cyber-jihad" (BBC 2011). Systems that can identify such posts is of paramount importance for security monitoring purposes.

On the other hand, not all information spread on mediums such as Twitter are accurate or interesting. This is compounded with the peculiarities of messages on modern social media (short, jargon, social context, etc.) where biased, incomplete, inaccurate and misleading messages are common. The latter makes it extremely challenging to automatically identify events worth monitoring for security purposes in real-time.

We propose a distributed infrastructure to automatically identify important new events (aka stories) in real-time by combining and comparing multiple message streams. The value of such story detection to many applications is clearly increased the faster this can happen. A security agency using our system would be better prepared when dealing with fast moving events as they unfold. Indeed, in this project, the notion of importance will be defined within a security context. Given the fact that streams typically have possible bias and not everything present can be trusted, a key requirement of the system is minimising false positives (uninteresting stories that are discovered). Moreover, the effective management and efficient processing of multiple streams of real-time data poses new technological and scientific challenges:

Challenge 1: Identify interesting new stories and not drown in a sea of false positives, yet reduce the effects of bias and rumour.
Challenge 2: Minimise system latency, such that new stories are detected in real-time with low latency.

We tackle the first challenge from the novel perspective of processing multiple streams and exploiting the fact that stories reported multiple times across several streams can cancel-out stream-specific bias and errors. For example, if a story is true, then it is more likely that it manifests in both Twitter and as an update to a Wikipedia article. Alternatively, a story might appear in Twitter and also appear in a governmental cable. The more often a story occurs within and across streams, the more likely it will be interesting. This is the cornerstone of our proposal, which we tackle by building upon modern first story detection techniques, adapted to account for bias and rumours.

In the second challenge, we ensure low-latency story detection by using a distributed real-time data processing architecture (e.g. S4 or Storm), similar to MapReduce but better suited for real-time operations. Real-time architectures for dealing with massive-scale data are in their infancy, hence CROSS will present a first concrete application, with a corresponding development of best practices for such architectures.

Planned Impact

Impact on Defence and Security Sector
Monitoring message streams (especially Social Media) has been identified as important by the CIA, mainly for counter-terrorism and counter-proliferation purposes (PC Magazine, 2011). For example, the CIA uses it to monitor the response of groups of people as events unfold in the world and brief the US President on a daily basis:

"Sites such as Facebook and Twitter have become a key resource for following a fast-moving crisis such as the riots that raged across Bangkok in April and May of last year, the center's deputy director said" (PC Magazine, 2011).

Whilst the manner and method of the monitoring conducted by the CIA is not in the public domain, it is obvious that making such monitoring systems more accurate (with fewer false positives) and faster will enhance national security. Any signals that characterise the detection of novelty (e.g. a new story) that we discover will generalise to such security settings. We also envisage our work being applicable to the more general objective of tackling `CyberTerrorism', especially identifying `Cyber-Jihad' threats (BBC News, 2011):

"Terrorists are increasingly using online technology ... for attack planning. While radicalisation continues primarily to be a social process, terrorists are making more and more use of new technologies to communicate their propaganda."--Theresa May, 12th July 2011

Our work has wider impact upon the national civic sector in that real-time, event-based processing technology is currently very much the province of (mainly) US-based internet companies such as Google, Twitter, and FaceBook. There is very little understanding of the problems involved with large-scale distributed processing elsewhere. Our project will help with technological transfer into the UK sector as a whole, better enabling the UK to build real-time, robust processing systems capable of dealing with "Big Data".

Specifically to the call, our work directly tackles the need to extract meaningful information from massive, "incomplete, contradictory, noisy and dispersed data sets". Furthermore, the combined application of provably efficient algorithms, supplemented by robust distributed technologies addresses the "need for speed".

Impact on National Civil Sector
The monitoring of civil behaviour can have security applications at the level of regional police forces or city councils. This would permit the targeted placement of resources (e.g. ambulances, police). A longer term vision includes the cross-comparison of geo-tagged social data (e.g. tweets) with data from environmental sensor (e.g. automatic crowd detection from video and audio sensors). Indeed, Craig Macdonald and Iadh Ounis are currently involved in an EU project (FP7 287583: SMART) which integrates processed sensor data from the city of Santander, Spain into a search setting.

Knowledge Transfer
Both university partner institutions have long-standing track records in commercialisation and knowledge transfer. In particular, the University of Edinburgh has a particular project dedicated to the commercialisation of Informatics technology, with a member of staff dedicated to language technologies. In a similar vein, the University of Glasgow can facilitate building knowledge transfer collaborations or spinouts through its commercialisation partner, IP Group plc. Finally, both institutions are members of the Scottish Informatics and Computing Science Alliance (SICSA), which runs dedicated knowledge exchange activities with members of industry.

Public Engagement
The Inspace public engagement lab of the University of Edinburgh's School of Informatics provides a "shop window" and a venue for public visits and demonstrations. CROSS will use Inspace for public lectures, exhibitions and demonstrations of the research results, including participation in the Edinburgh International Science Festival.

Publications

10 25 50

publication icon
Miles Osborne (Author) (2013) Can Twitter replace Newswire for breaking news?

publication icon
Zhao W (2015) Incorporating Social Role Theory into Topic Models for Social Media Content Analysis in IEEE Transactions on Knowledge and Data Engineering

 
Description We explored how events discovered in Twitter could be improved by cross referencing them with corresponding events in Wikipedia. We made this efficient using randomised algorithms.
Exploitation Route We published our findings. Our approach has already been extended by others (found by citations).
Sectors Aerospace, Defence and Marine,Communities and Social Services/Policy,Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy,Security and Diplomacy

URL http://demeter.inf.ed.ac.uk/cross/index.html
 
Description We showed how event detection in Twitter can be improved by cross-referencing it with other streams of information (such as how people search Wikipedia).
First Year Of Impact 2013
Sector Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Security and Diplomacy
Impact Types Cultural,Societal,Economic

 
Description Cross Stream Event Detection 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience
Results and Impact Talk summarising the Cross project.

Repeat talk at Computer Science, Macquarie University Australia
Year(s) Of Engagement Activity 2013
 
Description Cross Stream Event Detection 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Participants in your research or patient groups
Results and Impact I summarised the Cross project: looking at how event detection could be made scalable using Storm and LSH; also how quality could be improved using deferral and other streams.

Talk at Computer Science, Melbourne University, Australia
Year(s) Of Engagement Activity 2013
 
Description Finding Events in (Multiple) Massive Streams 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Participants in your research or patient groups
Results and Impact Talk at Dublin City University describing how we detect events in Twitter.
Year(s) Of Engagement Activity 2012