Accelerated Real-Time Information Extraction System (ARIES)

Lead Research Organisation: Queen's University Belfast
Department Name: Sch of Electronics, Elec Eng & Comp Sci

Abstract

Technological advances in CMOS semiconductor technology paved the way for the digital revolution. As predicted by Moore, silicon integration capability has been doubling every 18 months over the past four decades, providing the foundation for low-cost computing and memory technology.

The digitisation of information and communication technologies sparked a number of innovations revolutionising the way we compute and communicate. Ubiquitous high-bandwidth communication, enabled by WiFi and 3G/4G technologies, facilitates on-demand access to a vast amount of application and location specific information including multimedia and broadcast content, video and voice communications, email and SMS/MMS. Furthermore, it has enabled on-demand access to personalised storage and computing resources, providing the foundation for the development of cloud computing infrastructures and a wide range of online web-based services and applications. With the decreasing cost of communication and storage the Internet has also become the global communication infrastructure for a wide range of autonomous sensor technologies, referred to as the "Internet of things". Key application areas include monitoring/surveillance, smart grid, smart homes and smart cities.

Monitoring internet traffic and mining meaningful information from both the online traffic and the stored information has emerged as essential for many critical applications and services. For example resource management, market intelligence, physical and cybercrime investigations and forensics, cyber space policing, situation awareness and the monitoring of malicious behaviour for criminal and terrorist intent. As the scale, diversity and distributed nature of current and emerging data assets increases and as data becomes ever more ubiquitous and critical to decision making, effective real-time mining of useful information becomes essential.

Considering the exponential increase of internet traffic and stored data, traditional software based approaches have become inadequate and unsustainable. Performance gain achieved due to Moore's law does not keep up with the required computing bandwidth of current and near future generated data assets. Internet traffic bandwidth is doubling every 12 months while the emerging content diversity is significantly increasing mining complexity. As the enterprise becomes more data centric, with a significant increase in data assets within the public and private cloud, traditional scaling by increasing the number of computing resources can no longer be sustained due to cost and power dissipation.

Most data mining algorithms are derived by the software community and are optimised for data structures for platforms based upon the Von-Neumann architecture. An effective solution now requires a paradigm shift in the way we process data and also how we extract meaningful information from a large amount of distributed, constantly changing data that is partially stored or in-transit.

Planned Impact

Beyond the immediate academic beneficiaries it is anticipated that the following groups will also benefit from the research undertaken:

[1] Impact in the Defence and Security Sector
It is well recognised that potentially rich and valuable knowledge is often inaccessible within large and unstructured datasets. This can often come to light after a significant event has occurred and a subsequent forensic trawl of data reveals the hidden knowledge. For defence and security applications, where early visibility of emerging threat is paramount, the existing methods cannot deliver information extraction within an adequate timeframe. Situational awareness must be delivered in real-time or near real-time to inform critical decision making.

Physical Surveillance Scenario
Advancing surveillance and monitoring methods deployed by the military (radar, optical, infrared, audio, satellite, electronic signals) are generating vast datasets in real-time. Moreover the treatment of the inherent noise and uncertainty associated with such data, owing to operational limits of the sensing technologies, means traditional knowledge discovery methods are failing. The successful application of custom-purpose hardware offers the opportunity to exploit existing knowledge discovery models with potentially minimal modification and yet accelerate the end-to-end process bringing new capability to the situational awareness problem in live military operations.

Cyber Surveillance Scenario
A worldwide consensus has emerged identifying cyberspace as a new territory warranting military protection. In 2009 the US DoD declared a new military command dedicated to cyber-security. The UK Strategic and Security Review 2011 has identified "cyber attack, including by other states, and by organised crime and terrorists" as a top priority risk. As the scope and nature of this new territory becomes better understood in the military context a natural requirement emerges to equip the UK defence and security services with new tools to detect, identify and respond to threats against UK interests in cyberspace.
These tools must perform a robust surveillance function in a similar manner to their physical counterparts. As such they must contend with daunting datasets deployed across a vast distributed infrastructure and with data that is frequently transient in nature. This research proposal asserts that this specialised function requires specialised hardware to deliver cyber-surveillance capabilities that are sufficiently quick, flexible and robust to address this emerging need.

[2] Longer-term beneficiaries
Data mining of extremely large data sets is a pressing need expressed in multiple business sectors including medicine, security surveillance and network monitoring. This growing demand for new capabilities will drive the creation of a new and competitive commercial space to service this demand. UK based research has an opportunity to place itself at the centre of this movement and exploit future hardware advances.

[3] Future Impact
The work programme proposed aims to look at early opportunities to exploit custom purpose hardware in data intensive systems. However it will also seek to bring clarity to future trends and new capabilities that will inevitably emerge over the coming decade to 2020. This information will be of great value to those UK bodies tasked with strategy planning both in the defence sector and in commercial sectors. A clear view of how custom purpose hardware impacts upon knowledge discovery in vast datasets can inform the extrapolation of these capabilities to longer timeframes.

Publications

10 25 50
 
Description The ARIES project explored ways on how hardware acceleration can be used to scale big data analysis throughput in order to significantly reduce the analysis time of large unstructured data and achieve real-time data analysis capability for potential threat detection and situation awareness. Sentiment analysis is very popular and widely used technique. Similar to most big data application, it involves the search of key patterns recognised to indicate sentiment within a portion of unstructured text. Hardware acceleration of sentiment analysis is of significant benefit, for increasing the number of key pattern and data size within a given time frame or at real time. Our investigation has shown that offloading of complex content processing tasks onto a custom-purpose parallel processing platform (ARIES) can achieve over 60x better performance than standard software based solution. Benchmarking the content processing only, the custom-purpose content processor achieves well over 6000x acceleration, excluding the communication overhead via the PCIe Interface with the CPU.
Exploitation Route Significant industrial interest for the ARIES project led to a number of industrial funded projects and the customisation of the ARIES platform for commercial use by RepKnight Limited.
After the completion of the project, Titan IC, a project partner, has provided further funding in extending the project objectives targeting further use-cases within network security. The ARIES project outcomes have been effectively utilised by Titan IC for advancing Titan IC product features and extending product use-cases. Some of the research undertaken under the ARIES project underpin Titan IC's current products.
Sectors Digital/Communication/Information Technologies (including Software),Electronics

 
Description The research output on real-time parsing of content has been exploited by Titan IC Systems, a Queen's University Start-up company, to further develop the technology into a real-time content parsing technology for cybersecurity, in form of a PCIe card. The technology has become an integral part of a larger system, called Hyperion, that is used for Intrusion Detection System (IDS), Application layer DDoS detection and mitigation, and filtering of malicious content using CalmAV security rule-sets.
First Year Of Impact 2015
Sector Digital/Communication/Information Technologies (including Software),Electronics
Impact Types Economic

 
Description CSIT - ARIES - RepKnight Ltd
Amount £16,000 (GBP)
Organisation Invest Northern Ireland 
Sector Public
Country United Kingdom
Start 01/2014 
End 04/2014
 
Description CSIT - ARIES - RepKnight Ltd
Amount £16,000 (GBP)
Organisation Invest Northern Ireland 
Sector Public
Country United Kingdom
Start 02/2014 
End 05/2014
 
Description Productisation of FPGA based regex offload 
Organisation Titan IC Systems
Country United Kingdom 
Sector Private 
PI Contribution Twitter sentiment analysis use-case, prototyped as part of the ARIES project has been used by Titan IC to show-case their range of products to international customers in UK, Europe, Asia and US. The ARIES prototype has been further improved to develop a Snort offload acceleration using the same principles of a large keyword-based database, stored within the FPGA embedded memory. Funding provided by Titan IC has enable to extend the work and develop clear concepts for patter matching offload acceleration, which underpins Titan IC's current products.
Collaborator Contribution Titan IC has provided in-kind contribution, in the form of Software, FPGA Cards and Customisation (engineering resources) of the hardware based on the use-case investigated as part of the ARIES project. Later Titan IC has provided in cash funding to support a Research Fellow at QUB for 2 years. The extended research work (sponsored Research Fellow) enabled us to investigate the principles of patter matching offload acceleration, targeting network Intrusion Detection Systems, in particular Snort. The research outcome of this extended project with Titan IC underpins many of Titan IC's products.
Impact The collaboration resulted to a number of outcomes: 1. Proof-of-concept demonstrators of Titan IC RXP Intellectual Property (IP) for academic and industrial partners. 2. Twitter Sentiment Analysis use-case / show-case prototype. 3. Investigation and development of a framework, enabling efficient patter matching offload acceleration based on Intel DPDK Framework. 4. Prototype Snort Network Intrusion Prevention System (N-IPS) fast-pattern offload acceleration.
Start Year 2013
 
Title Hardware Accelerated Content Analysis Platform for Sentiment Analysis 
Description The ARIES project explored ways on how hardware acceleration can be used to scale big data analysis throughput in order to significantly reduce the analysis time of large unstructured data and achieve real-time data analysis capability for potential threat detection and situation awareness. Sentiment analysis is very popular and widely used. It involves the search of key patterns recognised to indicate sentiment within a portion of unstructured text. Hardware acceleration of sentiment analysis is of significant benefit, for increasing the number of key pattern and data size within a given time frame. Analysing a large number of ?tweets? is a key component of Sentiment Analysis. Hardware accelerated tweet analysis has been chosen as the use-case to demonstrate the research output of the ARIS project. In order to generate a meaningful comparison a full implementation in Java. This software-only implementation ran on the host CPU. The second implementation invoked the hardware accelerators to perform the regular expression matching. A substantial dataset of Twitter content (878,773 tweets) was captured and a test set of 11 company names was selected at random. The software-only implementation performed the task in 463 seconds while the accelerated implementation took approximately 79 seconds in total. However it should be noted that this is almost entirely due to the overhead in loading the dataset from the host file system into memory (78.94s). The actual execution time required on the hardware was only 0.07s, achieving over 6000 X acceleration.Furthermore, the ARIES architecture has been extended towards end users and a sample application has been developed to perform rapid term association. This allows users to load their own dataset for analysis and select a set of seed terms of interest. Moreover the application can be adapted to acquire the test data stream from a network input instead of a static archive if necessary. 
Type Of Technology Software 
Year Produced 2013 
Impact The notable impact has been the validation that complex co0nent analytic can be executed using hardware accelerated content analysis technology. The basic principles have been further developed and commercialised by Titan IC Systems limited.