Pattern matching algorithms for streaming data

Lead Research Organisation: University of Bristol
Department Name: Computer Science

Abstract

Imagine that I give you the following task: read the Complete Works of Shakespeare and write down all occurrences of the phrase my good lord . The task is known in Computer Science as exact pattern matching; here the phrase my good lord is the pattern. If I asked you to find all phrases similar to the phrase my good lord , you may decide write down the phrases my noble lord , my gracious lord and simply my lord . This is approximate pattern matching, a problem whose complexity is, of course, dependent on how we define the word similar. The definition considered depends on the application and much of the breadth and depth of the field arises from this.Now imagine that I am going to read the Complete Works of Shakespeare to you and expect you to write down similar phrases as you hear them. This is online approximate pattern matching and is the focus of this proposal. The proposal is applicable to Internet related applications where a vast quantity of data passes though a computer constantly - a field known as data streaming. Here the data is far too large to be stored and results must be computed on the fly as the data arrives. In the reading analogy, if you mishear a paragraph, I'm not going to reread it to you.The aim of this proposal is to bring these fields together to search for patterns quickly in streaming data. Continuing the analogy, we will be considering finding patterns in a number of circumstances:1. As before I am going to read you a book but this time much faster. I know that you can't write down all the occurrences fast enough but I want you to guarantee you will catch most of them.2. Many people will read books out loud to you at the same time. Any time any of them say the pattern you are looking, for you have to write it down.3. I am going to read you a book but I make no promise to read the words in order: page 6 line 3 word 6 is good , page 39 line 1 word 2 is happy , page 6 line 3 word 5 is my ...Of course, these problems sound strange and counter-intuitive phrased in plain English, but the underlying Computer Science problems are highly significant for many emerging applications such as traffic shaping, firewalls, Internet monitoring and malicious content detection.

Planned Impact

The central goal of this proposal is to develop algorithms which search for patterns in streaming data. The work is particularly suited to searching for patterns in high volume data streams in applications where space and computation time are limited. Key areas of impact are found in the internet infrastructure, internet traffic shaping, content distribution and security industries. In addition, there is a potential for impact on government intelligence agencies. In the internet infrastructure industry, a group of potential commercial benefactors are embedded router manufacturers. Work package three, which concerns randomised pattern matching, has particular impact for this group. The impact of this may be the development of more sophisticated methods for internet traffic monitoring. Adoption of such methods would likely be over a time period of several years given the time scales over which changes in hardware designs occur. Internet traffic shaping is an important issue for internet service providers (ISPs). The research into approximate pattern matching in multiple streams in work package one could increase the competitivity of developers of internet traffic shaping solutions and indirectly improve the network management ability of ISPs. Traffic shaping allows ISPs to provide improved quality of service which would also have a positive impact on the quality of life of their users. The work on approximate pattern matching in unordered streams contained in work package two would have impact on content distribution via peer to peer (P2P) networks. Main benefactors of this work likely include P2P client developers in the commercial and open-source world. These uses of the work also have the potential to encourage further adoption of P2P for commercial content distribution. The work also has potential for impact in the development of undedicated firewalls. A modern desktop machine likely has a software firewall which must use minimal system resources so not to impact on the productivity of the user. The solutions proposed here could lead to improved competitivity for companies developing firewalls and has indirect impact on the quality of life of internet users. In intelligence agencies and police internet crime divisions, the high speed monitoring of streaming data could be an important tool for the detection of illegal activities. Efficient low space pattern matching would dramatically increase the UK's capability in this area. The impact of the theoretical work in this proposal could be significant for practical solutions to such problems. The work has many potential direct benefactors, mainly within the commercial ICT industry. I propose a number of measures to ensure that they have the opportunity to benefit: All results and techniques developed through the work will be made available through international recognised peer reviewed international conferences and journals. To ensure that publications are available in commercial contexts, 'web' versions of the papers will be made openly available via the internet. Direct contact with benefactors is also important. Some industrial presence at academic conferences in the field is not uncommon and effort will be made to discuss applications with them at such events. However, this measure would only include a small fraction of potential benefactors. I will also endeavour to contact suitable organisations to increase their awareness of the work and, where appropriate, form industrial links. It is important for the adoption of theoretical research to show practical relevance though successful empirical studies. There is potential for the supervision of empirical studies performed as masters student projects at the host organisation. It would be my intention that such studies would be performed in contact with relevant potential benefactors.

Publications

10 25 50
 
Description This is the same award as EP/H028056/1. Details are found there.
Exploitation Route This is the same award as EP/H028056/1. Details are found there.
Sectors Digital/Communication/Information Technologies (including Software),Security and Diplomacy,Other

 
Description This is the same award as EP/H028056/1. Details are found there.
First Year Of Impact 2014
 
Description DTU Informatics 
Organisation Technical University of Denmark
Country Denmark 
Sector Academic/University 
PI Contribution I contributed expertise in pattern matching techniques, streaming algorithms, sub-linear methods and randomised algorithms.
Collaborator Contribution My partners contributed expertise in pattern matching techniques, compression methods and application domains.
Impact Several publications in good peer-reviewed international conferences including: Time-Space Trade-Offs for Longest Common Extensions Philip Bille, Inge Li Gørtz, Benjamin Sach, and Hjalte Wedel Vildhøj. In: Journal of Discrete Algorithms (Special issue for CPM 2012) Sparse Suffix Tree Construction in Small Space Philip Bille, Johannes Fischer, Inge Li Gørtz, Tsvi Kopelowitz, Benjamin Sach, Hjalte Wedel Vildhøj. In: ICALP 2013 Fingerprints in Compressed Strings Philip Bille, Patrick Hagge Cording, Inge Li Gørtz, Benjamin Sach, Hjalte Wedel Vildhøj and Søren Vind. In: WADS 2013
Start Year 2012