Pattern matching algorithms for streaming data

Lead Research Organisation: University of Warwick
Department Name: Computer Science

Abstract

Imagine that I give you the following task: read the Complete Works of Shakespeare and write down all occurrences of the phrase my good lord . The task is known in Computer Science as exact pattern matching; here the phrase my good lord is the pattern. If I asked you to find all phrases similar to the phrase my good lord , you may decide write down the phrases my noble lord , my gracious lord and simply my lord . This is approximate pattern matching, a problem whose complexity is, of course, dependent on how we define the word similar. The definition considered depends on the application and much of the breadth and depth of the field arises from this.Now imagine that I am going to read the Complete Works of Shakespeare to you and expect you to write down similar phrases as you hear them. This is online approximate pattern matching and is the focus of this proposal. The proposal is applicable to Internet related applications where a vast quantity of data passes though a computer constantly - a field known as data streaming. Here the data is far too large to be stored and results must be computed on the fly as the data arrives. In the reading analogy, if you mishear a paragraph, I'm not going to reread it to you.The aim of this proposal is to bring these fields together to search for patterns quickly in streaming data. Continuing the analogy, we will be considering finding patterns in a number of circumstances:1. As before I am going to read you a book but this time much faster. I know that you can't write down all the occurrences fast enough but I want you to guarantee you will catch most of them.2. Many people will read books out loud to you at the same time. Any time any of them say the pattern you are looking, for you have to write it down.3. I am going to read you a book but I make no promise to read the words in order: page 6 line 3 word 6 is good , page 39 line 1 word 2 is happy , page 6 line 3 word 5 is my ...Of course, these problems sound strange and counter-intuitive phrased in plain English, but the underlying Computer Science problems are highly significant for many emerging applications such as traffic shaping, firewalls, Internet monitoring and malicious content detection.

Planned Impact

The central goal of this proposal is to develop algorithms which search for patterns in streaming data. The work is particularly suited to searching for patterns in high volume data streams in applications where space and computation time are limited. Key areas of impact are found in the internet infrastructure, internet traffic shaping, content distribution and security industries. In addition, there is a potential for impact on government intelligence agencies. In the internet infrastructure industry, a group of potential commercial benefactors are embedded router manufacturers. Work package three, which concerns randomised pattern matching, has particular impact for this group. The impact of this may be the development of more sophisticated methods for internet traffic monitoring. Adoption of such methods would likely be over a time period of several years given the time scales over which changes in hardware designs occur. Internet traffic shaping is an important issue for internet service providers (ISPs). The research into approximate pattern matching in multiple streams in work package one could increase the competitivity of developers of internet traffic shaping solutions and indirectly improve the network management ability of ISPs. Traffic shaping allows ISPs to provide improved quality of service which would also have a positive impact on the quality of life of their users. The work on approximate pattern matching in unordered streams contained in work package two would have impact on content distribution via peer to peer (P2P) networks. Main benefactors of this work likely include P2P client developers in the commercial and open-source world. These uses of the work also have the potential to encourage further adoption of P2P for commercial content distribution. The work also has potential for impact in the development of undedicated firewalls. A modern desktop machine likely has a software firewall which must use minimal system resources so not to impact on the productivity of the user. The solutions proposed here could lead to improved competitivity for companies developing firewalls and has indirect impact on the quality of life of internet users. In intelligence agencies and police internet crime divisions, the high speed monitoring of streaming data could be an important tool for the detection of illegal activities. Efficient low space pattern matching would dramatically increase the UK's capability in this area. The impact of the theoretical work in this proposal could be significant for practical solutions to such problems. The work has many potential direct benefactors, mainly within the commercial ICT industry. I propose a number of measures to ensure that they have the opportunity to benefit: All results and techniques developed through the work will be made available through international recognised peer reviewed international conferences and journals. To ensure that publications are available in commercial contexts, 'web' versions of the papers will be made openly available via the internet. Direct contact with benefactors is also important. Some industrial presence at academic conferences in the field is not uncommon and effort will be made to discuss applications with them at such events. However, this measure would only include a small fraction of potential benefactors. I will also endeavour to contact suitable organisations to increase their awareness of the work and, where appropriate, form industrial links. It is important for the adoption of theoretical research to show practical relevance though successful empirical studies. There is potential for the supervision of empirical studies performed as masters student projects at the host organisation. It would be my intention that such studies would be performed in contact with relevant potential benefactors.

Publications

10 25 50
 
Description Due to the thoeretical nature of the output of this grant, it is too early in the research life-cycle to report on wider (non-academic) impact of this grant. This will be updated as wider impact materialises.
 
Description Bar-Ilan 
Organisation Bar-Ilan University
Country Israel 
Sector Academic/University 
PI Contribution I contributed expertise in pattern matching algorithms, randomised algorithms, streaming algorithms and lower bound techniques
Collaborator Contribution My partners contributed expertise in pattern matching algorithms, randomised algorithms and streaming algorithms
Impact Several publications in good peer-reviewed international conferences including: Pattern Matching under Polynomial Transformation Ayelet Butman, Peter Clifford, Raphaël Clifford, Markus Jalsenius, Noa Lewenstein, Benny Porat, Ely Porat and Benjamin Sach In: SIAM Journal on Computing (Vol. 42, Issue 2) Parameterized Matching in the Streaming Model Markus Jalsenius, Benny Porat and Benjamin Sach In: STACS 2013
Start Year 2011
 
Description University of Bristol 
Organisation University of Bristol
Country United Kingdom 
Sector Academic/University 
PI Contribution I contributed expertise in pattern matching, streaming algorithms, randomised algorithms and lower bound techniques
Collaborator Contribution My partners contributed expertise in pattern matching, streaming algorithms, randomised algorithms and lower bound techniques
Impact This collaboration has resulted in publications in top peer-reviewed international conferences including: Cell-Probe Bounds for Online Edit Distance and Other Pattern Matching Problems Raphaël Clifford, Markus Jalsenius, and Benjamin Sach In: SODA 2015 (to appear) Tight Cell-Probe Bounds for Online Hamming Distance Computation Raphaël Clifford, Markus Jalsenius, and Benjamin Sach In: SODA 2013 Pattern Matching under Polynomial Transformation Ayelet Butman, Peter Clifford, Raphaël Clifford, Markus Jalsenius, Noa Lewenstein, Benny Porat, Ely Porat and Benjamin Sach In: SIAM Journal on Computing (Vol. 42, Issue 2)
Start Year 2009