Next generation pattern matching

Lead Research Organisation: University of Bristol
Department Name: Computer Science

Abstract

This fellowship proposal is for a programme of fundamental research to develop the knowledge required for the next generation of pattern matching algorithms. The aim is to make significant
advances in both the theory and practice of searching, manipulating and processing massive datasets of the sorts that are now common in high technology industries and to make state of the art tools available to the wider community.

This proposal will not only develop new algorithms but will also show time and space lower bounds. Where ever faster and more efficient algorithms are developed by the algorithms community, our understanding of the limits of what can be achieved is currently at a much more basic level. Such lower bounds, were they available, would not only be important for the significant theoretical interest and insight they provide, but also for the practical purpose of preventing fruitless search for algorithms which cannot exist. Within computer science in general, lower bounds have historically proven hard to develop (consider for example the famous question of P vs NP) but in the context of streaming and dynamic computation new ideas can now be applied for the first time.

In order to give a truly complete picture, we will also provide state of the art implementations of both new and some old ideas to test assumptions that have been made and to discover new problems which need to be tackled by the whole community. Implementing pattern matching algorithms, especially those proven to be theoretically efficient, can often be a challenging task requiring both considerable algorithmic and engineering expertise. To ensure maximum impact for the ideas developed we will produce freely available, publicly accessible, state of the art implementations of the newly developed pattern matching algorithms as well as others whose practical performance is still unknown.

Planned Impact

Algorithmic research has revolutionised the way that we all live our lives in recent years. Google, a company founded by computer scientists and underpinned with perhaps the greatest concentration of university trained algorithms expertise in the world, is but the most prominent example. This Fellowship, has as its core aim to form a new centre for algorithms research in the UK and through it develop a complete picture of the theory and practice of pattern matching. This work will potentially impact a number of groups within academia and industry but also of great importance will be much needed UK based training and human capital that the new research team will help to provide. We now briefly describe the main groups outside of the direct field of academic study which will be impacted by our work.

Time sensitive data intensive applications are now central to the most successful international companies. AT&T are said to process several terabytes per day of streaming data in real-time and employ dozens of full time algorithms research staff. Intelligence and security agencies require streaming video, audio and telecommunications to be analysed within strict time frames. In high frequency trading, the ability to respond within a fraction of second to a new pattern detected in trading activity can be the difference between financial success and bankruptcy. We expect the advances we will make in dynamic and streaming pattern matching in particular to be of direct interest to these industries and others.

With respect to lower bounds, the results we will provide, while not directly giving new methods, will not only be important for the significant theoretical interest and insight they provide, but also for the practical purposes of preventing fruitless search for algorithms which cannot exist as well as as guiding future research to devise new problem formulations which circumvent these lower bounds. In a similar fashion to the way that no respectable software company will attempt to write code that proves their software terminates (due to the undecidability Halting Problem), our space and time lower bounds will set the framework within which software development can occur. These proven limits may therefore be of as much practical utility as the provision of a new fast algorithm.

We will also, for the first time for research of this type, directly consider the very important and often overlooked interplay between theory and practice. Real world problems have triggered the invention of new computational models and will continue to do so. The resulting theoretical problems are also some of the most pressing and exciting in theoretical computer science. To complete this circle, we will provide state of the art implementations of both new and some old ideas to test assumptions that have been made and to discover new problems which need to be tackled by the whole community. This is particularly motivated by the observation that implementation of pattern matching algorithms, especially those proven to be theoretically efficient, can be a challenging task requiring both considerable algorithmic and engineering expertise. To ensure maximum impact for the ideas developed we will produce freely available, publicly accessible, state of the art implementations of the newly developed pattern matching algorithms as well as others whose practical performance is still unknown.

It is widely recognised that the UK has been relatively weak in the mainstream of algorithms research that dominates top level US universities and global industry. Further investment in UK based algorithms research and training is vitally important for the national economy. This Fellowship, through the training and dissemination mechanisms set out and the development of future UK based research leaders will go some way to redressing this imbalance.
 
Description We have made significant progress in our understanding of the true complexity of a number of basic pattern matching problems. As part of the work, we have also significantly extended the machinery available for pattern matching, opening up a range of new questions which would not have been possible to approach before.
Exploitation Route We are exploring first how best our new methods can be applied in practice. In terms of future research an exciting question is how the new ideas we have introduced can let us build a complete picture of the complexity of pattern matching offline, in a stream and in dynamically changing settings.
Sectors Digital/Communication/Information Technologies (including Software)