Pattern matching algorithms for massive datasets

Lead Research Organisation: University of Bristol
Department Name: Computer Science

Abstract

This project aims to provide the tools necessary for pattern matchingin massive datasets in the 21st century. Pattern matching problemsare pervasive and it is therefore hard to overstate theirimportance. The hugely successful new field of high throughputcomputational genetics, which is the lifeblood of pharmaceuticalindustries, is founded on the ability to perform approximate stringmatching accurately and quickly. Perhaps more mundanely but no lesssignificant economically, linear time exact matching algorithms arenow taken for granted as basic tools in every text editor and wordprocessor used today.Despite the success that pattern matching algorithms continue toenjoy, new problems in urgent need of a solution arise continually.These revolve around data processing applications where the datasetsare massive, subject to error or ambiguity and where processing isrequired online or in real-time. For example, the problem of exactmatching has well known optimal solutions both for online search andwhen the data to be queried can be indexed beforehand. However,unlike exact matching, the problem of finding the fastest algorithmsfor approximate matching has still not been resolved under almost anymeasure of similarity. Where the data is of a non-standard form, forexample consisting of numerical rather than symbolic information, evenless is currently known about how to search or index the informationefficiently.Another vital difference between the old and new settings is notsimply in the quantity of data available but also the ways in which ithas to be processed. The public genome sequencing projects, forexample, have produced 100s of gigabytes of sequence and related metadata. However these datasets are relatively straightforward to handlecompared to the processing of information passing through Internetrouters and over telephone wires every day or stored in the World WideWeb. In this situation it is not sufficient simply that an algorithmsruns fast. Ideally it should also require considerably less space thanthe input, update at least as quickly as the new data are arriving andstill be able to perform complex queries on the whole dataset.This project will directly address these two separate but interrelatedchallenges. Real-time and online matching algorithms will be developedto handle situations where vast amounts of data are streaming past atvery high rates. The project will also consider new forms ofapproximation and present fast algorithmic solutions that will allowdatasets that result from modern applications and industries to besearched for approximate matches without the need to rely onheuristics. Finally as part of the work on improved methods forapproximate matching, this project will develop faster and smallerindexes that will for the first time allow approximate matching ontruly massive datasets to become feasible in practice.

Publications

10 25 50
publication icon
Clifford R (2010) Pattern matching with don't cares and few errors in Journal of Computer and System Sciences

publication icon
Clifford R (2011) A black box for online approximate pattern matching in Information and Computation

publication icon
Clifford R (2011) Pattern matching in pseudo real-time in Journal of Discrete Algorithms

 
Description EPSRC
Amount £17,453 (GBP)
Funding ID ESPRC "Developing Leaders" Funding 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start  
 
Description London Mathematical Society
Amount £500 (GBP)
Funding ID SC7-1112-01: Scheme 7 grant Award 
Organisation London Mathematical Society 
Sector Academic/University
Country United Kingdom
Start  
 
Description London Mathematical Society
Amount £500 (GBP)
Funding ID SC7-1112-01: Scheme 7 grant Award 
Organisation London Mathematical Society 
Sector Academic/University
Country United Kingdom
Start