Pattern matching algorithms for massive datasets

Lead Research Organisation: University of Bristol

Department Name: Computer Science

Abstract

This project aims to provide the tools necessary for pattern matchingin massive datasets in the 21st century. Pattern matching problemsare pervasive and it is therefore hard to overstate theirimportance. The hugely successful new field of high throughputcomputational genetics, which is the lifeblood of pharmaceuticalindustries, is founded on the ability to perform approximate stringmatching accurately and quickly. Perhaps more mundanely but no lesssignificant economically, linear time exact matching algorithms arenow taken for granted as basic tools in every text editor and wordprocessor used today.Despite the success that pattern matching algorithms continue toenjoy, new problems in urgent need of a solution arise continually.These revolve around data processing applications where the datasetsare massive, subject to error or ambiguity and where processing isrequired online or in real-time. For example, the problem of exactmatching has well known optimal solutions both for online search andwhen the data to be queried can be indexed beforehand. However,unlike exact matching, the problem of finding the fastest algorithmsfor approximate matching has still not been resolved under almost anymeasure of similarity. Where the data is of a non-standard form, forexample consisting of numerical rather than symbolic information, evenless is currently known about how to search or index the informationefficiently.Another vital difference between the old and new settings is notsimply in the quantity of data available but also the ways in which ithas to be processed. The public genome sequencing projects, forexample, have produced 100s of gigabytes of sequence and related metadata. However these datasets are relatively straightforward to handlecompared to the processing of information passing through Internetrouters and over telephone wires every day or stored in the World WideWeb. In this situation it is not sufficient simply that an algorithmsruns fast. Ideally it should also require considerably less space thanthe input, update at least as quickly as the new data are arriving andstill be able to perform complex queries on the whole dataset.This project will directly address these two separate but interrelatedchallenges. Real-time and online matching algorithms will be developedto handle situations where vast amounts of data are streaming past atvery high rates. The project will also consider new forms ofapproximation and present fast algorithmic solutions that will allowdatasets that result from modern applications and industries to besearched for approximate matches without the need to rely onheuristics. Finally as part of the work on improved methods forapproximate matching, this project will develop faster and smallerindexes that will for the first time allow approximate matching ontruly massive datasets to become feasible in practice.

Funded Value:

£278,544

Funded Period:

Feb 08 - Aug 11

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/F02682X/1

Principal Investigator:

R Clifford

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Fundamentals of Computing (100%)

Organisations

University of Bristol (Lead Research Organisation)

People	ORCID iD
R Clifford (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Clifford R (2011) Automata, Languages and Programming

Clifford R (2010) Pattern matching with don't cares and few errors in Journal of Computer and System Sciences

Clifford R (2011) Pattern matching in pseudo real-time in Journal of Discrete Algorithms

Clifford R (2011) A black box for online approximate pattern matching in Information and Computation

Raphael Clifford (2009) From Coding Theory to Efficient Pattern Matching

Further Funding


Description	EPSRC
Amount	£17,453 (GBP)
Funding ID	ESPRC "Developing Leaders" Funding
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start


Description	London Mathematical Society
Amount	£500 (GBP)
Funding ID	SC7-1112-01: Scheme 7 grant Award
Organisation	London Mathematical Society
Sector	Academic/University
Country	United Kingdom
Start


Description	London Mathematical Society
Amount	£500 (GBP)
Funding ID	SC7-1112-01: Scheme 7 grant Award
Organisation	London Mathematical Society
Sector	Academic/University
Country	United Kingdom
Start

Abstract

Organisations

People

ORCID iD

Publications