Approximate Matching of Sequences.

Lead Research Organisation: King's College London
Department Name: Computer Science

Abstract

One of the most ancient, fundamental and still active research field in theoretical computer science concerns Pattern Matching and its many approaches. The classic pattern matching problem consists in recognizing a pattern P as "matching" or "not matching" within a given text T and of listing all the occurrence positions, if any. It is
straightforward that the text may be whatever kind of input data and the pattern is usually a smaller piece of the same kind of data. One instance of the problem is to suppose that the text is provided before the pattern.
Then the text can be preprocessed to build an index data structure that accelerates dramatically further searches.
There exist several such data structures. This branch of pattern matching is usually called Text Indexing.
Its natural application scenario is when the text does not change or changes slowly with respect to the great amount of pattern search queries, e.g. a sequenced genome, a digital library or a biometric database.
The most important variant of Text Indexing concerns "approximate matching" of the pattern with text segments.
It is called Approximate Pattern Matching. Approximate means that an occurrence of the pattern within the text is considered a match even if it presents some errors, but if the distance (resp. similarity) between the pattern and the text segment is under (resp. over) a certain threshold.

Approximate variants of text indexing fit well with scenarios where a mistyping errors may occurs, the data sampling process is prone to introduce errors, the data source is not constant over the time as for biometric data, or when we are just looking for data similarity, e.g. for genes mutations in DNA sequences.

Pattern matching plays a fundamental role in many research and application fields, like information retrieval, data mining, bioinformatics, molecular biology functional and structural analysis, molecular pharmacology, computational musicology, network security, biometric identification, and many others. Among these, bioinformatics and computational molecular biology are the domains that mostly and directly benefit from any enhancement of pattern matching theoretical knowledge and solutions.

While straight text indexing counts many efficient solutions, its approximate versions have not found optimal solutions yet. Several solutions exploit the property that an approximate match is usually composed of some shorter and closer
matches, i.e. exact matches or approximate matches with a smaller number of errors. This approach does not provide any breakthrough towards an optimal solution.

Our approach is to design data structures and their related algorithms to get an efficient solution to the problem. We have explored and experimented several solutions that provided remarkable potentialities. They need to be developed further and tailored to some applications. Implementation for external memory and for distributed calculus shall be
investigated. As a by-product some theoretical results related to the distance between sequences should be achieved. The complexity of many classical result will be reviewed to highlight their dependence from the maximal-repetition-with-error parameter. All this is expected to provide a substantial step forward in the pattern matching research field.

The technological breakthrough proposed in this research allows to dramatically decrease the computational cost of detecting local similarities within a large family of biological sequences and of performing some approximate pattern matching related tasks. Furthermore, our designed algorithms will extend the range of applications of approximate pattern matching based solutions as they will allow to accomplish more complex investigations on sequences with the same computational resource either at the level of personal computer scale solutions, and of big research centres.

Planned Impact

Approximate pattern matching is involved in a lot of research fields as well as industrial applications. Pattern matching based solutions are widely used in a direct fraction and as a component of more complex solutions. For instance, in gene analysis, either the functional and the structural ones, the main investigation step concerns collecting information about sequences by looking for similar occurrences inside already known organisms since similar sequences frequently present similar functions and similar structures. With respect to a set of sequences, like among different species, the presence of similar sequences is likely to mean that those
descend from a common ancestor. This is the common way to build phylogenetic trees.

A common approximate pattern matching approach exploit the property that an approximate match is usually composed of some shorter and closer simpler matches, i.e. exact matches or approximate matches with a smaller number of errors. For instance, the pattern 'aaabccc' matches the 'abaaccc' with two errors and it comprises a match with one error of the initial part of the pattern 'aaa' with 'aba' in the text, closely followed by the exact match of the final part of the pattern 'ccc'. This approach is used to over-pass the current limits of theoretical solutions, but it allows to move just a little step forward, which is dearly paid in complexity and computational effort, usually accomplished via parallel computing or grid computing. A kind of technology and
resources accessible to just few big farms or big research centres, that, moreover, are still finite and that already have almost reached the actual technological and physical computational bounds. Theoretical advances represents the main source of enhancements in the short term.

Our outcomes, namely the set of algorithms and data structures for approximate indexing and for some related problems, will provide efficient solutions or, at least, an enhancement with respect to the actual state of art about these problems both in term of time and space computational resources.
Our algorithms application is two-fold. It will be candidate either to replace many of the previous
pattern matching algorithms in current solutions in order of either save resources or to accomplish more complex tasks with the same resources. It will also allow to develop a new generation of tools that previously were practically infeasible due to computational limits. For instance, the possibility to keep the approximate index of a whole genome in the memory of a personal computer represents an important technological breakthrough. This, together with the forthcoming availability of next-next-generation sequencing systems will provide a powerful investigation tool to researchers in medical genetics and genetic pharmacology. They will better and faster find the genetic variations responsible for some diseases and are likely to develop subject-specific therapy and medicines. Let us notice that this approach does not fit with centralized computational models that big research centres provide online as a service, because this approach requires to work each time on a different set of data. Furthermore, online services never appeared for many different disciplines, so our solutions may be their first chance to combinatorially analyse their data.

Our research is expected to be relevant also to the software industry and others concerned with the efficient implementation of search engines, information gathering and indexing services on the Web. Also, our indexing schemes would be useful in searching music libraries as well as searching molecular sequences in DNA/RNA and relevant databases.

Publications

10 25 50
publication icon
Al-Hafeedh A (2012) A comparison of index-based lempel-Ziv LZ77 factorization algorithms in ACM Computing Surveys

publication icon
Amit M (2017) Locating maximal approximate runs in a string in Theoretical Computer Science

publication icon
Apostolico A (2016) 40 years of suffix trees in Communications of the ACM

publication icon
Crochemore M (2013) Computing the Longest Previous Factor in European Journal of Combinatorics

publication icon
Crochemore M (2013) A note on efficient computation of all Abelian periods in a string in Information Processing Letters

publication icon
Crochemore M (2015) A note on the longest common compatible prefix problem for partial words in Journal of Discrete Algorithms

publication icon
Crochemore M (2014) Indexing a sequence for mapping reads with a single mismatch. in Philosophical transactions. Series A, Mathematical, physical, and engineering sciences

publication icon
CROCHEMORE M (2015) The longest common substring problem in Mathematical Structures in Computer Science

 
Description The research provided solutions for approximate indexing whose goal is to pre-compute full-text indexes accepting efficient approximate queries. We provide algorithms that builds the index data structure in time almost linear in the size of the sequence and answer subsequent approximate queries in almost linear time in the size of the query.
Exploitation Route The solution could be implemented to be given to potential users in Bioinformatics.
Sectors Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology

URL https://www.cambridge.org/core/journals/mathematical-structures-in-computer-science/article/div-classtitlethe-longest-common-substring-problemdiv/1DA955004B11A551E2BD130AA834CF41
 
Description Findings for this project have produced a significant progress on the design of data structures for efficient text indexing. This type of structure provides basic tools in the area of text mining and information retrieval.
First Year Of Impact 2014
Sector Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology
Impact Types Cultural

 
Description Approximate matching of sequences 
Organisation University of Warsaw
Country Poland 
Sector Academic/University 
PI Contribution We investigate the approximate matching of sequences problem and we are working to devise a new data structure to solve efficiently this problem for a bounded number of errors in both the text and the pattern. The research group of Prof. Wojciech Rytter is highly active on the domain of data structures and algorithms dealing with textual data. They advised on the design of indexes appropriate to our research project.
Start Year 2012