3.4: Linear-time exact string matching

Last updated
Save as PDF

Page ID: 40923

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

While we have looked at various forms of alignment and algorithms used to find such alignments, these algorithms are not fast enough for some purposes. For instance, we may have a 100 nucleotide sequence which we want to search for in the whole genome, which may be over a billion nucleotides long. In this case, we want an algorithm with a run-time that depends on the length of query sequence, possibly with some pre-processing on the database, because processing the entire genome for every query would be extremely slow. For such problems, we enter the realm of randomized algorithms where instead of worrying about the worst-case performance, we are more interested in making sure that the algorithm is linear in the expected case. When looking for exact(consecutive) matches of a sequence, the Karp-Rabin algorithm interprets such a match numerically. There are many other solutions to this problem and some of them that can ensure the problem is linear in the worst case such as: the Z-algorithm, Boyer-Moore and Knuth-Morris-Pratt algorithm, algorithms based on suffix trees, suffix arrays, etc. (discussed in the “Lecture 3 addendum” slides)

Karp-Rabin Algorithm

This algorithm tries to match a particular pattern to a string, which is the basic principle of database search. The problem is as follows: in text T of length n we are looking for pattern P of length m. Strings are mapped to numbers to enable fast comparison. A naive version of the algorithm involves mapping the string P and m-length substrings of T sinto numbers x and y, respectively, sliding x along T at every offset until there is a match of the numbers.

However, one can see that the algorithm, as stated, is in fact non-linear for two reasons:

Computing each yi takes more than constant time (it is in fact linear if we naively compute each number from scratch for each subsequence)
Comparing x and yi can be expensive if the numbers are very large which might happen if the pattern to be matched is very long

To make the algorithm faster, we first modify the procedure for calculating yi in constant time by using the previously computed number, \( y_i − 1 \). We can do this using some bit operations: a subtraction to remove the high-order bit, a multiplication to shift the characters left, and an addition to append the low-order digit. For example, in Figure 10, we can compute \( y_2 \) from \( y_1 \) by

• removing the highest order bit: 23590 mod 10000 = 3590

• shifting left: 3590 ∗ 10 = 35900
• adding the new low-order digit: 35900 + 2 = 35902

Our next issue arises when we have very long sequences to compare. This causes our calculations to be with very large numbers, which becomes no longer linear time. To keep the numbers small to ensure efficient comparison, we do all our computations modulo p (a form of hashing), where p reflects the word length available to us for storing numbers, but is small enough such that the comparison between x and \( y_i \) is doable in constant time.

: Using a function to map data values to a data set of fixed size.
Because we are using hashing, mapping to the space of numbers modulo p can result in spurious hits due to hashing collisions, and so we modify the algorithm to deal with such spurious hits by explicitly verifying reported hits of the hash values. Hence, the final version of the Karp-Rabin algorithm is:

To compute the expected runtime of Karp-Rabin, we must factor in the expect cost of verification. If we can show the probability of spurious hits is small, the expected runtime is linear.

Questions:

Q: What if there are more than 10 characters in the alphabet?

A: In such a case, we can just modify the above algorithm by including more digits i.e. by working in a base other than 10, e.g. say base 256. But in general, when hashing is used, strings are mapped into a space of numbers and hence the strings are interpreted numerically.

Q: How do we apply this to text?

A: A hash function is used that changes the text into numbers that are easier to compare. For example, if the whole alphabet is used, letters can be assigned a value between 0 and 25, and then be used similar to a string of numbers.

Q: Why does using modulus decrease the computation time?

A: Modulus can be applied to each individual part in the computation while preserving the answer. For instance: imagine our current text is ”314152” and word length is 5. After making our first computation on ”31415”, we move our frame over to make our second computation, which is:
\[ 14152 = (31415 − 3 ∗ 10000) ∗ 10 + 2(mod13) \nonumber \]

\[ = (7−3∗3)∗10+2(mod13) \nonumber \]
\[ = 8(mod13) \nonumber \]
This computation can be done now in linear time.

Q: Are there provisions in the algorithm for inexact matches?

A: The above algorithm only works when there are regions of exact similarity between the query sequence and the database. However, the BLAST algorithm, which we look at later, extends the above ideas to include the notion of searching in a biologically meaningful neighborhood of the query sequence to account for some inexact matches. This is done by searching in the database for not just the query sequence, but also some variants of the sequence up to some fixed number of changes.

In general, in order to reduce the time for operations on arguments like numbers or strings that are really long, it is necessary to reduce the number range to something more manageable. Hashing is a general solution to this and it involves mapping keys k from a large universe \( U \) of strings/numbers into a hash of the key \( h(k) \) which lies in a smaller range, say \( [1...m] \). There are many hash function that can be used, all with different theoretical and practical properties. The two key properties that we need for hashing are:

1. Reproducibility if \( x = y \), then \( h(x) = h(y) \). This is essential for our mapping to make sense.

2. Uniform output distribution This implies that regardless of the input distribution, the output distribution is uniform. i.e. if \( x! = y \), then \( P(h(x) = h(y)) = 1/m \), irrespective of the input distribution. This is a desirable property to reduce the chance of spurious hits.

An interesting idea that was raised was that it might be useful to have locality sensitive hash functions from the point of view of use in neighborhood searches, such that points in U that are close to each other are mapped to nearby points by the hash function. The notion of Random projections, as an extension of the BLAST algorithm, is based on this idea. Also, it is to be noted that modulo doesn't satisfy property 2 above because it is possible to have input distributions (e.g. all multiples of the number vis--vis which the modulo is taken) that result in a lot of collisions. Nevertheless, choosing a random number as the divisor of the modulo can avoid many collisions.

Working with hashing increases the complexity of analyzing the algorithm since now we need to compute the expected run time by including the cost of verification. To show that the expected run time is linear, we need to show that the probability of spurious hits is small.