# CHAPTER 34: STRING MATCHING

This chapter is organized as follows. In Section 34.1 we review the naive brute-force algorithm for the string-matching problem, which has worst-case running time O((n - m + 1)m). Section 34.2 presents an interesting string-matching algorithm, due to Rabin and Karp. This algorithm also has worst-case running time O((n - m + 1)m), but it works much better on average and in practice. It also generalizes nicely to other pattern-matching problems. Section 34.3 then describes a string-matching algorithm that begins by constructing a finite automaton specifically designed to search for occurrences of the given pattern P in a text. This algorithm runs in time O(n + m ). The similar but much cleverer Knuth-Morris-Pratt (or KMP) algorithm is presented in Section 34.4; the KMP algorithm runs in time O(n + m). Finally, Section 34.5 describes an algorithm due to Boyer and Moore that is often the best practical choice, although its worst-case running time (like that of the Rabin-Karp algorithm) is no better than that of the naive string-matching algorithm.

#### Figure 34.1 The string-matching problem. The goal is to find all occurrences of the pattern P = abaa in the text T = abcabaabcabac. The pattern occurs only once in the text, at shift s = 3. The shift s = 3 is said to be a valid shift. Each character of the pattern is connected by a vertical line to the matching character in the text, and all matched characters are shown shaded.

Notation and terminology

Proof See Figure 34.2 for a graphical proof.

For brevity of notation, we shall denote the k-character prefix P[1 . . k] of the pattern P[1 . . m] by Pk. Thus, P0 = and Pm = P = P[1 . . m]. Similarly, we denote the k-character prefix of the text T as Tk. Using this notation, we can state the string-matching problem as that of finding all shifts s in the range 0 s n - m such that .

In our pseudocode, we allow two equal-length strings to be compared for equality as a primitive operation. If the strings are compared from left to right and the comparison stops when a mismatch is discovered, we assume that the time taken by such a test is a linear function of the number of matching characters discovered. To be precise, the test "x = y" is assumed to take time (t + 1), where t is the length of the longest string z such that .

# 34.1 The naive string-matching algorithm

`NAIVE-STRING-MATCHER(T, P)`

`1  n  length[T]`

`2  m  length[P]`

`3  for s  0 to n - m`

`4        do if P[1 . . m] = T[s + 1 . . s + m]`

`5              then print "Pattern occurs with shift" s`

The naive string-matching procedure can be interpreted graphically as sliding a "template" containing the pattern over the text, noting for which shifts all of the characters on the template equal the corresponding characters in the text, as illustrated in Figure 34.3. The for loop beginning on line 3 considers each possible shift explicitly. The test on line 4 determines whether the current shift is valid or not; this test involves an implicit loop to check corresponding character positions until all positions match successfully or a mismatch is found. Line 5 prints out each valid shift s.

#### Figure 34.3 The operation of the naive string matcher for the pattern P = aab and the text T = acaabc. We can imagine the pattern P as a "template" that we slide next to the text. Parts (a)-(d) show the four successive alignments tried by the naive string matcher. In each part, vertical lines connect corresponding regions found to match (shown shaded), and a jagged line connects the first mismatched character found, if any. One occurrence of the pattern is found, at shift s = 2, shown in part (c).

Procedure NAIVE-STRING-MATCHER takes time ((n - m + 1)m) in the worst case. For example, consider the text string an (a string of n a's) and the pattern am. For each of the n - m + 1 possible values of the shift s, the implicit loop on line 4 to compare corresponding characters must execute m times to validate the shift. The worst-case running time is thus ((n - m + 1)m), which is (n2) if m = n/2.

As we shall see, NAIVE-STRING-MATCHER is not an optimal procedure for this problem. Indeed, in this chapter we shall show an algorithm with a worst-case running time of O(n + m). The naive string-matcher is inefficient because information gained about the text for one value of s is totally ignored in considering other values of s. Such information can be very valuable, however. For example, if P = aaab and we find that s = 0 is valid, then none of the shifts 1, 2, or 3 are valid, since T[4] = b. In the following sections, we examine several ways to make effective use of this sort of information.

## Exercises

Show the comparisons the naive string matcher makes for the pattern P = 0001 in the text T = 000010001010001.

Show that the worst-case time for the naive string matcher to find the first occurrence of a pattern in a text is ((n - m + 1)(m - 1)).

Suppose that all characters in the pattern P are different. Show how to accelerate NAIVE-STRING-MATCHER to run in time O(n) on an n-character text T.

Suppose that pattern P and text T are randomly chosen strings of length m and n, respectively, from the d-ary alphabet d = {0, 1, . . . , d - 1}, where d 2. Show that the expected number of character-to-character comparisons made by the implicit loop in line 4 of the naive algorithm is

(Assume that the naive algorithm stops comparing characters for a given shift once a mismatch is found or the entire pattern is matched.) Thus, for randomly chosen strings, the naive algorithm is quite efficient.

Note that the gap character may occur an arbitrary number of times in the pattern but is assumed not to occur at all in the text. Give a polynomial-time algorithm to determine if such a pattern P occurs in a given text T, and analyze the running time of your algorithm.

# 34.2 The Rabin-Karp algorithm

This algorithm makes use of elementary number-theoretic notions such as the equivalence of two numbers modulo a third number. You may want to refer to Section 33.1 for the relevant definitions.

For expository purposes, let us assume that = {0, 1, 2, . . . , 9}, so that each character is a decimal digit. (In the general case, we can assume that each character is a digit in radix-d notation, where d = .) We can then view a string of k consecutive characters as representing a length-k decimal number. The character string 31415 thus corresponds to the decimal number 31,415. Given the dual interpretation of the input characters as both graphical symbols and digits, we find it convenient in this section to denote them as we would digits, in our standard text font.

Given a pattern P[1 . . m], we let p denote its corresponding decimal value. In a similar manner, given a text T[1 . . n], we let ts denote the decimal value of the length-m substring T[s + 1 . . s + m], for s = 0, 1, . . . , n - m. Certainly, ts = p if and only if T[s + 1 . . s + m] = P[1 . . m]; thus, s is a valid shift if and only if ts = p. If we could compute p in time O(m) and all of the ti values in a total of O(n) time, then we could determine all valid shifts s in time O(n) by comparing p with each of the ts's. (For the moment, let's not worry about the possibility that p and the ts's might be very large numbers.)

We can compute p in time O(m) using Horner's rule (see Section 32.1):

`p = P[m] + 10 (P[m - 1] + 10(P[m - 2] + . . . + 10(P[2] + 10P[1]) . . . )).`

The value t0 can be similarly computed from T[1 . . m] in time O(m).

To compute the remaining values t1, t2, . . . , tn-m in time O(n - m), it suffices to observe that ts + 1 can be computed from ts in constant time, since

`ts + 1   =   10(ts - 10m - 1T[s + 1]) + T[s + m + 1].`

#### (34.1)

For example, if m= 5 and ts = 31415, then we wish to remove the high-order digit T[s + 1] = 3 and bring in the new low-order digit (suppose it is T[s + 5 + 1] = 2) to obtain

`ts+1 = 10(31415 - 10000.3) + 2`

`= 14152 .`

Subtracting 10m-1 T[s+1] removes the high-order digit from ts, multiplying the result by 10 shifts the number left one position, and adding T[s + m + 1] brings in the appropriate low-order digit. If the constant 10m-1 is precomputed (which can be done in time O(1g m) using the techniques of Section 33.6, although for this application a straightforward O(m) method is quite adequate), then each execution of equation (34.1) takes a constant number of arithmetic operations. Thus, p and t0, t1, . . . , tn-m can all be computed in time O(n + m), and we can find all occurrences of the pattern P[1 . . m] in the text T[1 . . n] in time O(n + m).

The only difficulty with this procedure is that p and ts may be too large to work with conveniently. If P contains m characters, then assuming that each arithmetic operation on p (which is m digits long) takes "constant time" is unreasonable. Fortunately, there is a simple cure for this problem, as shown in Figure 34.4 : compute p and the ts's modulo a suitable modulus q. Since the computation of p, t0, and the recurrence (34.1) can all be performed modulo q, we see that p and all the ts's can be computed modulo q in time O(n + m). The modulus q is typically chosen as a prime such that 10q just fits within one computer word, which allows all of the necessary computations to be performed with single-precision arithmetic.

#### Figure 34.4 The Rabin-Karp algorithm. Each character is a decimal digit, and we compute values modulo 13. (a) A text string. A window of length 5 is shown shaded. The numerical value of the shaded number is computed modulo 13, yielding the value 7. (b) The same text string with values computed modulo 13 for each possible position of a length-5 window. Assuming the pattern P = 31415, we look for windows whose value modulo 13 is 7, since 31415 7 (mod 13). Two such windows are found, shown shaded in the figure. The first, beginning at text position 7, is indeed an occurrence of the pattern, while the second, beginning at text position 13, is a spurious hit. (c) Computing the value for a window in constant time, given the value for the previous window. The first window has value 31415. Dropping the high-order digit 3, shifting left (multiplying by 10), and then adding in the low-order digit 2 gives us the new value 14152. All computations are performed modulo 13, however, so the value for the first window is 7, and the value computed for the new window is 8.

In general, with a d-ary alphabet {0, 1, . . . ,d - 1}, we choose q so that d q fits within a computer word and adjust the recurrence equation (34.1) to work modulo q, so that it becomes

`ts+1 = (d(ts - T[s + 1]h) + T[s + m + 1]) mod q ,`

#### (34.2)

where h dm-1 (mod q) is the value of the digit "1" in the high-order position of an m-digit text window.

The following procedure makes these ideas precise. The inputs to the procedure are the text T, the pattern P, the radix d to use (which is typically taken to be ||), and the prime q to use.

`RABIN-KARP-MATCHER(T, P, d, q)`

`1 n  length[T]`

`2 m length[P]`

`3 h  dm-1 mod q`

`4 p  0`

`5 t0  0`

`6 for i  1 to m`

`7       do p  (dp + P[i]) mod q`

`8          t0  (dt0 + T[i]) mod q`

`9 for s  0 to n - m`

`10      do if p = ts`

`11            then if P[1 . . m] = T[s + 1 . . s + m]`

`12                    then "Pattern occurs with shift" s`

`13         if s < n - m`

`14           then ts+1  (d(ts - T[s + 1]h) + T[s + m + 1]) mod q`

The procedure RABIN-KARP-MATCHER works as follows. All characters are interpreted as radix-d digits. The subscripts on t are provided only for clarity; the program works correctly if all the subscripts are dropped. Line 3 initializes h to the value of the high-order digit position of an m-digit window. Lines 4-8 compute p as the value of P[1 . . m] mod q and t0 as the value of T[1 . . m] mod q. The for loop beginning on line 9 iterates through all possible shifts s. The loop has the following invariant: whenever line 10 is executed, ts = T[s + 1 . . s + m] mod q. If p = ts in line 10 (a "hit"), then we check to see if P[1 . . m] = T[s + 1 . . s + m] in line 11 to rule out the possibility of a spurious hit. Any valid shifts found are printed out on line 12. If s < n - m (checked in line 13), then the for loop is to be executed at least one more time, and so line 14 is first executed to ensure that the loop invariant holds when line 10 is again reached. Line 14 computes the value of ts+1 mod q from the value of ts mod q in constant time using equation (34.2) directly.

In many applications, we expect few valid shifts (perhaps O(1) of them), and so the expected running time of the algorithm is O(n + m) plus the time required to process spurious hits. We can base a heuristic analysis on the assumption that reducing values modulo q acts like a random mapping from * to Zq. (See the discussion on the use of division for hashing in Section 12.3.1. It is difficult to formalize and prove such an assumption, although one viable approach is to assume that q is chosen randomly from integers of the appropriate size. We shall not pursue this formalization here.) We can then expect that the number of spurious hits is O(n/q), since the chance that an arbitrary ts will be equivalent to p, modulo q, can be estimated as 1/q. The expected amount of time taken by the Rabin-Karp algorithm is then

`O(n) + O(m(v + n/q)),`

where v is the number of valid shifts. This running time is O(n) if we choose q m. That is, if the expected number of valid shifts is small (O(1)) and the prime q is chosen to be larger than the length of the pattern, then we can expect the Rabin-Karp procedure to run in time O(n + m).

## Exercises

Working modulo q = 11, how many spurious hits does the Rabin-Karp matcher encounter in the text T = 3141592653589793 when looking for the pattern P = 26?

How would you extend the Rabin-Karp method to the problem of searching a text string for an occurrence of any one of a given set of k patterns?

Show how to extend the Rabin-Karp method to handle the problem of looking for a given m X m pattern in an n X n array of characters. (The pattern may be shifted vertically and horizontally, but it may not be rotated.)

and Bob similarly evaluates B(x). Prove that if A B, there is at most one chance in 1000 that A(x) = B(x), whereas if the two files are the same, A(x) is necessarily the same as B(x). (Hint: See Exercise 33.4-4.)

# 34.3 String matching with finite automata

We begin this section with the definition of a finite automaton. We then examine a special string-matching automaton and show how it can be used to find occurrences of a pattern in a text. This discussion includes details on how to simulate the behavior of a string-matching automaton on a given text. Finally, we shall show how to construct the string-matching automaton for a given input pattern.

## Finite automata

Q is a finite set of states,

#### Figure 34.5 A simple two-state finite automaton with state set Q = {0, 1}, start state q0 = 0, and input alphabet = {a, b}. (a) A tabular representation of the transition function . (b) An equivalent state-transition diagram. State 1 is the only accepting state (shown blackened). Directed edges represent transitions. For example, the edge from state 1 to state 0 labeled b indicates (1, b) = 0. This automaton accepts those strings that end in an odd number of a's. More precisely, a string x is accepted if and only if x = yz, where y = or y ends with a b, and z = ak, where k is odd. For example, the sequence of states this automaton enters for input abaaa (including the start state) is 0, 1, 0, 1, 0, 1, and so it accepts this input. For input abbaa, the sequence of states is 0, 1, 0, 0, 1, 0, and so it rejects this input.

`ø()  = q ,`

`ø(wa)  = (ø(w), a)  for w  *, a  .`

## String-matching automata

#### Figure 34.6 (a) A state-transition diagram for the string-matching automaton that accepts all strings ending in the string ababaca. State 0 is the start state, and state 7 (shown blackened) is the only accepting state. A directed edge from state i to state j labeled a represents (i,a) = j. The right-going edges forming the "spine" of the automaton, shown heavy in the figure, correspond to successful matches between pattern and input characters. The left-going edges correspond to failing matches. Some edges corresponding to failing matches are not shown; by convention, if a state i has no outgoing edge labeled a for some a , then (i,a) = 0. (b) The corresponding transition function , and the pattern string P = ababaca. The entries corresponding to successful matches between pattern and input characters are shown shaded. (c) The operation of the automaton on the text T = abababacaba. Under each text character T[i] is given the state ø(Ti) the automaton is in after processing the prefix Ti. One occurrence of the pattern is found, ending in position 9.

The suffix function is well defined since the empty string P0 = is a suffix of every string. As examples, for the pattern P = ab, we have () = 0, (ccaca) = 1, and (ccab) = 2. For a pattern P of length m, we have (x) = m if and only if . It follows from the definition of the suffix function that if , then (x) (y).

We define the string-matching automaton corresponding to a given pattern P[1 . . m] as follows.

The state set Q is {0, 1, . . . , m}. The start state q0 is state 0, and state m is the only accepting state.

The transition function is defined by the following equation, for any state q and character a:

` (q, a) = (Pqa) .`

#### (34.3)

Here is an intuitive rationale for defining (q, a) = (Pq a). The machine maintains as an invariant of its operation that

`(Ti) = (Ti) ;`

#### (34.4)

this result is proved as Theorem 34.4 below. In words, this means that after scanning the first i characters of the text string T, the machine is in state (Ti) = q, where q = (Ti) is the length of the longest suffix of Ti that is also a prefix of the pattern P. If the next character scanned is T[i + 1] = a, then the machine should make a transition to state (Ti + 1) = (Tia). The proof of the theorem shows that (Tia) = (Pqa). That is, to compute the length of the longest suffix of Tia that is a prefix of P, we can compute the longest suffix of Pqa that is a prefix of P. At each state, the machine only needs to know the length of the longest prefix of P that is a suffix of what has been read so far. Therefore, setting (q, a) = (Pqa) maintains the desired invariant (34.4). This informal argument will be made rigorous shortly.

In the string-matching automaton of Figure 34.6, for example, we have (5, b) = 4. This follows from the fact that if the automaton reads a b in state q = 5, then Pqb = ababab, and the longest prefix of P that is also a suffix of ababab is P4 = abab.

To clarify the operation of a string-matching automaton, we now give a simple, efficient program for simulating the behavior of such an automaton (represented by its transition function ) in finding occurrences of a pattern P of length m in an input text T[1 . . n]. As for any string-matching automaton for a pattern of length m, the state set Q is {0,1, . . . , m}, the start state is 0, and the only accepting state is state m.

#### Figure 34.7 An illustration for the proof of Lemma 34.2. The figure shows that r (x) + 1, where r = (xa).

`FINITE-AUTOMATON-MATCHER(T, , m)`

`1  n  length[T]`

`2  q  0`

`3  for i  1 to n`

`4        do q   (q, T[i])`

`5           if q = m`

`6              then s  i - m`

`7                   print "Pattern occurs with shift" s`

The simple loop structure of FINITE-AUTOMATON-MATCHER implies that its running time on a text string of length n is O(n). This running time, however, does not include the time required to compute the transition function . We address this problem later, after proving that the procedure FINITE-AUTOMATON-MATCHER operates correctly.

Consider the operation of the automaton on an input text T[1 . . n]. We shall prove that the automaton is in state (Tj) after scanning character T[i]. Since (Ti) = m if and only if , the machine is in the accepting state m if and only if the pattern P has just been scanned. To prove this result, we make use of the following two lemmas about the suffix function .

Proof Referring to Figure 34.7, let r = (xa). If r = 0, then the conclusion r (x) + 1 is trivially satisfied, by the nonnegativity of (x). So assume that r > 0. Now, , by the definition of . Thus, , by dropping the a from the end of Pr and from the end of xa. Therefore, r - 1 (x), since (x) is largest k such that .

Proof From the definition of , we have . As Figure 34.8 shows, . If we let r = (xa), then r q + 1 by Lemma 34.2. Since , and Pr Pqa, Lemma 34.1 implies that . Therefore, r (Pqa), that is, (xa) (Pqa). But we also have (Pqa) (xa), since . Thus, (xa) = (Pqa).

#### Figure 34.8 An illustration for the proof of Lemma 34.3. The figure shows that r = (Pqa), where q = (x) and r = (xa).

We are now ready to prove our main theorem characterizing the behavior of a string-matching automaton on a given input text. As noted above, this theorem shows that the automaton is merely keeping track, at each step, of the longest prefix of the pattern that is a suffix of what has been read so far.

If is the final-state function of a string-matching automaton for a given pattern P and T[1 . . n] is an input text for the automaton, then

Proof The proof is by induction on i. For i = 0, the theorem is trivially true, since T0 = . Thus, .

Now, we assume that and prove that . Let q denote , and let a denote T[i + 1]. Then,

By induction, the theorem is proved.

By Theorem 34.4, if the machine enters state q on line 4, then q is the largest value such that . Thus, we have q = m on line 5 if and only if an occurrence of the pattern P has just been scanned. We conclude that FINITE-AUTOMATON-MATCHER operates correctly.

## Computing the transition function

`COMPUTE-TRANSITION-FUNCTION(P, )`

`1 m  length[P]`

`2 for q  0 to m`

`3     do for each character a  *`

`4            do k  min (m + 1, q + 2)`

`5               repeat k  k - 1`

`7             (q, a)  k`

`8 return `

This procedure computes (q, a) in a straightforward manner according to its definition. The nested loops beginning on lines 2 and 3 consider all states q and characters a, and lines 4-7 set (q, a) to be the largest k such that . The code starts with the largest conceivable value of k, which is min(m, q + 1), and decreases k until .

The running time of COMPUTE-TRANSITION-FUNCTION is O(m3), because the outer loops contribute a factor of m , the inner repeat loop can run at most m + 1 times, and the test on line 6 can require comparing up to m characters. Much faster procedures exist; the time required to compute from P can be improved to O(m) by utilizing some cleverly computed information about the pattern P (see Exercise 34.4-6). With this improved procedure for computing , the total running time to find all occurrences of a length-m pattern in a length-n text over an alphabet is O(n + m).

## Exercises

Construct the string-matching automaton for the pattern P = aabab and illustrate its operation on the text string T = aaababaabaababaab.

Draw a state-transition diagram for a string-matching automaton for the pattern ababbabbababbababbabb over the alphabet = {a, b}.

# 34.4 The Knuth-Morris-Pratt algorithm

## The prefix function for a pattern

Consider the operation of the naive string matcher. Figure 34.9(a) shows a particular shift s of a template containing the pattern P = ababaca against a text T. For this example, q = 5 of the characters have matched successfully, but the 6th pattern character fails to match the corresponding text character. The information that q characters have matched successfully determines the corresponding text characters. Knowing these q text characters allows us to determine immediately that certain shifts are invalid. In the example of the figure, the shift s + 1 is necessarily invalid, since the first pattern character, an a, would be aligned with a text character that is known to match with the second pattern character, a b. The shift s + 2 shown in part (b) of the figure, however, aligns the first three pattern characters with three text characters that must necessarily match. In general, it is useful to know the answer to the following question:

Given that pattern characters P[1 . . q] match text characters T[s + 1 . . s + q], what is the least shift s' > s such that

`P[1 . . k] = T[s' 1 . . s' k],`

#### (34.5)

where s' + k = s + q?

Such a shift s' is the first shift greater than s that is not necessarily invalid due to our knowledge of T[s + 1 . . s + q]. In the best case, we have that s' = s + q, and shifts s + 1, s + 2, . . . , s + q - 1 are all immediately ruled out. In any case, at the new shift s' we don't need to compare the first k characters of P with the corresponding characters of T, since we are guaranteed that they match by equation (34.5).