Finding all occurrences of a pattern in a text is a problem that arises frequently in text-editing programs. Typically, the text is a document being edited, and the pattern searched for is a particular word supplied by the user. Efficient algorithms for this problem can greatly aid the responsiveness of the text-editing program. String-matching algorithms are also used, for example, to search for particular patterns in DNA sequences.

We formalize the * string-matching problem* as follows. We assume that the text is an array

We say that pattern *P* **occurs with shift*** s* in text

We shall let * (read "sigma-star") denote the set of all finite-length strings formed using characters from the alphabet . In this chapter, we consider only finite-length strings. The zero-length * empty string*, denoted , also belongs to *. The length of a string

We say that a string *w* is a * prefix* of a string

Suppose that *x*, *y*, and *z* are strings such that . If |*x*| |*y*|, then . If |*x*| |*y*|, then . If |*x*| = |*y*|, then *x* = *y*.

* Proof *See Figure 34.2 for a graphical proof.

The naive algorithm finds all valid shifts using a loop that checks the condition *P*[1 . . *m*] = *T*[*s + 1 . . s + m*] for each of the *n* - *m* + 1 possible values of *s*.

NAIVE-STRING-MATCHER(T, P)

1nlength[T]

2mlength[P]

3fors0ton-m

4doifP[1 . .m] =T[s+ 1 . .s+m]

5thenprint "Pattern occurs with shift"s

Suppose we allow the pattern *P* to contain occurrences of a * gap character * that can match an

Rabin and Karp have proposed a string-matching algorithm that performs well in practice and that also generalizes to other algorithms for related problems, such as two-dimensional pattern matching. The worst-case running time of the Rabin-Karp algorithm is *O*((*n* - *m* + 1)*m*), but it has a good average-case running time.

We can compute *p* in time *O*(*m*) using Horner's rule (see Section 32.1):

p=P[m] + 10 (P[m- 1] + 10(P[m - 2] +. . .+ 10(P[2] + 10P[1]). . .)).

The value *t*_{0} can be similarly computed from *T*[1 . . *m*] in time O(*m*).

t+ 1 = 10(_{s}t- 10_{s}- 1^{m}T[s+ 1]) +T[s+ m+1].

t+1 = 10(31415 - 10000.3) + 2_{s}

= 14152 .

t+1 = (_{s}d(t-_{s}T[s+ 1]h) +T[s+m+ 1]) modq,

The ointment of working modulo *q* now contains a fly, however, since *t _{s}*

RABIN-KARP-MATCHER(T, P, d, q)

1nlength[T]

2mlength[P]

3hd-^{m}1q^{ mod }

4p0

5t_{0}0

6fori1tom

7dop(dp+P[i]) modq

8t_{0}(dt_{0}+T[i]) modq

9fors0ton-m

10do ifp=t_{s}

11then ifP[1 . .m] =T[s+ 1 . .s+m]

12then"Pattern occurs with shift"s

13ifs<n-m

14thent+1 (_{s}d(t-_{s}T[s+ 1]h) +T[s+m+ 1]) modq

The running time of RABIN-KARP-MATCHER is ((*n* - *m* + 1)*m*) in the worst case, since (like the naive string-matching algorithm) the Rabin-Karp algorithm explicitly verifies every valid shift. If *P* = a* ^{m}* and

O(n) +O(m(v+n/q)),

Alice has a copy of a long *n*-bit file *A* = *a _{n}*-1,

Many string-matching algorithms build a finite automaton that scans the text string *T* for all occurrences of the pattern *P*. This section presents a method for building such an automaton. These string-matching automata are very efficient: they examine each text character *exactly once*, taking constant time per text character. The time used--after the automaton is built--is therefore (*n*). The time to build the automaton, however, can be large if is large. Section 34.4 describes a clever way around this problem.

A **finite automaton***M* is a 5-tuple (*Q, q*_{0}, *A*, , ), where

*A* *Q* is a distinguished set of **accepting states**,

is a function from *Q* X into *Q*, called the **transition function** of *M*.

The finite automaton begins in state *q*_{0} and reads the characters of its input string one at a time. If the automaton is in state *q* and reads input character *a*, it moves ("makes a transition") from state *q* to state (*q, a*). Whenever its current state *q* is a member of *A*, the machine *M* is said to have * accepted* the string read so far. An input that is not accepted is said to be

A finite automaton *M* induces a function ø, called the * final-state function*, from * to

ø() =q_{}_{,}

ø(wa) =(ø(w),a) forw*,a.

There is a string-matching automaton for every pattern *P*; this automaton must be constructed from the pattern in a preprocessing step before it can be used to search the text string. Figure 34.6 illustrates this construction for the pattern *P* = ababaca. From now on, we shall assume that *P* is a given fixed pattern string; for brevity, we shall not indicate the dependence upon *P* in our notation.

In order to specify the string-matching automaton corresponding to a given pattern *P*[1 . . *m*], we first define an auxiliary function , called the * suffix function* corresponding to

We define the string-matching automaton corresponding to a given pattern P[1 . . *m*] as follows.

The transition function is defined by the following equation, for any state *q* and character *a*:

(q,a) = (P) ._{q}a

(T) = (_{i}T) ;_{i}

FINITE-AUTOMATON-MATCHER(T,,m)

1nlength[T]

2q0

3fori1ton

4doq(q, T[i])

5ifq=m

6thensi-m

7 print "Pattern occurs with shift"s

For any string *x* and character *a***,** we have *(*xa*)* * **(*x*) *+ *1*.

For any string *x* and character *a*, if *q* = (*x*), then (*xa*) = (*P _{q}a*).

Now, we assume that and prove that . Let *q* denote , and let *a* denote *T*[*i* + 1]. Then,

By induction, the theorem is proved.

The following procedure computes the transition function *from a given pattern *P*[1 . . *m*].*

COMPUTE-TRANSITION-FUNCTION(P, )

1mlength[P]

2forq0tom

3do foreach charactera*

4dokmin (m+ 1,q+ 2)

5repeatkk- 1

7(q, a)k

8return

We call a pattern *P* * nonoverlappable* if implies

Given two patterns *P* and *P*', describe how to construct a finite automaton that determines all occurrences of *either* pattern. Try to minimize the number of states in your automaton.

Given a pattern *P* containing gap characters (see Exercise 34.1-5), show how to build a finite automaton that can find an occurrence of *P* in a text *T *in *O*(*n*) time, where *n* = *T*.

We now present a linear-time string-matching algorithm due to Knuth, Morris, and Pratt. Their algorithm achieves a (*n + m*) running time by avoiding the computation of the transition function * *altogether, and it does the pattern matching using just an auxiliary function [1 . . *m*] precomputed from the pattern in time *O*(*m*). The array * *allows the transition function * *to be computed efficiently (in an amortized sense) "on the fly" as needed. Roughly speaking, for any state *q* = 0, 1, . . . *, m,*and any character *a* , the value [*q*] contains the information that is independent of *a* and is needed to compute * *(*q, a*). (This remark will be clarified shortly.) Since the array * *has only *m* entries, whereas * *has *O*(*m* ) entries, we save a factor of in the preprocessing by computing * *rather than *.*

The prefix function for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. This information can be used to avoid testing useless shifts in the naive pattern-matching algorithm or to avoid the precomputation of for a string-matching automaton.

P[1 . .k] = T[s' 1 . . s'k],