String Matching

String matching is the problem of finding every place where a short pattern string occurs inside a longer text. It is one of the most common operations in computing, underlying search functions in text editors, grep, web search, and the comparison of DNA sequences in bioinformatics.

The naive approach slides the pattern across the text one position at a time and compares character by character at each position. With a text of length n and a pattern of length m, this takes time proportional to n times m in the worst case, because a near-match can force the algorithm to back up and re-compare characters it has already seen.

In 1977 Donald Knuth, James Morris, and Vaughan Pratt published “Fast Pattern Matching in Strings” in the SIAM Journal on Computing, presenting an algorithm whose running time is proportional to the sum of the lengths of the two strings, with a constant small enough to make it practical. The Knuth-Morris-Pratt algorithm achieves this by precomputing a table from the pattern that tells it how far to shift after a mismatch, so it never has to back up in the text. The Boyer-Moore algorithm, developed around the same period, uses similar precomputed tables to skip ahead and can be sub-linear in practice, often examining fewer characters than the text contains.

The textbook treatment of these algorithms appears in CLRS, “Introduction to Algorithms,” which devotes a chapter to string matching and presents the naive method alongside the linear-time approaches and their preprocessing tables.

Sources

Related