The simplest algorithm for searching for one string (the "needle") in another larger string (the "haystack") looks like this:
for i = 0 to len(haystack)-len(needle): matched = YES for j = 0 to len(needle)-1: if haystack[i+j] != needle[j]: matched = NO break end if end for if matched = YES: return i end for return "not found"
This has time complexity
is the length of the haystack, and
N is the length of
the needle). Of course, you don't usually need to test all
the characters of the needle for each value of
it can happen in the worst case (consider searching for the needle
aaaaaaaaaaab" in a haystack made of one million
repetitions of "
A well-known optimisation is to start by converting the needle into
a finite state machine. This reduces the cost of the search to
O(H), independent of
N, at the expense of
a setup phase that does depend on
N. For a single fixed
needle and a huge amount of haystack text, this is a worthwhile
On the other hand, what happens if you have a single fixed haystack and a large number of different needles you want to find?
Construct the list of all terminal substrings of the haystack. (A
terminal substring is one which appears at the end of the haystack.
In other words, a substring you can get by taking characters off the
beginning.) For example, with the haystack string
foobar", the terminal substrings would be
foobar oobar obar bar ar r
Now sort this list into order, so we get:
ar bar foobar obar oobar r
This concludes the setup phase. Now, when given a needle, we need only perform a binary search through this list!
The storage cost of this algorithm is high. The whole haystack must be kept in memory (or somewhere reasonably accessible) during the setup phase, and in addition to the haystack text itself we also need an array of offsets into the text, which is likely to take up 4 or 8 times as much space as the haystack itself. (We don't need to store all the actual terminal substrings; we need only store the position within the haystack where each one starts.)
The setup phase consists of a sort. Sorting is well known to be
O(n log n) comparisons, but when the
comparisons are between strings, the cost of each comparison is
non-trivial as well. In the absolute worst case (every character in
the haystack is identical), a comparison would take
O(H) time, so the setup phase could take
log H) in the worst case. On the other hand, any reasonably
normal haystack is likely to take not much more than
H) to sort.
The searching procedure is similar. A binary search takes
O(log n) comparisons, so in the worst case the search
O(N log H). Assuming the needle is reasonably
small, this seems entirely acceptable for the massive gain of not
No clear applications immediately spring to mind. Search engines and indexing is a related area, and it seems not impossible there might be an application there.
(back to algorithms index)