The simplest algorithm for searching for one string (the "needle") in another larger string (the "haystack") looks like this:
for i = 0 to len(haystack)-len(needle):
matched = YES
for j = 0 to len(needle)-1:
if haystack[i+j] != needle[j]:
matched = NO
break
end if
end for
if matched = YES:
return i
end for
return "not found"
This has time complexity O(H*N)
(where H
is the length of the haystack, and N
is the length of
the needle). Of course, you don't usually need to test all
the characters of the needle for each value of i
, but
it can happen in the worst case (consider searching for the needle
"aaaaaaaaaaab
" in a haystack made of one million
repetitions of "a
").
A well-known optimisation is to start by converting the needle into
a finite state machine. This reduces the cost of the search to
O(H)
, independent of N
, at the expense of
a setup phase that does depend on N
. For a single fixed
needle and a huge amount of haystack text, this is a worthwhile
tradeoff.
On the other hand, what happens if you have a single fixed haystack and a large number of different needles you want to find?
Construct the list of all terminal substrings of the haystack. (A
terminal substring is one which appears at the end of the haystack.
In other words, a substring you can get by taking characters off the
beginning.) For example, with the haystack string
"foobar
", the terminal substrings would be
foobar
oobar
obar
bar
ar
r
Now sort this list into order, so we get:
ar
bar
foobar
obar
oobar
r
This concludes the setup phase. Now, when given a needle, we need only perform a binary search through this list!
The storage cost of this algorithm is high. The whole haystack must be kept in memory (or somewhere reasonably accessible) during the setup phase, and in addition to the haystack text itself we also need an array of offsets into the text, which is likely to take up 4 or 8 times as much space as the haystack itself. (We don't need to store all the actual terminal substrings; we need only store the position within the haystack where each one starts.)
The setup phase consists of a sort. Sorting is well known to be
possible in O(n log n)
comparisons, but when the
comparisons are between strings, the cost of each comparison is
non-trivial as well. In the absolute worst case (every character in
the haystack is identical), a comparison would take
O(H)
time, so the setup phase could take O(H^2
log H)
in the worst case. On the other hand, any reasonably
normal haystack is likely to take not much more than O(H log
H)
to sort.
The searching procedure is similar. A binary search takes
O(log n)
comparisons, so in the worst case the search
can take O(N log H)
. Assuming the needle is reasonably
small, this seems entirely acceptable for the massive gain of not
having an O(H)
term!
No clear applications immediately spring to mind. Search engines and indexing is a related area, and it seems not impossible there might be an application there.