String Searching by Block Sorting

Introduction

The simplest algorithm for searching for one string (the "needle") in another larger string (the "haystack") looks like this:

    for i = 0 to len(haystack)-len(needle):
        matched = YES
        for j = 0 to len(needle)-1:
            if haystack[i+j] != needle[j]:
                matched = NO
                break
            end if
        end for
        if matched = YES:
            return i
    end for
    return "not found"

This has time complexity O(H*N) (where H is the length of the haystack, and N is the length of the needle). Of course, you don't usually need to test all the characters of the needle for each value of i, but it can happen in the worst case (consider searching for the needle "aaaaaaaaaaab" in a haystack made of one million repetitions of "a").

A well-known optimisation is to start by converting the needle into a finite state machine. This reduces the cost of the search to O(H), independent of N, at the expense of a setup phase that does depend on N. For a single fixed needle and a huge amount of haystack text, this is a worthwhile tradeoff.

On the other hand, what happens if you have a single fixed haystack and a large number of different needles you want to find?

Algorithm Description

Construct the list of all terminal substrings of the haystack. (A terminal substring is one which appears at the end of the haystack. In other words, a substring you can get by taking characters off the beginning.) For example, with the haystack string "foobar", the terminal substrings would be

    foobar
    oobar
    obar
    bar
    ar
    r

Now sort this list into order, so we get:

    ar
    bar
    foobar
    obar
    oobar
    r

This concludes the setup phase. Now, when given a needle, we need only perform a binary search through this list!

Complexity

The storage cost of this algorithm is high. The whole haystack must be kept in memory (or somewhere reasonably accessible) during the setup phase, and in addition to the haystack text itself we also need an array of offsets into the text, which is likely to take up 4 or 8 times as much space as the haystack itself. (We don't need to store all the actual terminal substrings; we need only store the position within the haystack where each one starts.)

The setup phase consists of a sort. Sorting is well known to be possible in O(n log n) comparisons, but when the comparisons are between strings, the cost of each comparison is non-trivial as well. In the absolute worst case (every character in the haystack is identical), a comparison would take O(H) time, so the setup phase could take O(H^2 log H) in the worst case. On the other hand, any reasonably normal haystack is likely to take not much more than O(H log H) to sort.

The searching procedure is similar. A binary search takes O(log n) comparisons, so in the worst case the search can take O(N log H). Assuming the needle is reasonably small, this seems entirely acceptable for the massive gain of not having an O(H) term!

Applications

No clear applications immediately spring to mind. Search engines and indexing is a related area, and it seems not impossible there might be an application there.

(back to algorithms index)


(comments to anakin@pobox.com)
(thanks to chiark for hosting this page)
(last modified on Sun May 7 14:33:22 2017)