An *equivalence relation* on a set is a partition of the set
into classes. Two elements are considered equivalent if they are in
the same class, and not if they are not.

In some situations, you might find yourself dealing with a set of elements and gradually discover ways in which they behave differently; so you might want to keep track of which ones are different and which ones are not.

(In other situations, you might find yourself dealing with a set of distinct elements and gradually discover that some are equivalent to others; but there's a known algorithm for this. It's made easier by the fact that if all elements start off distinct, you're unlikely ever to be dealing with too many of them to enumerate individually.)

So I'm looking for data structures with these operations:

- Canonify. Given an element of the set, return a canonical element of the equivalence class containing that element. Any two equivalent elements should return the same element when canonified. Any two non-equivalent elements should return different elements.
- Enumerate. Run through all the equivalence classes one by one, probably by means of their canonical elements.
- Disconnect. Given a subset of the set, arrange that every equivalence class is either totally inside the subset or totally outside it, by means of splitting any class that crosses the boundary into two.

For a small and dense set, where it's feasible to use the set
elements as array indices, there's a reasonable implementation of
all this: have an array `A`

with one element for each set
element. Then, for each set element `e`

, let
`A[e]`

contain the canonical element of the class
containing `e`

. The canonical element of any class is
defined to be the smallest-value element of that class.

Then the "canonify" operation is a simple array lookup, and the
"enumerate" operation consists of going through the array looking
for any `e`

satisfying `A[e] = e`

.
Disconnection is `O(N)`

, and connection is also
`O(N)`

; but by assumption `N`

is small, so
that isn't too big a problem.

For a sparse set - perhaps the set of all strings, or the set of basic blocks in a compilation process - I have no answer.

One clear application for equivalence classes with a disconnect
operation is the algorithm that constructs a deterministic finite
state machine from a nondeterministic one (used in regular
expression processing). Most of the character set can be treated as
equivalent: any character not mentioned explicitly in the regular
expression behaves just the same as any other, and any two
characters that are always used as part of the same character class
are equivalent. For example, in the regular expression
`(0[xX][0-9A-Fa-f]+|[1-9][0-9]*|0[0-7]*)`

, there is no
need to treat all ASCII characters differently. The equivalence
classes are `[0]`

, `[Xx]`

,
`[A-Fa-f]`

, `[1-7]`

, `[8-9]`

, and
everything else. So we only need to compute six transitions for each
state, instead of 256.
(back to algorithms index)

(comments to anakin@pobox.com)

(thanks to chiark for hosting this page)

(last modified on Sun May 7 14:33:22 2017)