Unsolved: Equivalence Relations

Introduction

An equivalence relation on a set is a partition of the set into classes. Two elements are considered equivalent if they are in the same class, and not if they are not.

In some situations, you might find yourself dealing with a set of elements and gradually discover ways in which they behave differently; so you might want to keep track of which ones are different and which ones are not.

(In other situations, you might find yourself dealing with a set of distinct elements and gradually discover that some are equivalent to others; but there's a known algorithm for this. It's made easier by the fact that if all elements start off distinct, you're unlikely ever to be dealing with too many of them to enumerate individually.)

Desired Properties

So I'm looking for data structures with these operations:

Canonify. Given an element of the set, return a canonical element of the equivalence class containing that element. Any two equivalent elements should return the same element when canonified. Any two non-equivalent elements should return different elements.
Enumerate. Run through all the equivalence classes one by one, probably by means of their canonical elements.
Disconnect. Given a subset of the set, arrange that every equivalence class is either totally inside the subset or totally outside it, by means of splitting any class that crosses the boundary into two.

Best Known Approximations

For a small and dense set, where it's feasible to use the set elements as array indices, there's a reasonable implementation of all this: have an array A with one element for each set element. Then, for each set element e, let A[e] contain the canonical element of the class containing e. The canonical element of any class is defined to be the smallest-value element of that class.

Then the "canonify" operation is a simple array lookup, and the "enumerate" operation consists of going through the array looking for any e satisfying A[e] = e. Disconnection is O(N), and connection is also O(N); but by assumption N is small, so that isn't too big a problem.

For a sparse set - perhaps the set of all strings, or the set of basic blocks in a compilation process - I have no answer.

Applications

One clear application for equivalence classes with a disconnect operation is the algorithm that constructs a deterministic finite state machine from a nondeterministic one (used in regular expression processing). Most of the character set can be treated as equivalent: any character not mentioned explicitly in the regular expression behaves just the same as any other, and any two characters that are always used as part of the same character class are equivalent. For example, in the regular expression (0[xX][0-9A-Fa-f]+|[1-9][0-9]*|0[0-7]*), there is no need to treat all ASCII characters differently. The equivalence classes are [0], [Xx], [A-Fa-f], [1-7], [8-9], and everything else. So we only need to compute six transitions for each state, instead of 256. (back to algorithms index)

(comments to anakin@pobox.com)
(thanks to chiark for hosting this page)
(last modified on Sun May 7 14:33:22 2017)