Suppose you are maintaining a table of frequencies. You are
receiving a steady incoming stream of symbols (represented as
numbers between 0 and NSYMS
), and you want to be able
to track how many symbols of a given type you've seen.
The obvious way to do this is to have an array indexed by the symbol
number: a[sym]
tells you how many occurrences of symbol
sym
you've seen. Insertion is constant-time (increment
a[sym]
), and lookup is also constant-time (read
a[sym]
).
This is less helpful when you want cumulative frequencies:
not "how many of symbol 18 have I seen", but "how many symbols
up to 18 have I seen". Looking that statistic up in the
array takes O(NSYMS)
time, because you have to walk
along the array from 0 to sym
adding together all the
entries.
If you change the contents of the array so that it contains the
cumulative frequencies instead of the individual frequencies, that's
not much help either, because although cumulative frequency lookup
is now very fast, insertion takes O(NSYMS)
time, because when a symbol sym
arrives you have to
walk along the array adding 1 to the first sym
elements.
So it looks as if we have the choice between constant-time insertion
and linear-time lookup, or linear-time insertion and constant-time
lookup. As NSYMS
increases, this looks less and less
attractive. There must be a better way.
There is, and I'll now describe it.
(I'm going to start by describing the simple case when
NSYMS
is a power of two. The algorithm still works when
this isn't the case, but takes a bit more explaining.)
Conceptually, our new data structure is going to be a collection of
arrays. It's possible to store all of these end-to-end in the same
physical array, but it's easiest to think of them as a
bunch of independent arrays. The arrays themselves are numbered from
0 to N
, where 2^N = NSYMS
.
N
has one element, which counts the total number
of symbols we have seen overall.
N-1
has one element, which counts the total
number of symbols we have seen in the range 0 to
2^(N-1)
. (Including 0, excluding 2^(N-1)
).
N-2
has two elements. Element 0 counts the
symbols we have seen in the range 0 to 2^(N-2)
, and
element 1 counts the symbols we have seen in the range
2*2^(N-2)
to 3*2^(N-2)
.
2^(N-1)
elements. Element 0 counts the
number of symbol 0 we have seen; element 1 counts the number of
symbol 2; in general, element i
counts the number of
symbol 2*i
.
In general:
i
has 2^(N-1-i)
elements (except that
array N
is a special case)
j
of array i
counts the total
number of symbols between j * 2^(i+1)
and
j * 2^(i+1) + 2^i
.
Here's a diagram illustrating the case N=3
(so
NSYMS=8
):
array A3 array A2 array A1 array A0
+---------+---------+---------+---------+
symbol 7 | | | | | symbol 7
| | | +---------+
symbol 6 | | | | A0[3] | symbol 6
| | +---------+---------+
symbol 5 | | | | | symbol 5
| | | A1[1] +---------+
symbol 4 | | | | A0[2] | symbol 4
| A3[0] +---------+---------+---------+
symbol 3 | | | | | symbol 3
| | | +---------+
symbol 2 | | | | A0[1] | symbol 2
| | A2[0] +---------+---------+
symbol 1 | | | | | symbol 1
| | | A1[0] +---------+
symbol 0 | | | | A0[0] | symbol 0
+---------+---------+---------+---------+
So that describes the data structure; now for the operations.
Insertion is log-time. Since any given symbol is counted by either
zero or one elements of each array, we need to modify at most
N
elements.
A3[0]
, nothing in
A2
, A1[1]
, and A0[2]
.
sym
: for
each array i
, let
j = sym / 2^(i+1)
. Then increment
element j
of array i
, if and only
if bit i
of sym
is 0.
n
to the count of symbol
sym
, you just add n
to precisely the same
array elements listed above.
Cumulative frequency lookup is also log-time. Suppose we want to
determine the total number of symbols we have seen whose value is
less than sym
. (So sym
could range from 0
to NSYMS
, inclusive.) We can achieve this by
adding together at most one element of each array.
A1[0]+A0[1]
. To count all
the symbols with value less than 7, we compute
A2[0]+A1[1]+A0[3]
. To count all the symbols with value
less than 8, we just return A3[0]
.
sym
: start with a count of zero, and loop over each
array. For each array i
, test whether bit
i
of sym
is 1. If it is, then let
j = sym / 2^(i+1)
, and add element
j
of array i
to the count.
sym=NSYMS
, we just return the single
element of array N
.
Single frequency lookup can be done by retrieving two adjacent cumulative frequencies and subtracting them. However, there's a slightly quicker way too, still log-time but requiring only one pass through the arrays instead of two:
count
equal element 0 (the only element) of array
N
.
i
loop from N-1
down to zero,
inclusive. For each i
: let
j = sym / 2^(i+1)
, and look up
e = Ai[j]
, element j
of array
i
. If bit i
of sym
is 0, then
let count
equal e
; otherwise let
count
equal count-e
.
count
contains the individual count of symbol
sym
.
Finally, there's one more operation you might want: given a number,
we might want to find which cumulative frequency band it falls in.
(In other words: if you sorted all the symbols in the table into
order, what would the n
th one be?) This can be done, in
log time, similarly to the other operations:
c=0
.
i
loop from N-1
down to zero,
inclusive. For each i: look up j = Ai[c]
. If
n < j
, then let c = c * 2
; otherwise,
let c = c * 2 + 1
and let n = n - j
.
c
contains the index of the symbol S
required. That is, the cumulative frequency of all symbols
below S
is less than or equal to
n
, and the cumulative frequency of all symbols
below or equal to S
is greater than
n
.
This completes the algorithm definition. All that remains is to deal
with the case in which NSYMS
is not a power of two.
Conceptually, we can deal with this by rounding NSYMS
up to the next power of two and using the algorithm in the ordinary
way. If we do that, we discover we can shorten most of the arrays
as a result: every index into every array is computed by the formula
j = sym / 2^(i+1)
, and therefore
the maximum length needed for each array must be given by the
formula (NSYMS-1) / 2^(i+1) + 1
.
(sym
is always strictly less than NSYMS
,
except that in the cumulative lookup operation it can be
equal to NSYMS
. But in this case the operation
instantly returns the single element of array N
, and
never progresses to the other arrays, so this special case doesn't
need to affect us.)
Consider the example diagram above. We had:
NSYMS = 8; NSYMS-1 = 7
len(A3) = 7 / 16 + 1 = 1
len(A2) = 7 / 8 + 1 = 1
len(A1) = 7 / 4 + 1 = 2
len(A0) = 7 / 2 + 1 = 4
Now suppose we changed NSYMS
to be 5 instead of 8. Then
we would have
NSYMS = 5; NSYMS-1 = 4
len(A3) = 4 / 16 + 1 = 1
len(A2) = 4 / 8 + 1 = 1
len(A1) = 4 / 4 + 1 = 2
len(A0) = 4 / 2 + 1 = 3
In other words, nothing changes in this case except for the final array. This would lead to the modified diagram
array A3 array A2 array A1 array A0
+---------+---------+---------+---------+
symbol 4 | | | A1[1] | A0[2] | symbol 4
| +---------+---------+---------+
symbol 3 | | | | | symbol 3
| | | +---------+
symbol 2 | A3[0] | | | A0[1] | symbol 2
| | A2[0] +---------+---------+
symbol 1 | | | | | symbol 1
| | | A1[0] +---------+
symbol 0 | | | | A0[0] | symbol 0
+---------+---------+---------+---------+
in which all the operations still work exactly as they did before.
I've already shown that the time cost of the operations (increment
or modify a single count; look up a cumulative frequency; look up a
single frequency) is O(N)
(i.e. O(log
NSYMS)
) for all operations.
The storage cost of the data structure is roughly
O(NSYMS)
. This is easily seen because the size of array
i
is approximately NSYMS/2^i
(perhaps off
by one.) Therefore, the total size of all the arrays is
NSYMS*(1/2+1/4+1/8+1/16+...)
, which is at most
NSYMS
. (The off-by-one errors might add up to an error
of log NSYMS
, but we don't mind about that.)
I invented this algorithm when writing an adaptive arithmetic compressor (for my own interest and education, since I'm told arithmetic compression is a patented technology). For arithmetic compression to work, you need cumulative probabilities for all your symbols: for each symbol, you need to be able to retrieve the probability of seeing that symbol or anything lower than it in value. This is clearly a job for a cumulative frequency table. In addition, my compressor's probabilities were adaptively based on the symbol frequencies in the last 16Kb of data, so every time the compressor received a new character it had to enter it in the frequency table and also remove the character from 16Kb earlier. Therefore, for every character I compressed, I needed to do an insertion, a removal, and two cumulative lookups. Linear time would have been painful.
In addition, it seems to me that this algorithm might potentially
have uses in code generation. Instead of thinking of it as a
cumulative frequency table, think of it as a set of
NSYMS+1
marks with varying distances between them,
allowing you to move any adjacent pair of marks closer together or
further apart, and allowing you to quickly measure the distance
between any (not necessarily adjacent) pair of marks.
So imagine a situation in which you have a number of "basic blocks", and each has an optional jump at the end that targets the beginning of another block. You know the length of the code in each block (i.e. the distance between the marks either side of it), and you can work out the distance required in each jump. If you were to try an optimisation phase in which you modified or re-ordered blocks so as to try to reduce those distances, this algorithm might be a convenient way to keep track of where you'd got to.
Sample code in C is provided here.