Cumulative Frequency Tables

Introduction

Suppose you are maintaining a table of frequencies. You are receiving a steady incoming stream of symbols (represented as numbers between 0 and NSYMS), and you want to be able to track how many symbols of a given type you've seen.

The obvious way to do this is to have an array indexed by the symbol number: a[sym] tells you how many occurrences of symbol sym you've seen. Insertion is constant-time (increment a[sym]), and lookup is also constant-time (read a[sym]).

This is less helpful when you want cumulative frequencies: not "how many of symbol 18 have I seen", but "how many symbols up to 18 have I seen". Looking that statistic up in the array takes O(NSYMS) time, because you have to walk along the array from 0 to sym adding together all the entries.

If you change the contents of the array so that it contains the cumulative frequencies instead of the individual frequencies, that's not much help either, because although cumulative frequency lookup is now very fast, insertion takes O(NSYMS) time, because when a symbol sym arrives you have to walk along the array adding 1 to the first sym elements.

So it looks as if we have the choice between constant-time insertion and linear-time lookup, or linear-time insertion and constant-time lookup. As NSYMS increases, this looks less and less attractive. There must be a better way.

Algorithm Description

There is, and I'll now describe it.

(I'm going to start by describing the simple case when NSYMS is a power of two. The algorithm still works when this isn't the case, but takes a bit more explaining.)

Conceptually, our new data structure is going to be a collection of arrays. It's possible to store all of these end-to-end in the same physical array, but it's easiest to think of them as a bunch of independent arrays. The arrays themselves are numbered from 0 to N, where 2^N = NSYMS.

In general:

Here's a diagram illustrating the case N=3 (so NSYMS=8):

          array A3  array A2  array A1  array A0
         +---------+---------+---------+---------+
symbol 7 |         |         |         |         | symbol 7
         |         |         |         +---------+
symbol 6 |         |         |         |  A0[3]  | symbol 6
         |         |         +---------+---------+
symbol 5 |         |         |         |         | symbol 5
         |         |         |  A1[1]  +---------+
symbol 4 |         |         |         |  A0[2]  | symbol 4
         |  A3[0]  +---------+---------+---------+
symbol 3 |         |         |         |         | symbol 3
         |         |         |         +---------+
symbol 2 |         |         |         |  A0[1]  | symbol 2
         |         |  A2[0]  +---------+---------+
symbol 1 |         |         |         |         | symbol 1
         |         |         |  A1[0]  +---------+
symbol 0 |         |         |         |  A0[0]  | symbol 0
         +---------+---------+---------+---------+

So that describes the data structure; now for the operations.

Insertion is log-time. Since any given symbol is counted by either zero or one elements of each array, we need to modify at most N elements.

Cumulative frequency lookup is also log-time. Suppose we want to determine the total number of symbols we have seen whose value is less than sym. (So sym could range from 0 to NSYMS, inclusive.) We can achieve this by adding together at most one element of each array.

Single frequency lookup can be done by retrieving two adjacent cumulative frequencies and subtracting them. However, there's a slightly quicker way too, still log-time but requiring only one pass through the arrays instead of two:

Finally, there's one more operation you might want: given a number, we might want to find which cumulative frequency band it falls in. (In other words: if you sorted all the symbols in the table into order, what would the nth one be?) This can be done, in log time, similarly to the other operations:

This completes the algorithm definition. All that remains is to deal with the case in which NSYMS is not a power of two. Conceptually, we can deal with this by rounding NSYMS up to the next power of two and using the algorithm in the ordinary way. If we do that, we discover we can shorten most of the arrays as a result: every index into every array is computed by the formula j = sym / 2^(i+1), and therefore the maximum length needed for each array must be given by the formula (NSYMS-1) / 2^(i+1) + 1. (sym is always strictly less than NSYMS, except that in the cumulative lookup operation it can be equal to NSYMS. But in this case the operation instantly returns the single element of array N, and never progresses to the other arrays, so this special case doesn't need to affect us.)

Consider the example diagram above. We had:

    NSYMS = 8; NSYMS-1 = 7
    len(A3) = 7 / 16 + 1 = 1
    len(A2) = 7 / 8 + 1 = 1
    len(A1) = 7 / 4 + 1 = 2
    len(A0) = 7 / 2 + 1 = 4
Now suppose we changed NSYMS to be 5 instead of 8. Then we would have

    NSYMS = 5; NSYMS-1 = 4
    len(A3) = 4 / 16 + 1 = 1
    len(A2) = 4 / 8 + 1 = 1
    len(A1) = 4 / 4 + 1 = 2
    len(A0) = 4 / 2 + 1 = 3

In other words, nothing changes in this case except for the final array. This would lead to the modified diagram

          array A3  array A2  array A1  array A0
         +---------+---------+---------+---------+
symbol 4 |         |         |  A1[1]  |  A0[2]  | symbol 4
         |         +---------+---------+---------+
symbol 3 |         |         |         |         | symbol 3
         |         |         |         +---------+
symbol 2 |  A3[0]  |         |         |  A0[1]  | symbol 2
         |         |  A2[0]  +---------+---------+
symbol 1 |         |         |         |         | symbol 1
         |         |         |  A1[0]  +---------+
symbol 0 |         |         |         |  A0[0]  | symbol 0
         +---------+---------+---------+---------+

in which all the operations still work exactly as they did before.

Complexity

I've already shown that the time cost of the operations (increment or modify a single count; look up a cumulative frequency; look up a single frequency) is O(N) (i.e. O(log NSYMS)) for all operations.

The storage cost of the data structure is roughly O(NSYMS). This is easily seen because the size of array i is approximately NSYMS/2^i (perhaps off by one.) Therefore, the total size of all the arrays is NSYMS*(1/2+1/4+1/8+1/16+...), which is at most NSYMS. (The off-by-one errors might add up to an error of log NSYMS, but we don't mind about that.)

Applications

I invented this algorithm when writing an adaptive arithmetic compressor (for my own interest and education, since I'm told arithmetic compression is a patented technology). For arithmetic compression to work, you need cumulative probabilities for all your symbols: for each symbol, you need to be able to retrieve the probability of seeing that symbol or anything lower than it in value. This is clearly a job for a cumulative frequency table. In addition, my compressor's probabilities were adaptively based on the symbol frequencies in the last 16Kb of data, so every time the compressor received a new character it had to enter it in the frequency table and also remove the character from 16Kb earlier. Therefore, for every character I compressed, I needed to do an insertion, a removal, and two cumulative lookups. Linear time would have been painful.

In addition, it seems to me that this algorithm might potentially have uses in code generation. Instead of thinking of it as a cumulative frequency table, think of it as a set of NSYMS+1 marks with varying distances between them, allowing you to move any adjacent pair of marks closer together or further apart, and allowing you to quickly measure the distance between any (not necessarily adjacent) pair of marks.

So imagine a situation in which you have a number of "basic blocks", and each has an optional jump at the end that targets the beginning of another block. You know the length of the code in each block (i.e. the distance between the marks either side of it), and you can work out the distance required in each jump. If you were to try an optimisation phase in which you modified or re-ordered blocks so as to try to reduce those distances, this algorithm might be a convenient way to keep track of where you'd got to.

Sample Code

Sample code in C is provided here.

(back to algorithms index)


(comments to anakin@pobox.com)
(thanks to chiark for hosting this page)
(last modified on Sun May 7 14:33:22 2017)