chiark / gitweb /
Universal hashing.
[mLib] / man / unihash.3
CommitLineData
8fe3c82b 1.\" -*-nroff-*-
2.de VS
3.sp 1
4.RS
5.nf
6.ft B
7..
8.de VE
9.ft R
10.fi
11.RE
12.sp 1
13..
14.de hP
15.IP
16.ft B
17\h'-\w'\\$1\ 'u'\\$1\ \c
18.ft P
19..
20.ie t \{\
21. ds ss \s8\u
22. ds se \d\s0
23. ds us \s8\d
24. ds ue \u\s0
25. ds *d \(*d
26. ds >= \(>=
27.\}
28.el \{\
29. ds ss ^
30. ds se
31. ds us _
32. ds ue
33. ds *d \fIdelta\fP
34. ds >= >=
35.\}
36.TH unihash 3 "5 July 2003" "Straylight/Edgeware" "mLib utilities library"
37.SH NAME
38unihash \- simple and efficient universal hashing for hashtables
39.\" @unihash_setkey
40.\" @UNIHASH_INIT
41.\" @unihash_hash
42.\" @UNIHASH
43.\" @unihash
44.SH SYNOPSIS
45.nf
46.B "#include <mLib/unihash.h>"
47
48.BI "void unihash_setkey(unihash_info *" i ", uint32 " k );
49.BI "uint32 UNIHASH_INIT(const unihash_info *" i );
50.BI "void unihash_hash(const unihash_info *" st ", uint32 " a ,
51.BI " const void *" p ", size_t " sz );
52.BI "uint32 unihash(const unihash_info *" i ", const void *" p ", size_t " sz );
53.BI "uint32 UNIHASH(const unihash_info *" i ", const void *" p ", size_t " sz );
54.fi
55.SH DESCRIPTION
56The
57.B unihash
58system implements a simple and relatively efficient
59.IR "universal hashing family" .
60Using a such a universal hashing family means that it's provably
61difficult for an adversary to choose input data whose hashes collide,
62thus guaranteeing good average performance even on maliciously chosen
63data.
64.PP
65Unlike, say,
66.BR crc32 (3),
67the
68.B unihash
69function is
70.I keyed
71\- in addition to the data to be hashed, the function takes as input a
7232-bit key. This key should be chosen at random each time the program
73runs.
74.SS "Preprocessing a key"
75Before use, a key must be
76.I preprocessed
77into a large (16K) table which is used by the main hashing functions.
78The preprocessing is done by
79.BR unihash_setkey :
80pass it a pointer to a
81.B unihash_info
82structure and the 32-bit key you've chosen, and it stores the table in
83the structure.
84.PP
85Objects of type
86.B unihash_info
87don't contain any pointers to other data and are safe to free when
88you've finished with them; or you can just allocate them statically or
89on the stack if that's more convenient.
90.SS "Hashing data"
91The function
92.B unihash_hash
93takes as input:
94.TP
95.BI "const unihash_info *" i
96A pointer to the precomputed tables for a key.
97.TP
98.BI "uint32 " a
99An accumulator value. This should be
100.BI UNIHASH_INIT( i )
101for the first chunk of a multi-chunk input, or the result of the
102previous
103.B unihash_hash
104call for subsequent chunks.
105.TP
106.BI "const void *" p
107A pointer to the start of a buffer containing this chunk of data.
108.TP
109.BI "size_t " sz
110The length of the chunk.
111.PP
112The function returns a new accumulator value, which is also the hash of
113the data so far. So, to hash multiple chunks of data, do something like
114.VS
115uint32 a = UNIHASH_INIT(i);
116a = unihash_hash(i, a, p_0, sz_0);
117a = unihash_hash(i, a, p_1, sz_1);
118/* ... */
119a = unihash_hash(i, a, p_n, sz_n);
120.VE
121The macro
122.B UNIHASH
123and function
124.B unihash
125are convenient interfaces to
126.B unihash_hash
127if you only wanted to hash one chunk.
128.SS "Theoretical issues"
129The hash function implemented by
130.B unihash
131is
132.RI ( l \ +\ 1)/2\*(ss32\*(se-almost
133XOR-universal, where
134.I l
135is the length (in bytes) of the longest string you hash. That means
136that, for any pair of strings
137.I x
138and
139.I y
140and any 32-bit value \*(*d, the probability taken over all choices of the
141key
142.I k
143that
144.IR H\*(usk\*(ue ( x )\ \c
145.BR xor \c
146.RI \ H\*(usk\*(ue ( y )\ =\ \*(*d
147is no greater than
148.RI ( l \ +\ 1)/2\*(ss32\*(se.
149.PP
150This fact is proven in the header file, but it requires more
151sophisticated typesetting than is available here.
152.PP
153The function evaluates a polynomial over GF(2\*(ss32\*(se) whose
154coefficients are the bytes of the message and whose variable is the key.
155Details are given in the header file.
156.PP
157For best results, you should choose the key as a random 32-bit number
158each time your program starts. Choosing a different key for different
159hashtables isn't necessary. It's probably a good idea to avoid the keys
1600 and 1. This raises the collision bound to
161.RI ( l \ +\ 1)/(2\*(ss32\*(se\ \-\ 2)
162(which isn't a significant increase) but eliminates keys for which the
163hash's behaviour is particularly poor.
164.PP
165In tests,
166.B unihash
167actually performed better than
168.BR crc32 ,
169so if you want to just use it as a fast-ish hash with good statistical
170properties, choose some fixed key
171.IR k \ \*(>=\ 2.
172.PP
173We emphasize that the proof of this function's collision behaviour is
174.I not
175dependent on any unproven assumptions (unlike many `proofs' of
176cryptographic security, which actually reduce the security of some
177construction to the security of its components). It's just a fact.
178.SH SEE ALSO
179.BR crc32 (3),
180.BR mLib (3).
181.SH AUTHOR
182Mark Wooding (mdw@nsict.org).