[mLib] / man / unihash.3

.\" -*-nroff-*-
.de VS
.sp 1
.RS
.nf
.ft B
..
.de VE
.ft R
.fi
.RE
.sp 1
..
.de hP
.IP
.ft B
\h'-\w'\\$1\ 'u'\\$1\ \c
.ft P
..
.ie t \{\
.  ds ss \s8\u
.  ds se \d\s0
.  ds us \s8\d
.  ds ue \u\s0
.  ds *d \(*d
.  ds >= \(>=
.\}
.el \{\
.  ds ss ^
.  ds se
.  ds us _
.  ds ue
.  ds *d \fIdelta\fP
.  ds >= >=
.\}
.TH unihash 3 "5 July 2003" "Straylight/Edgeware" "mLib utilities library"
.SH NAME
unihash \- simple and efficient universal hashing for hashtables
.\" @unihash_setkey
.\" @UNIHASH_INIT
.\" @unihash_hash
.\" @UNIHASH
.\" @unihash
.SH SYNOPSIS
.nf
.B "#include <mLib/unihash.h>"

.BI "void unihash_setkey(unihash_info *" i ", uint32 " k );
.BI "uint32 UNIHASH_INIT(const unihash_info *" i );
.BI "void unihash_hash(const unihash_info *" st ", uint32 " a ,
.BI "                  const void *" p ", size_t " sz );
.BI "uint32 unihash(const unihash_info *" i ", const void *" p ", size_t " sz );
.BI "uint32 UNIHASH(const unihash_info *" i ", const void *" p ", size_t " sz );
.fi
.SH DESCRIPTION
The
.B unihash
system implements a simple and relatively efficient
.IR "universal hashing family" .
Using a such a universal hashing family means that it's provably
difficult for an adversary to choose input data whose hashes collide,
thus guaranteeing good average performance even on maliciously chosen
data.
.PP
Unlike, say,
.BR crc32 (3),
the
.B unihash
function is
.I keyed
\- in addition to the data to be hashed, the function takes as input a
32-bit key.  This key should be chosen at random each time the program
runs.
.SS "Preprocessing a key"
Before use, a key must be
.I preprocessed
into a large (16K) table which is used by the main hashing functions.
The preprocessing is done by
.BR unihash_setkey :
pass it a pointer to a
.B unihash_info
structure and the 32-bit key you've chosen, and it stores the table in
the structure.
.PP
Objects of type
.B unihash_info
don't contain any pointers to other data and are safe to free when
you've finished with them; or you can just allocate them statically or
on the stack if that's more convenient.
.SS "Hashing data"
The function
.B unihash_hash
takes as input:
.TP
.BI "const unihash_info *" i
A pointer to the precomputed tables for a key.
.TP
.BI "uint32 " a 
An accumulator value.  This should be
.BI UNIHASH_INIT( i )
for the first chunk of a multi-chunk input, or the result of the
previous
.B unihash_hash
call for subsequent chunks.
.TP
.BI "const void *" p
A pointer to the start of a buffer containing this chunk of data.
.TP
.BI "size_t " sz
The length of the chunk.
.PP
The function returns a new accumulator value, which is also the hash of
the data so far. So, to hash multiple chunks of data, do something like
.VS
uint32 a = UNIHASH_INIT(i);
a = unihash_hash(i, a, p_0, sz_0);
a = unihash_hash(i, a, p_1, sz_1);
/* ... */
a = unihash_hash(i, a, p_n, sz_n);
.VE
The macro
.B UNIHASH
and function
.B unihash
are convenient interfaces to
.B unihash_hash
if you only wanted to hash one chunk.
.SS "Theoretical issues"
The hash function implemented by
.B unihash
is
.RI ( l \ +\ 1)/2\*(ss32\*(se-almost
XOR-universal, where
.I l
is the length (in bytes) of the longest string you hash.  That means
that, for any pair of strings
.I x
and
.I y
and any 32-bit value \*(*d, the probability taken over all choices of the
key
.I k
that
.IR H\*(usk\*(ue ( x )\  \c
.BR xor \c
.RI \  H\*(usk\*(ue ( y )\ =\ \*(*d
is no greater than
.RI ( l \ +\ 1)/2\*(ss32\*(se.
.PP
This fact is proven in the header file, but it requires more
sophisticated typesetting than is available here.
.PP
The function evaluates a polynomial over GF(2\*(ss32\*(se) whose
coefficients are the bytes of the message and whose variable is the key.
Details are given in the header file.
.PP
For best results, you should choose the key as a random 32-bit number
each time your program starts.  Choosing a different key for different
hashtables isn't necessary.  It's probably a good idea to avoid the keys
0 and 1.  This raises the collision bound to
.RI ( l \ +\ 1)/(2\*(ss32\*(se\ \-\ 2)
(which isn't a significant increase) but eliminates keys for which the
hash's behaviour is particularly poor.
.PP
In tests,
.B unihash
actually performed better than
.BR crc32 ,
so if you want to just use it as a fast-ish hash with good statistical
properties, choose some fixed key
.IR k \ \*(>=\ 2.
.PP
We emphasize that the proof of this function's collision behaviour is
.I not
dependent on any unproven assumptions (unlike many `proofs' of
cryptographic security, which actually reduce the security of some
construction to the security of its components).  It's just a fact.
.SH SEE ALSO
.BR crc32 (3),
.BR mLib (3).
.SH AUTHOR
Mark Wooding (mdw@nsict.org).
Commit	Line	Data
8fe3c82b	1	.\" --nroff--
	2	.de VS
	3	.sp 1
	4	.RS
	5	.nf
	6	.ft B
	7	..
	8	.de VE
	9	.ft R
	10	.fi
	11	.RE
	12	.sp 1
	13	..
	14	.de hP
	15	.IP
	16	.ft B
	17	\h'-\w'\\$1\ 'u'\\$1\ \c
	18	.ft P
	19	..
	20	.ie t \{\
	21	. ds ss \s8\u
	22	. ds se \d\s0
	23	. ds us \s8\d
	24	. ds ue \u\s0
	25	. ds d \(d
	26	. ds >= \(>=
	27	.\}
	28	.el \{\
	29	. ds ss ^
	30	. ds se
	31	. ds us _
	32	. ds ue
	33	. ds *d \fIdelta\fP
	34	. ds >= >=
	35	.\}
	36	.TH unihash 3 "5 July 2003" "Straylight/Edgeware" "mLib utilities library"
	37	.SH NAME
	38	unihash \- simple and efficient universal hashing for hashtables
	39	.\" @unihash_setkey
	40	.\" @UNIHASH_INIT
	41	.\" @unihash_hash
	42	.\" @UNIHASH
	43	.\" @unihash
	44	.SH SYNOPSIS
	45	.nf
	46	.B "#include <mLib/unihash.h>"
	47
	48	.BI "void unihash_setkey(unihash_info *" i ", uint32 " k );
	49	.BI "uint32 UNIHASH_INIT(const unihash_info *" i );
	50	.BI "void unihash_hash(const unihash_info *" st ", uint32 " a ,
	51	.BI " const void *" p ", size_t " sz );
	52	.BI "uint32 unihash(const unihash_info " i ", const void " p ", size_t " sz );
	53	.BI "uint32 UNIHASH(const unihash_info " i ", const void " p ", size_t " sz );
	54	.fi
	55	.SH DESCRIPTION
	56	The
	57	.B unihash
	58	system implements a simple and relatively efficient
	59	.IR "universal hashing family" .
	60	Using a such a universal hashing family means that it's provably
	61	difficult for an adversary to choose input data whose hashes collide,
	62	thus guaranteeing good average performance even on maliciously chosen
	63	data.
	64	.PP
65	Unlike, say,
66	.BR crc32 (3),
67	the
68	.B unihash
69	function is
70	.I keyed
71	\- in addition to the data to be hashed, the function takes as input a
72	32-bit key. This key should be chosen at random each time the program
73	runs.
74	.SS "Preprocessing a key"
75	Before use, a key must be
76	.I preprocessed
77	into a large (16K) table which is used by the main hashing functions.
78	The preprocessing is done by
79	.BR unihash_setkey :
80	pass it a pointer to a
81	.B unihash_info
82	structure and the 32-bit key you've chosen, and it stores the table in
83	the structure.
84	.PP
85	Objects of type
86	.B unihash_info
87	don't contain any pointers to other data and are safe to free when
88	you've finished with them; or you can just allocate them statically or
89	on the stack if that's more convenient.
90	.SS "Hashing data"
91	The function
92	.B unihash_hash
93	takes as input:
94	.TP
95	.BI "const unihash_info *" i
96	A pointer to the precomputed tables for a key.
97	.TP
98	.BI "uint32 " a
99	An accumulator value. This should be
100	.BI UNIHASH_INIT( i )
101	for the first chunk of a multi-chunk input, or the result of the
102	previous
103	.B unihash_hash
104	call for subsequent chunks.
105	.TP
106	.BI "const void *" p
107	A pointer to the start of a buffer containing this chunk of data.
108	.TP
109	.BI "size_t " sz
110	The length of the chunk.
111	.PP
112	The function returns a new accumulator value, which is also the hash of
113	the data so far. So, to hash multiple chunks of data, do something like
114	.VS
115	uint32 a = UNIHASH_INIT(i);
116	a = unihash_hash(i, a, p_0, sz_0);
117	a = unihash_hash(i, a, p_1, sz_1);
118	/* ... */
119	a = unihash_hash(i, a, p_n, sz_n);
120	.VE
121	The macro
122	.B UNIHASH
123	and function
124	.B unihash
125	are convenient interfaces to
126	.B unihash_hash
127	if you only wanted to hash one chunk.
128	.SS "Theoretical issues"
129	The hash function implemented by
130	.B unihash
131	is
132	.RI ( l \ +\ 1)/2\(ss32\(se-almost
133	XOR-universal, where
134	.I l
135	is the length (in bytes) of the longest string you hash. That means
136	that, for any pair of strings
137	.I x
138	and
139	.I y
140	and any 32-bit value \(d, the probability taken over all choices of the
141	key
142	.I k
143	that
144	.IR H\(usk\(ue ( x )\ \c
145	.BR xor \c
146	.RI \ H\(usk\(ue ( y )\ =\ \(d
147	is no greater than
148	.RI ( l \ +\ 1)/2\(ss32\(se.
149	.PP
150	This fact is proven in the header file, but it requires more
151	sophisticated typesetting than is available here.
152	.PP
153	The function evaluates a polynomial over GF(2\(ss32\(se) whose
154	coefficients are the bytes of the message and whose variable is the key.
155	Details are given in the header file.
156	.PP
157	For best results, you should choose the key as a random 32-bit number
158	each time your program starts. Choosing a different key for different
159	hashtables isn't necessary. It's probably a good idea to avoid the keys
160	0 and 1. This raises the collision bound to
161	.RI ( l \ +\ 1)/(2\(ss32\(se\ \-\ 2)
162	(which isn't a significant increase) but eliminates keys for which the
163	hash's behaviour is particularly poor.
164	.PP
165	In tests,
166	.B unihash
167	actually performed better than
168	.BR crc32 ,
169	so if you want to just use it as a fast-ish hash with good statistical
170	properties, choose some fixed key
171	.IR k \ \*(>=\ 2.
172	.PP
173	We emphasize that the proof of this function's collision behaviour is
174	.I not
175	dependent on any unproven assumptions (unlike many `proofs' of
176	cryptographic security, which actually reduce the security of some
177	construction to the security of its components). It's just a fact.
178	.SH SEE ALSO
179	.BR crc32 (3),
180	.BR mLib (3).
181	.SH AUTHOR
182	Mark Wooding (mdw@nsict.org).