[mLib] / hash / unihash.3

.\" -*-nroff-*-
.de VS
.sp 1
.RS
.nf
.ft B
..
.de VE
.ft R
.fi
.RE
.sp 1
..
.de hP
.IP
.ft B
\h'-\w'\\$1\ 'u'\\$1\ \c
.ft P
..
.ie t \{\
.  ds ss \s8\u
.  ds se \d\s0
.  ds us \s8\d
.  ds ue \u\s0
.  ds *d \(*d
.  ds >= \(>=
.\}
.el \{\
.  ds ss ^
.  ds se
.  ds us _
.  ds ue
.  ds *d \fIdelta\fP
.  ds >= >=
.\}
.TH unihash 3 "5 July 2003" "Straylight/Edgeware" "mLib utilities library"
.SH NAME
unihash \- simple and efficient universal hashing for hashtables
.\" @unihash_setkey
.\" @UNIHASH_INIT
.\" @unihash_hash
.\" @UNIHASH
.\" @unihash
.SH SYNOPSIS
.nf
.B "#include <mLib/unihash.h>"
.PP
.B "typedef struct { ...\& } unihash_info;"
.PP
.B "unihash_info unihash_global;"
.PP
.BI "void unihash_setkey(unihash_info *" i ", uint32 " k );
.BI "uint32 UNIHASH_INIT(const unihash_info *" i );
.ta \w'\fBuint32 unihash_hash('u
.BI "uint32 unihash_hash(const unihash_info *" i ", uint32 " a ,
.BI "	const void *" p ", size_t " sz );
.BI "uint32 unihash(const unihash_info *" i ", const void *" p ", size_t " sz );
.BI "uint32 UNIHASH(const unihash_info *" i ", const void *" p ", size_t " sz );
.fi
.SH DESCRIPTION
The
.B unihash
system implements a simple and relatively efficient
.IR "universal hashing family" .
Using a such a universal hashing family means that it's provably
difficult for an adversary to choose input data whose hashes collide,
thus guaranteeing good average performance even on maliciously chosen
data.
.PP
Unlike, say,
.BR crc32 (3),
the
.B unihash
function is
.I keyed
\- in addition to the data to be hashed, the function takes as input a
32-bit key.  This key should be chosen at random each time the program
runs.
.SS "Preprocessing a key"
Before use, a key must be
.I preprocessed
into a large (16K) table which is used by the main hashing functions.
The preprocessing is done by
.BR unihash_setkey :
pass it a pointer to a
.B unihash_info
structure and the 32-bit key you've chosen, and it stores the table in
the structure.
.PP
Objects of type
.B unihash_info
don't contain any pointers to other data and are safe to free when
you've finished with them; or you can just allocate them statically or
on the stack if that's more convenient.
.SS "Hashing data"
The function
.B unihash_hash
takes as input:
.TP
.BI "const unihash_info *" i
A pointer to the precomputed tables for a key.
.TP
.BI "uint32 " a
An accumulator value.  This should be
.BI UNIHASH_INIT( i )
for the first chunk of a multi-chunk input, or the result of the
previous
.B unihash_hash
call for subsequent chunks.
.TP
.BI "const void *" p
A pointer to the start of a buffer containing this chunk of data.
.TP
.BI "size_t " sz
The length of the chunk.
.PP
The function returns a new accumulator value, which is also the hash of
the data so far. So, to hash multiple chunks of data, do something like
.VS
uint32 a = UNIHASH_INIT(i);
a = unihash_hash(i, a, p_0, sz_0);
a = unihash_hash(i, a, p_1, sz_1);
/* ...\& */
a = unihash_hash(i, a, p_n, sz_n);
.VE
The macro
.B UNIHASH
and function
.B unihash
are convenient interfaces to
.B unihash_hash
if you only wanted to hash one chunk.
.SS "Global hash info table"
There's no problem with using the same key for several purposes, as long
as it's secret from all of your adversaries.  Therefore, there is a
global
.B unihash_info
table define, called
.BR unihash_global .
This initially contains information for a fixed key which the author
chose at random, but if you need to you can set a different key into it
.I before
it gets used to hash any data (otherwise your hash tables will become
messed up).
.SS "Theoretical issues"
The hash function implemented by
.B unihash
is
.RI ( l \ +\ 1)/2\*(ss32\*(se-almost
XOR-universal, where
.I l
is the length (in bytes) of the longest string you hash.  That means
that, for any pair of strings
.I x
and
.I y
and any 32-bit value \*(*d, the probability taken over all choices of the
key
.I k
that
.IR H\*(usk\*(ue ( x )\  \c
.BR xor \c
.RI \  H\*(usk\*(ue ( y )\ =\ \*(*d
is no greater than
.RI ( l \ +\ 1)/2\*(ss32\*(se.
.PP
This fact is proven in the header file, but it requires more
sophisticated typesetting than is available here.
.PP
The function evaluates a polynomial over GF(2\*(ss32\*(se) whose
coefficients are the bytes of the message and whose variable is the key.
Details are given in the header file.
.PP
For best results, you should choose the key as a random 32-bit number
each time your program starts.  Choosing a different key for different
hashtables isn't necessary.  It's probably a good idea to avoid the keys
0 and 1.  This raises the collision bound to
.RI ( l \ +\ 1)/(2\*(ss32\*(se\ \-\ 2)
(which isn't a significant increase) but eliminates keys for which the
hash's behaviour is particularly poor.
.PP
In tests,
.B unihash
actually performed better than
.BR crc32 ,
so if you want to just use it as a fast-ish hash with good statistical
properties, choose some fixed key
.IR k \ \*(>=\ 2.
.PP
We emphasize that the proof of this function's collision behaviour is
.I not
dependent on any unproven assumptions (unlike many `proofs' of
cryptographic security, which actually reduce the security of some
construction to the security of its components).  It's just a fact.
.SH SEE ALSO
.BR unihash-mkstatic (3),
.BR crc32 (3),
.BR mLib (3).
.SH AUTHOR
Mark Wooding (mdw@distorted.org.uk).
Commit	Line	Data
8fe3c82b	1	.\" --nroff--
	2	.de VS
	3	.sp 1
	4	.RS
	5	.nf
	6	.ft B
	7	..
	8	.de VE
	9	.ft R
	10	.fi
	11	.RE
	12	.sp 1
	13	..
	14	.de hP
	15	.IP
	16	.ft B
	17	\h'-\w'\\$1\ 'u'\\$1\ \c
	18	.ft P
	19	..
	20	.ie t \{\
	21	. ds ss \s8\u
	22	. ds se \d\s0
	23	. ds us \s8\d
	24	. ds ue \u\s0
	25	. ds d \(d
	26	. ds >= \(>=
	27	.\}
	28	.el \{\
	29	. ds ss ^
	30	. ds se
	31	. ds us _
	32	. ds ue
	33	. ds *d \fIdelta\fP
	34	. ds >= >=
	35	.\}
	36	.TH unihash 3 "5 July 2003" "Straylight/Edgeware" "mLib utilities library"
	37	.SH NAME
	38	unihash \- simple and efficient universal hashing for hashtables
	39	.\" @unihash_setkey
	40	.\" @UNIHASH_INIT
	41	.\" @unihash_hash
	42	.\" @UNIHASH
	43	.\" @unihash
	44	.SH SYNOPSIS
	45	.nf
	46	.B "#include <mLib/unihash.h>"
d056fbdf	47	.PP
4729aa69	48	.B "typedef struct { ...\& } unihash_info;"
d056fbdf	49	.PP
6f444bda	50	.B "unihash_info unihash_global;"
d056fbdf	51	.PP
8fe3c82b	52	.BI "void unihash_setkey(unihash_info *" i ", uint32 " k );
8fe3c82b	53	.BI "uint32 UNIHASH_INIT(const unihash_info *" i );
adec5584	54	.ta \w'\fBuint32 unihash_hash('u
3bc42912	55	.BI "uint32 unihash_hash(const unihash_info *" i ", uint32 " a ,
adec5584	56	.BI " const void *" p ", size_t " sz );
8fe3c82b	57	.BI "uint32 unihash(const unihash_info " i ", const void " p ", size_t " sz );
	58	.BI "uint32 UNIHASH(const unihash_info " i ", const void " p ", size_t " sz );
	59	.fi
	60	.SH DESCRIPTION
	61	The
	62	.B unihash
	63	system implements a simple and relatively efficient
	64	.IR "universal hashing family" .
	65	Using a such a universal hashing family means that it's provably
	66	difficult for an adversary to choose input data whose hashes collide,
	67	thus guaranteeing good average performance even on maliciously chosen
	68	data.
	69	.PP
	70	Unlike, say,
	71	.BR crc32 (3),
	72	the
	73	.B unihash
	74	function is
	75	.I keyed
	76	\- in addition to the data to be hashed, the function takes as input a
	77	32-bit key. This key should be chosen at random each time the program
	78	runs.
	79	.SS "Preprocessing a key"
	80	Before use, a key must be
	81	.I preprocessed
	82	into a large (16K) table which is used by the main hashing functions.
	83	The preprocessing is done by
	84	.BR unihash_setkey :
	85	pass it a pointer to a
	86	.B unihash_info
	87	structure and the 32-bit key you've chosen, and it stores the table in
	88	the structure.
	89	.PP
	90	Objects of type
	91	.B unihash_info
	92	don't contain any pointers to other data and are safe to free when
	93	you've finished with them; or you can just allocate them statically or
	94	on the stack if that's more convenient.
	95	.SS "Hashing data"
	96	The function
	97	.B unihash_hash
	98	takes as input:
	99	.TP
	100	.BI "const unihash_info *" i
	101	A pointer to the precomputed tables for a key.
	102	.TP
d4efbcd9	103	.BI "uint32 " a
8fe3c82b	104	An accumulator value. This should be
	105	.BI UNIHASH_INIT( i )
	106	for the first chunk of a multi-chunk input, or the result of the
	107	previous
	108	.B unihash_hash
	109	call for subsequent chunks.
	110	.TP
	111	.BI "const void *" p
	112	A pointer to the start of a buffer containing this chunk of data.
	113	.TP
	114	.BI "size_t " sz
	115	The length of the chunk.
	116	.PP
	117	The function returns a new accumulator value, which is also the hash of
	118	the data so far. So, to hash multiple chunks of data, do something like
	119	.VS
	120	uint32 a = UNIHASH_INIT(i);
	121	a = unihash_hash(i, a, p_0, sz_0);
	122	a = unihash_hash(i, a, p_1, sz_1);
adec5584	123	/* ...\& */
8fe3c82b	124	a = unihash_hash(i, a, p_n, sz_n);
	125	.VE
	126	The macro
	127	.B UNIHASH
	128	and function
	129	.B unihash
	130	are convenient interfaces to
	131	.B unihash_hash
	132	if you only wanted to hash one chunk.
6f444bda	133	.SS "Global hash info table"
	134	There's no problem with using the same key for several purposes, as long
	135	as it's secret from all of your adversaries. Therefore, there is a
	136	global
	137	.B unihash_info
	138	table define, called
	139	.BR unihash_global .
	140	This initially contains information for a fixed key which the author
	141	chose at random, but if you need to you can set a different key into it
	142	.I before
	143	it gets used to hash any data (otherwise your hash tables will become
	144	messed up).
8fe3c82b	145	.SS "Theoretical issues"
	146	The hash function implemented by
	147	.B unihash
	148	is
	149	.RI ( l \ +\ 1)/2\(ss32\(se-almost
	150	XOR-universal, where
	151	.I l
	152	is the length (in bytes) of the longest string you hash. That means
	153	that, for any pair of strings
	154	.I x
	155	and
	156	.I y
	157	and any 32-bit value \(d, the probability taken over all choices of the
	158	key
	159	.I k
	160	that
	161	.IR H\(usk\(ue ( x )\ \c
	162	.BR xor \c
	163	.RI \ H\(usk\(ue ( y )\ =\ \(d
	164	is no greater than
	165	.RI ( l \ +\ 1)/2\(ss32\(se.
	166	.PP
	167	This fact is proven in the header file, but it requires more
	168	sophisticated typesetting than is available here.
	169	.PP
	170	The function evaluates a polynomial over GF(2\(ss32\(se) whose
	171	coefficients are the bytes of the message and whose variable is the key.
	172	Details are given in the header file.
	173	.PP
	174	For best results, you should choose the key as a random 32-bit number
	175	each time your program starts. Choosing a different key for different
	176	hashtables isn't necessary. It's probably a good idea to avoid the keys
	177	0 and 1. This raises the collision bound to
	178	.RI ( l \ +\ 1)/(2\(ss32\(se\ \-\ 2)
	179	(which isn't a significant increase) but eliminates keys for which the
	180	hash's behaviour is particularly poor.
	181	.PP
	182	In tests,
	183	.B unihash
	184	actually performed better than
	185	.BR crc32 ,
	186	so if you want to just use it as a fast-ish hash with good statistical
	187	properties, choose some fixed key
	188	.IR k \ \*(>=\ 2.
	189	.PP
	190	We emphasize that the proof of this function's collision behaviour is
	191	.I not
	192	dependent on any unproven assumptions (unlike many `proofs' of
	193	cryptographic security, which actually reduce the security of some
	194	construction to the security of its components). It's just a fact.
	195	.SH SEE ALSO
6f444bda	196	.BR unihash-mkstatic (3),
8fe3c82b	197	.BR crc32 (3),
	198	.BR mLib (3).
	199	.SH AUTHOR
9b5ac6ff	200	Mark Wooding (mdw@distorted.org.uk).