chiark - git - mdw - mLib/blob - hash/unihash.3

   1 .\" -*-nroff-*-
   2 .de VS
   3 .sp 1
   4 .RS
   5 .nf
   6 .ft B
   7 ..
   8 .de VE
   9 .ft R
  10 .fi
  11 .RE
  12 .sp 1
  13 ..
  14 .de hP
  15 .IP
  16 .ft B
  17 \h'-\w'\\$1\ 'u'\\$1\ \c
  18 .ft P
  19 ..
  20 .ie t \{\
  21 .  ds ss \s8\u
  22 .  ds se \d\s0
  23 .  ds us \s8\d
  24 .  ds ue \u\s0
  25 .  ds *d \(*d
  26 .  ds >= \(>=
  27 .\}
  28 .el \{\
  29 .  ds ss ^
  30 .  ds se
  31 .  ds us _
  32 .  ds ue
  33 .  ds *d \fIdelta\fP
  34 .  ds >= >=
  35 .\}
  36 .TH unihash 3 "5 July 2003" "Straylight/Edgeware" "mLib utilities library"
  37 .SH NAME
  38 unihash \- simple and efficient universal hashing for hashtables
  39 .\" @unihash_setkey
  40 .\" @UNIHASH_INIT
  41 .\" @unihash_hash
  42 .\" @UNIHASH
  43 .\" @unihash
  44 .SH SYNOPSIS
  45 .nf
  46 .B "#include <mLib/unihash.h>"
  47 .PP
  48 .B "typedef struct { ...\& } unihash_info;"
  49 .PP
  50 .B "unihash_info unihash_global;"
  51 .PP
  52 .BI "void unihash_setkey(unihash_info *" i ", uint32 " k );
  53 .BI "uint32 UNIHASH_INIT(const unihash_info *" i );
  54 .ta \w'\fBuint32 unihash_hash('u
  55 .BI "uint32 unihash_hash(const unihash_info *" i ", uint32 " a ,
  56 .BI "   const void *" p ", size_t " sz );
  57 .BI "uint32 unihash(const unihash_info *" i ", const void *" p ", size_t " sz );
  58 .BI "uint32 UNIHASH(const unihash_info *" i ", const void *" p ", size_t " sz );
  59 .fi
  60 .SH DESCRIPTION
  61 The
  62 .B unihash
  63 system implements a simple and relatively efficient
  64 .IR "universal hashing family" .
  65 Using a such a universal hashing family means that it's provably
  66 difficult for an adversary to choose input data whose hashes collide,
  67 thus guaranteeing good average performance even on maliciously chosen
  68 data.
  69 .PP
  70 Unlike, say,
  71 .BR crc32 (3),
  72 the
  73 .B unihash
  74 function is
  75 .I keyed
  76 \- in addition to the data to be hashed, the function takes as input a
  77 32-bit key.  This key should be chosen at random each time the program
  78 runs.
  79 .SS "Preprocessing a key"
  80 Before use, a key must be
  81 .I preprocessed
  82 into a large (16K) table which is used by the main hashing functions.
  83 The preprocessing is done by
  84 .BR unihash_setkey :
  85 pass it a pointer to a
  86 .B unihash_info
  87 structure and the 32-bit key you've chosen, and it stores the table in
  88 the structure.
  89 .PP
  90 Objects of type
  91 .B unihash_info
  92 don't contain any pointers to other data and are safe to free when
  93 you've finished with them; or you can just allocate them statically or
  94 on the stack if that's more convenient.
  95 .SS "Hashing data"
  96 The function
  97 .B unihash_hash
  98 takes as input:
  99 .TP
 100 .BI "const unihash_info *" i
 101 A pointer to the precomputed tables for a key.
 102 .TP
 103 .BI "uint32 " a
 104 An accumulator value.  This should be
 105 .BI UNIHASH_INIT( i )
 106 for the first chunk of a multi-chunk input, or the result of the
 107 previous
 108 .B unihash_hash
 109 call for subsequent chunks.
 110 .TP
 111 .BI "const void *" p
 112 A pointer to the start of a buffer containing this chunk of data.
 113 .TP
 114 .BI "size_t " sz
 115 The length of the chunk.
 116 .PP
 117 The function returns a new accumulator value, which is also the hash of
 118 the data so far. So, to hash multiple chunks of data, do something like
 119 .VS
 120 uint32 a = UNIHASH_INIT(i);
 121 a = unihash_hash(i, a, p_0, sz_0);
 122 a = unihash_hash(i, a, p_1, sz_1);
 123 /* ...\& */
 124 a = unihash_hash(i, a, p_n, sz_n);
 125 .VE
 126 The macro
 127 .B UNIHASH
 128 and function
 129 .B unihash
 130 are convenient interfaces to
 131 .B unihash_hash
 132 if you only wanted to hash one chunk.
 133 .SS "Global hash info table"
 134 There's no problem with using the same key for several purposes, as long
 135 as it's secret from all of your adversaries.  Therefore, there is a
 136 global
 137 .B unihash_info
 138 table define, called
 139 .BR unihash_global .
 140 This initially contains information for a fixed key which the author
 141 chose at random, but if you need to you can set a different key into it
 142 .I before
 143 it gets used to hash any data (otherwise your hash tables will become
 144 messed up).
 145 .SS "Theoretical issues"
 146 The hash function implemented by
 147 .B unihash
 148 is
 149 .RI ( l \ +\ 1)/2\*(ss32\*(se-almost
 150 XOR-universal, where
 151 .I l
 152 is the length (in bytes) of the longest string you hash.  That means
 153 that, for any pair of strings
 154 .I x
 155 and
 156 .I y
 157 and any 32-bit value \*(*d, the probability taken over all choices of the
 158 key
 159 .I k
 160 that
 161 .IR H\*(usk\*(ue ( x )\  \c
 162 .BR xor \c
 163 .RI \  H\*(usk\*(ue ( y )\ =\ \*(*d
 164 is no greater than
 165 .RI ( l \ +\ 1)/2\*(ss32\*(se.
 166 .PP
 167 This fact is proven in the header file, but it requires more
 168 sophisticated typesetting than is available here.
 169 .PP
 170 The function evaluates a polynomial over GF(2\*(ss32\*(se) whose
 171 coefficients are the bytes of the message and whose variable is the key.
 172 Details are given in the header file.
 173 .PP
 174 For best results, you should choose the key as a random 32-bit number
 175 each time your program starts.  Choosing a different key for different
 176 hashtables isn't necessary.  It's probably a good idea to avoid the keys
 177 0 and 1.  This raises the collision bound to
 178 .RI ( l \ +\ 1)/(2\*(ss32\*(se\ \-\ 2)
 179 (which isn't a significant increase) but eliminates keys for which the
 180 hash's behaviour is particularly poor.
 181 .PP
 182 In tests,
 183 .B unihash
 184 actually performed better than
 185 .BR crc32 ,
 186 so if you want to just use it as a fast-ish hash with good statistical
 187 properties, choose some fixed key
 188 .IR k \ \*(>=\ 2.
 189 .PP
 190 We emphasize that the proof of this function's collision behaviour is
 191 .I not
 192 dependent on any unproven assumptions (unlike many `proofs' of
 193 cryptographic security, which actually reduce the security of some
 194 construction to the security of its components).  It's just a fact.
 195 .SH SEE ALSO
 196 .BR unihash-mkstatic (3),
 197 .BR crc32 (3),
 198 .BR mLib (3).
 199 .SH AUTHOR
 200 Mark Wooding (mdw@distorted.org.uk).