chiark - git - mdw - mLib/blob - hash/unihash.3

   1 .\" -*-nroff-*-
   2 .de VS
   3 .sp 1
   4 .RS
   5 .nf
   6 .ft B
   7 ..
   8 .de VE
   9 .ft R
  10 .fi
  11 .RE
  12 .sp 1
  13 ..
  14 .de hP
  15 .IP
  16 .ft B
  17 \h'-\w'\\$1\ 'u'\\$1\ \c
  18 .ft P
  19 ..
  20 .ie t \{\
  21 .  ds ss \s8\u
  22 .  ds se \d\s0
  23 .  ds us \s8\d
  24 .  ds ue \u\s0
  25 .  ds *d \(*d
  26 .  ds >= \(>=
  27 .\}
  28 .el \{\
  29 .  ds ss ^
  30 .  ds se
  31 .  ds us _
  32 .  ds ue
  33 .  ds *d \fIdelta\fP
  34 .  ds >= >=
  35 .\}
  36 .TH unihash 3 "5 July 2003" "Straylight/Edgeware" "mLib utilities library"
  37 .SH NAME
  38 unihash \- simple and efficient universal hashing for hashtables
  39 .\" @unihash_setkey
  40 .\" @UNIHASH_INIT
  41 .\" @unihash_hash
  42 .\" @UNIHASH
  43 .\" @unihash
  44 .SH SYNOPSIS
  45 .nf
  46 .B "#include <mLib/unihash.h>"
  47
  48 .B "typedef struct { ...\& } unihash_info;"
  49
  50 .B "unihash_info unihash_global;"
  51
  52 .BI "void unihash_setkey(unihash_info *" i ", uint32 " k );
  53 .BI "uint32 UNIHASH_INIT(const unihash_info *" i );
  54 .BI "uint32 unihash_hash(const unihash_info *" i ", uint32 " a ,
  55 .BI "                    const void *" p ", size_t " sz );
  56 .BI "uint32 unihash(const unihash_info *" i ", const void *" p ", size_t " sz );
  57 .BI "uint32 UNIHASH(const unihash_info *" i ", const void *" p ", size_t " sz );
  58 .fi
  59 .SH DESCRIPTION
  60 The
  61 .B unihash
  62 system implements a simple and relatively efficient
  63 .IR "universal hashing family" .
  64 Using a such a universal hashing family means that it's provably
  65 difficult for an adversary to choose input data whose hashes collide,
  66 thus guaranteeing good average performance even on maliciously chosen
  67 data.
  68 .PP
  69 Unlike, say,
  70 .BR crc32 (3),
  71 the
  72 .B unihash
  73 function is
  74 .I keyed
  75 \- in addition to the data to be hashed, the function takes as input a
  76 32-bit key.  This key should be chosen at random each time the program
  77 runs.
  78 .SS "Preprocessing a key"
  79 Before use, a key must be
  80 .I preprocessed
  81 into a large (16K) table which is used by the main hashing functions.
  82 The preprocessing is done by
  83 .BR unihash_setkey :
  84 pass it a pointer to a
  85 .B unihash_info
  86 structure and the 32-bit key you've chosen, and it stores the table in
  87 the structure.
  88 .PP
  89 Objects of type
  90 .B unihash_info
  91 don't contain any pointers to other data and are safe to free when
  92 you've finished with them; or you can just allocate them statically or
  93 on the stack if that's more convenient.
  94 .SS "Hashing data"
  95 The function
  96 .B unihash_hash
  97 takes as input:
  98 .TP
  99 .BI "const unihash_info *" i
 100 A pointer to the precomputed tables for a key.
 101 .TP
 102 .BI "uint32 " a
 103 An accumulator value.  This should be
 104 .BI UNIHASH_INIT( i )
 105 for the first chunk of a multi-chunk input, or the result of the
 106 previous
 107 .B unihash_hash
 108 call for subsequent chunks.
 109 .TP
 110 .BI "const void *" p
 111 A pointer to the start of a buffer containing this chunk of data.
 112 .TP
 113 .BI "size_t " sz
 114 The length of the chunk.
 115 .PP
 116 The function returns a new accumulator value, which is also the hash of
 117 the data so far. So, to hash multiple chunks of data, do something like
 118 .VS
 119 uint32 a = UNIHASH_INIT(i);
 120 a = unihash_hash(i, a, p_0, sz_0);
 121 a = unihash_hash(i, a, p_1, sz_1);
 122 /* ... */
 123 a = unihash_hash(i, a, p_n, sz_n);
 124 .VE
 125 The macro
 126 .B UNIHASH
 127 and function
 128 .B unihash
 129 are convenient interfaces to
 130 .B unihash_hash
 131 if you only wanted to hash one chunk.
 132 .SS "Global hash info table"
 133 There's no problem with using the same key for several purposes, as long
 134 as it's secret from all of your adversaries.  Therefore, there is a
 135 global
 136 .B unihash_info
 137 table define, called
 138 .BR unihash_global .
 139 This initially contains information for a fixed key which the author
 140 chose at random, but if you need to you can set a different key into it
 141 .I before
 142 it gets used to hash any data (otherwise your hash tables will become
 143 messed up).
 144 .SS "Theoretical issues"
 145 The hash function implemented by
 146 .B unihash
 147 is
 148 .RI ( l \ +\ 1)/2\*(ss32\*(se-almost
 149 XOR-universal, where
 150 .I l
 151 is the length (in bytes) of the longest string you hash.  That means
 152 that, for any pair of strings
 153 .I x
 154 and
 155 .I y
 156 and any 32-bit value \*(*d, the probability taken over all choices of the
 157 key
 158 .I k
 159 that
 160 .IR H\*(usk\*(ue ( x )\  \c
 161 .BR xor \c
 162 .RI \  H\*(usk\*(ue ( y )\ =\ \*(*d
 163 is no greater than
 164 .RI ( l \ +\ 1)/2\*(ss32\*(se.
 165 .PP
 166 This fact is proven in the header file, but it requires more
 167 sophisticated typesetting than is available here.
 168 .PP
 169 The function evaluates a polynomial over GF(2\*(ss32\*(se) whose
 170 coefficients are the bytes of the message and whose variable is the key.
 171 Details are given in the header file.
 172 .PP
 173 For best results, you should choose the key as a random 32-bit number
 174 each time your program starts.  Choosing a different key for different
 175 hashtables isn't necessary.  It's probably a good idea to avoid the keys
 176 0 and 1.  This raises the collision bound to
 177 .RI ( l \ +\ 1)/(2\*(ss32\*(se\ \-\ 2)
 178 (which isn't a significant increase) but eliminates keys for which the
 179 hash's behaviour is particularly poor.
 180 .PP
 181 In tests,
 182 .B unihash
 183 actually performed better than
 184 .BR crc32 ,
 185 so if you want to just use it as a fast-ish hash with good statistical
 186 properties, choose some fixed key
 187 .IR k \ \*(>=\ 2.
 188 .PP
 189 We emphasize that the proof of this function's collision behaviour is
 190 .I not
 191 dependent on any unproven assumptions (unlike many `proofs' of
 192 cryptographic security, which actually reduce the security of some
 193 construction to the security of its components).  It's just a fact.
 194 .SH SEE ALSO
 195 .BR unihash-mkstatic (3),
 196 .BR crc32 (3),
 197 .BR mLib (3).
 198 .SH AUTHOR
 199 Mark Wooding (mdw@distorted.org.uk).