chiark - git - mdw - mLib/blob - man/unihash.3

   1 .\" -*-nroff-*-
   2 .de VS
   3 .sp 1
   4 .RS
   5 .nf
   6 .ft B
   7 ..
   8 .de VE
   9 .ft R
  10 .fi
  11 .RE
  12 .sp 1
  13 ..
  14 .de hP
  15 .IP
  16 .ft B
  17 \h'-\w'\\$1\ 'u'\\$1\ \c
  18 .ft P
  19 ..
  20 .ie t \{\
  21 .  ds ss \s8\u
  22 .  ds se \d\s0
  23 .  ds us \s8\d
  24 .  ds ue \u\s0
  25 .  ds *d \(*d
  26 .  ds >= \(>=
  27 .\}
  28 .el \{\
  29 .  ds ss ^
  30 .  ds se
  31 .  ds us _
  32 .  ds ue
  33 .  ds *d \fIdelta\fP
  34 .  ds >= >=
  35 .\}
  36 .TH unihash 3 "5 July 2003" "Straylight/Edgeware" "mLib utilities library"
  37 .SH NAME
  38 unihash \- simple and efficient universal hashing for hashtables
  39 .\" @unihash_setkey
  40 .\" @UNIHASH_INIT
  41 .\" @unihash_hash
  42 .\" @UNIHASH
  43 .\" @unihash
  44 .SH SYNOPSIS
  45 .nf
  46 .B "#include <mLib/unihash.h>"
  47
  48 .BI "void unihash_setkey(unihash_info *" i ", uint32 " k );
  49 .BI "uint32 UNIHASH_INIT(const unihash_info *" i );
  50 .BI "void unihash_hash(const unihash_info *" st ", uint32 " a ,
  51 .BI "                  const void *" p ", size_t " sz );
  52 .BI "uint32 unihash(const unihash_info *" i ", const void *" p ", size_t " sz );
  53 .BI "uint32 UNIHASH(const unihash_info *" i ", const void *" p ", size_t " sz );
  54 .fi
  55 .SH DESCRIPTION
  56 The
  57 .B unihash
  58 system implements a simple and relatively efficient
  59 .IR "universal hashing family" .
  60 Using a such a universal hashing family means that it's provably
  61 difficult for an adversary to choose input data whose hashes collide,
  62 thus guaranteeing good average performance even on maliciously chosen
  63 data.
  64 .PP
  65 Unlike, say,
  66 .BR crc32 (3),
  67 the
  68 .B unihash
  69 function is
  70 .I keyed
  71 \- in addition to the data to be hashed, the function takes as input a
  72 32-bit key.  This key should be chosen at random each time the program
  73 runs.
  74 .SS "Preprocessing a key"
  75 Before use, a key must be
  76 .I preprocessed
  77 into a large (16K) table which is used by the main hashing functions.
  78 The preprocessing is done by
  79 .BR unihash_setkey :
  80 pass it a pointer to a
  81 .B unihash_info
  82 structure and the 32-bit key you've chosen, and it stores the table in
  83 the structure.
  84 .PP
  85 Objects of type
  86 .B unihash_info
  87 don't contain any pointers to other data and are safe to free when
  88 you've finished with them; or you can just allocate them statically or
  89 on the stack if that's more convenient.
  90 .SS "Hashing data"
  91 The function
  92 .B unihash_hash
  93 takes as input:
  94 .TP
  95 .BI "const unihash_info *" i
  96 A pointer to the precomputed tables for a key.
  97 .TP
  98 .BI "uint32 " a
  99 An accumulator value.  This should be
 100 .BI UNIHASH_INIT( i )
 101 for the first chunk of a multi-chunk input, or the result of the
 102 previous
 103 .B unihash_hash
 104 call for subsequent chunks.
 105 .TP
 106 .BI "const void *" p
 107 A pointer to the start of a buffer containing this chunk of data.
 108 .TP
 109 .BI "size_t " sz
 110 The length of the chunk.
 111 .PP
 112 The function returns a new accumulator value, which is also the hash of
 113 the data so far. So, to hash multiple chunks of data, do something like
 114 .VS
 115 uint32 a = UNIHASH_INIT(i);
 116 a = unihash_hash(i, a, p_0, sz_0);
 117 a = unihash_hash(i, a, p_1, sz_1);
 118 /* ... */
 119 a = unihash_hash(i, a, p_n, sz_n);
 120 .VE
 121 The macro
 122 .B UNIHASH
 123 and function
 124 .B unihash
 125 are convenient interfaces to
 126 .B unihash_hash
 127 if you only wanted to hash one chunk.
 128 .SS "Theoretical issues"
 129 The hash function implemented by
 130 .B unihash
 131 is
 132 .RI ( l \ +\ 1)/2\*(ss32\*(se-almost
 133 XOR-universal, where
 134 .I l
 135 is the length (in bytes) of the longest string you hash.  That means
 136 that, for any pair of strings
 137 .I x
 138 and
 139 .I y
 140 and any 32-bit value \*(*d, the probability taken over all choices of the
 141 key
 142 .I k
 143 that
 144 .IR H\*(usk\*(ue ( x )\  \c
 145 .BR xor \c
 146 .RI \  H\*(usk\*(ue ( y )\ =\ \*(*d
 147 is no greater than
 148 .RI ( l \ +\ 1)/2\*(ss32\*(se.
 149 .PP
 150 This fact is proven in the header file, but it requires more
 151 sophisticated typesetting than is available here.
 152 .PP
 153 The function evaluates a polynomial over GF(2\*(ss32\*(se) whose
 154 coefficients are the bytes of the message and whose variable is the key.
 155 Details are given in the header file.
 156 .PP
 157 For best results, you should choose the key as a random 32-bit number
 158 each time your program starts.  Choosing a different key for different
 159 hashtables isn't necessary.  It's probably a good idea to avoid the keys
 160 0 and 1.  This raises the collision bound to
 161 .RI ( l \ +\ 1)/(2\*(ss32\*(se\ \-\ 2)
 162 (which isn't a significant increase) but eliminates keys for which the
 163 hash's behaviour is particularly poor.
 164 .PP
 165 In tests,
 166 .B unihash
 167 actually performed better than
 168 .BR crc32 ,
 169 so if you want to just use it as a fast-ish hash with good statistical
 170 properties, choose some fixed key
 171 .IR k \ \*(>=\ 2.
 172 .PP
 173 We emphasize that the proof of this function's collision behaviour is
 174 .I not
 175 dependent on any unproven assumptions (unlike many `proofs' of
 176 cryptographic security, which actually reduce the security of some
 177 construction to the security of its components).  It's just a fact.
 178 .SH SEE ALSO
 179 .BR crc32 (3),
 180 .BR mLib (3).
 181 .SH AUTHOR
 182 Mark Wooding (mdw@nsict.org).