plan.txt

   1 From: Ian Jackson <ijackson@chiark.greenend.org.uk>
   2 To: ijackson@chiark.greenend.org.uk
   3 Subject: Transition plan for git to move to a new hash function
   4 Date: Thu, 20 Oct 2016 19:26:44 +0100
   5
   6
   7 Basic principle: Every object will have two (or more) names,
   8 corresponding to different hash functions.  It may be named by any of
   9 its names, in every context.
  10
  11 Every program that invokes git or speaks git protocols will need to
  12 understand the extended object name syntax, and understand that
  13 objects have multiple names.
  14
  15 Safety catches preferent accidental incorporation into a project of
  16 objects which contain references by incompatibly-new or
  17 deprecatedly-old names.  This allows for incremental deployment.
  18
  19
  20 TEXTUAL SYNTAX
  21
  22 The object name textual syntax is extended as follows:
  23
  24 We declare that the object name syntax is henceforth
  25   [A-Z]+[0-9a-z]+ | [0-9a-f]+
  26 and that names [A-Z].* are deprecated as ref name components.
  27
  28     Rationale:
  29
  30       Full backwards compatibility is impossible, because the hash
  31       function needs to be evident in the name, so the new names
  32       must be disjoint from all old SHA-1 names.
  33
  34       We want a short but extensible syntax.  The syntax should impose
  35       minimal extra requirements on existing git users.  In most
  36       contexts where existing git users use hashes, ASCII alphanumeric
  37       object names will fit.  Use of punctuation such as : or even _
  38       may give trouble to existing users, who are already using
  39       such things as delimiters.
  40
  41       In existing deployments, refnames that differ only in case are
  42       generally avoided (because they are troublesome on
  43       case-insensitive filesystems).  And conventionally refnames are
  44       lower case.  So names starting with an upper case letter will be
  45       disjoint from most existing ref name components.
  46
  47       Even though we probably want to keep using hex, it is a good
  48       idea to reserve the flexibility to use a more compact encoding,
  49       while not excessively widening the existing permissible
  50       character set.
  51
  52 Object names using SHA-1 are represented, in text, as at present.
  53
  54 Object names starting with uppercase ASCII letters H or later refer to
  55 new hash functions.  Programs that use `g<objectname>' should ideally
  56 be changed to show `H<hash>' for hash function `H' rather than
  57 `gH<hash>'.)
  58
  59     Rationale:
  60
  61       Object names starting with A-F might look like hex.  G is
  62       reserved because of the way that many programs write
  63       `g<objectname>'.
  64
  65       This gives us 19 new hash function values until we have to
  66       starting using two-letter hash function prefixes, or decide to
  67       use A-F after all.
  68
  69 (Truncated object names work as they do at the moment.)
  70
  71 Initially we define and assign one new hash function (and textual
  72 object name encoding):
  73
  74   H<hex>    where <hex> is the BLAKE2b hash of the object
  75             (in lowercase)
  76
  77 We also reserve the following syntax for private experiments:
  78   E[A-Z]+[0-9a-z]+
  79 We declare that public releases of git will never accept such
  80 object names.
  81
  82 Everywhere in the git object formats and git protocols, a new object
  83 name (with hash function indicator) is permitted where an old object
  84 name is permitted.
  85
  86 A single object refers to all the objects it references by the same
  87 hash function; in general this might be a different hash function to
  88 the hash function by which this particular object was itself
  89 referenced or obtained.
  90
  91 As a further restriction, it is forbidden to refer to a tree object by
  92 a name other than the hash function it uses to name its subtrees.  If
  93 this seems necessary, the tree object must be recursively rewritten
  94 instead to use the desired object name.
  95
  96 In binary protocols, where a SHA-1 object name in binary form was
  97 previously used, a new codepoint must be allocated in a containing
  98 structure (eg a new typecode).  Usually, the new-format binary object
  99 will have a new typecode and also an additional name hash indicator.
 100
 101 Whenever a new hash function textual syntax is defined, corresponding
 102 binary format codepoint(s) are assigned.  (Detailed binary format
 103 specification is outside the scope of this plan.)
 104
 105
 106 ORDERING
 107
 108 Hash functions are partially ordered, from `older' to `newer'.
 109
 110 The ordering is configurable.  The default, with the two hash
 111 functions defined here, is the obvious ordering
 112     SHA1 ([0-9a-f]*) < BLAKE2b (H*)
 113
 114
 115 CHOICE OF OBJECT NAMES
 116
 117 Whenever objects are named, it is possible to refer to them by old or
 118 new names.  So git must make a choice, each time: when new objects
 119 are created; when refs are updated; and when refs are reported over
 120 network protocols to other instances of git.
 121
 122 Although strictly speaking all objects have both old names and new
 123 names, and there may be more than two hash functions, it is possible
 124 to speak, somewhat loosely, about `new objects'.
 125
 126 A `new' object is one which refers to other objects by a `new' name.
 127 (whatever `new' means).
 128
 129 We call these different hashes `namings'.  That is, a `naming' is a
 130 hash function implemented by git.  The `naming IN an object' is the
 131 naming by which the object refers to other objects (and may not exist,
 132 if the object has no references); the `name OF an object' is the name
 133 by which the object itself is specified.
 134
 135
 136 Commits
 137
 138 A non-origin commit is made (by default) as new as the newest of
 139   (i) the naming in each of its parents
 140   (ii) the specified name of each of its parents
 141 (Implicitly this normally means that if HEAD uses a new name, new
 142 commits will be generated.)
 143
 144 The naming of an origin commit is controlled by a dropping left in
 145 .git by git checkout --orphan or git init.
 146
 147 At boundaries between old and new history, a new commit will refer to
 148 old parents by those old parents' new names.
 149
 150
 151 Tags
 152
 153 A new tag is made to use newest naming, for its tagged object, of
 154   (i) the name by which the tagged object was specified
 155   (ii) the naming in the tagged object (if applicable)
 156
 157
 158 Trees
 159
 160 Commits (and sometimes, tags) can refer to tree objects; that tree
 161 will contain the same naming as the referring object.
 162
 163 That is, it is a bug to refer to a tree object by other than the hash
 164 it uses internally to refer to subtrees (and gitlinks).  This will
 165 mean that a tree must sometimes be rewritten (ie, new object names
 166 recalculated recursively).
 167
 168     Rationale: we want to avoid new commits and tags relying on weak
 169     hashes.
 170
 171
 172 Blobs
 173
 174 Blobs do not refer to other objects so they are neither new or old.
 175
 176
 177 Name of newly created object
 178
 179 When git creates a new object, it reports the new object name using
 180 the naming in the object.
 181
 182 For blobs and empty trees, the caller should normally specify.  The
 183 default is the naming used for HEAD.
 184
 185
 186 Updating refs
 187
 188 If a ref is updated with a new object, the name from its creation is
 189 used (see above).
 190
 191 If a ref is updated to a specified object, the naming used in the ref
 192 is the newer of the specified name, or the naming in the object (if
 193 any).
 194
 195
 196
 197
 198
 199 ), or with a specified object name.
 200
 201
 202
 203 (If there are different equally new names, one of the newest names is
 204 chosen according to some stable rule.)
 205
 206
 207
 208 new
 209
 210 commit.  (This may mean converting the tree in hand, since trees are
 211 supposed to be homgeonous.)
 212
 213
 214
 215
 216 A `new commit' is one which refers to objects by
 217
 218
 219
 220
 221 Object store:
 222
 223 The object store knows which hash functions are enabled.  Each hash
 224 function H has one of the following statuses, which are configured by
 225 the user:
 226
 227 * ENABLED:
 228
 229   As far as the user is concerned every object in the object store is
 230   accessible using H.  Objects which use H names can be received and
 231   stored.
 232
 233   This is actually two states, depending on whether any objects exist
 234   in the store which use these names.  If no such objects exist yet,
 235   we say that the hash function is `ENABLED PROSPECTIVE'.  The H names
 236   for the objects have not yet been calculated.
 237
 238   When the first object which names another object using H is received
 239   (or, on demand), the object store calculates the H names for all
 240   existing objects and notes that this hash function is now
 241   `ENABLED PRESENT'.
 242
 243   If a hash collision is detected, we crash immediately.
 244
 245 * OBSOLESCENT: Every object in the object store has its hash
 246   calculated using H.  However, H is known to possibly have collisions
 247   which we try to tolerate.  When a collision occurs, the object text
 248   which is currently in the object store is preferred and the "new"
 249   object is thrown away.
 250
 251   Local creation of new objects with references using H is
 252   discouraged.  Specifically, if another hash function is ENABLED, we
 253   will use that instead.
 254
 255   This is used as part of a gradual desupport strategy.  When the hash
 256   function is in this stage, existing history in all existing object
 257   stores is safe and cannot be corrupted or modified by receiving
 258   colliding objects.
 259
 260   New object stores which receive their data from a trustworthy sender
 261   over a trustworthy channel will receive correct data.  Bad object
 262   stores or untrustworthy channels could exploit collisions, but not
 263   in new regions of the history which are presumably using new names.
 264   So the collisons can only affect archaeology.
 265
 266   Merging previously-unrelated histories does introduce a collision
 267   hazard, but the collision would have had to have been introduced
 268   while H was still a "live" hash function in at least one of the two
 269   projects.
 270
 271 * FORBIDDEN: Objects do not have their hashes calculated using this
 272   hash function.  Attempts to reference an object by such a name
 273   fail.  Optionally the user may specify a tolerant mode where:
 274   a commit which refers to parents by obsolete names is taken to
 275   simply not have those parents; a commit which refers to a tree by
 276   an obsolete name is taken to have an empty tree.
 277
 278   This is used for two purposes:
 279
 280     - On a server, we use this to restrict the propagation of
 281       new hashes so as to enforce our compatibility intentions.
 282       Ie, hashes which we are "not ready for" are forbidden.
 283
 284     - Everywhere, we use this to get rid of old hash functions.
 285       It makes access to old history possible but difficult.
 286
 287 * FORGOTTEN: Objects do not have their hashes calculated using this
 288   hash function.  References to objects by all such names return dummy
 289   objects of the right shape: the empty blob; the empty tree; a root
 290   commit with an empty tree and dummy metadata.
 291
 292   This allows us to finally retire a hash function entirely.  We
 293   effectively throw away all the history which uses H.
 294
 295 During transfer protocols, the receiver will say which hashes it
 296 thinks are obsolete or forgotten, and the sender will not follow such
 297 references when computing the set of objects to send.  So receivers
 298 will not receive the objects which were named only by obsolete or
 299 forgotten names.
 300
 301
 302 Naming in newly-generated objects, queries, etc.
 303
 304 There is a `default' hash function, which is that which HEAD uses.
 305 (That is, HEAD refers to an object by some name.  The default hash
 306 function is that name's hash function.)
 307
 308 git tools produce always output object names in the default hash
 309 function.  (Including git-hash-object.)
 310
 311 As a consequence, newly generated objects will contain object
 312 references using the `default' hash function.
 313
 314 When HEAD is empty, there is a separate record of the default hash
 315 function.  This comes from a configured default in a new tree.  In an
 316 existing tree, using git checkout --orphan remembers the default hash
 317 function that HEAD had.
 318
 319 When HEAD is updated to a new commit, the name stored in HEAD uses the
 320 newer of the previous HEAD hash function and of the hash function used
 321 in the commit being stored.  ("Newer" is a built-in preference order,
 322 overrideable by configuration.)
 323
 324 This (together with the `forbidden' state, above) ensures that
 325 switching a project to use a new hash function is a deliberate
 326 decision: the default hash function needs to be changed to make the
 327 first commit with the new hash function.  After that, provided
 328 the server accepts it, it's infectious.
 329
 330
 331 Naming of refs other than HEAD
 332
 333 A ref refers to an object by one of its names.  However, operations
 334 like git-show-ref convert that name to the default format (see above).
 335
 336 git-gc rewrites ref names to the default format iff that is newer.
 337
 338
 339 Remote protocol
 340
 341 During the negotation, a receiver needs to specify what hashes it
 342 understands.
 343
 344 When the sender is listing its refs, the names are converted to a
 345 hash understood by the client if necessary.  If this is not necessary,
 346 they are left unchanged.
 347
 348 When a receiver is updating refs, it should by follow the sender's
 349 idea of a hash change iff it's an upgrade (and the new function is
 350 ENABLED).  That is, if the sender sends name H2 for some ref, and the
 351 receiver has H1, but these refer to the same object, then the receiver
 352 should update its own ref name from H1 to H2 iff H2 uses a newer hash
 353 function.
 354
 355
 356 Equality testing
 357
 358 All software which tests for equality of git objects by checking
 359 whether their object names are equal needs to obtain a canonical name
 360 for both objects.
 361
 362 This is going to be quite annoying.
 363
 364 We should provide a convenient utility which tests whether two object
 365 names refer to the same object.
 366
 367 Note that semantically identical trees may (now) have different tree
 368 objects because those tree objects might contain different object
 369 names.  So (in some contexts at least) tree comparison cannot any
 370 longer be done by comparing names; rather an invocation of git diff is
 371 needed, or explicit generation of a tree object with the right name.
 372
 373
 374 Transition plan
 375
 376 Y0: Implement all of the above.  Test it.
 377
 378     Default configuration:
 379        SHA-1 is ENABLED
 380        SHA-512 is FORBIDDEN in bare repos
 381        SHA-512 is ENABLED in trees with working trees
 382        default HEAD hash is SHA-1
 383
 384     Effects:
 385
 386     Existing projects will not switch to SHA-512 willy-nilly.
 387     New projects will still use SHA-1.
 388
 389     Incompatible new-style commits cannot be pushed without server
 390     admin effort (or until future upgrade).
 391
 392     So all old git clients still work.
 393
 394 Y4: SHA-512 by default for new projects.
 395     Conversion enabled for existing projects.
 396     Old git software is now pretty firmly deprecated.
 397
 398     Default configuration change:
 399
 400        When creating a new bare tree, a configuration dropping is left
 401        (in `config') which specifies that SHA-1 is OBSOLESCENT
 402
 403        Default status for SHA-512 is FORBIDDEN if SHA-1 is ENABLED,
 404        or ENABLED if SHA-1 is OBSOLESCENT.
 405
 406        default HEAD hash is newest ENABLED hash.
 407
 408     Effects:
 409
 410     When creating a new working tree, it starts using SHA-512.
 411     A new server tree will accept SHA-512.
 412
 413     Existing server trees do not yet accept SHA-512.  They publish
 414     their SHA-1 hashes, so clients make commits with SHA-1.
 415
 416     To convert a project, an administrator would set SHA-1 to
 417     OBSOLESCENT on the server.  All clones after that will have HEAD
 418     with a SHA-512 name.  Fetches and pulls will update to SHA-512
 419     names.
 420
 421 will , and push one SHA-512 commit to
 422     mainline.
 423
 424
 425
 426     Default configuration change:
 427
 428     Effects:
 429
 430     When creating a new tree with working tree with git init (ie, no
 431     HEAD), the default HEAD hash is set to SHA-512 (because SHA-1 is
 432     OBSOLESCENT in a new tree and therefore SHA-512 is the only
 433     ENABLED hash and is the default).
 434
 435     Newly minted server trees accept SHA-512.
 436
 437
 438  start using SHA-512 by default.
 439
 440 Y6: Existing projects start being converted infectiously.
 441     It is hard to stop this happening.
 442     Old git software is firmly stuffed.
 443
 444     Default configuration change:
 445        SHA-1 is OBSOLESCENT
 446        (default for SHA-512, and HEAD hash, computed as in Y4)
 447
 448     Result is that by default all software
 449
 450     (Projects which do not want to convert need to set SHA-1 to
 451     ENABLED, explicitly, on their
 452
 453 Y6: Existing projects start using SHA-512.
 454
 455     Default configuration change:
 456        SHA-512 is ENABLED
 457        SHA-1 is OBSOLESCENT
 458        (default default HEAD hash is already SHA-512)
 459
 460       In existing repositories where no special action
 461
 462
 463 --
 464 Ian Jackson <ijackson@chiark.greenend.org.uk>   These opinions are my own.
 465
 466 If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
 467 a private address which bypasses my fierce spamfilter.