From: Ian Jackson <ijackson@chiark.greenend.org.uk>
To: ijackson@chiark.greenend.org.uk
Subject: Transition plan for git to move to a new hash function
Date: Thu, 20 Oct 2016 19:26:44 +0100


Basic principle: Every object will have two (or more) names,
corresponding to different hash functions.  It may be named by any of
its names, in every context.

Every program that invokes git or speaks git protocols will need to
understand the extended object name syntax, and understand that
objects have multiple names.

Safety catches preferent accidental incorporation into a project of
objects which contain references by incompatibly-new or
deprecatedly-old names.  This allows for incremental deployment.


Syntax:

The object name syntax is extended as follows: object names using sha1
are as current.  Object names starting with lowercase ASCII letters h
or later refer to new hash functions.  (`g' is reserved because of the
way that many programs write `g<objectname>'.  Programs that use
`g<objectname>' should be changed to show `h<hash>' for hash function
`h' rather than `gh<hash>'.)

Object names h<hex> are SHA-512 hashes.  Remaining letters are
reserved.  `x' `y' `z' are reserved for private experiments; we
declare that public releases of git will never accept such names.

Everywhere in the git object formats and git protocols, a new object
name (with hash function indicator) is permitted where an old object
name is permitted.  A single object refers to all the objects it
references by the same hash function; in general this might be a
different hash function to the hash function by this particular object
was itself referenced or obtained.

As an exception, it is forbidden to refer to a tree object by a name
other than the hash function it uses to name its subtrees.  If this
seems necessary, the tree object must be recursively rewritten instead
to use the desired object name.

In binary protocols, where a SHA-1 object name in binary form was
previously used, a new codepoint must be allocated in a containing
structure (eg a new typecode).  Usually, the new-format binary object
will have a new typecode and also an additional name hash indicator.
15 of the hash indicator values correspond to the lowercase letters
reserved above.


Object store:

The object store knows which hash functions are enabled.  Each hash
function H has one of the following statuses, which are configured by
the user:

* ENABLED:

  As far as the user is concerned every object in the object store is
  accessible using H.  Objects which use H names can be received and
  stored.

  This is actually two states, depending on whether any objects exist
  in the store which use these names.  If no such objects exist yet,
  we say that the hash function is `ENABLED PROSPECTIVE'.  The H names
  for the objects have not yet been calculated.

  When the first object which names another object using H is received
  (or, on demand), the object store calculates the H names for all
  existing objects and notes that this hash function is now
  `ENABLED PRESENT'.

* OBSOLESCENT: Every object in the object store has its hash
  calculated using H.  However, H is known to possibly have collisions
  which we try to tolerate.  When a collision occurs, the object text
  which is currently in the object store is preferred and the "new"
  object is thrown away.  Local creation of new objects with
  references using H is forbidden.

  This is used as part of a gradual desupport strategy.  When the hash
  function is in this stage, existing history in all existing object
  stores is safe and cannot be corrupted or modified by receiving
  colliding objects.

  New object stores which receive their data from a trustworthy sender
  over a trustworthy channel will receive correct data.  Bad object
  stores or untrustworthy channels could exploit collisions, but not
  in new regions of the history which are presumably using new names.
  So the collisons can only affect archaeology.

  Merging previously-unrelated histories does introduce a collision
  hazard, but the collision would have had to have been introduced
  while H was still a "live" hash function in at least one of the two
  projects.

* FORBIDDEN: Objects do not have their hashes calculated using this
  hash function.  Attempts to reference an object by such a name
  fail.  Optionally the user may specify a tolerant mode where:
  a commit which refers to parents by obsolete names is taken to
  simply not have those parents; a commit which refers to a tree by
  an obsolete name is taken to have an empty tree.

  This is used for two purposes:

    - On a server, we use this to restrict the propagation of
      new hashes so as to enforce our compatibility intentions.
      Ie, hashes which we are "not ready for" are forbidden.

    - Everywhere, we use this to get rid of old hash functions.
      It makes access to old history possible but difficult.

* FORGOTTEN: Objects do not have their hashes calculated using this
  hash function.  References to objects by all such names return dummy
  objects of the right shape: the empty blob; the empty tree; a root
  commit with an empty tree and dummy metadata.

  This allows us to finally retire a hash function entirely.  We
  effectively throw away all the history which uses H.

During transfer protocols, the receiver will say which hashes are
obsolete or forgotten, and the sender will not follow such references
when computing the set of objects to send.  So receivers will not
receive the objects which were named only by obsolete or forgotten
names.


Naming in newly-generated objects, queries, etc.

There is a `default' hash function, which is that which HEAD uses.
(That is, HEAD refers to an object by some name.  The default hash
function is that name's hash function.)

git tools produce always output object names in the default hash
function.  (Including git-hash-object.)

As a consequence, newly generated objects will contain object
references using the `default' hash function.

When HEAD is empty, there is a separate record of the default hash
function.  This comes from a configured default in a new tree.  In an
existing tree, using git checkout --orphan remembers the default hash
function that HEAD had.

When HEAD is updated to a new commit, the name stored in HEAD uses the
newer of the previous HEAD hash function and of the hash function used
in the commit being stored.  ("Newer" is a built-in preference order,
overrideable by configuration.)

This (together with the `forbidden' state, above) ensures that
switching a project to use a new hash function is a deliberate
decision: the default hash function needs to be changed to make the
first first commit with the new hash function.  After that, provided
the server accepts it, it's infectious.


Naming of refs other than HEAD

A ref refers to an object by one of its names.  However, operations
like git-show-ref convert that name to the default format (see above).

git-gc rewrites ref names to the default format.


Remote protocol

During the negotation, a client needs to specify what names it
understands, and which it prefers (its default).

When the server is listing its refs, the names are converted to the
client's preferred format.


Equality testing

All software which tests for equality of git objects by checking
whether their object names are equal needs to obtain a canonical name
for both objects.

This is going to be quite annoying.

Note that semantically identical trees may (now) have different tree
objects because those tree objects might contain different object
names.  So tree comparison cannot any longer be done by comparing
names; rather an invocation of git diff is needed.


Transition plan

Y0: Implement all of the above.  Test it.

    Default configuration:
       SHA-1 is ENABLED and is default HEAD hash

       SHA-512 is FORBIDDEN in bare repos
       SHA-512 is ENABLED in trees with working trees

Y5: New projects should start using SHA-512.

    Default configuration change:

       SHA-512 becomes ENABLED in *new* bare repos but remains
                  FORBIDDEN in existing ones
       

-- 
Ian Jackson <ijackson@chiark.greenend.org.uk>   These opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.