--- /dev/null
+From: Ian Jackson <ijackson@chiark.greenend.org.uk>
+To: ijackson@chiark.greenend.org.uk
+Subject: Transition plan for git to move to a new hash function
+Date: Thu, 20 Oct 2016 19:26:44 +0100
+
+
+Basic principle: Every object will have two (or more) names,
+corresponding to different hash functions. It may be named by any of
+its names, in every context.
+
+Every program that invokes git or speaks git protocols will need to
+understand the extended object name syntax, and understand that
+objects have multiple names.
+
+Safety catches preferent accidental incorporation into a project of
+objects which contain references by incompatibly-new or
+deprecatedly-old names. This allows for incremental deployment.
+
+
+Syntax:
+
+The object name syntax is extended as follows: object names using sha1
+are as current. Object names starting with lowercase ASCII letters h
+or later refer to new hash functions. (`g' is reserved because of the
+way that many programs write `g<objectname>'. Programs that use
+`g<objectname>' should be changed to show `h<hash>' for hash function
+`h' rather than `gh<hash>'.)
+
+Object names h<hex> are SHA-512 hashes. Remaining letters are
+reserved. `x' `y' `z' are reserved for private experiments; we
+declare that public releases of git will never accept such names.
+
+Everywhere in the git object formats and git protocols, a new object
+name (with hash function indicator) is permitted where an old object
+name is permitted. A single object refers to all the objects it
+references by the same hash function; in general this might be a
+different hash function to the hash function by this particular object
+was itself referenced or obtained.
+
+As an exception, it is forbidden to refer to a tree object by a name
+other than the hash function it uses to name its subtrees. If this
+seems necessary, the tree object must be recursively rewritten instead
+to use the desired object name.
+
+In binary protocols, where a SHA-1 object name in binary form was
+previously used, a new codepoint must be allocated in a containing
+structure (eg a new typecode). Usually, the new-format binary object
+will have a new typecode and also an additional name hash indicator.
+15 of the hash indicator values correspond to the lowercase letters
+reserved above.
+
+
+Object store:
+
+The object store knows which hash functions are enabled. Each hash
+function H has one of the following statuses, which are configured by
+the user:
+
+* ENABLED:
+
+ As far as the user is concerned every object in the object store is
+ accessible using H. Objects which use H names can be received and
+ stored.
+
+ This is actually two states, depending on whether any objects exist
+ in the store which use these names. If no such objects exist yet,
+ we say that the hash function is `ENABLED PROSPECTIVE'. The H names
+ for the objects have not yet been calculated.
+
+ When the first object which names another object using H is received
+ (or, on demand), the object store calculates the H names for all
+ existing objects and notes that this hash function is now
+ `ENABLED PRESENT'.
+
+* OBSOLESCENT: Every object in the object store has its hash
+ calculated using H. However, H is known to possibly have collisions
+ which we try to tolerate. When a collision occurs, the object text
+ which is currently in the object store is preferred and the "new"
+ object is thrown away. Local creation of new objects with
+ references using H is forbidden.
+
+ This is used as part of a gradual desupport strategy. When the hash
+ function is in this stage, existing history in all existing object
+ stores is safe and cannot be corrupted or modified by receiving
+ colliding objects.
+
+ New object stores which receive their data from a trustworthy sender
+ over a trustworthy channel will receive correct data. Bad object
+ stores or untrustworthy channels could exploit collisions, but not
+ in new regions of the history which are presumably using new names.
+ So the collisons can only affect archaeology.
+
+ Merging previously-unrelated histories does introduce a collision
+ hazard, but the collision would have had to have been introduced
+ while H was still a "live" hash function in at least one of the two
+ projects.
+
+* FORBIDDEN: Objects do not have their hashes calculated using this
+ hash function. Attempts to reference an object by such a name
+ fail. Optionally the user may specify a tolerant mode where:
+ a commit which refers to parents by obsolete names is taken to
+ simply not have those parents; a commit which refers to a tree by
+ an obsolete name is taken to have an empty tree.
+
+ This is used for two purposes:
+
+ - On a server, we use this to restrict the propagation of
+ new hashes so as to enforce our compatibility intentions.
+ Ie, hashes which we are "not ready for" are forbidden.
+
+ - Everywhere, we use this to get rid of old hash functions.
+ It makes access to old history possible but difficult.
+
+* FORGOTTEN: Objects do not have their hashes calculated using this
+ hash function. References to objects by all such names return dummy
+ objects of the right shape: the empty blob; the empty tree; a root
+ commit with an empty tree and dummy metadata.
+
+ This allows us to finally retire a hash function entirely. We
+ effectively throw away all the history which uses H.
+
+During transfer protocols, the receiver will say which hashes are
+obsolete or forgotten, and the sender will not follow such references
+when computing the set of objects to send. So receivers will not
+receive the objects which were named only by obsolete or forgotten
+names.
+
+
+Naming in newly-generated objects, queries, etc.
+
+There is a `default' hash function, which is that which HEAD uses.
+(That is, HEAD refers to an object by some name. The default hash
+function is that name's hash function.)
+
+git tools produce always output object names in the default hash
+function. (Including git-hash-object.)
+
+As a consequence, newly generated objects will contain object
+references using the `default' hash function.
+
+When HEAD is empty, there is a separate record of the default hash
+function. This comes from a configured default in a new tree. In an
+existing tree, using git checkout --orphan remembers the default hash
+function that HEAD had.
+
+When HEAD is updated to a new commit, the name stored in HEAD uses the
+newer of the previous HEAD hash function and of the hash function used
+in the commit being stored. ("Newer" is a built-in preference order,
+overrideable by configuration.)
+
+This (together with the `forbidden' state, above) ensures that
+switching a project to use a new hash function is a deliberate
+decision: the default hash function needs to be changed to make the
+first first commit with the new hash function. After that, provided
+the server accepts it, it's infectious.
+
+
+Naming of refs other than HEAD
+
+A ref refers to an object by one of its names. However, operations
+like git-show-ref convert that name to the default format (see above).
+
+git-gc rewrites ref names to the default format.
+
+
+Remote protocol
+
+During the negotation, a client needs to specify what names it
+understands, and which it prefers (its default).
+
+When the server is listing its refs, the names are converted to the
+client's preferred format.
+
+
+Equality testing
+
+All software which tests for equality of git objects by checking
+whether their object names are equal needs to obtain a canonical name
+for both objects.
+
+This is going to be quite annoying.
+
+Note that semantically identical trees may (now) have different tree
+objects because those tree objects might contain different object
+names. So tree comparison cannot any longer be done by comparing
+names; rather an invocation of git diff is needed.
+
+
+Transition plan
+
+Y0: Implement all of the above. Test it.
+
+ Default configuration:
+ SHA-1 is ENABLED and is default HEAD hash
+
+ SHA-512 is FORBIDDEN in bare repos
+ SHA-512 is ENABLED in trees with working trees
+
+Y5: New projects should start using SHA-512.
+
+ Default configuration change:
+
+ SHA-512 becomes ENABLED in *new* bare repos but remains
+ FORBIDDEN in existing ones
+
+
+
+--
+Ian Jackson <ijackson@chiark.greenend.org.uk> These opinions are my own.
+
+If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
+a private address which bypasses my fierce spamfilter.