-From: Ian Jackson <ijackson@chiark.greenend.org.uk>
-To: ijackson@chiark.greenend.org.uk
Subject: Transition plan for git to move to a new hash function
-Date: Thu, 20 Oct 2016 19:26:44 +0100
-Basic principle: Every object will have two (or more) names,
-corresponding to different hash functions. It may be named by any of
-its names, in every context.
+BASIC PRINCIPLE
+
+We run multiple object name subnamespaces in parallel, one for each
+hash function. Each object lives in exactly one subnamespace.
+Objects with identical content in the different object stores, named
+by different hash functions, are different objects.
+
+Objects may refer to objects living in different subnamespaces (ie,
+named by a different hash function) to their own.
+
+Packfiles need to be extended to be able to contain objects named by
+new hash functions. Blob objects with identical contents but living
+in different subnamespaces would ideally share storage.
Every program that invokes git or speaks git protocols will need to
-understand the extended object name syntax, and understand that
-objects have multiple names.
+understand the extended object name syntax.
Safety catches preferent accidental incorporation into a project of
-objects which contain references by incompatibly-new or
-deprecatedly-old names. This allows for incremental deployment.
+incompatibly-new objects, or additional deprecatedly-old objects.
+This allows for incremental deployment.
+
+
+TEXTUAL SYNTAX
+
+The object name textual syntax is extended as follows:
+
+We declare that the object name syntax is henceforth
+ [A-Z]+[0-9a-z]+ | [0-9a-f]+
+and that names [A-Z].* are deprecated as ref name components.
+
+ Rationale:
+
+ Full backwards compatibility is impossible, because the hash
+ function needs to be evident in the name, so the new names
+ must be disjoint from all old SHA-1 names.
+
+ We want a short but extensible syntax. The syntax should impose
+ minimal extra requirements on existing git users. In most
+ contexts where existing git users use hashes, ASCII alphanumeric
+ object names will fit. Use of punctuation such as : or even _
+ may give trouble to existing users, who are already using
+ such things as delimiters.
+
+ In existing deployments, refnames that differ only in case are
+ generally avoided (because they are troublesome on
+ case-insensitive filesystems). And conventionally refnames are
+ lower case. So names starting with an upper case letter will be
+ disjoint from most existing ref name components.
+
+ Even though we probably want to keep using hex, it is a good
+ idea to reserve the flexibility to use a more compact encoding,
+ while not excessively widening the existing permissible
+ character set.
+
+Object names using SHA-1 are represented, in text, as at present.
+
+Object names starting with uppercase ASCII letters H or later refer to
+new hash functions. Programs that use `g<objectname>' should ideally
+be changed to show `H<hash>' for hash function `H' rather than
+`gH<hash>'.)
+
+ Rationale:
+
+ Object names starting with A-F might look like hex. G is
+ reserved because of the way that many programs write
+ `g<objectname>'.
+ This gives us 19 new hash function values until we have to
+ starting using two-letter hash function prefixes, or decide to
+ use A-F after all.
-Syntax:
+(Truncated object names work as they do at the moment.)
-The object name syntax is extended as follows: object names using sha1
-are as current. Object names starting with lowercase ASCII letters h
-or later refer to new hash functions. (`g' is reserved because of the
-way that many programs write `g<objectname>'. Programs that use
-`g<objectname>' should be changed to show `h<hash>' for hash function
-`h' rather than `gh<hash>'.)
+Initially we define and assign one new hash function (and textual
+object name encoding):
-Object names h<hex> are SHA-512 hashes. Remaining letters are
-reserved. `x' `y' `z' are reserved for private experiments; we
-declare that public releases of git will never accept such names.
+ H<hex> where <hex> is the BLAKE2b hash of the object
+ (in lowercase)
+
+We also reserve the following syntax for private experiments:
+ E[A-Z]+[0-9a-z]+
+We declare that public releases of git will never accept such
+object names.
Everywhere in the git object formats and git protocols, a new object
name (with hash function indicator) is permitted where an old object
-name is permitted. A single object refers to all the objects it
-references by the same hash function; in general this might be a
-different hash function to the hash function by which this particular
-object was itself referenced or obtained.
+name is permitted.
+
+A single object may refer to other objects by its own hash functon, or
+by other hash functions. Ie, object references cross subnamespaces.
+During all git operations, subnamespace boundaries in the object graph
+are traversed freely.
-As an exception, it is forbidden to refer to a tree object by a name
-other than the hash function it uses to name its subtrees. If this
-seems necessary, the tree object must be recursively rewritten instead
-to use the desired object name.
+Two additional restrictions: a tree object may be referenced only by
+objects in the same subnamespace; and, a tree object may reference
+blobs in its own subnamespace.
In binary protocols, where a SHA-1 object name in binary form was
previously used, a new codepoint must be allocated in a containing
structure (eg a new typecode). Usually, the new-format binary object
-will have a new typecode and also an additional name hash indicator.
-15 of the hash indicator values correspond to the lowercase letters
-reserved above.
+will have a new typecode and also an additional name hash indicator,
+and it will also need a length field (as new hashes may be of
+different lengths).
+Whenever a new hash function textual syntax is defined, corresponding
+binary format codepoint(s) are assigned. (Implementation details such
+as the binary format specification is outside the scope of this
+transition plan.)
-Object store:
-The object store knows which hash functions are enabled. Each hash
-function H has one of the following statuses, which are configured by
-the user:
+ORDERING
-* ENABLED:
+Hash functions are partially ordered, from `worse' to `better'.
+The ordering is configurable. For details of the defaults,
+see _Transition Plan_.
- As far as the user is concerned every object in the object store is
- accessible using H. Objects which use H names can be received and
- stored.
- This is actually two states, depending on whether any objects exist
- in the store which use these names. If no such objects exist yet,
- we say that the hash function is `ENABLED PROSPECTIVE'. The H names
- for the objects have not yet been calculated.
+CHOICE OF SUBNAMESPACE
- When the first object which names another object using H is received
- (or, on demand), the object store calculates the H names for all
- existing objects and notes that this hash function is now
- `ENABLED PRESENT'.
+Whenever objects are created, it is necessary to choose the
+subnamespace to use (ie, the hash function).
- If a hash collision is detected, we crash immediately.
+Each ref may also have a subnamespace hint associated with it.
-* OBSOLESCENT: Every object in the object store has its hash
- calculated using H. However, H is known to possibly have collisions
- which we try to tolerate. When a collision occurs, the object text
- which is currently in the object store is preferred and the "new"
- object is thrown away.
- Local creation of new objects with references using H is
- discouraged. Specifically, if another hash function is ENABLED, we
- will use that instead.
+Commits
- This is used as part of a gradual desupport strategy. When the hash
- function is in this stage, existing history in all existing object
- stores is safe and cannot be corrupted or modified by receiving
- colliding objects.
+A commit is made (by default) as new as the newest of
+ (i) each of its parents
+ (ii) if applicable, the subnamespace hint for the ref to which the
+ new commit is to be written
- New object stores which receive their data from a trustworthy sender
- over a trustworthy channel will receive correct data. Bad object
- stores or untrustworthy channels could exploit collisions, but not
- in new regions of the history which are presumably using new names.
- So the collisons can only affect archaeology.
+Implicitly this normally means that if HEAD refers to a new commit,
+further new commits will be generated on top of it.
- Merging previously-unrelated histories does introduce a collision
- hazard, but the collision would have had to have been introduced
- while H was still a "live" hash function in at least one of the two
- projects.
+The subnamespace of an origin commit is controlled by the hint left in
+.git by git checkout --orphan or git init.
-* FORBIDDEN: Objects do not have their hashes calculated using this
- hash function. Attempts to reference an object by such a name
- fail. Optionally the user may specify a tolerant mode where:
- a commit which refers to parents by obsolete names is taken to
- simply not have those parents; a commit which refers to a tree by
- an obsolete name is taken to have an empty tree.
+At boundaries between old and new history, new commit(s) will refer to
+old parent(s).
- This is used for two purposes:
- - On a server, we use this to restrict the propagation of
- new hashes so as to enforce our compatibility intentions.
- Ie, hashes which we are "not ready for" are forbidden.
+Tags
- - Everywhere, we use this to get rid of old hash functions.
- It makes access to old history possible but difficult.
+A tag is created (by default) in the same subnamespace as the object
+to which it refers.
-* FORGOTTEN: Objects do not have their hashes calculated using this
- hash function. References to objects by all such names return dummy
- objects of the right shape: the empty blob; the empty tree; a root
- commit with an empty tree and dummy metadata.
- This allows us to finally retire a hash function entirely. We
- effectively throw away all the history which uses H.
+Trees
-During transfer protocols, the receiver will say which hashes it
-thinks are obsolete or forgotten, and the sender will not follow such
-references when computing the set of objects to send. So receivers
-will not receive the objects which were named only by obsolete or
-forgotten names.
+Trees are only referenced by objects in their own subnamespace.
+To satisfy this rule, occasionally a tree object from one subnamespace
+must be recursively rewritten into another subnamespace.
-Naming in newly-generated objects, queries, etc.
+When a tree refers to a commit, it may refer to one in a different
+subnamespace.
-There is a `default' hash function, which is that which HEAD uses.
-(That is, HEAD refers to an object by some name. The default hash
-function is that name's hash function.)
+ Rationale: we want to avoid new commits and tags relying on weak
+ hashes. But we must avoid demanding that commits be rewritten.
-git tools produce always output object names in the default hash
-function. (Including git-hash-object.)
-As a consequence, newly generated objects will contain object
-references using the `default' hash function.
+Blobs
-When HEAD is empty, there is a separate record of the default hash
-function. This comes from a configured default in a new tree. In an
-existing tree, using git checkout --orphan remembers the default hash
-function that HEAD had.
+Blobs are normally referred to by trees. Trees always refer to blobs
+in the same subnamespace.
-When HEAD is updated to a new commit, the name stored in HEAD uses the
-newer of the previous HEAD hash function and of the hash function used
-in the commit being stored. ("Newer" is a built-in preference order,
-overrideable by configuration.)
+Where a blob is created in other circumstances, the caller should
+specify the subnamespace.
-This (together with the `forbidden' state, above) ensures that
-switching a project to use a new hash function is a deliberate
-decision: the default hash function needs to be changed to make the
-first commit with the new hash function. After that, provided
-the server accepts it, it's infectious.
+Ref hints
-Naming of refs other than HEAD
+As noted above, each ref may also have a subnamespace hint associated
+with it.
-A ref refers to an object by one of its names. However, operations
-like git-show-ref convert that name to the default format (see above).
+The subnamespace hint is (by default) copied, when the ref value is
+copied. So for exmple if `git checkout foo' makes refs/heads/foo out
+of refs/remotes/origin/foo, it will copy the subnamespace hint (or
+lack of one) from refs/remotes/origin/foo.
-git-gc rewrites ref names to the default format iff that is newer.
+Likewise, the subnamespace hint is conveyed by `git fetch' (by
+default) and can be updated with `git push' (though this is not done
+by default).
+The ref subnamespace hint may be set explicitly. That is how an
+individual branch is upgraded. git checkout --orphan sets it to the
+subnamespace (or hint) of the previous HEAD.
-Remote protocol
+When a commit is made and stored in a ref, the subnamespace hint for
+that ref is removed iff the commit's subnamespace and the hint's
+subnamespace are the same.
-During the negotation, a receiver needs to specify what hashes it
-understands.
-When the sender is listing its refs, the names are converted to a
-hash understood by the client if necessary. If this is not necessary,
-they are left unchanged.
+OBJECT STORE BEHAVIOUR
-When a receiver is updating refs, it should by follow the sender's
-idea of a hash change iff it's an upgrade (and the new function is
-ENABLED). That is, if the sender sends name H2 for some ref, and the
-receiver has H1, but these refer to the same object, then the receiver
-should update its own ref name from H1 to H2 iff H2 uses a newer hash
-function.
+The object store has configuration to specify which hash functions are
+enabled. Each hash function H has a combination of the following
+behaviours, according to configuration:
+* Collision behaviour:
-Equality testing
+ What to do if we encounter an object we already have (eg as part of
+ a pack, or with hash-object) but with different contents.
+
+ (a) fail: print a scary message and abort operation (on the
+ basis that the source of the colliding object probably intended
+ the preimage that they provided, or is conducting an attack).
+
+ (b) tolerate: prefer our own data; print a message, but treat
+ the reference as referring to our version of the object.
+
+ In both cases we keep a copy of the second preimage in our .git, for
+ forensic purposes.
+
+ This is used as part of a gradual desupport strategy. Existing
+ history in all existing object stores is safe and cannot be
+ corrupted or modified by receiving colliding objects.
+
+ New trees which receive their initial data from a trustworthy sender
+ over a trustworthy channel will receive correct data. Bad object
+ stores or untrustworthy channels could exploit collisions, but not
+ in new regions of the history which are presumably using new names.
+ So the collisons can only affect archaeology.
+
+ Merging previously-unrelated histories does introduce a collision
+ hazard, but the collision would have had to have been introduced
+ while the colliding hash function was still a live hash function
+ in at least one of the two projects.
-All software which tests for equality of git objects by checking
-whether their object names are equal needs to obtain a canonical name
-for both objects.
-This is going to be quite annoying.
+* Hash function enablement:
+
+ (a) enabled: this hash function is good and available for use
+
+ (b) deprecated (in favour of H2): this hash function is
+ available for use, but newly created objects will use another
+ hash function instead (specifically, when creating an object,
+ this has function is not considered as a candidate; if as a
+ result there are no candidate hash functions, we use the
+ specified replacement H2). Existing refs referring to objects
+ with this hash, with no ref hint, are treated as having a ref
+ hint specifying H2. If no H2 is specified, the newest hash
+ "best" hash is used.
+
+ (c) disabled: existing objects using this hash function can be
+ accessed, but no such objects can be created or received.
+ (again, a replacement may be specified). This is used both
+ initially to prevent unintended upgrade, and later to block the
+ introduction of vulnerable data generated by badly configured
+ clients.
+
+
+Remote protocol
+
+During the negotation, a receiver needs to specify what hashes it
+understands, and whether it is prepared to see only a partial view.
-We should provide a convenient utility which tests whether two object
-names refer to the same object.
+When the sender is listing its refs, refs naming objects the receiver
+cannot understand are either elided (if the receiver is content with a
+parial view), or cause an error.
+
+
+Equality testing
Note that semantically identical trees may (now) have different tree
-objects because those tree objects might contain different object
-names. So (in some contexts at least) tree comparison cannot any
-longer be done by comparing names; rather an invocation of git diff is
-needed, or explicit generation of a tree object with the right name.
+objects because those tree objects might use (and be named by)
+different hashes. So (in some contexts at least) tree comparison
+cannot any longer be done by comparing names; rather an invocation of
+git diff is needed, or explicit generation of a tree object with the
+right hash.
+
+TRANSITION PLAN
-Transition plan
+(For brevity I will write `SHA' for hashing with SHA-1, using current
+unqualified object names, and `BLAKE' for hasing with BLAKE2b, using
+H<hex> object names.)
Y0: Implement all of the above. Test it.
Default configuration:
- SHA-1 is ENABLED
- SHA-512 is FORBIDDEN in bare repos
- SHA-512 is ENABLED in trees with working trees
- default HEAD hash is SHA-1
+ SHA is enabled
+ BLAKE is disabled in trees without working trees
+ BLAKE is enabled in trees with working trees
+ SHA > BLAKE
Effects:
- Existing projects will not switch to SHA-512 willy-nilly.
- New projects will still use SHA-1.
+ Clients are prepared to process BLAKE data, but it is not
+ generated by default and cannot be pushed to servers.
- Incompatible new-style commits cannot be pushed without server
- admin effort (or until future upgrade).
+ All old git clients still work.
- So all old git clients still work.
-
-Y4: SHA-512 by default for new projects.
+Y4: BLAKE by default for new projects.
Conversion enabled for existing projects.
- Old git software is now pretty firmly deprecated.
+ Old git software is going to start rotting.
Default configuration change:
+ BLAKE > SHA
+ BLAKE enabled (even in trees without working trees)
- When creating a new bare tree, a configuration dropping is left
- (in `config') which specifies that SHA-1 is OBSOLESCENT
-
- Default status for SHA-512 is FORBIDDEN if SHA-1 is ENABLED,
- or ENABLED if SHA-1 is OBSOLESCENT.
-
- default HEAD hash is newest ENABLED hash.
+ Suggested bulk hosting site configuration change:
+ Newly created projects should get BLAKE enabled
+ Existing projects should retain BLAKE disabled by default
+ Button should be provided to start conversion (see below)
Effects:
- When creating a new working tree, it starts using SHA-512.
- A new server tree will accept SHA-512.
+ When creating a new working tree, it starts using BLAKE.
- Existing server trees do not yet accept SHA-512. They publish
- their SHA-1 hashes, so clients make commits with SHA-1.
+ Servers which have been updated will accept BLAKE.
- To convert a project, an administrator would set SHA-1 to
- OBSOLESCENT on the server. All clones after that will have HEAD
- with a SHA-512 name. Fetches and pulls will update to SHA-512
- names.
+ Servers which have not been updated to Y4's git will need a small
+ configuration change (enabling BLAKE) to cope with the new
+ projects that are using BLAKE.
-will , and push one SHA-512 commit to
- mainline.
+ To convert a project, an administrator (or project owner) would
+ set BLAKE to enabled, and SHA to deprecated, on the server. On
+ the next pull the server will provide ref hints naming BLAKE,
+ which will get copied to the user's HEAD. So the user is infected
+ with BLAKE.
+ To convert a project branch-by-branch, the administrator would set
+ BLAKE to enabled but leave SHA enabled. Then each branch retains
+ its own hash. A branch can be converted by pushing a BLAKE commit
+ to it, or by setting a ref hint on the server.
+Y6: BLAKE by default for all projects
+ Existing projects start being converted infectiously.
+ It is hard for a project to stop this happening if any of
+ their servers are updated.
+ Old git software is firmly stuffed.
- Default configuration change:
+ Default configuration change
+ SHA deprecated in trees without working trees
Effects:
- When creating a new tree with working tree with git init (ie, no
- HEAD), the default HEAD hash is set to SHA-512 (because SHA-1 is
- OBSOLESCENT in a new tree and therefore SHA-512 is the only
- ENABLED hash and is the default).
+ Existing projects are, by default, `converted', as described
+ above.
- Newly minted server trees accept SHA-512.
+Y8: Clients hate SHA
+ Clients insist on trying to convert existing projects
+ It is very hard to stop this happening.
+ Unrepentant servers start being very hard to use.
+ Default configuration change
+ SHA deprecated (even in trees without working trees)
- start using SHA-512 by default.
+ Effects:
-Y6: Existing projects start being converted infectiously.
- It is hard to stop this happening.
- Old git software is firmly stuffed.
+ Clients will generate only BLAKE. Hopefully their server will
+ accept this!
- Default configuration change:
- SHA-1 is OBSOLESCENT
- (default for SHA-512, and HEAD hash, computed as in Y4)
+Y10: Stop accepting new SHA
+ No-one can manage to make new SHA commits
- Result is that by default all software
+ Default configuration change
+ SHA disabled in new trees, except during initial
+ `clone', `mirror' and similar
- (Projects which do not want to convert need to set SHA-1 to
- ENABLED, explicitly, on their
-
-Y6: Existing projects start using SHA-512.
+ Effects:
- Default configuration change:
- SHA-512 is ENABLED
- SHA-1 is OBSOLESCENT
- (default default HEAD hash is already SHA-512)
+ Existing SHA history is retained, and copied to new clients and
+ servers. But established clients and servers reject any newly
+ introduced SHA.
- In existing repositories where no special action
--