-From: Ian Jackson <ijackson@chiark.greenend.org.uk>
-To: ijackson@chiark.greenend.org.uk
Subject: Transition plan for git to move to a new hash function
-Date: Thu, 20 Oct 2016 19:26:44 +0100
-Basic principle: Every object will have two (or more) names,
-corresponding to different hash functions. It may be named by any of
-its names, in every context.
+BASIC PRINCIPLE
+
+We run multiple object name subnamespaces in parallel, one for each
+hash function. Each object lives in exactly one subnamespace.
+Objects with identical content in the different object stores, named
+by different hash functions, are different objects.
+
+Objects may refer to objects living in different subnamespaces (ie,
+named by a different hash function) to their own.
+
+Packfiles need to be extended to be able to contain objects named by
+new hash functions. Blob objects with identical contents but living
+in different subnamespaces would ideally share storage.
Every program that invokes git or speaks git protocols will need to
-understand the extended object name syntax, and understand that
-objects have multiple names.
+understand the extended object name syntax.
Safety catches preferent accidental incorporation into a project of
-objects which contain references by incompatibly-new or
-deprecatedly-old names. This allows for incremental deployment.
+incompatibly-new objects, or additional deprecatedly-old objects.
+This allows for incremental deployment.
TEXTUAL SYNTAX
name (with hash function indicator) is permitted where an old object
name is permitted.
-A single object refers to all the objects it references by the same
-hash function; in general this might be a different hash function to
-the hash function by which this particular object was itself
-referenced or obtained.
+A single object may refer to other objects by its own hash functon, or
+by other hash functions. Ie, object references cross subnamespaces.
+During all git operations, subnamespace boundaries in the object graph
+are traversed freely.
-As a further restriction, it is forbidden to refer to a tree object by
-a name other than the hash function it uses to name its subtrees. If
-this seems necessary, the tree object must be recursively rewritten
-instead to use the desired object name.
+Two additional restrictions: a tree object may be referenced only by
+objects in the same subnamespace; and, a tree object may reference
+blobs in its own subnamespace.
In binary protocols, where a SHA-1 object name in binary form was
previously used, a new codepoint must be allocated in a containing
structure (eg a new typecode). Usually, the new-format binary object
-will have a new typecode and also an additional name hash indicator.
+will have a new typecode and also an additional name hash indicator,
+and it will also need a length field (as new hashes may be of
+different lengths).
Whenever a new hash function textual syntax is defined, corresponding
-binary format codepoint(s) are assigned. (Detailed binary format
-specification is outside the scope of this plan.)
+binary format codepoint(s) are assigned. (Implementation details such
+as the binary format specification is outside the scope of this
+transition plan.)
ORDERING
-Hash functions are partially ordered, from `older' to `newer'.
-
-The ordering is configurable. The default, with the two hash
-functions defined here, is the obvious ordering
- SHA1 ([0-9a-f]*) < BLAKE2b (H*)
+Hash functions are partially ordered, from `worse' to `better'.
+The ordering is configurable. For details of the defaults,
+see _Transition Plan_.
-CHOICE OF OBJECT NAMES
+CHOICE OF SUBNAMESPACE
-Whenever objects are named, it is possible to refer to them by old or
-new names. So git must make a choice, each time: when new objects
-are created; when refs are updated; and when refs are reported over
-network protocols to other instances of git.
+Whenever objects are created, it is necessary to choose the
+subnamespace to use (ie, the hash function).
-Although strictly speaking all objects have both old names and new
-names, and there may be more than two hash functions, it is possible
-to speak, somewhat loosely, about `new objects'.
-
-A `new' object is one which refers to other objects by a `new' name.
-(whatever `new' means).
-
-We call these different hashes `namings'. That is, a `naming' is a
-hash function implemented by git. The `naming IN an object' is the
-naming by which the object refers to other objects (and may not exist,
-if the object has no references); the `name OF an object' is the name
-by which the object itself is specified.
+Each ref may also have a subnamespace hint associated with it.
Commits
-A non-origin commit is made (by default) as new as the newest of
- (i) the naming in each of its parents
- (ii) the specified name of each of its parents
-(Implicitly this normally means that if HEAD uses a new name, new
-commits will be generated.)
+A commit is made (by default) as new as the newest of
+ (i) each of its parents
+ (ii) if applicable, the subnamespace hint for the ref to which the
+ new commit is to be written
-The naming of an origin commit is controlled by a dropping left in
+Implicitly this normally means that if HEAD refers to a new commit,
+further new commits will be generated on top of it.
+
+The subnamespace of an origin commit is controlled by the hint left in
.git by git checkout --orphan or git init.
-At boundaries between old and new history, a new commit will refer to
-old parents by those old parents' new names.
+At boundaries between old and new history, new commit(s) will refer to
+old parent(s).
Tags
-A new tag is made to use newest naming, for its tagged object, of
- (i) the name by which the tagged object was specified
- (ii) the naming in the tagged object (if applicable)
+A tag is created (by default) in the same subnamespace as the object
+to which it refers.
Trees
-Commits (and sometimes, tags) can refer to tree objects; that tree
-will contain the same naming as the referring object.
+Trees are only referenced by objects in their own subnamespace.
+
+To satisfy this rule, occasionally a tree object from one subnamespace
+must be recursively rewritten into another subnamespace.
-That is, it is a bug to refer to a tree object by other than the hash
-it uses internally to refer to subtrees (and gitlinks). This will
-mean that a tree must sometimes be rewritten (ie, new object names
-recalculated recursively).
+When a tree refers to a commit, it may refer to one in a different
+subnamespace.
Rationale: we want to avoid new commits and tags relying on weak
- hashes.
+ hashes. But we must avoid demanding that commits be rewritten.
Blobs
-Blobs do not refer to other objects so they are neither new or old.
-
-
-Name of newly created object
-
-When git creates a new object, it reports the new object name using
-the naming in the object.
-
-For blobs and empty trees, the caller should normally specify. The
-default is the naming used for HEAD.
-
-
-Updating refs
-
-If a ref is updated with a new object, the name from its creation is
-used (see above).
+Blobs are normally referred to by trees. Trees always refer to blobs
+in the same subnamespace.
-If a ref is updated to a specified object, the naming used in the ref
-is the newer of the specified name, or the naming in the object (if
-any).
+Where a blob is created in other circumstances, the caller should
+specify the subnamespace.
+Ref hints
+As noted above, each ref may also have a subnamespace hint associated
+with it.
+The subnamespace hint is (by default) copied, when the ref value is
+copied. So for exmple if `git checkout foo' makes refs/heads/foo out
+of refs/remotes/origin/foo, it will copy the subnamespace hint (or
+lack of one) from refs/remotes/origin/foo.
-), or with a specified object name.
+Likewise, the subnamespace hint is conveyed by `git fetch' (by
+default) and can be updated with `git push' (though this is not done
+by default).
+The ref subnamespace hint may be set explicitly. That is how an
+individual branch is upgraded. git checkout --orphan sets it to the
+subnamespace (or hint) of the previous HEAD.
+When a commit is made and stored in a ref, the subnamespace hint for
+that ref is removed iff the commit's subnamespace and the hint's
+subnamespace are the same.
-(If there are different equally new names, one of the newest names is
-chosen according to some stable rule.)
+OBJECT STORE BEHAVIOUR
+The object store has configuration to specify which hash functions are
+enabled. Each hash function H has a combination of the following
+behaviours, according to configuration:
-new
+* Collision behaviour:
-commit. (This may mean converting the tree in hand, since trees are
-supposed to be homgeonous.)
+ What to do if we encounter an object we already have (eg as part of
+ a pack, or with hash-object) but with different contents.
+ (a) fail: print a scary message and abort operation (on the
+ basis that the source of the colliding object probably intended
+ the preimage that they provided, or is conducting an attack).
+ (b) tolerate: prefer our own data; print a message, but treat
+ the reference as referring to our version of the object.
+ In both cases we keep a copy of the second preimage in our .git, for
+ forensic purposes.
-A `new commit' is one which refers to objects by
+ This is used as part of a gradual desupport strategy. Existing
+ history in all existing object stores is safe and cannot be
+ corrupted or modified by receiving colliding objects.
-
-
-
-Object store:
-
-The object store knows which hash functions are enabled. Each hash
-function H has one of the following statuses, which are configured by
-the user:
-
-* ENABLED:
-
- As far as the user is concerned every object in the object store is
- accessible using H. Objects which use H names can be received and
- stored.
-
- This is actually two states, depending on whether any objects exist
- in the store which use these names. If no such objects exist yet,
- we say that the hash function is `ENABLED PROSPECTIVE'. The H names
- for the objects have not yet been calculated.
-
- When the first object which names another object using H is received
- (or, on demand), the object store calculates the H names for all
- existing objects and notes that this hash function is now
- `ENABLED PRESENT'.
-
- If a hash collision is detected, we crash immediately.
-
-* OBSOLESCENT: Every object in the object store has its hash
- calculated using H. However, H is known to possibly have collisions
- which we try to tolerate. When a collision occurs, the object text
- which is currently in the object store is preferred and the "new"
- object is thrown away.
-
- Local creation of new objects with references using H is
- discouraged. Specifically, if another hash function is ENABLED, we
- will use that instead.
-
- This is used as part of a gradual desupport strategy. When the hash
- function is in this stage, existing history in all existing object
- stores is safe and cannot be corrupted or modified by receiving
- colliding objects.
-
- New object stores which receive their data from a trustworthy sender
+ New trees which receive their initial data from a trustworthy sender
over a trustworthy channel will receive correct data. Bad object
stores or untrustworthy channels could exploit collisions, but not
in new regions of the history which are presumably using new names.
Merging previously-unrelated histories does introduce a collision
hazard, but the collision would have had to have been introduced
- while H was still a "live" hash function in at least one of the two
- projects.
-
-* FORBIDDEN: Objects do not have their hashes calculated using this
- hash function. Attempts to reference an object by such a name
- fail. Optionally the user may specify a tolerant mode where:
- a commit which refers to parents by obsolete names is taken to
- simply not have those parents; a commit which refers to a tree by
- an obsolete name is taken to have an empty tree.
+ while the colliding hash function was still a live hash function
+ in at least one of the two projects.
- This is used for two purposes:
- - On a server, we use this to restrict the propagation of
- new hashes so as to enforce our compatibility intentions.
- Ie, hashes which we are "not ready for" are forbidden.
+* Hash function enablement:
- - Everywhere, we use this to get rid of old hash functions.
- It makes access to old history possible but difficult.
+ (a) enabled: this hash function is good and available for use
-* FORGOTTEN: Objects do not have their hashes calculated using this
- hash function. References to objects by all such names return dummy
- objects of the right shape: the empty blob; the empty tree; a root
- commit with an empty tree and dummy metadata.
+ (b) deprecated (in favour of H2): this hash function is
+ available for use, but newly created objects will use another
+ hash function instead (specifically, when creating an object,
+ this has function is not considered as a candidate; if as a
+ result there are no candidate hash functions, we use the
+ specified replacement H2). Existing refs referring to objects
+ with this hash, with no ref hint, are treated as having a ref
+ hint specifying H2. If no H2 is specified, the newest hash
+ "best" hash is used.
- This allows us to finally retire a hash function entirely. We
- effectively throw away all the history which uses H.
-
-During transfer protocols, the receiver will say which hashes it
-thinks are obsolete or forgotten, and the sender will not follow such
-references when computing the set of objects to send. So receivers
-will not receive the objects which were named only by obsolete or
-forgotten names.
-
-
-Naming in newly-generated objects, queries, etc.
-
-There is a `default' hash function, which is that which HEAD uses.
-(That is, HEAD refers to an object by some name. The default hash
-function is that name's hash function.)
-
-git tools produce always output object names in the default hash
-function. (Including git-hash-object.)
-
-As a consequence, newly generated objects will contain object
-references using the `default' hash function.
-
-When HEAD is empty, there is a separate record of the default hash
-function. This comes from a configured default in a new tree. In an
-existing tree, using git checkout --orphan remembers the default hash
-function that HEAD had.
-
-When HEAD is updated to a new commit, the name stored in HEAD uses the
-newer of the previous HEAD hash function and of the hash function used
-in the commit being stored. ("Newer" is a built-in preference order,
-overrideable by configuration.)
-
-This (together with the `forbidden' state, above) ensures that
-switching a project to use a new hash function is a deliberate
-decision: the default hash function needs to be changed to make the
-first commit with the new hash function. After that, provided
-the server accepts it, it's infectious.
-
-
-Naming of refs other than HEAD
-
-A ref refers to an object by one of its names. However, operations
-like git-show-ref convert that name to the default format (see above).
-
-git-gc rewrites ref names to the default format iff that is newer.
+ (c) disabled: existing objects using this hash function can be
+ accessed, but no such objects can be created or received.
+ (again, a replacement may be specified). This is used both
+ initially to prevent unintended upgrade, and later to block the
+ introduction of vulnerable data generated by badly configured
+ clients.
Remote protocol
During the negotation, a receiver needs to specify what hashes it
-understands.
-
-When the sender is listing its refs, the names are converted to a
-hash understood by the client if necessary. If this is not necessary,
-they are left unchanged.
+understands, and whether it is prepared to see only a partial view.
-When a receiver is updating refs, it should by follow the sender's
-idea of a hash change iff it's an upgrade (and the new function is
-ENABLED). That is, if the sender sends name H2 for some ref, and the
-receiver has H1, but these refer to the same object, then the receiver
-should update its own ref name from H1 to H2 iff H2 uses a newer hash
-function.
+When the sender is listing its refs, refs naming objects the receiver
+cannot understand are either elided (if the receiver is content with a
+parial view), or cause an error.
Equality testing
-All software which tests for equality of git objects by checking
-whether their object names are equal needs to obtain a canonical name
-for both objects.
-
-This is going to be quite annoying.
-
-We should provide a convenient utility which tests whether two object
-names refer to the same object.
-
Note that semantically identical trees may (now) have different tree
-objects because those tree objects might contain different object
-names. So (in some contexts at least) tree comparison cannot any
-longer be done by comparing names; rather an invocation of git diff is
-needed, or explicit generation of a tree object with the right name.
+objects because those tree objects might use (and be named by)
+different hashes. So (in some contexts at least) tree comparison
+cannot any longer be done by comparing names; rather an invocation of
+git diff is needed, or explicit generation of a tree object with the
+right hash.
-Transition plan
+TRANSITION PLAN
+
+(For brevity I will write `SHA' for hashing with SHA-1, using current
+unqualified object names, and `BLAKE' for hasing with BLAKE2b, using
+H<hex> object names.)
Y0: Implement all of the above. Test it.
Default configuration:
- SHA-1 is ENABLED
- SHA-512 is FORBIDDEN in bare repos
- SHA-512 is ENABLED in trees with working trees
- default HEAD hash is SHA-1
+ SHA is enabled
+ BLAKE is disabled in trees without working trees
+ BLAKE is enabled in trees with working trees
+ SHA > BLAKE
Effects:
- Existing projects will not switch to SHA-512 willy-nilly.
- New projects will still use SHA-1.
-
- Incompatible new-style commits cannot be pushed without server
- admin effort (or until future upgrade).
+ Clients are prepared to process BLAKE data, but it is not
+ generated by default and cannot be pushed to servers.
- So all old git clients still work.
+ All old git clients still work.
-Y4: SHA-512 by default for new projects.
+Y4: BLAKE by default for new projects.
Conversion enabled for existing projects.
- Old git software is now pretty firmly deprecated.
+ Old git software is going to start rotting.
Default configuration change:
+ BLAKE > SHA
+ BLAKE enabled (even in trees without working trees)
- When creating a new bare tree, a configuration dropping is left
- (in `config') which specifies that SHA-1 is OBSOLESCENT
-
- Default status for SHA-512 is FORBIDDEN if SHA-1 is ENABLED,
- or ENABLED if SHA-1 is OBSOLESCENT.
-
- default HEAD hash is newest ENABLED hash.
+ Suggested bulk hosting site configuration change:
+ Newly created projects should get BLAKE enabled
+ Existing projects should retain BLAKE disabled by default
+ Button should be provided to start conversion (see below)
Effects:
- When creating a new working tree, it starts using SHA-512.
- A new server tree will accept SHA-512.
+ When creating a new working tree, it starts using BLAKE.
- Existing server trees do not yet accept SHA-512. They publish
- their SHA-1 hashes, so clients make commits with SHA-1.
+ Servers which have been updated will accept BLAKE.
- To convert a project, an administrator would set SHA-1 to
- OBSOLESCENT on the server. All clones after that will have HEAD
- with a SHA-512 name. Fetches and pulls will update to SHA-512
- names.
+ Servers which have not been updated to Y4's git will need a small
+ configuration change (enabling BLAKE) to cope with the new
+ projects that are using BLAKE.
-will , and push one SHA-512 commit to
- mainline.
+ To convert a project, an administrator (or project owner) would
+ set BLAKE to enabled, and SHA to deprecated, on the server. On
+ the next pull the server will provide ref hints naming BLAKE,
+ which will get copied to the user's HEAD. So the user is infected
+ with BLAKE.
+ To convert a project branch-by-branch, the administrator would set
+ BLAKE to enabled but leave SHA enabled. Then each branch retains
+ its own hash. A branch can be converted by pushing a BLAKE commit
+ to it, or by setting a ref hint on the server.
+Y6: BLAKE by default for all projects
+ Existing projects start being converted infectiously.
+ It is hard for a project to stop this happening if any of
+ their servers are updated.
+ Old git software is firmly stuffed.
- Default configuration change:
+ Default configuration change
+ SHA deprecated in trees without working trees
Effects:
- When creating a new tree with working tree with git init (ie, no
- HEAD), the default HEAD hash is set to SHA-512 (because SHA-1 is
- OBSOLESCENT in a new tree and therefore SHA-512 is the only
- ENABLED hash and is the default).
+ Existing projects are, by default, `converted', as described
+ above.
- Newly minted server trees accept SHA-512.
+Y8: Clients hate SHA
+ Clients insist on trying to convert existing projects
+ It is very hard to stop this happening.
+ Unrepentant servers start being very hard to use.
+ Default configuration change
+ SHA deprecated (even in trees without working trees)
- start using SHA-512 by default.
+ Effects:
-Y6: Existing projects start being converted infectiously.
- It is hard to stop this happening.
- Old git software is firmly stuffed.
+ Clients will generate only BLAKE. Hopefully their server will
+ accept this!
- Default configuration change:
- SHA-1 is OBSOLESCENT
- (default for SHA-512, and HEAD hash, computed as in Y4)
+Y10: Stop accepting new SHA
+ No-one can manage to make new SHA commits
- Result is that by default all software
+ Default configuration change
+ SHA disabled in new trees, except during initial
+ `clone', `mirror' and similar
- (Projects which do not want to convert need to set SHA-1 to
- ENABLED, explicitly, on their
-
-Y6: Existing projects start using SHA-512.
+ Effects:
- Default configuration change:
- SHA-512 is ENABLED
- SHA-1 is OBSOLESCENT
- (default default HEAD hash is already SHA-512)
+ Existing SHA history is retained, and copied to new clients and
+ servers. But established clients and servers reject any newly
+ introduced SHA.
- In existing repositories where no special action
--