Subject: Transition plan for git to move to a new hash function BASIC PRINCIPLE We run multiple object name subnamespaces in parallel, one for each hash function. Each object lives in exactly one subnamespace. Objects with identical content in the different object stores, named by different hash functions, are different objects. Objects may refer to objects living in different subnamespaces (ie, named by a different hash function) to their own. Packfiles need to be extended to be able to contain objects named by new hash functions. Blob objects with identical contents but living in different subnamespaces would ideally share storage. Every program that invokes git or speaks git protocols will need to understand the extended object name syntax. Safety catches preferent accidental incorporation into a project of incompatibly-new objects, or additional deprecatedly-old objects. This allows for incremental deployment. TEXTUAL SYNTAX The object name textual syntax is extended as follows: We declare that the object name syntax is henceforth [A-Z]+[0-9a-z]+ | [0-9a-f]+ and that names [A-Z].* are deprecated as ref name components. Rationale: Full backwards compatibility is impossible, because the hash function needs to be evident in the name, so the new names must be disjoint from all old SHA-1 names. We want a short but extensible syntax. The syntax should impose minimal extra requirements on existing git users. In most contexts where existing git users use hashes, ASCII alphanumeric object names will fit. Use of punctuation such as : or even _ may give trouble to existing users, who are already using such things as delimiters. In existing deployments, refnames that differ only in case are generally avoided (because they are troublesome on case-insensitive filesystems). And conventionally refnames are lower case. So names starting with an upper case letter will be disjoint from most existing ref name components. Even though we probably want to keep using hex, it is a good idea to reserve the flexibility to use a more compact encoding, while not excessively widening the existing permissible character set. Object names using SHA-1 are represented, in text, as at present. Object names starting with uppercase ASCII letters H or later refer to new hash functions. Programs that use `g' should ideally be changed to show `H' for hash function `H' rather than `gH'.) Rationale: Object names starting with A-F might look like hex. G is reserved because of the way that many programs write `g'. This gives us 19 new hash function values until we have to starting using two-letter hash function prefixes, or decide to use A-F after all. (Truncated object names work as they do at the moment.) Initially we define and assign one new hash function (and textual object name encoding): H where is the BLAKE2b hash of the object (in lowercase) We also reserve the following syntax for private experiments: E[A-Z]+[0-9a-z]+ We declare that public releases of git will never accept such object names. Everywhere in the git object formats and git protocols, a new object name (with hash function indicator) is permitted where an old object name is permitted. A single object may refer to other objects by its own hash functon, or by other hash functions. Ie, object references cross subnamespaces. During all git operations, subnamespace boundaries in the object graph are traversed freely. Two additional restrictions: a tree object may be referenced only by objects in the same subnamespace; and, a tree object may reference blobs in its own subnamespace. In binary protocols, where a SHA-1 object name in binary form was previously used, a new codepoint must be allocated in a containing structure (eg a new typecode). Usually, the new-format binary object will have a new typecode and also an additional name hash indicator, and it will also need a length field (as new hashes may be of different lengths). Whenever a new hash function textual syntax is defined, corresponding binary format codepoint(s) are assigned. (Implementation details such as the binary format specification is outside the scope of this transition plan.) ORDERING Hash functions are partially ordered, from `older' to `newer'. The ordering is configurable. The default, with the two hash functions defined here, is the obvious ordering SHA1 ([0-9a-f]*) < BLAKE2b (H*) CHOICE OF SUBNAMESPACE Whenever objects are created, it is necessary to choose the subnamespace to use (ie, the hash function). Each ref may also have a subnamespace hint associated with it. Commits A commit is made (by default) as new as the newest of (i) each of its parents (ii) if applicable, the subnamespace hint for the ref to which the new commit is to be written Implicitly this normally means that if HEAD refers to a new commit, further new commits will be generated on top of it. The subnamespace of an origin commit is controlled by the hint left in .git by git checkout --orphan or git init. At boundaries between old and new history, new commit(s) will refer to old parent(s). Tags A tag is created (by default) in the same subnamespace as the object to which it refers. Trees Trees are only referenced by objects in their own subnamespace. To satisfy this rule, occasionally a tree object from one subnamespace must be recursively rewritten into another subnamespace. When a tree refers to a commit, it may refer to one in a different subnamespace. Rationale: we want to avoid new commits and tags relying on weak hashes. But we must avoid demanding that commits be rewritten. Blobs Blobs are normally referred to by trees. Trees always refer to blobs in the same subnamespace. Where a blob is created in other circumstances, the caller should specify the subnamespace. Ref hints As noted above, each ref may also have a subnamespace hint associated with it. The subnamespace hint is (by default) copied, when the ref value is copied. So for exmple if `git checkout foo' makes refs/heads/foo out of refs/remotes/origin/foo, it will copy the subnamespace hint (or lack of one) from refs/remotes/origin/foo. Likewise, the subnamespace hint is conveyed by `git fetch' (by default) and can be updated with `git push' (though this is not done by default). The ref subnamespace hint may be set explicitly. That is how an individual branch is upgraded. git checkout --orphan sets it to the subnamespace (or hint) of the previous HEAD. When a commit is made and stored in a ref, the subnamespace hint for that ref is removed iff the commit's subnamespace and the hint's subnamespace are the same. OBJECT STORE BEHAVIOUR The object store has configuration to specify which hash functions are enabled. Each hash function H has a combination of the following behaviours, according to configuration: * Collision behaviour: What to do if we encounter an object we already have (eg as part of a pack, or with hash-object) but with different contents. (a) fail: print a scary message and abort operation (on the basis that the source of the colliding object probably intended the preimage that they provided, or is conducting an attack). (b) tolerate: prefer our own data; print a message, but treat the reference as referring to our version of the object. In both cases we keep a copy of the second preimage in our .git, for forensic purposes. This is used as part of a gradual desupport strategy. Existing history in all existing object stores is safe and cannot be corrupted or modified by receiving colliding objects. New trees which receive their initial data from a trustworthy sender over a trustworthy channel will receive correct data. Bad object stores or untrustworthy channels could exploit collisions, but not in new regions of the history which are presumably using new names. So the collisons can only affect archaeology. Merging previously-unrelated histories does introduce a collision hazard, but the collision would have had to have been introduced while the colliding hash function was still a live hash function in at least one of the two projects. * Hash function enablement: (a) enabled: this hash function is good and available for use (b) deprecated (in favour of H2): this hash function is available for use, but newly created objects will use another hash function instead (specifically, when creating an object, this has function is not considered as a candidate; if as a result there are no candidate hash functions, we use the specified replacement H2). Existing refs referring to objects with this hash, with no ref hint, are treated as having a ref hint specifying H2. If no H2 is specified, the newest hash "best" hash is used. (c) disabled: existing objects using this hash function can be accessed, but no such objects can be created or received. (again, a replacement may be specified). This is used both initially to prevent unintended upgrade, and later to block the introduction of vulnerable data generated by badly configured clients. (d) forgotten: such objects are not stored. References to such objects return dummy objects of the right shape: the empty blob; the empty tree; a root commit with an empty tree and dummy metadata. This allows us to finally retire a hash function entirely. We effectively throw away all the history which uses this hash function. During transfer protocols, the receiver will say which hashes it thinks are forgotten, and the sender will not follow such references when computing the set of objects to send. So receivers will not receive the forgotten objects. Remote protocol During the negotation, a receiver needs to specify what hashes it understands, and whether it is prepared to see only a partial view. When the sender is listing its refs, refs naming objects the receiver cannot understand are either elided (if the receiver is content with a parial view), or cause an error. Equality testing Note that semantically identical trees may (now) have different tree objects because those tree objects might use (and be named by) different hashes. So (in some contexts at least) tree comparison cannot any longer be done by comparing names; rather an invocation of git diff is needed, or explicit generation of a tree object with the right hash. Transition plan Y0: Implement all of the above. Test it. Default configuration: SHA-1 is enabled SHA-512 is disabled in trees without working trees SHA-512 is enabled in trees with working trees Effects: Existing projects will not switch to SHA-512 willy-nilly. New projects will still use SHA-1. Incompatible new-style commits cannot be pushed without server admin effort (or until future upgrade). So all old git clients still work. Y4: SHA-512 by default for new projects. Conversion enabled for existing projects. Old git software is now pretty firmly deprecated. Default configuration change: When creating a new bare tree, a configuration dropping is left (in `config') which specifies that SHA-1 is OBSOLESCENT Default status for SHA-512 is FORBIDDEN if SHA-1 is ENABLED, or ENABLED if SHA-1 is OBSOLESCENT. default HEAD hash is newest ENABLED hash. Effects: When creating a new working tree, it starts using SHA-512. A new server tree will accept SHA-512. Existing server trees do not yet accept SHA-512. They publish their SHA-1 hashes, so clients make commits with SHA-1. To convert a project, an administrator would set SHA-1 to OBSOLESCENT on the server. All clones after that will have HEAD with a SHA-512 name. Fetches and pulls will update to SHA-512 names. will , and push one SHA-512 commit to mainline. Default configuration change: Effects: When creating a new tree with working tree with git init (ie, no HEAD), the default HEAD hash is set to SHA-512 (because SHA-1 is OBSOLESCENT in a new tree and therefore SHA-512 is the only ENABLED hash and is the default). Newly minted server trees accept SHA-512. start using SHA-512 by default. Y6: Existing projects start being converted infectiously. It is hard to stop this happening. Old git software is firmly stuffed. Default configuration change: SHA-1 is OBSOLESCENT (default for SHA-512, and HEAD hash, computed as in Y4) Result is that by default all software (Projects which do not want to convert need to set SHA-1 to ENABLED, explicitly, on their Y6: Existing projects start using SHA-512. Default configuration change: SHA-512 is ENABLED SHA-1 is OBSOLESCENT (default default HEAD hash is already SHA-512) In existing repositories where no special action -- Ian Jackson These opinions are my own. If I emailed you from an address @fyvzl.net or @evade.org.uk, that is a private address which bypasses my fierce spamfilter.