From: Ian Jackson To: ijackson@chiark.greenend.org.uk Subject: Transition plan for git to move to a new hash function Date: Thu, 20 Oct 2016 19:26:44 +0100 Basic principle: Every object will have two (or more) names, corresponding to different hash functions. It may be named by any of its names, in every context. Every program that invokes git or speaks git protocols will need to understand the extended object name syntax, and understand that objects have multiple names. Safety catches preferent accidental incorporation into a project of objects which contain references by incompatibly-new or deprecatedly-old names. This allows for incremental deployment. TEXTUAL SYNTAX The object name textual syntax is extended as follows: We declare that the object name syntax is henceforth [A-Z]+[0-9a-z]+ | [0-9a-f]+ and that names [A-Z].* are deprecated as ref name components. Rationale: Full backwards compatibility is impossible, because the hash function needs to be evident in the name, so the new names must be disjoint from all old SHA-1 names. We want a short but extensible syntax. The syntax should impose minimal extra requirements on existing git users. In most contexts where existing git users use hashes, ASCII alphanumeric object names will fit. Use of punctuation such as : or even _ may give trouble to existing users, who are already using such things as delimiters. In existing deployments, refnames that differ only in case are generally avoided (because they are troublesome on case-insensitive filesystems). And conventionally refnames are lower case. So names starting with an upper case letter will be disjoint from most existing ref name components. Even though we probably want to keep using hex, it is a good idea to reserve the flexibility to use a more compact encoding, while not excessively widening the existing permissible character set. Object names using SHA-1 are represented, in text, as at present. Object names starting with uppercase ASCII letters H or later refer to new hash functions. Programs that use `g' should ideally be changed to show `H' for hash function `H' rather than `gH'.) Rationale: Object names starting with A-F might look like hex. G is reserved because of the way that many programs write `g'. This gives us 19 new hash function values until we have to starting using two-letter hash function prefixes, or decide to use A-F after all. (Truncated object names work as they do at the moment.) Initially we define and assign one new hash function (and textual object name encoding): H where is the BLAKE2b hash of the object (in lowercase) We also reserve the following syntax for private experiments: E[A-Z]+[0-9a-z]+ We declare that public releases of git will never accept such object names. Everywhere in the git object formats and git protocols, a new object name (with hash function indicator) is permitted where an old object name is permitted. A single object refers to all the objects it references by the same hash function; in general this might be a different hash function to the hash function by which this particular object was itself referenced or obtained. As a further restriction, it is forbidden to refer to a tree object by a name other than the hash function it uses to name its subtrees. If this seems necessary, the tree object must be recursively rewritten instead to use the desired object name. In binary protocols, where a SHA-1 object name in binary form was previously used, a new codepoint must be allocated in a containing structure (eg a new typecode). Usually, the new-format binary object will have a new typecode and also an additional name hash indicator. Whenever a new hash function textual syntax is defined, corresponding binary format codepoint(s) are assigned. (Detailed binary format specification is outside the scope of this plan.) ORDERING Hash functions are partially ordered, from `older' to `newer'. The ordering is configurable. The default, with the two hash functions defined here, is the obvious ordering SHA1 ([0-9a-f]*) < BLAKE2b (H*) CHOICE OF OBJECT NAMES Whenever objects are named, it is possible to refer to them by old or new names. So git must make a choice, each time: when new objects are created; when refs are updated; and when refs are reported over network protocols to other instances of git. Although strictly speaking all objects have both old names and new names, and there may be more than two hash functions, it is possible to speak, somewhat loosely, about `new objects'. A `new' object is one which refers to other objects by a `new' name. (whatever `new' means). We call these different hashes `namings'. That is, a `naming' is a hash function implemented by git. The `naming IN an object' is the naming by which the object refers to other objects (and may not exist, if the object has no references); the `name OF an object' is the name by which the object itself is specified. Commits A non-origin commit is made (by default) as new as the newest of (i) the naming in each of its parents (ii) the specified name of each of its parents (Implicitly this normally means that if HEAD uses a new name, new commits will be generated.) The naming of an origin commit is controlled by a dropping left in .git by git checkout --orphan or git init. At boundaries between old and new history, a new commit will refer to old parents by those old parents' new names. Tags A new tag is made to use newest naming, for its tagged object, of (i) the name by which the tagged object was specified (ii) the naming in the tagged object (if applicable) Trees Commits (and sometimes, tags) can refer to tree objects; that tree will contain the same naming as the referring object. That is, it is a bug to refer to a tree object by other than the hash it uses internally to refer to subtrees (and gitlinks). This will mean that a tree must sometimes be rewritten (ie, new object names recalculated recursively). Rationale: we want to avoid new commits and tags relying on weak hashes. Blobs Blobs do not refer to other objects so they are neither new or old. Name of newly created object When git creates a new object, it reports the new object name using the naming in the object. For blobs and empty trees, the caller should normally specify. The default is the naming used for HEAD. Updating refs If a ref is updated with a new object, the name from its creation is used (see above). If a ref is updated to a specified object, the naming used in the ref is the newer of the specified name, or the naming in the object (if any). ), or with a specified object name. (If there are different equally new names, one of the newest names is chosen according to some stable rule.) new commit. (This may mean converting the tree in hand, since trees are supposed to be homgeonous.) A `new commit' is one which refers to objects by Object store: The object store knows which hash functions are enabled. Each hash function H has one of the following statuses, which are configured by the user: * ENABLED: As far as the user is concerned every object in the object store is accessible using H. Objects which use H names can be received and stored. This is actually two states, depending on whether any objects exist in the store which use these names. If no such objects exist yet, we say that the hash function is `ENABLED PROSPECTIVE'. The H names for the objects have not yet been calculated. When the first object which names another object using H is received (or, on demand), the object store calculates the H names for all existing objects and notes that this hash function is now `ENABLED PRESENT'. If a hash collision is detected, we crash immediately. * OBSOLESCENT: Every object in the object store has its hash calculated using H. However, H is known to possibly have collisions which we try to tolerate. When a collision occurs, the object text which is currently in the object store is preferred and the "new" object is thrown away. Local creation of new objects with references using H is discouraged. Specifically, if another hash function is ENABLED, we will use that instead. This is used as part of a gradual desupport strategy. When the hash function is in this stage, existing history in all existing object stores is safe and cannot be corrupted or modified by receiving colliding objects. New object stores which receive their data from a trustworthy sender over a trustworthy channel will receive correct data. Bad object stores or untrustworthy channels could exploit collisions, but not in new regions of the history which are presumably using new names. So the collisons can only affect archaeology. Merging previously-unrelated histories does introduce a collision hazard, but the collision would have had to have been introduced while H was still a "live" hash function in at least one of the two projects. * FORBIDDEN: Objects do not have their hashes calculated using this hash function. Attempts to reference an object by such a name fail. Optionally the user may specify a tolerant mode where: a commit which refers to parents by obsolete names is taken to simply not have those parents; a commit which refers to a tree by an obsolete name is taken to have an empty tree. This is used for two purposes: - On a server, we use this to restrict the propagation of new hashes so as to enforce our compatibility intentions. Ie, hashes which we are "not ready for" are forbidden. - Everywhere, we use this to get rid of old hash functions. It makes access to old history possible but difficult. * FORGOTTEN: Objects do not have their hashes calculated using this hash function. References to objects by all such names return dummy objects of the right shape: the empty blob; the empty tree; a root commit with an empty tree and dummy metadata. This allows us to finally retire a hash function entirely. We effectively throw away all the history which uses H. During transfer protocols, the receiver will say which hashes it thinks are obsolete or forgotten, and the sender will not follow such references when computing the set of objects to send. So receivers will not receive the objects which were named only by obsolete or forgotten names. Naming in newly-generated objects, queries, etc. There is a `default' hash function, which is that which HEAD uses. (That is, HEAD refers to an object by some name. The default hash function is that name's hash function.) git tools produce always output object names in the default hash function. (Including git-hash-object.) As a consequence, newly generated objects will contain object references using the `default' hash function. When HEAD is empty, there is a separate record of the default hash function. This comes from a configured default in a new tree. In an existing tree, using git checkout --orphan remembers the default hash function that HEAD had. When HEAD is updated to a new commit, the name stored in HEAD uses the newer of the previous HEAD hash function and of the hash function used in the commit being stored. ("Newer" is a built-in preference order, overrideable by configuration.) This (together with the `forbidden' state, above) ensures that switching a project to use a new hash function is a deliberate decision: the default hash function needs to be changed to make the first commit with the new hash function. After that, provided the server accepts it, it's infectious. Naming of refs other than HEAD A ref refers to an object by one of its names. However, operations like git-show-ref convert that name to the default format (see above). git-gc rewrites ref names to the default format iff that is newer. Remote protocol During the negotation, a receiver needs to specify what hashes it understands. When the sender is listing its refs, the names are converted to a hash understood by the client if necessary. If this is not necessary, they are left unchanged. When a receiver is updating refs, it should by follow the sender's idea of a hash change iff it's an upgrade (and the new function is ENABLED). That is, if the sender sends name H2 for some ref, and the receiver has H1, but these refer to the same object, then the receiver should update its own ref name from H1 to H2 iff H2 uses a newer hash function. Equality testing All software which tests for equality of git objects by checking whether their object names are equal needs to obtain a canonical name for both objects. This is going to be quite annoying. We should provide a convenient utility which tests whether two object names refer to the same object. Note that semantically identical trees may (now) have different tree objects because those tree objects might contain different object names. So (in some contexts at least) tree comparison cannot any longer be done by comparing names; rather an invocation of git diff is needed, or explicit generation of a tree object with the right name. Transition plan Y0: Implement all of the above. Test it. Default configuration: SHA-1 is ENABLED SHA-512 is FORBIDDEN in bare repos SHA-512 is ENABLED in trees with working trees default HEAD hash is SHA-1 Effects: Existing projects will not switch to SHA-512 willy-nilly. New projects will still use SHA-1. Incompatible new-style commits cannot be pushed without server admin effort (or until future upgrade). So all old git clients still work. Y4: SHA-512 by default for new projects. Conversion enabled for existing projects. Old git software is now pretty firmly deprecated. Default configuration change: When creating a new bare tree, a configuration dropping is left (in `config') which specifies that SHA-1 is OBSOLESCENT Default status for SHA-512 is FORBIDDEN if SHA-1 is ENABLED, or ENABLED if SHA-1 is OBSOLESCENT. default HEAD hash is newest ENABLED hash. Effects: When creating a new working tree, it starts using SHA-512. A new server tree will accept SHA-512. Existing server trees do not yet accept SHA-512. They publish their SHA-1 hashes, so clients make commits with SHA-1. To convert a project, an administrator would set SHA-1 to OBSOLESCENT on the server. All clones after that will have HEAD with a SHA-512 name. Fetches and pulls will update to SHA-512 names. will , and push one SHA-512 commit to mainline. Default configuration change: Effects: When creating a new tree with working tree with git init (ie, no HEAD), the default HEAD hash is set to SHA-512 (because SHA-1 is OBSOLESCENT in a new tree and therefore SHA-512 is the only ENABLED hash and is the default). Newly minted server trees accept SHA-512. start using SHA-512 by default. Y6: Existing projects start being converted infectiously. It is hard to stop this happening. Old git software is firmly stuffed. Default configuration change: SHA-1 is OBSOLESCENT (default for SHA-512, and HEAD hash, computed as in Y4) Result is that by default all software (Projects which do not want to convert need to set SHA-1 to ENABLED, explicitly, on their Y6: Existing projects start using SHA-512. Default configuration change: SHA-512 is ENABLED SHA-1 is OBSOLESCENT (default default HEAD hash is already SHA-512) In existing repositories where no special action -- Ian Jackson These opinions are my own. If I emailed you from an address @fyvzl.net or @evade.org.uk, that is a private address which bypasses my fierce spamfilter.