X-Git-Url: http://www.chiark.greenend.org.uk/ucgi/~ian/git?p=git-hash-transition-plan.git;a=blobdiff_plain;f=plan.txt;h=b75537aca0b645165dcbbd8ebbbc799940a2f52e;hp=803aadda893a654ad70daf4b7b07cdbe3bc171e8;hb=266b32bc46dbfab5e449ff3be5ae5b8c51d8c5d8;hpb=5a36a235897886931d7c6e553090f56621429552 diff --git a/plan.txt b/plan.txt index 803aadd..b75537a 100644 --- a/plan.txt +++ b/plan.txt @@ -1,53 +1,180 @@ -From: Ian Jackson -To: ijackson@chiark.greenend.org.uk Subject: Transition plan for git to move to a new hash function -Date: Thu, 20 Oct 2016 19:26:44 +0100 -Basic principle: Every object will have two (or more) names, -corresponding to different hash functions. It may be named by any of -its names, in every context. +BASIC PRINCIPLE + +We run multiple object name subnamespaces in parallel, one for each +hash function. Each object lives in exactly one subnamespace. +Objects with identical content in the different object stores, named +by different hash functions, are different objects. + +Objects may refer to objects living in different subnamespaces (ie, +named by a different hash function) to their own. + +Packfiles need to be extended to be able to contain objects named by +new hash functions. Blob objects with identical contents but living +in different subnamespaces would ideally share storage. Every program that invokes git or speaks git protocols will need to -understand the extended object name syntax, and understand that -objects have multiple names. +understand the extended object name syntax. Safety catches preferent accidental incorporation into a project of -objects which contain references by incompatibly-new or -deprecatedly-old names. This allows for incremental deployment. +incompatibly-new objects, or additional deprecatedly-old objects. +This allows for incremental deployment. + + +TEXTUAL SYNTAX + +The object name textual syntax is extended as follows: + +We declare that the object name syntax is henceforth + [A-Z]+[0-9a-z]+ | [0-9a-f]+ +and that names [A-Z].* are deprecated as ref name components. + + Rationale: + + Full backwards compatibility is impossible, because the hash + function needs to be evident in the name, so the new names + must be disjoint from all old SHA-1 names. + + We want a short but extensible syntax. The syntax should impose + minimal extra requirements on existing git users. In most + contexts where existing git users use hashes, ASCII alphanumeric + object names will fit. Use of punctuation such as : or even _ + may give trouble to existing users, who are already using + such things as delimiters. + + In existing deployments, refnames that differ only in case are + generally avoided (because they are troublesome on + case-insensitive filesystems). And conventionally refnames are + lower case. So names starting with an upper case letter will be + disjoint from most existing ref name components. + Even though we probably want to keep using hex, it is a good + idea to reserve the flexibility to use a more compact encoding, + while not excessively widening the existing permissible + character set. -Syntax: +Object names using SHA-1 are represented, in text, as at present. -The object name syntax is extended as follows: object names using sha1 -are as current. Object names starting with lowercase ASCII letters h -or later refer to new hash functions. (`g' is reserved because of the -way that many programs write `g'. Programs that use -`g' should be changed to show `h' for hash function -`h' rather than `gh'.) +Object names starting with uppercase ASCII letters H or later refer to +new hash functions. Programs that use `g' should ideally +be changed to show `H' for hash function `H' rather than +`gH'.) -Object names h are SHA-512 hashes. Remaining letters are -reserved. `x' `y' `z' are reserved for private experiments; we -declare that public releases of git will never accept such names. + Rationale: + + Object names starting with A-F might look like hex. G is + reserved because of the way that many programs write + `g'. + + This gives us 19 new hash function values until we have to + starting using two-letter hash function prefixes, or decide to + use A-F after all. + +(Truncated object names work as they do at the moment.) + +Initially we define and assign one new hash function (and textual +object name encoding): + + H where is the BLAKE2b hash of the object + (in lowercase) + +We also reserve the following syntax for private experiments: + E[A-Z]+[0-9a-z]+ +We declare that public releases of git will never accept such +object names. Everywhere in the git object formats and git protocols, a new object name (with hash function indicator) is permitted where an old object -name is permitted. A single object refers to all the objects it -references by the same hash function; in general this might be a -different hash function to the hash function by this particular object -was itself referenced or obtained. +name is permitted. + +A single object may refer to other objects by its own hash functon, or +by other hash functions. Ie, object references cross subnamespaces. +During all git operations, subnamespace boundaries in the object graph +are traversed freely. -As an exception, it is forbidden to refer to a tree object by a name -other than the hash function it uses to name its subtrees. If this -seems necessary, the tree object must be recursively rewritten instead -to use the desired object name. +Two additional restrictions: a tree object may be referenced only by +objects in the same subnamespace; and, a tree object may reference +blobs in its own subnamespace. In binary protocols, where a SHA-1 object name in binary form was previously used, a new codepoint must be allocated in a containing structure (eg a new typecode). Usually, the new-format binary object -will have a new typecode and also an additional name hash indicator. -15 of the hash indicator values correspond to the lowercase letters -reserved above. +will have a new typecode and also an additional name hash indicator, +and it will also need a length field (as new hashes may be of +different lengths). + +Whenever a new hash function textual syntax is defined, corresponding +binary format codepoint(s) are assigned. (Implementation details such +as the binary format specification is outside the scope of this +transition plan.) + + +ORDERING + +Hash functions are partially ordered, from `older' to `newer'. + +The ordering is configurable. The default, with the two hash +functions defined here, is the obvious ordering + SHA1 ([0-9a-f]*) < BLAKE2b (H*) + + +CHOICE OF SUBNAMESPACE + +Whenever objects are created, it is necessary to choose the +subnamespace to use (ie, the hash function). + +Each ref may also have a subnamespace hint associated with it. + + +Commits + +A commit is made (by default) as new as the newest of + (i) each of its parents + (ii) if applicable, the subnamespace hint for the ref to which the + new commit is to be written + +Implicitly this normally means that if HEAD refers to a new commit, +further new commits will be generated on top of it. + +The subnamespace of an origin commit is controlled by the hint left in +.git by git checkout --orphan or git init. + +At boundaries between old and new history, new commit(s) will refer to +old parent(s). + + +Tags + +A tag is created (by default) in the same subnamespace as the object +to which it refers. + + +Trees + +Trees are always referenced by objects in their own subnamespace. + +Occasionally, a tree object from one subnamespace must be recursively +rewritten into another subnamespace. + +When a tree refers to a commit, it may refer to one in a different +subnamespace. + + Rationale: we want to avoid new commits and tags relying on weak + hashes. + + +Blobs + +Blobs are normally referred to by trees. Trees always refer to blobs +in the same subnamespace. + +Where a blob is created in other circumstances, the caller should +specify the subnamespace. + + + Object store: @@ -72,12 +199,17 @@ the user: existing objects and notes that this hash function is now `ENABLED PRESENT'. + If a hash collision is detected, we crash immediately. + * OBSOLESCENT: Every object in the object store has its hash calculated using H. However, H is known to possibly have collisions which we try to tolerate. When a collision occurs, the object text which is currently in the object store is preferred and the "new" - object is thrown away. Local creation of new objects with - references using H is forbidden. + object is thrown away. + + Local creation of new objects with references using H is + discouraged. Specifically, if another hash function is ENABLED, we + will use that instead. This is used as part of a gradual desupport strategy. When the hash function is in this stage, existing history in all existing object @@ -119,11 +251,11 @@ the user: This allows us to finally retire a hash function entirely. We effectively throw away all the history which uses H. -During transfer protocols, the receiver will say which hashes are -obsolete or forgotten, and the sender will not follow such references -when computing the set of objects to send. So receivers will not -receive the objects which were named only by obsolete or forgotten -names. +During transfer protocols, the receiver will say which hashes it +thinks are obsolete or forgotten, and the sender will not follow such +references when computing the set of objects to send. So receivers +will not receive the objects which were named only by obsolete or +forgotten names. Naming in newly-generated objects, queries, etc. @@ -151,7 +283,7 @@ overrideable by configuration.) This (together with the `forbidden' state, above) ensures that switching a project to use a new hash function is a deliberate decision: the default hash function needs to be changed to make the -first first commit with the new hash function. After that, provided +first commit with the new hash function. After that, provided the server accepts it, it's infectious. @@ -160,16 +292,24 @@ Naming of refs other than HEAD A ref refers to an object by one of its names. However, operations like git-show-ref convert that name to the default format (see above). -git-gc rewrites ref names to the default format. +git-gc rewrites ref names to the default format iff that is newer. Remote protocol -During the negotation, a client needs to specify what names it -understands, and which it prefers (its default). +During the negotation, a receiver needs to specify what hashes it +understands. -When the server is listing its refs, the names are converted to the -client's preferred format. +When the sender is listing its refs, the names are converted to a +hash understood by the client if necessary. If this is not necessary, +they are left unchanged. + +When a receiver is updating refs, it should by follow the sender's +idea of a hash change iff it's an upgrade (and the new function is +ENABLED). That is, if the sender sends name H2 for some ref, and the +receiver has H1, but these refer to the same object, then the receiver +should update its own ref name from H1 to H2 iff H2 uses a newer hash +function. Equality testing @@ -180,10 +320,14 @@ for both objects. This is going to be quite annoying. +We should provide a convenient utility which tests whether two object +names refer to the same object. + Note that semantically identical trees may (now) have different tree objects because those tree objects might contain different object -names. So tree comparison cannot any longer be done by comparing -names; rather an invocation of git diff is needed. +names. So (in some contexts at least) tree comparison cannot any +longer be done by comparing names; rather an invocation of git diff is +needed, or explicit generation of a tree object with the right name. Transition plan @@ -191,18 +335,88 @@ Transition plan Y0: Implement all of the above. Test it. Default configuration: - SHA-1 is ENABLED and is default HEAD hash - + SHA-1 is ENABLED SHA-512 is FORBIDDEN in bare repos SHA-512 is ENABLED in trees with working trees + default HEAD hash is SHA-1 + + Effects: + + Existing projects will not switch to SHA-512 willy-nilly. + New projects will still use SHA-1. + + Incompatible new-style commits cannot be pushed without server + admin effort (or until future upgrade). + + So all old git clients still work. + +Y4: SHA-512 by default for new projects. + Conversion enabled for existing projects. + Old git software is now pretty firmly deprecated. + + Default configuration change: + + When creating a new bare tree, a configuration dropping is left + (in `config') which specifies that SHA-1 is OBSOLESCENT + + Default status for SHA-512 is FORBIDDEN if SHA-1 is ENABLED, + or ENABLED if SHA-1 is OBSOLESCENT. + + default HEAD hash is newest ENABLED hash. + + Effects: + + When creating a new working tree, it starts using SHA-512. + A new server tree will accept SHA-512. + + Existing server trees do not yet accept SHA-512. They publish + their SHA-1 hashes, so clients make commits with SHA-1. + + To convert a project, an administrator would set SHA-1 to + OBSOLESCENT on the server. All clones after that will have HEAD + with a SHA-512 name. Fetches and pulls will update to SHA-512 + names. + +will , and push one SHA-512 commit to + mainline. + + + + Default configuration change: + + Effects: + + When creating a new tree with working tree with git init (ie, no + HEAD), the default HEAD hash is set to SHA-512 (because SHA-1 is + OBSOLESCENT in a new tree and therefore SHA-512 is the only + ENABLED hash and is the default). + + Newly minted server trees accept SHA-512. + + + start using SHA-512 by default. + +Y6: Existing projects start being converted infectiously. + It is hard to stop this happening. + Old git software is firmly stuffed. + + Default configuration change: + SHA-1 is OBSOLESCENT + (default for SHA-512, and HEAD hash, computed as in Y4) + + Result is that by default all software + + (Projects which do not want to convert need to set SHA-1 to + ENABLED, explicitly, on their -Y5: New projects should start using SHA-512. +Y6: Existing projects start using SHA-512. Default configuration change: + SHA-512 is ENABLED + SHA-1 is OBSOLESCENT + (default default HEAD hash is already SHA-512) - SHA-512 becomes ENABLED in *new* bare repos but remains - FORBIDDEN in existing ones - + In existing repositories where no special action --