X-Git-Url: http://www.chiark.greenend.org.uk/ucgi/~ian/git?a=blobdiff_plain;f=plan.txt;h=b4c622ba40e1637ed7c91652b3025d6dd059a2b7;hb=ce7991fe39a77c525795e747e4ee2a1465e0e9cb;hp=803aadda893a654ad70daf4b7b07cdbe3bc171e8;hpb=5a36a235897886931d7c6e553090f56621429552;p=git-hash-transition-plan.git diff --git a/plan.txt b/plan.txt index 803aadd..b4c622b 100644 --- a/plan.txt +++ b/plan.txt @@ -1,90 +1,226 @@ -From: Ian Jackson -To: ijackson@chiark.greenend.org.uk Subject: Transition plan for git to move to a new hash function -Date: Thu, 20 Oct 2016 19:26:44 +0100 -Basic principle: Every object will have two (or more) names, -corresponding to different hash functions. It may be named by any of -its names, in every context. +BASIC PRINCIPLE + +We run multiple object name subnamespaces in parallel, one for each +hash function. Each object lives in exactly one subnamespace. +Objects with identical content in the different object stores, named +by different hash functions, are different objects. + +Objects may refer to objects living in different subnamespaces (ie, +named by a different hash function) to their own. + +Packfiles need to be extended to be able to contain objects named by +new hash functions. Blob objects with identical contents but living +in different subnamespaces would ideally share storage. Every program that invokes git or speaks git protocols will need to -understand the extended object name syntax, and understand that -objects have multiple names. +understand the extended object name syntax. Safety catches preferent accidental incorporation into a project of -objects which contain references by incompatibly-new or -deprecatedly-old names. This allows for incremental deployment. +incompatibly-new objects, or additional deprecatedly-old objects. +This allows for incremental deployment. + + +TEXTUAL SYNTAX + +The object name textual syntax is extended as follows: + +We declare that the object name syntax is henceforth + [A-Z]+[0-9a-z]+ | [0-9a-f]+ +and that names [A-Z].* are deprecated as ref name components. + + Rationale: + + Full backwards compatibility is impossible, because the hash + function needs to be evident in the name, so the new names + must be disjoint from all old SHA-1 names. + We want a short but extensible syntax. The syntax should impose + minimal extra requirements on existing git users. In most + contexts where existing git users use hashes, ASCII alphanumeric + object names will fit. Use of punctuation such as : or even _ + may give trouble to existing users, who are already using + such things as delimiters. -Syntax: + In existing deployments, refnames that differ only in case are + generally avoided (because they are troublesome on + case-insensitive filesystems). And conventionally refnames are + lower case. So names starting with an upper case letter will be + disjoint from most existing ref name components. -The object name syntax is extended as follows: object names using sha1 -are as current. Object names starting with lowercase ASCII letters h -or later refer to new hash functions. (`g' is reserved because of the -way that many programs write `g'. Programs that use -`g' should be changed to show `h' for hash function -`h' rather than `gh'.) + Even though we probably want to keep using hex, it is a good + idea to reserve the flexibility to use a more compact encoding, + while not excessively widening the existing permissible + character set. -Object names h are SHA-512 hashes. Remaining letters are -reserved. `x' `y' `z' are reserved for private experiments; we -declare that public releases of git will never accept such names. +Object names using SHA-1 are represented, in text, as at present. + +Object names starting with uppercase ASCII letters H or later refer to +new hash functions. Programs that use `g' should ideally +be changed to show `H' for hash function `H' rather than +`gH'.) + + Rationale: + + Object names starting with A-F might look like hex. G is + reserved because of the way that many programs write + `g'. + + This gives us 19 new hash function values until we have to + starting using two-letter hash function prefixes, or decide to + use A-F after all. + +(Truncated object names work as they do at the moment.) + +Initially we define and assign one new hash function (and textual +object name encoding): + + H where is the BLAKE2b hash of the object + (in lowercase) + +We also reserve the following syntax for private experiments: + E[A-Z]+[0-9a-z]+ +We declare that public releases of git will never accept such +object names. Everywhere in the git object formats and git protocols, a new object name (with hash function indicator) is permitted where an old object -name is permitted. A single object refers to all the objects it -references by the same hash function; in general this might be a -different hash function to the hash function by this particular object -was itself referenced or obtained. +name is permitted. + +A single object may refer to other objects by its own hash functon, or +by other hash functions. Ie, object references cross subnamespaces. +During all git operations, subnamespace boundaries in the object graph +are traversed freely. -As an exception, it is forbidden to refer to a tree object by a name -other than the hash function it uses to name its subtrees. If this -seems necessary, the tree object must be recursively rewritten instead -to use the desired object name. +Two additional restrictions: a tree object may be referenced only by +objects in the same subnamespace; and, a tree object may reference +blobs in its own subnamespace. In binary protocols, where a SHA-1 object name in binary form was previously used, a new codepoint must be allocated in a containing structure (eg a new typecode). Usually, the new-format binary object -will have a new typecode and also an additional name hash indicator. -15 of the hash indicator values correspond to the lowercase letters -reserved above. +will have a new typecode and also an additional name hash indicator, +and it will also need a length field (as new hashes may be of +different lengths). + +Whenever a new hash function textual syntax is defined, corresponding +binary format codepoint(s) are assigned. (Implementation details such +as the binary format specification is outside the scope of this +transition plan.) + + +ORDERING + +Hash functions are partially ordered, from `worse' to `better'. +The ordering is configurable. For details of the defaults, +see _Transition Plan_. + + +CHOICE OF SUBNAMESPACE + +Whenever objects are created, it is necessary to choose the +subnamespace to use (ie, the hash function). + +Each ref may also have a subnamespace hint associated with it. + + +Commits + +A commit is made (by default) as new as the newest of + (i) each of its parents + (ii) if applicable, the subnamespace hint for the ref to which the + new commit is to be written + +Implicitly this normally means that if HEAD refers to a new commit, +further new commits will be generated on top of it. + +The subnamespace of an origin commit is controlled by the hint left in +.git by git checkout --orphan or git init. + +At boundaries between old and new history, new commit(s) will refer to +old parent(s). + + +Tags + +A tag is created (by default) in the same subnamespace as the object +to which it refers. -Object store: +Trees -The object store knows which hash functions are enabled. Each hash -function H has one of the following statuses, which are configured by -the user: +Trees are only referenced by objects in their own subnamespace. -* ENABLED: +To satisfy this rule, occasionally a tree object from one subnamespace +must be recursively rewritten into another subnamespace. - As far as the user is concerned every object in the object store is - accessible using H. Objects which use H names can be received and - stored. +When a tree refers to a commit, it may refer to one in a different +subnamespace. - This is actually two states, depending on whether any objects exist - in the store which use these names. If no such objects exist yet, - we say that the hash function is `ENABLED PROSPECTIVE'. The H names - for the objects have not yet been calculated. + Rationale: we want to avoid new commits and tags relying on weak + hashes. But we must avoid demanding that commits be rewritten. - When the first object which names another object using H is received - (or, on demand), the object store calculates the H names for all - existing objects and notes that this hash function is now - `ENABLED PRESENT'. -* OBSOLESCENT: Every object in the object store has its hash - calculated using H. However, H is known to possibly have collisions - which we try to tolerate. When a collision occurs, the object text - which is currently in the object store is preferred and the "new" - object is thrown away. Local creation of new objects with - references using H is forbidden. +Blobs - This is used as part of a gradual desupport strategy. When the hash - function is in this stage, existing history in all existing object - stores is safe and cannot be corrupted or modified by receiving - colliding objects. +Blobs are normally referred to by trees. Trees always refer to blobs +in the same subnamespace. - New object stores which receive their data from a trustworthy sender +Where a blob is created in other circumstances, the caller should +specify the subnamespace. + + +Ref hints + +As noted above, each ref may also have a subnamespace hint associated +with it. + +The subnamespace hint is (by default) copied, when the ref value is +copied. So for exmple if `git checkout foo' makes refs/heads/foo out +of refs/remotes/origin/foo, it will copy the subnamespace hint (or +lack of one) from refs/remotes/origin/foo. + +Likewise, the subnamespace hint is conveyed by `git fetch' (by +default) and can be updated with `git push' (though this is not done +by default). + +The ref subnamespace hint may be set explicitly. That is how an +individual branch is upgraded. git checkout --orphan sets it to the +subnamespace (or hint) of the previous HEAD. + +When a commit is made and stored in a ref, the subnamespace hint for +that ref is removed iff the commit's subnamespace and the hint's +subnamespace are the same. + + +OBJECT STORE BEHAVIOUR + +The object store has configuration to specify which hash functions are +enabled. Each hash function H has a combination of the following +behaviours, according to configuration: + +* Collision behaviour: + + What to do if we encounter an object we already have (eg as part of + a pack, or with hash-object) but with different contents. + + (a) fail: print a scary message and abort operation (on the + basis that the source of the colliding object probably intended + the preimage that they provided, or is conducting an attack). + + (b) tolerate: prefer our own data; print a message, but treat + the reference as referring to our version of the object. + + In both cases we keep a copy of the second preimage in our .git, for + forensic purposes. + + This is used as part of a gradual desupport strategy. Existing + history in all existing object stores is safe and cannot be + corrupted or modified by receiving colliding objects. + + New trees which receive their initial data from a trustworthy sender over a trustworthy channel will receive correct data. Bad object stores or untrustworthy channels could exploit collisions, but not in new regions of the history which are presumably using new names. @@ -92,117 +228,147 @@ the user: Merging previously-unrelated histories does introduce a collision hazard, but the collision would have had to have been introduced - while H was still a "live" hash function in at least one of the two - projects. + while the colliding hash function was still a live hash function + in at least one of the two projects. -* FORBIDDEN: Objects do not have their hashes calculated using this - hash function. Attempts to reference an object by such a name - fail. Optionally the user may specify a tolerant mode where: - a commit which refers to parents by obsolete names is taken to - simply not have those parents; a commit which refers to a tree by - an obsolete name is taken to have an empty tree. - This is used for two purposes: +* Hash function enablement: - - On a server, we use this to restrict the propagation of - new hashes so as to enforce our compatibility intentions. - Ie, hashes which we are "not ready for" are forbidden. + (a) enabled: this hash function is good and available for use - - Everywhere, we use this to get rid of old hash functions. - It makes access to old history possible but difficult. + (b) deprecated (in favour of H2): this hash function is + available for use, but newly created objects will use another + hash function instead (specifically, when creating an object, + this has function is not considered as a candidate; if as a + result there are no candidate hash functions, we use the + specified replacement H2). Existing refs referring to objects + with this hash, with no ref hint, are treated as having a ref + hint specifying H2. If no H2 is specified, the newest hash + "best" hash is used. -* FORGOTTEN: Objects do not have their hashes calculated using this - hash function. References to objects by all such names return dummy - objects of the right shape: the empty blob; the empty tree; a root - commit with an empty tree and dummy metadata. + (c) disabled: existing objects using this hash function can be + accessed, but no such objects can be created or received. + (again, a replacement may be specified). This is used both + initially to prevent unintended upgrade, and later to block the + introduction of vulnerable data generated by badly configured + clients. - This allows us to finally retire a hash function entirely. We - effectively throw away all the history which uses H. -During transfer protocols, the receiver will say which hashes are -obsolete or forgotten, and the sender will not follow such references -when computing the set of objects to send. So receivers will not -receive the objects which were named only by obsolete or forgotten -names. +Remote protocol +During the negotation, a receiver needs to specify what hashes it +understands, and whether it is prepared to see only a partial view. -Naming in newly-generated objects, queries, etc. +When the sender is listing its refs, refs naming objects the receiver +cannot understand are either elided (if the receiver is content with a +parial view), or cause an error. -There is a `default' hash function, which is that which HEAD uses. -(That is, HEAD refers to an object by some name. The default hash -function is that name's hash function.) -git tools produce always output object names in the default hash -function. (Including git-hash-object.) +Equality testing -As a consequence, newly generated objects will contain object -references using the `default' hash function. +Note that semantically identical trees may (now) have different tree +objects because those tree objects might use (and be named by) +different hashes. So (in some contexts at least) tree comparison +cannot any longer be done by comparing names; rather an invocation of +git diff is needed, or explicit generation of a tree object with the +right hash. -When HEAD is empty, there is a separate record of the default hash -function. This comes from a configured default in a new tree. In an -existing tree, using git checkout --orphan remembers the default hash -function that HEAD had. -When HEAD is updated to a new commit, the name stored in HEAD uses the -newer of the previous HEAD hash function and of the hash function used -in the commit being stored. ("Newer" is a built-in preference order, -overrideable by configuration.) +TRANSITION PLAN -This (together with the `forbidden' state, above) ensures that -switching a project to use a new hash function is a deliberate -decision: the default hash function needs to be changed to make the -first first commit with the new hash function. After that, provided -the server accepts it, it's infectious. +(For brevity I will write `SHA' for hashing with SHA-1, using current +unqualified object names, and `BLAKE' for hasing with BLAKE2b, using +H object names.) +Y0: Implement all of the above. Test it. -Naming of refs other than HEAD + Default configuration: + SHA is enabled + BLAKE is disabled in trees without working trees + BLAKE is enabled in trees with working trees + SHA > BLAKE -A ref refers to an object by one of its names. However, operations -like git-show-ref convert that name to the default format (see above). + Effects: -git-gc rewrites ref names to the default format. + Clients are prepared to process BLAKE data, but it is not + generated by default and cannot be pushed to servers. + All old git clients still work. -Remote protocol +Y4: BLAKE by default for new projects. + Conversion enabled for existing projects. + Old git software is going to start rotting. -During the negotation, a client needs to specify what names it -understands, and which it prefers (its default). + Default configuration change: + BLAKE > SHA + BLAKE enabled (even in trees without working trees) -When the server is listing its refs, the names are converted to the -client's preferred format. + Suggested bulk hosting site configuration change: + Newly created projects should get BLAKE enabled + Existing projects should retain BLAKE disabled by default + Button should be provided to start conversion (see below) + Effects: -Equality testing + When creating a new working tree, it starts using BLAKE. -All software which tests for equality of git objects by checking -whether their object names are equal needs to obtain a canonical name -for both objects. + Servers which have been updated will accept BLAKE. -This is going to be quite annoying. + Servers which have not been updated to Y4's git will need a small + configuration change (enabling BLAKE) to cope with the new + projects that are using BLAKE. -Note that semantically identical trees may (now) have different tree -objects because those tree objects might contain different object -names. So tree comparison cannot any longer be done by comparing -names; rather an invocation of git diff is needed. + To convert a project, an administrator (or project owner) would + set BLAKE to enabled, and SHA to deprecated, on the server. On + the next pull the server will provide ref hints naming BLAKE, + which will get copied to the user's HEAD. So the user is infected + with BLAKE. + To convert a project branch-by-branch, the administrator would set + BLAKE to enabled but leave SHA enabled. Then each branch retains + its own hash. A branch can be converted by pushing a BLAKE commit + to it, or by setting a ref hint on the server. -Transition plan +Y6: BLAKE by default for all projects + Existing projects start being converted infectiously. + It is hard for a project to stop this happening if any of + their servers are updated. + Old git software is firmly stuffed. -Y0: Implement all of the above. Test it. + Default configuration change + SHA deprecated in trees without working trees - Default configuration: - SHA-1 is ENABLED and is default HEAD hash + Effects: - SHA-512 is FORBIDDEN in bare repos - SHA-512 is ENABLED in trees with working trees + Existing projects are, by default, `converted', as described + above. -Y5: New projects should start using SHA-512. +Y8: Clients hate SHA + Clients insist on trying to convert existing projects + It is very hard to stop this happening. + Unrepentant servers start being very hard to use. - Default configuration change: + Default configuration change + SHA deprecated (even in trees without working trees) + + Effects: + + Clients will generate only BLAKE. Hopefully their server will + accept this! + +Y10: Stop accepting new SHA + No-one can manage to make new SHA commits + + Default configuration change + SHA disabled in new trees, except during initial + `clone', `mirror' and similar + + Effects: + + Existing SHA history is retained, and copied to new clients and + servers. But established clients and servers reject any newly + introduced SHA. - SHA-512 becomes ENABLED in *new* bare repos but remains - FORBIDDEN in existing ones - --