X-Git-Url: http://www.chiark.greenend.org.uk/ucgi/~ian/git?p=git-hash-transition-plan.git;a=blobdiff_plain;f=plan.txt;h=9e2037b74781ab87314aa10333f8dc6d30c27393;hp=b75537aca0b645165dcbbd8ebbbc799940a2f52e;hb=819b72d672dbbe88029b9c47a2f8db3cd46ea880;hpb=266b32bc46dbfab5e449ff3be5ae5b8c51d8c5d8 diff --git a/plan.txt b/plan.txt index b75537a..9e2037b 100644 --- a/plan.txt +++ b/plan.txt @@ -153,16 +153,16 @@ to which it refers. Trees -Trees are always referenced by objects in their own subnamespace. +Trees are only referenced by objects in their own subnamespace. -Occasionally, a tree object from one subnamespace must be recursively -rewritten into another subnamespace. +To satisfy this rule, occasionally a tree object from one subnamespace +must be recursively rewritten into another subnamespace. When a tree refers to a commit, it may refer to one in a different subnamespace. Rationale: we want to avoid new commits and tags relying on weak - hashes. + hashes. But we must avoid demanding that commits be rewritten. Blobs @@ -174,49 +174,55 @@ Where a blob is created in other circumstances, the caller should specify the subnamespace. +Ref hints +As noted above, each ref may also have a subnamespace hint associated +with it. +The subnamespace hint is (by default) copied, when the ref value is +copied. So for exmple if `git checkout foo' makes refs/heads/foo out +of refs/remotes/origin/foo, it will copy the subnamespace hint (or +lack of one) from refs/remotes/origin/foo. -Object store: +Likewise, the subnamespace hint is conveyed by `git fetch' (by +default) and can be updated with `git push' (though this is not done +by default). -The object store knows which hash functions are enabled. Each hash -function H has one of the following statuses, which are configured by -the user: +The ref subnamespace hint may be set explicitly. That is how an +individual branch is upgraded. git checkout --orphan sets it to the +subnamespace (or hint) of the previous HEAD. -* ENABLED: +When a commit is made and stored in a ref, the subnamespace hint for +that ref is removed iff the commit's subnamespace and the hint's +subnamespace are the same. - As far as the user is concerned every object in the object store is - accessible using H. Objects which use H names can be received and - stored. - This is actually two states, depending on whether any objects exist - in the store which use these names. If no such objects exist yet, - we say that the hash function is `ENABLED PROSPECTIVE'. The H names - for the objects have not yet been calculated. +OBJECT STORE BEHAVIOUR - When the first object which names another object using H is received - (or, on demand), the object store calculates the H names for all - existing objects and notes that this hash function is now - `ENABLED PRESENT'. +The object store has configuration to specify which hash functions are +enabled. Each hash function H has a combination of the following +behaviours, according to configuration: - If a hash collision is detected, we crash immediately. +* Collision behaviour: -* OBSOLESCENT: Every object in the object store has its hash - calculated using H. However, H is known to possibly have collisions - which we try to tolerate. When a collision occurs, the object text - which is currently in the object store is preferred and the "new" - object is thrown away. + What to do if we encounter an object we already have (eg as part of + a pack, or with hash-object) but with different contents. - Local creation of new objects with references using H is - discouraged. Specifically, if another hash function is ENABLED, we - will use that instead. + (a) fail: print a scary message and abort operation (on the + basis that the source of the colliding object probably intended + the preimage that they provided, or is conducting an attack). - This is used as part of a gradual desupport strategy. When the hash - function is in this stage, existing history in all existing object - stores is safe and cannot be corrupted or modified by receiving - colliding objects. + (b) tolerate: prefer our own data; print a message, but treat + the reference as referring to our version of the object. - New object stores which receive their data from a trustworthy sender + In both cases we keep a copy of the second preimage in our .git, for + forensic purposes. + + This is used as part of a gradual desupport strategy. Existing + history in all existing object stores is safe and cannot be + corrupted or modified by receiving colliding objects. + + New trees which receive their initial data from a trustworthy sender over a trustworthy channel will receive correct data. Bad object stores or untrustworthy channels could exploit collisions, but not in new regions of the history which are presumably using new names. @@ -224,110 +230,62 @@ the user: Merging previously-unrelated histories does introduce a collision hazard, but the collision would have had to have been introduced - while H was still a "live" hash function in at least one of the two - projects. - -* FORBIDDEN: Objects do not have their hashes calculated using this - hash function. Attempts to reference an object by such a name - fail. Optionally the user may specify a tolerant mode where: - a commit which refers to parents by obsolete names is taken to - simply not have those parents; a commit which refers to a tree by - an obsolete name is taken to have an empty tree. - - This is used for two purposes: - - - On a server, we use this to restrict the propagation of - new hashes so as to enforce our compatibility intentions. - Ie, hashes which we are "not ready for" are forbidden. - - - Everywhere, we use this to get rid of old hash functions. - It makes access to old history possible but difficult. - -* FORGOTTEN: Objects do not have their hashes calculated using this - hash function. References to objects by all such names return dummy - objects of the right shape: the empty blob; the empty tree; a root - commit with an empty tree and dummy metadata. - - This allows us to finally retire a hash function entirely. We - effectively throw away all the history which uses H. + while the colliding hash function was still a live hash function + in at least one of the two projects. + + +* Hash function enablement: + + (a) enabled: this hash function is good and available for use + + (b) deprecated (in favour of H2): this hash function is + available for use, but newly created objects will use another + hash function instead (specifically, when creating an object, + this has function is not considered as a candidate; if as a + result there are no candidate hash functions, we use the + specified replacement H2). Existing refs referring to objects + with this hash, with no ref hint, are treated as having a ref + hint specifying H2. If no H2 is specified, the newest hash + "best" hash is used. + + (c) disabled: existing objects using this hash function can be + accessed, but no such objects can be created or received. + (again, a replacement may be specified). This is used both + initially to prevent unintended upgrade, and later to block the + introduction of vulnerable data generated by badly configured + clients. + + (d) forgotten: such objects are not stored. References to such + objects return dummy objects of the right shape: the empty blob; + the empty tree; a root commit with an empty tree and dummy + metadata. This allows us to finally retire a hash function + entirely. We effectively throw away all the history which uses + this hash function. During transfer protocols, the receiver will say which hashes it -thinks are obsolete or forgotten, and the sender will not follow such -references when computing the set of objects to send. So receivers -will not receive the objects which were named only by obsolete or -forgotten names. - - -Naming in newly-generated objects, queries, etc. - -There is a `default' hash function, which is that which HEAD uses. -(That is, HEAD refers to an object by some name. The default hash -function is that name's hash function.) - -git tools produce always output object names in the default hash -function. (Including git-hash-object.) - -As a consequence, newly generated objects will contain object -references using the `default' hash function. - -When HEAD is empty, there is a separate record of the default hash -function. This comes from a configured default in a new tree. In an -existing tree, using git checkout --orphan remembers the default hash -function that HEAD had. - -When HEAD is updated to a new commit, the name stored in HEAD uses the -newer of the previous HEAD hash function and of the hash function used -in the commit being stored. ("Newer" is a built-in preference order, -overrideable by configuration.) - -This (together with the `forbidden' state, above) ensures that -switching a project to use a new hash function is a deliberate -decision: the default hash function needs to be changed to make the -first commit with the new hash function. After that, provided -the server accepts it, it's infectious. - - -Naming of refs other than HEAD - -A ref refers to an object by one of its names. However, operations -like git-show-ref convert that name to the default format (see above). - -git-gc rewrites ref names to the default format iff that is newer. +thinks are forgotten, and the sender will not follow such references +when computing the set of objects to send. So receivers will not +receive the forgotten objects. Remote protocol During the negotation, a receiver needs to specify what hashes it -understands. - -When the sender is listing its refs, the names are converted to a -hash understood by the client if necessary. If this is not necessary, -they are left unchanged. +understands, and whether it is prepared to see only a partial view. -When a receiver is updating refs, it should by follow the sender's -idea of a hash change iff it's an upgrade (and the new function is -ENABLED). That is, if the sender sends name H2 for some ref, and the -receiver has H1, but these refer to the same object, then the receiver -should update its own ref name from H1 to H2 iff H2 uses a newer hash -function. +When the sender is listing its refs, refs naming objects the receiver +cannot understand are either elided (if the receiver is content with a +parial view), or cause an error. Equality testing -All software which tests for equality of git objects by checking -whether their object names are equal needs to obtain a canonical name -for both objects. - -This is going to be quite annoying. - -We should provide a convenient utility which tests whether two object -names refer to the same object. - Note that semantically identical trees may (now) have different tree -objects because those tree objects might contain different object -names. So (in some contexts at least) tree comparison cannot any -longer be done by comparing names; rather an invocation of git diff is -needed, or explicit generation of a tree object with the right name. +objects because those tree objects might use (and be named by) +different hashes. So (in some contexts at least) tree comparison +cannot any longer be done by comparing names; rather an invocation of +git diff is needed, or explicit generation of a tree object with the +right hash. Transition plan @@ -335,10 +293,9 @@ Transition plan Y0: Implement all of the above. Test it. Default configuration: - SHA-1 is ENABLED - SHA-512 is FORBIDDEN in bare repos - SHA-512 is ENABLED in trees with working trees - default HEAD hash is SHA-1 + SHA-1 is enabled + SHA-512 is disabled in trees without working trees + SHA-512 is enabled in trees with working trees Effects: