From 5a36a235897886931d7c6e553090f56621429552 Mon Sep 17 00:00:00 2001 From: Ian Jackson Date: Thu, 23 Feb 2017 18:34:47 +0000 Subject: [PATCH] copied from email --- plan.txt | 212 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 212 insertions(+) create mode 100644 plan.txt diff --git a/plan.txt b/plan.txt new file mode 100644 index 0000000..803aadd --- /dev/null +++ b/plan.txt @@ -0,0 +1,212 @@ +From: Ian Jackson +To: ijackson@chiark.greenend.org.uk +Subject: Transition plan for git to move to a new hash function +Date: Thu, 20 Oct 2016 19:26:44 +0100 + + +Basic principle: Every object will have two (or more) names, +corresponding to different hash functions. It may be named by any of +its names, in every context. + +Every program that invokes git or speaks git protocols will need to +understand the extended object name syntax, and understand that +objects have multiple names. + +Safety catches preferent accidental incorporation into a project of +objects which contain references by incompatibly-new or +deprecatedly-old names. This allows for incremental deployment. + + +Syntax: + +The object name syntax is extended as follows: object names using sha1 +are as current. Object names starting with lowercase ASCII letters h +or later refer to new hash functions. (`g' is reserved because of the +way that many programs write `g'. Programs that use +`g' should be changed to show `h' for hash function +`h' rather than `gh'.) + +Object names h are SHA-512 hashes. Remaining letters are +reserved. `x' `y' `z' are reserved for private experiments; we +declare that public releases of git will never accept such names. + +Everywhere in the git object formats and git protocols, a new object +name (with hash function indicator) is permitted where an old object +name is permitted. A single object refers to all the objects it +references by the same hash function; in general this might be a +different hash function to the hash function by this particular object +was itself referenced or obtained. + +As an exception, it is forbidden to refer to a tree object by a name +other than the hash function it uses to name its subtrees. If this +seems necessary, the tree object must be recursively rewritten instead +to use the desired object name. + +In binary protocols, where a SHA-1 object name in binary form was +previously used, a new codepoint must be allocated in a containing +structure (eg a new typecode). Usually, the new-format binary object +will have a new typecode and also an additional name hash indicator. +15 of the hash indicator values correspond to the lowercase letters +reserved above. + + +Object store: + +The object store knows which hash functions are enabled. Each hash +function H has one of the following statuses, which are configured by +the user: + +* ENABLED: + + As far as the user is concerned every object in the object store is + accessible using H. Objects which use H names can be received and + stored. + + This is actually two states, depending on whether any objects exist + in the store which use these names. If no such objects exist yet, + we say that the hash function is `ENABLED PROSPECTIVE'. The H names + for the objects have not yet been calculated. + + When the first object which names another object using H is received + (or, on demand), the object store calculates the H names for all + existing objects and notes that this hash function is now + `ENABLED PRESENT'. + +* OBSOLESCENT: Every object in the object store has its hash + calculated using H. However, H is known to possibly have collisions + which we try to tolerate. When a collision occurs, the object text + which is currently in the object store is preferred and the "new" + object is thrown away. Local creation of new objects with + references using H is forbidden. + + This is used as part of a gradual desupport strategy. When the hash + function is in this stage, existing history in all existing object + stores is safe and cannot be corrupted or modified by receiving + colliding objects. + + New object stores which receive their data from a trustworthy sender + over a trustworthy channel will receive correct data. Bad object + stores or untrustworthy channels could exploit collisions, but not + in new regions of the history which are presumably using new names. + So the collisons can only affect archaeology. + + Merging previously-unrelated histories does introduce a collision + hazard, but the collision would have had to have been introduced + while H was still a "live" hash function in at least one of the two + projects. + +* FORBIDDEN: Objects do not have their hashes calculated using this + hash function. Attempts to reference an object by such a name + fail. Optionally the user may specify a tolerant mode where: + a commit which refers to parents by obsolete names is taken to + simply not have those parents; a commit which refers to a tree by + an obsolete name is taken to have an empty tree. + + This is used for two purposes: + + - On a server, we use this to restrict the propagation of + new hashes so as to enforce our compatibility intentions. + Ie, hashes which we are "not ready for" are forbidden. + + - Everywhere, we use this to get rid of old hash functions. + It makes access to old history possible but difficult. + +* FORGOTTEN: Objects do not have their hashes calculated using this + hash function. References to objects by all such names return dummy + objects of the right shape: the empty blob; the empty tree; a root + commit with an empty tree and dummy metadata. + + This allows us to finally retire a hash function entirely. We + effectively throw away all the history which uses H. + +During transfer protocols, the receiver will say which hashes are +obsolete or forgotten, and the sender will not follow such references +when computing the set of objects to send. So receivers will not +receive the objects which were named only by obsolete or forgotten +names. + + +Naming in newly-generated objects, queries, etc. + +There is a `default' hash function, which is that which HEAD uses. +(That is, HEAD refers to an object by some name. The default hash +function is that name's hash function.) + +git tools produce always output object names in the default hash +function. (Including git-hash-object.) + +As a consequence, newly generated objects will contain object +references using the `default' hash function. + +When HEAD is empty, there is a separate record of the default hash +function. This comes from a configured default in a new tree. In an +existing tree, using git checkout --orphan remembers the default hash +function that HEAD had. + +When HEAD is updated to a new commit, the name stored in HEAD uses the +newer of the previous HEAD hash function and of the hash function used +in the commit being stored. ("Newer" is a built-in preference order, +overrideable by configuration.) + +This (together with the `forbidden' state, above) ensures that +switching a project to use a new hash function is a deliberate +decision: the default hash function needs to be changed to make the +first first commit with the new hash function. After that, provided +the server accepts it, it's infectious. + + +Naming of refs other than HEAD + +A ref refers to an object by one of its names. However, operations +like git-show-ref convert that name to the default format (see above). + +git-gc rewrites ref names to the default format. + + +Remote protocol + +During the negotation, a client needs to specify what names it +understands, and which it prefers (its default). + +When the server is listing its refs, the names are converted to the +client's preferred format. + + +Equality testing + +All software which tests for equality of git objects by checking +whether their object names are equal needs to obtain a canonical name +for both objects. + +This is going to be quite annoying. + +Note that semantically identical trees may (now) have different tree +objects because those tree objects might contain different object +names. So tree comparison cannot any longer be done by comparing +names; rather an invocation of git diff is needed. + + +Transition plan + +Y0: Implement all of the above. Test it. + + Default configuration: + SHA-1 is ENABLED and is default HEAD hash + + SHA-512 is FORBIDDEN in bare repos + SHA-512 is ENABLED in trees with working trees + +Y5: New projects should start using SHA-512. + + Default configuration change: + + SHA-512 becomes ENABLED in *new* bare repos but remains + FORBIDDEN in existing ones + + + +-- +Ian Jackson These opinions are my own. + +If I emailed you from an address @fyvzl.net or @evade.org.uk, that is +a private address which bypasses my fierce spamfilter. -- 2.30.2