better

[git-hash-transition-plan.git] / plan.txt
diff --git a/plan.txt b/plan.txt

index 612a8b2b3ecbf9701522edf977a3551dd34d6272..b4c622ba40e1637ed7c91652b3025d6dd059a2b7 100644 (file)
--- a/plan.txt
+++ b/plan.txt
@@ -1,20 +1,26 @@
-From: Ian Jackson <ijackson@chiark.greenend.org.uk>
-To: ijackson@chiark.greenend.org.uk
  Subject: Transition plan for git to move to a new hash function
-Date: Thu, 20 Oct 2016 19:26:44 +0100
  
  
-Basic principle: Every object will have two (or more) names,
-corresponding to different hash functions.  It may be named by any of
-its names, in every context.
+BASIC PRINCIPLE
+
+We run multiple object name subnamespaces in parallel, one for each
+hash function.  Each object lives in exactly one subnamespace.
+Objects with identical content in the different object stores, named
+by different hash functions, are different objects.
+
+Objects may refer to objects living in different subnamespaces (ie,
+named by a different hash function) to their own.
+
+Packfiles need to be extended to be able to contain objects named by
+new hash functions.  Blob objects with identical contents but living
+in different subnamespaces would ideally share storage.
  
  Every program that invokes git or speaks git protocols will need to
-understand the extended object name syntax, and understand that
-objects have multiple names.
+understand the extended object name syntax.
  
  Safety catches preferent accidental incorporation into a project of
-objects which contain references by incompatibly-new or
-deprecatedly-old names.  This allows for incremental deployment.
+incompatibly-new objects, or additional deprecatedly-old objects.
+This allows for incremental deployment.
  
  
  TEXTUAL SYNTAX
@@ -83,181 +89,138 @@ Everywhere in the git object formats and git protocols, a new object
  name (with hash function indicator) is permitted where an old object
  name is permitted.
  
-A single object refers to all the objects it references by the same
-hash function; in general this might be a different hash function to
-the hash function by which this particular object was itself
-referenced or obtained.
+A single object may refer to other objects by its own hash functon, or
+by other hash functions.  Ie, object references cross subnamespaces.
+During all git operations, subnamespace boundaries in the object graph
+are traversed freely.
  
-As a further restriction, it is forbidden to refer to a tree object by
-a name other than the hash function it uses to name its subtrees.  If
-this seems necessary, the tree object must be recursively rewritten
-instead to use the desired object name.
+Two additional restrictions: a tree object may be referenced only by
+objects in the same subnamespace; and, a tree object may reference
+blobs in its own subnamespace.
  
  In binary protocols, where a SHA-1 object name in binary form was
  previously used, a new codepoint must be allocated in a containing
  structure (eg a new typecode).  Usually, the new-format binary object
-will have a new typecode and also an additional name hash indicator.
+will have a new typecode and also an additional name hash indicator,
+and it will also need a length field (as new hashes may be of
+different lengths).
  
  Whenever a new hash function textual syntax is defined, corresponding
-binary format codepoint(s) are assigned.  (Detailed binary format
-specification is outside the scope of this plan.)
+binary format codepoint(s) are assigned.  (Implementation details such
+as the binary format specification is outside the scope of this
+transition plan.)
  
  
  ORDERING
  
-Hash functions are partially ordered, from `older' to `newer'.
-
-The ordering is configurable.  The default, with the two hash
-functions defined here, is the obvious ordering
-    SHA1 ([0-9a-f]*) < BLAKE2b (H*)
+Hash functions are partially ordered, from `worse' to `better'.
+The ordering is configurable.  For details of the defaults,
+see _Transition Plan_.
  
  
-CHOICE OF OBJECT NAMES
+CHOICE OF SUBNAMESPACE
  
-Whenever objects are named, it is possible to refer to them by old or
-new names.  So git must make a choice, each time: when new objects
-are created; when refs are updated; and when refs are reported over
-network protocols to other instances of git.
+Whenever objects are created, it is necessary to choose the
+subnamespace to use (ie, the hash function).
  
-Although strictly speaking all objects have both old names and new
-names, and there may be more than two hash functions, it is possible
-to speak, somewhat loosely, about `new objects'.
-
-A `new' object is one which refers to other objects by a `new' name.
-(whatever `new' means).
-
-We call these different hashes `namings'.  That is, a `naming' is a
-hash function implemented by git.  The `naming IN an object' is the
-naming by which the object refers to other objects (and may not exist,
-if the object has no references); the `name OF an object' is the name
-by which the object itself is specified.
+Each ref may also have a subnamespace hint associated with it.
  
  
  Commits
  
-A non-origin commit is made (by default) as new as the newest of
-  (i) the naming in each of its parents
-  (ii) the specified name of each of its parents
-(Implicitly this normally means that if HEAD uses a new name, new
-commits will be generated.)
+A commit is made (by default) as new as the newest of
+ (i) each of its parents
+ (ii) if applicable, the subnamespace hint for the ref to which the
+     new commit is to be written
  
-The naming of an origin commit is controlled by a dropping left in
+Implicitly this normally means that if HEAD refers to a new commit,
+further new commits will be generated on top of it.
+
+The subnamespace of an origin commit is controlled by the hint left in
  .git by git checkout --orphan or git init.
  
-At boundaries between old and new history, a new commit will refer to
-old parents by those old parents' new names.
+At boundaries between old and new history, new commit(s) will refer to
+old parent(s).
  
  
  Tags
  
-A new tag is made to use newest naming, for its tagged object, of
-  (i) the name by which the tagged object was specified
-  (ii) the naming in the tagged object (if applicable)
+A tag is created (by default) in the same subnamespace as the object
+to which it refers.
  
  
  Trees
  
-Commits (and sometimes, tags) can refer to tree objects; that tree
-will contain the same naming as the referring object.
+Trees are only referenced by objects in their own subnamespace.
+
+To satisfy this rule, occasionally a tree object from one subnamespace
+must be recursively rewritten into another subnamespace.
  
-That is, it is a bug to refer to a tree object by other than the hash
-it uses internally to refer to subtrees (and gitlinks).  This will
-mean that a tree must sometimes be rewritten (ie, new object names
-recalculated recursively).
+When a tree refers to a commit, it may refer to one in a different
+subnamespace.
  
      Rationale: we want to avoid new commits and tags relying on weak
-    hashes.
+    hashes.  But we must avoid demanding that commits be rewritten.
  
  
  Blobs
  
-Blobs do not refer to other objects so they are neither new or old.
-
-
-Name of newly created object
-
-When git creates a new object, it reports the new object name using
-the naming in the object.
-
-For blobs and empty trees, the caller should normally specify.  The
-default is the naming used for HEAD.
-
-
-Updating refs
-
-If a ref is updated with a new object, the name from its creation is
-used (see above).
+Blobs are normally referred to by trees.  Trees always refer to blobs
+in the same subnamespace.
  
-If a ref is updated to a specified object, the naming used in the ref
-is the newer of the specified name, or the naming in the object (if
-any).
+Where a blob is created in other circumstances, the caller should
+specify the subnamespace.
  
  
+Ref hints
  
+As noted above, each ref may also have a subnamespace hint associated
+with it.
  
+The subnamespace hint is (by default) copied, when the ref value is
+copied.  So for exmple if `git checkout foo' makes refs/heads/foo out
+of refs/remotes/origin/foo, it will copy the subnamespace hint (or
+lack of one) from refs/remotes/origin/foo.
  
-), or with a specified object name.
+Likewise, the subnamespace hint is conveyed by `git fetch' (by
+default) and can be updated with `git push' (though this is not done
+by default).
  
+The ref subnamespace hint may be set explicitly.  That is how an
+individual branch is upgraded.  git checkout --orphan sets it to the
+subnamespace (or hint) of the previous HEAD.
  
+When a commit is made and stored in a ref, the subnamespace hint for
+that ref is removed iff the commit's subnamespace and the hint's
+subnamespace are the same.
  
-(If there are different equally new names, one of the newest names is
-chosen according to some stable rule.)
  
+OBJECT STORE BEHAVIOUR
  
+The object store has configuration to specify which hash functions are
+enabled.  Each hash function H has a combination of the following
+behaviours, according to configuration:
  
-new
+* Collision behaviour:
  
-commit.  (This may mean converting the tree in hand, since trees are
-supposed to be homgeonous.)
+  What to do if we encounter an object we already have (eg as part of
+  a pack, or with hash-object) but with different contents.
  
+  (a) fail: print a scary message and abort operation (on the
+    basis that the source of the colliding object probably intended
+    the preimage that they provided, or is conducting an attack).
  
+  (b) tolerate: prefer our own data; print a message, but treat
+    the reference as referring to our version of the object.
  
+  In both cases we keep a copy of the second preimage in our .git, for
+  forensic purposes.
  
-A `new commit' is one which refers to objects by 
+  This is used as part of a gradual desupport strategy.  Existing
+  history in all existing object stores is safe and cannot be
+  corrupted or modified by receiving colliding objects.
  
-
-
-
-Object store:
-
-The object store knows which hash functions are enabled.  Each hash
-function H has one of the following statuses, which are configured by
-the user:
-
-* ENABLED:
-
-  As far as the user is concerned every object in the object store is
-  accessible using H.  Objects which use H names can be received and
-  stored.
-
-  This is actually two states, depending on whether any objects exist
-  in the store which use these names.  If no such objects exist yet,
-  we say that the hash function is `ENABLED PROSPECTIVE'.  The H names
-  for the objects have not yet been calculated.
-
-  When the first object which names another object using H is received
-  (or, on demand), the object store calculates the H names for all
-  existing objects and notes that this hash function is now
-  `ENABLED PRESENT'.
-
-  If a hash collision is detected, we crash immediately.
-
-* OBSOLESCENT: Every object in the object store has its hash
-  calculated using H.  However, H is known to possibly have collisions
-  which we try to tolerate.  When a collision occurs, the object text
-  which is currently in the object store is preferred and the "new"
-  object is thrown away.
-
-  Local creation of new objects with references using H is
-  discouraged.  Specifically, if another hash function is ENABLED, we
-  will use that instead.
-
-  This is used as part of a gradual desupport strategy.  When the hash
-  function is in this stage, existing history in all existing object
-  stores is safe and cannot be corrupted or modified by receiving
-  colliding objects.
-
-  New object stores which receive their data from a trustworthy sender
+  New trees which receive their initial data from a trustworthy sender
    over a trustworthy channel will receive correct data.  Bad object
    stores or untrustworthy channels could exploit collisions, but not
    in new regions of the history which are presumably using new names.
@@ -265,199 +228,147 @@ the user:
  
    Merging previously-unrelated histories does introduce a collision
    hazard, but the collision would have had to have been introduced
-  while H was still a "live" hash function in at least one of the two
-  projects.
-
-* FORBIDDEN: Objects do not have their hashes calculated using this
-  hash function.  Attempts to reference an object by such a name
-  fail.  Optionally the user may specify a tolerant mode where:
-  a commit which refers to parents by obsolete names is taken to
-  simply not have those parents; a commit which refers to a tree by
-  an obsolete name is taken to have an empty tree.
+  while the colliding hash function was still a live hash function
+  in at least one of the two projects.
  
-  This is used for two purposes:
  
-    - On a server, we use this to restrict the propagation of
-      new hashes so as to enforce our compatibility intentions.
-      Ie, hashes which we are "not ready for" are forbidden.
+* Hash function enablement:
  
-    - Everywhere, we use this to get rid of old hash functions.
-      It makes access to old history possible but difficult.
+  (a) enabled: this hash function is good and available for use
  
-* FORGOTTEN: Objects do not have their hashes calculated using this
-  hash function.  References to objects by all such names return dummy
-  objects of the right shape: the empty blob; the empty tree; a root
-  commit with an empty tree and dummy metadata.
+  (b) deprecated (in favour of H2): this hash function is
+     available for use, but newly created objects will use another
+     hash function instead (specifically, when creating an object,
+     this has function is not considered as a candidate; if as a
+     result there are no candidate hash functions, we use the
+     specified replacement H2).  Existing refs referring to objects
+     with this hash, with no ref hint, are treated as having a ref
+     hint specifying H2.  If no H2 is specified, the newest hash
+     "best" hash is used.
  
-  This allows us to finally retire a hash function entirely.  We
-  effectively throw away all the history which uses H.
-
-During transfer protocols, the receiver will say which hashes it
-thinks are obsolete or forgotten, and the sender will not follow such
-references when computing the set of objects to send.  So receivers
-will not receive the objects which were named only by obsolete or
-forgotten names.
-
-
-Naming in newly-generated objects, queries, etc.
-
-There is a `default' hash function, which is that which HEAD uses.
-(That is, HEAD refers to an object by some name.  The default hash
-function is that name's hash function.)
-
-git tools produce always output object names in the default hash
-function.  (Including git-hash-object.)
-
-As a consequence, newly generated objects will contain object
-references using the `default' hash function.
-
-When HEAD is empty, there is a separate record of the default hash
-function.  This comes from a configured default in a new tree.  In an
-existing tree, using git checkout --orphan remembers the default hash
-function that HEAD had.
-
-When HEAD is updated to a new commit, the name stored in HEAD uses the
-newer of the previous HEAD hash function and of the hash function used
-in the commit being stored.  ("Newer" is a built-in preference order,
-overrideable by configuration.)
-
-This (together with the `forbidden' state, above) ensures that
-switching a project to use a new hash function is a deliberate
-decision: the default hash function needs to be changed to make the
-first commit with the new hash function.  After that, provided
-the server accepts it, it's infectious.
-
-
-Naming of refs other than HEAD
-
-A ref refers to an object by one of its names.  However, operations
-like git-show-ref convert that name to the default format (see above).
-
-git-gc rewrites ref names to the default format iff that is newer.
+  (c) disabled: existing objects using this hash function can be
+     accessed, but no such objects can be created or received.
+     (again, a replacement may be specified).  This is used both
+     initially to prevent unintended upgrade, and later to block the
+     introduction of vulnerable data generated by badly configured
+     clients.
  
  
  Remote protocol
  
  During the negotation, a receiver needs to specify what hashes it
-understands.
-
-When the sender is listing its refs, the names are converted to a
-hash understood by the client if necessary.  If this is not necessary,
-they are left unchanged.
+understands, and whether it is prepared to see only a partial view.
  
-When a receiver is updating refs, it should by follow the sender's
-idea of a hash change iff it's an upgrade (and the new function is
-ENABLED).  That is, if the sender sends name H2 for some ref, and the
-receiver has H1, but these refer to the same object, then the receiver
-should update its own ref name from H1 to H2 iff H2 uses a newer hash
-function.
+When the sender is listing its refs, refs naming objects the receiver
+cannot understand are either elided (if the receiver is content with a
+parial view), or cause an error.
  
  
  Equality testing
  
-All software which tests for equality of git objects by checking
-whether their object names are equal needs to obtain a canonical name
-for both objects.
-
-This is going to be quite annoying.
-
-We should provide a convenient utility which tests whether two object
-names refer to the same object.
-
  Note that semantically identical trees may (now) have different tree
-objects because those tree objects might contain different object
-names.  So (in some contexts at least) tree comparison cannot any
-longer be done by comparing names; rather an invocation of git diff is
-needed, or explicit generation of a tree object with the right name.
+objects because those tree objects might use (and be named by)
+different hashes.  So (in some contexts at least) tree comparison
+cannot any longer be done by comparing names; rather an invocation of
+git diff is needed, or explicit generation of a tree object with the
+right hash.
  
  
-Transition plan
+TRANSITION PLAN
+
+(For brevity I will write `SHA' for hashing with SHA-1, using current
+unqualified object names, and `BLAKE' for hasing with BLAKE2b, using
+H<hex> object names.)
  
  Y0: Implement all of the above.  Test it.
  
      Default configuration:
-       SHA-1 is ENABLED
-       SHA-512 is FORBIDDEN in bare repos
-       SHA-512 is ENABLED in trees with working trees
-       default HEAD hash is SHA-1
+       SHA is enabled
+       BLAKE is disabled in trees without working trees
+       BLAKE is enabled in trees with working trees
+       SHA > BLAKE
  
      Effects:
  
-    Existing projects will not switch to SHA-512 willy-nilly.
-    New projects will still use SHA-1.
-
-    Incompatible new-style commits cannot be pushed without server
-    admin effort (or until future upgrade).
+    Clients are prepared to process BLAKE data, but it is not
+    generated by default and cannot be pushed to servers.
  
-    So all old git clients still work.
+    All old git clients still work.
  
-Y4: SHA-512 by default for new projects.
+Y4: BLAKE by default for new projects.
      Conversion enabled for existing projects.
-    Old git software is now pretty firmly deprecated.
+    Old git software is going to start rotting.
  
      Default configuration change:
+       BLAKE > SHA
+       BLAKE enabled (even in trees without working trees)
  
-       When creating a new bare tree, a configuration dropping is left
-       (in `config') which specifies that SHA-1 is OBSOLESCENT
-
-       Default status for SHA-512 is FORBIDDEN if SHA-1 is ENABLED,
-       or ENABLED if SHA-1 is OBSOLESCENT.
-
-       default HEAD hash is newest ENABLED hash.
+    Suggested bulk hosting site configuration change:
+       Newly created projects should get BLAKE enabled
+       Existing projects should retain BLAKE disabled by default
+       Button should be provided to start conversion (see below)
  
      Effects:
  
-    When creating a new working tree, it starts using SHA-512.
-    A new server tree will accept SHA-512.
+    When creating a new working tree, it starts using BLAKE.
  
-    Existing server trees do not yet accept SHA-512.  They publish
-    their SHA-1 hashes, so clients make commits with SHA-1.
+    Servers which have been updated will accept BLAKE.
  
-    To convert a project, an administrator would set SHA-1 to
-    OBSOLESCENT on the server.  All clones after that will have HEAD
-    with a SHA-512 name.  Fetches and pulls will update to SHA-512
-    names.
+    Servers which have not been updated to Y4's git will need a small
+    configuration change (enabling BLAKE) to cope with the new
+    projects that are using BLAKE.
  
-will , and push one SHA-512 commit to
-    mainline.
+    To convert a project, an administrator (or project owner) would
+    set BLAKE to enabled, and SHA to deprecated, on the server.  On
+    the next pull the server will provide ref hints naming BLAKE,
+    which will get copied to the user's HEAD.  So the user is infected
+    with BLAKE.
  
+    To convert a project branch-by-branch, the administrator would set
+    BLAKE to enabled but leave SHA enabled.  Then each branch retains
+    its own hash.  A branch can be converted by pushing a BLAKE commit
+    to it, or by setting a ref hint on the server.
  
+Y6: BLAKE by default for all projects
+    Existing projects start being converted infectiously.
+    It is hard for a project to stop this happening if any of
+     their servers are updated.
+    Old git software is firmly stuffed.
  
-    Default configuration change:
+    Default configuration change
+       SHA deprecated in trees without working trees
  
      Effects:
  
-    When creating a new tree with working tree with git init (ie, no
-    HEAD), the default HEAD hash is set to SHA-512 (because SHA-1 is
-    OBSOLESCENT in a new tree and therefore SHA-512 is the only
-    ENABLED hash and is the default).
+    Existing projects are, by default, `converted', as described
+    above.
  
-    Newly minted server trees accept SHA-512.
+Y8: Clients hate SHA
+    Clients insist on trying to convert existing projects
+    It is very hard to stop this happening.
+    Unrepentant servers start being very hard to use.
  
+    Default configuration change
+       SHA deprecated (even in trees without working trees)
  
- start using SHA-512 by default.
+    Effects:
  
-Y6: Existing projects start being converted infectiously.
-    It is hard to stop this happening.
-    Old git software is firmly stuffed.
+    Clients will generate only BLAKE.  Hopefully their server will
+    accept this!
  
-    Default configuration change:
-       SHA-1 is OBSOLESCENT
-       (default for SHA-512, and HEAD hash, computed as in Y4)
+Y10: Stop accepting new SHA
+    No-one can manage to make new SHA commits
  
-    Result is that by default all software 
+    Default configuration change
+       SHA disabled in new trees, except during initial
+          `clone', `mirror' and similar
  
-    (Projects which do not want to convert need to set SHA-1 to
-    ENABLED, explicitly, on their 
-
-Y6: Existing projects start using SHA-512.
+    Effects:
  
-    Default configuration change:
-       SHA-512 is ENABLED
-       SHA-1 is OBSOLESCENT
-       (default default HEAD hash is already SHA-512)
+    Existing SHA history is retained, and copied to new clients and
+    servers.  But established clients and servers reject any newly
+    introduced SHA.
  
-      In existing repositories where no special action 
  
  
  --