From 266b32bc46dbfab5e449ff3be5ae5b8c51d8c5d8 Mon Sep 17 00:00:00 2001 From: Ian Jackson Date: Fri, 24 Feb 2017 18:47:32 +0000 Subject: [PATCH] wip new/old objs --- plan.txt | 151 ++++++++++++++++++++----------------------------------- 1 file changed, 55 insertions(+), 96 deletions(-) diff --git a/plan.txt b/plan.txt index 612a8b2..b75537a 100644 --- a/plan.txt +++ b/plan.txt @@ -1,20 +1,26 @@ -From: Ian Jackson -To: ijackson@chiark.greenend.org.uk Subject: Transition plan for git to move to a new hash function -Date: Thu, 20 Oct 2016 19:26:44 +0100 -Basic principle: Every object will have two (or more) names, -corresponding to different hash functions. It may be named by any of -its names, in every context. +BASIC PRINCIPLE + +We run multiple object name subnamespaces in parallel, one for each +hash function. Each object lives in exactly one subnamespace. +Objects with identical content in the different object stores, named +by different hash functions, are different objects. + +Objects may refer to objects living in different subnamespaces (ie, +named by a different hash function) to their own. + +Packfiles need to be extended to be able to contain objects named by +new hash functions. Blob objects with identical contents but living +in different subnamespaces would ideally share storage. Every program that invokes git or speaks git protocols will need to -understand the extended object name syntax, and understand that -objects have multiple names. +understand the extended object name syntax. Safety catches preferent accidental incorporation into a project of -objects which contain references by incompatibly-new or -deprecatedly-old names. This allows for incremental deployment. +incompatibly-new objects, or additional deprecatedly-old objects. +This allows for incremental deployment. TEXTUAL SYNTAX @@ -83,24 +89,26 @@ Everywhere in the git object formats and git protocols, a new object name (with hash function indicator) is permitted where an old object name is permitted. -A single object refers to all the objects it references by the same -hash function; in general this might be a different hash function to -the hash function by which this particular object was itself -referenced or obtained. +A single object may refer to other objects by its own hash functon, or +by other hash functions. Ie, object references cross subnamespaces. +During all git operations, subnamespace boundaries in the object graph +are traversed freely. -As a further restriction, it is forbidden to refer to a tree object by -a name other than the hash function it uses to name its subtrees. If -this seems necessary, the tree object must be recursively rewritten -instead to use the desired object name. +Two additional restrictions: a tree object may be referenced only by +objects in the same subnamespace; and, a tree object may reference +blobs in its own subnamespace. In binary protocols, where a SHA-1 object name in binary form was previously used, a new codepoint must be allocated in a containing structure (eg a new typecode). Usually, the new-format binary object -will have a new typecode and also an additional name hash indicator. +will have a new typecode and also an additional name hash indicator, +and it will also need a length field (as new hashes may be of +different lengths). Whenever a new hash function textual syntax is defined, corresponding -binary format codepoint(s) are assigned. (Detailed binary format -specification is outside the scope of this plan.) +binary format codepoint(s) are assigned. (Implementation details such +as the binary format specification is outside the scope of this +transition plan.) ORDERING @@ -112,58 +120,46 @@ functions defined here, is the obvious ordering SHA1 ([0-9a-f]*) < BLAKE2b (H*) -CHOICE OF OBJECT NAMES - -Whenever objects are named, it is possible to refer to them by old or -new names. So git must make a choice, each time: when new objects -are created; when refs are updated; and when refs are reported over -network protocols to other instances of git. +CHOICE OF SUBNAMESPACE -Although strictly speaking all objects have both old names and new -names, and there may be more than two hash functions, it is possible -to speak, somewhat loosely, about `new objects'. +Whenever objects are created, it is necessary to choose the +subnamespace to use (ie, the hash function). -A `new' object is one which refers to other objects by a `new' name. -(whatever `new' means). - -We call these different hashes `namings'. That is, a `naming' is a -hash function implemented by git. The `naming IN an object' is the -naming by which the object refers to other objects (and may not exist, -if the object has no references); the `name OF an object' is the name -by which the object itself is specified. +Each ref may also have a subnamespace hint associated with it. Commits -A non-origin commit is made (by default) as new as the newest of - (i) the naming in each of its parents - (ii) the specified name of each of its parents -(Implicitly this normally means that if HEAD uses a new name, new -commits will be generated.) +A commit is made (by default) as new as the newest of + (i) each of its parents + (ii) if applicable, the subnamespace hint for the ref to which the + new commit is to be written + +Implicitly this normally means that if HEAD refers to a new commit, +further new commits will be generated on top of it. -The naming of an origin commit is controlled by a dropping left in +The subnamespace of an origin commit is controlled by the hint left in .git by git checkout --orphan or git init. -At boundaries between old and new history, a new commit will refer to -old parents by those old parents' new names. +At boundaries between old and new history, new commit(s) will refer to +old parent(s). Tags -A new tag is made to use newest naming, for its tagged object, of - (i) the name by which the tagged object was specified - (ii) the naming in the tagged object (if applicable) +A tag is created (by default) in the same subnamespace as the object +to which it refers. Trees -Commits (and sometimes, tags) can refer to tree objects; that tree -will contain the same naming as the referring object. +Trees are always referenced by objects in their own subnamespace. -That is, it is a bug to refer to a tree object by other than the hash -it uses internally to refer to subtrees (and gitlinks). This will -mean that a tree must sometimes be rewritten (ie, new object names -recalculated recursively). +Occasionally, a tree object from one subnamespace must be recursively +rewritten into another subnamespace. + +When a tree refers to a commit, it may refer to one in a different +subnamespace. Rationale: we want to avoid new commits and tags relying on weak hashes. @@ -171,49 +167,12 @@ recalculated recursively). Blobs -Blobs do not refer to other objects so they are neither new or old. - - -Name of newly created object - -When git creates a new object, it reports the new object name using -the naming in the object. - -For blobs and empty trees, the caller should normally specify. The -default is the naming used for HEAD. - - -Updating refs - -If a ref is updated with a new object, the name from its creation is -used (see above). - -If a ref is updated to a specified object, the naming used in the ref -is the newer of the specified name, or the naming in the object (if -any). - - - - - -), or with a specified object name. - - - -(If there are different equally new names, one of the newest names is -chosen according to some stable rule.) - - - -new - -commit. (This may mean converting the tree in hand, since trees are -supposed to be homgeonous.) - - +Blobs are normally referred to by trees. Trees always refer to blobs +in the same subnamespace. +Where a blob is created in other circumstances, the caller should +specify the subnamespace. -A `new commit' is one which refers to objects by -- 2.30.2