Trees
-Trees are always referenced by objects in their own subnamespace.
+Trees are only referenced by objects in their own subnamespace.
-Occasionally, a tree object from one subnamespace must be recursively
-rewritten into another subnamespace.
+To satisfy this rule, occasionally a tree object from one subnamespace
+must be recursively rewritten into another subnamespace.
When a tree refers to a commit, it may refer to one in a different
subnamespace.
Rationale: we want to avoid new commits and tags relying on weak
- hashes.
+ hashes. But we must avoid demanding that commits be rewritten.
Blobs
specify the subnamespace.
+Ref hints
+As noted above, each ref may also have a subnamespace hint associated
+with it.
+The subnamespace hint is (by default) copied, when the ref value is
+copied. So for exmple if `git checkout foo' makes refs/heads/foo out
+of refs/remotes/origin/foo, it will copy the subnamespace hint (or
+lack of one) from refs/remotes/origin/foo.
-Object store:
+Likewise, the subnamespace hint is conveyed by `git fetch' (by
+default) and can be updated with `git push' (though this is not done
+by default).
-The object store knows which hash functions are enabled. Each hash
-function H has one of the following statuses, which are configured by
-the user:
+The ref subnamespace hint may be set explicitly. That is how an
+individual branch is upgraded. git checkout --orphan sets it to the
+subnamespace (or hint) of the previous HEAD.
-* ENABLED:
+When a commit is made and stored in a ref, the subnamespace hint for
+that ref is removed iff the commit's subnamespace and the hint's
+subnamespace are the same.
- As far as the user is concerned every object in the object store is
- accessible using H. Objects which use H names can be received and
- stored.
- This is actually two states, depending on whether any objects exist
- in the store which use these names. If no such objects exist yet,
- we say that the hash function is `ENABLED PROSPECTIVE'. The H names
- for the objects have not yet been calculated.
+OBJECT STORE BEHAVIOUR
- When the first object which names another object using H is received
- (or, on demand), the object store calculates the H names for all
- existing objects and notes that this hash function is now
- `ENABLED PRESENT'.
+The object store has configuration to specify which hash functions are
+enabled. Each hash function H has a combination of the following
+behaviours, according to configuration:
- If a hash collision is detected, we crash immediately.
+* Collision behaviour:
-* OBSOLESCENT: Every object in the object store has its hash
- calculated using H. However, H is known to possibly have collisions
- which we try to tolerate. When a collision occurs, the object text
- which is currently in the object store is preferred and the "new"
- object is thrown away.
+ What to do if we encounter an object we already have (eg as part of
+ a pack, or with hash-object) but with different contents.
- Local creation of new objects with references using H is
- discouraged. Specifically, if another hash function is ENABLED, we
- will use that instead.
+ (a) fail: print a scary message and abort operation (on the
+ basis that the source of the colliding object probably intended
+ the preimage that they provided, or is conducting an attack).
- This is used as part of a gradual desupport strategy. When the hash
- function is in this stage, existing history in all existing object
- stores is safe and cannot be corrupted or modified by receiving
- colliding objects.
+ (b) tolerate: prefer our own data; print a message, but treat
+ the reference as referring to our version of the object.
- New object stores which receive their data from a trustworthy sender
+ In both cases we keep a copy of the second preimage in our .git, for
+ forensic purposes.
+
+ This is used as part of a gradual desupport strategy. Existing
+ history in all existing object stores is safe and cannot be
+ corrupted or modified by receiving colliding objects.
+
+ New trees which receive their initial data from a trustworthy sender
over a trustworthy channel will receive correct data. Bad object
stores or untrustworthy channels could exploit collisions, but not
in new regions of the history which are presumably using new names.
Merging previously-unrelated histories does introduce a collision
hazard, but the collision would have had to have been introduced
- while H was still a "live" hash function in at least one of the two
- projects.
-
-* FORBIDDEN: Objects do not have their hashes calculated using this
- hash function. Attempts to reference an object by such a name
- fail. Optionally the user may specify a tolerant mode where:
- a commit which refers to parents by obsolete names is taken to
- simply not have those parents; a commit which refers to a tree by
- an obsolete name is taken to have an empty tree.
-
- This is used for two purposes:
-
- - On a server, we use this to restrict the propagation of
- new hashes so as to enforce our compatibility intentions.
- Ie, hashes which we are "not ready for" are forbidden.
-
- - Everywhere, we use this to get rid of old hash functions.
- It makes access to old history possible but difficult.
-
-* FORGOTTEN: Objects do not have their hashes calculated using this
- hash function. References to objects by all such names return dummy
- objects of the right shape: the empty blob; the empty tree; a root
- commit with an empty tree and dummy metadata.
-
- This allows us to finally retire a hash function entirely. We
- effectively throw away all the history which uses H.
+ while the colliding hash function was still a live hash function
+ in at least one of the two projects.
+
+
+* Hash function enablement:
+
+ (a) enabled: this hash function is good and available for use
+
+ (b) deprecated (in favour of H2): this hash function is
+ available for use, but newly created objects will use another
+ hash function instead (specifically, when creating an object,
+ this has function is not considered as a candidate; if as a
+ result there are no candidate hash functions, we use the
+ specified replacement H2). Existing refs referring to objects
+ with this hash, with no ref hint, are treated as having a ref
+ hint specifying H2. If no H2 is specified, the newest hash
+ "best" hash is used.
+
+ (c) disabled: existing objects using this hash function can be
+ accessed, but no such objects can be created or received.
+ (again, a replacement may be specified). This is used both
+ initially to prevent unintended upgrade, and later to block the
+ introduction of vulnerable data generated by badly configured
+ clients.
+
+ (d) forgotten: such objects are not stored. References to such
+ objects return dummy objects of the right shape: the empty blob;
+ the empty tree; a root commit with an empty tree and dummy
+ metadata. This allows us to finally retire a hash function
+ entirely. We effectively throw away all the history which uses
+ this hash function.
During transfer protocols, the receiver will say which hashes it
-thinks are obsolete or forgotten, and the sender will not follow such
-references when computing the set of objects to send. So receivers
-will not receive the objects which were named only by obsolete or
-forgotten names.
-
-
-Naming in newly-generated objects, queries, etc.
-
-There is a `default' hash function, which is that which HEAD uses.
-(That is, HEAD refers to an object by some name. The default hash
-function is that name's hash function.)
-
-git tools produce always output object names in the default hash
-function. (Including git-hash-object.)
-
-As a consequence, newly generated objects will contain object
-references using the `default' hash function.
-
-When HEAD is empty, there is a separate record of the default hash
-function. This comes from a configured default in a new tree. In an
-existing tree, using git checkout --orphan remembers the default hash
-function that HEAD had.
-
-When HEAD is updated to a new commit, the name stored in HEAD uses the
-newer of the previous HEAD hash function and of the hash function used
-in the commit being stored. ("Newer" is a built-in preference order,
-overrideable by configuration.)
-
-This (together with the `forbidden' state, above) ensures that
-switching a project to use a new hash function is a deliberate
-decision: the default hash function needs to be changed to make the
-first commit with the new hash function. After that, provided
-the server accepts it, it's infectious.
-
-
-Naming of refs other than HEAD
-
-A ref refers to an object by one of its names. However, operations
-like git-show-ref convert that name to the default format (see above).
-
-git-gc rewrites ref names to the default format iff that is newer.
+thinks are forgotten, and the sender will not follow such references
+when computing the set of objects to send. So receivers will not
+receive the forgotten objects.
Remote protocol
During the negotation, a receiver needs to specify what hashes it
-understands.
-
-When the sender is listing its refs, the names are converted to a
-hash understood by the client if necessary. If this is not necessary,
-they are left unchanged.
+understands, and whether it is prepared to see only a partial view.
-When a receiver is updating refs, it should by follow the sender's
-idea of a hash change iff it's an upgrade (and the new function is
-ENABLED). That is, if the sender sends name H2 for some ref, and the
-receiver has H1, but these refer to the same object, then the receiver
-should update its own ref name from H1 to H2 iff H2 uses a newer hash
-function.
+When the sender is listing its refs, refs naming objects the receiver
+cannot understand are either elided (if the receiver is content with a
+parial view), or cause an error.
Equality testing
-All software which tests for equality of git objects by checking
-whether their object names are equal needs to obtain a canonical name
-for both objects.
-
-This is going to be quite annoying.
-
-We should provide a convenient utility which tests whether two object
-names refer to the same object.
-
Note that semantically identical trees may (now) have different tree
-objects because those tree objects might contain different object
-names. So (in some contexts at least) tree comparison cannot any
-longer be done by comparing names; rather an invocation of git diff is
-needed, or explicit generation of a tree object with the right name.
+objects because those tree objects might use (and be named by)
+different hashes. So (in some contexts at least) tree comparison
+cannot any longer be done by comparing names; rather an invocation of
+git diff is needed, or explicit generation of a tree object with the
+right hash.
Transition plan
Y0: Implement all of the above. Test it.
Default configuration:
- SHA-1 is ENABLED
- SHA-512 is FORBIDDEN in bare repos
- SHA-512 is ENABLED in trees with working trees
- default HEAD hash is SHA-1
+ SHA-1 is enabled
+ SHA-512 is disabled in trees without working trees
+ SHA-512 is enabled in trees with working trees
Effects: