1 Subject: Transition plan for git to move to a new hash function
6 We run multiple object name subnamespaces in parallel, one for each
7 hash function. Each object lives in exactly one subnamespace.
8 Objects with identical content in the different object stores, named
9 by different hash functions, are different objects.
11 Objects may refer to objects living in different subnamespaces (ie,
12 named by a different hash function) to their own.
14 Packfiles need to be extended to be able to contain objects named by
15 new hash functions. Blob objects with identical contents but living
16 in different subnamespaces would ideally share storage.
18 Every program that invokes git or speaks git protocols will need to
19 understand the extended object name syntax.
21 Safety catches preferent accidental incorporation into a project of
22 incompatibly-new objects, or additional deprecatedly-old objects.
23 This allows for incremental deployment.
28 The object name textual syntax is extended as follows:
30 We declare that the object name syntax is henceforth
31 [A-Z]+[0-9a-z]+ | [0-9a-f]+
32 and that names [A-Z].* are deprecated as ref name components.
36 Full backwards compatibility is impossible, because the hash
37 function needs to be evident in the name, so the new names
38 must be disjoint from all old SHA-1 names.
40 We want a short but extensible syntax. The syntax should impose
41 minimal extra requirements on existing git users. In most
42 contexts where existing git users use hashes, ASCII alphanumeric
43 object names will fit. Use of punctuation such as : or even _
44 may give trouble to existing users, who are already using
45 such things as delimiters.
47 In existing deployments, refnames that differ only in case are
48 generally avoided (because they are troublesome on
49 case-insensitive filesystems). And conventionally refnames are
50 lower case. So names starting with an upper case letter will be
51 disjoint from most existing ref name components.
53 Even though we probably want to keep using hex, it is a good
54 idea to reserve the flexibility to use a more compact encoding,
55 while not excessively widening the existing permissible
58 Object names using SHA-1 are represented, in text, as at present.
60 Object names starting with uppercase ASCII letters H or later refer to
61 new hash functions. Programs that use `g<objectname>' should ideally
62 be changed to show `H<hash>' for hash function `H' rather than
67 Object names starting with A-F might look like hex. G is
68 reserved because of the way that many programs write
71 This gives us 19 new hash function values until we have to
72 starting using two-letter hash function prefixes, or decide to
75 (Truncated object names work as they do at the moment.)
77 Initially we define and assign one new hash function (and textual
78 object name encoding):
80 H<hex> where <hex> is the BLAKE2b hash of the object
83 We also reserve the following syntax for private experiments:
85 We declare that public releases of git will never accept such
88 Everywhere in the git object formats and git protocols, a new object
89 name (with hash function indicator) is permitted where an old object
92 A single object may refer to other objects by its own hash functon, or
93 by other hash functions. Ie, object references cross subnamespaces.
94 During all git operations, subnamespace boundaries in the object graph
97 Two additional restrictions: a tree object may be referenced only by
98 objects in the same subnamespace; and, a tree object may reference
99 blobs in its own subnamespace.
101 In binary protocols, where a SHA-1 object name in binary form was
102 previously used, a new codepoint must be allocated in a containing
103 structure (eg a new typecode). Usually, the new-format binary object
104 will have a new typecode and also an additional name hash indicator,
105 and it will also need a length field (as new hashes may be of
108 Whenever a new hash function textual syntax is defined, corresponding
109 binary format codepoint(s) are assigned. (Implementation details such
110 as the binary format specification is outside the scope of this
116 Hash functions are partially ordered, from `worse' to `better'.
117 The ordering is configurable. For details of the defaults,
118 see _Transition Plan_.
121 CHOICE OF SUBNAMESPACE
123 Whenever objects are created, it is necessary to choose the
124 subnamespace to use (ie, the hash function).
126 Each ref may also have a subnamespace hint associated with it.
131 A commit is made (by default) as new as the newest of
132 (i) each of its parents
133 (ii) if applicable, the subnamespace hint for the ref to which the
134 new commit is to be written
136 Implicitly this normally means that if HEAD refers to a new commit,
137 further new commits will be generated on top of it.
139 The subnamespace of an origin commit is controlled by the hint left in
140 .git by git checkout --orphan or git init.
142 At boundaries between old and new history, new commit(s) will refer to
148 A tag is created (by default) in the same subnamespace as the object
154 Trees are only referenced by objects in their own subnamespace.
156 To satisfy this rule, occasionally a tree object from one subnamespace
157 must be recursively rewritten into another subnamespace.
159 When a tree refers to a commit, it may refer to one in a different
162 Rationale: we want to avoid new commits and tags relying on weak
163 hashes. But we must avoid demanding that commits be rewritten.
168 Blobs are normally referred to by trees. Trees always refer to blobs
169 in the same subnamespace.
171 Where a blob is created in other circumstances, the caller should
172 specify the subnamespace.
177 As noted above, each ref may also have a subnamespace hint associated
180 The subnamespace hint is (by default) copied, when the ref value is
181 copied. So for exmple if `git checkout foo' makes refs/heads/foo out
182 of refs/remotes/origin/foo, it will copy the subnamespace hint (or
183 lack of one) from refs/remotes/origin/foo.
185 Likewise, the subnamespace hint is conveyed by `git fetch' (by
186 default) and can be updated with `git push' (though this is not done
189 The ref subnamespace hint may be set explicitly. That is how an
190 individual branch is upgraded. git checkout --orphan sets it to the
191 subnamespace (or hint) of the previous HEAD.
193 When a commit is made and stored in a ref, the subnamespace hint for
194 that ref is removed iff the commit's subnamespace and the hint's
195 subnamespace are the same.
198 OBJECT STORE BEHAVIOUR
200 The object store has configuration to specify which hash functions are
201 enabled. Each hash function H has a combination of the following
202 behaviours, according to configuration:
204 * Collision behaviour:
206 What to do if we encounter an object we already have (eg as part of
207 a pack, or with hash-object) but with different contents.
209 (a) fail: print a scary message and abort operation (on the
210 basis that the source of the colliding object probably intended
211 the preimage that they provided, or is conducting an attack).
213 (b) tolerate: prefer our own data; print a message, but treat
214 the reference as referring to our version of the object.
216 In both cases we keep a copy of the second preimage in our .git, for
219 This is used as part of a gradual desupport strategy. Existing
220 history in all existing object stores is safe and cannot be
221 corrupted or modified by receiving colliding objects.
223 New trees which receive their initial data from a trustworthy sender
224 over a trustworthy channel will receive correct data. Bad object
225 stores or untrustworthy channels could exploit collisions, but not
226 in new regions of the history which are presumably using new names.
227 So the collisons can only affect archaeology.
229 Merging previously-unrelated histories does introduce a collision
230 hazard, but the collision would have had to have been introduced
231 while the colliding hash function was still a live hash function
232 in at least one of the two projects.
235 * Hash function enablement:
237 (a) enabled: this hash function is good and available for use
239 (b) deprecated (in favour of H2): this hash function is
240 available for use, but newly created objects will use another
241 hash function instead (specifically, when creating an object,
242 this has function is not considered as a candidate; if as a
243 result there are no candidate hash functions, we use the
244 specified replacement H2). Existing refs referring to objects
245 with this hash, with no ref hint, are treated as having a ref
246 hint specifying H2. If no H2 is specified, the newest hash
249 (c) disabled: existing objects using this hash function can be
250 accessed, but no such objects can be created or received.
251 (again, a replacement may be specified). This is used both
252 initially to prevent unintended upgrade, and later to block the
253 introduction of vulnerable data generated by badly configured
259 During the negotation, a receiver needs to specify what hashes it
260 understands, and whether it is prepared to see only a partial view.
262 When the sender is listing its refs, refs naming objects the receiver
263 cannot understand are either elided (if the receiver is content with a
264 parial view), or cause an error.
269 Note that semantically identical trees may (now) have different tree
270 objects because those tree objects might use (and be named by)
271 different hashes. So (in some contexts at least) tree comparison
272 cannot any longer be done by comparing names; rather an invocation of
273 git diff is needed, or explicit generation of a tree object with the
279 (For brevity I will write `SHA' for hashing with SHA-1, using current
280 unqualified object names, and `BLAKE' for hasing with BLAKE2b, using
281 H<hex> object names.)
283 Y0: Implement all of the above. Test it.
285 Default configuration:
287 BLAKE is disabled in trees without working trees
288 BLAKE is enabled in trees with working trees
293 Clients are prepared to process BLAKE data, but it is not
294 generated by default and cannot be pushed to servers.
296 All old git clients still work.
298 Y4: BLAKE by default for new projects.
299 Conversion enabled for existing projects.
300 Old git software is going to start rotting.
302 Default configuration change:
304 BLAKE enabled (even in trees without working trees)
306 Suggested bulk hosting site configuration change:
307 Newly created projects should get BLAKE enabled
308 Existing projects should retain BLAKE disabled by default
309 Button should be provided to start conversion (see below)
313 When creating a new working tree, it starts using BLAKE.
315 Servers which have been updated will accept BLAKE.
317 Servers which have not been updated to Y4's git will need a small
318 configuration change (enabling BLAKE) to cope with the new
319 projects that are using BLAKE.
321 To convert a project, an administrator (or project owner) would
322 set BLAKE to enabled, and SHA to deprecated, on the server. On
323 the next pull the server will provide ref hints naming BLAKE,
324 which will get copied to the user's HEAD. So the user is infected
327 To convert a project branch-by-branch, the administrator would set
328 BLAKE to enabled but leave SHA enabled. Then each branch retains
329 its own hash. A branch can be converted by pushing a BLAKE commit
330 to it, or by setting a ref hint on the server.
332 Y6: BLAKE by default for all projects
333 Existing projects start being converted infectiously.
334 It is hard for a project to stop this happening if any of
335 their servers are updated.
336 Old git software is firmly stuffed.
338 Default configuration change
339 SHA deprecated in trees without working trees
343 Existing projects are, by default, `converted', as described
347 Clients insist on trying to convert existing projects
348 It is very hard to stop this happening.
349 Unrepentant servers start being very hard to use.
351 Default configuration change
352 SHA deprecated (even in trees without working trees)
356 Clients will generate only BLAKE. Hopefully their server will
359 Y10: Stop accepting new SHA
360 No-one can manage to make new SHA commits
362 Default configuration change
363 SHA disabled in new trees, except during initial
364 `clone', `mirror' and similar
368 Existing SHA history is retained, and copied to new clients and
369 servers. But established clients and servers reject any newly
375 Ian Jackson <ijackson@chiark.greenend.org.uk> These opinions are my own.
377 If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
378 a private address which bypasses my fierce spamfilter.