1 From: Ian Jackson <ijackson@chiark.greenend.org.uk>
2 To: ijackson@chiark.greenend.org.uk
3 Subject: Transition plan for git to move to a new hash function
4 Date: Thu, 20 Oct 2016 19:26:44 +0100
7 Basic principle: Every object will have two (or more) names,
8 corresponding to different hash functions. It may be named by any of
9 its names, in every context.
11 Every program that invokes git or speaks git protocols will need to
12 understand the extended object name syntax, and understand that
13 objects have multiple names.
15 Safety catches preferent accidental incorporation into a project of
16 objects which contain references by incompatibly-new or
17 deprecatedly-old names. This allows for incremental deployment.
22 The object name textual syntax is extended as follows:
24 We declare that the object name syntax is henceforth
25 [A-Z]+[0-9a-z]+ | [0-9a-f]+
26 and that names [A-Z].* are deprecated as ref name components.
30 Full backwards compatibility is impossible, because the hash
31 function needs to be evident in the name, so the new names
32 must be disjoint from all old SHA-1 names.
34 We want a short but extensible syntax. The syntax should impose
35 minimal extra requirements on existing git users. In most
36 contexts where existing git users use hashes, ASCII alphanumeric
37 object names will fit. Use of punctuation such as : or even _
38 may give trouble to existing users, who are already using
39 such things as delimiters.
41 In existing deployments, refnames that differ only in case are
42 generally avoided (because they are troublesome on
43 case-insensitive filesystems). And conventionally refnames are
44 lower case. So names starting with an upper case letter will be
45 disjoint from most existing ref name components.
47 Even though we probably want to keep using hex, it is a good
48 idea to reserve the flexibility to use a more compact encoding,
49 while not excessively widening the existing permissible
52 Object names using SHA-1 are represented, in text, as at present.
54 Object names starting with uppercase ASCII letters H or later refer to
55 new hash functions. Programs that use `g<objectname>' should ideally
56 be changed to show `H<hash>' for hash function `H' rather than
61 Object names starting with A-F might look like hex. G is
62 reserved because of the way that many programs write
65 This gives us 19 new hash function values until we have to
66 starting using two-letter hash function prefixes, or decide to
69 (Truncated object names work as they do at the moment.)
71 Initially we define and assign one new hash function (and textual
72 object name encoding):
74 H<hex> where <hex> is the BLAKE2b hash of the object
77 We also reserve the following syntax for private experiments:
79 We declare that public releases of git will never accept such
82 Everywhere in the git object formats and git protocols, a new object
83 name (with hash function indicator) is permitted where an old object
86 A single object refers to all the objects it references by the same
87 hash function; in general this might be a different hash function to
88 the hash function by which this particular object was itself
89 referenced or obtained.
91 As a further restriction, it is forbidden to refer to a tree object by
92 a name other than the hash function it uses to name its subtrees. If
93 this seems necessary, the tree object must be recursively rewritten
94 instead to use the desired object name.
96 In binary protocols, where a SHA-1 object name in binary form was
97 previously used, a new codepoint must be allocated in a containing
98 structure (eg a new typecode). Usually, the new-format binary object
99 will have a new typecode and also an additional name hash indicator.
101 Whenever a new hash function textual syntax is defined, corresponding
102 binary format codepoint(s) are assigned. (Detailed binary format
103 specification is outside the scope of this plan.)
108 Hash functions are partially ordered, from `older' to `newer'.
110 The ordering is configurable. The default, with the two hash
111 functions defined here, is the obvious ordering
112 SHA1 ([0-9a-f]*) < BLAKE2b (H*)
115 CHOICE OF OBJECT NAMES
117 Whenever objects are named, it is possible to refer to them by old or
118 new names. So git must make a choice, each time: when new objects
119 are created; when refs are updated; and when refs are reported over
120 network protocols to other instances of git.
122 Although strictly speaking all objects have both old names and new
123 names, and there may be more than two hash functions, it is possible
124 to speak, somewhat loosely, about `new objects'.
126 A `new' object is one which refers to other objects by a `new' name.
127 (whatever `new' means).
129 We call these different hashes `namings'. That is, a `naming' is a
130 hash function implemented by git. The `naming IN an object' is the
131 naming by which the object refers to other objects (and may not exist,
132 if the object has no references); the `name OF an object' is the name
133 by which the object itself is specified.
138 A non-origin commit is made (by default) as new as the newest of
139 (i) the naming in each of its parents
140 (ii) the specified name of each of its parents
141 (Implicitly this normally means that if HEAD uses a new name, new
142 commits will be generated.)
144 The naming of an origin commit is controlled by a dropping left in
145 .git by git checkout --orphan or git init.
147 At boundaries between old and new history, a new commit will refer to
148 old parents by those old parents' new names.
153 A new tag is made to use newest naming, for its tagged object, of
154 (i) the name by which the tagged object was specified
155 (ii) the naming in the tagged object (if applicable)
160 Commits (and sometimes, tags) can refer to tree objects; that tree
161 will contain the same naming as the referring object.
163 That is, it is a bug to refer to a tree object by other than the hash
164 it uses internally to refer to subtrees (and gitlinks). This will
165 mean that a tree must sometimes be rewritten (ie, new object names
166 recalculated recursively).
168 Rationale: we want to avoid new commits and tags relying on weak
174 Blobs do not refer to other objects so they are neither new or old.
177 Name of newly created object
179 When git creates a new object, it reports the new object name using
180 the naming in the object.
182 For blobs and empty trees, the caller should normally specify. The
183 default is the naming used for HEAD.
188 If a ref is updated with a new object, the name from its creation is
191 If a ref is updated to a specified object, the naming used in the ref
192 is the newer of the specified name, or the naming in the object (if
199 ), or with a specified object name.
203 (If there are different equally new names, one of the newest names is
204 chosen according to some stable rule.)
210 commit. (This may mean converting the tree in hand, since trees are
211 supposed to be homgeonous.)
216 A `new commit' is one which refers to objects by
223 The object store knows which hash functions are enabled. Each hash
224 function H has one of the following statuses, which are configured by
229 As far as the user is concerned every object in the object store is
230 accessible using H. Objects which use H names can be received and
233 This is actually two states, depending on whether any objects exist
234 in the store which use these names. If no such objects exist yet,
235 we say that the hash function is `ENABLED PROSPECTIVE'. The H names
236 for the objects have not yet been calculated.
238 When the first object which names another object using H is received
239 (or, on demand), the object store calculates the H names for all
240 existing objects and notes that this hash function is now
243 If a hash collision is detected, we crash immediately.
245 * OBSOLESCENT: Every object in the object store has its hash
246 calculated using H. However, H is known to possibly have collisions
247 which we try to tolerate. When a collision occurs, the object text
248 which is currently in the object store is preferred and the "new"
249 object is thrown away.
251 Local creation of new objects with references using H is
252 discouraged. Specifically, if another hash function is ENABLED, we
253 will use that instead.
255 This is used as part of a gradual desupport strategy. When the hash
256 function is in this stage, existing history in all existing object
257 stores is safe and cannot be corrupted or modified by receiving
260 New object stores which receive their data from a trustworthy sender
261 over a trustworthy channel will receive correct data. Bad object
262 stores or untrustworthy channels could exploit collisions, but not
263 in new regions of the history which are presumably using new names.
264 So the collisons can only affect archaeology.
266 Merging previously-unrelated histories does introduce a collision
267 hazard, but the collision would have had to have been introduced
268 while H was still a "live" hash function in at least one of the two
271 * FORBIDDEN: Objects do not have their hashes calculated using this
272 hash function. Attempts to reference an object by such a name
273 fail. Optionally the user may specify a tolerant mode where:
274 a commit which refers to parents by obsolete names is taken to
275 simply not have those parents; a commit which refers to a tree by
276 an obsolete name is taken to have an empty tree.
278 This is used for two purposes:
280 - On a server, we use this to restrict the propagation of
281 new hashes so as to enforce our compatibility intentions.
282 Ie, hashes which we are "not ready for" are forbidden.
284 - Everywhere, we use this to get rid of old hash functions.
285 It makes access to old history possible but difficult.
287 * FORGOTTEN: Objects do not have their hashes calculated using this
288 hash function. References to objects by all such names return dummy
289 objects of the right shape: the empty blob; the empty tree; a root
290 commit with an empty tree and dummy metadata.
292 This allows us to finally retire a hash function entirely. We
293 effectively throw away all the history which uses H.
295 During transfer protocols, the receiver will say which hashes it
296 thinks are obsolete or forgotten, and the sender will not follow such
297 references when computing the set of objects to send. So receivers
298 will not receive the objects which were named only by obsolete or
302 Naming in newly-generated objects, queries, etc.
304 There is a `default' hash function, which is that which HEAD uses.
305 (That is, HEAD refers to an object by some name. The default hash
306 function is that name's hash function.)
308 git tools produce always output object names in the default hash
309 function. (Including git-hash-object.)
311 As a consequence, newly generated objects will contain object
312 references using the `default' hash function.
314 When HEAD is empty, there is a separate record of the default hash
315 function. This comes from a configured default in a new tree. In an
316 existing tree, using git checkout --orphan remembers the default hash
317 function that HEAD had.
319 When HEAD is updated to a new commit, the name stored in HEAD uses the
320 newer of the previous HEAD hash function and of the hash function used
321 in the commit being stored. ("Newer" is a built-in preference order,
322 overrideable by configuration.)
324 This (together with the `forbidden' state, above) ensures that
325 switching a project to use a new hash function is a deliberate
326 decision: the default hash function needs to be changed to make the
327 first commit with the new hash function. After that, provided
328 the server accepts it, it's infectious.
331 Naming of refs other than HEAD
333 A ref refers to an object by one of its names. However, operations
334 like git-show-ref convert that name to the default format (see above).
336 git-gc rewrites ref names to the default format iff that is newer.
341 During the negotation, a receiver needs to specify what hashes it
344 When the sender is listing its refs, the names are converted to a
345 hash understood by the client if necessary. If this is not necessary,
346 they are left unchanged.
348 When a receiver is updating refs, it should by follow the sender's
349 idea of a hash change iff it's an upgrade (and the new function is
350 ENABLED). That is, if the sender sends name H2 for some ref, and the
351 receiver has H1, but these refer to the same object, then the receiver
352 should update its own ref name from H1 to H2 iff H2 uses a newer hash
358 All software which tests for equality of git objects by checking
359 whether their object names are equal needs to obtain a canonical name
362 This is going to be quite annoying.
364 We should provide a convenient utility which tests whether two object
365 names refer to the same object.
367 Note that semantically identical trees may (now) have different tree
368 objects because those tree objects might contain different object
369 names. So (in some contexts at least) tree comparison cannot any
370 longer be done by comparing names; rather an invocation of git diff is
371 needed, or explicit generation of a tree object with the right name.
376 Y0: Implement all of the above. Test it.
378 Default configuration:
380 SHA-512 is FORBIDDEN in bare repos
381 SHA-512 is ENABLED in trees with working trees
382 default HEAD hash is SHA-1
386 Existing projects will not switch to SHA-512 willy-nilly.
387 New projects will still use SHA-1.
389 Incompatible new-style commits cannot be pushed without server
390 admin effort (or until future upgrade).
392 So all old git clients still work.
394 Y4: SHA-512 by default for new projects.
395 Conversion enabled for existing projects.
396 Old git software is now pretty firmly deprecated.
398 Default configuration change:
400 When creating a new bare tree, a configuration dropping is left
401 (in `config') which specifies that SHA-1 is OBSOLESCENT
403 Default status for SHA-512 is FORBIDDEN if SHA-1 is ENABLED,
404 or ENABLED if SHA-1 is OBSOLESCENT.
406 default HEAD hash is newest ENABLED hash.
410 When creating a new working tree, it starts using SHA-512.
411 A new server tree will accept SHA-512.
413 Existing server trees do not yet accept SHA-512. They publish
414 their SHA-1 hashes, so clients make commits with SHA-1.
416 To convert a project, an administrator would set SHA-1 to
417 OBSOLESCENT on the server. All clones after that will have HEAD
418 with a SHA-512 name. Fetches and pulls will update to SHA-512
421 will , and push one SHA-512 commit to
426 Default configuration change:
430 When creating a new tree with working tree with git init (ie, no
431 HEAD), the default HEAD hash is set to SHA-512 (because SHA-1 is
432 OBSOLESCENT in a new tree and therefore SHA-512 is the only
433 ENABLED hash and is the default).
435 Newly minted server trees accept SHA-512.
438 start using SHA-512 by default.
440 Y6: Existing projects start being converted infectiously.
441 It is hard to stop this happening.
442 Old git software is firmly stuffed.
444 Default configuration change:
446 (default for SHA-512, and HEAD hash, computed as in Y4)
448 Result is that by default all software
450 (Projects which do not want to convert need to set SHA-1 to
451 ENABLED, explicitly, on their
453 Y6: Existing projects start using SHA-512.
455 Default configuration change:
458 (default default HEAD hash is already SHA-512)
460 In existing repositories where no special action
464 Ian Jackson <ijackson@chiark.greenend.org.uk> These opinions are my own.
466 If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
467 a private address which bypasses my fierce spamfilter.