--- /dev/null
+Book Sifting
+============
+
+In coquet, data is written in books. Books are the basic unit of durability. Transactions may contain multiple books, and books may be written and verified yet which later have no effect (due to being parts of incomplete transactions). Books are verified by the hashes internal to them.
+
+Books include two sequence numbers, as they may be written multiple times: one for the contents of the book (bcid), and one for this particular write of the book (bwid). For verifying book integrity, only the bwid is relevant: a book written mulptiple times may have been successfully written in one place, but not in another.
+
+Whereas during processing and on disk, a bwid is unique, we cannot assume that this is the only time a bwid has been used. If an OS crashes (in the sense of the disk's dirty cache being lost), having failed to persist a bwid, it will likely unwittingly reuse that bwid later. Therefore, in this document we cannot refer unambiguously to a book by its bwid, but must make refer to the context of a particular OS run.
+
+This diagram shows plausible bwids for a database, across OS crashes. Each line represents an OS boot. For example, in the first run, books with bwid 1-5 were written but the OS crashed with 4 and 5 unsynced, and so on.
+
+1 -- 2 -- 3 -- 4 -- 5
+ \-- 4 -- 5 -- 6 -- 7
+ \-- 5 -- 6
+ \-- 7 -- 8 -- 9
+
+The most recently verified bwid is recorded in the superblock.
+
+Book sifting is the process of verifying books after OS crashes. Books can also be abandoned after lesser (program) crashes, but the way a book is written ensures that such a book can easily be verified as invalid: the book header is written last and hashed (hashing being relevant in the case where that last write is torn before a program crash, ie a partial write).
+
+However, in general, operating systems do not guarantee in-order disk cache flushing, so we can end up with a book which contains corrupt pages but have a valid book header, so the whole book contents must be verified against the whole book hash. This is expensive, so we must do our best only to verify books which are at particular risk. (Some standards, file-systems, and OSes guarantee in-order writes, particularly in the context of appending, but the coverage of these guarantees is too patchy to be relied upon, especially in the context of an operating system crash).
+
+Sifting is a correctness concern, orthogonal to durability. It is not an absolute guaranteee of durability (nothing is), but it shifts those books which are at highest risk of being corrupt (unsynced books) to a lower risk (synced books), while something can be done about it (the restart beginning in a known good state). Per-page caches (unused in sifting) lower the risk further, at the expense of corruption here not being recoverable at the point of detection.
+
+Sifting should not be done too often. As syncs occur due to durability concerns in normal operation, most books will never need to be verified. Therefore, sifting should only take place *when there may have been a crash since it was last called*. For example, for connection pool based applications, this only need occur at the time of pool creation.
+
+After a sync, a book cannot be torn by the operating system, or at least this should be regarded as permanent file corruption or machine failure. Therefore syncs record the last bwid at time of sync in the superblock. The superblock need not be itself synced after doing so, there is no correctness issue with reverification.
+
+Sifting methods need to be as efficient as possible if repeatedly called. Upon being called, sifting cannot know whether this was due to possible OS crash, or merely a repeated call (or we wouldn't have bothered calling). The only mechanism we allow ourselves for inter-process communication is the database file. If we record non-synced book verifications in that file, we cannot know if that information comes from before or after the most recent OS crash. This is important because those that come from before must be verified.
+
+If we are in a mode which syncs after every transaction, for durability reasons, sifting will never need to do work, and no extra sync will be performed (except in rare races, where it will be redundant).
+
+Otherwise we have two choices: to repeatedly verify on calls to sift (which, afterall, may not be so often), or perform a sync after each sift. Syncs are remarkably slow, so reverifying significant volumes of data is a justifiable decision (for example every 10-100MB [2025]), especially if calls to sift are rare (pooled connections). On the one hand, without calls to sync, expensive work is done over and again; on the other syncs are remarkably slow. If sifts are rare, none of this likely makes any significant difference.
+
+Syncing (and so persisting the outcomes of sift runs to reduce work) can also be performed asynchronously(!) in another thread.
+
+A good compromise would be to call sync in a thread every thrity seconds or so, and also set a per-connection write trigger every 10-100MB.
+
+Don't get syncing and sifting confused. Sync as much as you can get away with, sift as little.
\ No newline at end of file
* are sequenced in the order of the declarations below.
*/
-/* A true root page, a pt_root header follows */
-#define COQUET_PT_ROOT 0x01
+/* A true root page, a pt_book header follows */
+#define COQUET_PT_BOOK 0x00000001
-struct pt_root {
- /* TODO nursery id */
- uint64_t book_length; /* Pages to next pt_root, (0 = this is first) */
- uint8_t page_hash[32]; /* HMAC SHA512-256 using global_iv */
- uint8_t section_hash[32]; /* HMAC SHA512-256 using global_iv */
+struct pt_book {
+ /* TODO bcid, bwid */
+ uint64_t book_length; /* Pages to next pt_book, (0 = this is first) */
+ uint8_t book_hash[32]; /* HMAC SHA512-256 using global_iv */
};
struct cq_page {
- uint8_t flags;
- struct pt_root root;
+ uint32_t flags;
+ uint8_t page_hash[32]; /* HMAC SHA512-256 using global_iv */
+ struct pt_book book;
uint64_t data;
};