Persisting Mutable Storage Inside The "T"EE

ZmnSCPxj · October 10, 2025, 6:21am

After a bit of research, I found that LUKSv2 composes two subsystems: dm-crypt and dm-integrity. The combination is effectively equivalent to an AE scheme, with dm-integrity checking for data modification or corruption before dm-crypt decrypts (effectively equivalent to Encrypt-then-MAC). dm-integrity however needs to ensure atomicity of both the tag (what it calls the MAC) update and the actual sector, and to do so, it uses… a journal!

The problem is that of a log on a log. Briefly, a log — including just a short write-ahead log or journal — is required for atomicity, but represent writing twice as much on every write: once to the log, once to the real location. But if you have a logged layer on top of a logged layer, then you are writing four times: first the upper layer writes to its log, which on the lower layer translates to two writes (one to the lower-layer log, one to the lower-layer storage), and then the upper layer writes to it storage, which on the lower layer translates to two more writes (one to the lower-layer log, one to the lower-layer storage).

Such a thing can happen when our array-management code uses a journal to write out planned writes to ensure atomicity across devices and plug the RAID5 write hole, then have an additional layer on top that handles AE (i.e. dm-integrity+dm-crypt aka LUKS2) that has its own journal to write out planned writes to ensure atomicity of updating MACs and the actual sector storage.

The correct solution is to collapse the log layers into a single one, which is why ZFS is awesome, it uses the same atomicity logging for both plugging the RAID5 write hole and ensuring it is a transactional filesystem and it is capable of using cryptography-quality checksums and has encryption and that is all on one log layer. More broadly, XFS has been proposing a bunch of extensions as well to integrate database logs into its own logs so that it can provide the atomicity of its own log to upper layer databases running on XFS, and avoid the log-on-a-log problem.

The eventual evolution of this is that you have a lower layer that has a ridiculously large log / journal, because all the higher layers are relying on it for atomicity, and that means more and more data being pushed inside an atomic operation. The end result is you have a copy-on-write filesystem, just like ZFS, where in essence the whole disk is a log and there is no separate “storage” to rewrite to — you just write the log on to whatever free space is available instead of overwriting existing storage, and then after you are sure the lying underlying disk has actually written the data out, you mark the existing storage whose data you replaced as “now this is free”, without having to double-write from the journal to the storage: the journal is the storage.

The problem with that scheme is for the “key deletion” problem of statechain signers. Old journal entries are effectively backups of the data, and therefore at risk of exfiltration of supposedly-deleted keys. So we actually have to avoid copy-on-write schemes for statechain signers; we want a short journal that we specifically destroy each time we have applied the latest journal entry. However, we can at least merge the atomicity-providing log layers for the AE and the RAID-X.

We can have an array of IV+MAC. Each IV+MAC covers on sector of encrypted storage, and is itself stored in some number of sectors. Like the encrypted storage itself, the IV+MAC is also done with erasure coding — note that we do not need to IV+MAC the parity sectors, only the actual storage sectors; if we can recover using the parity sectors and the recovered data matches the IV+MAC, then the parity sectors are also correct, thus they are implicitly covered by the same IV+MAC. The IV removes the problem of using a stream cipher with full disk encryption, as full disk encryption requires sector-addressible deciphering; each encrypted storage sector gets its own IV, and we can use an AEAD scheme where the AD is the encrypted storage sector index in the RAID array. For ChaCha20, the IV is 12 bytes. We should use HMAC instead of Poly1305, due to a minor weakness in Poly1305 that allows the tag to be malleated by attackers to match the same encrypted data to a different MAC encryption key, in that the Poly1305 MAC matches, but decryption results in garbage output — but the point of Encrypt-then-MAC is that the probability of MAC matching for a different key is so low that if the MAC matches then decryption will be for the same encryption key and result in the original plaintext and not garbage, thus this issue with Poly1305 violates that assumption. This is the key-commitment problem, and HMAC naturally commits to the MAC key (but polynomial-based MACs like Poly1305 do not). The output of HMAC-SHA-2-256 is 32 bytes but I understand that it can be truncated to 16 bytes, so a 28 byte IV(12 bytes)+MAC(16 bytes) for each 4096 sector is slightly above half a percent overhead. The array of IV+MAC would also be grouped into 4096-byte sectors, and each such sector also needs its own IV+MAC covering the section of the array it contains — we can have 145 IV+MAC entries in a 4096-byte sector, plus a 146th entry covering itself.

The journal only needs to be expanded enough to be able to contain one stripe width (storage sectors + parity sectors) for the main data, plus two IV+MAC storage sectors and the IV+MAC parity sectors to update. We only need two IV+MAC sectors in case the storage sectors of a stripe have to cross between two IV+MAC sectors. Then we can atomically ensure not only against RAID5 write hole, but also atomic update of the encryption+integrity data. At the same time, the (relatively) small journal means we can afford to ask the disk to delete the journal data and leave only the version counter in the atomicity sector, thus also preventing having natural backups of supposedly-deleted keys. While typical modern filesystems and databases will actually have logs on top for atomicity as well, a statechain signer application can simply perform direct writes to a dedicated partition of the simulated persistent storage disk, to ensure that erasure of old keys goes through fewer layers that need to be audited against having accidental backup copies.