Re: Moving forward with TDE [PATCH v3] - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Moving forward with TDE [PATCH v3] |
Date | |
Msg-id | 20231103023228.gz32vjx67jwzfh6h@alap3.anarazel.de Whole thread Raw |
In response to | Re: Moving forward with TDE [PATCH v3] (David Christensen <david.christensen@crunchydata.com>) |
Responses |
Re: Moving forward with TDE [PATCH v3]
Re: Moving forward with TDE [PATCH v3] Re: Moving forward with TDE [PATCH v3] Re: Moving forward with TDE [PATCH v3] |
List | pgsql-hackers |
Hi, On 2023-10-31 16:23:17 -0500, David Christensen wrote: > The patches are as follows: > > 0001 - doc updates > 0002 - Basic key management and cipher support > 0003 - Backend-related changes to support heap encryption > 0004 - modifications to bin tools and programs to manage key rotation and > add other knowledge > 0005 - Encrypted/authenticated WAL > > These are very broad strokes at this point and should be split up a bit > more to make things more granular and easier to review, but I wanted to get > this update out. Yes, particularly 0003 really needs to be split - as is it's not easily reviewable. > From 327e86d52be1df8de9c3a324cb06b85ba5db9604 Mon Sep 17 00:00:00 2001 > From: David Christensen <david@pgguru.net> > Date: Fri, 29 Sep 2023 15:16:00 -0400 > Subject: [PATCH v3 5/5] Add encrypted/authenticated WAL > > When using an encrypted cluster, we need to ensure that the WAL is also > encrypted. While we could go with an page-based approach, we use instead a > per-record approach, using GCM for the encryption method and storing the AuthTag > in the xl_crc field. > > We change the xl_crc field to instead be a union struct, with a compile-time > adjustable size to allow us to customize the number of bytes allocated to the > GCM authtag. This allows us to easily adjust the size of bytes needed to > support our authentication. (Testing has included up to 12, but leaving at this > point to 4 due to keeping the size of the WAL records the same.) Ugh, that'll be quite a bit of overhead in some workloads... You can't really use such a build for non-encrypted workloads, making this a not very deployable path... > @@ -905,20 +905,28 @@ XLogInsertRecord(XLogRecData *rdata, > { > /* > * Now that xl_prev has been filled in, calculate CRC of the record > - * header. > + * header. If we are using encrypted WAL, this CRC is overwritten by > + * the authentication tag, so just zero > */ > - rdata_crc = rechdr->xl_crc; > - COMP_CRC32C(rdata_crc, rechdr, offsetof(XLogRecord, xl_crc)); > - FIN_CRC32C(rdata_crc); > - rechdr->xl_crc = rdata_crc; > + if (!encrypt_wal) > + { > + rdata_crc = rechdr->xl_integrity.crc; > + COMP_CRC32C(rdata_crc, rechdr, offsetof(XLogRecord, xl_integrity.crc)); > + FIN_CRC32C(rdata_crc); > + rechdr->xl_integrity.crc = rdata_crc; > + } > + else > + memset(&rechdr->xl_integrity, 0, sizeof(rechdr->xl_integrity)); Why aren't you encrypting most of the data here? Just as for CRC computation, encrypting a large record in XLOG_BLCKSZ > * XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's > * used to distinguish between block references, and the main data structs. > */ > + > +#define XL_AUTHTAG_SIZE 4 > +#define XL_HEADER_PAD 2 > + > typedef struct XLogRecord > { > uint32 xl_tot_len; /* total len of entire record */ > @@ -45,14 +49,16 @@ typedef struct XLogRecord > XLogRecPtr xl_prev; /* ptr to previous record in log */ > uint8 xl_info; /* flag bits, see below */ > RmgrId xl_rmid; /* resource manager for this record */ > - /* 2 bytes of padding here, initialize to zero */ > - pg_crc32c xl_crc; /* CRC for this record */ > - > + uint8 xl_pad[XL_HEADER_PAD]; /* required alignment padding */ What does "required" mean here? And why is this defined in a separate define? And why do we need the explicit field here at all? The union will still provide sufficient alignment for a uint32. It also doesn't seem right to remove the comment about needing to zero out the space. > From 7557acf60f52da4a86fd9f902bab4804c028dd4b Mon Sep 17 00:00:00 2001 > From: David Christensen <david.christensen@crunchydata.com> > Date: Tue, 31 Oct 2023 15:24:02 -0400 > Subject: [PATCH v3 3/5] Backend-related changes > --- a/src/backend/access/heap/rewriteheap.c > +++ b/src/backend/access/heap/rewriteheap.c > @@ -323,6 +323,11 @@ end_heap_rewrite(RewriteState state) > state->rs_buffer, > true); > > + PageEncryptInplace(state->rs_buffer, MAIN_FORKNUM, > + RelationIsPermanent(state->rs_new_rel), > + state->rs_blockno, > + RelationGetSmgr(state->rs_new_rel)->smgr_rlocator.locator.relNumber > + ); > PageSetChecksumInplace(state->rs_buffer, state->rs_blockno); > > smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM, I don't think it's ok to have to make such changes in a bunch of places. I think we need to add an abstraction layer between smgr and code like this first. Luckily I think Heikki was working abstracting away some of these direct smgr* uses... > +Architecture > +------------ > + > +Fundamentally, cluster file encryption must store data in a file system > +in such a way that the keys required to decrypt the file system data can > +only be accessed using somewhere outside of the file system itself. The > +external requirement can be someone typing in a passphrase, getting a > +key from a key management server (KMS), or decrypting a key stored in > +the file system using a hardware security module (HSM). The current > +architecture supports all of these methods, and includes sample scripts > +for them. > + > +The simplest method for accessing data keys using some external > +requirement would be to retrieve all data encryption keys from a KMS. > +However, retrieved keys would still need to be verified as valid. This > +method also introduces unacceptable complexity for simpler use-cases, > +like user-supplied passphrases or HSM usage. External key rotation > +would also be very hard since it would require re-encrypting all the > +file system data with the new externally-stored keys. > + > +For these reason, a two-tiered architecture is used, which uses two > +types of encryption keys: a key encryption key (KEK) and data encryption > +keys (DEK). The KEK should not be present unencrypted in the file system > +--- it should be supplied the user, stored externally (e.g., in a KMS) *by* the user? > +or stored in the file system encrypted with a HSM (e.g., PIV device). > +The DEK is used to encrypt database files and is stored in the same file > +system as the database but is encrypted using the KEK. Because the DEK > +is encrypted, its storage in the file system is no more of a security > +weakness and the storage of the encrypted database files in the same > +file system. As is this paragraph doesn't really follow from the prior paragraph for me. That a KMS would be hard to use isn't obviously related to splitting the KEK and the DEK. > +Implementation > +-------------- > + > +To enable cluster file encryption, the initdb option > +--cluster-key-command must be used, which specifies a command to > +retrieve the KEK. FWIW, I think "cluster file encryption" is somewhat ambiguous. That could also mean encrypting on the file system level or such. > initdb records the cluster_key_command in > +postgresql.conf. Every time the KEK is needed, the command is run and > +must return 64 hex characters which are decoded into the KEK. The > +command is called twice during initdb, and every time the server starts. > +initdb also sets the encryption method in controldata during server > +bootstrap. > + > +initdb runs "postgres --boot", which calls function > +kmgr.c::BootStrapKmgr(), which calls the cluster key command. The > +cluster key command returns a KEK which is used to encrypt random bytes > +for each DEK and writes them to the file system by > +kmgr.c::KmgrWriteCryptoKeys() (unless --copy-encryption-keys is used). > +Currently the DEK files are 0 and 1 and are stored in > +$PGDATA/pg_cryptokeys/live. The wrapped DEK files use Key Wrapping with > +Padding which verifies the validity of the KEK. > + > +initdb also does a non-boot backend start which calls > +kmgr.c::InitializeKmgr(), which calls the cluster key command a second > +time. This decrypts/unwraps the DEK keys and stores them in the shared > +memory structure KmgrShmem. This step also happens every time the server > +starts. Later patches will use the keys stored in KmgrShmem to > +encrypt/decrypt database files. KmgrShmem is erased via > +explicit_bzero() on server shutdown. I think this encodes too many details of how initdb works today. It seems likely that nobody adding or removing a restart will think of updating this file - nor should they have to. I'd just say that initdb starts/stops postgres multiple times. > +Initialization Vector > +--------------------- > + > +Nonce means "number used once". An Initialization Vector (IV) is a > +specific type of nonce. That is, unique but not necessarily random or > +secret, as specified by the NIST > +(https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf). > +To generate unique IVs, the NIST recommends two methods: > + > + The first method is to apply the forward cipher function, under > + the same key that is used for the encryption of the plaintext, > + to a nonce. The nonce must be a data block that is unique to > + each execution of the encryption operation. For example, the > + nonce may be a counter, as described in Appendix B, or a message > + number. The second method is to generate a random data block > + using a FIPS-approved random number generator. > + > +We will use the first method to generate IVs. That is, select nonce > +carefully and use a cipher with the key to make it unique enough to use > +as an IV. The nonce selection for buffer encryption and WAL encryption > +are described below. > + > +If the IV was used more than once with the same key (and we only use one > +data encryption key), changes in the unencrypted data would be visible > +in the encrypted data. > + > +IV for Heap/Index Encryption > +- - - - - - - - - - - - - - > + > +To create the 16-byte IV needed by AES for each page version, we will > +use the page LSN (8 bytes) and page number (4 bytes). I still am quite quite unconvinced that using the LSN as a nonce is a good design decision. - LSNs can end up being reused after crash restarts - the LSN does not contain the timeline ID, if a standby is promoted, two systems can be using the same LSNs - The LSN does *NOT* actually change every time a page is modified. Even with wal_log_hint_bits, only the first hint bit modification to a page in a checkpoint cycles will cause WAL writes - and changing that would have a quite substantial overhead. > The LSN is ideal for use in the IV because it is +always increasing, and is > changed every time a page is updated. The +same LSN is never used for two > relations with different page contents. As mentioned above, this is not true - the LSN does *NOT* change every time the page is updated. > +However, the same LSN can be used in multiple pages in the same relation > +--- this can happen when a heap update expires an old tuple and adds a > +new tuple to another page. By adding the page number to the IV, we keep > +the IV unique. There's many other ways that can happen. > +CREATE DATABASE can be run with two different strategies: FILE_COPY or > +WAL_LOG. If using WAL_LOG, the heap/index files are automatically > +rewritten with new LSNs as part of the copy operation and will get new > +IVs automatically. > + > +This approach still works with the older FILE_COPY stragegy; by not > +using the database id in the IV, CREATE DATABASE can copy the heap/index > +files from the old database to a new one without decryption/encryption. > +Both page copies are valid. Once a database changes its pages, it gets > +new LSNs, and hence new IV. > + > +As part of WAL logging, every change of a WAL-logged page gets a new > +LSN, and therefore a new IV automatically. > + > +However, the LSN must then be visible on encrypted pages, so we will not > +encrypt the LSN on the page. We will also not encrypt the CRC so > +pg_checksums can still check pages offline without access to the keys. s/crc/checksum/? The data-page-level checksum isn't a CRC. > +Non-Permanent Relations > +- - - - - - - - - - - - > + > +To avoid the overhead of generating WAL for non-permanent (unlogged and > +temporary) relations, we assign fake LSNs that are derived from a > +counter via xlog.c::GetFakeLSNForUnloggedRel(). (GiST also uses this > +counter for LSNs.) We also set a bit in the IV so the use of the same > +value for WAL (real) and fake LSNs will still generate unique IVs. Only > +main forks are encrypted, not init, vm, or fsm files. Why aren't other forks encrypted? This seems like a very odd design to me. > +Hint Bits > +- - - - - > + > +For hint bit changes, the LSN normally doesn't change, which is a > +problem. By enabling wal_log_hints, you get full page writes to the WAL > +after the first hint bit change of the checkpoint. This is useful for > +two reasons. First, it generates a new LSN, which is needed for the IV > +to be secure. Second, full page images protect against torn pages, > +which is an even bigger requirement for encryption because the new LSN > +is re-encrypting the entire page, not just the hint bit changes. You > +can safely lose the hint bit changes, but you need to use the same LSN > +to decrypt the entire page, so a torn page with an LSN change cannot be > +decrypted. To prevent this, wal_log_hints guarantees that the > +pre-hint-bit version (and previous LSN version) of the page is restored. > + > +However, if a hint-bit-modified page is written to the file system > +during a checkpoint, and there is a later hint bit change switching the > +same page from clean to dirty during the same checkpoint, we need a new > +LSN, and wal_log_hints doesn't give us a new LSN here. The fix for this > +is to update the page LSN by writing a dummy WAL record via > +xloginsert.c::LSNForEncryption() in such cases. Ugh, so that's really the plan. That's a substantial overhead in some common scenarios. ... Greetings, Andres Freund
pgsql-hackers by date: