Re: Moving forward with TDE [PATCH v3] - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Moving forward with TDE [PATCH v3]
Date
Msg-id 20231103023228.gz32vjx67jwzfh6h@alap3.anarazel.de
Whole thread Raw
In response to Re: Moving forward with TDE [PATCH v3]  (David Christensen <david.christensen@crunchydata.com>)
Responses Re: Moving forward with TDE [PATCH v3]
Re: Moving forward with TDE [PATCH v3]
Re: Moving forward with TDE [PATCH v3]
Re: Moving forward with TDE [PATCH v3]
List pgsql-hackers
Hi,

On 2023-10-31 16:23:17 -0500, David Christensen wrote:
> The patches are as follows:
>
> 0001 - doc updates
> 0002 - Basic key management and cipher support
> 0003 - Backend-related changes to support heap encryption
> 0004 - modifications to bin tools and programs to manage key rotation and
> add other knowledge
> 0005 - Encrypted/authenticated WAL
>
> These are very broad strokes at this point and should be split up a bit
> more to make things more granular and easier to review, but I wanted to get
> this update out.

Yes, particularly 0003 really needs to be split - as is it's not easily
reviewable.



> From 327e86d52be1df8de9c3a324cb06b85ba5db9604 Mon Sep 17 00:00:00 2001
> From: David Christensen <david@pgguru.net>
> Date: Fri, 29 Sep 2023 15:16:00 -0400
> Subject: [PATCH v3 5/5] Add encrypted/authenticated WAL
>
> When using an encrypted cluster, we need to ensure that the WAL is also
> encrypted. While we could go with an page-based approach, we use instead a
> per-record approach, using GCM for the encryption method and storing the AuthTag
> in the xl_crc field.
>
> We change the xl_crc field to instead be a union struct, with a compile-time
> adjustable size to allow us to customize the number of bytes allocated to the
> GCM authtag.  This allows us to easily adjust the size of bytes needed to
> support our authentication.  (Testing has included up to 12, but leaving at this
> point to 4 due to keeping the size of the WAL records the same.)

Ugh, that'll be quite a bit of overhead in some workloads... You can't really
use such a build for non-encrypted workloads, making this a not very
deployable path...


> @@ -905,20 +905,28 @@ XLogInsertRecord(XLogRecData *rdata,
>      {
>          /*
>           * Now that xl_prev has been filled in, calculate CRC of the record
> -         * header.
> +         * header.  If we are using encrypted WAL, this CRC is overwritten by
> +         * the authentication tag, so just zero
>           */
> -        rdata_crc = rechdr->xl_crc;
> -        COMP_CRC32C(rdata_crc, rechdr, offsetof(XLogRecord, xl_crc));
> -        FIN_CRC32C(rdata_crc);
> -        rechdr->xl_crc = rdata_crc;
> +        if (!encrypt_wal)
> +        {
> +            rdata_crc = rechdr->xl_integrity.crc;
> +            COMP_CRC32C(rdata_crc, rechdr, offsetof(XLogRecord, xl_integrity.crc));
> +            FIN_CRC32C(rdata_crc);
> +            rechdr->xl_integrity.crc = rdata_crc;
> +        }
> +        else
> +            memset(&rechdr->xl_integrity, 0, sizeof(rechdr->xl_integrity));

Why aren't you encrypting most of the data here?  Just as for CRC computation,
encrypting a large record in XLOG_BLCKSZ


>   * XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
>   * used to distinguish between block references, and the main data structs.
>   */
> +
> +#define XL_AUTHTAG_SIZE 4
> +#define XL_HEADER_PAD 2
> +
>  typedef struct XLogRecord
>  {
>      uint32        xl_tot_len;        /* total len of entire record */
> @@ -45,14 +49,16 @@ typedef struct XLogRecord
>      XLogRecPtr    xl_prev;        /* ptr to previous record in log */
>      uint8        xl_info;        /* flag bits, see below */
>      RmgrId        xl_rmid;        /* resource manager for this record */
> -    /* 2 bytes of padding here, initialize to zero */
> -    pg_crc32c    xl_crc;            /* CRC for this record */
> -
> +    uint8       xl_pad[XL_HEADER_PAD];        /* required alignment padding */

What does "required" mean here? And why is this defined in a separate define?
And why do we need the explicit field here at all? The union will still
provide sufficient alignment for a uint32.

It also doesn't seem right to remove the comment about needing to zero out the
space.


> From 7557acf60f52da4a86fd9f902bab4804c028dd4b Mon Sep 17 00:00:00 2001
> From: David Christensen <david.christensen@crunchydata.com>
> Date: Tue, 31 Oct 2023 15:24:02 -0400
> Subject: [PATCH v3 3/5] Backend-related changes




> --- a/src/backend/access/heap/rewriteheap.c
> +++ b/src/backend/access/heap/rewriteheap.c
> @@ -323,6 +323,11 @@ end_heap_rewrite(RewriteState state)
>                          state->rs_buffer,
>                          true);
>
> +        PageEncryptInplace(state->rs_buffer, MAIN_FORKNUM,
> +                           RelationIsPermanent(state->rs_new_rel),
> +                           state->rs_blockno,
> +                           RelationGetSmgr(state->rs_new_rel)->smgr_rlocator.locator.relNumber
> +            );
>          PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
>
>          smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,

I don't think it's ok to have to make such changes in a bunch of places. I
think we need to add an abstraction layer between smgr and code like this
first.  Luckily I think Heikki was working abstracting away some of these
direct smgr* uses...



> +Architecture
> +------------
> +
> +Fundamentally, cluster file encryption must store data in a file system
> +in such a way that the keys required to decrypt the file system data can
> +only be accessed using somewhere outside of the file system itself.  The
> +external requirement can be someone typing in a passphrase, getting a
> +key from a key management server (KMS), or decrypting a key stored in
> +the file system using a hardware security module (HSM).  The current
> +architecture supports all of these methods, and includes sample scripts
> +for them.
> +
> +The simplest method for accessing data keys using some external
> +requirement would be to retrieve all data encryption keys from a KMS.
> +However, retrieved keys would still need to be verified as valid.  This
> +method also introduces unacceptable complexity for simpler use-cases,
> +like user-supplied passphrases or HSM usage.  External key rotation
> +would also be very hard since it would require re-encrypting all the
> +file system data with the new externally-stored keys.
> +
> +For these reason, a two-tiered architecture is used, which uses two
> +types of encryption keys: a key encryption key (KEK) and data encryption
> +keys (DEK). The KEK should not be present unencrypted in the file system
> +--- it should be supplied the user, stored externally (e.g., in a KMS)

*by* the user?

> +or stored in the file system encrypted with a HSM (e.g., PIV device).
> +The DEK is used to encrypt database files and is stored in the same file
> +system as the database but is encrypted using the KEK.  Because the DEK
> +is encrypted, its storage in the file system is no more of a security
> +weakness and the storage of the encrypted database files in the same
> +file system.

As is this paragraph doesn't really follow from the prior paragraph for
me. That a KMS would be hard to use isn't obviously related to splitting the
KEK and the DEK.



> +Implementation
> +--------------
> +
> +To enable cluster file encryption, the initdb option
> +--cluster-key-command must be used, which specifies a command to
> +retrieve the KEK.

FWIW, I think "cluster file encryption" is somewhat ambiguous. That could also
mean encrypting on the file system level or such.


> initdb records the cluster_key_command in
> +postgresql.conf.  Every time the KEK is needed, the command is run and
> +must return 64 hex characters which are decoded into the KEK.  The
> +command is called twice during initdb, and every time the server starts.
> +initdb also sets the encryption method in controldata during server
> +bootstrap.
> +
> +initdb runs "postgres --boot", which calls function
> +kmgr.c::BootStrapKmgr(), which calls the cluster key command.  The
> +cluster key command returns a KEK which is used to encrypt random bytes
> +for each DEK and writes them to the file system by
> +kmgr.c::KmgrWriteCryptoKeys() (unless --copy-encryption-keys is used).
> +Currently the DEK files are 0 and 1 and are stored in
> +$PGDATA/pg_cryptokeys/live.  The wrapped DEK files use Key Wrapping with
> +Padding which verifies the validity of the KEK.
> +
> +initdb also does a non-boot backend start which calls
> +kmgr.c::InitializeKmgr(), which calls the cluster key command a second
> +time.  This decrypts/unwraps the DEK keys and stores them in the shared
> +memory structure KmgrShmem. This step also happens every time the server
> +starts. Later patches will use the keys stored in KmgrShmem to
> +encrypt/decrypt database files.  KmgrShmem is erased via
> +explicit_bzero() on server shutdown.

I think this encodes too many details of how initdb works today. It seems
likely that nobody adding or removing a restart will think of updating this
file - nor should they have to.  I'd just say that initdb starts/stops
postgres multiple times.



> +Initialization Vector
> +---------------------
> +
> +Nonce means "number used once". An Initialization Vector (IV) is a
> +specific type of nonce. That is, unique but not necessarily random or
> +secret, as specified by the NIST
> +(https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf).
> +To generate unique IVs, the NIST recommends two methods:
> +
> +    The first method is to apply the forward cipher function, under
> +    the same key that is used for the encryption of the plaintext,
> +    to a nonce. The nonce must be a data block that is unique to
> +    each execution of the encryption operation. For example, the
> +    nonce may be a counter, as described in Appendix B, or a message
> +    number. The second method is to generate a random data block
> +    using a FIPS-approved random number generator.
> +
> +We will use the first method to generate IVs. That is, select nonce
> +carefully and use a cipher with the key to make it unique enough to use
> +as an IV. The nonce selection for buffer encryption and WAL encryption
> +are described below.
> +
> +If the IV was used more than once with the same key (and we only use one
> +data encryption key), changes in the unencrypted data would be visible
> +in the encrypted data.
> +
> +IV for Heap/Index Encryption
> +- - - - - - - - - - - - - -
> +
> +To create the 16-byte IV needed by AES for each page version, we will
> +use the page LSN (8 bytes) and page number (4 bytes).

I still am quite quite unconvinced that using the LSN as a nonce is a good
design decision.

- LSNs can end up being reused after crash restarts
- the LSN does not contain the timeline ID, if a standby is promoted, two
  systems can be using the same LSNs
- The LSN does *NOT* actually change every time a page is modified. Even with
  wal_log_hint_bits, only the first hint bit modification to a page in a
  checkpoint cycles will cause WAL writes - and changing that would have
  a quite substantial overhead.


>  The LSN is ideal for use in the IV because it is +always increasing, and is
> changed every time a page is updated.  The +same LSN is never used for two
> relations with different page contents.

As mentioned above, this is not true - the LSN does *NOT* change every time
the page is updated.


> +However, the same LSN can be used in multiple pages in the same relation
> +--- this can happen when a heap update expires an old tuple and adds a
> +new tuple to another page.  By adding the page number to the IV, we keep
> +the IV unique.

There's many other ways that can happen.


> +CREATE DATABASE can be run with two different strategies: FILE_COPY or
> +WAL_LOG.  If using WAL_LOG, the heap/index files are automatically
> +rewritten with new LSNs as part of the copy operation and will get new
> +IVs automatically.
> +
> +This approach still works with the older FILE_COPY stragegy; by not
> +using the database id in the IV, CREATE DATABASE can copy the heap/index
> +files from the old database to a new one without decryption/encryption.
> +Both page copies are valid.  Once a database changes its pages, it gets
> +new LSNs, and hence new IV.
> +
> +As part of WAL logging, every change of a WAL-logged page gets a new
> +LSN, and therefore a new IV automatically.
> +
> +However, the LSN must then be visible on encrypted pages, so we will not
> +encrypt the LSN on the page. We will also not encrypt the CRC so
> +pg_checksums can still check pages offline without access to the keys.

s/crc/checksum/? The data-page-level checksum isn't a CRC.


> +Non-Permanent Relations
> +- - - - - - - - - - - -
> +
> +To avoid the overhead of generating WAL for non-permanent (unlogged and
> +temporary) relations, we assign fake LSNs that are derived from a
> +counter via xlog.c::GetFakeLSNForUnloggedRel().  (GiST also uses this
> +counter for LSNs.)  We also set a bit in the IV so the use of the same
> +value for WAL (real) and fake LSNs will still generate unique IVs.  Only
> +main forks are encrypted, not init, vm, or fsm files.

Why aren't other forks encrypted? This seems like a very odd design to me.


> +Hint Bits
> +- - - - -
> +
> +For hint bit changes, the LSN normally doesn't change, which is a
> +problem.  By enabling wal_log_hints, you get full page writes to the WAL
> +after the first hint bit change of the checkpoint.  This is useful for
> +two reasons.  First, it generates a new LSN, which is needed for the IV
> +to be secure.  Second, full page images protect against torn pages,
> +which is an even bigger requirement for encryption because the new LSN
> +is re-encrypting the entire page, not just the hint bit changes.  You
> +can safely lose the hint bit changes, but you need to use the same LSN
> +to decrypt the entire page, so a torn page with an LSN change cannot be
> +decrypted.  To prevent this, wal_log_hints guarantees that the
> +pre-hint-bit version (and previous LSN version) of the page is restored.
> +
> +However, if a hint-bit-modified page is written to the file system
> +during a checkpoint, and there is a later hint bit change switching the
> +same page from clean to dirty during the same checkpoint, we need a new
> +LSN, and wal_log_hints doesn't give us a new LSN here.  The fix for this
> +is to update the page LSN by writing a dummy WAL record via
> +xloginsert.c::LSNForEncryption() in such cases.

Ugh, so that's really the plan. That's a substantial overhead in some common
scenarios.

...

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Japin Li
Date:
Subject: Re: Tab completion regression test failed on illumos
Next
From: Peter Smith
Date:
Subject: Re: A recent message added to pg_upgade