Re: Moving forward with TDE [PATCH v3] - Mailing list pgsql-hackers

From David Christensen
Subject Re: Moving forward with TDE [PATCH v3]
Date
Msg-id CAOxo6X+RnY4zh=2M5V5NnurG1B8Xin-T838BOk7jpbS_bYtgLg@mail.gmail.com
Whole thread Raw
In response to Re: Moving forward with TDE [PATCH v3]  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Tue, Nov 7, 2023 at 6:47 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2023-11-06 11:26:44 +0100, Matthias van de Meent wrote:
> On Sat, 4 Nov 2023 at 03:38, Andres Freund <andres@anarazel.de> wrote:
> > On 2023-11-02 22:09:40 +0100, Matthias van de Meent wrote:
> > > I'm quite surprised at the significant number of changes being made
> > > outside the core storage manager files. I thought that changing out
> > > mdsmgr with an encrypted smgr (that could wrap mdsmgr if so desired)
> > > would be the most obvious change to implement cluster-wide encryption
> > > with the least code touched, as relations don't need to know whether
> > > the files they're writing are encrypted, right? Is there a reason to
> > > not implement this at the smgr level that I overlooked in the
> > > documentation of these patches?
> >
> > You can't really implement encryption transparently inside an smgr without
> > significant downsides. You need a way to store an initialization vector
> > associated with the page (or you can store that elsewhere, but then you've
> > doubled the worst cse amount of random reads/writes). The patch uses the LSN
> > as the IV (which I doubt is a good idea). For authenticated encryption further
> > additional storage space is required.
>
> I am unaware of any user of the smgr API that doesn't also use the
> buffer cache, and thus implicitly the Page layout with PageHeader
> [^1]

Everything indeed uses a PageHeader - but there are a number of places that do
*not* utilize pd_lower/upper/special. E.g. visibilitymap.c just assumes that
those fields are zero - and changing that wouldn't be trivial / free, because
we do a lot of bitmasking/shifting with constants derived from

#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))

which obviously wouldn't be constant anymore if you could reserve space on the
page.

While not constants, I was able to get this working with variable values here in a way that did not have the overhead of the original patch for the vismap specifically using Montgomery Multiplication for division/mod. This was actually the heaviest of the changes from moving to runtime-calculated, so we might be able to use this approach in this specific case even if only this change is required for this specific fork.
 
> The API of smgr is also tailored to page-sized quanta of data
> with mostly relation-level information. I don't see why there would be
> a veil covering the layout of Page for smgr when all other information
> already points to the use of PageHeader and Page layouts. In my view,
> it would even make sense to allow the smgr to get exclusive access to
> some part of the page in the current Page layout.
>
> Yes, I agree that there will be an impact on usable page size if you
> want authenticated encryption, and that AMs will indeed need to
> account for storage space now being used by the smgr - inconvenient,
> but it serves a purpose. That would happen regardless of whether smgr
> or some higher system decides where to store the data for encryption -
> as long as it is on the page, the AM effectively can't use those
> bytes.
> But I'd say that's best solved by making the Page documentation and
> PageInit API explicit about the potential use of that space by the
> chosen storage method (encrypted, plain, ...) instead of requiring the
> various AMs to manually consider encryption when using Postgres' APIs
> for writing data to disk without hitting shared buffers; page space
> management is already a task of AMs, but handling the actual
> encryption is not.

I don't particularly disagree with any detail here - but to me reserving space
for nonces etc at PageInit() time pretty much is the opposite of handling
encryption inside smgr.

Originally, I was anticipating that we might want different space amounts reserved on different classes of pages (apart from encryption), so while we'd be storing the default page reserved size in pg_control we'd not be limited to this in the structure of the page calls.  We could presumably just move the logic into PageInit() itself if every reserved allocation is the same and individual call sites wouldn't need to know about it.  The call sites do have more context as to the requirements of the page or the "type" of page in play, which if we made it dependent on page type would need to get passed in somehow, which was where the reserved_page_size parameter came in to the current patch.

> Should the AM really care whether the data on disk is encrypted or
> not? I don't think so. When the disk contains encrypted bytes, but
> smgrread() and smgrwrite() both produce and accept plaintext data,
> who's going to complain? Requiring AMs to be mindful about encryption
> on all common paths only adds pitfalls where encryption would be
> forgotten by the developer of AMs in one path or another.

I agree with that - I think the way the patch currently is designed is not
right.

The things that need to care tend to be the same places that need to care about setting checksums, that or being aware of the LSNs in play or needing to be set.  I'd agree that a common interface for "get this page ready for writing to storage" and "get this page converted from storage" which could handle both checksums, encryption, or additional page features would make sense. (I doubt we'd want hooks to support page in/page out, but if we /did/ want that, this'd also likely live there.)

There's other stuff you can't trivially do at the smgr level. E.g. if
checksums or encryption is enabled, you need to copy the buffer to compute
checksums / do IO if in shared buffers, because somebody could set a hint bit
even with just a shared content lock. But you don't need that when coming from
private buffers during index builds.



> I think that getting PageInit to allocate the smgr-specific area would
> take some effort, too (which would potentially require adding some
> relational context to PageInit, so that it knows which page of which
> relation it is going to initialize), but IMHO that would be more
> natural than requiring all index and table AMs to be aware the actual
> encryption of its pages and require manual handling of that encryption
> when the page needs to be written to disk, when it otherwise already
> conforms to the various buffer management and file extension APIs
> currently in use in PostgreSQL. I would expect "transparent" data
> encryption to be handled at the file write layer (i.e. smgr), not
> inside the AMs.

As mentioned above - I agree that the relevant code shouldn't be in index
AMs. But I somewhat doubt that smgr is the right level either. For one, the
place computing checksums needs awareness of locking / sharing semantics as
well as knowledge about WAL logging. That's IMO above smgr. For another, if we
ever got another smgr implementation - should it have to reimplement
encryption?

ISTM that there's a layer missing. Places bypassing bufmgr.c currently need
their own handling of checksums (and in the future encryption), because the
relevant bufmgr.c code can't be reached without pages in the buffer
pool. Which is why we have PageIsVerifiedExtended() and
PageSetChecksumInplace() calls in gist, hash, heapam, nbtree ... IMO when
checksums were added, we should have added the proper abstraction layer
instead of littering the code with redundant copies.

Agreed that a fixup patch to add /something/ here would be good.  A concern here is what context would need to be passed in; certainly with only checksums just a Page is sufficient, but if we have AAD we'd need to be able to pass that in or otherwise be able to identify it in the page. As a point of references, the existing GCM patch authenticates all unencrypted header fields (i.e., PageHeaderData up to the pd_special field), plus the RelFileNumber and BlockNumber, of which we'd need to pass in RelFileNumber and BlockNumber (and presumably would want the ForkNum as well if expanding the set of authenticated data).  To some extent, we can punt on some of this, as the existing call sites have been modified in this patch to pass that info in already, so it's really about how we marshal that data.

Thanks,

David

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500
Next
From: Peter Smith
Date:
Subject: Re: GUC names in messages