Re: [PATCHES] Post-special page storage TDE support - Mailing list pgsql-hackers

From David Christensen
Subject Re: [PATCHES] Post-special page storage TDE support
Date
Msg-id CAOxo6XJSXQJmg0guCn=tN=6PdDgYB_jV6dA1esEKPX97_v0Drg@mail.gmail.com
Whole thread Raw
In response to Re: [PATCHES] Post-special page storage TDE support  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Tue, Nov 7, 2023 at 6:20 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> From: David Christensen <david@pgguru.net>
> Date: Tue, 9 May 2023 16:56:15 -0500
> Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
>
> This space is reserved for extended data on the Page structure which will be ultimately used for
> encrypted data, extended checksums, and potentially other things.  This data appears at the end of
> the Page, after any `pd_special` area, and will be calculated at runtime based on specific
> ControlFile features.
>
> No effort is made to ensure this is backwards-compatible with existing clusters for `pg_upgrade`, as
> we will require logical replication to move data into a cluster with
> different settings here.

The first part of the last paragraph makes it sound like pg_upgrade won't be
supported across this commit, rather than just between different settings...

Thanks for the review.
 
I think as a whole this is not an insane idea. A few comments:

- IMO the patch touches many places it shouldn't need to touch, because of
  essentially renaming a lot of existing macro names to *Limit,
  necessitating modifying a lot of users. I think instead the few places that
  care about the runtime limit should be modified.

  As-is the patch would cause a lot of fallout in extensions that just do
  things like defining an on-stack array of Datums or such - even though all
  they'd need is to change the define to the *Limit one.

  Even leaving extensions aside, it must makes reviewing (and I'm sure
  maintaining) the patch very tedious.

You make a good point, and I think you're right that we could teach the places that care about runtime vs compile time differences about the changes while leaving other callers alone. The *Limit ones were introduced since we need constant values here from the Calc...() macros, but could try keeping the existing *Limit with the old name and switching things around. I suspect there will be the same amount of code churn, but less mechanical.
 
- I'm a bit worried about how the extra special page will be managed - if
  there are multiple features that want to use it, who gets to put their data
  at what offset?

  After writing this I saw that 0002 tries to address this - but I don't like
  the design. It introduces runtime overhead that seems likely to be visible.

Agreed this could be optimized.
 
- Checking for features using PageGetFeatureOffset() seems the wrong design to
  me - instead of a branch for some feature being disabled, perfectly
  predictable for the CPU, we need to do an external function call every time
  to figure out that yet, checksums are *still* disabled.

This is probably not a supported approach (it felt a little icky), but I'd played around with const pointers to structs of const elements, where the initial values of a global var was populated early on (so set once and never changed post init), and the compiler didn't complain and things seemed to work ok; not sure if this approach might help balance the early mutability and constant lookup needs:

typedef struct PageFeatureOffsets {
  const Size feature0offset;
  const Size feature1offset;
  ...
} PageFeatureOffsets;
 
PageFeatureOffsets offsets = {0};
const PageFeatureOffsets *exposedOffsets = &offsets;

void InitOffsets() {
    *((Size*)&offsets.feature0offset) = ...;
    *((Size*)&offsets.feature1offset) = ...;
...
}

- Recomputing offsets every time in PageGetFeatureOffset() seems too
  expensive. The offsets can't change while running as PageGetFeatureOffset()
  have enough information to distinguish between different kinds of relations

Yes, this was a simple approach for ease of implementation; there is certainly a way to precompute a lookup table from the page feature bitmask into the offsets themselves or otherwise precompute, turn from function call into inline/macro, etc.
 
  - so why do we need to recompute offsets on every single page?  I'd instead
  add a distinct offset variable for each feature.

This would work iff there is a single page feature set across all pages in the cluster; I'm not sure we don't want more flexibility here.
 
- Modifying every single PageInit() call doesn't make sense to me. That'll
  just create a lot of breakage for - as far as I can tell - no win.

This was a placeholder to allow different features depending on page type; to keep things simple for now I just used the same values here, but we could move this inside PageInit() instead (again, assuming single feature set per cluster).
 
- Why is it worth sacrificing space on every page to indicate which features
  were enabled?  I think there'd need to be some convincing reasons for
  introducing such overhead.

The point here is if we can use either GCM authtag or stronger checksums then we've gained the ability to authenticate the page contents at the cost of reassigning those bits, in a way that would support variable permutations of features for different relations or page types, if so desired.  A single global setting here both eliminates that possibility as well as requires external data in order to fully interpret pages.
 
- Is it really useful to encode the set of features enabled in a cluster with
  a bitmask? That pretty much precludes utilizing extra page space in
  extensions. We could instead just have an extra cluster-wide file that
  defines a mapping of offset to feature.

Given the current design, yes we do need that, which does make it harder to allocate/use from an extension. Due to needing to have consistent offsets for a given feature set (however represented on a page), the implementation load going forward as-is involves ensuring that a given bit always maps to the same offset in the page regardless of additional features available in the future. So the 0'th bit if enabled would always map to the 8 byte chunk at the end of the page, the 1st bit corresponds to some amount of space prior to that, etc.  I'm not sure how to get that property without some sort of bitmap or otherwise indexed operation.

I get what you're saying as far as the more global approach, and while that does lend itself to some nice properties in terms of extensibility, some of the features (GCM tags in particular) need to be able to control the page offset at a consistent location so we can decode the rest of the page without knowing anything else.
 
Additionally, since the reserved space/page features are configured at initdb time I am unclear how a given extension would even be able to stake a claim here.  ...though if we consider this a two-part problem, one of space reservation and one of space usage, that part could be handled via allocating more than the minimum in the reserved_page_space and allowing unallocated page space to be claimed later via some sort of additional functions/other hook.  That opens up other questions though, tracking whether said space has ever been initialized and what to do when first accessing existing/new pages as one example.

Best,

David

pgsql-hackers by date:

Previous
From: "a.rybakina"
Date:
Subject: Re: POC, WIP: OR-clause support for indexes
Next
From: Laurenz Albe
Date:
Subject: Re: should check collations when creating partitioned index