Thread: [PATCHES] Post-special page storage TDE support

[PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

24 October 2022, 17:55:53

Hi -hackers,

An additional piece that I am working on for improving infra for TDE
features is allowing the storage of additional per-page data.  Rather
than hard-code the idea of a specific struct, this is utilizing a new,
more dynamic structure to associate page offsets with a particular
feature that may-or-may-not be present for a given cluster.  I am
calling this generic structure a PageFeature/PageFeatureSet (better
names welcome), which is defined for a cluster at initdb/bootstrap
time, and reserves a given amount of trailing space on the Page which
is then parceled out to the consumers of said space.

While the immediate need that this feature fills is storage of
encryption tags for XTS-based encryption on the pages themselves, this
can also be used for any optional features; as an example I have
implemented expanded checksum support (both 32- and 64-bit), as well
as a self-description "wasted space" feature, which just allocates
trailing space from the page (obviously intended as illustration
only).

There are 6 commits in this series:

0001 - adds `reserved_page_space` global, making various size
calculations and limits dynamic, adjusting access methods to offset
special space, and ensuring that we can safely reserve allocated space
from the end of pages.

0002 - test suite stability fixes - the change in number of tuples per
page means that we had some assumptions about the order from tests
that now break

0003 - the "PageFeatures" commit, the meat of this feature (see
following description)

0004 - page_checksum32 feature - store the full 32-bit checksum across
the existing pd_checksum field as well as 2 bytes from
reserved_page_space.  This is more of a demo of what could be done
here than a practical feature.

0005 - wasted space PageFeature - just use up space.  An additional
feature we can turn on/off to see how multiple features interact.
Only for illustration.

0006 - 64-bit checksums - fully allocated from reserved_page_space.
Using an MIT-licensed 64-bit checksum, but if we determined we'd want
to do this we'd probably roll our own.

From the commit message for PageFeatures:

Page features are a standardized way of assigning and using dynamic
space usage from the tail end of
a disk page.  These features are set at cluster init time (so
configured via `initdb` and
initialized via the bootstrap process) and affect all disk pages.

A PageFeatureSet is effectively a bitflag of all configured features,
each of which has a fixed
size.  If not using any PageFeatures, the storage overhead of this is 0.

Rather than using a variable location struct, an implementation of a
PageFeature is responsible for
an offset and a length in the page.  The current API returns only a
pointer to the page location for
the implementation to manage, and no further checks are done to ensure
that only the expected memory
is accessed.

Access to the underlying memory is synonymous with determining whether
a given cluster is using an
underlying PageFeature, so code paths can do something like:

    char *loc;

    if ((loc = ClusterGetPageFeatureOffset(page, PF_MY_FEATURE_ID)))
    {
        // ipso facto this feature is enabled in this cluster *and* we
know the memory address
        ...
    }

Since this is direct memory access to the underlying Page, ensure the
buffer is pinned.  Explicitly
locking (assuming you stay in your lane) should only need to guard
against access from other
backends of this type if using shared buffers, so will be use-case dependent.

This does have a runtime overhead due to moving some offset
calculations from compile time to
runtime. It is thought that the utility of this feature will outweigh
the costs here.

Candidates for page features include 32-bit or 64-bit checksums,
encryption tags, or additional
per-page metadata.

While we are not currently getting rid of the pd_checksum field, this
mechanism could be used to
free up that 16 bits for some other purpose. One such purpose might be
to mirror the cluster-wise
PageFeatureSet, currently also a uint16, which would mean the entirety
of this scheme could be
reflected in a given page, opening up per-relation or even per-page
setting/metadata here.  (We'd
presumably need to snag a pd_flags bit to interpret pd_checksum that
way, but it would be an
interesting use.)

Discussion is welcome and encouraged!

Thanks,

David

Attachment

Re: [PATCHES] Post-special page storage TDE support

From

Julien Rouhaud

Date:

25 October 2022, 01:55:59

Hi,

On Mon, Oct 24, 2022 at 12:55:53PM -0500, David Christensen wrote:
>
> Explicitly
> locking (assuming you stay in your lane) should only need to guard
> against access from other
> backends of this type if using shared buffers, so will be use-case dependent.

I'm not sure what you mean here?

> This does have a runtime overhead due to moving some offset
> calculations from compile time to
> runtime. It is thought that the utility of this feature will outweigh
> the costs here.

Have you done some benchmarking to give an idea of how much overhead we're
talking about?

> Candidates for page features include 32-bit or 64-bit checksums,
> encryption tags, or additional
> per-page metadata.
>
> While we are not currently getting rid of the pd_checksum field, this
> mechanism could be used to
> free up that 16 bits for some other purpose.

IIUC there's a hard requirement of initdb-time initialization, as there's
otherwise no guarantee that you will find enough free space in each page at
runtime.  It seems like a very hard requirement for a full replacement of the
current checksum approach (even if I agree that the current implementation
limitations are far from ideal), especially since there's no technical reason
that would prevent us from dynamically enabling data-checksums without doing
all the work when the cluster is down.

Re: [PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

25 October 2022, 14:55:21

> > Explicitly
> > locking (assuming you stay in your lane) should only need to guard
> > against access from other
> > backends of this type if using shared buffers, so will be use-case dependent.
>
> I'm not sure what you mean here?

I'm mainly pointing out that the specific code that manages this
feature is the only one who has to worry about modifying said page
region.

> > This does have a runtime overhead due to moving some offset
> > calculations from compile time to
> > runtime. It is thought that the utility of this feature will outweigh
> > the costs here.
>
> Have you done some benchmarking to give an idea of how much overhead we're
> talking about?

Not yet, but I am going to work on this.  I suspect the current code
could be improved, but will try to get some sort of measurement of the
additional overhead.

> > Candidates for page features include 32-bit or 64-bit checksums,
> > encryption tags, or additional
> > per-page metadata.
> >
> > While we are not currently getting rid of the pd_checksum field, this
> > mechanism could be used to
> > free up that 16 bits for some other purpose.
>
> IIUC there's a hard requirement of initdb-time initialization, as there's
> otherwise no guarantee that you will find enough free space in each page at
> runtime.  It seems like a very hard requirement for a full replacement of the
> current checksum approach (even if I agree that the current implementation
> limitations are far from ideal), especially since there's no technical reason
> that would prevent us from dynamically enabling data-checksums without doing
> all the work when the cluster is down.

As implemented, that is correct; we are currently assuming this
specific feature mechanism is set at initdb time only.  Checksums are
not the primary motivation here, but were something that I could use
for an immediate illustration of the feature.  That said, presumably
you could define a way to set the features per-relation (say with a
template field in pg_class) which would propagate to a relation on
rewrite, so there could be ways to handle things incrementally, were
this an overall goal.

Thanks for looking,

David

Re: [PATCHES] Post-special page storage TDE support

From

Matthias van de Meent

Date:

27 October 2022, 13:17:00

Hi

On Mon, 24 Oct 2022, 19:56 David Christensen, <david.christensen@crunchydata.com> wrote:
>
> Discussion is welcome and encouraged!

Did you read the related thread with related discussion from last June, "Re: better page-level checksums" [0]? In that I argued that space at the end of a page is already allocated for the AM, and that reserving variable space at the end of the page for non-AM usage is wasting the AM's performance potential.

Apart from that: Is this variable-sized 'metadata' associated with smgr infrastructure only, or is it also available for AM features? If not; then this is a strong -1. The amount of tasks smgr needs to do on a page is generally much less than the amount of tasks an AM needs to do; so in my view the AM has priority in prime page real estate, not smgr or related infrastructure.

re: PageFeatures
I'm not sure I understand the goal, nor the reasoning. Shouldn't this be part of the storage manager (smgr) implementation / can't this be part of the smgr of the relation?

re: use of pd_checksum
I mentioned this in the above-mentioned thread too, in [1], that we could use pd_checksum as an extra area marker for this storage-specific data, which would be located between pd_upper and pd_special.

Re: patch contents

0001:
>+ specialSize = MAXALIGN(specialSize) + reserved_page_size;

This needs to be aligned, so MAXALIGN(specialSize + reserved_page_size), or an assertion that reserved_page_size is MAXALIGNED, would be better.

> PageValidateSpecialPointer(Page page)
> {
> Assert(page);
> - Assert(((PageHeader) page)->pd_special <= BLCKSZ);
> + Assert((((PageHeader) page)->pd_special - reserved_page_size) <= BLCKSZ);

This check is incorrect. With your code it would allow pd_special past the end of the block. If you want to put the reserved_space_size effectively inside the special area, this check should instead be:

+ Assert(((PageHeader) page)->pd_special <= (BLCKSZ - reserved_page_size));

Or, equally valid

+ Assert((((PageHeader) page)->pd_special + reserved_page_size) <= BLCKSZ);

> + * +-------------+-----+------------+-----------------+
> + * | ... tuple2 tuple1 | "special space" | "reserved" |
> + * +-------------------+------------+-----------------+

Could you fix the table display if / when you revise the patchset? It seems to me that the corners don't line up with the column borders.

0002:
> Make the output of "select_views" test stable
> Changing the reserved_page_size has resulted in non-stable results for this test.

This makes sense, what kind of instability are we talking about? Are there different results for runs with the same binary, or is this across compilations?

0003 and up were not yet reviewed in depth.

Kind regards,

Matthias van de Meent

[0] https://www.postgresql.org/message-id/flat/CA%2BTgmoaCeQ2b-BVgVfF8go8zFoceDjJq9w4AFQX7u6Acfdn2uA%40mail.gmail.com#90badc63e568a89a76f51fc95f07ffaf
[1] https://postgr.es/m/CAEze2Wi5wYinU7nYxyKe_C0DRc6uWYa8ivn5%3Dzg63nKtHBnn8A%40mail.gmail.com

Re: [PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

28 October 2022, 22:25:33

Hi Matthias,

> Did you read the related thread with related discussion from last June, "Re: better page-level checksums" [0]? In
thatI argued that space at the end of a page is already allocated for the AM, and that reserving variable space at the
endof the page for non-AM usage is wasting the AM's performance potential. 

Yes, I had read parts of that thread among others, but have given it a
re-read.  I can see the point you're making here, and agree that if we
can allocate between pd_special and pd_upper that could make sense.  I
am a little unclear as to what performance impacts for the AM there
would be if this additional space were ahead or behind the page
special area; it seems like if this is something that needs to live on
the page *somewhere* just being aligned correctly would be sufficient
from the AM's standpoint.  Considering that I am trying to make this
have zero storage impact if these features are not active, the impact
on a cluster with no additional features would be moot from a storage
perspective, no?

> Apart from that: Is this variable-sized 'metadata' associated with smgr infrastructure only, or is it also available
forAM features? If not; then this is a strong -1. The amount of tasks smgr needs to do on a page is generally much less
thanthe amount of tasks an AM needs to do; so in my view the AM has priority in prime page real estate, not smgr or
relatedinfrastructure. 

I will confess to a slightly wobbly understanding of the delineation
of responsibility here. I was under the impression that by modifying
any consumer of PageHeaderData this would be sufficient to cover all
AMs for the types of cluster-wide options we'd be concerned about (say
extended checksums, multiple page encryption schemes, or other
per-page information we haven't yet anticipated).  Reading smgr/README
and the various access/*/README has not made the distinction clear to
me yet.

> re: PageFeatures
> I'm not sure I understand the goal, nor the reasoning. Shouldn't this be part of the storage manager (smgr)
implementation/ can't this be part of the smgr of the relation? 

For at least the feature cases I'm anticipating, this would apply to
any disk page that may have user data, set (at least initially) at
initdb time, so should apply to any pages in the cluster, regardless
of AM.

> re: use of pd_checksum
> I mentioned this in the above-mentioned thread too, in [1], that we could use pd_checksum as an extra area marker for
thisstorage-specific data, which would be located between pd_upper and pd_special. 

I do think that we could indeed use this as an additional in-page
pointer, but at least for this version was keeping things
backwards-compatible.  Peter G (I think) also made some good points
about how to include the various status bits on the page somehow in
terms of making a page completely self-contained.

> Re: patch contents
>
> 0001:
> >+ specialSize = MAXALIGN(specialSize) + reserved_page_size;
>
> This needs to be aligned, so MAXALIGN(specialSize + reserved_page_size), or an assertion that reserved_page_size is
MAXALIGNED,would be better. 

It is currently aligned via the space calculation return value but
agree that folding it into an assert or reworking it explicitly is
clearer.

> >  PageValidateSpecialPointer(Page page)
> >  {
> >      Assert(page);
> > -    Assert(((PageHeader) page)->pd_special <= BLCKSZ);
> > +    Assert((((PageHeader) page)->pd_special - reserved_page_size) <= BLCKSZ);
>
> This check is incorrect. With your code it would allow pd_special past the end of the block. If you want to put the
reserved_space_sizeeffectively inside the special area, this check should instead be: 
>
> +    Assert(((PageHeader) page)->pd_special <= (BLCKSZ - reserved_page_size));
>
> Or, equally valid
>
> +    Assert((((PageHeader) page)->pd_special + reserved_page_size) <= BLCKSZ);

Yup, I think I inverted my logic there; thanks.

> > + * +-------------+-----+------------+-----------------+
> > + * | ... tuple2 tuple1 | "special space" | "reserved" |
> > + * +-------------------+------------+-----------------+
>
> Could you fix the table display if / when you revise the patchset? It seems to me that the corners don't line up with
thecolumn borders. 

Sure thing.

> 0002:
> > Make the output of "select_views" test stable
> > Changing the reserved_page_size has resulted in non-stable results for this test.
>
> This makes sense, what kind of instability are we talking about? Are there different results for runs with the same
binary,or is this across compilations? 

When running with the same compilation/initdb settings, the test
results are stable, but differ depending what options you chose, so
`make installcheck` output will fail when testing a cluster with
different options vs upstream HEAD without these patches, etc.

> 0003 and up were not yet reviewed in depth.

Thanks, I appreciate the feedback so far.

Re: [PATCHES] Post-special page storage TDE support

From

Matthias van de Meent

Date:

31 October 2022, 19:42:17

On Sat, 29 Oct 2022 at 00:25, David Christensen
<david.christensen@crunchydata.com> wrote:
>
> Hi Matthias,
>
> > Did you read the related thread with related discussion from last June, "Re: better page-level checksums" [0]? In
thatI argued that space at the end of a page is already allocated for the AM, and that reserving variable space at the
endof the page for non-AM usage is wasting the AM's performance potential. 
>
> Yes, I had read parts of that thread among others, but have given it a
> re-read.  I can see the point you're making here, and agree that if we
> can allocate between pd_special and pd_upper that could make sense.  I
> am a little unclear as to what performance impacts for the AM there
> would be if this additional space were ahead or behind the page
> special area; it seems like if this is something that needs to live on
> the page *somewhere* just being aligned correctly would be sufficient
> from the AM's standpoint.

It would be sufficient, but it is definitely suboptimal. See [0] as a
patch that is being held back by putting stuff behind the special
area.

I don't really care much about the storage layout on-disk, but I do
care that AMs have efficient access to their data. For the page
header, line pointers, and special area, that is currently guaranteed
by  the current page layout. However, for the special area, that
currently guaranteed offset of (BLCKSZ -
MAXALIGN(sizeof(IndexOpaque))) will get broken as there would be more
space in the special area than the AM would be expecting. Right now,
our index AMs are doing pointer chasing during special area lookups
for no good reason, but with the patch it would be required. I don't
like that at all.

Because I understand that it is a requirement to store this reserved
space in a fixed place on the on-disk page (you must know where the
checksum is at static places on the page, otherwise you'd potentially
mis-validate a page) but that requirement is not there for in-memory
storage. I think it's a small price to pay to swap the fields around
during R/W operations - the largest size of special area is currently
24 bytes, and the proposals I've seen for this extra storage area
would not need it to be actually filled with data whilst the page is
being used by the AM (checksum could be zeroed in in-memory
operations, and it'd get set during writeback; same with all other
fields I can imagine the storage system using).

> Considering that I am trying to make this
> have zero storage impact if these features are not active, the impact
> on a cluster with no additional features would be moot from a storage
> perspective, no?

The issue is that I'd like to eliminate the redirection from the page
header in the hot path. Currently, we can do that, and pd_special
would be little more than a hint to the smgr and pd_linp code that
that area is special and reserved for this access method's private
data, so that it is not freed. If you stick something extra in there,
it's not special for the AM's private data, and the AM won't be able
to use pd_special for similar uses as pd_linp+pd_lower. I'd rather
have the storage system use it's own not-special area; choreographed
by e.g. a reuse of pd_checksum for one more page offset. Swapping the
fields around between on-disk and in-memory doesn't need to be an
issue, as special areas are rarely very large.

Every index type we support utilizes the special area. Wouldn't those
in-memory operations have priority on this useful space, as opposed to
a storage system that maybe will be used in new clusters, and even
then only during R/W operations to disk (each at most once for N
memory operations)?

> > Apart from that: Is this variable-sized 'metadata' associated with smgr infrastructure only, or is it also
availablefor AM features? If not; then this is a strong -1. The amount of tasks smgr needs to do on a page is generally
muchless than the amount of tasks an AM needs to do; so in my view the AM has priority in prime page real estate, not
smgror related infrastructure. 
>
> I will confess to a slightly wobbly understanding of the delineation
> of responsibility here. I was under the impression that by modifying
> any consumer of PageHeaderData this would be sufficient to cover all
> AMs for the types of cluster-wide options we'd be concerned about (say
> extended checksums, multiple page encryption schemes, or other
> per-page information we haven't yet anticipated).  Reading smgr/README
> and the various access/*/README has not made the distinction clear to
> me yet.

pd_special has (historically) been reserved for access methods'
page-level private data. If you add to this area, shouldn't that be
space that the AM should be able to hook into as well? Or are all
those features limited to the storage system only; i.e. the storage
system decides what's best for the AM's page handling w.r.t. physical
storage?

> > re: PageFeatures
> > I'm not sure I understand the goal, nor the reasoning. Shouldn't this be part of the storage manager (smgr)
implementation/ can't this be part of the smgr of the relation? 
>
> For at least the feature cases I'm anticipating, this would apply to
> any disk page that may have user data, set (at least initially) at
> initdb time, so should apply to any pages in the cluster, regardless
> of AM.

OK, so having a storage manager for each supported set of features is
not planned for this. Understood.

> > re: use of pd_checksum
> > I mentioned this in the above-mentioned thread too, in [1], that we could use pd_checksum as an extra area marker
forthis storage-specific data, which would be located between pd_upper and pd_special. 
>
> I do think that we could indeed use this as an additional in-page
> pointer, but at least for this version was keeping things
> backwards-compatible.  Peter G (I think) also made some good points
> about how to include the various status bits on the page somehow in
> terms of making a page completely self-contained.

I think that adding page header bits would suffice for backwards
compatibility if we'd want to reuse pd_checksum. A new
PD_CHECKSUM_REUSED_FOR_STORAGE would suffice here; it would be unset
in normal (pre-patch, or without these fancy new features) clusters.

Kind regards,

Matthias van de Meent

PS. sorry for the rant. I hope my arguments are clear why I dislike
the storage area being placed behind the special area in memory.

[0] https://commitfest.postgresql.org/40/3543/

Re: [PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

08 November 2022, 16:17:56

Per some offline discussion with Stephen and incorporating some of the
feedback I've gotten I'm including the following changes/revisions:

1. Change the signature of any macros that rely on a dynamic component
to look like a function so you can more easily determine in-code
whether something is truly a constant/compile time calculation or a
runtime one.

2. We use a new page flag for whether "extended page features" are
enabled on the given page.  If this is set then we look for the 1-byte
trailer with the bitflag of the number of features. We allow space for
7 page features and are reserving the final hi bit for future
use/change of interpretation to accommodate more.

3. Consolidate the extended checksums into a 56-bit checksum that
immediately precedes the 1-byte flag.  Choice of 64-bit checksum is
arbitrary just based on some MIT-licensed code I found, so just
considering this proof of concept, not necessarily promoting that
specific calculation.  (I think I included some additional checksum
variants from the earlier revision for ease of testing various
approaches.)

4. Ensure the whole area is MAXALIGN and fixed a few bugs that were
pointed out in this thread.

Patches are:

1. make select_views stable, prerequisite for anything that is messing
with tuples on page sizes

2. add reserved_page_size handling and rework existing code to account
for this additional space usage

3. main PageFeatures-related code; introduce that abstraction layer,
along with the trailing byte on the page with the enabled features for
this specific page.  We also add an additional param to PageInit()
with the page features active on this page; currently all call sites
are using the cluster-wide cluster_page_features as the parameter, so
all pages share what is stored in the control file based on initdb
options.  However, routines which query page features look on the
actual page itself, so in fact we are able to piecemeal at the
page/relation level if we so desire, or turn off for specific types of
pages, say. Also includes the additional pd_flags bit to enable that
interpretation.

4. Actual extended checksums PageFeature. Rather than two separate
implementations as in the previous patch series, we are using 56 bits
of a 64-bit checksum, stored as the high 7 bytes of the final 8 in the
page where this is enabled.

5. wasted_space PageFeature just to demo multiple features in play.

Thanks,

David

Attachment

Re: [PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

08 November 2022, 20:07:34

Looking into some CF bot failures which didn't show up locally.  Will
send a v3 when resolved.

Re: [PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

25 January 2023, 00:38:50

So here is a v3 here, incorporating additional bug fixes and some
design revisions.  I have narrowed this down to 3 patches, fixing the
bugs that were leading to the instability of the specific test file so
dropping that as well as removing the useless POC "wasted space".

The following pieces are left:

0001 - adjust the codebase to utilize the "reserved_page_space"
variable for all offsets rather than assuming compile-time constants.
This allows us to effectively allocate a fixed chunk of storage from
the end of the page and have everything still work on this cluster.
0002 - add the Page Feature abstraction. This allows you to utilize
this chunk of storage, as well as query for feature use at the page
level.
0003 - the first page feature, 64-bit encryption (soon to be
renumbered when GCM storage for TDE is introduced, though the two
features are designed to be incompatible).  This includes an
arbitrarily found 64-bit checksum, so we probably will need to write
our own or ensure that we have something license-compatible.

This is rebased and current as-of-today and passes all CI tests, so
should be in a good place to start looking at.

Best,

David

Attachment

Re: [PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

09 May 2023, 22:08:26

Refreshing this with HEAD as of today, v4.

Attachment

Re: [PATCHES] Post-special page storage TDE support

From

Stephen Frost

Date:

12 May 2023, 23:47:56

Greetings,

* David Christensen (david.christensen@crunchydata.com) wrote:
> Refreshing this with HEAD as of today, v4.

Thanks for updating this!

> Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
>
> This space is reserved for extended data on the Page structure which will be ultimately used for
> encrypted data, extended checksums, and potentially other things.  This data appears at the end of
> the Page, after any `pd_special` area, and will be calculated at runtime based on specific
> ControlFile features.
>
> No effort is made to ensure this is backwards-compatible with existing clusters for `pg_upgrade`, as
> we will require logical replication to move data into a cluster with different settings here.

This initial patch, at least, does maintain pg_upgrade as the
reserved_page_size (maybe not a great name?) is set to 0, right?
Basically this is just introducing the concept of a reserved_page_size
and adjusting all of the code that currently uses BLCKSZ or
PageGetPageSize() to account for this extra space.

Looking at the changes to bufpage.h, in particular ...

> diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h

> @@ -19,6 +19,14 @@
>  #include "storage/item.h"
>  #include "storage/off.h"
>
> +extern PGDLLIMPORT int reserved_page_size;
> +
> +#define SizeOfPageReservedSpace() reserved_page_size
> +#define MaxSizeOfPageReservedSpace 0
> +
> +/* strict upper bound on the amount of space occupied we have reserved on
> + * pages in this cluster */

This will eventually be calculated based on what features are supported
concurrently?

> @@ -36,10 +44,10 @@
>   * |             v pd_upper                              |
>   * +-------------+------------------------------------+
>   * |             | tupleN ...                         |
> - * +-------------+------------------+-----------------+
> - * |       ... tuple3 tuple2 tuple1 | "special space" |
> - * +--------------------------------+-----------------+
> - *                                    ^ pd_special
> + * +-------------+-----+------------+----+------------+
> + * | ... tuple2 tuple1 | "special space" | "reserved" |
> + * +-------------------+------------+----+------------+
> + *                       ^ pd_special      ^ reserved_page_space

Right, adds a dynamic amount of space 'post-special area'.

> @@ -73,6 +81,8 @@
>   * stored as the page trailer.  an access method should always
>   * initialize its pages with PageInit and then set its own opaque
>   * fields.
> + *
> + * XXX - update more comments here about reserved_page_space
>   */

Would be good to do. ;)

> @@ -325,7 +335,7 @@ static inline void
>  PageValidateSpecialPointer(Page page)
>  {
>      Assert(page);
> -    Assert(((PageHeader) page)->pd_special <= BLCKSZ);
> +    Assert((((PageHeader) page)->pd_special + reserved_page_size) <= BLCKSZ);
>      Assert(((PageHeader) page)->pd_special >= SizeOfPageHeaderData);
>  }

This is just one usage ... but seems like maybe we should be using
PageGetPageSize() here instead of BLCKSZ, and that more-or-less
throughout?  Nearly everywhere we're using BLCKSZ today to give us that
compile-time advantage of a fixed block size is going to lose that
advantage anyway thanks to reserved_page_size being run-time.  Now, one
up-side to this is that it'd also get us closer to being able to support
dynamic block sizes concurrently which would be quite interesting.  That
is, a special tablespace with a 32KB block size while the rest are the
traditional 8KB.  This would likely require multiple shared buffer
pools, of course...

> diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
> index 9a302ddc30..a93cd9df9f 100644
> --- a/src/backend/storage/page/bufpage.c
> +++ b/src/backend/storage/page/bufpage.c
> @@ -26,6 +26,8 @@
>  /* GUC variable */
>  bool        ignore_checksum_failure = false;
>
> +int            reserved_page_size = 0; /* how much page space to reserve for extended unencrypted metadata */
> +
>
>  /* ----------------------------------------------------------------
>   *                        Page support functions
> @@ -43,7 +45,7 @@ PageInit(Page page, Size pageSize, Size specialSize)
>  {
>      PageHeader    p = (PageHeader) page;
>
> -    specialSize = MAXALIGN(specialSize);
> +    specialSize = MAXALIGN(specialSize) + reserved_page_size;

Rather than make it part of specialSize, I would think we'd be better
off just treating them independently.  Eg, the later pd_upper setting
would be done by:

p->pd_upper = pageSize - specialSize - reserved_page_size;

etc.

> @@ -186,7 +188,7 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags)
>   *    one that is both unused and deallocated.
>   *
>   *    If flag PAI_IS_HEAP is set, we enforce that there can't be more than
> - *    MaxHeapTuplesPerPage line pointers on the page.
> + *    MaxHeapTuplesPerPage() line pointers on the page.

Making MaxHeapTuplesPerPage() runtime dynamic is a requirement for
supporting multiple page sizes concurrently ... but I'm not sure it's
actually required for the reserved_page_size idea as currently
considered.  The reason is that with 8K or larger pages, the amount of
space we're already throwing away is at least 20 bytes, if I did my math
right.  If we constrain reserved_page_size to be 20 bytes or less, as I
believe we're currently thinking we won't need that much, then we could
perhaps keep MaxHeapTuplesPerPage as a compile-time constant.

On the other hand, to the extent that we want to consider having
variable page sizes in the future, perhaps we do want to change this.
If so, the approach broadly looks reasonable to me, but I'd suggest we
make that a separate patch from the introduction of reserved_page_size.

> @@ -211,7 +213,7 @@ PageAddItemExtended(Page page,
>      if (phdr->pd_lower < SizeOfPageHeaderData ||
>          phdr->pd_lower > phdr->pd_upper ||
>          phdr->pd_upper > phdr->pd_special ||
> -        phdr->pd_special > BLCKSZ)
> +        phdr->pd_special + reserved_page_size > BLCKSZ)
>          ereport(PANIC,
>                  (errcode(ERRCODE_DATA_CORRUPTED),
>                   errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",

Probably should add reserved_page_size to that errmsg output?  Also,
this check of pointers seems to be done multiple times- maybe it should
be moved into a #define or similar?

> @@ -723,7 +725,7 @@ PageRepairFragmentation(Page page)
>      if (pd_lower < SizeOfPageHeaderData ||
>          pd_lower > pd_upper ||
>          pd_upper > pd_special ||
> -        pd_special > BLCKSZ ||
> +        pd_special + reserved_page_size > BLCKSZ ||
>          pd_special != MAXALIGN(pd_special))
>          ereport(ERROR,
>                  (errcode(ERRCODE_DATA_CORRUPTED),

This ends up being the same as above ...

> @@ -1066,7 +1068,7 @@ PageIndexTupleDelete(Page page, OffsetNumber offnum)
>      if (phdr->pd_lower < SizeOfPageHeaderData ||
>          phdr->pd_lower > phdr->pd_upper ||
>          phdr->pd_upper > phdr->pd_special ||
> -        phdr->pd_special > BLCKSZ ||
> +        phdr->pd_special + reserved_page_size > BLCKSZ ||
>          phdr->pd_special != MAXALIGN(phdr->pd_special))
>          ereport(ERROR,
>                  (errcode(ERRCODE_DATA_CORRUPTED),

And here ...

> @@ -1201,7 +1203,7 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
>      if (pd_lower < SizeOfPageHeaderData ||
>          pd_lower > pd_upper ||
>          pd_upper > pd_special ||
> -        pd_special > BLCKSZ ||
> +        pd_special + reserved_page_size > BLCKSZ ||
>          pd_special != MAXALIGN(pd_special))
>          ereport(ERROR,
>                  (errcode(ERRCODE_DATA_CORRUPTED),

And here ...

> @@ -1307,7 +1309,7 @@ PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum)
>      if (phdr->pd_lower < SizeOfPageHeaderData ||
>          phdr->pd_lower > phdr->pd_upper ||
>          phdr->pd_upper > phdr->pd_special ||
> -        phdr->pd_special > BLCKSZ ||
> +        phdr->pd_special + reserved_page_size > BLCKSZ ||
>          phdr->pd_special != MAXALIGN(phdr->pd_special))
>          ereport(ERROR,
>                  (errcode(ERRCODE_DATA_CORRUPTED),

And here ...

> @@ -1419,7 +1421,7 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
>      if (phdr->pd_lower < SizeOfPageHeaderData ||
>          phdr->pd_lower > phdr->pd_upper ||
>          phdr->pd_upper > phdr->pd_special ||
> -        phdr->pd_special > BLCKSZ ||
> +        phdr->pd_special + reserved_page_size > BLCKSZ ||
>          phdr->pd_special != MAXALIGN(phdr->pd_special))
>          ereport(ERROR,
>                  (errcode(ERRCODE_DATA_CORRUPTED),

And here ...

> diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
> index 6979aff727..060c4ab3e3 100644
> --- a/contrib/amcheck/verify_nbtree.c
> +++ b/contrib/amcheck/verify_nbtree.c
> @@ -489,12 +489,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
>          /*
>           * Size Bloom filter based on estimated number of tuples in index,
>           * while conservatively assuming that each block must contain at least
> -         * MaxTIDsPerBTreePage / 3 "plain" tuples -- see
> +         * MaxTIDsPerBTreePage() / 3 "plain" tuples -- see
>           * bt_posting_plain_tuple() for definition, and details of how posting
>           * list tuples are handled.
>           */
>          total_pages = RelationGetNumberOfBlocks(rel);
> -        total_elems = Max(total_pages * (MaxTIDsPerBTreePage / 3),
> +        total_elems = Max(total_pages * (MaxTIDsPerBTreePage() / 3),
>                            (int64) state->rel->rd_rel->reltuples);
>          /* Generate a random seed to avoid repetition */
>          seed = pg_prng_uint64(&pg_global_prng_state);

Making MaxTIDsPerBTreePage dynamic looks to be required as it doesn't
end up with any 'leftover' space, from what I can tell.  Again, though,
perhaps this should be split out as an independent patch from the rest.
That is- we can change the higher-level functions to be dynamic in the
initial patches, and then eventually we'll get down to making the
lower-level functions dynamic.

> diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
> index efdf9415d1..8ebabdd7ee 100644
> --- a/contrib/bloom/bloom.h
> +++ b/contrib/bloom/bloom.h
> @@ -131,7 +131,7 @@ typedef struct BloomMetaPageData
>  #define BLOOM_MAGICK_NUMBER (0xDBAC0DED)
>
>  /* Number of blocks numbers fit in BloomMetaPageData */
> -#define BloomMetaBlockN        (sizeof(FreeBlockNumberArray) / sizeof(BlockNumber))
> +#define BloomMetaBlockN()        ((sizeof(FreeBlockNumberArray) - SizeOfPageReservedSpace())/ sizeof(BlockNumber))
>
>  #define BloomPageGetMeta(page)    ((BloomMetaPageData *) PageGetContents(page))
>
> @@ -151,6 +151,7 @@ typedef struct BloomState
>
>  #define BloomPageGetFreeSpace(state, page) \
>      (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
> +        - SizeOfPageReservedSpace()                                  \
>          - BloomPageGetMaxOffset(page) * (state)->sizeOfBloomTuple \
>          - MAXALIGN(sizeof(BloomPageOpaqueData)))

This formulation (or something close to it) tends to happen quite a bit:

(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - SizeOfPageReservedSpace() ...

This is basically asking for "amount of usable space" where the
resulting 'usable space' either includes line pointers and tuples or
similar, or doesn't.  Perhaps we should break this down into two
patches- one which provides a function to return usable space on a page,
and then the patch to add reserved_page_size can simply adjust that
instead of changing the very, very many places we have this forumlation.

> diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
> index d935ed8fbd..d3d74a9d28 100644
> --- a/contrib/bloom/blutils.c
> +++ b/contrib/bloom/blutils.c
> @@ -430,10 +430,10 @@ BloomFillMetapage(Relation index, Page metaPage)
>       */
>      BloomInitPage(metaPage, BLOOM_META);
>      metadata = BloomPageGetMeta(metaPage);
> -    memset(metadata, 0, sizeof(BloomMetaPageData));
> +    memset(metadata, 0, sizeof(BloomMetaPageData) - SizeOfPageReservedSpace());

This doesn't seem quite right?  The reserved space is off at the end of
the page and this is 0'ing the space immediately after the page header,
if I'm following correctly, and only to the size of BloomMetaPageData...

>      metadata->magickNumber = BLOOM_MAGICK_NUMBER;
>      metadata->opts = *opts;
> -    ((PageHeader) metaPage)->pd_lower += sizeof(BloomMetaPageData);
> +    ((PageHeader) metaPage)->pd_lower += sizeof(BloomMetaPageData) - SizeOfPageReservedSpace();

Not quite following what's going on here either.

> diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
> @@ -116,7 +116,7 @@ blbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
>           */
>          if (BloomPageGetMaxOffset(page) != 0 &&
>              BloomPageGetFreeSpace(&state, page) >= state.sizeOfBloomTuple &&
> -            countPage < BloomMetaBlockN)
> +            countPage < BloomMetaBlockN())
>              notFullPage[countPage++] = blkno;

Looks to be another opportunity to have a separate patch making this
change first before actually changing the lower-level #define's,

> diff --git a/src/backend/access/brin/brin_tuple.c b/src/backend/access/brin/brin_tuple.c
> @@ -217,7 +217,7 @@ brin_form_tuple(BrinDesc *brdesc, BlockNumber blkno, BrinMemTuple *tuple,
>               * datatype, try to compress it in-line.
>               */
>              if (!VARATT_IS_EXTENDED(DatumGetPointer(value)) &&
> -                VARSIZE(DatumGetPointer(value)) > TOAST_INDEX_TARGET &&
> +                VARSIZE(DatumGetPointer(value)) > TOAST_INDEX_TARGET() &&
>                  (atttype->typstorage == TYPSTORAGE_EXTENDED ||
>                   atttype->typstorage == TYPSTORAGE_MAIN))
>              {

Probably could be another patch but also if we're going to change
TOAST_INDEX_TARGET to be a function we should probably not have it named
in all-CAPS.

> diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
> @@ -535,7 +535,7 @@ dataBeginPlaceToPageLeaf(GinBtree btree, Buffer buf, GinBtreeStack *stack,
>           * a single byte, and we can use all the free space on the old page as
>           * well as the new page. For simplicity, ignore segment overhead etc.
>           */
> -        maxitems = Min(maxitems, freespace + GinDataPageMaxDataSize);
> +        maxitems = Min(maxitems, freespace + GinDataPageMaxDataSize());
>      }
>      else
>      {

Ditto.

> diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
> @@ -38,8 +38,8 @@
>  /* GUC parameter */
>  int            gin_pending_list_limit = 0;
>
> -#define GIN_PAGE_FREESIZE \
> -    ( BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - MAXALIGN(sizeof(GinPageOpaqueData)) )
> +#define GIN_PAGE_FREESIZE() \
> +    ( BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - MAXALIGN(sizeof(GinPageOpaqueData)) - SizeOfPageReservedSpace() )

Another case of BLCKSZ - MAXALIGN(SizeOfPageHeaderData) -
SizeOfPageReservedSpace() ...

> @@ -450,7 +450,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
>       * ginInsertCleanup() should not be called inside our CRIT_SECTION.
>       */
>      cleanupSize = GinGetPendingListCleanupSize(index);
> -    if (metadata->nPendingPages * GIN_PAGE_FREESIZE > cleanupSize * 1024L)
> +    if (metadata->nPendingPages * GIN_PAGE_FREESIZE() > cleanupSize * 1024L)
>          needCleanup = true;

Also shouldn't be all-CAPS.

> diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
> index 43b67893d9..5babbb457a 100644
> --- a/src/backend/access/nbtree/nbtsplitloc.c
> +++ b/src/backend/access/nbtree/nbtsplitloc.c
> @@ -156,7 +156,7 @@ _bt_findsplitloc(Relation rel,
>
>      /* Total free space available on a btree page, after fixed overhead */
>      leftspace = rightspace =
> -        PageGetPageSize(origpage) - SizeOfPageHeaderData -
> +        PageGetPageSize(origpage) - SizeOfPageHeaderData - SizeOfPageReservedSpace() -
>          MAXALIGN(sizeof(BTPageOpaqueData));

Also here ... though a bit interesting that this uses PageGetPageSize()
instead of BLCKSZ.

> diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
> index 011ec18015..022b5eee4e 100644
> --- a/src/backend/utils/init/globals.c
> +++ b/src/backend/utils/init/globals.c
> @@ -154,3 +154,4 @@ int64        VacuumPageDirty = 0;
>
>  int            VacuumCostBalance = 0;    /* working state for vacuum */
>  bool        VacuumCostActive = false;
> +

Unnecessary whitespace hunk ?

Thanks!

Stephen

Attachment

signature.asc

Re: [PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

30 May 2023, 17:35:37

On Fri, May 12, 2023 at 7:48 PM Stephen Frost <sfrost@snowman.net> wrote:
>
> Greetings,
>
> * David Christensen (david.christensen@crunchydata.com) wrote:
> > Refreshing this with HEAD as of today, v4.
>
> Thanks for updating this!

Thanks for the patience in my response here.

> > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> >
> > This space is reserved for extended data on the Page structure which will be ultimately used for
> > encrypted data, extended checksums, and potentially other things.  This data appears at the end of
> > the Page, after any `pd_special` area, and will be calculated at runtime based on specific
> > ControlFile features.
> >
> > No effort is made to ensure this is backwards-compatible with existing clusters for `pg_upgrade`, as
> > we will require logical replication to move data into a cluster with different settings here.
>
> This initial patch, at least, does maintain pg_upgrade as the
> reserved_page_size (maybe not a great name?) is set to 0, right?
> Basically this is just introducing the concept of a reserved_page_size
> and adjusting all of the code that currently uses BLCKSZ or
> PageGetPageSize() to account for this extra space.

Correct; a reserved_page_size of 0 would be the same page format as
currently exists, so you could use pg_upgrade with no page features
and be binary compatible with existing clusters.

> Looking at the changes to bufpage.h, in particular ...
>
> > diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
>
> > @@ -19,6 +19,14 @@
> >  #include "storage/item.h"
> >  #include "storage/off.h"
> >
> > +extern PGDLLIMPORT int reserved_page_size;
> > +
> > +#define SizeOfPageReservedSpace() reserved_page_size
> > +#define MaxSizeOfPageReservedSpace 0
> > +
> > +/* strict upper bound on the amount of space occupied we have reserved on
> > + * pages in this cluster */
>
> This will eventually be calculated based on what features are supported
> concurrently?

Correct; these are fleshed out in later patches.

> > @@ -36,10 +44,10 @@
> >   * |                  v pd_upper                                                       |
> >   * +-------------+------------------------------------+
> >   * |                  | tupleN ...                         |
> > - * +-------------+------------------+-----------------+
> > - * |    ... tuple3 tuple2 tuple1 | "special space" |
> > - * +--------------------------------+-----------------+
> > - *                                                                   ^ pd_special
> > + * +-------------+-----+------------+----+------------+
> > + * | ... tuple2 tuple1 | "special space" | "reserved" |
> > + * +-------------------+------------+----+------------+
> > + *                                      ^ pd_special      ^ reserved_page_space
>
> Right, adds a dynamic amount of space 'post-special area'.

Dynamic as in "fixed at initdb time" instead of compile time. However,
things are coded in such a way that the page feature bitmap is stored
on a given page, so different pages could have different
reserved_page_size depending on use case/code path. (Basically
preserving future flexibility while minimizing code changes here.)  We
could utilize different features depending on what type of page it is,
say, or have different relations or tablespaces with different page
feature defaults.

> > @@ -73,6 +81,8 @@
> >   * stored as the page trailer.  an access method should always
> >   * initialize its pages with PageInit and then set its own opaque
> >   * fields.
> > + *
> > + * XXX - update more comments here about reserved_page_space
> >   */
>
> Would be good to do. ;)

Next revision... :D

> > @@ -325,7 +335,7 @@ static inline void
> >  PageValidateSpecialPointer(Page page)
> >  {
> >       Assert(page);
> > -     Assert(((PageHeader) page)->pd_special <= BLCKSZ);
> > +     Assert((((PageHeader) page)->pd_special + reserved_page_size) <= BLCKSZ);
> >       Assert(((PageHeader) page)->pd_special >= SizeOfPageHeaderData);
> >  }
>
> This is just one usage ... but seems like maybe we should be using
> PageGetPageSize() here instead of BLCKSZ, and that more-or-less
> throughout?  Nearly everywhere we're using BLCKSZ today to give us that
> compile-time advantage of a fixed block size is going to lose that
> advantage anyway thanks to reserved_page_size being run-time.  Now, one
> up-side to this is that it'd also get us closer to being able to support
> dynamic block sizes concurrently which would be quite interesting.  That
> is, a special tablespace with a 32KB block size while the rest are the
> traditional 8KB.  This would likely require multiple shared buffer
> pools, of course...

I think multiple shared-buffer pools is a ways off; but sure, this
would support this sort of use case as well.  I am working on a new
patch for this series (probably the first one in the series) which
will actually just abstract away all existing compile-time usages of
BLCKSZ.  This will be a start in that direction and also make the
reserved_page_size patch a bit more reasonable to review.

> > diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
> > index 9a302ddc30..a93cd9df9f 100644
> > --- a/src/backend/storage/page/bufpage.c
> > +++ b/src/backend/storage/page/bufpage.c
> > @@ -26,6 +26,8 @@
> >  /* GUC variable */
> >  bool         ignore_checksum_failure = false;
> >
> > +int                  reserved_page_size = 0; /* how much page space to reserve for extended unencrypted metadata
*/
> > +
> >
> >  /* ----------------------------------------------------------------
> >   *                                           Page support functions
> > @@ -43,7 +45,7 @@ PageInit(Page page, Size pageSize, Size specialSize)
> >  {
> >       PageHeader      p = (PageHeader) page;
> >
> > -     specialSize = MAXALIGN(specialSize);
> > +     specialSize = MAXALIGN(specialSize) + reserved_page_size;
>
> Rather than make it part of specialSize, I would think we'd be better
> off just treating them independently.  Eg, the later pd_upper setting
> would be done by:
>
> p->pd_upper = pageSize - specialSize - reserved_page_size;
>
> etc.

I can see that there's a mild readability benefit, but really the
effect is local to PageInit(), so ¯\_(ツ)_/¯... happy to make that
change though.

> > @@ -186,7 +188,7 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags)
> >   *   one that is both unused and deallocated.
> >   *
> >   *   If flag PAI_IS_HEAP is set, we enforce that there can't be more than
> > - *   MaxHeapTuplesPerPage line pointers on the page.
> > + *   MaxHeapTuplesPerPage() line pointers on the page.
>
> Making MaxHeapTuplesPerPage() runtime dynamic is a requirement for
> supporting multiple page sizes concurrently ... but I'm not sure it's
> actually required for the reserved_page_size idea as currently
> considered.  The reason is that with 8K or larger pages, the amount of
> space we're already throwing away is at least 20 bytes, if I did my math
> right.  If we constrain reserved_page_size to be 20 bytes or less, as I
> believe we're currently thinking we won't need that much, then we could
> perhaps keep MaxHeapTuplesPerPage as a compile-time constant.

In this version we don't have that explicit constraint. In practice I
don't know that we have many more than 20 bytes, at least for the
first few features, but I don't think we can count on that forever
going forward. At some point we're going to have to parameterize
these, so might as well do it in this pass, since how else would you
know that this magic value has been exceeded?

> On the other hand, to the extent that we want to consider having
> variable page sizes in the future, perhaps we do want to change this.
> If so, the approach broadly looks reasonable to me, but I'd suggest we
> make that a separate patch from the introduction of reserved_page_size.

The variable blocksize patch I'm working on includes some of this, so
this will be in the next revision.

> > @@ -211,7 +213,7 @@ PageAddItemExtended(Page page,
> >       if (phdr->pd_lower < SizeOfPageHeaderData ||
> >               phdr->pd_lower > phdr->pd_upper ||
> >               phdr->pd_upper > phdr->pd_special ||
> > -             phdr->pd_special > BLCKSZ)
> > +             phdr->pd_special + reserved_page_size > BLCKSZ)
> >               ereport(PANIC,
> >                               (errcode(ERRCODE_DATA_CORRUPTED),
> >                                errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
>
> Probably should add reserved_page_size to that errmsg output?  Also,
> this check of pointers seems to be done multiple times- maybe it should
> be moved into a #define or similar?

Sure, can change; agreed it'd be good to have.  I just modified the
existing call sites and didn't attempt to change too much else.

[snipped other instances...]

> Making MaxTIDsPerBTreePage dynamic looks to be required as it doesn't
> end up with any 'leftover' space, from what I can tell.  Again, though,
> perhaps this should be split out as an independent patch from the rest.
> That is- we can change the higher-level functions to be dynamic in the
> initial patches, and then eventually we'll get down to making the
> lower-level functions dynamic.

Same; should be accounted for in the next variable blocksize patch.
It does have a cascading effect though, so hard to make the high-level
functions dynamic but not the lower-level ones. What is the benefit in
this case for separating those two?

> > diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
> > index efdf9415d1..8ebabdd7ee 100644
> > --- a/contrib/bloom/bloom.h
> > +++ b/contrib/bloom/bloom.h
> > @@ -131,7 +131,7 @@ typedef struct BloomMetaPageData
> >  #define BLOOM_MAGICK_NUMBER (0xDBAC0DED)
> >
> >  /* Number of blocks numbers fit in BloomMetaPageData */
> > -#define BloomMetaBlockN              (sizeof(FreeBlockNumberArray) / sizeof(BlockNumber))
> > +#define BloomMetaBlockN()            ((sizeof(FreeBlockNumberArray) - SizeOfPageReservedSpace())/
sizeof(BlockNumber))
> >
> >  #define BloomPageGetMeta(page)       ((BloomMetaPageData *) PageGetContents(page))
> >
> > @@ -151,6 +151,7 @@ typedef struct BloomState
> >
> >  #define BloomPageGetFreeSpace(state, page) \
> >       (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
> > +             - SizeOfPageReservedSpace()                                                               \
> >               - BloomPageGetMaxOffset(page) * (state)->sizeOfBloomTuple \
> >               - MAXALIGN(sizeof(BloomPageOpaqueData)))
>
> This formulation (or something close to it) tends to happen quite a bit:
>
> (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - SizeOfPageReservedSpace() ...
>
> This is basically asking for "amount of usable space" where the
> resulting 'usable space' either includes line pointers and tuples or
> similar, or doesn't.  Perhaps we should break this down into two
> patches- one which provides a function to return usable space on a page,
> and then the patch to add reserved_page_size can simply adjust that
> instead of changing the very, very many places we have this forumlation.

Yeah, I can make this a computed expression; agreed it's pretty common
to have the usable space on the page so really any AM shouldn't know
or care about the details of either header or footer.  Since we
already have PageGetContents() I will probably name it
PageGetContentsSize(). The AM can own everything from the pointer
returned by PageGetContents() through said size, allowing for both the
header and reserved_page_size in said computation.

> > diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
> > index d935ed8fbd..d3d74a9d28 100644
> > --- a/contrib/bloom/blutils.c
> > +++ b/contrib/bloom/blutils.c
> > @@ -430,10 +430,10 @@ BloomFillMetapage(Relation index, Page metaPage)
> >        */
> >       BloomInitPage(metaPage, BLOOM_META);
> >       metadata = BloomPageGetMeta(metaPage);
> > -     memset(metadata, 0, sizeof(BloomMetaPageData));
> > +     memset(metadata, 0, sizeof(BloomMetaPageData) - SizeOfPageReservedSpace());
>
> This doesn't seem quite right?  The reserved space is off at the end of
> the page and this is 0'ing the space immediately after the page header,
> if I'm following correctly, and only to the size of BloomMetaPageData...

I think you're correct with that analysis. BloomInitPage() would have
(probably?) had a zero'd page so this underset would have been
unnoticed in practice, but still good to fix.

> >       metadata->magickNumber = BLOOM_MAGICK_NUMBER;
> >       metadata->opts = *opts;
> > -     ((PageHeader) metaPage)->pd_lower += sizeof(BloomMetaPageData);
> > +     ((PageHeader) metaPage)->pd_lower += sizeof(BloomMetaPageData) - SizeOfPageReservedSpace();
>
> Not quite following what's going on here either.

Heh, not sure either.  Not sure if there was a reason or a mechanical
replacement, but will look when I do next revisions.

> > diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
> > @@ -116,7 +116,7 @@ blbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
> >                */
> >               if (BloomPageGetMaxOffset(page) != 0 &&
> >                       BloomPageGetFreeSpace(&state, page) >= state.sizeOfBloomTuple &&
> > -                     countPage < BloomMetaBlockN)
> > +                     countPage < BloomMetaBlockN())
> >                       notFullPage[countPage++] = blkno;
>
> Looks to be another opportunity to have a separate patch making this
> change first before actually changing the lower-level #define's,

#include <stdpatch/variable-blocksize>

> > diff --git a/src/backend/access/brin/brin_tuple.c b/src/backend/access/brin/brin_tuple.c
> > @@ -217,7 +217,7 @@ brin_form_tuple(BrinDesc *brdesc, BlockNumber blkno, BrinMemTuple *tuple,
> >                        * datatype, try to compress it in-line.
> >                        */
> >                       if (!VARATT_IS_EXTENDED(DatumGetPointer(value)) &&
> > -                             VARSIZE(DatumGetPointer(value)) > TOAST_INDEX_TARGET &&
> > +                             VARSIZE(DatumGetPointer(value)) > TOAST_INDEX_TARGET() &&
> >                               (atttype->typstorage == TYPSTORAGE_EXTENDED ||
> >                                atttype->typstorage == TYPSTORAGE_MAIN))
> >                       {
>
> Probably could be another patch but also if we're going to change
> TOAST_INDEX_TARGET to be a function we should probably not have it named
> in all-CAPS.

Okay, can make those style changes as well; agreed ALLCAPS should be constant.

> > diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
> > index 43b67893d9..5babbb457a 100644
> > --- a/src/backend/access/nbtree/nbtsplitloc.c
> > +++ b/src/backend/access/nbtree/nbtsplitloc.c
> > @@ -156,7 +156,7 @@ _bt_findsplitloc(Relation rel,
> >
> >       /* Total free space available on a btree page, after fixed overhead */
> >       leftspace = rightspace =
> > -             PageGetPageSize(origpage) - SizeOfPageHeaderData -
> > +             PageGetPageSize(origpage) - SizeOfPageHeaderData - SizeOfPageReservedSpace() -
> >               MAXALIGN(sizeof(BTPageOpaqueData));
>
> Also here ... though a bit interesting that this uses PageGetPageSize()
> instead of BLCKSZ.

Yeah, a few little exceptions. Variable blocksize patch introduces
those every place it can, and ClusterBlockSize() anywhere it can't.

> > diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
> > index 011ec18015..022b5eee4e 100644
> > --- a/src/backend/utils/init/globals.c
> > +++ b/src/backend/utils/init/globals.c
> > @@ -154,3 +154,4 @@ int64             VacuumPageDirty = 0;
> >
> >  int                  VacuumCostBalance = 0;  /* working state for vacuum */
> >  bool         VacuumCostActive = false;
> > +
>
> Unnecessary whitespace hunk ?

Will clean up.

Thanks for the review,

David

Re: [PATCHES] Post-special page storage TDE support

From

Andres Freund

Date:

08 November 2023, 00:20:11

Hi,

On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> From: David Christensen <david@pgguru.net>
> Date: Tue, 9 May 2023 16:56:15 -0500
> Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
>
> This space is reserved for extended data on the Page structure which will be ultimately used for
> encrypted data, extended checksums, and potentially other things.  This data appears at the end of
> the Page, after any `pd_special` area, and will be calculated at runtime based on specific
> ControlFile features.
>
> No effort is made to ensure this is backwards-compatible with existing clusters for `pg_upgrade`, as
> we will require logical replication to move data into a cluster with
> different settings here.

The first part of the last paragraph makes it sound like pg_upgrade won't be
supported across this commit, rather than just between different settings...

I think as a whole this is not an insane idea. A few comments:

- IMO the patch touches many places it shouldn't need to touch, because of
  essentially renaming a lot of existing macro names to *Limit,
  necessitating modifying a lot of users. I think instead the few places that
  care about the runtime limit should be modified.

  As-is the patch would cause a lot of fallout in extensions that just do
  things like defining an on-stack array of Datums or such - even though all
  they'd need is to change the define to the *Limit one.

  Even leaving extensions aside, it must makes reviewing (and I'm sure
  maintaining) the patch very tedious.


- I'm a bit worried about how the extra special page will be managed - if
  there are multiple features that want to use it, who gets to put their data
  at what offset?

  After writing this I saw that 0002 tries to address this - but I don't like
  the design. It introduces runtime overhead that seems likely to be visible.


- Checking for features using PageGetFeatureOffset() seems the wrong design to
  me - instead of a branch for some feature being disabled, perfectly
  predictable for the CPU, we need to do an external function call every time
  to figure out that yet, checksums are *still* disabled.


- Recomputing offsets every time in PageGetFeatureOffset() seems too
  expensive. The offsets can't change while running as PageGetFeatureOffset()
  have enough information to distinguish between different kinds of relations
  - so why do we need to recompute offsets on every single page?  I'd instead
  add a distinct offset variable for each feature.


- Modifying every single PageInit() call doesn't make sense to me. That'll
  just create a lot of breakage for - as far as I can tell - no win.


- Why is it worth sacrificing space on every page to indicate which features
  were enabled?  I think there'd need to be some convincing reasons for
  introducing such overhead.

- Is it really useful to encode the set of features enabled in a cluster with
  a bitmask? That pretty much precludes utilizing extra page space in
  extensions. We could instead just have an extra cluster-wide file that
  defines a mapping of offset to feature.


Greetings,

Andres Freund

Re: [PATCHES] Post-special page storage TDE support

From

Stephen Frost

Date:

08 November 2023, 14:04:13

Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> > From: David Christensen <david@pgguru.net>
> > Date: Tue, 9 May 2023 16:56:15 -0500
> > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> >
> > This space is reserved for extended data on the Page structure which will be ultimately used for
> > encrypted data, extended checksums, and potentially other things.  This data appears at the end of
> > the Page, after any `pd_special` area, and will be calculated at runtime based on specific
> > ControlFile features.
> >
> > No effort is made to ensure this is backwards-compatible with existing clusters for `pg_upgrade`, as
> > we will require logical replication to move data into a cluster with
> > different settings here.
>
> The first part of the last paragraph makes it sound like pg_upgrade won't be
> supported across this commit, rather than just between different settings...
>
> I think as a whole this is not an insane idea. A few comments:

Thanks for all the feedback!

> - Why is it worth sacrificing space on every page to indicate which features
>   were enabled?  I think there'd need to be some convincing reasons for
>   introducing such overhead.

In conversations with folks (my memory specifically is a discussion with
Peter G, added to CC, and my apologies to Peter if I'm misremembering)
there was a pretty strong push that a page should be able to 'stand
alone' and not depend on something else (eg: pg_control, or whatever) to
provide info needed be able to interpret the page.  For my part, I don't
have a particularly strong feeling on that, but that's what lead to this
design.

Getting a consensus on if that's a requirement or not would definitely
be really helpful.

Thanks,

Stephen

Attachment

signature.asc

Re: [PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

09 November 2023, 01:55:16

On Wed, Nov 8, 2023 at 8:04 AM Stephen Frost <sfrost@snowman.net> wrote:

Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> > From: David Christensen <david@pgguru.net>
> > Date: Tue, 9 May 2023 16:56:15 -0500
> > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> >
> > This space is reserved for extended data on the Page structure which will be ultimately used for
> > encrypted data, extended checksums, and potentially other things. This data appears at the end of
> > the Page, after any `pd_special` area, and will be calculated at runtime based on specific
> > ControlFile features.
> >
> > No effort is made to ensure this is backwards-compatible with existing clusters for `pg_upgrade`, as
> > we will require logical replication to move data into a cluster with
> > different settings here.
>
> The first part of the last paragraph makes it sound like pg_upgrade won't be
> supported across this commit, rather than just between different settings...

Yeah, that's vague, but you picked up on what I meant.

> I think as a whole this is not an insane idea. A few comments:

Thanks for all the feedback!

> - Why is it worth sacrificing space on every page to indicate which features
> were enabled? I think there'd need to be some convincing reasons for
> introducing such overhead.

In conversations with folks (my memory specifically is a discussion with
Peter G, added to CC, and my apologies to Peter if I'm misremembering)
there was a pretty strong push that a page should be able to 'stand
alone' and not depend on something else (eg: pg_control, or whatever) to
provide info needed be able to interpret the page. For my part, I don't
have a particularly strong feeling on that, but that's what lead to this
design.

Unsurprisingly, I agree that it's useful to keep these features on the page itself; from a forensic standpoint this seems much easier to interpret what is happening here, as well it would allow you to have different features on a given page or type of page depending on need. The initial patch utilizes pg_control to store the cluster page features, but there's no reason it couldn't be dependent on fork/page type or stored in pg_tablespace to utilize different features.

Thanks,

David

Re: [PATCHES] Post-special page storage TDE support

From

Stephen Frost

Date:

09 November 2023, 02:05:44

Greetings,

On Wed, Nov 8, 2023 at 20:55 David Christensen <david.christensen@crunchydata.com> wrote:

On Wed, Nov 8, 2023 at 8:04 AM Stephen Frost <sfrost@snowman.net> wrote:
* Andres Freund (andres@anarazel.de) wrote:
> On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> > From: David Christensen <david@pgguru.net>
> > Date: Tue, 9 May 2023 16:56:15 -0500
> > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> >
> > This space is reserved for extended data on the Page structure which will be ultimately used for
> > encrypted data, extended checksums, and potentially other things. This data appears at the end of
> > the Page, after any `pd_special` area, and will be calculated at runtime based on specific
> > ControlFile features.
> >
> > No effort is made to ensure this is backwards-compatible with existing clusters for `pg_upgrade`, as
> > we will require logical replication to move data into a cluster with
> > different settings here.
>
> The first part of the last paragraph makes it sound like pg_upgrade won't be
> supported across this commit, rather than just between different settings...

Yeah, that's vague, but you picked up on what I meant.

> I think as a whole this is not an insane idea. A few comments:

Thanks for all the feedback!

> - Why is it worth sacrificing space on every page to indicate which features
> were enabled? I think there'd need to be some convincing reasons for
> introducing such overhead.

In conversations with folks (my memory specifically is a discussion with
Peter G, added to CC, and my apologies to Peter if I'm misremembering)
there was a pretty strong push that a page should be able to 'stand
alone' and not depend on something else (eg: pg_control, or whatever) to
provide info needed be able to interpret the page. For my part, I don't
have a particularly strong feeling on that, but that's what lead to this
design.

Unsurprisingly, I agree that it's useful to keep these features on the page itself; from a forensic standpoint this seems much easier to interpret what is happening here, as well it would allow you to have different features on a given page or type of page depending on need. The initial patch utilizes pg_control to store the cluster page features, but there's no reason it couldn't be dependent on fork/page type or stored in pg_tablespace to utilize different features.

When it comes to authenticated encryption, it’s also the case that it’s unclear what value the checksum field has, if any… it’s certainly not directly needed as a checksum, as the auth tag is much better for the purpose of seeing if the page has been changed in some way. It’s also not big enough to serve as an auth tag per NIST guidelines regarding the size of the authenticated data vs. the size of the tag. Using it to indicate what features are enabled on the page seems pretty useful, as David notes.

Thanks,

Stephen

Re: [PATCHES] Post-special page storage TDE support

From

Peter Geoghegan

Date:

09 November 2023, 02:47:56

On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost <sfrost@snowman.net> wrote:
> In conversations with folks (my memory specifically is a discussion with
> Peter G, added to CC, and my apologies to Peter if I'm misremembering)
> there was a pretty strong push that a page should be able to 'stand
> alone' and not depend on something else (eg: pg_control, or whatever) to
> provide info needed be able to interpret the page.  For my part, I don't
> have a particularly strong feeling on that, but that's what lead to this
> design.

The term that I have used in the past is "self-contained". Meaning
capable of being decoded more or less as-is, without any metadata, by
tools like pg_filedump.

Any design in this area should try to make things as easy to debug as
possible, for the obvious reason: encrypted data that somehow becomes
corrupt is bound to be a nightmare to debug. (Besides, we already
support tools like pg_filedump, so this isn't a new principle.)

--
Peter Geoghegan

Re: [PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

13 November 2023, 20:03:36

On Tue, Nov 7, 2023 at 6:20 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> From: David Christensen <david@pgguru.net>
> Date: Tue, 9 May 2023 16:56:15 -0500
> Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
>
> This space is reserved for extended data on the Page structure which will be ultimately used for
> encrypted data, extended checksums, and potentially other things. This data appears at the end of
> the Page, after any `pd_special` area, and will be calculated at runtime based on specific
> ControlFile features.
>
> No effort is made to ensure this is backwards-compatible with existing clusters for `pg_upgrade`, as
> we will require logical replication to move data into a cluster with
> different settings here.

The first part of the last paragraph makes it sound like pg_upgrade won't be
supported across this commit, rather than just between different settings...

Thanks for the review.

I think as a whole this is not an insane idea. A few comments:

- IMO the patch touches many places it shouldn't need to touch, because of
essentially renaming a lot of existing macro names to *Limit,
necessitating modifying a lot of users. I think instead the few places that
care about the runtime limit should be modified.

As-is the patch would cause a lot of fallout in extensions that just do
things like defining an on-stack array of Datums or such - even though all
they'd need is to change the define to the *Limit one.

Even leaving extensions aside, it must makes reviewing (and I'm sure
maintaining) the patch very tedious.

You make a good point, and I think you're right that we could teach the places that care about runtime vs compile time differences about the changes while leaving other callers alone. The *Limit ones were introduced since we need constant values here from the Calc...() macros, but could try keeping the existing *Limit with the old name and switching things around. I suspect there will be the same amount of code churn, but less mechanical.

- I'm a bit worried about how the extra special page will be managed - if
there are multiple features that want to use it, who gets to put their data
at what offset?

After writing this I saw that 0002 tries to address this - but I don't like
the design. It introduces runtime overhead that seems likely to be visible.

Agreed this could be optimized.

- Checking for features using PageGetFeatureOffset() seems the wrong design to
me - instead of a branch for some feature being disabled, perfectly
predictable for the CPU, we need to do an external function call every time
to figure out that yet, checksums are *still* disabled.

This is probably not a supported approach (it felt a little icky), but I'd played around with const pointers to structs of const elements, where the initial values of a global var was populated early on (so set once and never changed post init), and the compiler didn't complain and things seemed to work ok; not sure if this approach might help balance the early mutability and constant lookup needs:

typedef struct PageFeatureOffsets {

const Size feature0offset;

const Size feature1offset;

...

} PageFeatureOffsets;

PageFeatureOffsets offsets = {0};

const PageFeatureOffsets *exposedOffsets = &offsets;

void InitOffsets() {

*((Size*)&offsets.feature0offset) = ...;

*((Size*)&offsets.feature1offset) = ...;

...

}

- Recomputing offsets every time in PageGetFeatureOffset() seems too
expensive. The offsets can't change while running as PageGetFeatureOffset()
have enough information to distinguish between different kinds of relations

Yes, this was a simple approach for ease of implementation; there is certainly a way to precompute a lookup table from the page feature bitmask into the offsets themselves or otherwise precompute, turn from function call into inline/macro, etc.

- so why do we need to recompute offsets on every single page? I'd instead
add a distinct offset variable for each feature.

This would work iff there is a single page feature set across all pages in the cluster; I'm not sure we don't want more flexibility here.

- Modifying every single PageInit() call doesn't make sense to me. That'll
just create a lot of breakage for - as far as I can tell - no win.

This was a placeholder to allow different features depending on page type; to keep things simple for now I just used the same values here, but we could move this inside PageInit() instead (again, assuming single feature set per cluster).

- Why is it worth sacrificing space on every page to indicate which features
were enabled? I think there'd need to be some convincing reasons for
introducing such overhead.

The point here is if we can use either GCM authtag or stronger checksums then we've gained the ability to authenticate the page contents at the cost of reassigning those bits, in a way that would support variable permutations of features for different relations or page types, if so desired. A single global setting here both eliminates that possibility as well as requires external data in order to fully interpret pages.

- Is it really useful to encode the set of features enabled in a cluster with
a bitmask? That pretty much precludes utilizing extra page space in
extensions. We could instead just have an extra cluster-wide file that
defines a mapping of offset to feature.

Given the current design, yes we do need that, which does make it harder to allocate/use from an extension. Due to needing to have consistent offsets for a given feature set (however represented on a page), the implementation load going forward as-is involves ensuring that a given bit always maps to the same offset in the page regardless of additional features available in the future. So the 0'th bit if enabled would always map to the 8 byte chunk at the end of the page, the 1st bit corresponds to some amount of space prior to that, etc. I'm not sure how to get that property without some sort of bitmap or otherwise indexed operation.

I get what you're saying as far as the more global approach, and while that does lend itself to some nice properties in terms of extensibility, some of the features (GCM tags in particular) need to be able to control the page offset at a consistent location so we can decode the rest of the page without knowing anything else.

Additionally, since the reserved space/page features are configured at initdb time I am unclear how a given extension would even be able to stake a claim here. ...though if we consider this a two-part problem, one of space reservation and one of space usage, that part could be handled via allocating more than the minimum in the reserved_page_space and allowing unallocated page space to be claimed later via some sort of additional functions/other hook. That opens up other questions though, tracking whether said space has ever been initialized and what to do when first accessing existing/new pages as one example.

Best,

David

Re: [PATCHES] Post-special page storage TDE support

From

Andres Freund

Date:

13 November 2023, 20:27:40

Hi,

On 2023-11-08 18:47:56 -0800, Peter Geoghegan wrote:
> On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost <sfrost@snowman.net> wrote:
> > In conversations with folks (my memory specifically is a discussion with
> > Peter G, added to CC, and my apologies to Peter if I'm misremembering)
> > there was a pretty strong push that a page should be able to 'stand
> > alone' and not depend on something else (eg: pg_control, or whatever) to
> > provide info needed be able to interpret the page.  For my part, I don't
> > have a particularly strong feeling on that, but that's what lead to this
> > design.
> 
> The term that I have used in the past is "self-contained". Meaning
> capable of being decoded more or less as-is, without any metadata, by
> tools like pg_filedump.

I'm not finding that very convincing - without cluster wide data, like keys, a
tool like pg_filedump isn't going to be able to do much with encrypted
pages. Given the need to look at some global data, figuring out the offset at
which data starts based on a value in pg_control isn't meaningfully worse than
having the data on each page.

Storing redundant data in each page header, when we've wanted space in the
page header for plenty other things, just doesn't seem a good use of said
space.

Greetings,

Andres Freund

Re: [PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

13 November 2023, 20:37:47

On Mon, Nov 13, 2023 at 2:27 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-11-08 18:47:56 -0800, Peter Geoghegan wrote:
> On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost <sfrost@snowman.net> wrote:
> > In conversations with folks (my memory specifically is a discussion with
> > Peter G, added to CC, and my apologies to Peter if I'm misremembering)
> > there was a pretty strong push that a page should be able to 'stand
> > alone' and not depend on something else (eg: pg_control, or whatever) to
> > provide info needed be able to interpret the page. For my part, I don't
> > have a particularly strong feeling on that, but that's what lead to this
> > design.
>
> The term that I have used in the past is "self-contained". Meaning
> capable of being decoded more or less as-is, without any metadata, by
> tools like pg_filedump.

I'm not finding that very convincing - without cluster wide data, like keys, a
tool like pg_filedump isn't going to be able to do much with encrypted
pages. Given the need to look at some global data, figuring out the offset at
which data starts based on a value in pg_control isn't meaningfully worse than
having the data on each page.

Storing redundant data in each page header, when we've wanted space in the
page header for plenty other things, just doesn't seem a good use of said
space.

This scheme would open up space per page that would now be available for plenty of other things; the encoding in the header and the corresponding available space in the footer would seem to open up quite a few options now, no?

Re: [PATCHES] Post-special page storage TDE support

From

Andres Freund

Date:

13 November 2023, 20:52:40

Hi,

On 2023-11-13 14:37:47 -0600, David Christensen wrote:
> On Mon, Nov 13, 2023 at 2:27 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2023-11-08 18:47:56 -0800, Peter Geoghegan wrote:
> > > On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost <sfrost@snowman.net> wrote:
> > > > In conversations with folks (my memory specifically is a discussion
> > with
> > > > Peter G, added to CC, and my apologies to Peter if I'm misremembering)
> > > > there was a pretty strong push that a page should be able to 'stand
> > > > alone' and not depend on something else (eg: pg_control, or whatever)
> > to
> > > > provide info needed be able to interpret the page.  For my part, I
> > don't
> > > > have a particularly strong feeling on that, but that's what lead to
> > this
> > > > design.
> > >
> > > The term that I have used in the past is "self-contained". Meaning
> > > capable of being decoded more or less as-is, without any metadata, by
> > > tools like pg_filedump.
> >
> > I'm not finding that very convincing - without cluster wide data, like
> > keys, a
> > tool like pg_filedump isn't going to be able to do much with encrypted
> > pages. Given the need to look at some global data, figuring out the offset
> > at
> > which data starts based on a value in pg_control isn't meaningfully worse
> > than
> > having the data on each page.
> >
> > Storing redundant data in each page header, when we've wanted space in the
> > page header for plenty other things, just doesn't seem a good use of said
> > space.
> >
> 
> This scheme would open up space per page that would now be available for
> plenty of other things; the encoding in the header and the corresponding
> available space in the footer would seem to open up quite a few options
> now, no?

Sure, if you're willing to rewrite the whole cluster to upgrade and willing to
permanently sacrifice some data density.  If the stored data is actually
specific to the page - that is the place to put the data. If not, then the
tradeoff is much more complicated IMO.

Of course this isn't a new problem - storing the page size on each page was
just silly, it's never going to change across the cluster and even more
definitely not going to change within a single relation.

Greetings,

Andres Freund

Re: [PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

13 November 2023, 21:53:18

On Mon, Nov 13, 2023 at 2:52 PM Andres Freund <andres@anarazel.de> wrote:

> This scheme would open up space per page that would now be available for
> plenty of other things; the encoding in the header and the corresponding
> available space in the footer would seem to open up quite a few options
> now, no?

Sure, if you're willing to rewrite the whole cluster to upgrade and willing to
permanently sacrifice some data density. If the stored data is actually
specific to the page - that is the place to put the data. If not, then the
tradeoff is much more complicated IMO.

Of course this isn't a new problem - storing the page size on each page was
just silly, it's never going to change across the cluster and even more
definitely not going to change within a single relation.

Crazy idea; since stored pagesize is already a fixed cost that likely isn't going away, what if instead of the pd_checksum field, we instead reinterpret pd_pagesize_version; 4 would mean "no page features", but anything 5 or higher could be looked up as an external page feature set, with storage semantics outside of the realm of the page itself (other than what the page features code itself needs to know); i.e,. move away from the on-page bitmap into a more abstract representation of features which could be something along the lines of what you were suggesting, including extension support.

It seems like this could also support adding/removing features on page read/write as long as there was sufficient space in the reserved_page space; read the old feature set on page read, convert to the new feature set which will write out the page with the additional/changed format. Obviously there would be bookkeeping to be done in terms of making sure all pages had been converted from one format to another, but for the page level this would be straightforward.

Just thinking aloud here...

David

Re: [PATCHES] Post-special page storage TDE support

From

Stephen Frost

Date:

13 November 2023, 22:08:22

Greetings,

On Mon, Nov 13, 2023 at 16:53 David Christensen <david.christensen@crunchydata.com> wrote:

On Mon, Nov 13, 2023 at 2:52 PM Andres Freund <andres@anarazel.de> wrote:

> This scheme would open up space per page that would now be available for
> plenty of other things; the encoding in the header and the corresponding
> available space in the footer would seem to open up quite a few options
> now, no?

Sure, if you're willing to rewrite the whole cluster to upgrade and willing to
permanently sacrifice some data density. If the stored data is actually
specific to the page - that is the place to put the data. If not, then the
tradeoff is much more complicated IMO.

Of course this isn't a new problem - storing the page size on each page was
just silly, it's never going to change across the cluster and even more
definitely not going to change within a single relation.

Crazy idea; since stored pagesize is already a fixed cost that likely isn't going away, what if instead of the pd_checksum field, we instead reinterpret pd_pagesize_version; 4 would mean "no page features", but anything 5 or higher could be looked up as an external page feature set, with storage semantics outside of the realm of the page itself (other than what the page features code itself needs to know); i.e,. move away from the on-page bitmap into a more abstract representation of features which could be something along the lines of what you were suggesting, including extension support.

It seems like this could also support adding/removing features on page read/write as long as there was sufficient space in the reserved_page space; read the old feature set on page read, convert to the new feature set which will write out the page with the additional/changed format. Obviously there would be bookkeeping to be done in terms of making sure all pages had been converted from one format to another, but for the page level this would be straightforward.

Just thinking aloud here...

In other crazy idea space … if the page didn’t have enough space to allow for the desired features then make any insert/update actions forcibly have to choose a different page for the new tuple, while allowing delete’s to do their usual thing, and then when vacuum comes along and is able to clean up the page and remove the all dead tuples, it could then enable the features on the page that are desired…

Thanks,

Stephen

Re: [PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

29 November 2023, 15:12:31

On Tue, Nov 7, 2023 at 6:20 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

- IMO the patch touches many places it shouldn't need to touch, because of
essentially renaming a lot of existing macro names to *Limit,
necessitating modifying a lot of users. I think instead the few places that
care about the runtime limit should be modified.

As-is the patch would cause a lot of fallout in extensions that just do
things like defining an on-stack array of Datums or such - even though all
they'd need is to change the define to the *Limit one.

Even leaving extensions aside, it must makes reviewing (and I'm sure
maintaining) the patch very tedious.

Hi Andres et al,

So I've been looking at alternate approaches to this issue and considering how to reduce churn, and I think we still need the *Limit variants. Let's take a simple example:

Just looking at MaxHeapTuplesPerPage and breaking down instances in the code, loosely partitioning into whether it's used as an array index or other usage (doesn't discriminate against code vs comments, unfortunately) we get the following breakdown:

$ git grep -hoE [[]?MaxHeapTuplesPerPage | sort | uniq -c
18 [MaxHeapTuplesPerPage
51 MaxHeapTuplesPerPage

This would be 18 places where we would need at adjust in a fairly mechanical fashion to add the MaxHeapTuplesPerPageLimit instead of MaxHeapTuplesPerPage vs some significant fraction of non-comment--even if you assumed half were in comments, there would presumably need to be some sort of adjustments in verbage since we are going to be changing some of the interpretation.

I am working on a patch to cleanup some of the assumptions that smgr makes currently about its space usage and how the individual access methods consider it, as they should only be calculating things based on how much space is available after smgr is done with it. That has traditionally been BLCKSZ - SizeOfPageHeaderData, but this patch (included) factors that out into a single expression that we can now use in access methods, so we can then reserve additional page space and not need to adjust the access methods furter.

Building on top of this patch, we'd define something like this to handle the #defines that need to be dynamic:

extern Size reserved_page_space;
#define PageUsableSpace (BLCKSZ - SizeOfPageHeaderData - reserved_page_space)
#define MaxHeapTuplesPerPage CalcMaxHeapTuplesPerPage(PageUsableSpace)
#define MaxHeapTuplesPerPageLimit CalcMaxHeapTuplesPerPage(BLCKSZ - SizeOfPageHeaderData)
#define CalcMaxHeapTuplesPerPage(freesize)
((int) ((freesize) / \
(MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData))))

In my view, extensions that are expecting to need no changes when it comes to changing how these are interpreted are better off needing to only change the static allocation in a mechanical sense than revisit any other uses of code; this seems more likely to guarantee a correct result than if you exceed the page space and start overwriting things you weren't because you're not aware that you need to check for dynamic limits on your own.

Take another thing which would need adjusting for reserving page space, MaxHeapTupleSize:

$ git grep -ohE '[[]?MaxHeapTupleSize' | sort | uniq -c
3 [MaxHeapTupleSize
16 MaxHeapTupleSize

Here there are 3 static arrays which would need to be adjusted vs 16 other instances. If we kept MaxHeapTupleSize interpretation the same and didn't adjust an extension it would compile just fine, but with too large of a length compared to the smaller PageUsableSpace, so you could conceivably overwrite into the reserved space depending on what you were doing.

(since by definition the reserved_page_space >= 0, so PageUsableSpace will always be <= BLCKSZ - SizeOfPageHeaderData, so any expression based on it as a basis will be smaller).

In short, I think the approach I took originally actually will reduce errors out-of-core, and while churn is still necessary churn.

I can produce a second patch which implements this calc/limit atop this first one as well.

Thanks,

David

Attachment

0001-Create-PageUsableSpace-to-represent-space-post-smgr.patch

Re: [PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

22 December 2023, 21:24:50

Hi again!

Per some offline discussion with Stephen, I've continued to work on some modifications here; this particular patchset is intended to facilitate review by highlighting the mechanical nature of many of these changes. As such, I have taken the following approach to this rework:

0001 - Create PageUsableSpace to represent space post-smgr

0002 - Add support for fast, non-division-based div/mod algorithms

0003 - Use fastdiv code in visibility map

0004 - Make PageUsableSpace incorporate variable-sized limit

0005 - Add Calc, Limit and Dynamic forms of all variable constants

0006 - Split MaxHeapTuplesPerPage into Limit and Dynamic variants

0007 - Split MaxIndexTuplesPerPage into Limit and Dynamic variants

0008 - Split MaxHeapTupleSize into Limit and Dynamic variants

0009 - Split MaxTIDsPerBTreePage into Limit and Dynamic variant

0001 - 0003 have appeared in this thread or in other forms on the list already, though 0001 refactors things slightly more aggressively, but makes StaticAssert() to ensure that this change is still sane.

0004 adds the ReservedPageSpace variable, and also redefines the previous BLCKSZ - SizeOfPageHeaderDate as PageUsableSpaceMax; there are a few related fixups.

0005 adds the macros to compute the former constants while leaving their original definitions to evaluate to the same place (the infamous Calc* and *Limit, plus we invite *Dynamic to the party as well; the names are terrible and there must be something better)

0006 - 0009 are all the same approach; we undefine the old constant name and modify the existing uses of this symbol to be either the *Limit or *Dynamic, depending on if the changed available space would impact the calculations. Since we are touching every use of this symbol, this facilitates review of the impact, though I would contend that almost every piece I've spot-checked seems like it really does need to know about the runtime limit. Perhaps there is more we could do here. I could also see a variable per constant rather than recalculating this every time, in which case the *Dynamic would just be the variable and we'd need a hook to initialize this or otherwise set on first use.

There are a number of additional things remaining to be done to get this to fully work, but I did want to get some of this out there for review.

Still to do (almost all in some form in original patch, so just need to extract the relevant pieces):

- set reserved-page-size via initdb

- load reserved-page-size from pg_control

- apply to the running cluster

- some form of compatibility for these constants in common and ensuring bin/ works

- some toast-related changes (this requires a patch to support dynamic relopts, which I can extract, as the existing code is using a constant lookup table)

- probably some more pieces I'm forgetting

Thanks,

David

Attachment

Re: [PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

19 January 2024, 18:49:03

Hi,

I have finished the reworking of this particular patch series, and have tried to
organize this in such a way that it will be easily reviewable. It is
constructed progressively to be able to follow what is happening here. As such,
each individual commit is not guaranteed to compile on its own, so the whole
series would need to be applied before it works. (It does pass CI tests.)

Here is a brief roadmap of the patches; some of them have additional details in
the commit message describing a little more about them.

These two patches do some refactoring of existing code to make a common place to
modify the definitions:

v3-0001-refactor-Create-PageUsableSpace-to-represent-spac.patch
v3-0002-refactor-Make-PageGetUsablePageSize-routine.patch

These two patches add the ReservedPageSize variable and teach PageInit to use to
adjust sizing accordingly:

v3-0003-feature-Add-ReservedPageSize-variable.patch
v3-0004-feature-Adjust-page-sizes-at-PageInit.patch

This patch modifies the definitions of 4 symbols to be computed based on
PageUsableSpace:

v3-0005-feature-Create-Calc-Limit-and-Dynamic-forms-for-f.patch

These following 4 patches are mechanical replacements of all existing uses of
these symbols; this provides both visibility into where the existing symbol is
used as well as distinguishing between parts that care about static allocation
vs dynamic usage. The only non-mechanical change is to remove the definition of
the old symbol so we can be guaranteed that all uses have been considered:

v3-0006-chore-Split-MaxHeapTuplesPerPage-into-Limit-and-D.patch
v3-0007-chore-Split-MaxIndexTuplesPerPage-into-Limit-and-.patch
v3-0008-chore-Split-MaxHeapTupleSize-into-Limit-and-Dynam.patch
v3-0009-chore-Split-MaxTIDsPerBTreePage-into-Limit-and-Dy.patch

The following patches are related to required changes to support dynamic toast
limits:

v3-0010-feature-Add-hook-for-setting-reloptions-defaults-.patch
v3-0011-feature-Dynamically-calculate-toast_tuple_target.patch
v3-0012-feature-Add-Calc-options-for-toast-related-pieces.patch
v3-0013-chore-Replace-TOAST_MAX_CHUNK_SIZE-with-ClusterTo.patch
v3-0014-chore-Translation-updates-for-TOAST_MAX_CHUNK_SIZ.patch

In order to calculate some of the sizes, we need to include nbtree.h internals,
but we can't use in front-end apps, so we separate out the pieces we care about
into a separate include and use that:

v3-0015-chore-Split-nbtree.h-structure-defs-into-an-inter.patch

This is the meat of the patch; provide a common location for these
block-size-related constants to be computed using the infra that has been set up
so far. Also ensure that we are properly initializing this in front end and
back end code. A tricky piece here is we have two separate include files for
blocksize.h; one which exposes externs as consts for optimizations, and one that
blocksize.c itself uses without consts, which it uses to create/initialized the
vars:

v3-0016-feature-Calculate-all-blocksize-constants-in-a-co.patch

Add ControlFile and GUC support for reserved_page_size:

v3-0017-feature-ControlFile-GUC-support-for-reserved_page.patch

Add initdb support for reserving page space:

v3-0018-feature-Add-reserved_page_size-to-initdb-bootstra.patch

Fixes for pg_resetwal:

v3-0019-feature-Updates-for-pg_resetwal.patch

The following 4 patches mechanically replace the Dynamic form to use the new
Cluster variables:

v3-0020-chore-Rename-MaxHeapTupleSizeDynamic-to-ClusterMa.patch
v3-0021-chore-Rename-MaxHeapTuplesPerPageDynamic-to-Clust.patch
v3-0022-chore-Rename-MaxIndexTuplesPerPageDynamic-to-Clus.patch
v3-0023-chore-Rename-MaxTIDsPerBTreePageDynamic-to-Cluste.patch

Two pieces of optimization required for visibility map:

v3-0024-optimization-Add-support-for-fast-non-division-ba.patch
v3-0025-optimization-Use-fastdiv-code-in-visibility-map.patch

Update bufpage.h comments:

v3-0026-doc-update-bufpage-docs-w-reserved-space-data.patch

Fixes for bloom to use runtime size:

v3-0027-feature-Teach-bloom-about-PageUsableSpace.patch

Fixes for FSM to use runtime size:

v3-0028-feature-teach-FSM-about-reserved-page-space.patch

I hope this makes sense for reviewing, I know it's a big job, so breaking things up a little more and organizing will hopefully help.

Best,

David

Attachment

Re: [PATCHES] Post-special page storage TDE support

From

Aleksander Alekseev

Date:

12 March 2024, 13:03:32

Hi David,

> I have finished the reworking of this particular patch series, and have tried to
> organize this in such a way that it will be easily reviewable.  It is
> constructed progressively to be able to follow what is happening here.  As such,
> each individual commit is not guaranteed to compile on its own, so the whole
> series would need to be applied before it works. (It does pass CI tests.)
>
> Here is a brief roadmap of the patches; some of them have additional details in
> the commit message describing a little more about them.
>
> These two patches do some refactoring of existing code to make a common place to
> modify the definitions:
>
> v3-0001-refactor-Create-PageUsableSpace-to-represent-spac.patch
> v3-0002-refactor-Make-PageGetUsablePageSize-routine.patch
>
> These two patches add the ReservedPageSize variable and teach PageInit to use to
> adjust sizing accordingly:
>
> v3-0003-feature-Add-ReservedPageSize-variable.patch
> v3-0004-feature-Adjust-page-sizes-at-PageInit.patch
>
> This patch modifies the definitions of 4 symbols to be computed based on
> PageUsableSpace:
>
> v3-0005-feature-Create-Calc-Limit-and-Dynamic-forms-for-f.patch
>
> These following 4 patches are mechanical replacements of all existing uses of
> these symbols; this provides both visibility into where the existing symbol is
> used as well as distinguishing between parts that care about static allocation
> vs dynamic usage.  The only non-mechanical change is to remove the definition of
> the old symbol so we can be guaranteed that all uses have been considered:
>
> v3-0006-chore-Split-MaxHeapTuplesPerPage-into-Limit-and-D.patch
> v3-0007-chore-Split-MaxIndexTuplesPerPage-into-Limit-and-.patch
> v3-0008-chore-Split-MaxHeapTupleSize-into-Limit-and-Dynam.patch
> v3-0009-chore-Split-MaxTIDsPerBTreePage-into-Limit-and-Dy.patch
>
> The following patches are related to required changes to support dynamic toast
> limits:
>
> v3-0010-feature-Add-hook-for-setting-reloptions-defaults-.patch
> v3-0011-feature-Dynamically-calculate-toast_tuple_target.patch
> v3-0012-feature-Add-Calc-options-for-toast-related-pieces.patch
> v3-0013-chore-Replace-TOAST_MAX_CHUNK_SIZE-with-ClusterTo.patch
> v3-0014-chore-Translation-updates-for-TOAST_MAX_CHUNK_SIZ.patch
>
> In order to calculate some of the sizes, we need to include nbtree.h internals,
> but we can't use in front-end apps, so we separate out the pieces we care about
> into a separate include and use that:
>
> v3-0015-chore-Split-nbtree.h-structure-defs-into-an-inter.patch
>
> This is the meat of the patch; provide a common location for these
> block-size-related constants to be computed using the infra that has been set up
> so far.  Also ensure that we are properly initializing this in front end and
> back end code.  A tricky piece here is we have two separate include files for
> blocksize.h; one which exposes externs as consts for optimizations, and one that
> blocksize.c itself uses without consts, which it uses to create/initialized the
> vars:
>
> v3-0016-feature-Calculate-all-blocksize-constants-in-a-co.patch
>
> Add ControlFile and GUC support for reserved_page_size:
>
> v3-0017-feature-ControlFile-GUC-support-for-reserved_page.patch
>
> Add initdb support for reserving page space:
>
> v3-0018-feature-Add-reserved_page_size-to-initdb-bootstra.patch
>
> Fixes for pg_resetwal:
>
> v3-0019-feature-Updates-for-pg_resetwal.patch
>
> The following 4 patches mechanically replace the Dynamic form to use the new
> Cluster variables:
>
> v3-0020-chore-Rename-MaxHeapTupleSizeDynamic-to-ClusterMa.patch
> v3-0021-chore-Rename-MaxHeapTuplesPerPageDynamic-to-Clust.patch
> v3-0022-chore-Rename-MaxIndexTuplesPerPageDynamic-to-Clus.patch
> v3-0023-chore-Rename-MaxTIDsPerBTreePageDynamic-to-Cluste.patch
>
> Two pieces of optimization required for visibility map:
>
> v3-0024-optimization-Add-support-for-fast-non-division-ba.patch
> v3-0025-optimization-Use-fastdiv-code-in-visibility-map.patch
>
> Update bufpage.h comments:
>
> v3-0026-doc-update-bufpage-docs-w-reserved-space-data.patch
>
> Fixes for bloom to use runtime size:
>
> v3-0027-feature-Teach-bloom-about-PageUsableSpace.patch
>
> Fixes for FSM to use runtime size:
>
> v3-0028-feature-teach-FSM-about-reserved-page-space.patch
>
> I hope this makes sense for reviewing, I know it's a big job, so breaking things up a little more and organizing will
hopefullyhelp.
 

Just wanted to let you know that the patchset seems to need a rebase,
according to cfbot.

Best regards,
Aleksander Alekseev (wearing a co-CFM hat)

Re: [PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

13 March 2024, 16:26:48

Hi Aleksander et al,

Enclosing v4 for this patch series, rebased atop the
constant-splitting series[1].  For the purposes of having cfbot happy,
I am including the prerequisites as a squashed commit v4-0000, however
this is not technically part of this series.

The roadmap this time is similar to the last series, with some
improvements being made in terms of a few bug fixes and other
reorganizations/cleanups.  With the prerequisite/rework, we are able
to eliminate some number of patches in the previous series.

Squashed prerequisites, out of scope for review:
v4-0000-squashed-prerequisites.patch

Refactoring some of the existing uses of BLCKSZ and SizeOfPageHeaderData:
v4-0001-refactor-Create-PageUsableSpace-to-represent-spac.patch
v4-0002-refactor-Make-PageGetUsablePageSize-routine.patch
v4-0003-feature-Add-ReservedPageSize-variable.patch
v4-0004-feature-Adjust-page-sizes-at-PageInit.patch

Making TOAST dynamic:
v4-0005-feature-Add-hook-for-setting-reloptions-defaults-.patch
v4-0006-feature-Add-Calc-options-for-toast-related-pieces.patch
v4-0007-feature-Dynamically-calculate-toast_tuple_target.patch
v4-0008-chore-Replace-TOAST_MAX_CHUNK_SIZE-with-ClusterTo.patch
v4-0009-chore-Translation-updates-for-TOAST_MAX_CHUNK_SIZ.patch

Infra/support for blocksize calculations:
v4-0010-chore-Split-nbtree.h-structure-defs-into-an-inter.patch
v4-0011-Control-File-support-for-reserved_page_size.patch
v4-0012-feature-Calculate-all-blocksize-constants-in-a-co.patch

GUC/initdb/bootstrap support for setting reserved-page-size:
v4-0013-GUC-for-reserved_page_size.patch
v4-0014-feature-Add-reserved_page_size-to-initdb-bootstra.patch
v4-0015-feature-Updates-for-pg_resetwal.patch

Optimization of VisMap:
v4-0016-optimization-Add-support-for-fast-non-division-ba.patch
v4-0017-optimization-Use-fastdiv-code-in-visibility-map.patch

Docs:
v4-0018-doc-update-bufpage-docs-w-reserved-space-data.patch

Misc cleanup/fixes:
v4-0019-feature-Teach-bloom-about-PageUsableSpace.patch
v4-0020-feature-teach-FSM-about-reserved-page-space.patch
v4-0021-feature-expose-reserved_page_size-in-SQL-controld.patch

Write out of init options that are relevant:
v4-0022-feature-save-out-our-initialization-options.patch

A few notes:
- There was a bug in the previous VisMap in v3 which resulted in
treating the page size as smaller than it was. This has been fixed.

- v4-0022 is new, but useful for the page features going forward, and
should simplify some things like using `pg_resetwal` or other places
that really need to know how initdb was initialized.

- I have done some performance metrics with this feature vs unpatched
postgres.  Since the biggest place this seemed to affect was the
visibility map (per profiling), I constructed an index-only scan test
case which basically measured nested loop against index-only lookups
with something like 20M rows in the index and 1M generate_series
options, measuring the differences between the approach we are using
(and several others), and showing a trimmean of < 0.005 in execution
time.[2] This seems acceptable (if not just noise), so would be
interested in any sorts of performance deviations others encounter.

Thanks,

David

[1] https://commitfest.postgresql.org/47/4828/
[2] https://www.pgguru.net/2024-03-13-vismap-benchmarking.txt

Attachment

Re: [PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

12 December 2024, 18:15:55

On Tue, Dec 10, 2024 at 12:54 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Mar 13, 2024 at 11:26:48AM -0500, David Christensen wrote:
> > Enclosing v4 for this patch series, rebased atop the
> > constant-splitting series[1].  For the purposes of having cfbot happy,
> > I am including the prerequisites as a squashed commit v4-0000, however
> > this is not technically part of this series.
>
> The last update of this thread is from march 2024, with no replies and
> no reviews.  Please note that this fails in the CI so I'd suggest a
> rebase for now, and I have marked the patch as waiting on author.  If
> there is a lack of interest, well..

I can't say there is a lack of interest from the author per se :), but
not really seeing much in the way of community engagement makes me
think it's largely unwanted.  I'd certainly be happy to rebase and
reengage, but if it's not wanted at the conceptual level it doesn't
seem worth the effort.  It's hard to interpret lack of response as
"don't care, fine" vs "don't want" vs "haven't looked, -hackers is a
firehose".

Thanks,

David

Re: [PATCHES] Post-special page storage TDE support

From

Bruce Momjian

Date:

27 December 2024, 18:12:34

On Thu, Dec 12, 2024 at 09:15:55AM -0600, David Christensen wrote:
> On Tue, Dec 10, 2024 at 12:54 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Wed, Mar 13, 2024 at 11:26:48AM -0500, David Christensen wrote:
> > > Enclosing v4 for this patch series, rebased atop the
> > > constant-splitting series[1].  For the purposes of having cfbot happy,
> > > I am including the prerequisites as a squashed commit v4-0000, however
> > > this is not technically part of this series.
> >
> > The last update of this thread is from march 2024, with no replies and
> > no reviews.  Please note that this fails in the CI so I'd suggest a
> > rebase for now, and I have marked the patch as waiting on author.  If
> > there is a lack of interest, well..
> 
> I can't say there is a lack of interest from the author per se :), but
> not really seeing much in the way of community engagement makes me
> think it's largely unwanted.  I'd certainly be happy to rebase and
> reengage, but if it's not wanted at the conceptual level it doesn't
> seem worth the effort.  It's hard to interpret lack of response as
> "don't care, fine" vs "don't want" vs "haven't looked, -hackers is a
> firehose".

The value of TDE is limited from a security value perspective, but high
on the list of security policy requirements.  Our community is much more
responsive to actual value vs policy compliance value.

When I started focusing on TDE, it was going to require changes to
buffer reads/writes, WAL, and require a way to store secret keys.  I
thought those changes would be acceptable given TDE's security value. 
Once the file I/O changes were required, I think the balance tilted to
TDE requiring too many code changes given its security value (not policy
compliance value).

At least that is my analysis, and part of me wishes I was wrong.  I know
there are several commercial forks of TDE, mostly because companies are
more sensitive to policy compliance value, which translates to monetary
value for them.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: [PATCHES] Post-special page storage TDE support

From

Greg Sabino Mullane

Date:

27 December 2024, 20:19:58

Re-reading this thread, and this has been nagging me:

On Mon, Nov 13, 2023 at 3:03 PM David Christensen <david.christensen@crunchydata.com> wrote:

- so why do we need to recompute offsets on every single page? I'd instead
add a distinct offset variable for each feature.

This would work iff there is a single page feature set across all pages in the cluster; I'm not sure we don't want more flexibility here.

I'm quite skeptical about the usefulness of this; seems to be at best a footgun to have per-page features. Also, none of the examples given so far (e.g. encryption, better checksums) are of a non-global nature, so did you have specific examples in mind?

Cheers,

Greg

Re: [PATCHES] Post-special page storage TDE support

From

Greg Sabino Mullane

Date:

27 December 2024, 20:25:11

On Fri, Dec 27, 2024 at 10:12 AM Bruce Momjian <bruce@momjian.us> wrote:

The value of TDE is limited from a security value perspective, but high on the list of security policy requirements. Our community is much more responsive to actual value vs policy compliance value.

True. The number of forks, though, makes me feel this is a "when", not "if" feature. Has there been any other complex feature forked/implemented by so many? Maybe columnar storage?

Cheers,

Greg

Re: [PATCHES] Post-special page storage TDE support

From

Bruce Momjian

Date:

27 December 2024, 22:58:35

On Fri, Dec 27, 2024 at 12:25:11PM -0500, Greg Sabino Mullane wrote:
> On Fri, Dec 27, 2024 at 10:12 AM Bruce Momjian <bruce@momjian.us> wrote:
> 
>     The value of TDE is limited from a security value perspective, but high on
>     the list of security policy requirements.  Our community is much more
>     responsive to actual value vs policy compliance value.
> 
> 
> True. The number of forks, though, makes me feel this is a "when", not "if"
> feature. Has there been any other complex feature forked/implemented by so
> many? Maybe columnar storage?

That is a great question.  We have TDE implementations from EDB,
Fujitsu, Percona, Cybertec, and Crunchy Data, and perhaps others, and
that is a lot of duplicated effort.

As far as parallels, I think compatibility with Oracle and MSSQL are
areas that several companies have developed that the community is
unlikely to ever develop, I think because they are pure compatibility,
not functionality.  I think TDE having primarily policy compliance value
also might make it something the community never develops.

I think this blog post is the clearest I have seen about the technical
value vs.policy compliance value of TDE:

    https://www.percona.com/blog/why-postgresql-needs-transparent-database-encryption-tde/

One possible way TDE could be added to community Postgres is if the code
changes required were reduced due to an API redesign.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: [PATCHES] Post-special page storage TDE support

From

David Christensen

Date:

28 December 2024, 03:15:59

On Fri, Dec 27, 2024 at 1:58 PM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Fri, Dec 27, 2024 at 12:25:11PM -0500, Greg Sabino Mullane wrote:
> > On Fri, Dec 27, 2024 at 10:12 AM Bruce Momjian <bruce@momjian.us> wrote:
> >
> >     The value of TDE is limited from a security value perspective, but high on
> >     the list of security policy requirements.  Our community is much more
> >     responsive to actual value vs policy compliance value.
> >
> >
> > True. The number of forks, though, makes me feel this is a "when", not "if"
> > feature. Has there been any other complex feature forked/implemented by so
> > many? Maybe columnar storage?
>
> That is a great question.  We have TDE implementations from EDB,
> Fujitsu, Percona, Cybertec, and Crunchy Data, and perhaps others, and
> that is a lot of duplicated effort.
>
> As far as parallels, I think compatibility with Oracle and MSSQL are
> areas that several companies have developed that the community is
> unlikely to ever develop, I think because they are pure compatibility,
> not functionality.  I think TDE having primarily policy compliance value
> also might make it something the community never develops.
>
> I think this blog post is the clearest I have seen about the technical
> value vs.policy compliance value of TDE:
>
>         https://www.percona.com/blog/why-postgresql-needs-transparent-database-encryption-tde/
>
> One possible way TDE could be added to community Postgres is if the code
> changes required were reduced due to an API redesign.

A couple big pieces here could be modifying the API to add
PreparePageForWrite()/PreparePageFromRead() hooks to transform the
data page once read from disk or getting ready to write to disk.  I
think I have a (not yet rebased atop bulk read/write API and various
incremental backup pieces) patch version which basically refactors
things somewhat to support that, but basically making a single call
point that we can add things like checksums, page encryption, etc.

I think there was also a thread floating around moving various
arbitrary read/write file APIs (for temporary files) into a common
API; my recollection is there was something along the lines of 4 or 5
different file read abstractions we used at various pieces in the code
base, so making a common one that could also be hooked would give us
the ability to make a read/write transient file API that we could then
"do stuff" with.  (My recollection is we could support encryption and
compression, but don't have the thread in front of me.)

Some form of early init pluggability would be required, since we'd
need to support reading encrypted WAL before full startup is
accomplished.  This seems harder, at least given the bits I was
originally plugging into.

Obviously existing forks would want to support reading their existing
clusters, so not sure there is an all-in-one solution here.

Just some musing here...

David