Re: [PATCHES] Post-special page storage TDE support - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: [PATCHES] Post-special page storage TDE support
Date
Msg-id ZF7QLFgV/Xv4aD04@tamriel.snowman.net
Whole thread Raw
In response to Re: [PATCHES] Post-special page storage TDE support  (David Christensen <david.christensen@crunchydata.com>)
Responses Re: [PATCHES] Post-special page storage TDE support
List pgsql-hackers
Greetings,

* David Christensen (david.christensen@crunchydata.com) wrote:
> Refreshing this with HEAD as of today, v4.

Thanks for updating this!

> Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
>
> This space is reserved for extended data on the Page structure which will be ultimately used for
> encrypted data, extended checksums, and potentially other things.  This data appears at the end of
> the Page, after any `pd_special` area, and will be calculated at runtime based on specific
> ControlFile features.
>
> No effort is made to ensure this is backwards-compatible with existing clusters for `pg_upgrade`, as
> we will require logical replication to move data into a cluster with different settings here.

This initial patch, at least, does maintain pg_upgrade as the
reserved_page_size (maybe not a great name?) is set to 0, right?
Basically this is just introducing the concept of a reserved_page_size
and adjusting all of the code that currently uses BLCKSZ or
PageGetPageSize() to account for this extra space.

Looking at the changes to bufpage.h, in particular ...

> diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h

> @@ -19,6 +19,14 @@
>  #include "storage/item.h"
>  #include "storage/off.h"
>
> +extern PGDLLIMPORT int reserved_page_size;
> +
> +#define SizeOfPageReservedSpace() reserved_page_size
> +#define MaxSizeOfPageReservedSpace 0
> +
> +/* strict upper bound on the amount of space occupied we have reserved on
> + * pages in this cluster */

This will eventually be calculated based on what features are supported
concurrently?

> @@ -36,10 +44,10 @@
>   * |             v pd_upper                              |
>   * +-------------+------------------------------------+
>   * |             | tupleN ...                         |
> - * +-------------+------------------+-----------------+
> - * |       ... tuple3 tuple2 tuple1 | "special space" |
> - * +--------------------------------+-----------------+
> - *                                    ^ pd_special
> + * +-------------+-----+------------+----+------------+
> + * | ... tuple2 tuple1 | "special space" | "reserved" |
> + * +-------------------+------------+----+------------+
> + *                       ^ pd_special      ^ reserved_page_space

Right, adds a dynamic amount of space 'post-special area'.

> @@ -73,6 +81,8 @@
>   * stored as the page trailer.  an access method should always
>   * initialize its pages with PageInit and then set its own opaque
>   * fields.
> + *
> + * XXX - update more comments here about reserved_page_space
>   */

Would be good to do. ;)

> @@ -325,7 +335,7 @@ static inline void
>  PageValidateSpecialPointer(Page page)
>  {
>      Assert(page);
> -    Assert(((PageHeader) page)->pd_special <= BLCKSZ);
> +    Assert((((PageHeader) page)->pd_special + reserved_page_size) <= BLCKSZ);
>      Assert(((PageHeader) page)->pd_special >= SizeOfPageHeaderData);
>  }

This is just one usage ... but seems like maybe we should be using
PageGetPageSize() here instead of BLCKSZ, and that more-or-less
throughout?  Nearly everywhere we're using BLCKSZ today to give us that
compile-time advantage of a fixed block size is going to lose that
advantage anyway thanks to reserved_page_size being run-time.  Now, one
up-side to this is that it'd also get us closer to being able to support
dynamic block sizes concurrently which would be quite interesting.  That
is, a special tablespace with a 32KB block size while the rest are the
traditional 8KB.  This would likely require multiple shared buffer
pools, of course...

> diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
> index 9a302ddc30..a93cd9df9f 100644
> --- a/src/backend/storage/page/bufpage.c
> +++ b/src/backend/storage/page/bufpage.c
> @@ -26,6 +26,8 @@
>  /* GUC variable */
>  bool        ignore_checksum_failure = false;
>
> +int            reserved_page_size = 0; /* how much page space to reserve for extended unencrypted metadata */
> +
>
>  /* ----------------------------------------------------------------
>   *                        Page support functions
> @@ -43,7 +45,7 @@ PageInit(Page page, Size pageSize, Size specialSize)
>  {
>      PageHeader    p = (PageHeader) page;
>
> -    specialSize = MAXALIGN(specialSize);
> +    specialSize = MAXALIGN(specialSize) + reserved_page_size;

Rather than make it part of specialSize, I would think we'd be better
off just treating them independently.  Eg, the later pd_upper setting
would be done by:

p->pd_upper = pageSize - specialSize - reserved_page_size;

etc.

> @@ -186,7 +188,7 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags)
>   *    one that is both unused and deallocated.
>   *
>   *    If flag PAI_IS_HEAP is set, we enforce that there can't be more than
> - *    MaxHeapTuplesPerPage line pointers on the page.
> + *    MaxHeapTuplesPerPage() line pointers on the page.

Making MaxHeapTuplesPerPage() runtime dynamic is a requirement for
supporting multiple page sizes concurrently ... but I'm not sure it's
actually required for the reserved_page_size idea as currently
considered.  The reason is that with 8K or larger pages, the amount of
space we're already throwing away is at least 20 bytes, if I did my math
right.  If we constrain reserved_page_size to be 20 bytes or less, as I
believe we're currently thinking we won't need that much, then we could
perhaps keep MaxHeapTuplesPerPage as a compile-time constant.

On the other hand, to the extent that we want to consider having
variable page sizes in the future, perhaps we do want to change this.
If so, the approach broadly looks reasonable to me, but I'd suggest we
make that a separate patch from the introduction of reserved_page_size.

> @@ -211,7 +213,7 @@ PageAddItemExtended(Page page,
>      if (phdr->pd_lower < SizeOfPageHeaderData ||
>          phdr->pd_lower > phdr->pd_upper ||
>          phdr->pd_upper > phdr->pd_special ||
> -        phdr->pd_special > BLCKSZ)
> +        phdr->pd_special + reserved_page_size > BLCKSZ)
>          ereport(PANIC,
>                  (errcode(ERRCODE_DATA_CORRUPTED),
>                   errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",

Probably should add reserved_page_size to that errmsg output?  Also,
this check of pointers seems to be done multiple times- maybe it should
be moved into a #define or similar?

> @@ -723,7 +725,7 @@ PageRepairFragmentation(Page page)
>      if (pd_lower < SizeOfPageHeaderData ||
>          pd_lower > pd_upper ||
>          pd_upper > pd_special ||
> -        pd_special > BLCKSZ ||
> +        pd_special + reserved_page_size > BLCKSZ ||
>          pd_special != MAXALIGN(pd_special))
>          ereport(ERROR,
>                  (errcode(ERRCODE_DATA_CORRUPTED),

This ends up being the same as above ...

> @@ -1066,7 +1068,7 @@ PageIndexTupleDelete(Page page, OffsetNumber offnum)
>      if (phdr->pd_lower < SizeOfPageHeaderData ||
>          phdr->pd_lower > phdr->pd_upper ||
>          phdr->pd_upper > phdr->pd_special ||
> -        phdr->pd_special > BLCKSZ ||
> +        phdr->pd_special + reserved_page_size > BLCKSZ ||
>          phdr->pd_special != MAXALIGN(phdr->pd_special))
>          ereport(ERROR,
>                  (errcode(ERRCODE_DATA_CORRUPTED),

And here ...

> @@ -1201,7 +1203,7 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
>      if (pd_lower < SizeOfPageHeaderData ||
>          pd_lower > pd_upper ||
>          pd_upper > pd_special ||
> -        pd_special > BLCKSZ ||
> +        pd_special + reserved_page_size > BLCKSZ ||
>          pd_special != MAXALIGN(pd_special))
>          ereport(ERROR,
>                  (errcode(ERRCODE_DATA_CORRUPTED),

And here ...

> @@ -1307,7 +1309,7 @@ PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum)
>      if (phdr->pd_lower < SizeOfPageHeaderData ||
>          phdr->pd_lower > phdr->pd_upper ||
>          phdr->pd_upper > phdr->pd_special ||
> -        phdr->pd_special > BLCKSZ ||
> +        phdr->pd_special + reserved_page_size > BLCKSZ ||
>          phdr->pd_special != MAXALIGN(phdr->pd_special))
>          ereport(ERROR,
>                  (errcode(ERRCODE_DATA_CORRUPTED),

And here ...

> @@ -1419,7 +1421,7 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
>      if (phdr->pd_lower < SizeOfPageHeaderData ||
>          phdr->pd_lower > phdr->pd_upper ||
>          phdr->pd_upper > phdr->pd_special ||
> -        phdr->pd_special > BLCKSZ ||
> +        phdr->pd_special + reserved_page_size > BLCKSZ ||
>          phdr->pd_special != MAXALIGN(phdr->pd_special))
>          ereport(ERROR,
>                  (errcode(ERRCODE_DATA_CORRUPTED),

And here ...

> diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
> index 6979aff727..060c4ab3e3 100644
> --- a/contrib/amcheck/verify_nbtree.c
> +++ b/contrib/amcheck/verify_nbtree.c
> @@ -489,12 +489,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
>          /*
>           * Size Bloom filter based on estimated number of tuples in index,
>           * while conservatively assuming that each block must contain at least
> -         * MaxTIDsPerBTreePage / 3 "plain" tuples -- see
> +         * MaxTIDsPerBTreePage() / 3 "plain" tuples -- see
>           * bt_posting_plain_tuple() for definition, and details of how posting
>           * list tuples are handled.
>           */
>          total_pages = RelationGetNumberOfBlocks(rel);
> -        total_elems = Max(total_pages * (MaxTIDsPerBTreePage / 3),
> +        total_elems = Max(total_pages * (MaxTIDsPerBTreePage() / 3),
>                            (int64) state->rel->rd_rel->reltuples);
>          /* Generate a random seed to avoid repetition */
>          seed = pg_prng_uint64(&pg_global_prng_state);

Making MaxTIDsPerBTreePage dynamic looks to be required as it doesn't
end up with any 'leftover' space, from what I can tell.  Again, though,
perhaps this should be split out as an independent patch from the rest.
That is- we can change the higher-level functions to be dynamic in the
initial patches, and then eventually we'll get down to making the
lower-level functions dynamic.

> diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
> index efdf9415d1..8ebabdd7ee 100644
> --- a/contrib/bloom/bloom.h
> +++ b/contrib/bloom/bloom.h
> @@ -131,7 +131,7 @@ typedef struct BloomMetaPageData
>  #define BLOOM_MAGICK_NUMBER (0xDBAC0DED)
>
>  /* Number of blocks numbers fit in BloomMetaPageData */
> -#define BloomMetaBlockN        (sizeof(FreeBlockNumberArray) / sizeof(BlockNumber))
> +#define BloomMetaBlockN()        ((sizeof(FreeBlockNumberArray) - SizeOfPageReservedSpace())/ sizeof(BlockNumber))
>
>  #define BloomPageGetMeta(page)    ((BloomMetaPageData *) PageGetContents(page))
>
> @@ -151,6 +151,7 @@ typedef struct BloomState
>
>  #define BloomPageGetFreeSpace(state, page) \
>      (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
> +        - SizeOfPageReservedSpace()                                  \
>          - BloomPageGetMaxOffset(page) * (state)->sizeOfBloomTuple \
>          - MAXALIGN(sizeof(BloomPageOpaqueData)))

This formulation (or something close to it) tends to happen quite a bit:

(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - SizeOfPageReservedSpace() ...

This is basically asking for "amount of usable space" where the
resulting 'usable space' either includes line pointers and tuples or
similar, or doesn't.  Perhaps we should break this down into two
patches- one which provides a function to return usable space on a page,
and then the patch to add reserved_page_size can simply adjust that
instead of changing the very, very many places we have this forumlation.

> diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
> index d935ed8fbd..d3d74a9d28 100644
> --- a/contrib/bloom/blutils.c
> +++ b/contrib/bloom/blutils.c
> @@ -430,10 +430,10 @@ BloomFillMetapage(Relation index, Page metaPage)
>       */
>      BloomInitPage(metaPage, BLOOM_META);
>      metadata = BloomPageGetMeta(metaPage);
> -    memset(metadata, 0, sizeof(BloomMetaPageData));
> +    memset(metadata, 0, sizeof(BloomMetaPageData) - SizeOfPageReservedSpace());

This doesn't seem quite right?  The reserved space is off at the end of
the page and this is 0'ing the space immediately after the page header,
if I'm following correctly, and only to the size of BloomMetaPageData...

>      metadata->magickNumber = BLOOM_MAGICK_NUMBER;
>      metadata->opts = *opts;
> -    ((PageHeader) metaPage)->pd_lower += sizeof(BloomMetaPageData);
> +    ((PageHeader) metaPage)->pd_lower += sizeof(BloomMetaPageData) - SizeOfPageReservedSpace();

Not quite following what's going on here either.

> diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
> @@ -116,7 +116,7 @@ blbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
>           */
>          if (BloomPageGetMaxOffset(page) != 0 &&
>              BloomPageGetFreeSpace(&state, page) >= state.sizeOfBloomTuple &&
> -            countPage < BloomMetaBlockN)
> +            countPage < BloomMetaBlockN())
>              notFullPage[countPage++] = blkno;

Looks to be another opportunity to have a separate patch making this
change first before actually changing the lower-level #define's,

> diff --git a/src/backend/access/brin/brin_tuple.c b/src/backend/access/brin/brin_tuple.c
> @@ -217,7 +217,7 @@ brin_form_tuple(BrinDesc *brdesc, BlockNumber blkno, BrinMemTuple *tuple,
>               * datatype, try to compress it in-line.
>               */
>              if (!VARATT_IS_EXTENDED(DatumGetPointer(value)) &&
> -                VARSIZE(DatumGetPointer(value)) > TOAST_INDEX_TARGET &&
> +                VARSIZE(DatumGetPointer(value)) > TOAST_INDEX_TARGET() &&
>                  (atttype->typstorage == TYPSTORAGE_EXTENDED ||
>                   atttype->typstorage == TYPSTORAGE_MAIN))
>              {

Probably could be another patch but also if we're going to change
TOAST_INDEX_TARGET to be a function we should probably not have it named
in all-CAPS.

> diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
> @@ -535,7 +535,7 @@ dataBeginPlaceToPageLeaf(GinBtree btree, Buffer buf, GinBtreeStack *stack,
>           * a single byte, and we can use all the free space on the old page as
>           * well as the new page. For simplicity, ignore segment overhead etc.
>           */
> -        maxitems = Min(maxitems, freespace + GinDataPageMaxDataSize);
> +        maxitems = Min(maxitems, freespace + GinDataPageMaxDataSize());
>      }
>      else
>      {

Ditto.

> diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
> @@ -38,8 +38,8 @@
>  /* GUC parameter */
>  int            gin_pending_list_limit = 0;
>
> -#define GIN_PAGE_FREESIZE \
> -    ( BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - MAXALIGN(sizeof(GinPageOpaqueData)) )
> +#define GIN_PAGE_FREESIZE() \
> +    ( BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - MAXALIGN(sizeof(GinPageOpaqueData)) - SizeOfPageReservedSpace() )

Another case of BLCKSZ - MAXALIGN(SizeOfPageHeaderData) -
SizeOfPageReservedSpace() ...

> @@ -450,7 +450,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
>       * ginInsertCleanup() should not be called inside our CRIT_SECTION.
>       */
>      cleanupSize = GinGetPendingListCleanupSize(index);
> -    if (metadata->nPendingPages * GIN_PAGE_FREESIZE > cleanupSize * 1024L)
> +    if (metadata->nPendingPages * GIN_PAGE_FREESIZE() > cleanupSize * 1024L)
>          needCleanup = true;

Also shouldn't be all-CAPS.

> diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
> index 43b67893d9..5babbb457a 100644
> --- a/src/backend/access/nbtree/nbtsplitloc.c
> +++ b/src/backend/access/nbtree/nbtsplitloc.c
> @@ -156,7 +156,7 @@ _bt_findsplitloc(Relation rel,
>
>      /* Total free space available on a btree page, after fixed overhead */
>      leftspace = rightspace =
> -        PageGetPageSize(origpage) - SizeOfPageHeaderData -
> +        PageGetPageSize(origpage) - SizeOfPageHeaderData - SizeOfPageReservedSpace() -
>          MAXALIGN(sizeof(BTPageOpaqueData));

Also here ... though a bit interesting that this uses PageGetPageSize()
instead of BLCKSZ.

> diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
> index 011ec18015..022b5eee4e 100644
> --- a/src/backend/utils/init/globals.c
> +++ b/src/backend/utils/init/globals.c
> @@ -154,3 +154,4 @@ int64        VacuumPageDirty = 0;
>
>  int            VacuumCostBalance = 0;    /* working state for vacuum */
>  bool        VacuumCostActive = false;
> +

Unnecessary whitespace hunk ?

Thanks!

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: smgrzeroextend clarification
Next
From: Thomas Munro
Date:
Subject: Re: Large files for relations