Re: longfin and tamandua aren't too happy but I'm not sure why - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: longfin and tamandua aren't too happy but I'm not sure why
Date
Msg-id CAH2-Wzkx8q2S0zXDNGVT1f0sfvWGL+n5yz0Uzy7EOOFOhaQ9YA@mail.gmail.com
Whole thread Raw
In response to Re: longfin and tamandua aren't too happy but I'm not sure why  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: longfin and tamandua aren't too happy but I'm not sure why
List pgsql-hackers
On Wed, Sep 28, 2022 at 6:48 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On second thought, I'm going to revert the whole thing. There's a
> bigger mess here than can be cleaned up on the fly. The
> alignment-related mess in ParseCommitRecord is maybe something for
> which I could just hack a quick fix, but what I've also just now
> realized is that this makes a huge number of WAL records larger by 4
> bytes, since most WAL records will contain a block reference.

It would be useful if there were generic tests that caught issues like
this. There are various subtle effects related to how struct layout
can impact WAL record size that might easily be missed. It's not like
there are a huge number of truly critical WAL records to have tests
for.

The example that comes to mind is the XLOG_BTREE_INSERT_POST record
type, which is used for B-Tree index tuple inserts with a posting list
split. There is only an extra 2 bytes of payload for these record
types compared to conventional XLOG_BTREE_INSERT_LEAF records, but we
nevertheless tend to see a final record size that is consistently a
full 8 bytes larger in many important cases, despite not needing to
stored the IndexTuple with alignment padding. I believe that this is a
consequence of the record header itself needing to be MAXALIGN()'d.

Another important factor in this scenario is the general tendency for
index tuple sizes to leave the final XLOG_BTREE_INSERT_LEAF record
size at 64 bytes. It wouldn't have been okay if the deduplication work
made that size jump up to 72 bytes for many kinds of indexes across
the board, even when there was no accompanying posting list split
(i.e. the vast majority of the time). Maybe it would have been okay if
nbtree leaf page insert records were naturally rare, but that isn't
the case at all, obviously.

That's why we have two different record types here in the first place.
Earlier versions of the deduplication patch just added an OffsetNumber
field to XLOG_BTREE_INSERT_LEAF which could be set to
InvalidOffsetNumber, resulting in a surprisingly large amount of waste
in terms of WAL size. Because of the presence of 3 different factors.
We don't bother doing this with the split records, which can also have
accompanying posting list splits, because it makes hardly any
difference at all (split records are much rarer than any kind of leaf
insert record, and are far larger when considered individually).

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: [PATCH] Log details for client certificate failures
Next
From: Ranier Vilela
Date:
Subject: A potential memory leak on Merge Join when Sort node is not below Materialize node