Thread: BTMaxItemSize seems to be subtly incorrect

BTMaxItemSize seems to be subtly incorrect

From
Robert Haas
Date:
We have these two definitions in the source code:

#define BTMaxItemSize(page) \
     MAXALIGN_DOWN((PageGetPageSize(page) - \
                    MAXALIGN(SizeOfPageHeaderData + \
                             3*sizeof(ItemIdData)  + \
                             3*sizeof(ItemPointerData)) - \
                    MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
#define BTMaxItemSizeNoHeapTid(page) \
     MAXALIGN_DOWN((PageGetPageSize(page) - \
                    MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
                    MAXALIGN(sizeof(BTPageOpaqueData))) / 3)

In my tests, PageGetPageSize(page) = 8192, SizeOfPageHeaderData = 24,
sizeof(ItemIdData) = 4, sizeof(ItemPointerData) = 6, and
sizeof(BTPageOpaqueData) = 16. Assuming MAXIMUM_ALIGNOF == 8, I
believe that makes BTMaxItemSize come out to 2704 and
BTMaxItemSizeNoHeapTid come out to 2712. I have no quibble with the
formula for BTMaxItemSizeNoHeapTid. It's just saying that if you
subtract out the page header data, the special space, and enough space
for 3 line pointers, you have a certain amount of space left (8152
bytes) and so a single item shouldn't use more than a third of that
(2717 bytes) but since items use up space in increments of MAXALIGN,
you have to round down to the next such multiple (2712 bytes).

But what's up with BTMaxItemSize? Here, the idea as I understand it is
that we might need to add a TID to the item, so it has to be small
enough to squeeze one in while still fitting under the limit. And at
first glance everything seems to be OK, because BTMaxItemSize comes
out to be 8 bytes less than BTMaxItemSizeNoHeapTid and that's enough
space to fit a heap TID for sure. However, it seems to me that the
formula calculates this as if those additional 6 bytes were being
separately added to the page header or the line pointer array, whereas
in reality they will be part of the tuple itself. I think that we
should be subtracting sizeof(ItemPointerData) at the very end, rather
than subtracting 3*sizeof(ItemPointerData) from the space available to
be divided by three.

To see why, suppose that sizeof(BTPageOpaqueData) were 24 rather than
16. Then we'd have:

BTMaxItemSize = MAXALIGN_DOWN((8192 - MAXALIGN(24 + 3 * 4 + 3 * 6) -
MAXALIGN(24)) / 3) = MAXALIGN_DOWN((8192 - MAXALIGN(54)  - 24) / 3) =
MAXALIGN_DOWN(2704) = 2704
BTMaxItemSizeNoHeapTid = MAXALIGN_DOWN((8192 - MAXALIGN(24 + 3 * 4) -
MAXALIGN(24)) / 3 = MAXALIGN_DOWN((8192 - MAXALIGN(36)  - 24) / 3) =
MAXALIGN_DOWN(2709) = 2704

That's a problem, because if in that scenario you allow three 2704
byte items that don't need a heap TID and later you find you need to
add a heap TID to one of those items, the result will be bigger than
2704 bytes, and then you can't fit three of them into a page.

Applying the attached patch and running 'make check' suffices to
demonstrate the problem for me:

diff -U3 /Users/rhaas/pgsql/src/test/regress/expected/vacuum.out
/Users/rhaas/pgsql/src/test/regress/results/vacuum.out
--- /Users/rhaas/pgsql/src/test/regress/expected/vacuum.out
2022-06-06 14:46:17.000000000 -0400
+++ /Users/rhaas/pgsql/src/test/regress/results/vacuum.out
2022-06-08 17:20:58.000000000 -0400
@@ -137,7 +137,9 @@
     repeat('1234567890',269));
 -- index cleanup option is ignored if VACUUM FULL
 VACUUM (INDEX_CLEANUP TRUE, FULL TRUE) no_index_cleanup;
+ERROR:  cannot insert oversized tuple of size 2712 on internal page
of index "no_index_cleanup_idx"
 VACUUM (FULL TRUE) no_index_cleanup;
+ERROR:  cannot insert oversized tuple of size 2712 on internal page
of index "no_index_cleanup_idx"
 -- Toast inherits the value from its parent table.
 ALTER TABLE no_index_cleanup SET (vacuum_index_cleanup = false);
 DELETE FROM no_index_cleanup WHERE i < 15;

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: BTMaxItemSize seems to be subtly incorrect

From
Peter Geoghegan
Date:
On Wed, Jun 8, 2022 at 2:23 PM Robert Haas <robertmhaas@gmail.com> wrote:
> In my tests, PageGetPageSize(page) = 8192, SizeOfPageHeaderData = 24,
> sizeof(ItemIdData) = 4, sizeof(ItemPointerData) = 6, and
> sizeof(BTPageOpaqueData) = 16. Assuming MAXIMUM_ALIGNOF == 8, I
> believe that makes BTMaxItemSize come out to 2704 and
> BTMaxItemSizeNoHeapTid come out to 2712.

I agree that these numbers are what you get on all mainstream
platforms. I know these specifics from memory alone, actually.

> To see why, suppose that sizeof(BTPageOpaqueData) were 24 rather than
> 16. Then we'd have:
>
> BTMaxItemSize = MAXALIGN_DOWN((8192 - MAXALIGN(24 + 3 * 4 + 3 * 6) -
> MAXALIGN(24)) / 3) = MAXALIGN_DOWN((8192 - MAXALIGN(54)  - 24) / 3) =
> MAXALIGN_DOWN(2704) = 2704
> BTMaxItemSizeNoHeapTid = MAXALIGN_DOWN((8192 - MAXALIGN(24 + 3 * 4) -
> MAXALIGN(24)) / 3 = MAXALIGN_DOWN((8192 - MAXALIGN(36)  - 24) / 3) =
> MAXALIGN_DOWN(2709) = 2704
>
> That's a problem, because if in that scenario you allow three 2704
> byte items that don't need a heap TID and later you find you need to
> add a heap TID to one of those items, the result will be bigger than
> 2704 bytes, and then you can't fit three of them into a page.

Seems you must be right. I'm guessing that the field "cabbage" was
originally a nonce value, as part of a draft patch you're working on?

I actually tested this in a fairly brute force fashion back when I was
working on the Postgres 12 nbtree stuff. Essentially I found a way to
build the tallest possible B-Tree, consisting of only 3 items (plus a
high key) on each leaf page, each of which was the largest possible
size, up to the byte. If memory serves, it is just about impossible to
get beyond 7 levels. It took as long as 30 minutes or more to run the
test.

I think that we should fix this on HEAD, on general principle. There
is no reason to believe that this is a live bug, so a backpatch seems
unnecessary.

--
Peter Geoghegan



Re: BTMaxItemSize seems to be subtly incorrect

From
Robert Haas
Date:
On Wed, Jun 8, 2022 at 5:55 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > That's a problem, because if in that scenario you allow three 2704
> > byte items that don't need a heap TID and later you find you need to
> > add a heap TID to one of those items, the result will be bigger than
> > 2704 bytes, and then you can't fit three of them into a page.
>
> Seems you must be right. I'm guessing that the field "cabbage" was
> originally a nonce value, as part of a draft patch you're working on?

I wasn't originally setting out to modify BTPageOpaqueData at all,
just borrow some special space. See the "storing an explicit nonce"
discussion and patch set. But when this regression failure turned up I
said to myself, hmm, I think this is an unrelated bug.

> I think that we should fix this on HEAD, on general principle. There
> is no reason to believe that this is a live bug, so a backpatch seems
> unnecessary.

Yeah, I agree with not back-patching the fix, unless it turns out that
there is some platform where the same issue occurs without any
cabbage. I assume that if it happened on any common system someone
would have complained about it by now, so probably it doesn't. I
suppose we could try to enumerate plausibly different values of the
quantities involved and see if any of the combinations look like they
lead to a bad result. I'm not really sure how many things could
plausibly be different, though, apart from MAXIMUM_ALIGNOF.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: BTMaxItemSize seems to be subtly incorrect

From
Peter Geoghegan
Date:
On Wed, Jun 8, 2022 at 4:18 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I wasn't originally setting out to modify BTPageOpaqueData at all,
> just borrow some special space. See the "storing an explicit nonce"
> discussion and patch set. But when this regression failure turned up I
> said to myself, hmm, I think this is an unrelated bug.

FWIW I don't see much difference between borrowing special space and
adding something to BTPageOpaqueData. If anything I'd prefer the
latter.

> > I think that we should fix this on HEAD, on general principle. There
> > is no reason to believe that this is a live bug, so a backpatch seems
> > unnecessary.
>
> Yeah, I agree with not back-patching the fix, unless it turns out that
> there is some platform where the same issue occurs without any
> cabbage. I assume that if it happened on any common system someone
> would have complained about it by now, so probably it doesn't.

I don't think it's possible on 32-bit platforms, which is the only
further variation I can think of. But let's assume that the same
problem was in fact possible "without cabbage", just for the sake of
argument. The worst problem that could result would be a failure to
split a page that had maximally large keys, without TOAST compression
-- which is what you demonstrated. There wouldn't be any downstream
consequences.

Here's why: BTMaxItemSizeNoHeapTid() is actually what BTMaxItemSize()
looked like prior to Postgres 12. So the limit on internal pages never
changed, even in Postgres 12. There was no separate leaf page limit
prior to 12. Only the rules on the leaf level ever really changed.

Note also that amcheck has tests for this stuff. Though that probably
doesn't matter at all.

-- 
Peter Geoghegan



Re: BTMaxItemSize seems to be subtly incorrect

From
Robert Haas
Date:
On Wed, Jun 8, 2022 at 7:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
> FWIW I don't see much difference between borrowing special space and
> adding something to BTPageOpaqueData. If anything I'd prefer the
> latter.

I think this discussion will take us too far afield from the topic of
this thread, so I'll just say here that wouldn't solve the problem I
was trying to tackle.

> Here's why: BTMaxItemSizeNoHeapTid() is actually what BTMaxItemSize()
> looked like prior to Postgres 12. So the limit on internal pages never
> changed, even in Postgres 12. There was no separate leaf page limit
> prior to 12. Only the rules on the leaf level ever really changed.

Yeah, I noticed that, too.

Are you going to code up a patch?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: BTMaxItemSize seems to be subtly incorrect

From
Peter Geoghegan
Date:
On Thu, Jun 9, 2022 at 6:40 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Are you going to code up a patch?

I can, but feel free to fix it yourself if you prefer. Your analysis
seems sound.

-- 
Peter Geoghegan



Re: BTMaxItemSize seems to be subtly incorrect

From
Robert Haas
Date:
On Thu, Jun 9, 2022 at 1:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Jun 9, 2022 at 6:40 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > Are you going to code up a patch?
>
> I can, but feel free to fix it yourself if you prefer. Your analysis
> seems sound.

I think I'd feel more comfortable if you wrote the patch, if that's possible.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: BTMaxItemSize seems to be subtly incorrect

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Jun 8, 2022 at 5:55 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > That's a problem, because if in that scenario you allow three 2704
> > > byte items that don't need a heap TID and later you find you need to
> > > add a heap TID to one of those items, the result will be bigger than
> > > 2704 bytes, and then you can't fit three of them into a page.
> >
> > Seems you must be right. I'm guessing that the field "cabbage" was
> > originally a nonce value, as part of a draft patch you're working on?
>
> I wasn't originally setting out to modify BTPageOpaqueData at all,
> just borrow some special space. See the "storing an explicit nonce"
> discussion and patch set. But when this regression failure turned up I
> said to myself, hmm, I think this is an unrelated bug.

I had seen something along these lines when also playing with trying to
use special space.  I hadn't had a chance to run down exactly where it
was coming from, so thanks for working on this.

> > I think that we should fix this on HEAD, on general principle. There
> > is no reason to believe that this is a live bug, so a backpatch seems
> > unnecessary.
>
> Yeah, I agree with not back-patching the fix, unless it turns out that
> there is some platform where the same issue occurs without any
> cabbage. I assume that if it happened on any common system someone
> would have complained about it by now, so probably it doesn't. I
> suppose we could try to enumerate plausibly different values of the
> quantities involved and see if any of the combinations look like they
> lead to a bad result. I'm not really sure how many things could
> plausibly be different, though, apart from MAXIMUM_ALIGNOF.

Agreed that it doesn't seem like we'd need to backpatch this.

Thanks,

Stephen

Attachment

Re: BTMaxItemSize seems to be subtly incorrect

From
Peter Geoghegan
Date:
On Thu, Jun 9, 2022 at 11:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I think I'd feel more comfortable if you wrote the patch, if that's possible.

Okay, pushed a fix just now.

Thanks
-- 
Peter Geoghegan



Re: BTMaxItemSize seems to be subtly incorrect

From
Thomas Munro
Date:
On Fri, Aug 5, 2022 at 3:56 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Jun 9, 2022 at 11:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I think I'd feel more comfortable if you wrote the patch, if that's possible.
>
> Okay, pushed a fix just now.

FYI florican and lapwing showed:

2022-08-05 01:04:29.903 EDT [34485:5] FATAL:  deduplication failed to
add heap tid to pending posting list
2022-08-05 01:04:29.903 EDT [34485:6] CONTEXT:  WAL redo at 0/49708D8
for Btree/DEDUP: nintervals 4; blkref #0: rel 1663/16384/2674, blk 1



Re: BTMaxItemSize seems to be subtly incorrect

From
Peter Geoghegan
Date:
On Thu, Aug 4, 2022 at 10:25 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> FYI florican and lapwing showed:
>
> 2022-08-05 01:04:29.903 EDT [34485:5] FATAL:  deduplication failed to
> add heap tid to pending posting list
> 2022-08-05 01:04:29.903 EDT [34485:6] CONTEXT:  WAL redo at 0/49708D8
> for Btree/DEDUP: nintervals 4; blkref #0: rel 1663/16384/2674, blk 1

This very likely has something to do with the way nbtdedup.c uses
BTMaxItemSize(), which apparently won't work on these 32-bit systems
now.

I'll fix this tomorrow morning.

-- 
Peter Geoghegan



Re: BTMaxItemSize seems to be subtly incorrect

From
Peter Geoghegan
Date:
On Thu, Aug 4, 2022 at 10:40 PM Peter Geoghegan <pg@bowt.ie> wrote:
> This very likely has something to do with the way nbtdedup.c uses
> BTMaxItemSize(), which apparently won't work on these 32-bit systems
> now.

Update: I discovered that I can get the regression tests to fail (even
on mainstream 64-bit platforms) by MAXALIGN()'ing the expression that
we assign to state->maxpostingsize at the top of _bt_dedup_pass().
This is surprising; it contradicts existing comments that explain that
the existing max is 1/6 of a page by choice, to get better space
utilization than the more natural cap of 1/3 of a page. It now looks
like that might have actually been strictly necessary, all along.

-- 
Peter Geoghegan



Re: BTMaxItemSize seems to be subtly incorrect

From
Peter Geoghegan
Date:
On Fri, Aug 5, 2022 at 10:13 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Update: I discovered that I can get the regression tests to fail (even
> on mainstream 64-bit platforms) by MAXALIGN()'ing the expression that
> we assign to state->maxpostingsize at the top of _bt_dedup_pass().

Looks like this was nothing more than a silly oversight with how the
macro was defined. As written, it would evaluate to the wrong thing at
the same point in nbtdedup.c, just because it was used in an
expression.

Pushed a fix for that just now.
-- 
Peter Geoghegan