Thread: BTMaxItemSize seems to be subtly incorrect
We have these two definitions in the source code: #define BTMaxItemSize(page) \ MAXALIGN_DOWN((PageGetPageSize(page) - \ MAXALIGN(SizeOfPageHeaderData + \ 3*sizeof(ItemIdData) + \ 3*sizeof(ItemPointerData)) - \ MAXALIGN(sizeof(BTPageOpaqueData))) / 3) #define BTMaxItemSizeNoHeapTid(page) \ MAXALIGN_DOWN((PageGetPageSize(page) - \ MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \ MAXALIGN(sizeof(BTPageOpaqueData))) / 3) In my tests, PageGetPageSize(page) = 8192, SizeOfPageHeaderData = 24, sizeof(ItemIdData) = 4, sizeof(ItemPointerData) = 6, and sizeof(BTPageOpaqueData) = 16. Assuming MAXIMUM_ALIGNOF == 8, I believe that makes BTMaxItemSize come out to 2704 and BTMaxItemSizeNoHeapTid come out to 2712. I have no quibble with the formula for BTMaxItemSizeNoHeapTid. It's just saying that if you subtract out the page header data, the special space, and enough space for 3 line pointers, you have a certain amount of space left (8152 bytes) and so a single item shouldn't use more than a third of that (2717 bytes) but since items use up space in increments of MAXALIGN, you have to round down to the next such multiple (2712 bytes). But what's up with BTMaxItemSize? Here, the idea as I understand it is that we might need to add a TID to the item, so it has to be small enough to squeeze one in while still fitting under the limit. And at first glance everything seems to be OK, because BTMaxItemSize comes out to be 8 bytes less than BTMaxItemSizeNoHeapTid and that's enough space to fit a heap TID for sure. However, it seems to me that the formula calculates this as if those additional 6 bytes were being separately added to the page header or the line pointer array, whereas in reality they will be part of the tuple itself. I think that we should be subtracting sizeof(ItemPointerData) at the very end, rather than subtracting 3*sizeof(ItemPointerData) from the space available to be divided by three. To see why, suppose that sizeof(BTPageOpaqueData) were 24 rather than 16. Then we'd have: BTMaxItemSize = MAXALIGN_DOWN((8192 - MAXALIGN(24 + 3 * 4 + 3 * 6) - MAXALIGN(24)) / 3) = MAXALIGN_DOWN((8192 - MAXALIGN(54) - 24) / 3) = MAXALIGN_DOWN(2704) = 2704 BTMaxItemSizeNoHeapTid = MAXALIGN_DOWN((8192 - MAXALIGN(24 + 3 * 4) - MAXALIGN(24)) / 3 = MAXALIGN_DOWN((8192 - MAXALIGN(36) - 24) / 3) = MAXALIGN_DOWN(2709) = 2704 That's a problem, because if in that scenario you allow three 2704 byte items that don't need a heap TID and later you find you need to add a heap TID to one of those items, the result will be bigger than 2704 bytes, and then you can't fit three of them into a page. Applying the attached patch and running 'make check' suffices to demonstrate the problem for me: diff -U3 /Users/rhaas/pgsql/src/test/regress/expected/vacuum.out /Users/rhaas/pgsql/src/test/regress/results/vacuum.out --- /Users/rhaas/pgsql/src/test/regress/expected/vacuum.out 2022-06-06 14:46:17.000000000 -0400 +++ /Users/rhaas/pgsql/src/test/regress/results/vacuum.out 2022-06-08 17:20:58.000000000 -0400 @@ -137,7 +137,9 @@ repeat('1234567890',269)); -- index cleanup option is ignored if VACUUM FULL VACUUM (INDEX_CLEANUP TRUE, FULL TRUE) no_index_cleanup; +ERROR: cannot insert oversized tuple of size 2712 on internal page of index "no_index_cleanup_idx" VACUUM (FULL TRUE) no_index_cleanup; +ERROR: cannot insert oversized tuple of size 2712 on internal page of index "no_index_cleanup_idx" -- Toast inherits the value from its parent table. ALTER TABLE no_index_cleanup SET (vacuum_index_cleanup = false); DELETE FROM no_index_cleanup WHERE i < 15; -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Wed, Jun 8, 2022 at 2:23 PM Robert Haas <robertmhaas@gmail.com> wrote: > In my tests, PageGetPageSize(page) = 8192, SizeOfPageHeaderData = 24, > sizeof(ItemIdData) = 4, sizeof(ItemPointerData) = 6, and > sizeof(BTPageOpaqueData) = 16. Assuming MAXIMUM_ALIGNOF == 8, I > believe that makes BTMaxItemSize come out to 2704 and > BTMaxItemSizeNoHeapTid come out to 2712. I agree that these numbers are what you get on all mainstream platforms. I know these specifics from memory alone, actually. > To see why, suppose that sizeof(BTPageOpaqueData) were 24 rather than > 16. Then we'd have: > > BTMaxItemSize = MAXALIGN_DOWN((8192 - MAXALIGN(24 + 3 * 4 + 3 * 6) - > MAXALIGN(24)) / 3) = MAXALIGN_DOWN((8192 - MAXALIGN(54) - 24) / 3) = > MAXALIGN_DOWN(2704) = 2704 > BTMaxItemSizeNoHeapTid = MAXALIGN_DOWN((8192 - MAXALIGN(24 + 3 * 4) - > MAXALIGN(24)) / 3 = MAXALIGN_DOWN((8192 - MAXALIGN(36) - 24) / 3) = > MAXALIGN_DOWN(2709) = 2704 > > That's a problem, because if in that scenario you allow three 2704 > byte items that don't need a heap TID and later you find you need to > add a heap TID to one of those items, the result will be bigger than > 2704 bytes, and then you can't fit three of them into a page. Seems you must be right. I'm guessing that the field "cabbage" was originally a nonce value, as part of a draft patch you're working on? I actually tested this in a fairly brute force fashion back when I was working on the Postgres 12 nbtree stuff. Essentially I found a way to build the tallest possible B-Tree, consisting of only 3 items (plus a high key) on each leaf page, each of which was the largest possible size, up to the byte. If memory serves, it is just about impossible to get beyond 7 levels. It took as long as 30 minutes or more to run the test. I think that we should fix this on HEAD, on general principle. There is no reason to believe that this is a live bug, so a backpatch seems unnecessary. -- Peter Geoghegan
On Wed, Jun 8, 2022 at 5:55 PM Peter Geoghegan <pg@bowt.ie> wrote: > > That's a problem, because if in that scenario you allow three 2704 > > byte items that don't need a heap TID and later you find you need to > > add a heap TID to one of those items, the result will be bigger than > > 2704 bytes, and then you can't fit three of them into a page. > > Seems you must be right. I'm guessing that the field "cabbage" was > originally a nonce value, as part of a draft patch you're working on? I wasn't originally setting out to modify BTPageOpaqueData at all, just borrow some special space. See the "storing an explicit nonce" discussion and patch set. But when this regression failure turned up I said to myself, hmm, I think this is an unrelated bug. > I think that we should fix this on HEAD, on general principle. There > is no reason to believe that this is a live bug, so a backpatch seems > unnecessary. Yeah, I agree with not back-patching the fix, unless it turns out that there is some platform where the same issue occurs without any cabbage. I assume that if it happened on any common system someone would have complained about it by now, so probably it doesn't. I suppose we could try to enumerate plausibly different values of the quantities involved and see if any of the combinations look like they lead to a bad result. I'm not really sure how many things could plausibly be different, though, apart from MAXIMUM_ALIGNOF. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jun 8, 2022 at 4:18 PM Robert Haas <robertmhaas@gmail.com> wrote: > I wasn't originally setting out to modify BTPageOpaqueData at all, > just borrow some special space. See the "storing an explicit nonce" > discussion and patch set. But when this regression failure turned up I > said to myself, hmm, I think this is an unrelated bug. FWIW I don't see much difference between borrowing special space and adding something to BTPageOpaqueData. If anything I'd prefer the latter. > > I think that we should fix this on HEAD, on general principle. There > > is no reason to believe that this is a live bug, so a backpatch seems > > unnecessary. > > Yeah, I agree with not back-patching the fix, unless it turns out that > there is some platform where the same issue occurs without any > cabbage. I assume that if it happened on any common system someone > would have complained about it by now, so probably it doesn't. I don't think it's possible on 32-bit platforms, which is the only further variation I can think of. But let's assume that the same problem was in fact possible "without cabbage", just for the sake of argument. The worst problem that could result would be a failure to split a page that had maximally large keys, without TOAST compression -- which is what you demonstrated. There wouldn't be any downstream consequences. Here's why: BTMaxItemSizeNoHeapTid() is actually what BTMaxItemSize() looked like prior to Postgres 12. So the limit on internal pages never changed, even in Postgres 12. There was no separate leaf page limit prior to 12. Only the rules on the leaf level ever really changed. Note also that amcheck has tests for this stuff. Though that probably doesn't matter at all. -- Peter Geoghegan
On Wed, Jun 8, 2022 at 7:44 PM Peter Geoghegan <pg@bowt.ie> wrote: > FWIW I don't see much difference between borrowing special space and > adding something to BTPageOpaqueData. If anything I'd prefer the > latter. I think this discussion will take us too far afield from the topic of this thread, so I'll just say here that wouldn't solve the problem I was trying to tackle. > Here's why: BTMaxItemSizeNoHeapTid() is actually what BTMaxItemSize() > looked like prior to Postgres 12. So the limit on internal pages never > changed, even in Postgres 12. There was no separate leaf page limit > prior to 12. Only the rules on the leaf level ever really changed. Yeah, I noticed that, too. Are you going to code up a patch? -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jun 9, 2022 at 6:40 AM Robert Haas <robertmhaas@gmail.com> wrote: > Are you going to code up a patch? I can, but feel free to fix it yourself if you prefer. Your analysis seems sound. -- Peter Geoghegan
On Thu, Jun 9, 2022 at 1:39 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Jun 9, 2022 at 6:40 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Are you going to code up a patch? > > I can, but feel free to fix it yourself if you prefer. Your analysis > seems sound. I think I'd feel more comfortable if you wrote the patch, if that's possible. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Jun 8, 2022 at 5:55 PM Peter Geoghegan <pg@bowt.ie> wrote: > > > That's a problem, because if in that scenario you allow three 2704 > > > byte items that don't need a heap TID and later you find you need to > > > add a heap TID to one of those items, the result will be bigger than > > > 2704 bytes, and then you can't fit three of them into a page. > > > > Seems you must be right. I'm guessing that the field "cabbage" was > > originally a nonce value, as part of a draft patch you're working on? > > I wasn't originally setting out to modify BTPageOpaqueData at all, > just borrow some special space. See the "storing an explicit nonce" > discussion and patch set. But when this regression failure turned up I > said to myself, hmm, I think this is an unrelated bug. I had seen something along these lines when also playing with trying to use special space. I hadn't had a chance to run down exactly where it was coming from, so thanks for working on this. > > I think that we should fix this on HEAD, on general principle. There > > is no reason to believe that this is a live bug, so a backpatch seems > > unnecessary. > > Yeah, I agree with not back-patching the fix, unless it turns out that > there is some platform where the same issue occurs without any > cabbage. I assume that if it happened on any common system someone > would have complained about it by now, so probably it doesn't. I > suppose we could try to enumerate plausibly different values of the > quantities involved and see if any of the combinations look like they > lead to a bad result. I'm not really sure how many things could > plausibly be different, though, apart from MAXIMUM_ALIGNOF. Agreed that it doesn't seem like we'd need to backpatch this. Thanks, Stephen
Attachment
On Thu, Jun 9, 2022 at 11:20 AM Robert Haas <robertmhaas@gmail.com> wrote: > I think I'd feel more comfortable if you wrote the patch, if that's possible. Okay, pushed a fix just now. Thanks -- Peter Geoghegan
On Fri, Aug 5, 2022 at 3:56 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Jun 9, 2022 at 11:20 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I think I'd feel more comfortable if you wrote the patch, if that's possible. > > Okay, pushed a fix just now. FYI florican and lapwing showed: 2022-08-05 01:04:29.903 EDT [34485:5] FATAL: deduplication failed to add heap tid to pending posting list 2022-08-05 01:04:29.903 EDT [34485:6] CONTEXT: WAL redo at 0/49708D8 for Btree/DEDUP: nintervals 4; blkref #0: rel 1663/16384/2674, blk 1
On Thu, Aug 4, 2022 at 10:25 PM Thomas Munro <thomas.munro@gmail.com> wrote: > FYI florican and lapwing showed: > > 2022-08-05 01:04:29.903 EDT [34485:5] FATAL: deduplication failed to > add heap tid to pending posting list > 2022-08-05 01:04:29.903 EDT [34485:6] CONTEXT: WAL redo at 0/49708D8 > for Btree/DEDUP: nintervals 4; blkref #0: rel 1663/16384/2674, blk 1 This very likely has something to do with the way nbtdedup.c uses BTMaxItemSize(), which apparently won't work on these 32-bit systems now. I'll fix this tomorrow morning. -- Peter Geoghegan
On Thu, Aug 4, 2022 at 10:40 PM Peter Geoghegan <pg@bowt.ie> wrote: > This very likely has something to do with the way nbtdedup.c uses > BTMaxItemSize(), which apparently won't work on these 32-bit systems > now. Update: I discovered that I can get the regression tests to fail (even on mainstream 64-bit platforms) by MAXALIGN()'ing the expression that we assign to state->maxpostingsize at the top of _bt_dedup_pass(). This is surprising; it contradicts existing comments that explain that the existing max is 1/6 of a page by choice, to get better space utilization than the more natural cap of 1/3 of a page. It now looks like that might have actually been strictly necessary, all along. -- Peter Geoghegan
On Fri, Aug 5, 2022 at 10:13 AM Peter Geoghegan <pg@bowt.ie> wrote: > Update: I discovered that I can get the regression tests to fail (even > on mainstream 64-bit platforms) by MAXALIGN()'ing the expression that > we assign to state->maxpostingsize at the top of _bt_dedup_pass(). Looks like this was nothing more than a silly oversight with how the macro was defined. As written, it would evaluate to the wrong thing at the same point in nbtdedup.c, just because it was used in an expression. Pushed a fix for that just now. -- Peter Geoghegan