Re: BTScanOpaqueData size slows down tests - Mailing list pgsql-hackers

From Andres Freund
Subject Re: BTScanOpaqueData size slows down tests
Date
Msg-id jtjpscdpj5dxugvulavmjekmblvuroxi3tvkeyhbhp6ye5blqj@jrjrsrlr76gj
Whole thread Raw
In response to Re: BTScanOpaqueData size slows down tests  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BTScanOpaqueData size slows down tests
Re: BTScanOpaqueData size slows down tests
List pgsql-hackers
Hi,

On 2025-04-02 11:36:33 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Looking at the size of BTScanOpaqueData I am less surprised:
> >         /* size: 27352, cachelines: 428, members: 17 */
> > allocating, zeroing and freeing 28kB of memory for every syscache miss, yea,
> > that's gonna hurt.
>
> Ouch!  I had no idea it had gotten that big.  Yeah, we ought to
> do something about that.

It got a bit bigger a few years back, in

commit 0d861bbb702
Author: Peter Geoghegan <pg@bowt.ie>
Date:   2020-02-26 13:05:30 -0800

    Add deduplication to nbtree.

Because the posting list is a lot more dense, more items can be stored on each
page.

Not that it was small before either:

        BTScanPosData              currPos __attribute__((__aligned__(8))); /*    88  4128 */
        /* --- cacheline 65 boundary (4160 bytes) was 56 bytes ago --- */
        BTScanPosData              markPos __attribute__((__aligned__(8))); /*  4216  4128 */

        /* size: 8344, cachelines: 131, members: 16 */
        /* sum members: 8334, holes: 3, sum holes: 10 */
        /* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
        /* last cacheline: 24 bytes */
} __attribute__((__aligned__(8)));

But obviously ~3.2x can qualitatively change something.


> > And/or perhaps we could could allocate BTScanOpaqueData.markPos as a whole
> > only when mark/restore are used?
>
> That'd be an easy way of removing about half of the problem, but
> 14kB is still too much.  How badly do we need this items array?
> Couldn't we just reference the on-page items?

I think that'd require acquiring the buffer lock and/or pin more frequently.
But I know very little about nbtree.


I'd assume it's extremely rare for there to be this many items on a page. I'd
guess that something like storing having BTScanPosData->items point to an
in-line 4-16 BTScanPosItem items_inline[N] and dynamically allocate a
full-length BTScanPosItem[MaxTIDsPerBTreePage] just in the cases it's needed.

I'm a bit confused by the "MUST BE LAST" comment:
    BTScanPosItem items[MaxTIDsPerBTreePage];    /* MUST BE LAST */

Not clear why?  Seems to be from rather long back:

commit 09cb5c0e7d6
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2006-05-07 01:21:30 +0000

    Rewrite btree index scans to work a page at a time in all cases (both


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: SQL function which allows to distinguish a server being in point in time recovery mode and an ordinary replica
Next
From: Peter Geoghegan
Date:
Subject: Re: BTScanOpaqueData size slows down tests