Thread: BTScanOpaqueData size slows down tests

BTScanOpaqueData size slows down tests

From
Andres Freund
Date:
Hi,

I was a bit annoyed at test times just now. Ran a profile on the entire
regression tests in a cassert -Og build.

Unsurprisingly most of the time is spent in AllocSetCheck(). I was mildly
surprised to see how expensive the new compact attribute checks are.

What I was more surprised to realize is how much of the time is spent in
freeing and allocating BTScanOpaqueData.

+    6.94%  postgres         postgres                                  [.] AllocSetCheck
-    4.96%  postgres         libc.so.6                                 [.] __memset_evex_unaligned_erms
   - 1.94% memset@plt
      - 1.12% _int_malloc
         - 1.11% malloc
            - 0.90% AllocSetAllocLarge
               - AllocSetAlloc
                  - 0.77% palloc
                     - 0.63% btbeginscan
                        - index_beginscan_internal
                           - 0.63% index_beginscan
                              - 0.61% systable_beginscan
                                 + 0.22% SearchCatCacheMiss
                                 + 0.07% ScanPgRelation
                                 + 0.05% RelationBuildTupleDesc
                                 + 0.04% findDependentObjects
                                   0.03% GetNewOidWithIndex
                                 + 0.02% deleteOneObject
                                 + 0.02% shdepDropDependency
                                 + 0.02% DeleteComments
                                 + 0.02% SearchCatCacheList
                                 + 0.02% DeleteSecurityLabel
                                 + 0.02% DeleteInitPrivs
                     + 0.04% text_to_cstring
                     + 0.02% cstring_to_text_with_len
                     + 0.02% datumCopy
                     + 0.02% tuplesort_begin_batch
                  + 0.11% palloc_extended
                  + 0.01% AllocSetRealloc
            + 0.20% AllocSetAllocFromNewBlock
      + 0.82% _int_free_merge_chunk
   - 1.90% __memset_evex_unaligned_erms
      - 1.82% wipe_mem
         - 1.33% AllocSetFree
            - 1.33% pfree
               + 0.73% btendscan
               + 0.22% freedfa
               + 0.06% ExecAggCopyTransValue
               + 0.04% freenfa
               + 0.03% enlarge_list
               + 0.03% ExecDropSingleTupleTableSlot
               + 0.02% xmlconcat
               + 0.01% RemoveLocalLock
               + 0.01% errcontext_msg
               + 0.01% IndexScanEnd
                 0.01% heap_free_minimal_tuple
         + 0.49% AllocSetReset
        0.02% palloc0
        0.01% PageInit
      + 0.01% wipe_mem
   + 0.59% alloc_perturb
   + 0.46% asm_exc_page_fault
   + 0.03% asm_sysvec_apic_timer_interrupt
   + 0.02% wipe_mem


Looking at the size of BTScanOpaqueData I am less surprised:

        /* --- cacheline 1 boundary (64 bytes) --- */
        char *                     currTuples;           /*    64     8 */
        char *                     markTuples;           /*    72     8 */
        int                        markItemIndex;        /*    80     4 */

        /* XXX 4 bytes hole, try to pack */

        BTScanPosData              currPos __attribute__((__aligned__(8))); /*    88 13632 */
        /* --- cacheline 214 boundary (13696 bytes) was 24 bytes ago --- */
        BTScanPosData              markPos __attribute__((__aligned__(8))); /* 13720 13632 */

        /* size: 27352, cachelines: 428, members: 17 */
        /* sum members: 27340, holes: 4, sum holes: 12 */
        /* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
        /* last cacheline: 24 bytes */
} __attribute__((__aligned__(8)));

allocating, zeroing and freeing 28kB of memory for every syscache miss, yea,
that's gonna hurt.


The reason BTScanPosData is that large is that it stores MaxTIDsPerBTreePage*
sizeof(BTScanPosItem):
        BTScanPosItem              items[1358] __attribute__((__aligned__(2))); /*    48 13580 */


Could we perhaps allocate BTScanPosData->items dynamically if more than a
handful of items are needed?


And/or perhaps we could could allocate BTScanOpaqueData.markPos as a whole
only when mark/restore are used?


I'd be rather unsurprised if this isn't just an issue for tests, but also in a
few real workloads.

Greetings,

Andres Freund



Re: BTScanOpaqueData size slows down tests

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Looking at the size of BTScanOpaqueData I am less surprised:
>         /* size: 27352, cachelines: 428, members: 17 */
> allocating, zeroing and freeing 28kB of memory for every syscache miss, yea,
> that's gonna hurt.

Ouch!  I had no idea it had gotten that big.  Yeah, we ought to
do something about that.

> And/or perhaps we could could allocate BTScanOpaqueData.markPos as a whole
> only when mark/restore are used?

That'd be an easy way of removing about half of the problem, but
14kB is still too much.  How badly do we need this items array?
Couldn't we just reference the on-page items?

            regards, tom lane



Re: BTScanOpaqueData size slows down tests

From
Peter Geoghegan
Date:
On Wed, Apr 2, 2025 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ouch!  I had no idea it had gotten that big.  Yeah, we ought to
> do something about that.

Tomas Vondra talked about this recently, in the context of his work on
prefetching.

> > And/or perhaps we could could allocate BTScanOpaqueData.markPos as a whole
> > only when mark/restore are used?
>
> That'd be an easy way of removing about half of the problem, but
> 14kB is still too much.  How badly do we need this items array?
> Couldn't we just reference the on-page items?

I'm not sure what you mean by that. The whole design of _bt_readpage
is based on the idea that we read a whole page, in one go. It has to
batch up the items that are to be returned from the page somewhere.
The worst case is that there are about 1350 TIDs to return from any
single page (assuming default BLCKSZ). It's very pessimistic to start
from the assumption that that worst case will be hit, but I don't see
a way around doing it at least some of the time.

The first thing I'd try is some kind of simple dynamic allocation
scheme, with a small built-in array that avoided any allocation
penalty in the common case where there weren't too many tuples to
return from the page.

The way that we allocate BLCKSZ twice for index-only scans (one for
so->currTuples, the other for so->markTuples) is also pretty
inefficient. Especially because any kind of use of mark and restore is
exceedingly rare.

--
Peter Geoghegan



Re: BTScanOpaqueData size slows down tests

From
Andres Freund
Date:
Hi,

On 2025-04-02 11:36:33 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Looking at the size of BTScanOpaqueData I am less surprised:
> >         /* size: 27352, cachelines: 428, members: 17 */
> > allocating, zeroing and freeing 28kB of memory for every syscache miss, yea,
> > that's gonna hurt.
>
> Ouch!  I had no idea it had gotten that big.  Yeah, we ought to
> do something about that.

It got a bit bigger a few years back, in

commit 0d861bbb702
Author: Peter Geoghegan <pg@bowt.ie>
Date:   2020-02-26 13:05:30 -0800

    Add deduplication to nbtree.

Because the posting list is a lot more dense, more items can be stored on each
page.

Not that it was small before either:

        BTScanPosData              currPos __attribute__((__aligned__(8))); /*    88  4128 */
        /* --- cacheline 65 boundary (4160 bytes) was 56 bytes ago --- */
        BTScanPosData              markPos __attribute__((__aligned__(8))); /*  4216  4128 */

        /* size: 8344, cachelines: 131, members: 16 */
        /* sum members: 8334, holes: 3, sum holes: 10 */
        /* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
        /* last cacheline: 24 bytes */
} __attribute__((__aligned__(8)));

But obviously ~3.2x can qualitatively change something.


> > And/or perhaps we could could allocate BTScanOpaqueData.markPos as a whole
> > only when mark/restore are used?
>
> That'd be an easy way of removing about half of the problem, but
> 14kB is still too much.  How badly do we need this items array?
> Couldn't we just reference the on-page items?

I think that'd require acquiring the buffer lock and/or pin more frequently.
But I know very little about nbtree.


I'd assume it's extremely rare for there to be this many items on a page. I'd
guess that something like storing having BTScanPosData->items point to an
in-line 4-16 BTScanPosItem items_inline[N] and dynamically allocate a
full-length BTScanPosItem[MaxTIDsPerBTreePage] just in the cases it's needed.

I'm a bit confused by the "MUST BE LAST" comment:
    BTScanPosItem items[MaxTIDsPerBTreePage];    /* MUST BE LAST */

Not clear why?  Seems to be from rather long back:

commit 09cb5c0e7d6
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2006-05-07 01:21:30 +0000

    Rewrite btree index scans to work a page at a time in all cases (both


Greetings,

Andres Freund



Re: BTScanOpaqueData size slows down tests

From
Peter Geoghegan
Date:
On Wed, Apr 2, 2025 at 11:57 AM Andres Freund <andres@anarazel.de> wrote:
> I'd assume it's extremely rare for there to be this many items on a page. I'd
> guess that something like storing having BTScanPosData->items point to an
> in-line 4-16 BTScanPosItem items_inline[N] and dynamically allocate a
> full-length BTScanPosItem[MaxTIDsPerBTreePage] just in the cases it's needed.

There can still only be MaxIndexTuplesPerPage items on the page (407
if memory serves) -- deduplication didn't change that.

It isn't at all rare for the scan to have to return about 1350 TIDs
from a page, though. Any low cardinality index will tend to have
almost that many TIDs to return on any page that only stores
duplicates. And scan will necessarily have to return all of the TIDs
from such a page, if it has to return any.

--
Peter Geoghegan



Re: BTScanOpaqueData size slows down tests

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I'm a bit confused by the "MUST BE LAST" comment:
>     BTScanPosItem items[MaxTIDsPerBTreePage];    /* MUST BE LAST */

Yeah, me too.  It looks like it might've been intended to allow
the items[] array to be only partially allocated.  But as Peter
says, we don't do that.  Too long ago to be sure :=(

            regards, tom lane



Re: BTScanOpaqueData size slows down tests

From
Andres Freund
Date:
Hi,

On 2025-04-02 12:01:57 -0400, Peter Geoghegan wrote:
> On Wed, Apr 2, 2025 at 11:57 AM Andres Freund <andres@anarazel.de> wrote:
> > I'd assume it's extremely rare for there to be this many items on a page. I'd
> > guess that something like storing having BTScanPosData->items point to an
> > in-line 4-16 BTScanPosItem items_inline[N] and dynamically allocate a
> > full-length BTScanPosItem[MaxTIDsPerBTreePage] just in the cases it's needed.
> 
> There can still only be MaxIndexTuplesPerPage items on the page (407
> if memory serves) -- deduplication didn't change that.

Sure.

> It isn't at all rare for the scan to have to return about 1350 TIDs
> from a page, though. Any low cardinality index will tend to have
> almost that many TIDs to return on any page that only stores
> duplicates. And scan will necessarily have to return all of the TIDs
> from such a page, if it has to return any.

I'm not sure what you're arguing for/against here? Obviously we need to handle
that case.  I doubt that the overhead of once-per-scan allocation of a
MaxTIDsPerBTreePage * sizeof(BTScanPosItem) array once per scan matters when
that many tuples are returned.

Greetings,

Andres Freund



Re: BTScanOpaqueData size slows down tests

From
Peter Geoghegan
Date:
On Wed, Apr 2, 2025 at 12:08 PM Andres Freund <andres@anarazel.de> wrote:
> > It isn't at all rare for the scan to have to return about 1350 TIDs
> > from a page, though. Any low cardinality index will tend to have
> > almost that many TIDs to return on any page that only stores
> > duplicates. And scan will necessarily have to return all of the TIDs
> > from such a page, if it has to return any.
>
> I'm not sure what you're arguing for/against here? Obviously we need to handle
> that case.  I doubt that the overhead of once-per-scan allocation of a
> MaxTIDsPerBTreePage * sizeof(BTScanPosItem) array once per scan matters when
> that many tuples are returned.

I'm not arguing for or against anything. I'm just giving context.

--
Peter Geoghegan



Re: BTScanOpaqueData size slows down tests

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> It isn't at all rare for the scan to have to return about 1350 TIDs
> from a page, though. Any low cardinality index will tend to have
> almost that many TIDs to return on any page that only stores
> duplicates. And scan will necessarily have to return all of the TIDs
> from such a page, if it has to return any.

Agreed, but scans that just return one item are also very common,
particularly in the syscache-miss case that Andres started with.

I could get behind the idea of just having enough space in
BTScanOpaqueData for about ten items, and dynamically allocating
a MaxTIDsPerBTreePage-sized array only if we overrun that.
And not allocate any space for mark/restore unless a mark is
done.

            regards, tom lane



Re: BTScanOpaqueData size slows down tests

From
Peter Geoghegan
Date:
On Wed, Apr 2, 2025 at 12:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I could get behind the idea of just having enough space in
> BTScanOpaqueData for about ten items, and dynamically allocating
> a MaxTIDsPerBTreePage-sized array only if we overrun that.
> And not allocate any space for mark/restore unless a mark is
> done.

That's exactly what I had in mind.

--
Peter Geoghegan



Re: BTScanOpaqueData size slows down tests

From
Álvaro Herrera
Date:
On 2025-Apr-02, Tom Lane wrote:

> Andres Freund <andres@anarazel.de> writes:
> > I'm a bit confused by the "MUST BE LAST" comment:
> >     BTScanPosItem items[MaxTIDsPerBTreePage];    /* MUST BE LAST */
> 
> Yeah, me too.  It looks like it might've been intended to allow
> the items[] array to be only partially allocated.  But as Peter
> says, we don't do that.  Too long ago to be sure :=(

The mentioned commit (09cb5c0e7d6f) shows that we do partial memcpys, e.g.

+       memcpy(&so->markPos, &so->currPos,
+              offsetof(BTScanPosData, items[1]) +
+              so->currPos.lastItem * sizeof(BTScanPosItem));

That would obviously not work if this field weren't last.  We still do
that, don't we?  See btrestrpos().

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Before you were born your parents weren't as boring as they are now. They
got that way paying your bills, cleaning up your room and listening to you
tell them how idealistic you are."  -- Charles J. Sykes' advice to teenagers



Re: BTScanOpaqueData size slows down tests

From
Peter Geoghegan
Date:
On Wed, Apr 2, 2025 at 12:31 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> The mentioned commit (09cb5c0e7d6f) shows that we do partial memcpys, e.g.
>
> +       memcpy(&so->markPos, &so->currPos,
> +              offsetof(BTScanPosData, items[1]) +
> +              so->currPos.lastItem * sizeof(BTScanPosItem));
>
> That would obviously not work if this field weren't last.  We still do
> that, don't we?  See btrestrpos().

Yes, we do -- but it's rare. It only happens in the case where we
restore a mark after leaving the page where the mark was taken. In
practice, merge joins tend to hardly ever do that (I know this because
it makes testing annoyingly difficult). In other words, the
optimization added by commit 08ae5edc5c seems to be very effective in
practice.

--
Peter Geoghegan



Re: BTScanOpaqueData size slows down tests

From
Álvaro Herrera
Date:
On 2025-Apr-02, Peter Geoghegan wrote:

> On Wed, Apr 2, 2025 at 12:31 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > The mentioned commit (09cb5c0e7d6f) shows that we do partial memcpys, e.g.
> >
> > +       memcpy(&so->markPos, &so->currPos,
> > +              offsetof(BTScanPosData, items[1]) +
> > +              so->currPos.lastItem * sizeof(BTScanPosItem));
> >
> > That would obviously not work if this field weren't last.  We still do
> > that, don't we?  See btrestrpos().
> 
> Yes, we do -- but it's rare. It only happens in the case where we
> restore a mark after leaving the page where the mark was taken. In
> practice, merge joins tend to hardly ever do that (I know this because
> it makes testing annoyingly difficult). In other words, the
> optimization added by commit 08ae5edc5c seems to be very effective in
> practice.

I'm just saying that this is the reason to have the field be last in the
struct.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)



Re: BTScanOpaqueData size slows down tests

From
Peter Geoghegan
Date:
On Wed, Apr 2, 2025 at 12:43 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I'm just saying that this is the reason to have the field be last in the
> struct.

Right. And I'm just saying that I don't think that it would take very
much effort to change it. That aspect isn't performance critical.

--
Peter Geoghegan



Re: BTScanOpaqueData size slows down tests

From
Tomas Vondra
Date:
On 4/2/25 17:45, Peter Geoghegan wrote:
> On Wed, Apr 2, 2025 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Ouch!  I had no idea it had gotten that big.  Yeah, we ought to
>> do something about that.
> 
> Tomas Vondra talked about this recently, in the context of his work on
> prefetching.
> 

I might have mentioned in the context of index prefetching (because that
has to touch this, naturally), but I actually ran into this when working
on the fast-path locking [1].

[1]
https://www.postgresql.org/message-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com

One of the tests I did was with partitions, and with an index scans on
tiny partitions that got pretty awful simply because of malloc() calls.
The struct exceeds ALLOCSET_SEPARATE_THRESHOLD, so it can't be cached,
and even if it could we would not cache it across scans anyway.

>>> And/or perhaps we could could allocate BTScanOpaqueData.markPos as a whole
>>> only when mark/restore are used?
>>
>> That'd be an easy way of removing about half of the problem, but
>> 14kB is still too much.  How badly do we need this items array?
>> Couldn't we just reference the on-page items?
> 
> I'm not sure what you mean by that. The whole design of _bt_readpage
> is based on the idea that we read a whole page, in one go. It has to
> batch up the items that are to be returned from the page somewhere.
> The worst case is that there are about 1350 TIDs to return from any
> single page (assuming default BLCKSZ). It's very pessimistic to start
> from the assumption that that worst case will be hit, but I don't see
> a way around doing it at least some of the time.
> 
> The first thing I'd try is some kind of simple dynamic allocation
> scheme, with a small built-in array that avoided any allocation
> penalty in the common case where there weren't too many tuples to
> return from the page.
> 
> The way that we allocate BLCKSZ twice for index-only scans (one for
> so->currTuples, the other for so->markTuples) is also pretty
> inefficient. Especially because any kind of use of mark and restore is
> exceedingly rare.
> 

Yeah, something like this (allocating smaller arrays unless more is
actually needed) would help many common cases.

Another thing that helped was setting MALLOC_TOP_PAD_ env variable (or
the same thing using mallopt), so that glibc keeps "buffer" for future
allocations.


regards

-- 
Tomas Vondra




Re: BTScanOpaqueData size slows down tests

From
David Rowley
Date:
On Thu, 3 Apr 2025 at 04:21, Andres Freund <andres@anarazel.de> wrote:
> I was mildly
> surprised to see how expensive the new compact attribute checks are.

Is this a fairly deform-heavy workload?  FWIW, before adding that
Assert, I did test to see if there was any measurable impact on the
time it took to run all the tests [1] and I couldn't measure it. I
feel having something to check these are in sync is worthwhile. It did
find bugs during the development of the patch.

David

[1] https://postgr.es/m/CAApHDvpHmUuFY9rtRw6oiG7OTM-o%2BBnODg1m-EynKGbbhSsMzA%40mail.gmail.com



Re: BTScanOpaqueData size slows down tests

From
Andres Freund
Date:
Hi,

On 2025-04-03 11:26:08 +1300, David Rowley wrote:
> On Thu, 3 Apr 2025 at 04:21, Andres Freund <andres@anarazel.de> wrote:
> > I was mildly
> > surprised to see how expensive the new compact attribute checks are.
> 
> Is this a fairly deform-heavy workload?

It was our regression tests, although I was playing around with combining
various groups in the schedule to get the tests to pass faster overall (on
average we just use 2.8 cores, which seems depressingly low).

I do see a about a 6-7% increase in test time with just the default schedule.


> I feel having something to check these are in sync is worthwhile. It did
> find bugs during the development of the patch.

I agree. But I wonder if we couldn't make it a bit smarter than right
now. Afaict we'll recheck it over and over, which does seem a bit extreme.

When running queries that deform a lot, the overhead is rather substantial.

With pgbench of f <(echo 'SELECT sum(pronargs) FROM pg_proc;') I see 2155 TPS
with the call to verify_compact_attribute() commented out and 827 TPS without.

Greetings,

Andres Freund