Re: BTScanOpaqueData size slows down tests - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: BTScanOpaqueData size slows down tests
Date
Msg-id 4434c0a9-4e04-4130-bd88-23619873a48d@vondra.me
Whole thread Raw
In response to Re: BTScanOpaqueData size slows down tests  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On 4/2/25 17:45, Peter Geoghegan wrote:
> On Wed, Apr 2, 2025 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Ouch!  I had no idea it had gotten that big.  Yeah, we ought to
>> do something about that.
> 
> Tomas Vondra talked about this recently, in the context of his work on
> prefetching.
> 

I might have mentioned in the context of index prefetching (because that
has to touch this, naturally), but I actually ran into this when working
on the fast-path locking [1].

[1]
https://www.postgresql.org/message-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com

One of the tests I did was with partitions, and with an index scans on
tiny partitions that got pretty awful simply because of malloc() calls.
The struct exceeds ALLOCSET_SEPARATE_THRESHOLD, so it can't be cached,
and even if it could we would not cache it across scans anyway.

>>> And/or perhaps we could could allocate BTScanOpaqueData.markPos as a whole
>>> only when mark/restore are used?
>>
>> That'd be an easy way of removing about half of the problem, but
>> 14kB is still too much.  How badly do we need this items array?
>> Couldn't we just reference the on-page items?
> 
> I'm not sure what you mean by that. The whole design of _bt_readpage
> is based on the idea that we read a whole page, in one go. It has to
> batch up the items that are to be returned from the page somewhere.
> The worst case is that there are about 1350 TIDs to return from any
> single page (assuming default BLCKSZ). It's very pessimistic to start
> from the assumption that that worst case will be hit, but I don't see
> a way around doing it at least some of the time.
> 
> The first thing I'd try is some kind of simple dynamic allocation
> scheme, with a small built-in array that avoided any allocation
> penalty in the common case where there weren't too many tuples to
> return from the page.
> 
> The way that we allocate BLCKSZ twice for index-only scans (one for
> so->currTuples, the other for so->markTuples) is also pretty
> inefficient. Especially because any kind of use of mark and restore is
> exceedingly rare.
> 

Yeah, something like this (allocating smaller arrays unless more is
actually needed) would help many common cases.

Another thing that helped was setting MALLOC_TOP_PAD_ env variable (or
the same thing using mallopt), so that glibc keeps "buffer" for future
allocations.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Add pg_buffercache_evict_all() and pg_buffercache_mark_dirty[_all]() functions
Next
From: Heikki Linnakangas
Date:
Subject: Re: [PATCH] Add sortsupport for range types and btree_gist