On 4/2/25 17:45, Peter Geoghegan wrote:
> On Wed, Apr 2, 2025 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Ouch! I had no idea it had gotten that big. Yeah, we ought to
>> do something about that.
>
> Tomas Vondra talked about this recently, in the context of his work on
> prefetching.
>
I might have mentioned in the context of index prefetching (because that
has to touch this, naturally), but I actually ran into this when working
on the fast-path locking [1].
[1]
https://www.postgresql.org/message-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com
One of the tests I did was with partitions, and with an index scans on
tiny partitions that got pretty awful simply because of malloc() calls.
The struct exceeds ALLOCSET_SEPARATE_THRESHOLD, so it can't be cached,
and even if it could we would not cache it across scans anyway.
>>> And/or perhaps we could could allocate BTScanOpaqueData.markPos as a whole
>>> only when mark/restore are used?
>>
>> That'd be an easy way of removing about half of the problem, but
>> 14kB is still too much. How badly do we need this items array?
>> Couldn't we just reference the on-page items?
>
> I'm not sure what you mean by that. The whole design of _bt_readpage
> is based on the idea that we read a whole page, in one go. It has to
> batch up the items that are to be returned from the page somewhere.
> The worst case is that there are about 1350 TIDs to return from any
> single page (assuming default BLCKSZ). It's very pessimistic to start
> from the assumption that that worst case will be hit, but I don't see
> a way around doing it at least some of the time.
>
> The first thing I'd try is some kind of simple dynamic allocation
> scheme, with a small built-in array that avoided any allocation
> penalty in the common case where there weren't too many tuples to
> return from the page.
>
> The way that we allocate BLCKSZ twice for index-only scans (one for
> so->currTuples, the other for so->markTuples) is also pretty
> inefficient. Especially because any kind of use of mark and restore is
> exceedingly rare.
>
Yeah, something like this (allocating smaller arrays unless more is
actually needed) would help many common cases.
Another thing that helped was setting MALLOC_TOP_PAD_ env variable (or
the same thing using mallopt), so that glibc keeps "buffer" for future
allocations.
regards
--
Tomas Vondra