Re: scalability bottlenecks with (many) partitions (and more) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: scalability bottlenecks with (many) partitions (and more)
Date
Msg-id c3cddb9d-283e-4caf-b558-5c9196320650@enterprisedb.com
Whole thread Raw
In response to Re: scalability bottlenecks with (many) partitions (and more)  (Ronan Dunklau <ronan.dunklau@aiven.io>)
Responses Re: scalability bottlenecks with (many) partitions (and more)
List pgsql-hackers
On 1/29/24 09:53, Ronan Dunklau wrote:
> Le dimanche 28 janvier 2024, 22:57:02 CET Tomas Vondra a écrit :
> 
> Hi Tomas !
> 
> I'll comment on glibc-malloc part as I studied that part last year, and 
> proposed some things here: https://www.postgresql.org/message-id/
> 3424675.QJadu78ljV%40aivenlaptop
> 

Thanks for reminding me. I'll re-read that thread.

> 
>> FWIW where does the malloc overhead come from? For one, while we do have
>> some caching of malloc-ed memory in memory contexts, that doesn't quite
>> work cross-query, because we destroy the contexts at the end of the
>> query. We attempt to cache the memory contexts too, but in this case
>> that can't help because the allocations come from btbeginscan() where we
>> do this:
>>
>>     so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
>>
>> and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and
>> thus always allocated using a separate malloc() call. Maybe we could
>> break it into smaller/cacheable parts, but I haven't tried, and I doubt
>>>>> it's the only such allocation.
> 
> Did you try running an strace on the process ? That may give you some 
> hindsights into what malloc is doing. A more sophisticated approach would be 
> using stap and plugging it into the malloc probes, for example 
> memory_sbrk_more and memory_sbrk_less. 
> 

No, I haven't tried that. In my experience strace is pretty expensive,
and if the issue is in glibc itself (before it does the syscalls),
strace won't really tell us much. Not sure, ofc.

> An important part of glibc's malloc behaviour in that regard comes from the 
> adjustment of the mmap and free threshold. By default, mmap adjusts them 
> dynamically and you can poke into that using the 
> memory_mallopt_free_dyn_thresholds probe.
> 

Thanks, I'll take a look at that.

>>
>> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
>> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
>> the behavior a lot - it gets us maybe ~80% of the mempool benefits.
>> Which is nice, it confirms it's glibc-specific (I wonder if there's a
>> way to tweak glibc to address this), and it also means systems using
>> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
>> says the mempool has ~20% benefit on top of jemalloc.
> 
> GLIBC's malloc offers some tuning for this. In particular, setting either 
> M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto 
> adjustment" beheviour and allow you to control what it's doing. 
> 
> By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated using 
> sbrk isn't freed as easily, and you don't run into a pattern of moving the 
> sbrk pointer up and down repeatedly. The automatic trade off between the mmap 
> and trim thresholds is supposed to prevent that, but the way it is incremented 
> means you can end in a bad place depending on your particular allocation 
> patttern.
> 

So, what values would you recommend for these parameters?

My concern is increasing those value would lead to (much) higher memory
usage, with little control over it. With the mempool we keep more
blocks, ofc, but we have control over freeing the memory.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Jelte Fennema-Nio
Date:
Subject: Re: UUID v7
Next
From: Jelte Fennema-Nio
Date:
Subject: Re: [EXTERNAL] Re: Add non-blocking version of PQcancel