Re: Add GUC to tune glibc's malloc implementation. - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Add GUC to tune glibc's malloc implementation. |
Date | |
Msg-id | 20230628223101.jprqvuxyzthdehdm@awork3.anarazel.de Whole thread Raw |
In response to | Re: Add GUC to tune glibc's malloc implementation. (Ronan Dunklau <ronan.dunklau@aiven.io>) |
Responses |
Re: Add GUC to tune glibc's malloc implementation.
|
List | pgsql-hackers |
Hi, On 2023-06-28 07:26:03 +0200, Ronan Dunklau wrote: > I see it as a way to have *some* sort of control over the malloc > implementation we use, instead of tuning our allocations pattern on top of it > while treating it entirely as a black box. As for the tuning, I proposed > earlier to replace this parameter expressed in terms of size as a "profile" > (greedy / conservative) to make it easier to pick a sensible value. I don't think that makes it very usable - we'll still have idle connections use up a lot more memory than now in some cases, and not in others, even though it doesn't help. And it will be very heavily dependent on the OS and glibc version. > Le mardi 27 juin 2023, 20:17:46 CEST Andres Freund a écrit : > > > Except if you hinted we should write our own directly instead ? > > > > We e.g. could keep a larger number of memory blocks reserved > > > > ourselves. Possibly by delaying the release of additionally held blocks > > > > until we have been idle for a few seconds or such. > > > > > > I think keeping work_mem around after it has been used a couple times make > > > sense. This is the memory a user is willing to dedicate to operations, > > > after all. > > > > The biggest overhead of returning pages to the kernel is that that triggers > > zeroing the data during the next allocation. Particularly on multi-node > > servers that's surprisingly slow. It's most commonly not the brk() or > > mmap() themselves that are the performance issue. > > > > Indeed, with your benchmark, I see that most of the time, on my dual Xeon > > Gold 5215 workstation, is spent zeroing newly allocated pages during page > > faults. That microarchitecture is worse at this than some others, but it's > > never free (or cache friendly). > > I'm not sure I see the practical difference between those, but that's > interesting. Were you able to reproduce my results ? I see a bit smaller win than what you observed, but it is substantial. The runtime difference between the "default" and "cached" malloc are almost entirely in these bits: cached: - 8.93% postgres libc.so.6 [.] __memmove_evex_unaligned_erms - __memmove_evex_unaligned_erms + 6.77% minimal_tuple_from_heap_tuple + 2.04% _int_realloc + 0.04% AllocSetRealloc 0.02% 0x56281094806f 0.02% 0x56281094e0bf vs uncached: - 14.52% postgres libc.so.6 [.] __memmove_evex_unaligned_erms 8.61% asm_exc_page_fault - 5.91% __memmove_evex_unaligned_erms + 5.78% minimal_tuple_from_heap_tuple 0.04% 0x560130a2900f 0.02% 0x560130a20faf + 0.02% AllocSetRealloc + 0.02% _int_realloc + 3.81% postgres [kernel.vmlinux] [k] native_irq_return_iret + 1.88% postgres [kernel.vmlinux] [k] __handle_mm_fault + 1.76% postgres [kernel.vmlinux] [k] clear_page_erms + 1.67% postgres [kernel.vmlinux] [k] get_mem_cgroup_from_mm + 1.42% postgres [kernel.vmlinux] [k] cgroup_rstat_updated + 1.00% postgres [kernel.vmlinux] [k] get_page_from_freelist + 0.93% postgres [kernel.vmlinux] [k] mtree_range_walk None of the latter are visible in a profile in the cached case. I.e. the overhead is encountering page faults and individually allocating the necessary memory in the kernel. This isn't surprising, I just wanted to make sure I entirely understand. Part of the reason this code is a bit worse is that it's using generation.c, which doesn't cache any part of of the context. Not that aset.c's level of caching would help a lot, given that it caches the context itself, not later blocks. > > FWIW, in my experience trimming the brk()ed region doesn't work reliably > > enough in real world postgres workloads to be worth relying on (from a > > memory usage POV). Sooner or later you're going to have longer lived > > allocations placed that will prevent it from happening. > > I'm not sure I follow: given our workload is clearly split at queries and > transactions boundaries, releasing memory at that time, I've assumed (and > noticed in practice, albeit not on a production system) that most memory at > the top of the heap would be trimmable as we don't keep much in between > queries / transactions. That's true for very simple workloads, but once you're beyond that you just need some longer-lived allocation to happen. E.g. some relcache / catcache miss during the query execution, and there's no exant memory in CacheMemoryContext, so a new block is allocated. Greetings, Andres Freund
pgsql-hackers by date: