Re: Add GUC to tune glibc's malloc implementation. - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Add GUC to tune glibc's malloc implementation.
Date
Msg-id 20230628223101.jprqvuxyzthdehdm@awork3.anarazel.de
Whole thread Raw
In response to Re: Add GUC to tune glibc's malloc implementation.  (Ronan Dunklau <ronan.dunklau@aiven.io>)
Responses Re: Add GUC to tune glibc's malloc implementation.
List pgsql-hackers
Hi,

On 2023-06-28 07:26:03 +0200, Ronan Dunklau wrote:
> I see it as a way to have *some* sort of control over the malloc
> implementation we use, instead of tuning our allocations pattern on top of it
> while treating it entirely as a black box. As for the tuning, I proposed
> earlier to replace this parameter expressed in terms of size as a "profile"
> (greedy / conservative) to make it easier to pick a sensible value.

I don't think that makes it very usable - we'll still have idle connections
use up a lot more memory than now in some cases, and not in others, even
though it doesn't help. And it will be very heavily dependent on the OS and
glibc version.


> Le mardi 27 juin 2023, 20:17:46 CEST Andres Freund a écrit :
> > > Except if you hinted we should write our own directly instead ?
> > > > We e.g. could keep a larger number of memory blocks reserved
> > > > ourselves. Possibly by delaying the release of additionally held blocks
> > > > until we have been idle for a few seconds or such.
> > >
> > > I think keeping work_mem around after it has been used a couple times make
> > > sense. This is the memory a user is willing to dedicate to operations,
> > > after all.
> >
> > The biggest overhead of returning pages to the kernel is that that triggers
> > zeroing the data during the next allocation. Particularly on multi-node
> > servers that's surprisingly slow.  It's most commonly not the brk() or
> > mmap() themselves that are the performance issue.
> >
> > Indeed, with your benchmark, I see that most of the time, on my dual Xeon
> > Gold 5215 workstation, is spent zeroing newly allocated pages during page
> > faults. That microarchitecture is worse at this than some others, but it's
> > never free (or cache friendly).
>
> I'm not sure I see the practical difference between those, but that's
> interesting. Were you able to reproduce my results ?

I see a bit smaller win than what you observed, but it is substantial.


The runtime difference between the "default" and "cached" malloc are almost
entirely in these bits:

cached:
-    8.93%  postgres  libc.so.6         [.] __memmove_evex_unaligned_erms
   - __memmove_evex_unaligned_erms
      + 6.77% minimal_tuple_from_heap_tuple
      + 2.04% _int_realloc
      + 0.04% AllocSetRealloc
        0.02% 0x56281094806f
        0.02% 0x56281094e0bf

vs

uncached:

-   14.52%  postgres  libc.so.6         [.] __memmove_evex_unaligned_erms
     8.61% asm_exc_page_fault
   - 5.91% __memmove_evex_unaligned_erms
      + 5.78% minimal_tuple_from_heap_tuple
        0.04% 0x560130a2900f
        0.02% 0x560130a20faf
      + 0.02% AllocSetRealloc
      + 0.02% _int_realloc

+    3.81%  postgres  [kernel.vmlinux]  [k] native_irq_return_iret
+    1.88%  postgres  [kernel.vmlinux]  [k] __handle_mm_fault
+    1.76%  postgres  [kernel.vmlinux]  [k] clear_page_erms
+    1.67%  postgres  [kernel.vmlinux]  [k] get_mem_cgroup_from_mm
+    1.42%  postgres  [kernel.vmlinux]  [k] cgroup_rstat_updated
+    1.00%  postgres  [kernel.vmlinux]  [k] get_page_from_freelist
+    0.93%  postgres  [kernel.vmlinux]  [k] mtree_range_walk

None of the latter are visible in a profile in the cached case.

I.e. the overhead is encountering page faults and individually allocating the
necessary memory in the kernel.


This isn't surprising, I just wanted to make sure I entirely understand.


Part of the reason this code is a bit worse is that it's using generation.c,
which doesn't cache any part of of the context. Not that aset.c's level of
caching would help a lot, given that it caches the context itself, not later
blocks.


> > FWIW, in my experience trimming the brk()ed region doesn't work reliably
> > enough in real world postgres workloads to be worth relying on (from a
> > memory usage POV). Sooner or later you're going to have longer lived
> > allocations placed that will prevent it from happening.
>
> I'm not sure I follow: given our workload is clearly split at queries and
> transactions boundaries, releasing memory at that time, I've assumed (and
> noticed in practice, albeit not on a production system) that most memory at
> the top of the heap would be trimmable as we don't keep much in between
> queries / transactions.

That's true for very simple workloads, but once you're beyond that you just
need some longer-lived allocation to happen. E.g. some relcache / catcache
miss during the query execution, and there's no exant memory in
CacheMemoryContext, so a new block is allocated.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Vik Fearing
Date:
Subject: Re: Row pattern recognition
Next
From: Nathan Bossart
Date:
Subject: vacuumdb/clusterdb/reindexdb: allow specifying objects to process in all databases