Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: Adding basic NUMA awareness
Date
Msg-id CAKZiRmxXrKQX8kcpKEJkRs==B4dJ3p49g9kykb-8O386H+Rg9g@mail.gmail.com
Whole thread Raw
In response to Adding basic NUMA awareness  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
Hi Tomas, some more thoughts after the weekend:

On Fri, Jul 4, 2025 at 8:12 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 7/4/25 13:05, Jakub Wartak wrote:
> > On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:
> >
> > Hi!
> >
> >> 1) v1-0001-NUMA-interleaving-buffers.patch
> > [..]
> >> It's a bit more complicated, because the patch distributes both the
> >> blocks and descriptors, in the same way. So a buffer and it's descriptor
> >> always end on the same NUMA node. This is one of the reasons why we need
> >> to map larger chunks, because NUMA works on page granularity, and the
> >> descriptors are tiny - many fit on a memory page.
> >
> > Oh, now I get it! OK, let's stick to this one.
> >
> >> I don't think the splitting would actually make some things simpler, or
> >> maybe more flexible - in particular, it'd allow us to enable huge pages
> >> only for some regions (like shared buffers), and keep the small pages
> >> e.g. for PGPROC. So that'd be good.
> >
> > You have made assumption that this is good, but small pages (4KB) are
> > not hugetlb, and are *swappable* (Transparent HP are swappable too,
> > manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
> > most frequent problem I see these days are OOMs, and it makes me
> > believe that making certain critical parts of shared memory being
> > swappable just to make pagesize granular is possibly throwing the baby
> > out with the bathwater. I'm thinking about bad situations like: some
> > wrong settings of vm.swapiness that people keep (or distros keep?) and
> > general inability of PG to restrain from allocating more memory in
> > some cases.
> >
>
> I haven't observed such issues myself, or maybe I didn't realize it's
> happening. Maybe it happens, but it'd be good to see some data showing
> that, or a reproducer of some sort. But let's say it's real.
>
> I don't think we should use huge pages merely to ensure something is not
> swapped out. The "not swappable" is more of a limitation of huge pages,
> not an advantage. You can't just choose to make them swappable.
>
> Wouldn't it be better to keep using 4KB pages, but lock the memory using
> mlock/mlockall?

In my book, not being swappable is a win (it's hard for me to imagine
when it could be beneficial to swap out parts of s_b).

I was trying to think about it and also got those:

Anyway mlock() probably sounds like it, but e.g. Rocky 8.10 by default
has max locked memory (ulimit -l) as low as 64kB due to systemd's
DefaultLimitMEMLOCK, but Debian/Ubuntu have those at higher values.
Wasn't expecting that - those are bizzare low values. I think we would
need something like (10000*900)/1024/1024 or more, but with each
PGPROC on a separate page that would be even way more?

Another thing with 4kB pages: there's this big assumption now made
that once we arrive in InitProcess() we won't ever change NUMA node,
so we stick to the PGPROC from where we started (based on getcpu(2)).
Let's assume CPU scheduler reassigned us to differnt node, but we have
now this 4kB patch ready for PGPROC in theory and this means we would
need to rely on the NUMA autobalancing doing it's job to migrate that
4kB page from node to node (to get better local accesses instead of
remote ones).  The questions in my head are now like that:
- but we have asked intially asked those PGPROC pages to be localized
on certain node (they have policy), so they won't autobalance? We
would need to somewhere call getcpu() again notice the difference and
unlocalize (clear the NUMA/mbind() policy) for the PGPROC page?
- mlocked() as above says stick to physical RAM page (?) , so it won't move?
- after what time kernel's autobalancing would migrate that page since
switching the active CPU<->node? I mean do we execute enough reads on
this page?

BTW: to move this into pragmatic real, what's the most
one-liner/trivial way to exercise/stress PGPROC?

> >> The other thing I haven't thought about very much is determining on
> >> which CPUs/nodes the instance is allowed to run. I assume we'd start by
> >> simply inherit/determine that at the start through libnuma, not through
> >> some custom PG configuration (which the patch [2] proposed to do).
> >
> > 0. I think that we could do better, some counter arguments to
> > no-configuration-at-all:
> >
> > a. as Robert & Bertrand already put it there after review: let's say I
> > want just to run on NUMA #2 node, so here I would need to override
> > systemd's script ExecStart= to include that numactl (not elegant?). I
> > could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even
> > less friendly. Also it probably requires root to edit/reload systemd,
> > while having GUC for this like in my proposal makes it more smooth (I
> > think?)
> >
> > b. wouldn't it be better if that stayed as drop-in rather than always
> > on? What if there's a problem, how do you disable those internal
> > optimizations if they do harm in some cases?  (or let's say I want to
> > play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean
> > numa_buffers_interleave would be nice?
> >
> > c. What if I want my standby (walreceiver+startup/recovery) to run
> > with NUMA affinity to get better performance (I'm not going to hack
> > around systemd script every time, but I could imagine changing
> > numa=X,Y,Z after restart/before promotion)
> >
> > d. Now if I would be forced for some reason to do that numactl(1)
> > voodoo, and use the those above mentioned overrides and PG wouldn't be
> > having GUC (let's say I would use `numactl
> > --weighted-interleave=0,1`), then:
> >
>
> I'm not against doing something like this, but I don't plan to do that
> in V1. I don't have a clear idea what configurability is actually
> needed, so it's likely I'd do the interface wrong.
>
> >> 2) v1-0002-NUMA-localalloc.patch
> >> This simply sets "localalloc" when initializing a backend, so that all
> >> memory allocated later is local, not interleaved. Initially this was
> >> necessary because the patch set the allocation policy to interleaving
> >> before initializing shared memory, and we didn't want to interleave the
> >> private memory. But that's no longer the case - the explicit mapping to
> >> nodes does not have this issue. I'm keeping the patch for convenience,
> >> it allows experimenting with numactl etc.
> >
> > .. .is not accurate anymore and we would require to have that in
> > (still with GUC) ?
> > Thoughts? I can add that mine part into Your's patches if you want.
> >
>
> I'm sorry, I don't understand what's the question :-(

That patch reference above, it was a chain of thought from step "d".
What I had in mind was that you cannot remove the patch
`v1-0002-NUMA-localalloc.patch` from the scope if forcing people to
use numactl by not having enough configurability on the PG side. That
is: if someone will have to use systemd+numactl
--interleave/--weighted-interleave then, he will also need to have a
way to use numa_localalloc=on (to override the new/user's policy
default, otherwise local mem allocations are also going to be
interleaved, and we are back to square one). Which brings me to a
point why instead of this toggle, should include the configuration
properly inside from start (it's not that hard apparently).

> > Way too quick review and some very fast benchmark probes, I've
> > concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt
> > would be too new topic for me), but let's start:
> >
> > 1. normal pgbench -S (still with just s_b@4GB), done many tries,
> > consistent benefit for the patch with like +8..10% boost on generic
> > run:
> >
[.. removed numbers]
>
> But this actually brings an interesting question. What exactly should we
> expect / demand from these patches? In my mind it'd primarily about
> predictability and stability of results.
>
> For example, the results should not depend on how was the database
> warmed up - was it done by a single backend or many backends? Was it
> restarted, or what? I could probably warmup the system very carefully to
> ensure it's balanced. The patches mean I don't need to be that careful.

Well, pretty much the same here. I was after minimizing "stddev" (to
have better predictability of results, especially across restarts) and
increasing available bandwidth [which is pretty much related]. Without
our NUMA work, PG can just put that s_b on any random node or spill
randomly from to another (depending on size of allocation request).

> >     So should I close https://commitfest.postgresql.org/patch/5703/
> > and you'll open a new one or should I just edit the #5703 and alter it
> > and add this thread too?
> >
>
> Good question. It's probably best to close the original entry as
> "withdrawn" and I'll add a new entry. Sounds OK?

Sure thing, marked it as `Returned with feedback`, this approach seems
to be much more advanced.

> > 3. Patch is not calling interleave on PQ shmem, do we want to add that
> > in as some next item like v1-0007? Question is whether OS interleaving
> > makes sense there ? I believe it does there, please see my thread
> > (NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are
> > being spawned by postmaster and may end up on different NUMA nodes
> > randomly, so actually OS-interleaving that memory reduces jitter there
> > (AKA bandwidth-over-latency). My thinking is that one cannot expect
> > static/forced CPU-to-just-one-NUMA-node assignment for backend and
> > it's PQ workers, because it is impossible have always available CPU
> > power there in that NUMA node, so it might be useful to interleave
> > that shared mem there too (as separate patch item?)
> >
>
> Excellent question. I haven't thought about this at all. I agree it
> probably makes sense to interleave this memory, in some way. I don't
> know what's the perfect scheme, though.
>
> wild idea: Would it make sense to pin the workers to the same NUMA node
> as the leader? And allocate all memory only from that node?

I'm trying to convey exactly the opposite message or at least that it
might depend on configuration. Please see
https://www.postgresql.org/message-id/CAKZiRmxYMPbQ4WiyJWh%3DVuw_Ny%2BhLGH9_9FaacKRJvzZ-smm%2Bw%40mail.gmail.com
(btw it should read there that I don't indent spend a lot of thime on
PQ), but anyway: I think we should NOT pin the PQ workers the same
NODE as you do not know if there's CPU left there (same story as with
v1-0006 here).

I'm just proposing quick OS-based interleaving of PQ shm if using all
nodes, literally:

@@ -334,6 +336,13 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size
request_size,
     }
     *mapped_address = address;
     *mapped_size = request_size;
+
+    /* We interleave memory only at creation time. */
+    if (op == DSM_OP_CREATE && numa->setting > NUMA_OFF) {
+        elog(DEBUG1, "interleaving shm mem @ %p size=%zu",
*mapped_address, *mapped_size);
+        pg_numa_interleave_memptr(*mapped_address, *mapped_size, numa->nodes);
+    }
+

Because then if memory is interleaved you have probably less variance
for memory access. But also from that previous thread:

"So if anything:
- latency-wise: it would be best to place leader+all PQ workers close
to s_b, provided s_b fits NUMA shared/huge page memory there and you
won't need more CPU than there's on that NUMA node... (assuming e.g.
hosting 4 DBs on 4-sockets each on it's own, it would be best to pin
everything including shm, but PQ workers too)
- capacity/TPS-wise or s_b > NUMA: just interleave to maximize
bandwidth and get uniform CPU performance out of this"

So wild idea was: maybe PQ shm interleaving should on NUMA
configuration (if intereavling to all nodes, then interleave normally,
but if configuration sets to just 1 NUMA node, it automatically binds
there -- there was '@' support for that in my patch).

> > 4 In BufferManagerShmemInit() you call numa_num_configured_nodes()
> > (also in v1-0005). My worry is should we may put some
> > known-limitations docs (?) from start and mention that
> > if the VM is greatly resized and NUMA numa nodes appear, they might
> > not be used until restart?
> >
>
> Yes, this is one thing I need some feedback on. The patches mostly
> assume there are no disabled nodes, that the set of allowed nodes does
> not change, etc. I think for V1 that's a reasonable limitation.

Sure!

> But let's say we want to relax this a bit. How do we learn about the
> change, after a node/CPU gets disabled? For some parts it's not that
> difficult (e.g. we can "remap" buffers/descriptors) in the background.
> But for other parts that's not practical. E.g. we can't rework how the
> PGPROC gets split.
>
> But while discussing this with Andres yesterday, he had an interesting
> suggestion - to always use e.g. 8 or 16 partitions, then partition this
> by NUMA node. So we'd have 16 partitions, and with 4 nodes the 0-3 would
> go to node 0, 4-7 would go to node 1, etc. The advantage is that if a
> node gets disabled, we can rebuild just this small "mapping" and not the
> 16 partitions. And the partitioning may be helpful even without NUMA.
>
> Still have to figure out the details, but seems it might help.

Right, no idea how the shared_memory remapping patch will work
(how/when the s_b change will be executed), but we could somehow mark
that number of NUMA zones could be rechecked during SIGHUP (?) and
then just simple compare check if old_numa_num_configured_nodes ==
new_numa_num_configured_nodes is true.

Anyway, I think it's way too advanced for now, don't you think? (like
CPU ballooning [s_b itself] is rare, and NUMA ballooning seems to be
super-wild-rare).

As for the rest, forgot to include this too: getcpu() - this really
needs a portable pg_getcpu() wrapper.

-J.



pgsql-hackers by date:

Previous
From: Álvaro Herrera
Date:
Subject: Re: amcheck support for BRIN indexes
Next
From: Mircea Cadariu
Date:
Subject: Re: analyze-in-stages post upgrade questions