Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: Adding basic NUMA awareness
Date
Msg-id CAKZiRmy4EGAGvHjEEEwqm8m_su_xtW5ZLHLLZJQkU-ier=fqrQ@mail.gmail.com
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Andres Freund <andres@anarazel.de>)
Responses Re: Adding basic NUMA awareness
List pgsql-hackers
On Tue, Jul 8, 2025 at 2:56 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote:
> > On 7/8/25 05:04, Andres Freund wrote:
> > > On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
> > > The reason it would be advantageous to put something like the procarray onto
> > > smaller pages is that otherwise the entire procarray (unless particularly
> > > large) ends up on a single NUMA node, increasing the latency for backends on
> > > every other numa node and increasing memory traffic on that node.
> > >

Sure thing, I fully understand the motivation and underlying reason
(without claiming that I understand the exact memory access patterns
that involve procarray/PGPROC/etc and hotspots involved from PG side).
Any single-liner pgbench help for how to really easily stress the
PGPROC or procarray?

> > That's why the patch series splits the procarray into multiple pieces,
> > so that it can be properly distributed on multiple NUMA nodes even with
> > huge pages. It requires adjusting a couple places accessing the entries,
> > but it surprised me how limited the impact was.

Yes, and we are discussing if it is worth getting into smaller pages
for such usecases (e.g. 4kB ones without hugetlb with 2MB hugepages or
what more even more waste 1GB hugetlb if we dont request 2MB for some
small structs: btw, we have ability to select MAP_HUGE_2MB vs
MAP_HUGE_1GB). I'm thinking about two problems:
- 4kB are swappable and mlock() potentially (?) disarms NUMA autobalacning
- using libnuma often leads to MPOL_BIND which disarms NUMA
autobalancing, BUT apparently there are set_mempolicy(2)/mbind(2) and
since 5.12+ kernel they can take additional flag
MPOL_F_NUMA_BALANCING(!), so this looks like it has potential to move
memory anyway (if way too many tasks are relocated, so would be
memory?). It is available only in recent libnuma as
numa_set_membind_balancing(3), but sadly there's no way via libnuma to
do mbind(MPOL_F_NUMA_BALANCING) for a specific addr only? I mean it
would have be something like MPOL_F_NUMA_BALANCING | MPOL_PREFERRED?
(select one node from many for each node while still allowing
balancing?), but in [1][2] (2024) it is stated that "It's not
legitimate (yet) to use MPOL_PREFERRED + MPOL_F_NUMA_BALANCING.", but
maybe stuff has been improved since then.

Something like:
PGPROC/procarray 2MB page for node#1 - mbind(addr1,
MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [0,1]);
PGPROC/procarray 2MB page for node#2 - mbind(addr2,
MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [1,0]);

> Sure, you can do that, but it does mean that iterations over the procarray now
> have an added level of indirection...

So the most efficient would be the old-way (no indirections) vs
NUMA-way? Can this be done without #ifdefs at all?

> > The thing I'm not sure about is how much this actually helps with the
> > traffic between node. Sure, if we pick a PGPROC from the same node, and
> > the task does not get moved, it'll be local traffic. But if the task
> > moves, there'll be traffic.

With MPOL_F_NUMA_BALANCING, that should "auto-tune" in the worst case?

> > I don't have any estimates how often this happens, e.g. for older tasks.

We could measure, kernel 6.16+ has per PID numa_task_migrated in
/proc/{PID}/sched , but I assume we would have to throw backends >>
VCPUs at it, to simulate reality and do some "waves" between different
activity periods of certain pools (I can imagine worst case scenario:
a) pgbench "a" open $VCPU connections, all idle, with scripto to sleep
for a while
b) pgbench "b" open some $VCPU new connections to some other DB, all
active from start (tpcbb or readonly)
c) manually ping CPUs using taskset for each PID all from "b" to
specific NUMA node #2 -- just to simulate unfortunate app working on
every 2nd conn
d) pgbench "a" starts working and hits CPU imbalance -- e.g. NUMA node
#1 is idle, #2 is full, CPU scheduler starts puting "a" backends on
CPUs from #1 , and we should notice PIDs being migrated)

> I think the most important bit is to not put everything onto one numa node,
> otherwise the chance of increased latency for *everyone* due to the increased
> memory contention is more likely to hurt.

-J.

p.s. I hope i did write in an understandable way, because I had many
interruptions, so if anything is unclear please let me know.

[1] - https://lkml.org/lkml/2024/7/3/352
[2] - https://lkml.rescloud.iu.edu/2402.2/03227.html



pgsql-hackers by date:

Previous
From: Mircea Cadariu
Date:
Subject: Re: Add os_page_num to pg_buffercache
Next
From: vignesh C
Date:
Subject: Re: Logical Replication of sequences