Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: Adding basic NUMA awareness |
Date | |
Msg-id | CAKZiRmy4EGAGvHjEEEwqm8m_su_xtW5ZLHLLZJQkU-ier=fqrQ@mail.gmail.com Whole thread Raw |
In response to | Re: Adding basic NUMA awareness (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Adding basic NUMA awareness
|
List | pgsql-hackers |
On Tue, Jul 8, 2025 at 2:56 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote: > > On 7/8/25 05:04, Andres Freund wrote: > > > On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote: > > > The reason it would be advantageous to put something like the procarray onto > > > smaller pages is that otherwise the entire procarray (unless particularly > > > large) ends up on a single NUMA node, increasing the latency for backends on > > > every other numa node and increasing memory traffic on that node. > > > Sure thing, I fully understand the motivation and underlying reason (without claiming that I understand the exact memory access patterns that involve procarray/PGPROC/etc and hotspots involved from PG side). Any single-liner pgbench help for how to really easily stress the PGPROC or procarray? > > That's why the patch series splits the procarray into multiple pieces, > > so that it can be properly distributed on multiple NUMA nodes even with > > huge pages. It requires adjusting a couple places accessing the entries, > > but it surprised me how limited the impact was. Yes, and we are discussing if it is worth getting into smaller pages for such usecases (e.g. 4kB ones without hugetlb with 2MB hugepages or what more even more waste 1GB hugetlb if we dont request 2MB for some small structs: btw, we have ability to select MAP_HUGE_2MB vs MAP_HUGE_1GB). I'm thinking about two problems: - 4kB are swappable and mlock() potentially (?) disarms NUMA autobalacning - using libnuma often leads to MPOL_BIND which disarms NUMA autobalancing, BUT apparently there are set_mempolicy(2)/mbind(2) and since 5.12+ kernel they can take additional flag MPOL_F_NUMA_BALANCING(!), so this looks like it has potential to move memory anyway (if way too many tasks are relocated, so would be memory?). It is available only in recent libnuma as numa_set_membind_balancing(3), but sadly there's no way via libnuma to do mbind(MPOL_F_NUMA_BALANCING) for a specific addr only? I mean it would have be something like MPOL_F_NUMA_BALANCING | MPOL_PREFERRED? (select one node from many for each node while still allowing balancing?), but in [1][2] (2024) it is stated that "It's not legitimate (yet) to use MPOL_PREFERRED + MPOL_F_NUMA_BALANCING.", but maybe stuff has been improved since then. Something like: PGPROC/procarray 2MB page for node#1 - mbind(addr1, MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [0,1]); PGPROC/procarray 2MB page for node#2 - mbind(addr2, MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [1,0]); > Sure, you can do that, but it does mean that iterations over the procarray now > have an added level of indirection... So the most efficient would be the old-way (no indirections) vs NUMA-way? Can this be done without #ifdefs at all? > > The thing I'm not sure about is how much this actually helps with the > > traffic between node. Sure, if we pick a PGPROC from the same node, and > > the task does not get moved, it'll be local traffic. But if the task > > moves, there'll be traffic. With MPOL_F_NUMA_BALANCING, that should "auto-tune" in the worst case? > > I don't have any estimates how often this happens, e.g. for older tasks. We could measure, kernel 6.16+ has per PID numa_task_migrated in /proc/{PID}/sched , but I assume we would have to throw backends >> VCPUs at it, to simulate reality and do some "waves" between different activity periods of certain pools (I can imagine worst case scenario: a) pgbench "a" open $VCPU connections, all idle, with scripto to sleep for a while b) pgbench "b" open some $VCPU new connections to some other DB, all active from start (tpcbb or readonly) c) manually ping CPUs using taskset for each PID all from "b" to specific NUMA node #2 -- just to simulate unfortunate app working on every 2nd conn d) pgbench "a" starts working and hits CPU imbalance -- e.g. NUMA node #1 is idle, #2 is full, CPU scheduler starts puting "a" backends on CPUs from #1 , and we should notice PIDs being migrated) > I think the most important bit is to not put everything onto one numa node, > otherwise the chance of increased latency for *everyone* due to the increased > memory contention is more likely to hurt. -J. p.s. I hope i did write in an understandable way, because I had many interruptions, so if anything is unclear please let me know. [1] - https://lkml.org/lkml/2024/7/3/352 [2] - https://lkml.rescloud.iu.edu/2402.2/03227.html
pgsql-hackers by date: