Home > mailing lists

Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: Adding basic NUMA awareness
Date	July 9 20:13:04
Msg-id	jqg6jd32sw4s6gjkezauer372xrww7xnupvrcsqkegh2uhv6vg@ppiwoigzz6v4 Whole thread Raw
In response to	Re: Adding basic NUMA awareness (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses	Re: Adding basic NUMA awareness
List	pgsql-hackers

Tree view

Hi,

On 2025-07-09 12:04:00 +0200, Jakub Wartak wrote:
> On Tue, Jul 8, 2025 at 2:56 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote:
> > > On 7/8/25 05:04, Andres Freund wrote:
> > > > On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
> > > > The reason it would be advantageous to put something like the procarray onto
> > > > smaller pages is that otherwise the entire procarray (unless particularly
> > > > large) ends up on a single NUMA node, increasing the latency for backends on
> > > > every other numa node and increasing memory traffic on that node.
> > > >
> 
> Sure thing, I fully understand the motivation and underlying reason
> (without claiming that I understand the exact memory access patterns
> that involve procarray/PGPROC/etc and hotspots involved from PG side).
> Any single-liner pgbench help for how to really easily stress the
> PGPROC or procarray?

Unfortunately it's probably going to be slightly more complicated workloads
that show the effect - the very simplest cases don't go iterate through the
procarray itself anymore.


> > > That's why the patch series splits the procarray into multiple pieces,
> > > so that it can be properly distributed on multiple NUMA nodes even with
> > > huge pages. It requires adjusting a couple places accessing the entries,
> > > but it surprised me how limited the impact was.
> 
> Yes, and we are discussing if it is worth getting into smaller pages
> for such usecases (e.g. 4kB ones without hugetlb with 2MB hugepages or
> what more even more waste 1GB hugetlb if we dont request 2MB for some
> small structs: btw, we have ability to select MAP_HUGE_2MB vs
> MAP_HUGE_1GB). I'm thinking about two problems:
> - 4kB are swappable and mlock() potentially (?) disarms NUMA autobalacning

I'm not really bought into this being a problem.  If your system has enough
pressure to swap out the PGPROC array, you're so hosed that this won't make a
difference.


> - using libnuma often leads to MPOL_BIND which disarms NUMA
> autobalancing, BUT apparently there are set_mempolicy(2)/mbind(2) and
> since 5.12+ kernel they can take additional flag
> MPOL_F_NUMA_BALANCING(!), so this looks like it has potential to move
> memory anyway (if way too many tasks are relocated, so would be
> memory?). It is available only in recent libnuma as
> numa_set_membind_balancing(3), but sadly there's no way via libnuma to
> do mbind(MPOL_F_NUMA_BALANCING) for a specific addr only? I mean it
> would have be something like MPOL_F_NUMA_BALANCING | MPOL_PREFERRED?
> (select one node from many for each node while still allowing
> balancing?), but in [1][2] (2024) it is stated that "It's not
> legitimate (yet) to use MPOL_PREFERRED + MPOL_F_NUMA_BALANCING.", but
> maybe stuff has been improved since then.
> 
> Something like:
> PGPROC/procarray 2MB page for node#1 - mbind(addr1,
> MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [0,1]);
> PGPROC/procarray 2MB page for node#2 - mbind(addr2,
> MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [1,0]);

I'm rather doubtful that it's a good idea to combine numa awareness with numa
balancing. Numa balancing adds latency and makes it much more expensive for
userspace to act in a numa aware way, since it needs to regularly update its
knowledge about where memory resides.


> > Sure, you can do that, but it does mean that iterations over the procarray now
> > have an added level of indirection...
> 
> So the most efficient would be the old-way (no indirections) vs
> NUMA-way? Can this be done without #ifdefs at all?

If we used 4k pages for the procarray we would just have ~4 procs on one page,
if that range were marked as interleaved, it'd probably suffice.


> > > The thing I'm not sure about is how much this actually helps with the
> > > traffic between node. Sure, if we pick a PGPROC from the same node, and
> > > the task does not get moved, it'll be local traffic. But if the task
> > > moves, there'll be traffic.
> 
> With MPOL_F_NUMA_BALANCING, that should "auto-tune" in the worst case?

I doubt that NUMA balancing is going to help a whole lot here, there are too
many procs on one page for that to be helpful.  One thing that might be worth
doing is to *increase* the size of PGPROC by moving other pieces of data that
are keyed by ProcNumber into PGPROC.

I think the main thing to avoid is the case where all of PGPROC, buffer
mapping table, ... resides on one NUMA node (e.g. because it's the one
postmaster was scheduled on), as the increased memory traffic will lead to
queries on that node being slower than the other node.

Greetings,

Andres Freund

pgsql-hackers by date:

From: Andrew Dunstan
Date: 09 July, 19:58:27
Subject: Re: Buildfarm setup for AIX

From: Andres Freund
Date: 09 July, 20:23:06
Subject: Re: Adding basic NUMA awareness

Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

Previous

Next