Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Adding basic NUMA awareness
Date
Msg-id aG/LcTxyVT1DtoB4@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

On Wed, Jul 09, 2025 at 03:42:26PM -0400, Andres Freund wrote:
> Hi,
> 
> Thanks for working on this!

Indeed, thanks!

> On 2025-07-01 21:07:00 +0200, Tomas Vondra wrote:
> > 1) v1-0001-NUMA-interleaving-buffers.patch
> >
> > This is the main thing when people think about NUMA - making sure the
> > shared buffers are allocated evenly on all the nodes, not just on a
> > single node (which can happen easily with warmup). The regular memory
> > interleaving would address this, but it also has some disadvantages.
> >
> > Firstly, it's oblivious to the contents of the shared memory segment,
> > and we may not want to interleave everything. It's also oblivious to
> > alignment of the items (a buffer can easily end up "split" on multiple
> > NUMA nodes), or relationship between different parts (e.g. there's a
> > BufferBlock and a related BufferDescriptor, and those might again end up
> > on different nodes).
> 
> Two more disadvantages:
> 
> With OS interleaving postgres doesn't (not easily at least) know about what
> maps to what, which means postgres can't do stuff like numa aware buffer
> replacement.
> 
> With OS interleaving the interleaving is "too fine grained", with pages being
> mapped at each page boundary, making it less likely for things like one
> strategy ringbuffer to reside on a single numa node.

> > There's a secondary benefit of explicitly assigning buffers to nodes,
> > using this simple scheme - it allows quickly determining the node ID
> > given a buffer ID. This is helpful later, when building freelist.

I do think this is a big advantage as compare to the OS interleaving.

> I wonder if we should *increase* the size of shared_buffers whenever huge
> pages are in use and there's padding space due to the huge page
> boundaries. Pretty pointless to waste that memory if we can instead use if for
> the buffer pool.  Not that big a deal with 2MB huge pages, but with 1GB huge
> pages...

I think that makes sense, except maybe for operations that need to scan
the whole buffer pool (i.e related to BUF_DROP_FULL_SCAN_THRESHOLD)?

> > 5) v1-0005-NUMA-interleave-PGPROC-entries.patch
> >
> > Another area that seems like it might benefit from NUMA is PGPROC, so I
> > gave it a try. It turned out somewhat challenging. Similarly to buffers
> > we have two pieces that need to be located in a coordinated way - PGPROC
> > entries and fast-path arrays. But we can't use the same approach as for
> > buffers/descriptors, because
> >
> > (a) Neither of those pieces aligns with memory page size (PGPROC is
> > ~900B, fast-path arrays are variable length).
> 
> > (b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
> > rather high max_connections before we use multiple huge pages.
> 
> Right now sizeof(PGPROC) happens to be multiple of 64 (i.e. the most common
> cache line size)

Oh right, it's currently 832 bytes and the patch extends that to 840 bytes.

With a bit of reordering:

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5cb1632718e..2ed2f94202a 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -194,8 +194,6 @@ struct PGPROC
                                                                 * vacuum must not remove tuples deleted by
                                                                 * xid >= xmin ! */

-       int                     procnumber;             /* index in ProcGlobal->allProcs */
-
        int                     pid;                    /* Backend's process ID; 0 if prepared xact */

        int                     pgxactoff;              /* offset into various ProcGlobal->arrays with
@@ -243,6 +241,7 @@ struct PGPROC

        /* Support for condition variables. */
        proclist_node cvWaitLink;       /* position in CV wait list */
+       int                     procnumber;             /* index in ProcGlobal->allProcs */

        /* Info about lock the process is currently waiting for, if any. */
        /* waitLock and waitProcLock are NULL if not currently waiting. */
@@ -268,6 +267,7 @@ struct PGPROC
         */
        XLogRecPtr      waitLSN;                /* waiting for this LSN or higher */
        int                     syncRepState;   /* wait state for sync rep */
+       int                     numa_node;
        dlist_node      syncRepLinks;   /* list link if process is in syncrep queue */

        /*
@@ -321,9 +321,6 @@ struct PGPROC
        PGPROC     *lockGroupLeader;    /* lock group leader, if I'm a member */
        dlist_head      lockGroupMembers;       /* list of members, if I'm a leader */
        dlist_node      lockGroupLink;  /* my member link, if I'm a member */
-
-       /* NUMA node */
-       int                     numa_node;
 };

That could be back to 832 (the order does not make sense logically speaking
though).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: Improving and extending int128.h to more of numeric.c
Next
From: Benjamin Coutu
Date:
Subject: Using ASSUME in place of ASSERT in non-assert builds