Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
From | Bertrand Drouvot |
---|---|
Subject | Re: Adding basic NUMA awareness |
Date | |
Msg-id | aG/LcTxyVT1DtoB4@ip-10-97-1-34.eu-west-3.compute.internal Whole thread Raw |
In response to | Re: Adding basic NUMA awareness (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
Hi, On Wed, Jul 09, 2025 at 03:42:26PM -0400, Andres Freund wrote: > Hi, > > Thanks for working on this! Indeed, thanks! > On 2025-07-01 21:07:00 +0200, Tomas Vondra wrote: > > 1) v1-0001-NUMA-interleaving-buffers.patch > > > > This is the main thing when people think about NUMA - making sure the > > shared buffers are allocated evenly on all the nodes, not just on a > > single node (which can happen easily with warmup). The regular memory > > interleaving would address this, but it also has some disadvantages. > > > > Firstly, it's oblivious to the contents of the shared memory segment, > > and we may not want to interleave everything. It's also oblivious to > > alignment of the items (a buffer can easily end up "split" on multiple > > NUMA nodes), or relationship between different parts (e.g. there's a > > BufferBlock and a related BufferDescriptor, and those might again end up > > on different nodes). > > Two more disadvantages: > > With OS interleaving postgres doesn't (not easily at least) know about what > maps to what, which means postgres can't do stuff like numa aware buffer > replacement. > > With OS interleaving the interleaving is "too fine grained", with pages being > mapped at each page boundary, making it less likely for things like one > strategy ringbuffer to reside on a single numa node. > > There's a secondary benefit of explicitly assigning buffers to nodes, > > using this simple scheme - it allows quickly determining the node ID > > given a buffer ID. This is helpful later, when building freelist. I do think this is a big advantage as compare to the OS interleaving. > I wonder if we should *increase* the size of shared_buffers whenever huge > pages are in use and there's padding space due to the huge page > boundaries. Pretty pointless to waste that memory if we can instead use if for > the buffer pool. Not that big a deal with 2MB huge pages, but with 1GB huge > pages... I think that makes sense, except maybe for operations that need to scan the whole buffer pool (i.e related to BUF_DROP_FULL_SCAN_THRESHOLD)? > > 5) v1-0005-NUMA-interleave-PGPROC-entries.patch > > > > Another area that seems like it might benefit from NUMA is PGPROC, so I > > gave it a try. It turned out somewhat challenging. Similarly to buffers > > we have two pieces that need to be located in a coordinated way - PGPROC > > entries and fast-path arrays. But we can't use the same approach as for > > buffers/descriptors, because > > > > (a) Neither of those pieces aligns with memory page size (PGPROC is > > ~900B, fast-path arrays are variable length). > > > (b) We could pad PGPROC entries e.g. to 1KB, but that'd still require > > rather high max_connections before we use multiple huge pages. > > Right now sizeof(PGPROC) happens to be multiple of 64 (i.e. the most common > cache line size) Oh right, it's currently 832 bytes and the patch extends that to 840 bytes. With a bit of reordering: diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h index 5cb1632718e..2ed2f94202a 100644 --- a/src/include/storage/proc.h +++ b/src/include/storage/proc.h @@ -194,8 +194,6 @@ struct PGPROC * vacuum must not remove tuples deleted by * xid >= xmin ! */ - int procnumber; /* index in ProcGlobal->allProcs */ - int pid; /* Backend's process ID; 0 if prepared xact */ int pgxactoff; /* offset into various ProcGlobal->arrays with @@ -243,6 +241,7 @@ struct PGPROC /* Support for condition variables. */ proclist_node cvWaitLink; /* position in CV wait list */ + int procnumber; /* index in ProcGlobal->allProcs */ /* Info about lock the process is currently waiting for, if any. */ /* waitLock and waitProcLock are NULL if not currently waiting. */ @@ -268,6 +267,7 @@ struct PGPROC */ XLogRecPtr waitLSN; /* waiting for this LSN or higher */ int syncRepState; /* wait state for sync rep */ + int numa_node; dlist_node syncRepLinks; /* list link if process is in syncrep queue */ /* @@ -321,9 +321,6 @@ struct PGPROC PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */ dlist_head lockGroupMembers; /* list of members, if I'm a leader */ dlist_node lockGroupLink; /* my member link, if I'm a member */ - - /* NUMA node */ - int numa_node; }; That could be back to 832 (the order does not make sense logically speaking though). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: