Thread: Adding basic NUMA awareness

Adding basic NUMA awareness

From
Tomas Vondra
Date:
Hi,

This is a WIP version of a patch series I'm working on, adding some
basic NUMA awareness for a couple parts of our shared memory (shared
buffers, etc.). It's based on Andres' experimental patches he spoke
about at pgconf.eu 2024 [1], and while it's improved and polished in
various ways, it's still experimental.

But there's a recent thread aiming to do something similar [2], so
better to share it now so that we can discuss both approaches. This
patch set is a bit more ambitious, handling NUMA in a way to allow
smarter optimizations later, so I'm posting it in a separate thread.

The series is split into patches addressing different parts of the
shared memory, starting (unsurprisingly) from shared buffers, then
buffer freelists and ProcArray. There's a couple additional parts, but
those are smaller / addressing miscellaneous stuff.

Each patch has a numa_ GUC, intended to enable/disable that part. This
is meant to make development easier, not as a final interface. I'm not
sure how exactly that should look. It's possible some combinations of
GUCs won't work, etc.

Each patch should have a commit message explaining the intent and
implementation, and then also detailed comments explaining various
challenges and open questions.

But let me go over the basics, and discuss some of the design choices
and open questions that need solving.


1) v1-0001-NUMA-interleaving-buffers.patch

This is the main thing when people think about NUMA - making sure the
shared buffers are allocated evenly on all the nodes, not just on a
single node (which can happen easily with warmup). The regular memory
interleaving would address this, but it also has some disadvantages.

Firstly, it's oblivious to the contents of the shared memory segment,
and we may not want to interleave everything. It's also oblivious to
alignment of the items (a buffer can easily end up "split" on multiple
NUMA nodes), or relationship between different parts (e.g. there's a
BufferBlock and a related BufferDescriptor, and those might again end up
on different nodes).

So the patch handles this by explicitly mapping chunks of shared buffers
to different nodes - a bit like interleaving, but in larger chunks.
Ideally each node gets (1/N) of shared buffers, as a contiguous chunk.

It's a bit more complicated, because the patch distributes both the
blocks and descriptors, in the same way. So a buffer and it's descriptor
always end on the same NUMA node. This is one of the reasons why we need
to map larger chunks, because NUMA works on page granularity, and the
descriptors are tiny - many fit on a memory page.

There's a secondary benefit of explicitly assigning buffers to nodes,
using this simple scheme - it allows quickly determining the node ID
given a buffer ID. This is helpful later, when building freelist.

The patch is fairly simple. Most of the complexity is about picking the
chunk size, and aligning the arrays (so that it nicely aligns with
memory pages).

The patch has a GUC "numa_buffers_interleave", with "off" by default.


2) v1-0002-NUMA-localalloc.patch

This simply sets "localalloc" when initializing a backend, so that all
memory allocated later is local, not interleaved. Initially this was
necessary because the patch set the allocation policy to interleaving
before initializing shared memory, and we didn't want to interleave the
private memory. But that's no longer the case - the explicit mapping to
nodes does not have this issue. I'm keeping the patch for convenience,
it allows experimenting with numactl etc.

The patch has a GUC "numa_localalloc", with "off" by default.


3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch

Minor optimization. Andres noticed we're tracking the tail of buffer
freelist, without using it. So the patch removes that.


4) v1-0004-NUMA-partition-buffer-freelist.patch

Right now we have a single freelist, and in busy instances that can be
quite contended. What's worse, the freelist may trash between different
CPUs, NUMA nodes, etc. So the idea is to have multiple freelists on
subsets of buffers. The patch implements multiple strategies how the
list can be split (configured using "numa_partition_freelist" GUC), for
experimenting:

* node - One list per NUMA node. This is the most natural option,
because we now know which buffer is on which node, so we can ensure a
list for a node only has buffers from that list.

* cpu - One list per CPU. Pretty simple, each CPU gets it's own list.

* pid - Similar to "cpu", but the processes are mapped to lists based on
PID, not CPU ID.

* none - nothing, sigle freelist

Ultimately, I think we'll want to go with "node", simply because it
aligns with the buffer interleaving. But there are improvements needed.

The main challenge is that with multiple smaller lists, a process can't
really use the whole shared buffers. So a single backed will only use
part of the memory. The more lists there are, the worse this effect is.
This is also why I think we won't use the other partitioning options,
because there's going to be more CPUs than NUMA nodes.

Obviously, this needs solving even with NUMA nodes - we need to allow a
single backend to utilize the whole shared buffers if needed. There
should be a way to "steal" buffers from other freelists (if the
"regular" freelist is empty), but the patch does not implement this.
Shouldn't be hard, I think.

The other missing part is clocksweep - there's still just a single
instance of clocksweep, feeding buffers to all the freelists. But that's
clearly a problem, because the clocksweep returns buffers from all NUMA
nodes. The clocksweep really needs to be partitioned the same way as a
freelists, and each partition will operate on a subset of buffers (from
the right NUMA node).

I do have a separate experimental patch doing something like that, I
need to make it part of this branch.


5) v1-0005-NUMA-interleave-PGPROC-entries.patch

Another area that seems like it might benefit from NUMA is PGPROC, so I
gave it a try. It turned out somewhat challenging. Similarly to buffers
we have two pieces that need to be located in a coordinated way - PGPROC
entries and fast-path arrays. But we can't use the same approach as for
buffers/descriptors, because

(a) Neither of those pieces aligns with memory page size (PGPROC is
~900B, fast-path arrays are variable length).

(b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
rather high max_connections before we use multiple huge pages.

The fast-path arrays are less of a problem, because those tend to be
larger, and are accessed through pointers, so we can just adjust that.

So what I did instead is splitting the whole PGPROC array into one array
per NUMA node, and one array for auxiliary processes and 2PC xacts. So
with 4 NUMA nodes there are 5 separate arrays, for example. Each array
is a multiple of memory pages, so we may waste some of the memory. But
that's simply how NUMA works - page granularity.

This however makes one particular thing harder - in a couple places we
accessed PGPROC entries through PROC_HDR->allProcs, which was pretty
much just one large array. And GetNumberFromPGProc() relied on array
arithmetics to determine procnumber. With the array partitioned, this
can't work the same way.

But there's a simple solution - if we turn allProcs into an array of
*pointers* to PGPROC arrays, there's no issue. All the places need a
pointer anyway. And then we need an explicit procnumber field in PGPROC,
instead of calculating it.

There's a chance this have negative impact on code that accessed PGPROC
very often, but so far I haven't seen such cases. But if you can come up
with such examples, I'd like to see those.

There's another detail - when obtaining a PGPROC entry in InitProcess(),
we try to get an entry from the same NUMA node. And only if that doesn't
work, we grab the first one from the list (there's still just one PGPROC
freelist, I haven't split that - maybe we should?).

This has a GUC "numa_procs_interleave", again "off" by default. It's not
quite correct, though, because the partitioning happens always. It only
affects the PGPROC lookup. (In a way, this may be a bit broken.)


6) v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch

This is an experimental patch, that simply pins the new process to the
NUMA node obtained from the freelist.

Driven by GUC "numa_procs_pin" (default: off).


Summary
-------

So this is what I have at the moment. I've tried to organize the patches
in the order of importance, but that's just my guess. It's entirely
possible there's something I missed, some other order might make more
sense, etc.

There's also the question how this is related to other patches affecting
shared memory - I think the most relevant one is the "shared buffers
online resize" by Ashutosh, simply because it touches the shared memory.

I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.

But there'd also need to be some logic to "rework" how shared buffers
get mapped to NUMA nodes after resizing. It'd be silly to start with
memory on 4 nodes (25% each), resize shared buffers to 50% and end up
with memory only on 2 of the nodes (because the other 2 nodes were
originally assigned the upper half of shared buffers).

I don't have a clear idea how this would be done, but I guess it'd
require a bit of code invoked sometime after the resize. It'd already
need to rebuild the freelists in some way, I guess.

The other thing I haven't thought about very much is determining on
which CPUs/nodes the instance is allowed to run. I assume we'd start by
simply inherit/determine that at the start through libnuma, not through
some custom PG configuration (which the patch [2] proposed to do).


regards


[1] https://www.youtube.com/watch?v=V75KpACdl6E

[2]
https://www.postgresql.org/message-id/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw%40mail.gmail.com

-- 
Tomas Vondra

Attachment

Re: Adding basic NUMA awareness

From
Ashutosh Bapat
Date:
On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas@vondra.me> wrote:
>
>
> 3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch
>
> Minor optimization. Andres noticed we're tracking the tail of buffer
> freelist, without using it. So the patch removes that.
>

The patches for resizing buffers use the lastFreeBuffer to add new
buffers to the end of free list when expanding it. But we could as
well add it at the beginning of the free list.

This patch seems almost independent of the rest of the patches. Do you
need it in the rest of the patches? I understand that those patches
don't need to worry about maintaining lastFreeBuffer after this patch.
Is there any other effect?

If we are going to do this, let's do it earlier so that buffer
resizing patches can be adjusted.

>
> There's also the question how this is related to other patches affecting
> shared memory - I think the most relevant one is the "shared buffers
> online resize" by Ashutosh, simply because it touches the shared memory.

I have added Dmitry to this thread since he has written most of the
shared memory handling code.

>
> I don't think the splitting would actually make some things simpler, or
> maybe more flexible - in particular, it'd allow us to enable huge pages
> only for some regions (like shared buffers), and keep the small pages
> e.g. for PGPROC. So that'd be good.

The resizing patches split the shared buffer related structures into
separate memory segments. I think that itself will help enabling huge
pages for some regions. Would that help in your case?

>
> But there'd also need to be some logic to "rework" how shared buffers
> get mapped to NUMA nodes after resizing. It'd be silly to start with
> memory on 4 nodes (25% each), resize shared buffers to 50% and end up
> with memory only on 2 of the nodes (because the other 2 nodes were
> originally assigned the upper half of shared buffers).
>
> I don't have a clear idea how this would be done, but I guess it'd
> require a bit of code invoked sometime after the resize. It'd already
> need to rebuild the freelists in some way, I guess.

Yes, there's code to build the free list. I think we will need code to
remap the buffers and buffer descriptor.

--
Best Wishes,
Ashutosh Bapat



Re: Adding basic NUMA awareness

From
Tomas Vondra
Date:

On 7/2/25 13:37, Ashutosh Bapat wrote:
> On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas@vondra.me> wrote:
>>
>>
>> 3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch
>>
>> Minor optimization. Andres noticed we're tracking the tail of buffer
>> freelist, without using it. So the patch removes that.
>>
> 
> The patches for resizing buffers use the lastFreeBuffer to add new
> buffers to the end of free list when expanding it. But we could as
> well add it at the beginning of the free list.
> 
> This patch seems almost independent of the rest of the patches. Do you
> need it in the rest of the patches? I understand that those patches
> don't need to worry about maintaining lastFreeBuffer after this patch.
> Is there any other effect?
> 
> If we are going to do this, let's do it earlier so that buffer
> resizing patches can be adjusted.
> 

My patches don't particularly rely on this bit, it would work even with
lastFreeBuffer. I believe Andres simply noticed the current code does
not use lastFreeBuffer, it just maintains is, so he removed that as an
optimization. I don't know how significant is the improvement, but if
it's measurable we could just do that independently of our patches.

>>
>> There's also the question how this is related to other patches affecting
>> shared memory - I think the most relevant one is the "shared buffers
>> online resize" by Ashutosh, simply because it touches the shared memory.
> 
> I have added Dmitry to this thread since he has written most of the
> shared memory handling code.
> 

Thanks.

>>
>> I don't think the splitting would actually make some things simpler, or
>> maybe more flexible - in particular, it'd allow us to enable huge pages
>> only for some regions (like shared buffers), and keep the small pages
>> e.g. for PGPROC. So that'd be good.
> 
> The resizing patches split the shared buffer related structures into
> separate memory segments. I think that itself will help enabling huge
> pages for some regions. Would that help in your case?
> 

Indirectly. My patch can work just fine with a single segment, but being
able to enable huge pages only for some of the segments seems better.

>>
>> But there'd also need to be some logic to "rework" how shared buffers
>> get mapped to NUMA nodes after resizing. It'd be silly to start with
>> memory on 4 nodes (25% each), resize shared buffers to 50% and end up
>> with memory only on 2 of the nodes (because the other 2 nodes were
>> originally assigned the upper half of shared buffers).
>>
>> I don't have a clear idea how this would be done, but I guess it'd
>> require a bit of code invoked sometime after the resize. It'd already
>> need to rebuild the freelists in some way, I guess.
> 
> Yes, there's code to build the free list. I think we will need code to
> remap the buffers and buffer descriptor.
> 

Right. The good thing is that's just "advisory" information, it doesn't
break anything if it's temporarily out of sync. We don't need to "stop"
everything to remap the buffers to other nodes, or anything like that.
Or at least I think so.

It's one thing to "flip" the target mapping (determining which node a
buffer should be on), and actually migrating the buffers. The first part
can be done instantaneously, the second part can happen in the
background over a longer time period.

I'm not sure how you're rebuilding the freelist. Presumably it can
contain buffers that are no longer valid (after shrinking). How is that
handled to not break anything? I think the NUMA variant would do exactly
the same thing, except that there's multiple lists.

regards

-- 
Tomas Vondra




Re: Adding basic NUMA awareness

From
Ashutosh Bapat
Date:
On Wed, Jul 2, 2025 at 6:06 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> I'm not sure how you're rebuilding the freelist. Presumably it can
> contain buffers that are no longer valid (after shrinking). How is that
> handled to not break anything? I think the NUMA variant would do exactly
> the same thing, except that there's multiple lists.

Before shrinking the buffers, we walk the free list removing any
buffers that are going to be removed. When expanding, by linking the
new buffers in the order and then adding those to the already existing
free list. 0005 patch in [1] has the code for the same.

[1] https://www.postgresql.org/message-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj%40hmuxsf2ngov2

--
Best Wishes,
Ashutosh Bapat



Re: Adding basic NUMA awareness

From
Dmitry Dolgov
Date:
> On Wed, Jul 02, 2025 at 05:07:28PM +0530, Ashutosh Bapat wrote:
> > There's also the question how this is related to other patches affecting
> > shared memory - I think the most relevant one is the "shared buffers
> > online resize" by Ashutosh, simply because it touches the shared memory.
>
> I have added Dmitry to this thread since he has written most of the
> shared memory handling code.

Thanks! I like the idea behind this patch series. I haven't read it in
details yet, but I can imagine both patches (interleaving and online
resizing) could benefit from each other. In online resizing we've
introduced a possibility to use multiple shared mappings for different
types of data, maybe it would be convenient to use the same interface to
create separate mappings for different NUMA nodes as well. Using a
separate shared mapping per NUMA node would also make resizing easier,
since it would be more straightforward to fit an increased segment into
NUMA boundaries.

> > I don't think the splitting would actually make some things simpler, or
> > maybe more flexible - in particular, it'd allow us to enable huge pages
> > only for some regions (like shared buffers), and keep the small pages
> > e.g. for PGPROC. So that'd be good.
>
> The resizing patches split the shared buffer related structures into
> separate memory segments. I think that itself will help enabling huge
> pages for some regions. Would that help in your case?

Right, separate segments would allow to mix and match huge pages with
pages of regular size. It's not implemented in the latest version of
online resizing patch, purely to reduce complexity and maintain the same
invariant (everything is either using huge pages or not) -- but we could
do it other way around as well.