Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Adding basic NUMA awareness
Date
Msg-id ruhvmwlf5ore5eevg275ic5pad7v3qdjdhi6vxty2adeow5ghp@z6kadtces37s
Whole thread Raw
In response to Adding basic NUMA awareness  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Adding basic NUMA awareness
Re: Adding basic NUMA awareness
List pgsql-hackers
Hi,

Thanks for working on this!  I think it's an area we have long neglected...


On 2025-07-01 21:07:00 +0200, Tomas Vondra wrote:
> Each patch has a numa_ GUC, intended to enable/disable that part. This
> is meant to make development easier, not as a final interface. I'm not
> sure how exactly that should look. It's possible some combinations of
> GUCs won't work, etc.

Wonder if some of it might be worth putting into a multi-valued GUC (like
debug_io_direct).



> 1) v1-0001-NUMA-interleaving-buffers.patch
>
> This is the main thing when people think about NUMA - making sure the
> shared buffers are allocated evenly on all the nodes, not just on a
> single node (which can happen easily with warmup). The regular memory
> interleaving would address this, but it also has some disadvantages.
>
> Firstly, it's oblivious to the contents of the shared memory segment,
> and we may not want to interleave everything. It's also oblivious to
> alignment of the items (a buffer can easily end up "split" on multiple
> NUMA nodes), or relationship between different parts (e.g. there's a
> BufferBlock and a related BufferDescriptor, and those might again end up
> on different nodes).

Two more disadvantages:

With OS interleaving postgres doesn't (not easily at least) know about what
maps to what, which means postgres can't do stuff like numa aware buffer
replacement.

With OS interleaving the interleaving is "too fine grained", with pages being
mapped at each page boundary, making it less likely for things like one
strategy ringbuffer to reside on a single numa node.


I wonder if we should *increase* the size of shared_buffers whenever huge
pages are in use and there's padding space due to the huge page
boundaries. Pretty pointless to waste that memory if we can instead use if for
the buffer pool.  Not that big a deal with 2MB huge pages, but with 1GB huge
pages...



> 4) v1-0004-NUMA-partition-buffer-freelist.patch
>
> Right now we have a single freelist, and in busy instances that can be
> quite contended. What's worse, the freelist may trash between different
> CPUs, NUMA nodes, etc. So the idea is to have multiple freelists on
> subsets of buffers. The patch implements multiple strategies how the
> list can be split (configured using "numa_partition_freelist" GUC), for
> experimenting:
>
> * node - One list per NUMA node. This is the most natural option,
> because we now know which buffer is on which node, so we can ensure a
> list for a node only has buffers from that list.
>
> * cpu - One list per CPU. Pretty simple, each CPU gets it's own list.
>
> * pid - Similar to "cpu", but the processes are mapped to lists based on
> PID, not CPU ID.
>
> * none - nothing, sigle freelist
>
> Ultimately, I think we'll want to go with "node", simply because it
> aligns with the buffer interleaving. But there are improvements needed.

I think we might eventually want something more granular than just "node" -
the freelist (and the clock sweep) can become a contention point even within
one NUMA node.  I'm imagining something like an array of freelists/clocksweep
states, where the current numa node selects a subset of the array and the cpu
is used to choose the entry within that list.

But we can do that later, that should be a fairly simple extension of what
you're doing.



> The other missing part is clocksweep - there's still just a single
> instance of clocksweep, feeding buffers to all the freelists. But that's
> clearly a problem, because the clocksweep returns buffers from all NUMA
> nodes. The clocksweep really needs to be partitioned the same way as a
> freelists, and each partition will operate on a subset of buffers (from
> the right NUMA node).
>
> I do have a separate experimental patch doing something like that, I
> need to make it part of this branch.

I'm really curious about that patch, as I wrote elsewhere in this thread, I
think we should just get rid of the freelist alltogether.  Even if we don't do
so, in a steady state system the clock sweep is commonly much more important
than the freelist...


> 5) v1-0005-NUMA-interleave-PGPROC-entries.patch
>
> Another area that seems like it might benefit from NUMA is PGPROC, so I
> gave it a try. It turned out somewhat challenging. Similarly to buffers
> we have two pieces that need to be located in a coordinated way - PGPROC
> entries and fast-path arrays. But we can't use the same approach as for
> buffers/descriptors, because
>
> (a) Neither of those pieces aligns with memory page size (PGPROC is
> ~900B, fast-path arrays are variable length).

> (b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
> rather high max_connections before we use multiple huge pages.

We should probably pad them regardless? Right now sizeof(PGPROC) happens to be
multiple of 64 (i.e. the most common cache line size), but that hasn't always
been the case, and isn't the case on systems with 128 bit cachelines like
common ARMv8 systems.  And having one cacheline hold one backends fast path
states and another backend's xmin doesn't sound like a recipe for good
performance.

Seems like we should also do some reordering of the contents within PGPROC. We
have e.g. have very frequently changing data (->waitStatus, ->lwWaiting) in
the same caceheline as almost immutable data (->pid, ->pgxactoff,
->databaseId,).


> So what I did instead is splitting the whole PGPROC array into one array
> per NUMA node, and one array for auxiliary processes and 2PC xacts. So
> with 4 NUMA nodes there are 5 separate arrays, for example. Each array
> is a multiple of memory pages, so we may waste some of the memory. But
> that's simply how NUMA works - page granularity.

Theoretically we could use the "padding" memory at the end of each NUMA node's
PGPROC array to for the 2PC entries, for those we presumably don't care for
locality.  Not sure it's worth the complexity though.


For a while I thought I had a better solution: Given that we're going to waste
all the "padding" memory, why not just oversize the PGPROC array so that it
spans the required number of NUMA nodes?

The problem is that that would lead to ProcNumbers to get much larger, and we
do have other arrays that are keyed by ProcNumber. Which probably makes this
not so great an idea.


> This however makes one particular thing harder - in a couple places we
> accessed PGPROC entries through PROC_HDR->allProcs, which was pretty
> much just one large array. And GetNumberFromPGProc() relied on array
> arithmetics to determine procnumber. With the array partitioned, this
> can't work the same way.
>
> But there's a simple solution - if we turn allProcs into an array of
> *pointers* to PGPROC arrays, there's no issue. All the places need a
> pointer anyway. And then we need an explicit procnumber field in PGPROC,
> instead of calculating it.
>
> There's a chance this have negative impact on code that accessed PGPROC
> very often, but so far I haven't seen such cases. But if you can come up
> with such examples, I'd like to see those.

I'd not be surprised if there were overhead, adding a level of indirection to
things like ProcArrayGroupClearXid(), GetVirtualXIDsDelayingChkpt(),
SignalVirtualTransaction() probably won't be free.

BUT: For at least some of these a better answer might be to add additional
"dense" arrays like we have for xids etc, so they don't need to trawl through
PGPROCs.


> There's another detail - when obtaining a PGPROC entry in InitProcess(),
> we try to get an entry from the same NUMA node. And only if that doesn't
> work, we grab the first one from the list (there's still just one PGPROC
> freelist, I haven't split that - maybe we should?).

I guess it might be worth partitioning the freelist, iterating through a few
thousand links just to discover that there's no free proc on the current numa
node, while holding a spinlock, doesn't sound great. Even if it's likely
rarely a huge issue compared to other costs.


> The other thing I haven't thought about very much is determining on
> which CPUs/nodes the instance is allowed to run. I assume we'd start by
> simply inherit/determine that at the start through libnuma, not through
> some custom PG configuration (which the patch [2] proposed to do).

That seems like the right thing to me.


One thing that this patchset afaict doesn't address so far is that there is a
fair bit of other important shared memory that this patch doesn't set up
intelligently e.g. the buffer mapping table itself (but there are loads of
other cases). Because we touch a lot of that memory during startup, most it
will be allocated on whatever NUMA node postmaster was scheduled.  I suspect
that the best we can do for parts of shared memory where we don't have
explicit NUMA awareness is to default to an interleave policy.


> From 9712e50d6d15c18ea2c5fcf457972486b0d4ef53 Mon Sep 17 00:00:00 2001
> From: Tomas Vondra <tomas@vondra.me>
> Date: Tue, 6 May 2025 21:12:21 +0200
> Subject: [PATCH v1 1/6] NUMA: interleaving buffers
>
> Ensure shared buffers are allocated from all NUMA nodes, in a balanced
> way, instead of just using the node where Postgres initially starts, or
> where the kernel decides to migrate the page, etc. With pre-warming
> performed by a single backend, this can easily result in severely
> unbalanced memory distribution (with most from a single NUMA node).
>
> The kernel would eventually move some of the memory to other nodes
> (thanks to zone_reclaim), but that tends to take a long time. So this
> patch improves predictability, reduces the time needed for warmup
> during benchmarking, etc.  It's less dependent on what the CPU
> scheduler does, etc.

FWIW, I don't think zone_reclaim_mode will commonly do that? Even if enabled,
which I don't think it is anymore by default.  At least huge pages can't be
reclaimed by the kernel, but even when not using huge pages, I think the only
scenario where that would happen is if shared_buffers were swapped out.

Numa balancing might eventually "fix" such an imbalance though.


> diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
> index ed1dc488a42..2ad34624c49 100644
> --- a/src/backend/storage/buffer/buf_init.c
> +++ b/src/backend/storage/buffer/buf_init.c
> @@ -14,9 +14,17 @@
>   */
>  #include "postgres.h"
>
> +#ifdef USE_LIBNUMA
> +#include <numa.h>
> +#include <numaif.h>
> +#endif
> +

I wonder how much of this we should try to put into port/pg_numa.c. Having
direct calls to libnuma code all over the backend will make it rather hard to
add numa awareness for hypothetical platforms not using libnuma compatible
interfaces.


> +/* number of buffers allocated on the same NUMA node */
> +static int64 numa_chunk_buffers = -1;

Given that NBuffers is a 32bit quantity, this probably doesn't need to be
64bit...  Anyway, I'm not going to review on that level going forward, the
patch is probably in too early a state for that.



> @@ -71,18 +92,80 @@ BufferManagerShmemInit(void)
>                  foundDescs,
>                  foundIOCV,
>                  foundBufCkpt;
> +    Size        mem_page_size;
> +    Size        buffer_align;
> +
> +    /*
> +     * XXX A bit weird. Do we need to worry about postmaster? Could this even
> +     * run outside postmaster? I don't think so.

It can run in single user mode - but that shouldn't prevent us from using
pg_get_shmem_pagesize().


> +     * XXX Another issue is we may get different values than when sizing the
> +     * the memory, because at that point we didn't know if we get huge pages,
> +     * so we assumed we will. Shouldn't cause crashes, but we might allocate
> +     * shared memory and then not use some of it (because of the alignment
> +     * that we don't actually need). Not sure about better way, good for now.
> +     */

Ugh, not seeing a great way to deal with that either.


> +     * XXX Maybe with (mem_page_size > PG_IO_ALIGN_SIZE), we don't need to
> +     * align to mem_page_size? Especially for very large huge pages (e.g. 1GB)
> +     * that doesn't seem quite worth it. Maybe we should simply align to
> +     * BLCKSZ, so that buffers don't get split? Still, we might interfere with
> +     * other stuff stored in shared memory that we want to allocate on a
> +     * particular NUMA node (e.g. ProcArray).
> +     *
> +     * XXX Maybe with "too large" huge pages we should just not do this, or
> +     * maybe do this only for sufficiently large areas (e.g. shared buffers,
> +     * but not ProcArray).

I think that's right - there's no point in using 1GB pages for anything other
than shared_buffers, we should allocate shared_buffers separately.




> +/*
> + * Determine the size of memory page.
> + *
> + * XXX This is a bit tricky, because the result depends at which point we call
> + * this. Before the allocation we don't know if we succeed in allocating huge
> + * pages - but we have to size everything for the chance that we will. And then
> + * if the huge pages fail (with 'huge_pages=try'), we'll use the regular memory
> + * pages. But at that point we can't adjust the sizing.
> + *
> + * XXX Maybe with huge_pages=try we should do the sizing twice - first with
> + * huge pages, and if that fails, then without them. But not for this patch.
> + * Up to this point there was no such dependency on huge pages.

Doing it twice sounds somewhat nasty - but perhaps we could just have the
shmem size infrastructure compute two different numbers, one for use with huge
pages and one without?



> +static int64
> +choose_chunk_buffers(int NBuffers, Size mem_page_size, int num_nodes)
> +{
> +    int64        num_items;
> +    int64        max_items;
> +
> +    /* make sure the chunks will align nicely */
> +    Assert(BLCKSZ % sizeof(BufferDescPadded) == 0);
> +    Assert(mem_page_size % sizeof(BufferDescPadded) == 0);
> +    Assert(((BLCKSZ % mem_page_size) == 0) || ((mem_page_size % BLCKSZ) == 0));
> +
> +    /*
> +     * The minimum number of items to fill a memory page with descriptors and
> +     * blocks. The NUMA allocates memory in pages, and we need to do that for
> +     * both buffers and descriptors.
> +     *
> +     * In practice the BLCKSZ doesn't really matter, because it's much larger
> +     * than BufferDescPadded, so the result is determined buffer descriptors.
> +     * But it's clearer this way.
> +     */
> +    num_items = Max(mem_page_size / sizeof(BufferDescPadded),
> +                    mem_page_size / BLCKSZ);
> +
> +    /*
> +     * We shouldn't use chunks larger than NBuffers/num_nodes, because with
> +     * larger chunks the last NUMA node would end up with much less memory (or
> +     * no memory at all).
> +     */
> +    max_items = (NBuffers / num_nodes);
> +
> +    /*
> +     * Did we already exceed the maximum desirable chunk size? That is, will
> +     * the last node get less than one whole chunk (or no memory at all)?
> +     */
> +    if (num_items > max_items)
> +        elog(WARNING, "choose_chunk_buffers: chunk items exceeds max (%ld > %ld)",
> +             num_items, max_items);
> +
> +    /* grow the chunk size until we hit the max limit. */
> +    while (2 * num_items <= max_items)
> +        num_items *= 2;

Something around this logic leads to a fair bit of imbalance - I started postgres with
huge_page_size=1GB, shared_buffers=4GB on a 2 node system and that results in

postgres[4188255][1]=# SELECT * FROM pg_shmem_allocations_numa WHERE name in ('Buffer Blocks', 'Buffer Descriptors');
┌────────────────────┬───────────┬────────────┐
│        name        │ numa_node │    size    │
├────────────────────┼───────────┼────────────┤
│ Buffer Blocks      │         0 │ 5368709120 │
│ Buffer Blocks      │         1 │ 1073741824 │
│ Buffer Descriptors │         0 │ 1073741824 │
│ Buffer Descriptors │         1 │ 1073741824 │
└────────────────────┴───────────┴────────────┘
(4 rows)


With shared_buffers=8GB postgres failed to start, even though 16 1GB huge
pages are available, as 18GB were requested.

After increasing the limit, the top allocations were as follows:
postgres[4189384][1]=# SELECT * FROM pg_shmem_allocations ORDER BY allocated_size DESC LIMIT 5;
┌──────────────────────┬─────────────┬────────────┬────────────────┐
│         name         │     off     │    size    │ allocated_size │
├──────────────────────┼─────────────┼────────────┼────────────────┤
│ Buffer Blocks        │  1192223104 │ 9663676416 │     9663676416 │
│ PGPROC structures    │ 10970279808 │ 3221733342 │     3221733376 │
│ Fast-Path Lock Array │ 14192013184 │ 3221396544 │     3221396608 │
│ Buffer Descriptors   │    51372416 │ 1140850688 │     1140850688 │
│ (null)               │ 17468590976 │  785020032 │      785020032 │
└──────────────────────┴─────────────┴────────────┴────────────────┘

With a fair bit of imbalance:
postgres[4189384][1]=# SELECT * FROM pg_shmem_allocations_numa WHERE name in ('Buffer Blocks', 'Buffer Descriptors');
┌────────────────────┬───────────┬────────────┐
│        name        │ numa_node │    size    │
├────────────────────┼───────────┼────────────┤
│ Buffer Blocks      │         0 │ 8589934592 │
│ Buffer Blocks      │         1 │ 2147483648 │
│ Buffer Descriptors │         0 │          0 │
│ Buffer Descriptors │         1 │ 2147483648 │
└────────────────────┴───────────┴────────────┘
(4 rows)

Note that the buffer descriptors are all on node 1.


> +/*
> + * Calculate the NUMA node for a given buffer.
> + */
> +int
> +BufferGetNode(Buffer buffer)
> +{
> +    /* not NUMA interleaving */
> +    if (numa_chunk_buffers == -1)
> +        return -1;
> +
> +    return (buffer / numa_chunk_buffers) % numa_nodes;
> +}

FWIW, this is likely rather expensive - when not a compile time constant,
divisions and modulo can take a fair number of cycles.



> +/*
> + * pg_numa_interleave_memory
> + *        move memory to different NUMA nodes in larger chunks
> + *
> + * startptr - start of the region (should be aligned to page size)
> + * endptr - end of the region (doesn't need to be aligned)
> + * mem_page_size - size of the memory page size
> + * chunk_size - size of the chunk to move to a single node (should be multiple
> + *              of page size
> + * num_nodes - number of nodes to allocate memory to
> + *
> + * XXX Maybe this should use numa_tonode_memory and numa_police_memory instead?
> + * That might be more efficient than numa_move_pages, as it works on larger
> + * chunks of memory, not individual system pages, I think.
> + *
> + * XXX The "interleave" name is not quite accurate, I guess.
> + */
> +static void
> +pg_numa_interleave_memory(char *startptr, char *endptr,
> +                          Size mem_page_size, Size chunk_size,
> +                          int num_nodes)
> +{

Seems like this should be in pg_numa.c?




> diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
> index 69b6a877dc9..c07de903f76 100644
> --- a/src/bin/pgbench/pgbench.c
> +++ b/src/bin/pgbench/pgbench.c

I assume those changes weren't intentionally part of this patch...


> From 6505848ac8359c8c76dfbffc7150b6601ab07601 Mon Sep 17 00:00:00 2001
> From: Tomas Vondra <tomas@vondra.me>
> Date: Thu, 22 May 2025 18:38:41 +0200
> Subject: [PATCH v1 4/6] NUMA: partition buffer freelist
>
> Instead of a single buffer freelist, partition into multiple smaller
> lists, to reduce lock contention, and to spread the buffers over all
> NUMA nodes more evenly.
>
> There are four strategies, specified by GUC numa_partition_freelist
>
> * none - single long freelist, should work just like now
>
> * node - one freelist per NUMA node, with only buffers from that node
>
> * cpu - one freelist per CPU
>
> * pid - freelist determined by PID (same number of freelists as 'cpu')
>
> When allocating a buffer, it's taken from the correct freelist (e.g.
> same NUMA node).
>
> Note: This is (probably) more important than partitioning ProcArray.

>
> +/*
> + * Represents one freelist partition.
> + */
> +typedef struct BufferStrategyFreelist
> +{
> +    /* Spinlock: protects the values below */
> +    slock_t        freelist_lock;
> +
> +    /*
> +     * XXX Not sure why this needs to be aligned like this. Need to ask
> +     * Andres.
> +     */
> +    int            firstFreeBuffer __attribute__((aligned(64)));    /* Head of list of
> +                                                                 * unused buffers */
> +
> +    /* Number of buffers consumed from this list. */
> +    uint64        consumed;
> +}            BufferStrategyFreelist;

I think this might be a leftover from measuring performance of a *non*
partitioned freelist.  I saw unnecessar contention between

  BufferStrategyControl->{nextVictimBuffer,buffer_strategy_lock,numBufferAllocs}

and was testing what effect the simplest avoidance scheme has.

I don't this should be part of this patchset.



>
>  /*
>   * The shared freelist control information.
> @@ -39,8 +66,6 @@ typedef struct
>       */
>      pg_atomic_uint32 nextVictimBuffer;
>
> -    int            firstFreeBuffer;    /* Head of list of unused buffers */
> -
>      /*
>       * Statistics.  These counters should be wide enough that they can't
>       * overflow during a single bgwriter cycle.
> @@ -51,13 +76,27 @@ typedef struct
>      /*
>       * Bgworker process to be notified upon activity or -1 if none. See
>       * StrategyNotifyBgWriter.
> +     *
> +     * XXX Not sure why this needs to be aligned like this. Need to ask
> +     * Andres. Also, shouldn't the alignment be specified after, like for
> +     * "consumed"?
>       */
> -    int            bgwprocno;
> +    int            __attribute__((aligned(64))) bgwprocno;
> +
> +    BufferStrategyFreelist freelists[FLEXIBLE_ARRAY_MEMBER];
>  } BufferStrategyControl;

Here the reason was that it's silly to put almost-readonly data (like
bgwprocno) onto the same cacheline as very frequently modified data like
->numBufferAllocs.  That causes unnecessary cache misses in many
StrategyGetBuffer() calls, as another backend's StrategyGetBuffer() will
always have modified ->numBufferAllocs and either ->buffer_strategy_lock or
->nextVictimBuffer.


>
> +static BufferStrategyFreelist *
> +ChooseFreeList(void)
> +{
> +    unsigned    cpu;
> +    unsigned    node;
> +    int            rc;
> +
> +    int            freelist_idx;
> +
> +    /* freelist not partitioned, return the first (and only) freelist */
> +    if (numa_partition_freelist == FREELIST_PARTITION_NONE)
> +        return &StrategyControl->freelists[0];
> +
> +    /*
> +     * freelist is partitioned, so determine the CPU/NUMA node, and pick a
> +     * list based on that.
> +     */
> +    rc = getcpu(&cpu, &node);
> +    if (rc != 0)
> +        elog(ERROR, "getcpu failed: %m");

Probably should put this into somewhere abstracted away...


> +    /*
> +     * Pick the freelist, based on CPU, NUMA node or process PID. This matches
> +     * how we built the freelists above.
> +     *
> +     * XXX Can we rely on some of the values (especially strategy_nnodes) to
> +     * be a power-of-2? Then we could replace the modulo with a mask, which is
> +     * likely more efficient.
> +     */
> +    switch (numa_partition_freelist)
> +    {
> +        case FREELIST_PARTITION_CPU:
> +            freelist_idx = cpu % strategy_ncpus;

As mentioned earlier, modulo is rather expensive for something executed so
frequently...

> +            break;
> +
> +        case FREELIST_PARTITION_NODE:
> +            freelist_idx = node % strategy_nnodes;
> +            break;

Here we shouldn't need modulo, right?


> +
> +        case FREELIST_PARTITION_PID:
> +            freelist_idx = MyProcPid % strategy_ncpus;
> +            break;
> +
> +        default:
> +            elog(ERROR, "unknown freelist partitioning value");
> +    }
> +
> +    return &StrategyControl->freelists[freelist_idx];
> +}


>      /* size of lookup hash table ... see comment in StrategyInitialize */
>      size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
>
>      /* size of the shared replacement strategy control block */
> -    size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
> +    size = add_size(size, MAXALIGN(offsetof(BufferStrategyControl, freelists)));
> +
> +    /*
> +     * Allocate one frelist per CPU. We might use per-node freelists, but the
> +     * assumption is the number of CPUs is less than number of NUMA nodes.
> +     *
> +     * FIXME This assumes the we have more CPUs than NUMA nodes, which seems
> +     * like a safe assumption. But maybe we should calculate how many elements
> +     * we actually need, depending on the GUC? Not a huge amount of memory.

FWIW, I don't think that's a safe assumption anymore. With CXL we can get a)
PCIe attached memory and b) remote memory as a separate NUMA nodes, and that
very well could end up as more NUMA nodes than cores.


Ugh, -ETOOLONG. Gotta schedule some other things...


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: 回复:[Internet]Re: [PATCH] Prevent replacement of a function if it's used in an index expression and is not IMMUTABLE
Next
From: Tom Lane
Date:
Subject: Re: [PoC] Federated Authn/z with OAUTHBEARER