Thread: NUMA shared memory interleaving

NUMA shared memory interleaving

From
Jakub Wartak
Date:
Thanks to having pg_numa.c, we can now simply address problem#2 of
NUMA imbalance from [1] pages 11-14, by interleaving shm memory in
PG19 - patch attached. We do not need to call numa_set_localalloc() as
we only interleave shm segments, while local allocations stay the same
(well, "local" means relative to the CPU asking for private memory).
Below is result from legacy 4s32t64 Sandy Bridge EP box with low NUMA
(QPI) interconnect bandwidth to better illustrate the problem (it's
little edgecase, but some one may hit it):

Testcase:
    small SB (here it was 4GB*) that fully fits NUMA hugepage zone as
this was tested with hugepages=on

    $ cat seqconcurrscans.pgb
    \set num (:client_id % 8) + 1
    select sum(octet_length(filler)) from pgbench_accounts_:num;

    /usr/local/pgsql/bin/pg_ctl -D /db/data -l logfile restart
    /usr/local/pgsql/bin/psql  -c "select
pg_prewarm('pgbench_accounts_'||s) from generate_series(1, 8) s;"
#load all using current policy
    /usr/local/pgsql/bin/psql  -c "select * from
pg_shmem_allocations_numa where name = 'Buffer Blocks';"
    /usr/local/pgsql/bin/pgbench -c 64 -j 8 -P 1 -T 60 -f seqconcurrscans.pgb

on master and numa=off (default) and in previous versions:
         name      | numa_node |    size
    ---------------+-----------+------------
     Buffer Blocks |         0 |          0
     Buffer Blocks |         1 |          0
     Buffer Blocks |         2 | 4297064448
     Buffer Blocks |         3 |          0

    latency average = 1826.324 ms
    latency stddev = 665.567 ms
    tps = 34.708151 (without initial connection time)

on master and numa=on:
         name      | numa_node |    size
    ---------------+-----------+------------
     Buffer Blocks |         0 | 1073741824
     Buffer Blocks |         1 | 1073741824
     Buffer Blocks |         2 | 1075838976
     Buffer Blocks |         3 | 1073741824

    latency average = 1002.288 ms
    latency stddev = 214.392 ms
    tps = 63.344814 (without initial connection time)

Normal pgbench workloads tend to be not affected, as each backend
tends to touch just a small partition of shm (thanks to BAS
strategies). Some remaining questions are:
1. How to name this GUC (numa or numa_shm_interleave) ? I prefer the
first option, as we could potentially in future add more optimizations
behind that GUC.
2. Should we also interleave DSA/DSM for Parallel Query? (I'm not an
expert on DSA/DSM at all)
3. Should we fail to start if we numa=on on an unsupported platform?

* interesting tidbit to get reliable measurement: one needs to double
check that s_b (hugepage allocation) is smaller than per-NUMA zone
free hugepages (s_b fits static hugepage allocation within a single
zone). This shouldn't be a problem on 2 sockets (as most of the time
there, s_b is < 50% RAM anyway, well usually 26-30% with some stuff by
max_connections, it's higher than 25% but people usually sysctl
nr_hugepages=25%RAM) , but with >= 4 NUMA nodes (4 sockets or some
modern MCMs) kernel might start spilling the s_b (> 25%) to the other
NUMA node on it's own, so it's best to verify it using
pg_shmem_allocations_numa...

-J.

[1] - https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf

Attachment

Re: NUMA shared memory interleaving

From
Thomas Munro
Date:
On Wed, Apr 16, 2025 at 9:14 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> 2. Should we also interleave DSA/DSM for Parallel Query? (I'm not an
> expert on DSA/DSM at all)

I have no answers but I have speculated for years about a very
specific case (without any idea where to begin due to lack of ... I
guess all this sort of stuff): in ExecParallelHashJoinNewBatch(),
workers split up and try to work on different batches on their own to
minimise contention, and when that's not possible (more workers than
batches, or finishing their existing work at different times and going
to help others), they just proceed in round-robin order.  A beginner
thought is: if you're going to help someone working on a hash table,
it would surely be best to have the CPUs and all the data on the same
NUMA node.  During loading, cache line ping pong would be cheaper, and
during probing, it *might* be easier to tune explicit memory prefetch
timing that way as it would look more like a single node system with a
fixed latency, IDK (I've shared patches for prefetching before that
showed pretty decent speedups, and the lack of that feature is
probably a bigger problem than any of this stuff, who knows...).
Another beginner thought is that the DSA allocator is a source of
contention during loading: the dumbest problem is that the chunks are
just too small, but it might also be interesting to look into per-node
pools.  Or something.   IDK, just some thoughts...



Re: NUMA shared memory interleaving

From
Thomas Munro
Date:
On Thu, Apr 17, 2025 at 1:58 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> I have no answers but I have speculated for years about a very
> specific case (without any idea where to begin due to lack of ... I
> guess all this sort of stuff): in ExecParallelHashJoinNewBatch(),
> workers split up and try to work on different batches on their own to
> minimise contention, and when that's not possible (more workers than
> batches, or finishing their existing work at different times and going
> to help others), they just proceed in round-robin order.  A beginner
> thought is: if you're going to help someone working on a hash table,
> it would surely be best to have the CPUs and all the data on the same
> NUMA node.  During loading, cache line ping pong would be cheaper, and
> during probing, it *might* be easier to tune explicit memory prefetch
> timing that way as it would look more like a single node system with a
> fixed latency, IDK (I've shared patches for prefetching before that
> showed pretty decent speedups, and the lack of that feature is
> probably a bigger problem than any of this stuff, who knows...).
> Another beginner thought is that the DSA allocator is a source of
> contention during loading: the dumbest problem is that the chunks are
> just too small, but it might also be interesting to look into per-node
> pools.  Or something.   IDK, just some thoughts...

And BTW there are papers about that (but they mostly just remind me
that I have to reboot the prefetching patch long before that...), for
example:

https://15721.courses.cs.cmu.edu/spring2023/papers/11-hashjoins/lang-imdm2013.pdf



Re: NUMA shared memory interleaving

From
Robert Haas
Date:
On Wed, Apr 16, 2025 at 5:14 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
> Normal pgbench workloads tend to be not affected, as each backend
> tends to touch just a small partition of shm (thanks to BAS
> strategies). Some remaining questions are:
> 1. How to name this GUC (numa or numa_shm_interleave) ? I prefer the
> first option, as we could potentially in future add more optimizations
> behind that GUC.

I wonder whether the GUC needs to support interleaving between a
designated set of nodes rather than only being able to do all nodes.
For example, suppose someone is pinning the processes to a certain set
of NUMA nodes; perhaps then they wouldn't want to use memory from
other nodes.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: NUMA shared memory interleaving

From
Bertrand Drouvot
Date:
Hi,

On Wed, Apr 16, 2025 at 10:05:04AM -0400, Robert Haas wrote:
> On Wed, Apr 16, 2025 at 5:14 AM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> > Normal pgbench workloads tend to be not affected, as each backend
> > tends to touch just a small partition of shm (thanks to BAS
> > strategies). Some remaining questions are:
> > 1. How to name this GUC (numa or numa_shm_interleave) ? I prefer the
> > first option, as we could potentially in future add more optimizations
> > behind that GUC.
> 
> I wonder whether the GUC needs to support interleaving between a
> designated set of nodes rather than only being able to do all nodes.
> For example, suppose someone is pinning the processes to a certain set
> of NUMA nodes; perhaps then they wouldn't want to use memory from
> other nodes.

+1. That could be used for instances consolidation on the same host. One could
ensure that numa nodes are not shared across instances (cpu and memory resource
isolation per instance). Bonus point, adding Direct I/O into the game would
ensure that the OS page cache is not shared too.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: NUMA shared memory interleaving

From
Bertrand Drouvot
Date:
Hi,

On Thu, Apr 17, 2025 at 01:58:44AM +1200, Thomas Munro wrote:
> On Wed, Apr 16, 2025 at 9:14 PM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> > 2. Should we also interleave DSA/DSM for Parallel Query? (I'm not an
> > expert on DSA/DSM at all)
> 
> I have no answers but I have speculated for years about a very
> specific case (without any idea where to begin due to lack of ... I
> guess all this sort of stuff): in ExecParallelHashJoinNewBatch(),
> workers split up and try to work on different batches on their own to
> minimise contention, and when that's not possible (more workers than
> batches, or finishing their existing work at different times and going
> to help others), they just proceed in round-robin order.  A beginner
> thought is: if you're going to help someone working on a hash table,
> it would surely be best to have the CPUs and all the data on the same
> NUMA node.  During loading, cache line ping pong would be cheaper, and
> during probing, it *might* be easier to tune explicit memory prefetch
> timing that way as it would look more like a single node system with a
> fixed latency, IDK (I've shared patches for prefetching before that
> showed pretty decent speedups, and the lack of that feature is
> probably a bigger problem than any of this stuff, who knows...).
> Another beginner thought is that the DSA allocator is a source of
> contention during loading: the dumbest problem is that the chunks are
> just too small, but it might also be interesting to look into per-node
> pools.  Or something.   IDK, just some thoughts...

I'm also thinking that could be beneficial for parallel workers. I think the
ideal scenario would be to have the parallel workers spread across numa nodes and
accessing their "local" memory first (and help with "remote" memory access if
there is still more work to do "remotely").

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: NUMA shared memory interleaving

From
Jakub Wartak
Date:
On Fri, Apr 18, 2025 at 7:43 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Wed, Apr 16, 2025 at 10:05:04AM -0400, Robert Haas wrote:
> > On Wed, Apr 16, 2025 at 5:14 AM Jakub Wartak
> > <jakub.wartak@enterprisedb.com> wrote:
> > > Normal pgbench workloads tend to be not affected, as each backend
> > > tends to touch just a small partition of shm (thanks to BAS
> > > strategies). Some remaining questions are:
> > > 1. How to name this GUC (numa or numa_shm_interleave) ? I prefer the
> > > first option, as we could potentially in future add more optimizations
> > > behind that GUC.
> >
> > I wonder whether the GUC needs to support interleaving between a
> > designated set of nodes rather than only being able to do all nodes.
> > For example, suppose someone is pinning the processes to a certain set
> > of NUMA nodes; perhaps then they wouldn't want to use memory from
> > other nodes.
>
> +1. That could be used for instances consolidation on the same host. One could
> ensure that numa nodes are not shared across instances (cpu and memory resource
> isolation per instance). Bonus point, adding Direct I/O into the game would
> ensure that the OS page cache is not shared too.

Hi, the attached patch has two changes:
1. It adds more modes and supports this 'consolidation' and
'isolation' scenario from above. Doc in patch briefly explains the
merit.
2. it adds trivial NUMA for PQ

The original initial test expanded on the very same machine
(4s32c128t, QPI interconnect):

numa='off'
    latency average = 1271.019 ms
    latency stddev = 245.061 ms
    tps = 49.683923 (without initial connection time)
    explanation(pcm-memory): 3 sockets doing ~46MB/s on RAM (almost
idle), 1 socket doing ~17GB/s (fully saturated because s_b ended up in
this scenario only on NUMA node)

numa='all'
    latency average = 702.622 ms
    latency stddev = 13.259 ms
    tps = 90.026526 (without initial connection time)
    explanation(pcm-memory): this forced to interleave s_b across 4
NUMA nodes and each socket gets equal part of workload (9.2 - 10GB/s)
totalling ~37GB/s of memory bandwidth

This gives a boost: 90/49.6=1.8x. The values for memory bandwidth are
combined read+write.

NUMA impact on the Parallel Query:
----------------------------------
with:
    with the most simplistic interleaving of s_b +
dynamic_shared_memory for PQ interleaved too :
    max_worker_processes=max_parallel_workers=max_parallel_workers_per_gather=64
    alter on 1 partition to force real 64 parallel seq scans
The query:
    select sum(octet_length(filler)) from pgbench_accounts;
launched 64 effective parallel workes launched for 64 partitions each
of 400MB  (25600MBs), All of that was fitting in the s_b (32GB), so
all fetched from s_b. All was hot, several first runs are not shown.

select sum(octet_length(filler)) from pgbench_accounts;

numa='off'
    Time: 1108.178 ms (00:01.108)
    Time: 1118.494 ms (00:01.118)
    Time: 1104.491 ms (00:01.104)
    Time: 1112.221 ms (00:01.112)
    Time: 1105.501 ms (00:01.106)
    avg: 1109 ms

    -- not interleaved, more like appended:
    postgres=# select * from pg_shmem_allocations_numa where name =
'Buffer Blocks';
         name      | numa_node |    size
    ---------------+-----------+------------
     Buffer Blocks |         0 | 9277800448
     Buffer Blocks |         1 | 7044333568
     Buffer Blocks |         2 | 9097445376
     Buffer Blocks |         3 | 8942256128

numa='all'
    Time: 1026.747 ms (00:01.027)
    Time: 1024.087 ms (00:01.024)
    Time: 1024.179 ms (00:01.024)
    Time: 1037.026 ms (00:01.037)
    avg: 1027 ms

        postgres=# select * from pg_shmem_allocations_numa where name
= 'Buffer Blocks';
         name      | numa_node |    size
    ---------------+-----------+------------
     Buffer Blocks |         0 | 8589934592
     Buffer Blocks |         1 | 8592031744
     Buffer Blocks |         2 | 8589934592
     Buffer Blocks |         3 | 8589934592

1109/1027=1.079x, not bad for such trivial change and the paper
referenced by Thomas also stated (`We can see an improvement by a
factor of more than three by just running
the non-NUMA-aware implementation on interleaved memor`), probably it
could be improved much further, but I'm not planning to work on this
more. So if anything:
- latency-wise: it would be best to place leader+all PQ workers close
to s_b, provided s_b fits NUMA shared/huge page memory there and you
won't need more CPU than there's on that NUMA node... (assuming e.g.
hosting 4 DBs on 4-sockets each on it's own, it would be best to pin
everything including shm, but PQ workers too)
- capacity/TPS-wise or s_b > NUMA: just interleave to maximize
bandwidth and get uniform CPU performance out of this

The patch supports e.g. numa='@1' which should fully isolate the
workload to just memory and CPUs on node #1.
As for the patch: I'm lost with our C headers policy :)

One of less obvious reasons (outside of better efficiency of
consolidation of multiple PostgreSQL cluster on single NUMA server),
why I've implemented '=' and '@' is that seems that CXL memory can be
attached as a CPU-less(!) NUMA node, thus Linux - depending on
sysctls/sysfs setup - could use it for automatic memory tiering and
the above provides configurable way to prevent allocation on such
(potential) systems - simply exclude such NUMA node via config for now
and we are covered I think. I have no access to real hardware, so I
haven't researched it further, but it looks like in the far future we
could probably indicate preferred NUMA memory nodes (think big s_b,
bigger than "CPU" RAM), and that kernel could transparently do NUMA
auto balancing/demotion for us (AKA Transparent Page Placement AKA
memory) or vice versa: use small s_b and do not use CXL node at all
and expect that VFS cache could be spilled there.
numa_weighted_interleave_memory() / MPOL_WEIGHTED_INTERLEAVE is not
yet supported in distros (although new libnuma has support for it), so
I have not included it in the patch, as it was too early.

BTW: DO NOT USE meson's --buildtype=debug as it somehow disables the
NUMA optimizations benefit, I've lost hours on it (probably -O0 is so
slow that it wasn't stressing interconnects enough). Default is
--buildtype=debugoptimized which is good. Also if testing performance,
check that HW that has proper realistic NUMA remote access distances
first, e.g. here my remote had remote access 2x or even 3x. Probably
this is worth only testing on multi-sockets which have really higher
latencies/throughput limitations, but reports from 1 socket MCMs CPUs
(with various Node-per-Socket BIOS settings) are welcome too.

kernel 6.14.7 was used with full isolation:
    cpupower frequency-set --governor performance
    cpupower idle-set -D0
    echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
    echo never > /sys/kernel/mm/transparent_hugepage/enabled
    echo never > /sys/kernel/mm/transparent_hugepage/defrag

max_connections = '10000'
huge_pages = 'on'
wal_level = 'minimal'
wal_buffers = '1024MB'
max_wal_senders = '0'
shared_buffers = '4 GB'
autovacuum = 'off'
max_parallel_workers_per_gather = '0'
numa = 'all'
#numa = 'off'

[1] - https://lwn.net/Articles/897536/

Attachment

Re: NUMA shared memory interleaving

From
Jakub Wartak
Date:
On Fri, Apr 18, 2025 at 7:48 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Thu, Apr 17, 2025 at 01:58:44AM +1200, Thomas Munro wrote:
> > On Wed, Apr 16, 2025 at 9:14 PM Jakub Wartak
> > <jakub.wartak@enterprisedb.com> wrote:
> > > 2. Should we also interleave DSA/DSM for Parallel Query? (I'm not an
> > > expert on DSA/DSM at all)
> >
> > I have no answers but I have speculated for years about a very
> > specific case (without any idea where to begin due to lack of ... I
> > guess all this sort of stuff): in ExecParallelHashJoinNewBatch(),
> > workers split up and try to work on different batches on their own to
> > minimise contention, and when that's not possible (more workers than
> > batches, or finishing their existing work at different times and going
> > to help others), they just proceed in round-robin order.  A beginner
> > thought is: if you're going to help someone working on a hash table,
> > it would surely be best to have the CPUs and all the data on the same
> > NUMA node.  During loading, cache line ping pong would be cheaper, and
> > during probing, it *might* be easier to tune explicit memory prefetch
> > timing that way as it would look more like a single node system with a
> > fixed latency, IDK (I've shared patches for prefetching before that
> > showed pretty decent speedups, and the lack of that feature is
> > probably a bigger problem than any of this stuff, who knows...).
> > Another beginner thought is that the DSA allocator is a source of
> > contention during loading: the dumbest problem is that the chunks are
> > just too small, but it might also be interesting to look into per-node
> > pools.  Or something.   IDK, just some thoughts...
>
> I'm also thinking that could be beneficial for parallel workers. I think the
> ideal scenario would be to have the parallel workers spread across numa nodes and
> accessing their "local" memory first (and help with "remote" memory access if
> there is still more work to do "remotely").

Hi Bertrand, I've played with CPU pinning of PQ workers (via adjusting
postmaster pin), but I've got quite opposite results - please see
attached, especially "lat"ency against how the CPUs were assigned VS
NUMA/s_b when it was not interleaved. Not that I intend to spend a lot
of time researching PQ vs NUMA , but I've included interleaving of PQ
shm segments too in the v4 patch in the subthread nearby. Those
attached results here, were made some time ago with v1 of the patch
where PQ shm segment was not interleaved.

If anything it would be to hear if there are any sensible
production-like scenarios/workloads when dynamic_shared_memory should
be set to sysv or mmap (instead of default posix) ? Asking for Linux
only, I couldn't imagine anything (?)

-J.

Attachment