Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Adding basic NUMA awareness
Date
Msg-id 0e1b997d-99c8-40f4-bc32-6c044bc7ed9a@vondra.me
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Andres Freund <andres@anarazel.de>)
Responses Re: Adding basic NUMA awareness
List pgsql-hackers
On 1/10/26 02:42, Andres Freund wrote:
> Hi,
> 
> On 2025-12-08 21:02:27 +0100, Tomas Vondra wrote:
>> * Most of the benefit comes from patches unrelated to NUMA. The initial
>> patches partition clockweep, in a NUMA oblivious way. In fact, applying
>> the NUMA patches often *reduces* the throughput. So if we're concerned
>> about contention on the clocksweep hand, we could apply just these first
>> patches. That way we wouldn't have to deal with huge pages.
> 
>> * Furthermore, I'm not quite sure clocksweep really is a bottleneck in
>> realistic cases. The benchmark used in this thread does many concurrent
>> sequential scans, on data that exceeds shared buffers / fits into RAM.
>> Perhaps that happens, but I doubt it's all that common.
> 
> I think this misses that this isn't necessarily about peak throughput under
> concurrent contention.  Consider this scenario:
> 
> 1) shared buffers is already allocated from a kernel POV, i.e. pages reside on
>    some numa node instead of having to be allocated on the first access
> 
> 2) one backend does a scan of scan of a relation [largely] not in shared
>    buffers
> 
> Whether the buffers for the ringbuffer (if the relation is > NBuffers/4) or
> for the entire relation (if smaller) is allocated on the same node as the
> backend makes a quite substantial difference.  I see about a 25% difference
> even on a small-ish numa system.
> 
> Partitioned clocksweep makes it vastly more likely that data is on the local
> numa node.
> 
> If you simulate different locality modes with numactl, I can see pretty
> drastic differences for the processing of individual queries, both with
> parallel and non-parallel processing.
> 
> 
> psql -Xq -c 'SELECT pg_buffercache_evict_all();' -c 'SELECT numa_node, sum(size) FROM pg_shmem_allocations_numa GROUP
BY1;' && perf stat --per-socket  -M memory_bandwidth_read,memory_bandwidth_write -a psql -c 'SELECT sum(abalance) FROM
pgbench_accounts;'
> 
> membind 0, cpunodebind 1, max_parallel_workers_per_gather=0:
> S0        6        341,635,792      UNC_M_CAS_COUNT.WR               #   4276.9 MB/s  memory_bandwidth_write
> S0       20      5,116,381,542      duration_time
> S0        6        255,977,795      UNC_M_CAS_COUNT.RD               #   3204.6 MB/s  memory_bandwidth_read
> S0       20      5,116,391,355      duration_time
> S1        6          2,418,579      UNC_M_CAS_COUNT.WR               #     30.3 MB/s  memory_bandwidth_write
> S1        6        115,511,123      UNC_M_CAS_COUNT.RD               #   1446.1 MB/s  memory_bandwidth_read
> 
>        5.112286670 seconds time elapsed
> 
> 
> membind 1, cpunodebind 1, max_parallel_workers_per_gather=0:
> S0        6         16,528,154      UNC_M_CAS_COUNT.WR               #    248.1 MB/s  memory_bandwidth_write
> S0       20      4,267,078,201      duration_time
> S0        6         40,327,670      UNC_M_CAS_COUNT.RD               #    605.4 MB/s  memory_bandwidth_read
> S0       20      4,267,088,762      duration_time
> S1        6        116,925,559      UNC_M_CAS_COUNT.WR               #   1755.2 MB/s  memory_bandwidth_write
> S1        6        244,251,242      UNC_M_CAS_COUNT.RD               #   3666.5 MB/s  memory_bandwidth_read
> 
>        4.263442844 seconds time elapsed
> 
> 
> interleave 0,1, cpunodebind 1, max_parallel_workers_per_gather=0:
> 
> S0        6        196,713,044      UNC_M_CAS_COUNT.WR               #   2757.4 MB/s  memory_bandwidth_write
> S0       20      4,569,805,767      duration_time
> S0        6        167,497,804      UNC_M_CAS_COUNT.RD               #   2347.9 MB/s  memory_bandwidth_read
> S0       20      4,569,816,439      duration_time
> S1        6         81,992,696      UNC_M_CAS_COUNT.WR               #   1149.3 MB/s  memory_bandwidth_write
> S1        6        192,265,269      UNC_M_CAS_COUNT.RD               #   2695.1 MB/s  memory_bandwidth_read
> 
>        4.565722468 seconds time elapsed
> 
> 
> membind 0, cpunodebind 1, max_parallel_workers_per_gather=8:
> S0        6        336,538,518      UNC_M_CAS_COUNT.WR               #  24130.2 MB/s  memory_bandwidth_write
> S0       20        895,976,459      duration_time
> S0        6        238,663,716      UNC_M_CAS_COUNT.RD               #  17112.4 MB/s  memory_bandwidth_read
> S0       20        895,986,193      duration_time
> S1        6          2,594,371      UNC_M_CAS_COUNT.WR               #    186.0 MB/s  memory_bandwidth_write
> S1        6        113,981,673      UNC_M_CAS_COUNT.RD               #   8172.6 MB/s  memory_bandwidth_read
> 
>        0.892594989 seconds time elapsed
> 
> 
> membind 1, cpunodebind 1, max_parallel_workers_per_gather=8:
> S0        6          3,492,673      UNC_M_CAS_COUNT.WR               #    322.0 MB/s  memory_bandwidth_write
> S0       20        698,175,650      duration_time
> S0        6          5,363,152      UNC_M_CAS_COUNT.RD               #    494.4 MB/s  memory_bandwidth_read
> S0       20        698,187,522      duration_time
> S1        6        117,181,190      UNC_M_CAS_COUNT.WR               #  10802.4 MB/s  memory_bandwidth_write
> S1        6        251,059,297      UNC_M_CAS_COUNT.RD               #  23144.0 MB/s  memory_bandwidth_read
> 
>        0.694253637 seconds time elapsed
> 
> 
> interleave 0,1, cpunodebind 1, max_parallel_workers_per_gather=8:
> 
> S0        6        170,352,086      UNC_M_CAS_COUNT.WR               #  13767.3 MB/s  memory_bandwidth_write
> S0       20        797,166,139      duration_time
> S0        6        121,646,666      UNC_M_CAS_COUNT.RD               #   9831.1 MB/s  memory_bandwidth_read
> S0       20        797,175,899      duration_time
> S1        6         60,099,863      UNC_M_CAS_COUNT.WR               #   4857.1 MB/s  memory_bandwidth_write
> S1        6        182,035,468      UNC_M_CAS_COUNT.RD               #  14711.5 MB/s  memory_bandwidth_read
> 
>        0.791915733 seconds time elapsed
> 
> 
> 
> You're never going to be quite as good when actually using both NUMA nodes,
> but at least simple workloads like the above should be able to get a lot
> closer to the good number from above than we currently are.
> 

I see no such improvements, unfortunately. Even when I explicitly pin
memory and cpus to different nodes using numactl. Consider a simple
experiment, starting an instance either like this:

numactl --membind=0 --cpunodebind=0 pg_ctl -D /mnt/data/data-numa start

or like this

numactl --membind=0 --cpunodebind=1 pg_ctl -D /mnt/data/data-numa start

on a 2-node NUMA cluster. To the best of my knowledge this means that
either both the memory and all pg processes (including the backend) are
on node 0, of memory is on node 0 and backend is on node 1.

And then I initialized pgbench with scale that is much larger than
shared buffers, but fits into RAM. So cached, but definitely > NB/4. And
then I ran

  select * from pgbench_accounts offset 1000000000;

which does a sequential scan with the circular buffer you mention abobe

I've made all reasonable precautions to stabilize the results, like
enabling huge pages (both for shared memory and binaries), disabling
checksums, ... And I ran that on an Azure instance D96v6 with EPYC 9V74.
This was with scale 10000 (~150GB), shared_buffers=8GB.

And I get this:

worker / 32
-----------

numactl --membind=0 --cpunodebind=0 pg_ctl ...
Time: 26280.437 ms (00:26.280)
Time: 26177.165 ms (00:26.177)
Time: 26182.222 ms (00:26.182)
Time: 26174.421 ms (00:26.174)
Time: 26216.989 ms (00:26.217)

numactl --membind=0 --cpunodebind=1 pg_ctl ...
Time: 26412.878 ms (00:26.413)
Time: 26413.332 ms (00:26.413)
Time: 26202.899 ms (00:26.203)
Time: 26412.627 ms (00:26.413)
Time: 26484.962 ms (00:26.485)

io_uring
--------

numactl --membind=0 --cpunodebind=0 pg_ctl ...
Time: 26286.977 ms (00:26.287)
Time: 26499.830 ms (00:26.500)
Time: 26629.990 ms (00:26.630)
Time: 26443.147 ms (00:26.443)

numactl --membind=0 --cpunodebind=1 pg_ctl ...
Time: 26727.655 ms (00:26.728)
Time: 26787.456 ms (00:26.787)
Time: 26484.260 ms (00:26.484)
Time: 26250.737 ms (00:26.251)
Time: 26208.913 ms (00:26.209)

I don't see any difference. To rule out any virtualization weirdness, I
did the same experiment on my old Xeon machine (also 2-node NUMA), just
with a smaller scale (2000) and shared_buffers=4GB. And that gave me:


xeon scale=2000 nochecksums

worker / 32
-----------

numactl --membind=0 --cpunodebind=0 pg_ctl ...
Time: 5519.728 ms (00:05.520)
Time: 5570.215 ms (00:05.570)
Time: 5568.233 ms (00:05.568)
Time: 5556.465 ms (00:05.556)
Time: 5517.420 ms (00:05.517)

numactl --membind=0 --cpunodebind=1 pg_ctl ...
Time: 5639.281 ms (00:05.639)
Time: 5657.822 ms (00:05.658)
Time: 5653.077 ms (00:05.653)
Time: 5647.780 ms (00:05.648)
Time: 5647.288 ms (00:05.647)

io_uring
--------

numactl --membind=0 --cpunodebind=0 pg_ctl ...
Time: 7517.920 ms (00:07.518)
Time: 7180.628 ms (00:07.181)
Time: 7162.801 ms (00:07.163)
Time: 7164.827 ms (00:07.165)
Time: 7177.757 ms (00:07.178)

numactl --membind=0 --cpunodebind=1 pg_ctl ...
Time: 7622.372 ms (00:07.622)
Time: 7571.923 ms (00:07.572)
Time: 7571.966 ms (00:07.572)
Time: 7568.269 ms (00:07.568)
Time: 7558.195 ms (00:07.558)

If I squint a little bit, there's difference for io_uring. But it's not
even 5%, definitely not 25%.

> 
> 
> Maybe the problem is that the patchset doesn't actually quite work right now?
> I checked out numa-20251111 and ran a query for a 1GB table in a 40GB s_b
> system: there's not much more locality with debug_numa=buffers, than without
> (roughly 55% on one node, 45% on the other). Making it not surprising that the
> results aren't great.
> 

Hard to say, but I'd guess that's because of the clocksweep balancing.
Which ensures that we don't overload a single NUMA node. Imagine an
instance with a single connection - it can't allocate from a single NUMA
node, because that'd mean it'll only ever use 50% of available cache.
Which does not seem great. Maybe there's a better way to address this.

> 
> 
>> I've been unable to demonstrate any benefits on other workloads, even if
>> there's a lot of buffer misses / reads into shared buffers. As soon as
>> the query starts doing something else, the clocksweep contention becomes
>> a non-issue. Consider for example read-only pgbench with database much
>> larger than shared buffers (but still within RAM). The cost of the index
>> scans (and other nodes) seems to reduce the pressure on clocksweep.
>>
>> So I'm skeptical about clocksweep pressure being a serious issue, except
>> for some very narrow benchmarks (like the concurrent seqscan test). And
>> even if this happened for some realistic cases, partitioning the buffers
>> in a NUMA-oblivious way seems to do the trick.
> 
> I think you're over-indexing on the contention aspect and under-indexing on
> the locality benefits.
> 

I've been unable to demonstrate meaningful benefits of locality (like in
the example above), while I've been able to show benefits of reducing
the clocksweep contention. It's entirely possible I'm doing it wrong or
missing something, of course.

> 
>> When discussing this stuff off list, it was suggested this might help
>> with the scenario Andres presented in [3], where the throughput improves
>> a lot with multiple databases. I've not observed that in practice, and I
>> don't think these patches really can help with that. That scenario is
>> about buffer lock contention, not clocksweep contention.
> 
> Buffer content and buffer headers being on your local node makes access
> faster...
> 

That was my expectation too, but I haven't seen meaningful improvements
in any benchmark.

For example in the benchmark I presented earlier, all the memory is on
node 0 (so both headers and buffers). And there does not seem to be any
measurable difference when accessing it from node 0 vs. node 1. So why
would it matter than header may be on node 0 and buffer on node 1?

> 
>> Attached is a tiny patch doing mostly what Jakub did, except that it
>> does two things. First, it allows interleaving the shared memory on all
>> relevant NUMA nodes (per numa_get_mems_allowed). Second, it allows
>> populating all memory by setting MAP_POPULATE in mmap(). There's a new
>> GUC to enable each of these.
> 
>> I think we should try this (much simpler) approach first, or something
>> close to it. Sorry for dragging everyone into a much more complex
>> approach, which now seems to be a dead end.
> 
> I'm somewhat doubtful that interleaving is going to be good enough without
> some awareness of which buffers to preferrably use. Additionally, without huge
> pages, there are significant negative performance effects due to each buffer
> being split across two numa nodes.
> 

I'm rather skeptical this being worth it without huge pages. If you're
trying to get the best performance on a NUMA machine (with is likely big
with a lot of RAM), then huge pages are a huge improvement on their own.

I'd even say this NUMA stuff might/should require huge_pages=on.


-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Andrew Jackson
Date:
Subject: Re: Add ldapservice connection parameter
Next
From: Andres Freund
Date:
Subject: Re: Adding basic NUMA awareness