Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Adding basic NUMA awareness
Date
Msg-id uezi46xhhbvdjgdi6wl7iqgfcdh4jmnnyzbfovdcrck6ywqa7j@fj3yimxvekk6
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
Hi,

On 2025-12-08 21:02:27 +0100, Tomas Vondra wrote:
> * Most of the benefit comes from patches unrelated to NUMA. The initial
> patches partition clockweep, in a NUMA oblivious way. In fact, applying
> the NUMA patches often *reduces* the throughput. So if we're concerned
> about contention on the clocksweep hand, we could apply just these first
> patches. That way we wouldn't have to deal with huge pages.

> * Furthermore, I'm not quite sure clocksweep really is a bottleneck in
> realistic cases. The benchmark used in this thread does many concurrent
> sequential scans, on data that exceeds shared buffers / fits into RAM.
> Perhaps that happens, but I doubt it's all that common.

I think this misses that this isn't necessarily about peak throughput under
concurrent contention.  Consider this scenario:

1) shared buffers is already allocated from a kernel POV, i.e. pages reside on
   some numa node instead of having to be allocated on the first access

2) one backend does a scan of scan of a relation [largely] not in shared
   buffers

Whether the buffers for the ringbuffer (if the relation is > NBuffers/4) or
for the entire relation (if smaller) is allocated on the same node as the
backend makes a quite substantial difference.  I see about a 25% difference
even on a small-ish numa system.

Partitioned clocksweep makes it vastly more likely that data is on the local
numa node.

If you simulate different locality modes with numactl, I can see pretty
drastic differences for the processing of individual queries, both with
parallel and non-parallel processing.


psql -Xq -c 'SELECT pg_buffercache_evict_all();' -c 'SELECT numa_node, sum(size) FROM pg_shmem_allocations_numa GROUP
BY1;' && perf stat --per-socket  -M memory_bandwidth_read,memory_bandwidth_write -a psql -c 'SELECT sum(abalance) FROM
pgbench_accounts;'

membind 0, cpunodebind 1, max_parallel_workers_per_gather=0:
S0        6        341,635,792      UNC_M_CAS_COUNT.WR               #   4276.9 MB/s  memory_bandwidth_write
S0       20      5,116,381,542      duration_time
S0        6        255,977,795      UNC_M_CAS_COUNT.RD               #   3204.6 MB/s  memory_bandwidth_read
S0       20      5,116,391,355      duration_time
S1        6          2,418,579      UNC_M_CAS_COUNT.WR               #     30.3 MB/s  memory_bandwidth_write
S1        6        115,511,123      UNC_M_CAS_COUNT.RD               #   1446.1 MB/s  memory_bandwidth_read

       5.112286670 seconds time elapsed


membind 1, cpunodebind 1, max_parallel_workers_per_gather=0:
S0        6         16,528,154      UNC_M_CAS_COUNT.WR               #    248.1 MB/s  memory_bandwidth_write
S0       20      4,267,078,201      duration_time
S0        6         40,327,670      UNC_M_CAS_COUNT.RD               #    605.4 MB/s  memory_bandwidth_read
S0       20      4,267,088,762      duration_time
S1        6        116,925,559      UNC_M_CAS_COUNT.WR               #   1755.2 MB/s  memory_bandwidth_write
S1        6        244,251,242      UNC_M_CAS_COUNT.RD               #   3666.5 MB/s  memory_bandwidth_read

       4.263442844 seconds time elapsed


interleave 0,1, cpunodebind 1, max_parallel_workers_per_gather=0:

S0        6        196,713,044      UNC_M_CAS_COUNT.WR               #   2757.4 MB/s  memory_bandwidth_write
S0       20      4,569,805,767      duration_time
S0        6        167,497,804      UNC_M_CAS_COUNT.RD               #   2347.9 MB/s  memory_bandwidth_read
S0       20      4,569,816,439      duration_time
S1        6         81,992,696      UNC_M_CAS_COUNT.WR               #   1149.3 MB/s  memory_bandwidth_write
S1        6        192,265,269      UNC_M_CAS_COUNT.RD               #   2695.1 MB/s  memory_bandwidth_read

       4.565722468 seconds time elapsed


membind 0, cpunodebind 1, max_parallel_workers_per_gather=8:
S0        6        336,538,518      UNC_M_CAS_COUNT.WR               #  24130.2 MB/s  memory_bandwidth_write
S0       20        895,976,459      duration_time
S0        6        238,663,716      UNC_M_CAS_COUNT.RD               #  17112.4 MB/s  memory_bandwidth_read
S0       20        895,986,193      duration_time
S1        6          2,594,371      UNC_M_CAS_COUNT.WR               #    186.0 MB/s  memory_bandwidth_write
S1        6        113,981,673      UNC_M_CAS_COUNT.RD               #   8172.6 MB/s  memory_bandwidth_read

       0.892594989 seconds time elapsed


membind 1, cpunodebind 1, max_parallel_workers_per_gather=8:
S0        6          3,492,673      UNC_M_CAS_COUNT.WR               #    322.0 MB/s  memory_bandwidth_write
S0       20        698,175,650      duration_time
S0        6          5,363,152      UNC_M_CAS_COUNT.RD               #    494.4 MB/s  memory_bandwidth_read
S0       20        698,187,522      duration_time
S1        6        117,181,190      UNC_M_CAS_COUNT.WR               #  10802.4 MB/s  memory_bandwidth_write
S1        6        251,059,297      UNC_M_CAS_COUNT.RD               #  23144.0 MB/s  memory_bandwidth_read

       0.694253637 seconds time elapsed


interleave 0,1, cpunodebind 1, max_parallel_workers_per_gather=8:

S0        6        170,352,086      UNC_M_CAS_COUNT.WR               #  13767.3 MB/s  memory_bandwidth_write
S0       20        797,166,139      duration_time
S0        6        121,646,666      UNC_M_CAS_COUNT.RD               #   9831.1 MB/s  memory_bandwidth_read
S0       20        797,175,899      duration_time
S1        6         60,099,863      UNC_M_CAS_COUNT.WR               #   4857.1 MB/s  memory_bandwidth_write
S1        6        182,035,468      UNC_M_CAS_COUNT.RD               #  14711.5 MB/s  memory_bandwidth_read

       0.791915733 seconds time elapsed



You're never going to be quite as good when actually using both NUMA nodes,
but at least simple workloads like the above should be able to get a lot
closer to the good number from above than we currently are.



Maybe the problem is that the patchset doesn't actually quite work right now?
I checked out numa-20251111 and ran a query for a 1GB table in a 40GB s_b
system: there's not much more locality with debug_numa=buffers, than without
(roughly 55% on one node, 45% on the other). Making it not surprising that the
results aren't great.



> I've been unable to demonstrate any benefits on other workloads, even if
> there's a lot of buffer misses / reads into shared buffers. As soon as
> the query starts doing something else, the clocksweep contention becomes
> a non-issue. Consider for example read-only pgbench with database much
> larger than shared buffers (but still within RAM). The cost of the index
> scans (and other nodes) seems to reduce the pressure on clocksweep.
>
> So I'm skeptical about clocksweep pressure being a serious issue, except
> for some very narrow benchmarks (like the concurrent seqscan test). And
> even if this happened for some realistic cases, partitioning the buffers
> in a NUMA-oblivious way seems to do the trick.

I think you're over-indexing on the contention aspect and under-indexing on
the locality benefits.


> When discussing this stuff off list, it was suggested this might help
> with the scenario Andres presented in [3], where the throughput improves
> a lot with multiple databases. I've not observed that in practice, and I
> don't think these patches really can help with that. That scenario is
> about buffer lock contention, not clocksweep contention.

Buffer content and buffer headers being on your local node makes access
faster...


> Attached is a tiny patch doing mostly what Jakub did, except that it
> does two things. First, it allows interleaving the shared memory on all
> relevant NUMA nodes (per numa_get_mems_allowed). Second, it allows
> populating all memory by setting MAP_POPULATE in mmap(). There's a new
> GUC to enable each of these.

> I think we should try this (much simpler) approach first, or something
> close to it. Sorry for dragging everyone into a much more complex
> approach, which now seems to be a dead end.

I'm somewhat doubtful that interleaving is going to be good enough without
some awareness of which buffers to preferrably use. Additionally, without huge
pages, there are significant negative performance effects due to each buffer
being split across two numa nodes.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Japin Li
Date:
Subject: Re: Always show correct error message for statement timeouts, fixes random buildfarm failures
Next
From: Thomas Munro
Date:
Subject: Re: Trying out native UTF-8 locales on Windows