Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
| From | Andres Freund |
|---|---|
| Subject | Re: Adding basic NUMA awareness |
| Date | |
| Msg-id | uezi46xhhbvdjgdi6wl7iqgfcdh4jmnnyzbfovdcrck6ywqa7j@fj3yimxvekk6 Whole thread Raw |
| In response to | Re: Adding basic NUMA awareness (Tomas Vondra <tomas@vondra.me>) |
| List | pgsql-hackers |
Hi,
On 2025-12-08 21:02:27 +0100, Tomas Vondra wrote:
> * Most of the benefit comes from patches unrelated to NUMA. The initial
> patches partition clockweep, in a NUMA oblivious way. In fact, applying
> the NUMA patches often *reduces* the throughput. So if we're concerned
> about contention on the clocksweep hand, we could apply just these first
> patches. That way we wouldn't have to deal with huge pages.
> * Furthermore, I'm not quite sure clocksweep really is a bottleneck in
> realistic cases. The benchmark used in this thread does many concurrent
> sequential scans, on data that exceeds shared buffers / fits into RAM.
> Perhaps that happens, but I doubt it's all that common.
I think this misses that this isn't necessarily about peak throughput under
concurrent contention. Consider this scenario:
1) shared buffers is already allocated from a kernel POV, i.e. pages reside on
some numa node instead of having to be allocated on the first access
2) one backend does a scan of scan of a relation [largely] not in shared
buffers
Whether the buffers for the ringbuffer (if the relation is > NBuffers/4) or
for the entire relation (if smaller) is allocated on the same node as the
backend makes a quite substantial difference. I see about a 25% difference
even on a small-ish numa system.
Partitioned clocksweep makes it vastly more likely that data is on the local
numa node.
If you simulate different locality modes with numactl, I can see pretty
drastic differences for the processing of individual queries, both with
parallel and non-parallel processing.
psql -Xq -c 'SELECT pg_buffercache_evict_all();' -c 'SELECT numa_node, sum(size) FROM pg_shmem_allocations_numa GROUP
BY1;' && perf stat --per-socket -M memory_bandwidth_read,memory_bandwidth_write -a psql -c 'SELECT sum(abalance) FROM
pgbench_accounts;'
membind 0, cpunodebind 1, max_parallel_workers_per_gather=0:
S0 6 341,635,792 UNC_M_CAS_COUNT.WR # 4276.9 MB/s memory_bandwidth_write
S0 20 5,116,381,542 duration_time
S0 6 255,977,795 UNC_M_CAS_COUNT.RD # 3204.6 MB/s memory_bandwidth_read
S0 20 5,116,391,355 duration_time
S1 6 2,418,579 UNC_M_CAS_COUNT.WR # 30.3 MB/s memory_bandwidth_write
S1 6 115,511,123 UNC_M_CAS_COUNT.RD # 1446.1 MB/s memory_bandwidth_read
5.112286670 seconds time elapsed
membind 1, cpunodebind 1, max_parallel_workers_per_gather=0:
S0 6 16,528,154 UNC_M_CAS_COUNT.WR # 248.1 MB/s memory_bandwidth_write
S0 20 4,267,078,201 duration_time
S0 6 40,327,670 UNC_M_CAS_COUNT.RD # 605.4 MB/s memory_bandwidth_read
S0 20 4,267,088,762 duration_time
S1 6 116,925,559 UNC_M_CAS_COUNT.WR # 1755.2 MB/s memory_bandwidth_write
S1 6 244,251,242 UNC_M_CAS_COUNT.RD # 3666.5 MB/s memory_bandwidth_read
4.263442844 seconds time elapsed
interleave 0,1, cpunodebind 1, max_parallel_workers_per_gather=0:
S0 6 196,713,044 UNC_M_CAS_COUNT.WR # 2757.4 MB/s memory_bandwidth_write
S0 20 4,569,805,767 duration_time
S0 6 167,497,804 UNC_M_CAS_COUNT.RD # 2347.9 MB/s memory_bandwidth_read
S0 20 4,569,816,439 duration_time
S1 6 81,992,696 UNC_M_CAS_COUNT.WR # 1149.3 MB/s memory_bandwidth_write
S1 6 192,265,269 UNC_M_CAS_COUNT.RD # 2695.1 MB/s memory_bandwidth_read
4.565722468 seconds time elapsed
membind 0, cpunodebind 1, max_parallel_workers_per_gather=8:
S0 6 336,538,518 UNC_M_CAS_COUNT.WR # 24130.2 MB/s memory_bandwidth_write
S0 20 895,976,459 duration_time
S0 6 238,663,716 UNC_M_CAS_COUNT.RD # 17112.4 MB/s memory_bandwidth_read
S0 20 895,986,193 duration_time
S1 6 2,594,371 UNC_M_CAS_COUNT.WR # 186.0 MB/s memory_bandwidth_write
S1 6 113,981,673 UNC_M_CAS_COUNT.RD # 8172.6 MB/s memory_bandwidth_read
0.892594989 seconds time elapsed
membind 1, cpunodebind 1, max_parallel_workers_per_gather=8:
S0 6 3,492,673 UNC_M_CAS_COUNT.WR # 322.0 MB/s memory_bandwidth_write
S0 20 698,175,650 duration_time
S0 6 5,363,152 UNC_M_CAS_COUNT.RD # 494.4 MB/s memory_bandwidth_read
S0 20 698,187,522 duration_time
S1 6 117,181,190 UNC_M_CAS_COUNT.WR # 10802.4 MB/s memory_bandwidth_write
S1 6 251,059,297 UNC_M_CAS_COUNT.RD # 23144.0 MB/s memory_bandwidth_read
0.694253637 seconds time elapsed
interleave 0,1, cpunodebind 1, max_parallel_workers_per_gather=8:
S0 6 170,352,086 UNC_M_CAS_COUNT.WR # 13767.3 MB/s memory_bandwidth_write
S0 20 797,166,139 duration_time
S0 6 121,646,666 UNC_M_CAS_COUNT.RD # 9831.1 MB/s memory_bandwidth_read
S0 20 797,175,899 duration_time
S1 6 60,099,863 UNC_M_CAS_COUNT.WR # 4857.1 MB/s memory_bandwidth_write
S1 6 182,035,468 UNC_M_CAS_COUNT.RD # 14711.5 MB/s memory_bandwidth_read
0.791915733 seconds time elapsed
You're never going to be quite as good when actually using both NUMA nodes,
but at least simple workloads like the above should be able to get a lot
closer to the good number from above than we currently are.
Maybe the problem is that the patchset doesn't actually quite work right now?
I checked out numa-20251111 and ran a query for a 1GB table in a 40GB s_b
system: there's not much more locality with debug_numa=buffers, than without
(roughly 55% on one node, 45% on the other). Making it not surprising that the
results aren't great.
> I've been unable to demonstrate any benefits on other workloads, even if
> there's a lot of buffer misses / reads into shared buffers. As soon as
> the query starts doing something else, the clocksweep contention becomes
> a non-issue. Consider for example read-only pgbench with database much
> larger than shared buffers (but still within RAM). The cost of the index
> scans (and other nodes) seems to reduce the pressure on clocksweep.
>
> So I'm skeptical about clocksweep pressure being a serious issue, except
> for some very narrow benchmarks (like the concurrent seqscan test). And
> even if this happened for some realistic cases, partitioning the buffers
> in a NUMA-oblivious way seems to do the trick.
I think you're over-indexing on the contention aspect and under-indexing on
the locality benefits.
> When discussing this stuff off list, it was suggested this might help
> with the scenario Andres presented in [3], where the throughput improves
> a lot with multiple databases. I've not observed that in practice, and I
> don't think these patches really can help with that. That scenario is
> about buffer lock contention, not clocksweep contention.
Buffer content and buffer headers being on your local node makes access
faster...
> Attached is a tiny patch doing mostly what Jakub did, except that it
> does two things. First, it allows interleaving the shared memory on all
> relevant NUMA nodes (per numa_get_mems_allowed). Second, it allows
> populating all memory by setting MAP_POPULATE in mmap(). There's a new
> GUC to enable each of these.
> I think we should try this (much simpler) approach first, or something
> close to it. Sorry for dragging everyone into a much more complex
> approach, which now seems to be a dead end.
I'm somewhat doubtful that interleaving is going to be good enough without
some awareness of which buffers to preferrably use. Additionally, without huge
pages, there are significant negative performance effects due to each buffer
being split across two numa nodes.
Greetings,
Andres Freund
pgsql-hackers by date: