Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
| From | Jakub Wartak |
|---|---|
| Subject | Re: Adding basic NUMA awareness |
| Date | |
| Msg-id | CAKZiRmxwN+qMpbijCLPix_y6mwSjgus2CPPj=1+uFo9fQG-Knw@mail.gmail.com Whole thread Raw |
| In response to | Re: Adding basic NUMA awareness (Tomas Vondra <tomas@vondra.me>) |
| List | pgsql-hackers |
On Wed, Nov 26, 2025 at 5:19 PM Tomas Vondra <tomas@vondra.me> wrote: > Rebased patch series attached. Thanks. BTW still with the old patchset series, One additional thing that I've found out related to interleave is that in CreateAnonymousSegment() with the default check_debug='', we still issue numa_interleave_memory(ptr..). It should be optional (this also affects earlier calls too). Tiny patch attached. > I think the MAP_POPULATE should be optional, enabled by GUC. OK, but you mean it's a new option to debug_numa, right? (not some separate) so debug_numa='prefault' then? > > I would consider everything +/- 3% as noise (technically each branch > > was a different compilation/ELF binary, as changing this #define > > required to do so to get 4 vs 16; please see attached script). I miss > > the explanation why without HP it deteriorates so much with for c=1024 > > with the patches. > > I wouldn't expect a big difference for "pgbench -S". That workload has > so much other fairly expensive stuff (e.g. initializing index scans > etc.), the cost of buffer replacement is going to be fairly limited. Right. OK, so I've got the seqconcurrentscans comparison done right, that is when prewarmed and not naturally filled: @master, 29GB/s mem bandwidth latency average = 1255.572 ms latency stddev = 417.162 ms tps = 50.451925 (without initial connection time) @v20251121 patchset, 41GB/s (~10GB/s per socket) latency average = 719.931 ms latency stddev = 14.874 ms tps = 88.362091 (without initial connection time) The main PMC difference seems to be much lower "backend cycles idle" (51% master and vs 31% for the NUMA debug_numa="buffers,procs", so less is waiting on memory, thus it gets that speedup and better IPC). Anyway, the biggest gripe right now (at least to me) is reliable benchmarking. Below runs are all apples and oranges comparisons (they measure different stuff although looks the same initially) - restart and just select pg_shmem_allocations_numa or prewarm puts everything into 1 NUMA node with check_numa='', because of prefaulting happening during select-view case - restart and pgbench -i -s XX (same issue as above) then pgbench - you get the same, everything on potential one NUMA node (because pgbench prefaults just on one) - restart and pgbench -c 64.. with debug_numa='' (off) MIGHT get random NUMA layout, how's that is supposed to be deterministic? at least with debug_numa='buffers' you get determinism.. - the shared_buffers size vs size of dataset read, the moment you start doing something CPU intensive (or like calling syscalls just for VFS cache), the benefit seems to disappear at least on my hardware Anyway, depending on the scenario I could get varied results like 34tps .. 88tps here. The debug_numa='buffers,..' gives just assurance of the proper layout of shared memory is there (one could even argue that such performance deviations across runs are bug ;)). > The regressions for numa/pgproc patches with 1024 clients are annoying, > but how realistic is such scenario? With 32/64 CPUs, having 1024 active > connections is a substantial overload. If we can fix this, great. But I > think such regression may be OK if we get benefits for reasonable setups > (with fewer clients). > > I don't know why it's happening, though. I haven't been testing cases > with so many clients (compared to the number of CPUs). The only thing in my mind about deterioration of high-connection count (AKA -c 1024 scenario) with pgprocs, would be related to the question you raised in 0007 "Note: The scheduler may migrate the process to a different CPU/node later. Maybe we should consider pinning the process to the node?" I think the answer is yes, so to fetch MyProc based on sched_getcpu() and then maybe with additional numa_flags & new PROCS_PIN_NODE simply numa_run_on_node(node)? I've tried this: pgbench -c 1024 -j 64 -P 1 -T 30 -S -M prepared got: @numa-empty-debug_numa ~434k TPS, ~12k CPU migrations/second @numa+buffers+pgproc ~412k TPS, 7-8k CPU migrations/second @numa+buffers+pgproc+pinnode ~434k TPS, still with 7-8k CPU migrations/second (so same) but I've verified for the last one, with bpftrace on that tracepoint:sched:sched_migrate_task did not performed node-to-node process bounces anymore (it did for pgbench but not for postgres itself with this numa_run_on_node()) > > scenario II: pgbench -f seqconcurrscans.pgb; 64 partitions from > > pgbench --partitions=64 -i -s 2000 [~29GB] being hammered in modulo [..] > > Hmmm. I'd have expected better results for this workload. So I tried > re-running my seqscan benchmark on the 176-core instance, and I got this: [..] Thanks! > I did the benchmark for individual parts of the patch series. There's a > clear (~20%) speedup for 0005, but 0006 and 0007 make it go away. The > 0002/0003 regress it quite a bit. And with 128 clients there's no > improvement at all. [..] > Those are clearly much better results, so I guess the default number of > partitions may be too low. > > What bothers me is that this seems like a very narrow benchmark. I mean, > few systems are doing concurrent seqscans putting this much pressure on > buffer replacement. And once the plans start to do other stuff, the > contention on clock sweep seems to go down substantially (as shown by > the read-only pgbench). So the question is - is this really worth it? Are you thinking here about whole NUMA patchset or just clocksweep? I think multiple clocksweep are just not shining because other bottlenecks hammer the efficiency here. Andres talk about it exactly here https://youtu.be/V75KpACdl6E?t=1990 (He mentions out of order execution, I see btrees in reports as top#1). So maybe it's just too early to see the results of this optimization? As for classic readonly pgbench -S I still see roughly 1:8 local to remote (!) DRAM access (1 <-> 3 sockets) even with those patches, so potentially something could be improved in far future for sure (that would require some memaddr monitoring for most remote DRAM misses <-> pg inter-shm ptr mapping; think of pg_shmem_allocations_numa with local/remote counters or maybe just fallback to perf-c2c). To sum up, IMHO I understand this $thread's NUMA implementation as: - it's strictly a guard mechanism to get determinism (for most cases) -- it fixes "imbalance" - no performance boost for OLTP as such - for analytics it could be win (in-memory workloads; well PG is not fully built for this, but it could be one day/or already is with 3rd party TAMs and extensions), and: -- we can provide performance jump for seqconcurrentjobs or memory fitting workloads (patchset does this already). Note: I think PG will eventually get into such classes in the longer run, we are just ahead with NUMA, but PG is without proper vectorized executor stuff. -- we could further enhance PQ here: the leader and PQ workers would stick to the same NUMA node with some affinity (the earlier thread measurements for this [1] -- we could have session GUC to enable this for planned big PQ whole-NUMA SELECTs; this would be probably done close to dsm_impl_posix()) - new idea: we could allow exposing tables(spaces) into NUMA nodes or make it per-user toggle too while we are at it (imagine HTAP-like workloads: NUMA node #0 for OLTP, node #1 for analytics). Sounds cool and rather easy and has valid use, but dunno if that would be really useful? Way out of scope: - superlocking btress that Andres mentioned on his presentation -J. [1] - https://www.postgresql.org/message-id/attachment/178120/NUMA_pq_cpu_pinning_results.txt
Attachment
pgsql-hackers by date: