Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
| From | Jakub Wartak |
|---|---|
| Subject | Re: Adding basic NUMA awareness |
| Date | |
| Msg-id | CAKZiRmwPVxi1H23pNZ4_Vc=mtMaNgY1z79s6SwjuUZD3EaOPeA@mail.gmail.com Whole thread Raw |
| In response to | Re: Adding basic NUMA awareness (Tomas Vondra <tomas@vondra.me>) |
| List | pgsql-hackers |
Hi Tomas!
[..]
> Which I think is mostly the same thing you're saying, and you have the maps to support it.
Right, the thread is kind of long, you were right back then, well but
at least we've got a solid explanation with data.
> Here's an updated version of the patch series.
Just for double confirmation, I've used those ones (v20251121*) and
they indeed interleaved parts of shm memory.
> It fixes a bunch of issues in pg_buffercache_pages.c - duplicate attnums
> and a incorrect array length.
You'll need to rebase again, pg_buffercache_numa got updated again on
Monday and clashes with 0006.
> The main change is in 0006 - it sets the default allocation policy for
> shmem to interleaving, before doing the explicit partitioning for shared
> buffers. It does it by calling numa_set_membind before the mmap(), and
> then numa_interleave_memory() on the allocated shmem. It does this to
> allow using MAP_POPULATE - but that's commented out by default.
>
> This does seem to solve the SIGBUS failures for me. I still think there
> might be a small chance of hitting that, because of locating an extra
> "boundary" page on one of the nodes. But it should be solvable by
> reserving a couple more pages.
I can confirm, never got any SIGBUS during the later described
benchmarks, so it's much better now.
> Jakub, what do you think?
On one side not using MAP_POPULATE gives instant startup, but on the
other it gives much better predictability latencies especially fresh
after starting up (this might matter to folks who like to benchmark --
us?, but initially I've just used it as a simple hack to touch
memory). I would be wary of using MAP_POPULATE with s_b when it would
be sized in hundreths of GBs, it could take minutes in startup, which
would be terrible if someone would hit SIGSEGV on production and
expect restart_after_crash=true to save him. I mean WAL redo crash
would be terrible, but that would be terrible * 2. Also pretty
long-term with DIO, we'll get much bigger s_b anyway (hopefully), so
it would hurt even more, so I think that would be a bad path(?)
I've benchmarked the thing in two scenarios (readonly pgbench < s_b
size across variations of code and connections and 2nd one with
seqconcurrrentscans) in solid stable conditions: 4s32c64t == 4 NUMA
nodes, 128GB RAM, 31GB shared_buffers dbsize ~29GB, 6.14.x, no idle
CPU states, no turbo boost, and so on, literally great home heater
when there's -3C outside!)
The data is baseline "100%" for master along with HP on/off (so it's
showing diff % from respective HP setting):
scenario I: pgbench -S
connections
branch HP 1 8 64 128 1024
master off 100.00% 100.00% 100.00% 100.00% 100.00%
master on 100.00% 100.00% 100.00% 100.00% 100.00%
numa16 off 99.13% 100.46% 99.66% 99.44% 89.60%
numa16 on 101.80% 100.89% 99.36% 99.89% 93.43%
numa4 off 96.82% 100.61% 99.37% 99.92% 94.41%
numa4 on 101.83% 100.61% 99.35% 99.69% 101.48%
pgproc16 off 99.13% 100.84% 99.38% 99.85% 91.15%
pgproc16 on 101.72% 101.40% 99.72% 100.14% 95.20%
pgproc4 off 98.63% 101.44% 100.05% 100.14% 90.97%
pgproc4 on 101.05% 101.46% 99.92% 100.31% 97.60%
sweep16 off 99.53% 101.14% 100.71% 100.75% 101.52%
sweep16 on 97.63% 102.49% 100.42% 100.75% 105.56%
sweep4 off 99.43% 101.59% 100.06% 100.45% 104.63%
sweep4 on 97.69% 101.59% 100.70% 100.69% 104.70%
I would consider everything +/- 3% as noise (technically each branch
was a different compilation/ELF binary, as changing this #define
required to do so to get 4 vs 16; please see attached script). I miss
the explanation why without HP it deteriorates so much with for c=1024
with the patches.
scenario II: pgbench -f seqconcurrscans.pgb; 64 partitions from
pgbench --partitions=64 -i -s 2000 [~29GB] being hammered in modulo
without PQ by:
\set num (:client_id % 8) + 1
select sum(octet_length(filler)) from pgbench_accounts_:num;
connections
branch HP 1 8 64 128
master off 100.00% 100.00% 100.00% 100.00%
master on 100.00% 100.00% 100.00% 100.00%
numa16 off 115.62% 108.87% 101.08% 111.56%
numa16 on 107.68% 104.90% 102.98% 105.51%
numa4 off 113.55% 111.41% 101.45% 113.10%
numa4 on 107.90% 106.60% 103.68% 106.98%
pgproc16 off 111.70% 108.27% 98.69% 109.36%
pgproc16 on 106.98% 100.69% 101.98% 103.42%
pgproc4 off 112.41% 106.15% 100.03% 112.03%
pgproc4 on 106.73% 105.77% 103.74% 101.13%
sweep16 off 100.63% 100.38% 98.41% 103.46%
sweep16 on 109.03% 99.15% 101.17% 99.19%
sweep4 off 102.04% 101.16% 101.71% 91.86%
sweep4 on 108.33% 101.69% 97.14% 100.92%
The benefit varies with like +3-10% depending on connection count.
Quite frankly I was expecting a little bit more, especially after
re-reading [1]. Maybe you preloaded it there using pg_prewarm? (here
I've randomly warmed it using pgbench). Probably it's something with
my test, I'll take yet another look hopefully soon. The good thing is
that it never crashed and I haven't seen any errors like "Bad address"
probably related to AIO as you saw in [1], perhaps I wasn't using
uring.
0007 (PROCs) still complains with "mbind: Invalid argument" (aligment issue)
-J.
[1] - https://www.postgresql.org/message-id/e4d7e6fc-b5c5-4288-991c-56219db2edd5%40vondra.me
Attachment
pgsql-hackers by date: