Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: Adding basic NUMA awareness
Date
Msg-id CAKZiRmwPVxi1H23pNZ4_Vc=mtMaNgY1z79s6SwjuUZD3EaOPeA@mail.gmail.com
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
Hi Tomas!

[..]
> Which I think is mostly the same thing you're saying, and you have the maps to support it.

Right, the thread is kind of long, you were right back then, well but
at least we've got a solid explanation with data.

> Here's an updated version of the patch series.

Just for double confirmation, I've used those ones (v20251121*) and
they indeed interleaved parts of shm memory.

> It fixes a bunch of issues in pg_buffercache_pages.c - duplicate attnums
> and a incorrect array length.

You'll need to rebase again, pg_buffercache_numa got updated again on
Monday and clashes with 0006.

> The main change is in 0006 - it sets the default allocation policy for
> shmem to interleaving, before doing the explicit partitioning for shared
> buffers. It does it by calling numa_set_membind before the mmap(), and
> then numa_interleave_memory() on the allocated shmem. It does this to
> allow using MAP_POPULATE - but that's commented out by default.
>
> This does seem to solve the SIGBUS failures for me. I still think there
> might be a small chance of hitting that, because of locating an extra
> "boundary" page on one of the nodes. But it should be solvable by
> reserving a couple more pages.

I can confirm, never got any SIGBUS during the later described
benchmarks, so it's much better now.

> Jakub, what do you think?

On one side not using MAP_POPULATE gives instant startup, but on the
other it gives much better predictability latencies especially fresh
after starting up (this might matter to folks who like to benchmark --
us?, but initially I've just used it as a simple hack to touch
memory). I would be wary of using MAP_POPULATE with s_b when it would
be sized in hundreths of GBs, it could take minutes in startup, which
would be terrible if someone would hit SIGSEGV on production and
expect restart_after_crash=true to save him. I mean WAL redo crash
would be terrible, but that would be terrible * 2. Also pretty
long-term with DIO, we'll get much bigger s_b anyway (hopefully), so
it would hurt even more, so I think that would be a bad path(?)

I've benchmarked the thing in two scenarios (readonly pgbench < s_b
size across variations of code and connections and 2nd one with
seqconcurrrentscans) in solid stable conditions: 4s32c64t == 4 NUMA
nodes, 128GB RAM, 31GB shared_buffers dbsize ~29GB, 6.14.x, no idle
CPU states, no turbo boost, and so on, literally great home heater
when there's -3C outside!)

The data is baseline "100%" for master along with HP on/off (so it's
showing diff % from respective HP setting):

scenario I: pgbench -S

                 connections
branch   HP      1       8       64      128     1024
master   off     100.00% 100.00% 100.00% 100.00% 100.00%
master   on      100.00% 100.00% 100.00% 100.00% 100.00%
numa16   off     99.13%  100.46% 99.66%  99.44%  89.60%
numa16   on      101.80% 100.89% 99.36%  99.89%  93.43%
numa4    off     96.82%  100.61% 99.37%  99.92%  94.41%
numa4    on      101.83% 100.61% 99.35%  99.69%  101.48%
pgproc16 off     99.13%  100.84% 99.38%  99.85%  91.15%
pgproc16 on      101.72% 101.40% 99.72%  100.14% 95.20%
pgproc4  off     98.63%  101.44% 100.05% 100.14% 90.97%
pgproc4  on      101.05% 101.46% 99.92%  100.31% 97.60%
sweep16  off     99.53%  101.14% 100.71% 100.75% 101.52%
sweep16  on      97.63%  102.49% 100.42% 100.75% 105.56%
sweep4   off     99.43%  101.59% 100.06% 100.45% 104.63%
sweep4   on      97.69%  101.59% 100.70% 100.69% 104.70%

I would consider everything +/- 3% as noise (technically each branch
was a different compilation/ELF binary, as changing this #define
required to do so to get 4 vs 16; please see attached script). I miss
the explanation why without HP it deteriorates so much with for c=1024
with the patches.

scenario II: pgbench -f seqconcurrscans.pgb; 64 partitions from
pgbench --partitions=64 -i -s 2000 [~29GB] being hammered in modulo
without PQ by:
    \set num (:client_id % 8) + 1
    select sum(octet_length(filler)) from pgbench_accounts_:num;

                 connections
branch   HP      1       8       64      128
master   off     100.00% 100.00% 100.00% 100.00%
master   on      100.00% 100.00% 100.00% 100.00%
numa16   off     115.62% 108.87% 101.08% 111.56%
numa16   on      107.68% 104.90% 102.98% 105.51%
numa4    off     113.55% 111.41% 101.45% 113.10%
numa4    on      107.90% 106.60% 103.68% 106.98%
pgproc16 off     111.70% 108.27% 98.69%  109.36%
pgproc16 on      106.98% 100.69% 101.98% 103.42%
pgproc4  off     112.41% 106.15% 100.03% 112.03%
pgproc4  on      106.73% 105.77% 103.74% 101.13%
sweep16  off     100.63% 100.38% 98.41%  103.46%
sweep16  on      109.03% 99.15%  101.17% 99.19%
sweep4   off     102.04% 101.16% 101.71% 91.86%
sweep4   on      108.33% 101.69% 97.14%  100.92%

The benefit varies with like +3-10% depending on connection count.
Quite frankly I was expecting a little bit more, especially after
re-reading [1]. Maybe you preloaded it there using pg_prewarm? (here
I've randomly warmed it using pgbench). Probably it's something with
my test, I'll take yet another look hopefully soon. The good thing is
that it never crashed and I haven't seen any errors like "Bad address"
probably related to AIO as you saw in [1], perhaps I wasn't using
uring.

0007 (PROCs) still complains with "mbind: Invalid argument" (aligment issue)

-J.

[1] - https://www.postgresql.org/message-id/e4d7e6fc-b5c5-4288-991c-56219db2edd5%40vondra.me

Attachment

pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Patch: VACUUM should ignore (CREATE |RE)INDEX CONCURRENTLY for xmin horizon calculations
Next
From: Maxim Orlov
Date:
Subject: Re: Add 64-bit XIDs into PostgreSQL 15