Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Adding basic NUMA awareness
Date
Msg-id ae031f63-4f0b-47a0-ad15-134e4ec677b6@vondra.me
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
List pgsql-hackers

On 11/4/25 13:10, Jakub Wartak wrote:
> On Fri, Oct 31, 2025 at 12:57 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> Hi,
>>
>> here's a significantly reworked version of this patch series.
>>
>> I had a couple discussions about these patches at pgconf.eu last week,[..]
> 
> I've just had a quick look at this and oh, my, I've started getting
> into this partitioned clocksweep and that's ambitious! Yes, this
> sequencing of patches makes it much more understandable. Anyway I've
> spotted some things, attempted to fix some and have some basic
> questions too (so small baby steps, all of this was on 4s/4 NUMA nodes
> with HP on) -- the 000X refers to question/issue/bug in specific
> patchset file:
> 
> 0001: you mention 'debug_numa = buffers' in commitmsg, but there's
> nothing there like that? it comes with 0006
> 

Right, I forgot to remove that reference.

> 0002: dunno, but wouldn't it make some educational/debugging sense to
> add a debug function returning clocksweep partition index
> (calculate_partition_index) for backend? (so that we know which
> partition we are working on right now)
> 

Perhaps. I didn't need that, but it might be interesting during
development. I probably would not keep that in the final version.

> 0003: those two "elog(INFO, "rebalance skipped:" should be at DEBUG2+
> IMHO (they are way too verbose during runs)
> 

Agreed.

> 0006a: Needs update - s/patches later in the patch series/patches
> earlier in the patch series/
> 

Agreed.

> 0006b: IMHO longer term, we should hide some complexity of those calls
> via src/port numa shims (pg_numa_sched_cpu()?)
> 

Yeah, there's definitely room for moving more of the code to src/port.

> 0006c: after GUC commit fce7c73fba4e5, apply complains with:
> error: patch failed: src/backend/utils/misc/guc_parameters.dat:906
> error: src/backend/utils/misc/guc_parameters.dat: patch does not apply
> 

Will fix.

> 0007a: pg_buffercache_pgproc returns pgproc_ptr and fastpath_ptr in
> bigint and not hex? I've wanted to adjust that to TEXTOID, but instead
> I've thought it is going to be simpler to use to_hex() -- see 0009
> attached.
> 

I don't know. I added simply because it might be useful for development,
but we probably don't want to expose these pointers at all.

> 0007b: pg_buffercache_pgproc -- nitpick, but maybe it would be better
> called pg_shm_pgproc?
> 

Right. It does not belong to pg_buffercache at all, I just added it
there because I've been messing with that code already.

> 0007c with check_numa='buffers,procs' throws 'mbind Invalid argument'
> during start:
> 
>     2025-11-04 10:02:27.055 CET [58464] DEBUG:  NUMA:
> pgproc_init_partition procs 0x7f8d30400000 endptr 0x7f8d30800000
> num_procs 2523 node 0
>     2025-11-04 10:02:27.057 CET [58464] DEBUG:  NUMA:
> pgproc_init_partition procs 0x7f8d30800000 endptr 0x7f8d30c00000
> num_procs 2523 node 1
>     2025-11-04 10:02:27.059 CET [58464] DEBUG:  NUMA:
> pgproc_init_partition procs 0x7f8d30c00000 endptr 0x7f8d31000000
> num_procs 2523 node 2
>     2025-11-04 10:02:27.061 CET [58464] DEBUG:  NUMA:
> pgproc_init_partition procs 0x7f8d31000000 endptr 0x7f8d31400000
> num_procs 2523 node 3
>     2025-11-04 10:02:27.062 CET [58464] DEBUG:  NUMA:
> pgproc_init_partition procs 0x7f8d31400000 endptr 0x7f8d31407cb0
> num_procs 38 node -1
>     mbind: Invalid argument
>     mbind: Invalid argument
>     mbind: Invalid argument
>     mbind: Invalid argument
> 

I'll take a look, but I don't recall seeing such errors.

> 0007d: so we probably need numa_warn()/numa_error() wrappers (this was
> initially part of NUMA observability patches but got removed during
> the course of action), I'm attaching 0008. With that you'll get
> something a little more up to our standards:
>     2025-11-04 10:27:07.140 CET [59696] DEBUG:
> fastpath_parititon_init node = 3, ptr = 0x7f4f4d400000, endptr =
> 0x7f4f4d4b1660
>     2025-11-04 10:27:07.140 CET [59696] WARNING:  libnuma: ERROR: mbind
> 

Not sure.

> 0007e: elog DEBUG says it's pg_proc_init_partition but it's
> pgproc_partition_init() actually ;)
> 
> 0007f: The "mbind: Invalid argument"" issue itself with the below  addition:
>     +elog(DEBUG1, "NUMA: fastpath_partition_init ptr %p endptr %p
> num_procs %d node %d", ptr, endptr, num_procs, node);
>     showed this:
>     2025-11-04 11:30:51.089 CET [61841] DEBUG:  NUMA:
> fastpath_partition_init ptr 0x7f39eea00000 endptr 0x7f39eeab1660
> num_procs 2523 node 0
>     2025-11-04 11:30:51.089 CET [61841] WARNING:  libnuma: ERROR: mbind
>     2025-11-04 11:30:51.089 CET [61841] DEBUG:  NUMA:
> fastpath_partition_init ptr 0x7f39eec00000 endptr 0x7f39eecb1660
> num_procs 2523 node 1
>     2025-11-04 11:30:51.089 CET [61841] WARNING:  libnuma: ERROR: mbind
>     2025-11-04 11:30:51.089 CET [61841] DEBUG:  NUMA:
> fastpath_partition_init ptr 0x7f39eee00000 endptr 0x7f39eeeb1660
> num_procs 2523 node 2
>     2025-11-04 11:30:51.089 CET [61841] WARNING:  libnuma: ERROR: mbind
>     [..]
> 
>     Meanwhile it's full hugepage size (e.g. 0x7f39eec00000−0x7f39eea00000 = 2MB)
>     $ grep --color 7f39ee[ace] /proc/61841/smaps
>     7f39ee800000-7f39eea00000 rw-s 87de00000 00:11 122710
>       /anon_hugepage (deleted)
>     7f39eea00000-7f39eec00000 rw-s 87e000000 00:11 122710
>       /anon_hugepage (deleted)
>     7f39eec00000-7f39eee00000 rw-s 87e200000 00:11 122710
>       /anon_hugepage (deleted)
>     7f39eee00000-7f39ef000000 rw-s 87e400000 00:11 122710
>       /anon_hugepage (deleted)
> 
>     but mbind() was called for just 0x7f39eeab1660−0x7f39eea00000 =
> 0xB1660 = 726624 bytes, but if adjust blindly endptr in that
> fastpath_partition_init() to be "char *endptr = ptr + 2*1024*1024;"
> (HP) it doesn't complain anymore and I get success:
>     2025-11-04 12:08:30.147 CET [62352] DEBUG:  NUMA:
> fastpath_partition_init ptr 0x7f7bf7000000 endptr 0x7f7bf7200000
> num_procs 2523 node 0
>     2025-11-04 12:08:30.147 CET [62352] DEBUG:  NUMA:
> fastpath_partition_init ptr 0x7f7bf7200000 endptr 0x7f7bf7400000
> num_procs 2523 node 1
>     2025-11-04 12:08:30.147 CET [62352] DEBUG:  NUMA:
> fastpath_partition_init ptr 0x7f7bf7400000 endptr 0x7f7bf7600000
> num_procs 2523 node 2
>     2025-11-04 12:08:30.147 CET [62352] DEBUG:  NUMA:
> fastpath_partition_init ptr 0x7f7bf7600000 endptr 0x7f7bf7800000
> num_procs 2523 node 3
>     2025-11-04 12:08:30.147 CET [62352] DEBUG:  NUMA:
> fastpath_partition_init ptr 0x7f7bf7800000 endptr 0x7f7bf7a00000
> num_procs 38 node -1
>     2025-11-04 12:08:30.239 CET [62352] LOG:  starting PostgreSQL
> 19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
> 

Hmm, so it seems like another hugepage-related issue. The mbind manpage
says this about "len":

  EINVAL An invalid value was specified for flags or mode; or addr + len
  was less than addr; or addr is not a multiple of the system page size.

I don't think that requires (addr+len) to be a multiple of page size,
but maybe that is required.

> 0006d: I've got one SIGBUS during a call to select
> pg_buffercache_numa_pages(); and it looks like that memory accessed is
> simply not mapped? (bug)
> 
>     Program received signal SIGBUS, Bus error.
>     pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at
> ../contrib/pg_buffercache/pg_buffercache_pages.c:386
>     386                                     pg_numa_touch_mem_if_required(ptr);
>     (gdb) print ptr
>     $1 = 0x7f4ed0200000 <error: Cannot access memory at address 0x7f4ed0200000>
>     (gdb) where
>     #0  pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at
> ../contrib/pg_buffercache/pg_buffercache_pages.c:386
>     #1  0x0000561a672a0efe in ExecMakeFunctionResultSet
> (fcache=0x561a97e8e5d0, econtext=econtext@entry=0x561a97e8dab8,
> argContext=0x561a97ec62a0, isNull=0x561a97e8e578,
> isDone=isDone@entry=0x561a97e8e5c0) at
> ../src/backend/executor/execSRF.c:624
>     [..]
> 
>     Postmaster had still attached shm (visible via smaps), and if you
> compare closely 0x7f4ed0200000 against sorted smaps:
> 
>     7f4921400000-7f4b21400000 rw-s 252600000 00:11 151111
>       /anon_hugepage (deleted)
>     7f4b21400000-7f4d21400000 rw-s 452600000 00:11 151111
>       /anon_hugepage (deleted)
>     7f4d21400000-7f4f21400000 rw-s 652600000 00:11 151111
>       /anon_hugepage (deleted)
>     7f4f21400000-7f4f4bc00000 rw-s 852600000 00:11 151111
>       /anon_hugepage (deleted)
>     7f4f4bc00000-7f4f4c000000 rw-s 87ce00000 00:11 151111
>       /anon_hugepage (deleted)
> 
>     it's NOT there at all (there's no mmap region starting with
> 0x"7f4e" ). It looks like because pg_buffercache_numa_pages() is not
> aware of this new mmaped() regions and instead does simple loop over
> all NBuffers with "for (char *ptr = startptr; ptr < endptr; ptr +=
> os_page_size)"?
> 

I'm confused. How could that mapping be missing? Was this with huge
pages / how many did you reserve on the nodes? Maybe there were not
enough huge pages left on one of the nodes?

I believe I got some SIGBUS in those cases.

> 0006e:
>     I'm seeking confirmation, but is this the issue we have discussed
> on PgconfEU related to lack of detection of Mems_allowed, right? e.g.
>     $ numactl --membind="0,1" --cpunodebind="0,1"
> /usr/pgsql19/bin/pg_ctl -D /path start
>     still shows 4 NUMA nodes used. Current patches use
> numa_num_configured_nodes(), but it says 'This count includes any
> nodes that are currently DISABLED'. So I was wondering if I could help
> by migrating towards numa_num_task_nodes() / numa_get_mems_allowed()?
> It's the same as You wrote earlier to Alexy?
> 

If "mems_allowed" refers to nodes allowing memory allocation, then yes,
this would be one way to get into that issue. Oh, is this what happened
in 0006d?

>     > But that's not what you proposed here, clearly. You're saying we should
>     > find which NUMA nodes the process is allowed to run, and use those.
>     > Instead of just using all *configured* nodes. And I agree with that.
> 
>     So are you already on it ?
> 
>> There are a couple unsolved issues, though. While running the tests, I
>> ran into a bunch of weird issues. I saw two types of failures:
>> 1) Bad address
>> 2) Operation canceled
> 
> I did run (with io_uring) a short test(< 10min with -c 128) and didn't
> get those. Could you please share specific tips/workload for
> reproducing this?
> 

I did get a couple of "operation canceled" failures, but only on fairly
old kernel versions (6.1 which came as default with the VM). I heard
some suggestions this is a bug in older kernels - I don't have any link
to a bug report / fix, though. But I've been unable to reproduce this on
6.17, so maybe it's true.

For me the failures always happened 10 seconds after the start of the
benchmark (and starting the instance), so it's probably sufficient to
keep the runs ~20 seconds (and maybe restart in between?).

But even then it's fairly rare. I've seen ~10 failures for 500 runs.


I haven't seen more "bad address" cases, I have no idea why. I'm still
guessing it's related to huge pages, so maybe I happened to reserve
enough of them.

regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Philip Alger
Date:
Subject: Re: [PATCH] Add pretty formatting to pg_get_triggerdef
Next
From: John H
Date:
Subject: Re: [PATCH] Add archive_mode=follow_primary to prevent unarchived WAL on standby promotion