Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: Draft for basic NUMA observability
Date
Msg-id CAKZiRmyJj=pn=LQcj2eGH9BcVkS1pC0VRhrA0NAByV-brUw7CA@mail.gmail.com
Whole thread Raw
In response to Re: Draft for basic NUMA observability  (Patrick Stählin <me@packi.ch>)
Responses Re: Draft for basic NUMA observability
List pgsql-hackers
On Tue, Jul 22, 2025 at 11:30 AM Patrick Stählin <me@packi.ch> wrote:
>
> Hi!
>
> On 4/7/25 11:27 PM, Tomas Vondra wrote:
> >
> > I've pushed all three parts of v29, with some additional corrections
> > (picked lower OIDs, bumped catversion, fixed commit messages).
>
> While building the PG18 beta1/2 packages I noticed that in our build
> containers the selftest for pg_buffercache_numa and numa failed. It
> seems that libnuma was available and pg_numa_init/numa_available returns
> no errors, we still fail in pg_numa_query_pages/move_pages with EPERM
> yielding the following error when accessing
> pg_buffercache_numa/pg_shmem_allocations_numa:
>
>    ERROR: failed NUMA pages inquiry: Operation not permitted
>
> The man-page of move_pages lead me to believe that this is because of
> the missing capability CAP_SYS_NICE on the process but I couldn't prove
> that theory with the attached patch.
> The patch did make the tests pass but also disabled NUMA permanently on
> a vanilla Debian VM and that is certainly not wanted. It may well be
> that my understanding of checking capabilities and how they work is
> incomplete. I also think that adding a new dependency for the reason of
> just checking the capability is probably a bit of an overkill, maybe we
> can check if we can access move_pages once without an error before
> treating it as one?
>
> I'd be happy to debug this further but I have limited access to our
> build-infra, I should be able to sneak in commands during the build though.


Hi Patrick,

So is it because the container was started without CAP_SYS_NICE so
even root -> postgres is not having this cap? In my book container
would be rather small and certainly single container wouldn't be
spanning multiple CPU sockets, so I would just disable libnuma, anyway
if I do on regular VM:

# capsh --drop=CAP_SYS_NICE -- -c "su - postgres"
$ /usr/sbin/capsh --print
[..]
Current IAB: !cap_sys_nice
[..]

then I can still query pg_shmem_allocations_numa and
pg_buffercache_numa after start. Same happens with setpriv(1), if I do
little cross-check:

# setpriv --reuid nobody --regid nogroup --clear-groups
--bounding=-sys_nice -- id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
# setpriv --reuid nobody --regid nogroup --clear-groups
--bounding=-sys_nice -- sleep 60 &
# pgrep sleep ### => 14882
# grep ^Cap /proc/14882/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffff7fffff
CapAmb: 0000000000000000
# capsh --decode=000001ffff7fffff

0x000001ffff7fffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore
# capsh --decode=000001ffff7fffff | grep -i nice ### nothing (no cap)
#

and then for start pg for real:

# setpriv --reuid postgres --regid postgres --clear-groups
--bounding=-sys_nice -- /usr/pgsql19/bin/pg_ctl -D /tmp/pg19 -l
/tmp/logfile start
$ psql.. ### => pid 15012

# grep ^Cap /proc/15012/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffff7fffff
CapAmb: 0000000000000000
# capsh --decode=000001ffff7fffff

0x000001ffff7fffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore
# capsh --decode=000001ffff7fffff | grep -i nice ### nothing (no cap)
#

.. and I still cannot reproduce this in VM.

Can you provide exact details about this container technology?
Can you provide /usr/sbin/capsh --print just before starting PG there?
Maybe this is more cgroup/cpuset somehow related too?

Anyway, there is a simpler way to make the tests pass if that's what
you are after. We do have
contrib/pg_buffercache/sql/pg_buffercache_numa.sql which is expected
to match outputs in pg_buffercache_numa.out OR (!)
pg_buffercache_numa_1.out. We could just handle this edge case by
adding pg_buffercache_numa_2.out too probably (which would just
contain semi-valid scenario for "ERROR: failed NUMA pages inquiry:
Operation not permitted")

-J.



pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Wrong datatype used in visibilitymap_get_status
Next
From: Fujii Masao
Date:
Subject: Re: Logical replication launcher did not automatically restart when got SIGKILL