Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | CAKZiRmyJj=pn=LQcj2eGH9BcVkS1pC0VRhrA0NAByV-brUw7CA@mail.gmail.com Whole thread Raw |
In response to | Re: Draft for basic NUMA observability (Patrick Stählin <me@packi.ch>) |
Responses |
Re: Draft for basic NUMA observability
|
List | pgsql-hackers |
On Tue, Jul 22, 2025 at 11:30 AM Patrick Stählin <me@packi.ch> wrote: > > Hi! > > On 4/7/25 11:27 PM, Tomas Vondra wrote: > > > > I've pushed all three parts of v29, with some additional corrections > > (picked lower OIDs, bumped catversion, fixed commit messages). > > While building the PG18 beta1/2 packages I noticed that in our build > containers the selftest for pg_buffercache_numa and numa failed. It > seems that libnuma was available and pg_numa_init/numa_available returns > no errors, we still fail in pg_numa_query_pages/move_pages with EPERM > yielding the following error when accessing > pg_buffercache_numa/pg_shmem_allocations_numa: > > ERROR: failed NUMA pages inquiry: Operation not permitted > > The man-page of move_pages lead me to believe that this is because of > the missing capability CAP_SYS_NICE on the process but I couldn't prove > that theory with the attached patch. > The patch did make the tests pass but also disabled NUMA permanently on > a vanilla Debian VM and that is certainly not wanted. It may well be > that my understanding of checking capabilities and how they work is > incomplete. I also think that adding a new dependency for the reason of > just checking the capability is probably a bit of an overkill, maybe we > can check if we can access move_pages once without an error before > treating it as one? > > I'd be happy to debug this further but I have limited access to our > build-infra, I should be able to sneak in commands during the build though. Hi Patrick, So is it because the container was started without CAP_SYS_NICE so even root -> postgres is not having this cap? In my book container would be rather small and certainly single container wouldn't be spanning multiple CPU sockets, so I would just disable libnuma, anyway if I do on regular VM: # capsh --drop=CAP_SYS_NICE -- -c "su - postgres" $ /usr/sbin/capsh --print [..] Current IAB: !cap_sys_nice [..] then I can still query pg_shmem_allocations_numa and pg_buffercache_numa after start. Same happens with setpriv(1), if I do little cross-check: # setpriv --reuid nobody --regid nogroup --clear-groups --bounding=-sys_nice -- id uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup) # setpriv --reuid nobody --regid nogroup --clear-groups --bounding=-sys_nice -- sleep 60 & # pgrep sleep ### => 14882 # grep ^Cap /proc/14882/status CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 000001ffff7fffff CapAmb: 0000000000000000 # capsh --decode=000001ffff7fffff 0x000001ffff7fffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore # capsh --decode=000001ffff7fffff | grep -i nice ### nothing (no cap) # and then for start pg for real: # setpriv --reuid postgres --regid postgres --clear-groups --bounding=-sys_nice -- /usr/pgsql19/bin/pg_ctl -D /tmp/pg19 -l /tmp/logfile start $ psql.. ### => pid 15012 # grep ^Cap /proc/15012/status CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 000001ffff7fffff CapAmb: 0000000000000000 # capsh --decode=000001ffff7fffff 0x000001ffff7fffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore # capsh --decode=000001ffff7fffff | grep -i nice ### nothing (no cap) # .. and I still cannot reproduce this in VM. Can you provide exact details about this container technology? Can you provide /usr/sbin/capsh --print just before starting PG there? Maybe this is more cgroup/cpuset somehow related too? Anyway, there is a simpler way to make the tests pass if that's what you are after. We do have contrib/pg_buffercache/sql/pg_buffercache_numa.sql which is expected to match outputs in pg_buffercache_numa.out OR (!) pg_buffercache_numa_1.out. We could just handle this edge case by adding pg_buffercache_numa_2.out too probably (which would just contain semi-valid scenario for "ERROR: failed NUMA pages inquiry: Operation not permitted") -J.
pgsql-hackers by date: