Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
| From | Jakub Wartak |
|---|---|
| Subject | Re: Adding basic NUMA awareness |
| Date | |
| Msg-id | CAKZiRmww2P6QAzu6W+vxB89i5Ha-YRSHMeyr6ax2Lymcu3LUcw@mail.gmail.com Whole thread Raw |
| In response to | Re: Adding basic NUMA awareness (Tomas Vondra <tomas@vondra.me>) |
| List | pgsql-hackers |
On Tue, Nov 11, 2025 at 12:52 PM Tomas Vondra <tomas@vondra.me> wrote: > > Hi, > > here's a rebased patch series, fixing most of the smaller issues from > v20251101, and making cfbot happy (hopefully). Hi Tomas, > >>> 0007b: pg_buffercache_pgproc -- nitpick, but maybe it would be better > >>> called pg_shm_pgproc? > >>> > >> > >> Right. It does not belong to pg_buffercache at all, I just added it > >> there because I've been messing with that code already. > > > > Please keep them in for at least for some time (perhaps standalone > > patch marked as not intended to be commited would work?). I find the > > view extermely useful as it will allow us pinpointing local-vs-remote > > NUMA fetches (we need to know the addres). > > > > Are you referring to the _pgproc view specifically, or also to the view > with buffer partitions? I don't intend to remove the view for shared > buffers, that's indeed useful. Both, even the _pgproc. > Hmmm, ok. Will check. But maybe let's not focus too much on the PGPROC > partitioning, I don't think that's likely to go into 19. Oh ok. > >>> 0006d: I've got one SIGBUS during a call to select > >>> pg_buffercache_numa_pages(); and it looks like that memory accessed is > >>> simply not mapped? (bug) [..] > I didn't have time to look into all this info about mappings, io_uring > yet, so no response from me. > Ok, so the proper HP + SIGBUS explanation: Appologies, earlier I wrote that disabling THP does workaround this, but I've probably made an error there and used wrong binary back there (with MAP_POPULATE in PG_MMAP_FLAGS), so please ignore that. 1. Before starting PG, with shared_buffers=32GB, huge_pages=on (2MB ones), vm.nr_hugepages=17715, 4 NUMA nodes, kernel 6.14.x, max_connections=10k, wal_buffers=1GB: node0/hugepages/hugepages-2048kB/free_hugepages:4429 node1/hugepages/hugepages-2048kB/free_hugepages:4429 node2/hugepages/hugepages-2048kB/free_hugepages:4429 node3/hugepages/hugepages-2048kB/free_hugepages:4428 2. Just startup the PG with the older NUMA patchset 20251101. There will be deficit across NUMA nodes right after startup, mostly one node NUMA will allocate much more: node0/hugepages/hugepages-2048kB/free_hugepages:4397 node1/hugepages/hugepages-2048kB/free_hugepages:3453 node2/hugepages/hugepages-2048kB/free_hugepages:4397 node3/hugepages/hugepages-2048kB/free_hugepages:4396 3. Check layout of NUMA maps for postmaster PID 7fc9cb200000 default file=/anon_hugepage\040(deleted) huge dirty=517 mapmax=8 N1=517 kernelpagesize_kB=2048 [!!!] 7fca0d600000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=32 mapmax=2 N0=32 kernelpagesize_kB=2048 7fca11600000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=32 mapmax=2 N1=32 kernelpagesize_kB=2048 7fca15600000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=32 mapmax=2 N2=32 kernelpagesize_kB=2048 7fca19600000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=32 mapmax=2 N3=32 kernelpagesize_kB=2048 7fca1d600000 default file=/anon_hugepage\040(deleted) huge 7fca1d800000 bind:0 file=/anon_hugepage\040(deleted) huge 7fcc1d800000 bind:1 file=/anon_hugepage\040(deleted) huge 7fce1d800000 bind:2 file=/anon_hugepage\040(deleted) huge 7fd01d800000 bind:3 file=/anon_hugepage\040(deleted) huge 7fd21d800000 default file=/anon_hugepage\040(deleted) huge dirty=425 mapmax=8 N1=425 kernelpagesize_kB=2048 [!!!] So your patch doesn't do anything special for anything other than Buffer Blocks and PGPROC in the above picture, so the the default mmap() just keeps on with "default" NUMA policy which takes per above (517+425) * 2MB = ~1884 MB of really used memory as per N1 entires. PG does touch those regions on startup, but it doesnt really touch Buffer Blocks. Anyway, this causes the missing amount of free huge pages on the N1 (generates pressure on this Node 1). So as it stands, the patchset is missing some form balancing to use equal memory across nodes: - each node to be forced to get certain amount of BufferBlocks/NUMA nodes blocks - yet we do nothing and leave at the "defaults" the others regions (e..g $SegHDR (start of shm) .. first Buffers Block), as those are placed on the current node (due default policy), which in causes turns this memory overallocation imbalance (so in the example N1 will get Buffer Blocks + everything else, but that only happens on real access not during mmap() due to lazy/first touch policy) Currently, any launch of anything that touches imbalanced NUMA node memory with deficit (N1 above) - use of pg_shm_allocations, pg_buffercache - it will cause stress there and end up in SIGBUS. This looks by design on Linux kernel side: exc:page_fault() -> do_user_addr_fault() -> do_sigbus() AKA force_sig_fault(). But, if I hack pg to hack do interleave (or just numactl --interleave=all ... ) to effectivley interleave those 3 "default" regions instead, so I'll get "interleave" like that: 7fb2dd000000 interleave:0-3 file=/anon_hugepage\040(deleted) huge dirty=517 mapmax=8 N0=129 N1=132 N2=128 N3=128 kernelpagesize_kB=2048 7fb31f400000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=32 mapmax=2 N0=32 kernelpagesize_kB=2048 7fb323400000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=32 mapmax=2 N1=32 kernelpagesize_kB=2048 7fb327400000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=32 mapmax=2 N2=32 kernelpagesize_kB=2048 7fb32b400000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=32 mapmax=2 N3=32 kernelpagesize_kB=2048 7fb32f400000 interleave:0-3 file=/anon_hugepage\040(deleted) huge 7fb32f600000 bind:0 file=/anon_hugepage\040(deleted) huge 7fb52f600000 bind:1 file=/anon_hugepage\040(deleted) huge 7fb72f600000 bind:2 file=/anon_hugepage\040(deleted) huge 7fb92f600000 bind:3 file=/anon_hugepage\040(deleted) huge 7fbb2f600000 interleave:0-3 file=/anon_hugepage\040(deleted) huge dirty=425 N0=106 N1=106 N2=105 N3=108 kernelpagesize_kB=2048 then even after fully touching everything (via select to pg_shm_allocations), it'll run, I'll get much better balance, and wont have SIGBUS issues: node0/hugepages/hugepages-2048kB/free_hugepages:23 node1/hugepages/hugepages-2048kB/free_hugepages:23 node2/hugepages/hugepages-2048kB/free_hugepages:23 node3/hugepages/hugepages-2048kB/free_hugepages:22 This somehow demonstrates that enough free memory is out there, it's just imbalance that causes SIGBUS. I hope this somehow hopefully answers one of Your's main questions as per in the very first messages what we should do with remaining shared_buffer members. I would like to hear your thoughts on this, before I start benchmarking this for real as I didnt want to bench it yet, as such interleaving could alter the the test results. Other things I've noticed: - smaps Size: && Shared_Hugetlb: reporting are a lie and are showing really touched memory, not assigned memory - same goes for procfs's numa_maps, ignore the N[0-3] sizes, it's only "really used", not assigned - the best is just to manually calculate size from pointers/address range itself -J.
pgsql-hackers by date: