Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
| From | Tomas Vondra |
|---|---|
| Subject | Re: Adding basic NUMA awareness |
| Date | |
| Msg-id | 33d0da22-1400-4b7a-b4c8-1867b8a2cae0@vondra.me Whole thread Raw |
| In response to | Re: Adding basic NUMA awareness (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
| List | pgsql-hackers |
On 11/17/25 10:23, Jakub Wartak wrote: > On Tue, Nov 11, 2025 at 12:52 PM Tomas Vondra <tomas@vondra.me> wrote: >> >> Hi, >> >> here's a rebased patch series, fixing most of the smaller issues from >> v20251101, and making cfbot happy (hopefully). > > Hi Tomas, > >>>>> 0007b: pg_buffercache_pgproc -- nitpick, but maybe it would be better >>>>> called pg_shm_pgproc? >>>>> >>>> >>>> Right. It does not belong to pg_buffercache at all, I just added it >>>> there because I've been messing with that code already. >>> >>> Please keep them in for at least for some time (perhaps standalone >>> patch marked as not intended to be commited would work?). I find the >>> view extermely useful as it will allow us pinpointing local-vs-remote >>> NUMA fetches (we need to know the addres). >>> >> >> Are you referring to the _pgproc view specifically, or also to the view >> with buffer partitions? I don't intend to remove the view for shared >> buffers, that's indeed useful. > > Both, even the _pgproc. > > >> Hmmm, ok. Will check. But maybe let's not focus too much on the PGPROC >> partitioning, I don't think that's likely to go into 19. > > Oh ok. > >>>>> 0006d: I've got one SIGBUS during a call to select >>>>> pg_buffercache_numa_pages(); and it looks like that memory accessed is >>>>> simply not mapped? (bug) > [..] >> I didn't have time to look into all this info about mappings, io_uring >> yet, so no response from me. >> > > Ok, so the proper HP + SIGBUS explanation: > > Appologies, earlier I wrote that disabling THP does workaround this, > but I've probably made an error there and used wrong binary back there > (with MAP_POPULATE in PG_MMAP_FLAGS), so please ignore that. > > 1. Before starting PG, with shared_buffers=32GB, huge_pages=on (2MB > ones), vm.nr_hugepages=17715, 4 NUMA nodes, kernel 6.14.x, > max_connections=10k, wal_buffers=1GB: > > node0/hugepages/hugepages-2048kB/free_hugepages:4429 > node1/hugepages/hugepages-2048kB/free_hugepages:4429 > node2/hugepages/hugepages-2048kB/free_hugepages:4429 > node3/hugepages/hugepages-2048kB/free_hugepages:4428 > > 2. Just startup the PG with the older NUMA patchset 20251101. There > will be deficit across NUMA nodes right after startup, mostly one node > NUMA will allocate much more: > > node0/hugepages/hugepages-2048kB/free_hugepages:4397 > node1/hugepages/hugepages-2048kB/free_hugepages:3453 > node2/hugepages/hugepages-2048kB/free_hugepages:4397 > node3/hugepages/hugepages-2048kB/free_hugepages:4396 > > 3. Check layout of NUMA maps for postmaster PID > > 7fc9cb200000 default file=/anon_hugepage\040(deleted) huge dirty=517 > mapmax=8 N1=517 kernelpagesize_kB=2048 [!!!] > 7fca0d600000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=32 > mapmax=2 N0=32 kernelpagesize_kB=2048 > 7fca11600000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=32 > mapmax=2 N1=32 kernelpagesize_kB=2048 > 7fca15600000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=32 > mapmax=2 N2=32 kernelpagesize_kB=2048 > 7fca19600000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=32 > mapmax=2 N3=32 kernelpagesize_kB=2048 > 7fca1d600000 default file=/anon_hugepage\040(deleted) huge > 7fca1d800000 bind:0 file=/anon_hugepage\040(deleted) huge > 7fcc1d800000 bind:1 file=/anon_hugepage\040(deleted) huge > 7fce1d800000 bind:2 file=/anon_hugepage\040(deleted) huge > 7fd01d800000 bind:3 file=/anon_hugepage\040(deleted) huge > 7fd21d800000 default file=/anon_hugepage\040(deleted) huge dirty=425 > mapmax=8 N1=425 kernelpagesize_kB=2048 [!!!] > > So your patch doesn't do anything special for anything other than > Buffer Blocks and PGPROC in the above picture, so the the default > mmap() just keeps on with "default" NUMA policy which takes per above > (517+425) * 2MB = ~1884 MB of really used memory as per N1 entires. PG > does touch those regions on startup, but it doesnt really touch Buffer > Blocks. Anyway, this causes the missing amount of free huge pages on > the N1 (generates pressure on this Node 1). > > So as it stands, the patchset is missing some form balancing to use > equal memory across nodes: > - each node to be forced to get certain amount of BufferBlocks/NUMA nodes blocks > - yet we do nothing and leave at the "defaults" the others regions > (e..g $SegHDR (start of shm) .. first Buffers Block), as those are > placed on the current node (due default policy), which in causes turns > this memory overallocation imbalance (so in the example N1 will get > Buffer Blocks + everything else, but that only happens on real access > not during mmap() due to lazy/first touch policy) > > Currently, any launch of anything that touches imbalanced NUMA node > memory with deficit (N1 above) - use of pg_shm_allocations, > pg_buffercache - it will cause stress there and end up in SIGBUS. > This looks by design on Linux kernel side: exc:page_fault() -> > do_user_addr_fault() -> do_sigbus() AKA force_sig_fault(). But, if I > hack pg to hack do interleave (or just numactl --interleave=all ... ) > to effectivley interleave those 3 "default" regions instead, so I'll > get "interleave" like that: > > 7fb2dd000000 interleave:0-3 file=/anon_hugepage\040(deleted) huge > dirty=517 mapmax=8 N0=129 N1=132 N2=128 N3=128 kernelpagesize_kB=2048 > 7fb31f400000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=32 > mapmax=2 N0=32 kernelpagesize_kB=2048 > 7fb323400000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=32 > mapmax=2 N1=32 kernelpagesize_kB=2048 > 7fb327400000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=32 > mapmax=2 N2=32 kernelpagesize_kB=2048 > 7fb32b400000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=32 > mapmax=2 N3=32 kernelpagesize_kB=2048 > 7fb32f400000 interleave:0-3 file=/anon_hugepage\040(deleted) huge > 7fb32f600000 bind:0 file=/anon_hugepage\040(deleted) huge > 7fb52f600000 bind:1 file=/anon_hugepage\040(deleted) huge > 7fb72f600000 bind:2 file=/anon_hugepage\040(deleted) huge > 7fb92f600000 bind:3 file=/anon_hugepage\040(deleted) huge > 7fbb2f600000 interleave:0-3 file=/anon_hugepage\040(deleted) huge > dirty=425 N0=106 N1=106 N2=105 N3=108 kernelpagesize_kB=2048 > > then even after fully touching everything (via select to > pg_shm_allocations), it'll run, I'll get much better balance, and wont > have SIGBUS issues: > > node0/hugepages/hugepages-2048kB/free_hugepages:23 > node1/hugepages/hugepages-2048kB/free_hugepages:23 > node2/hugepages/hugepages-2048kB/free_hugepages:23 > node3/hugepages/hugepages-2048kB/free_hugepages:22 > > This somehow demonstrates that enough free memory is out there, it's > just imbalance that causes SIGBUS. I hope this somehow hopefully > answers one of Your's main questions as per in the very first messages > what we should do with remaining shared_buffer members. I would like > to hear your thoughts on this, before I start benchmarking this for > real as I didnt want to bench it yet, as such interleaving could alter > the the test results. > Thanks for investigating this. If I understand the findings correctly, it agrees with my imprecise explanation in [1], right? There I said: > ... > You may ask why the per-node limit is too low. We still need just > shared_memory_size_in_huge_pages, right? And if we were partitioning > the whole memory segment, that'd be true. But we only to that for > shared buffers, and there's a lot of other shared memory - could be > 1-2GB or so, depending on the configuration. > > And this gets placed on one of the nodes, and it counts against the > limit on that particular node. And so it doesn't have enough huge > pages to back the partition of shared buffers. > ... Which I think is mostly the same thing you're saying, and you have the maps to support it. In any case, I think setting "interleave" as the default policy, and then overriding it for the areas we partition explicitly (buffers, pgproc), seems like the right solution. The only other solution would be balance it ourselves, but how is that different from interleaving? So I think this makes sense, and you can do --interleave=all for the benchmark. [1] https://www.postgresql.org/message-id/71a46484-053c-4b81-ba32-ddac050a8b5d%40vondra.me I suppose we may need to adjust shared_memory_size_in_huge_pages, because the interleave followed by explicit partitioning may still leave behind a bit of imbalance. It should be only a couple pages, but I haven't done the math yet. regards -- Tomas Vondra
pgsql-hackers by date: