Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | y4zhgypa4vt3txf22yzvkfe2m4rgrph25ms6ax2ukduwcl43u3@dosysiprwsha Whole thread Raw |
In response to | Re: Draft for basic NUMA observability (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: Draft for basic NUMA observability
|
List | pgsql-hackers |
Hi, On 2025-04-06 13:56:54 +0200, Tomas Vondra wrote: > On 4/6/25 01:00, Andres Freund wrote: > > On 2025-04-05 18:29:22 -0400, Andres Freund wrote: > >> I think one thing that the docs should mention is that calling the numa > >> functions/views will force the pages to be allocated, even if they're > >> currently unused. > >> > >> Newly started server, with s_b of 32GB an 2MB huge pages: > >> > >> grep ^Huge /proc/meminfo > >> HugePages_Total: 34802 > >> HugePages_Free: 34448 > >> HugePages_Rsvd: 16437 > >> HugePages_Surp: 0 > >> Hugepagesize: 2048 kB > >> Hugetlb: 76517376 kB > >> > >> run > >> SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id; > >> > >> Now the pages that previously were marked as reserved are actually allocated: > >> > >> grep ^Huge /proc/meminfo > >> HugePages_Total: 34802 > >> HugePages_Free: 18012 > >> HugePages_Rsvd: 1 > >> HugePages_Surp: 0 > >> Hugepagesize: 2048 kB > >> Hugetlb: 76517376 kB > >> > >> > >> I don't see how we can avoid that right now, but at the very least we ought to > >> document it. > > > > The only allocation where that really matters is shared_buffers. I wonder if > > we could special case the logic for that, by only probing if at least one of > > the buffers in the range is valid. > > > > Then we could treat a page status of -ENOENT as "page is not mapped" and > > display NULL for the node_id? > > > > Of course that would mean that we'd always need to > > pg_numa_touch_mem_if_required(), not just the first time round, because we > > previously might not have for a page that is now valid. But compared to the > > cost of actually allocating pages, the cost for that seems small. > > > > I don't think this would be a good trade off. The buffers already have a > NUMA node, and users would be interested in that. The thing is that the buffer might *NOT* have a numa node. That's e.g. the case in the above example - otherwise we wouldn't initially have seen the large HugePages_Rsvd. Forcing all those pages to be allocated via pg_numa_touch_mem_if_required() itself wouldn't be too bad - in fact I'd rather like to have an explicit way of doing that. The problem is that that leads to all those allocations to happen on the *current* numa node (unless you have started postgres with numactl --interleave=all or such), rather than the node where the normal first use woul have allocated it. > It's just that we don't have the memory mapped in the current backend, so > I'd bet people would not be happy with NULL, and would proceed to force the > allocation in some other way (say, a large query of some sort). Which > obviously causes a lot of other problems. I don't think that really would be the case with what I proposed? If any buffer in the region were valid, we would force the allocation to become known to the current backend. Greetings, Andres Freund
pgsql-hackers by date: