Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | ab5059cd-b3a5-40d8-b26c-a0db906b3aef@vondra.me Whole thread Raw |
In response to | Re: Draft for basic NUMA observability (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Draft for basic NUMA observability
|
List | pgsql-hackers |
On 4/7/25 17:51, Andres Freund wrote: > Hi, > > On 2025-04-06 13:56:54 +0200, Tomas Vondra wrote: >> On 4/6/25 01:00, Andres Freund wrote: >>> On 2025-04-05 18:29:22 -0400, Andres Freund wrote: >>>> I think one thing that the docs should mention is that calling the numa >>>> functions/views will force the pages to be allocated, even if they're >>>> currently unused. >>>> >>>> Newly started server, with s_b of 32GB an 2MB huge pages: >>>> >>>> grep ^Huge /proc/meminfo >>>> HugePages_Total: 34802 >>>> HugePages_Free: 34448 >>>> HugePages_Rsvd: 16437 >>>> HugePages_Surp: 0 >>>> Hugepagesize: 2048 kB >>>> Hugetlb: 76517376 kB >>>> >>>> run >>>> SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id; >>>> >>>> Now the pages that previously were marked as reserved are actually allocated: >>>> >>>> grep ^Huge /proc/meminfo >>>> HugePages_Total: 34802 >>>> HugePages_Free: 18012 >>>> HugePages_Rsvd: 1 >>>> HugePages_Surp: 0 >>>> Hugepagesize: 2048 kB >>>> Hugetlb: 76517376 kB >>>> >>>> >>>> I don't see how we can avoid that right now, but at the very least we ought to >>>> document it. >>> >>> The only allocation where that really matters is shared_buffers. I wonder if >>> we could special case the logic for that, by only probing if at least one of >>> the buffers in the range is valid. >>> >>> Then we could treat a page status of -ENOENT as "page is not mapped" and >>> display NULL for the node_id? >>> >>> Of course that would mean that we'd always need to >>> pg_numa_touch_mem_if_required(), not just the first time round, because we >>> previously might not have for a page that is now valid. But compared to the >>> cost of actually allocating pages, the cost for that seems small. >>> >> >> I don't think this would be a good trade off. The buffers already have a >> NUMA node, and users would be interested in that. > > The thing is that the buffer might *NOT* have a numa node. That's e.g. the > case in the above example - otherwise we wouldn't initially have seen the > large HugePages_Rsvd. > > Forcing all those pages to be allocated via pg_numa_touch_mem_if_required() > itself wouldn't be too bad - in fact I'd rather like to have an explicit way > of doing that. The problem is that that leads to all those allocations to > happen on the *current* numa node (unless you have started postgres with > numactl --interleave=all or such), rather than the node where the normal first > use woul have allocated it. > I agree, forcing those allocations to happen on a single node seems rather unfortunate. But really, how likely is it that someone will run this function on a cluster that hasn't already allocated this memory? I'm not saying it can't happen, but we already have this issue if you start and do a warmup from a single connection ... > >> It's just that we don't have the memory mapped in the current backend, so >> I'd bet people would not be happy with NULL, and would proceed to force the >> allocation in some other way (say, a large query of some sort). Which >> obviously causes a lot of other problems. > > I don't think that really would be the case with what I proposed? If any > buffer in the region were valid, we would force the allocation to become known > to the current backend. > It's not quite clear to me what exactly are you proposing :-( I believe you're referring to this: > The only allocation where that really matters is shared_buffers. I wonder if > we could special case the logic for that, by only probing if at least one of > the buffers in the range is valid. > > Then we could treat a page status of -ENOENT as "page is not mapped" and > display NULL for the node_id? > > Of course that would mean that we'd always need to > pg_numa_touch_mem_if_required(), not just the first time round, because we > previously might not have for a page that is now valid. But compared to the > cost of actually allocating pages, the cost for that seems small. I suppose by "range" you mean buffers on a given memory page, and "valid" means BufferIsValid. Yeah, that probably means the memory page is allocated. But if the buffer is invalid, it does not mean the memory is not allocated, right? So does it make the buffer not interesting? I'd find this ambiguity rather confusing, i.e. we'd never know if NULL means just "invalid buffer" or "not allocated". Maybe we should simply return rows only for valid buffers, to make it more explicit that we say nothing about NUMA nodes for the invalid ones. I think we need to decide whether the current patches are good enough for PG18, with the current behavior, and then maybe improve that in PG19. Or whether this is so serious we have to leave all of it for PG19. I'd go with the former, but perhaps I'm wrong. I don't feel like I want to be reworking this less than a day before the feature freeze. Attached is v27, which I planned to push, but I'll hold off. regards -- Tomas Vondra
Attachment
pgsql-hackers by date: