Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Draft for basic NUMA observability
Date
Msg-id ab5059cd-b3a5-40d8-b26c-a0db906b3aef@vondra.me
Whole thread Raw
In response to Re: Draft for basic NUMA observability  (Andres Freund <andres@anarazel.de>)
Responses Re: Draft for basic NUMA observability
List pgsql-hackers
On 4/7/25 17:51, Andres Freund wrote:
> Hi,
> 
> On 2025-04-06 13:56:54 +0200, Tomas Vondra wrote:
>> On 4/6/25 01:00, Andres Freund wrote:
>>> On 2025-04-05 18:29:22 -0400, Andres Freund wrote:
>>>> I think one thing that the docs should mention is that calling the numa
>>>> functions/views will force the pages to be allocated, even if they're
>>>> currently unused.
>>>>
>>>> Newly started server, with s_b of 32GB an 2MB huge pages:
>>>>
>>>>   grep ^Huge /proc/meminfo
>>>>   HugePages_Total:   34802
>>>>   HugePages_Free:    34448
>>>>   HugePages_Rsvd:    16437
>>>>   HugePages_Surp:        0
>>>>   Hugepagesize:       2048 kB
>>>>   Hugetlb:        76517376 kB
>>>>
>>>> run
>>>>   SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id;
>>>>
>>>> Now the pages that previously were marked as reserved are actually allocated:
>>>>
>>>>   grep ^Huge /proc/meminfo
>>>>   HugePages_Total:   34802
>>>>   HugePages_Free:    18012
>>>>   HugePages_Rsvd:        1
>>>>   HugePages_Surp:        0
>>>>   Hugepagesize:       2048 kB
>>>>   Hugetlb:        76517376 kB
>>>>
>>>>
>>>> I don't see how we can avoid that right now, but at the very least we ought to
>>>> document it.
>>>
>>> The only allocation where that really matters is shared_buffers. I wonder if
>>> we could special case the logic for that, by only probing if at least one of
>>> the buffers in the range is valid.
>>>
>>> Then we could treat a page status of -ENOENT as "page is not mapped" and
>>> display NULL for the node_id?
>>>
>>> Of course that would mean that we'd always need to
>>> pg_numa_touch_mem_if_required(), not just the first time round, because we
>>> previously might not have for a page that is now valid.  But compared to the
>>> cost of actually allocating pages, the cost for that seems small.
>>>
>>
>> I don't think this would be a good trade off. The buffers already have a
>> NUMA node, and users would be interested in that.
> 
> The thing is that the buffer might *NOT* have a numa node.  That's e.g. the
> case in the above example - otherwise we wouldn't initially have seen the
> large HugePages_Rsvd.
> 
> Forcing all those pages to be allocated via pg_numa_touch_mem_if_required()
> itself wouldn't be too bad - in fact I'd rather like to have an explicit way
> of doing that.  The problem is that that leads to all those allocations to
> happen on the *current* numa node (unless you have started postgres with
> numactl --interleave=all or such), rather than the node where the normal first
> use woul have allocated it.
> 

I agree, forcing those allocations to happen on a single node seems
rather unfortunate. But really, how likely is it that someone will run
this function on a cluster that hasn't already allocated this memory?

I'm not saying it can't happen, but we already have this issue if you
start and do a warmup from a single connection ...

> 
>> It's just that we don't have the memory mapped in the current backend, so
>> I'd bet people would not be happy with NULL, and would proceed to force the
>> allocation in some other way (say, a large query of some sort). Which
>> obviously causes a lot of other problems.
> 
> I don't think that really would be the case with what I proposed? If any
> buffer in the region were valid, we would force the allocation to become known
> to the current backend.
> 

It's not quite clear to me what exactly are you proposing :-(

I believe you're referring to this:

> The only allocation where that really matters is shared_buffers. I wonder if
> we could special case the logic for that, by only probing if at least one of
> the buffers in the range is valid.
> 
> Then we could treat a page status of -ENOENT as "page is not mapped" and
> display NULL for the node_id?
> 
> Of course that would mean that we'd always need to
> pg_numa_touch_mem_if_required(), not just the first time round, because we
> previously might not have for a page that is now valid.  But compared to the
> cost of actually allocating pages, the cost for that seems small.

I suppose by "range" you mean buffers on a given memory page, and
"valid" means BufferIsValid. Yeah, that probably means the memory page
is allocated. But if the buffer is invalid, it does not mean the memory
is not allocated, right? So does it make the buffer not interesting?

I'd find this ambiguity rather confusing, i.e. we'd never know if NULL
means just "invalid buffer" or "not allocated". Maybe we should simply
return rows only for valid buffers, to make it more explicit that we say
nothing about NUMA nodes for the invalid ones.



I think we need to decide whether the current patches are good enough
for PG18, with the current behavior, and then maybe improve that in
PG19. Or whether this is so serious we have to leave all of it for PG19.
I'd go with the former, but perhaps I'm wrong. I don't feel like I want
to be reworking this less than a day before the feature freeze.

Attached is v27, which I planned to push, but I'll hold off.


regards

-- 
Tomas Vondra

Attachment

pgsql-hackers by date:

Previous
From: Rahila Syed
Date:
Subject: Re: Enhancing Memory Context Statistics Reporting
Next
From: Nazir Bilal Yavuz
Date:
Subject: Re: Add pg_buffercache_evict_all() and pg_buffercache_mark_dirty[_all]() functions