Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Draft for basic NUMA observability
Date
Msg-id y4zhgypa4vt3txf22yzvkfe2m4rgrph25ms6ax2ukduwcl43u3@dosysiprwsha
Whole thread Raw
In response to Re: Draft for basic NUMA observability  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Draft for basic NUMA observability
List pgsql-hackers
Hi,

On 2025-04-06 13:56:54 +0200, Tomas Vondra wrote:
> On 4/6/25 01:00, Andres Freund wrote:
> > On 2025-04-05 18:29:22 -0400, Andres Freund wrote:
> >> I think one thing that the docs should mention is that calling the numa
> >> functions/views will force the pages to be allocated, even if they're
> >> currently unused.
> >>
> >> Newly started server, with s_b of 32GB an 2MB huge pages:
> >>
> >>   grep ^Huge /proc/meminfo
> >>   HugePages_Total:   34802
> >>   HugePages_Free:    34448
> >>   HugePages_Rsvd:    16437
> >>   HugePages_Surp:        0
> >>   Hugepagesize:       2048 kB
> >>   Hugetlb:        76517376 kB
> >>
> >> run
> >>   SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id;
> >>
> >> Now the pages that previously were marked as reserved are actually allocated:
> >>
> >>   grep ^Huge /proc/meminfo
> >>   HugePages_Total:   34802
> >>   HugePages_Free:    18012
> >>   HugePages_Rsvd:        1
> >>   HugePages_Surp:        0
> >>   Hugepagesize:       2048 kB
> >>   Hugetlb:        76517376 kB
> >>
> >>
> >> I don't see how we can avoid that right now, but at the very least we ought to
> >> document it.
> > 
> > The only allocation where that really matters is shared_buffers. I wonder if
> > we could special case the logic for that, by only probing if at least one of
> > the buffers in the range is valid.
> > 
> > Then we could treat a page status of -ENOENT as "page is not mapped" and
> > display NULL for the node_id?
> > 
> > Of course that would mean that we'd always need to
> > pg_numa_touch_mem_if_required(), not just the first time round, because we
> > previously might not have for a page that is now valid.  But compared to the
> > cost of actually allocating pages, the cost for that seems small.
> > 
> 
> I don't think this would be a good trade off. The buffers already have a
> NUMA node, and users would be interested in that.

The thing is that the buffer might *NOT* have a numa node.  That's e.g. the
case in the above example - otherwise we wouldn't initially have seen the
large HugePages_Rsvd.

Forcing all those pages to be allocated via pg_numa_touch_mem_if_required()
itself wouldn't be too bad - in fact I'd rather like to have an explicit way
of doing that.  The problem is that that leads to all those allocations to
happen on the *current* numa node (unless you have started postgres with
numactl --interleave=all or such), rather than the node where the normal first
use woul have allocated it.


> It's just that we don't have the memory mapped in the current backend, so
> I'd bet people would not be happy with NULL, and would proceed to force the
> allocation in some other way (say, a large query of some sort). Which
> obviously causes a lot of other problems.

I don't think that really would be the case with what I proposed? If any
buffer in the region were valid, we would force the allocation to become known
to the current backend.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [PoC] Reducing planning time when tables have many partitions
Next
From: Tom Lane
Date:
Subject: Re: Logging which local address was connected to in log_line_prefix