Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Draft for basic NUMA observability
Date
Msg-id q4f4v2xnf7anfl5ucjm5dbqtnaspsq5z2ajndag35ed75avt5e@eqpqm4a5ap7q
Whole thread Raw
In response to Re: Draft for basic NUMA observability  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
Hi,

On 2025-04-06 13:51:34 +0200, Tomas Vondra wrote:
> On 4/6/25 00:29, Andres Freund wrote:
> >> +
> >> +        if (firstNumaTouch)
> >> +            elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
> > 
> > Over the patchseries the related code is duplicated. Seems like it'd be good
> > to put it into pg_numa instead?  This seems like the thing that's good to
> > abstract away in one central spot.
> > 
> 
> Abstract away which part, exactly? I thought about moving some of the
> code to port/pg_numa.c, but it didn't seem worth it.

The easiest would be to just have a single function that does this for the
whole shared memory allocation, without having to integrate it with the
per-allocation or per-buffer loop.


> > 
> >> +            /*
> >> +             * If we have multiple OS pages per buffer, fill those in too. We
> >> +             * always want at least one OS page, even if there are multiple
> >> +             * buffers per page.
> >> +             *
> >> +             * Altough we could query just once per each OS page, we do it
> >> +             * repeatably for each Buffer and hit the same address as
> >> +             * move_pages(2) requires page aligment. This also simplifies
> >> +             * retrieval code later on. Also NBuffers starts from 1.
> >> +             */
> >> +            for (j = 0; j < Max(1, pages_per_buffer); j++)
> >> +            {
> >> +                char       *buffptr = (char *) BufferGetBlock(i + 1);
> >> +
> >> +                fctx->record[idx].bufferid = bufferid;
> >> +                fctx->record[idx].numa_page = j;
> >> +
> >> +                os_page_ptrs[idx]
> >> +                    = (char *) TYPEALIGN(os_page_size,
> >> +                                         buffptr + (os_page_size * j));
> > 
> > FWIW, this bit here is the most expensive part of the function itself, as the
> > compiler has no choice than to implement it as an actual division, as
> > os_page_size is runtime variable.
> > 
> > It'd be fine to leave it like that, the call to numa_move_pages() is way more
> > expensive. But it shouldn't be too hard to do that alignment once, rather than
> > having to do it over and over.
> > 
> 
> Division? It's entirely possible I'm missing something obvious, but I
> don't see any divisions in this code.

Oops. The division was only added in a subsequent patch, not the quoted
code. At the time it was:

+                               /* calculate ID of the OS memory page */
+                               fctx->record[idx].numa_page
+                                       = ((char *) os_page_ptrs[idx] - startptr) / os_page_size;


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Logging which local address was connected to in log_line_prefix
Next
From: Rahila Syed
Date:
Subject: Re: Enhancing Memory Context Statistics Reporting