Hi,
On 2025-04-06 13:51:34 +0200, Tomas Vondra wrote:
> On 4/6/25 00:29, Andres Freund wrote:
> >> +
> >> + if (firstNumaTouch)
> >> + elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
> >
> > Over the patchseries the related code is duplicated. Seems like it'd be good
> > to put it into pg_numa instead? This seems like the thing that's good to
> > abstract away in one central spot.
> >
>
> Abstract away which part, exactly? I thought about moving some of the
> code to port/pg_numa.c, but it didn't seem worth it.
The easiest would be to just have a single function that does this for the
whole shared memory allocation, without having to integrate it with the
per-allocation or per-buffer loop.
> >
> >> + /*
> >> + * If we have multiple OS pages per buffer, fill those in too. We
> >> + * always want at least one OS page, even if there are multiple
> >> + * buffers per page.
> >> + *
> >> + * Altough we could query just once per each OS page, we do it
> >> + * repeatably for each Buffer and hit the same address as
> >> + * move_pages(2) requires page aligment. This also simplifies
> >> + * retrieval code later on. Also NBuffers starts from 1.
> >> + */
> >> + for (j = 0; j < Max(1, pages_per_buffer); j++)
> >> + {
> >> + char *buffptr = (char *) BufferGetBlock(i + 1);
> >> +
> >> + fctx->record[idx].bufferid = bufferid;
> >> + fctx->record[idx].numa_page = j;
> >> +
> >> + os_page_ptrs[idx]
> >> + = (char *) TYPEALIGN(os_page_size,
> >> + buffptr + (os_page_size * j));
> >
> > FWIW, this bit here is the most expensive part of the function itself, as the
> > compiler has no choice than to implement it as an actual division, as
> > os_page_size is runtime variable.
> >
> > It'd be fine to leave it like that, the call to numa_move_pages() is way more
> > expensive. But it shouldn't be too hard to do that alignment once, rather than
> > having to do it over and over.
> >
>
> Division? It's entirely possible I'm missing something obvious, but I
> don't see any divisions in this code.
Oops. The division was only added in a subsequent patch, not the quoted
code. At the time it was:
+ /* calculate ID of the OS memory page */
+ fctx->record[idx].numa_page
+ = ((char *) os_page_ptrs[idx] - startptr) / os_page_size;
Greetings,
Andres Freund