Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: Draft for basic NUMA observability
Date
Msg-id CAKZiRmxQBcMM0e98e79OpK7_XxoLBmXfMYys-gC-iED9N+wMrQ@mail.gmail.com
Whole thread Raw
In response to Re: Draft for basic NUMA observability  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Draft for basic NUMA observability
List pgsql-hackers
On Sun, Apr 6, 2025 at 3:52 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 4/6/25 14:57, Jakub Wartak wrote:
> > On Sun, Apr 6, 2025 at 12:29 AM Andres Freund <andres@anarazel.de> wrote:
> >>
> >> Hi,
> >
> > Hi Andres/Tomas,
> >
> > I've noticed that Tomas responded to this while writing this, so I'm
> > attaching git-am patches based on his v25 (no squash) and there's only
> > one new (last one contains fixes based on this review) + slight commit
> > amendment to 0001.
> >
>
> I'm not working on this at the moment. I may have a bit of time in the
> evening, but more likely I'll get back to this on Monday.

OK, tried to fix all outstanding issues (except maybe some tiny code
refactors for beatification)

> >> I just played around with this for a bit.  As noted somewhere further down,
> >> pg_buffercache_numa.page_num ends up wonky in different ways for the different
> >> pages.
> >
> > I think page_num is under heavy work in progress... I'm still not sure
> > is it worth exposing this (is it worth the hassle). If we scratch it
> > it won't be perfect, but we have everything , otherwise we risk this
> > feature as we are going into a feature freeze literally tomorrow.
> >
>
> IMHO it's not difficult to change the definition of page_num this way,
> it's pretty much a one line change. It's more a question of whether we
> actually want to expose this.

Bertrand noticed this first in
https://www.postgresql.org/message-id/Z/FhOOCmTxuB2h0b%40ip-10-97-1-34.eu-west-3.compute.internal
:

-               startptr = (char *) BufferGetBlock(1);
+               startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char
*) BufferGetBlock(1));

With the above I'm also not getting wonky (-1) results anymore. The
rest of reply assumes we are using this.

> [snip]
>
> >>> +
> >>> +             if (firstNumaTouch)
> >>> +                     elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
> >>
> >> Over the patchseries the related code is duplicated. Seems like it'd be good
> >> to put it into pg_numa instead?  This seems like the thing that's good to
> >> abstract away in one central spot.
> >
> > I hear you, but we are using those statistics for per-localized-code
> > (shm.c's firstTouch != pgbuffercache.c's firstTouch).
> >
>
> Yeah, I don't moving this is quite possible / useful. We could pass the
> flag somewhere, but we still need to check & update it in local code.

No idea how such code should be looking, but it would be more code?
Are you thinking about some code still touching local &firstNumaTouch
?

> >> FWIW, neither this definition of numa_page, nor the one from "adjust page_num"
> >> works quite right for me.
> >>
> >> This definition afaict is always 0 when using huge pages and just 0 and 1 for
> >> 4k pages.  But my understanding of numa_page is that it's the "id" of the numa
> >> pages, which isn't that?
> >>
> >> With "adjust page_num" I get a number that starts at -1 and then increments
> >> from there. More correct, but doesn't quite seem right either.
> >
> > Apparently handling this special case of splitted buffers edge-cases
> > was Pandora box ;) Tomas what do we do about it? Does that has chance
> > to get in before freeze?
> >
>
> I don't think the split buffers are pandora box on their own, it's more
> that it made us notice other issues / questions. I don't think handling
> it is particularly complex - the most difficult part seems to be
> figuring out how many rows we'll return, and mapping them to pages.
> But that's not very difficult, IMO.

OK, with a fresh week, and fresh mind and a different name (ospageid)
it looks better to me now.

> The bigger question is whether it's safe to do the TYPEALIGN_DOWN(),
> which may return a pointer from before the first buffer. But I guess
> that's OK, thanks to how memory is allocated - at least, that's what all
> the move_pages() examples I found do, so unless those are all broken,
> that's OK.

I agree. This took some more time, but in case of

a) pg_buffercache_numa and HP=off view we shouldn't access ptr below
buffercache, because my understanding is that shm memory would be page
aligned anyway as per BufferManagerShmemInit() which uses
TYPEALGIN(PG_IO_ALIGN_SIZE) for it anyway. So when you query
pg_buffercache_numa, one gets the following pages:

# strace -fp 14364 -e move_pages
strace: Process 14364 attached
move_pages(0, 32768, [0x7f8bfe4b5000, 0x7f8bfe4b6000, 0x7f8bfe4b7000,...

(gdb) print BufferBlocks
$1 = 0x7f8bfe4b5000 ""

while BufferBlocks actually starts there, so we are good without HP.
With HP it's move_pages(0, 16384, [0x7ff33c800000, ... vs (gdb)  print
BufferBlocks
=> $1 = 0x7ff33c879000
so we are actually accessing earlier pointer (!), but Buffer Blocks is
like 15-th structure there (and there's always going be something
earlier to my understanding):

select row_number() over (order by off),* from pg_shmem_allocations
order by off asc limit 15;
 row_number |          name          |   off   |   size    | allocated_size
------------+------------------------+---------+-----------+----------------
[..]
         13 | Shared MultiXact State | 5735936 |      1148 |           1152
         14 | Buffer Descriptors     | 5737088 |   1048576 |        1048576
         15 | Buffer Blocks          | 6785664 | 134221824 |      134221824

To me this is finding in itself, with HP shared_buffers being located
on page with something else...

b) in pg_shmem_allocations_numa view without HP we don't even get to
as low as ShmemBase for move_pages(), but with HP=on we hit as low as
ShmemBase but not lower, so we are good IMHO.

Wouldn't buildfarm actually tell us this if that's bad as an insurance policy?

> >>> +    <tbody>
> >>> +     <row>
> >>> +      <entry role="catalog_table_entry"><para role="column_definition">
> >>> +       <structfield>bufferid</structfield> <type>integer</type>
> >>> +      </para>
> >>> +      <para>
> >>> +       ID, in the range 1..<varname>shared_buffers</varname>
> >>> +      </para></entry>
> >>> +     </row>
> >>> +
> >>> +     <row>
> >>> +      <entry role="catalog_table_entry"><para role="column_definition">
> >>> +       <structfield>page_num</structfield> <type>int</type>
> >>> +      </para>
> >>> +      <para>
> >>> +       number of OS memory page for this buffer
> >>> +      </para></entry>
> >>> +     </row>
> >>
> >> "page_num" doesn't really seem very descriptive for "number of OS memory page
> >> for this buffer".  To me "number of" sounds like it's counting the number of
> >> associated pages, but it's really just a "page id" or something like that.
> >>
> >> Maybe rename page_num to "os_page_id" and rephrase the comment a bit?
> >
> > Tomas, are you good with rename? I think I've would also prefer os_page_id.
> >
>
> Fine with me.

Done, s/page_num/ospageid/g as whole pg_buffercache does not use "_"
anywhere, so let's stick to that. Done that for nodeid too.

> >>> +     /*
> >>> +      * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
> >>> +      * the OS may have different memory page sizes.
> >>> +      *
> >>> +      * To correctly map between them, we need to: 1. Determine the OS memory
> >>> +      * page size 2. Calculate how many OS pages are used by all buffer blocks
> >>> +      * 3. Calculate how many OS pages are contained within each database
> >>> +      * block.
> >>> +      *
> >>> +      * This information is needed before calling move_pages() for NUMA memory
> >>> +      * node inquiry.
> >>> +      */
> >>> +     os_page_size = pg_numa_get_pagesize();
> >>> +
> >>> +     /*
> >>> +      * Allocate memory for page pointers and status based on total shared
> >>> +      * memory size. This simplified approach allocates enough space for all
> >>> +      * pages in shared memory rather than calculating the exact requirements
> >>> +      * for each segment.
> >>> +      *
> >>> +      * XXX Isn't this wasteful? But there probably is one large segment of
> >>> +      * shared memory, much larger than the rest anyway.
> >>> +      */
> >>> +     shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
> >>> +     page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
> >>> +     pages_status = palloc(sizeof(int) * shm_total_page_count);
> >>
> >> There's a fair bit of duplicated code here with pg_buffercache_numa_pages(),
> >> could more be moved to a shared helper function?
> >
> > -> Tomas?
> >
>
> I'm not against that in principle, but when I tried it didn't quite help
> that much. But maybe it's better with the current patch version.
>
> >>> +     hash_seq_init(&hstat, ShmemIndex);
> >>> +
> >>> +     /* output all allocated entries */
> >>> +     memset(nulls, 0, sizeof(nulls));
> >>> +     while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
> >>> +     {
> >>
> >> One thing that's interesting with using ShmemIndex is that this won't get
> >> anonymous allocations. I think that's ok for now, it'd be complicated to
> >> figure out the unmapped "regions".  But I guess it' be good to mention it in
> >> the ocs?
> >
> > Added.
> >

BTW: it's also in the function comment too. now.

> > [..], but I'm sending it as fast as possible
> > to avoid a clash with Tomas.
> >
>
> Please keep working on this. I may hava a bit of time in the evening,
> but in the worst case I'll merge it into your patch.

So, attached is still patchset v25, still with one-off patches (please
git am most, but just `patch -p1 < file` last two and just squash it
if you are happy. I've intentionally not squash it to provide
changelog). This LGTM.

-J.

Attachment

pgsql-hackers by date:

Previous
From: Richard Guo
Date:
Subject: An incorrect check in get_memoize_path
Next
From: Junwang Zhao
Date:
Subject: Re: SQL Property Graph Queries (SQL/PGQ)