Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Draft for basic NUMA observability
Date
Msg-id c4693ec8-161e-465d-9ae2-c6d40b153ae1@vondra.me
Whole thread Raw
In response to Re: Draft for basic NUMA observability  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses Re: Draft for basic NUMA observability
List pgsql-hackers
On 4/6/25 14:57, Jakub Wartak wrote:
> On Sun, Apr 6, 2025 at 12:29 AM Andres Freund <andres@anarazel.de> wrote:
>>
>> Hi,
> 
> Hi Andres/Tomas,
> 
> I've noticed that Tomas responded to this while writing this, so I'm
> attaching git-am patches based on his v25 (no squash) and there's only
> one new (last one contains fixes based on this review) + slight commit
> amendment to 0001.
> 

I'm not working on this at the moment. I may have a bit of time in the
evening, but more likely I'll get back to this on Monday.

>> I just played around with this for a bit.  As noted somewhere further down,
>> pg_buffercache_numa.page_num ends up wonky in different ways for the different
>> pages.
> 
> I think page_num is under heavy work in progress... I'm still not sure
> is it worth exposing this (is it worth the hassle). If we scratch it
> it won't be perfect, but we have everything , otherwise we risk this
> feature as we are going into a feature freeze literally tomorrow.
> 

IMHO it's not difficult to change the definition of page_num this way,
it's pretty much a one line change. It's more a question of whether we
actually want to expose this.

[snip]

>>> +
>>> +             if (firstNumaTouch)
>>> +                     elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
>>
>> Over the patchseries the related code is duplicated. Seems like it'd be good
>> to put it into pg_numa instead?  This seems like the thing that's good to
>> abstract away in one central spot.
> 
> I hear you, but we are using those statistics for per-localized-code
> (shm.c's firstTouch != pgbuffercache.c's firstTouch).
> 

Yeah, I don't moving this is quite possible / useful. We could pass the
flag somewhere, but we still need to check & update it in local code.

>>> +                     /*
>>> +                      * If we have multiple OS pages per buffer, fill those in too. We
>>> +                      * always want at least one OS page, even if there are multiple
>>> +                      * buffers per page.
>>> +                      *
>>> +                      * Altough we could query just once per each OS page, we do it
>>> +                      * repeatably for each Buffer and hit the same address as
>>> +                      * move_pages(2) requires page aligment. This also simplifies
>>> +                      * retrieval code later on. Also NBuffers starts from 1.
>>> +                      */
>>> +                     for (j = 0; j < Max(1, pages_per_buffer); j++)
>>> +                     {
>>> +                             char       *buffptr = (char *) BufferGetBlock(i + 1);
>>> +
>>> +                             fctx->record[idx].bufferid = bufferid;
>>> +                             fctx->record[idx].numa_page = j;
>>> +
>>> +                             os_page_ptrs[idx]
>>> +                                     = (char *) TYPEALIGN(os_page_size,
>>> +                                                                              buffptr + (os_page_size * j));
>>
>> FWIW, this bit here is the most expensive part of the function itself, as the
>> compiler has no choice than to implement it as an actual division, as
>> os_page_size is runtime variable.
>>
>> It'd be fine to leave it like that, the call to numa_move_pages() is way more
>> expensive. But it shouldn't be too hard to do that alignment once, rather than
>> having to do it over and over.
> 
> TBH, I don't think we should spend lot of time optimizing, after all
> it's debugging view (at the start I was actually considering putting
> it as developer-only compile time option, but with shm view it is
> actually usuable for others too and well... we want to have it as
> foundation for real NUMA optimizations)
> 

I agree with this.

>> FWIW, neither this definition of numa_page, nor the one from "adjust page_num"
>> works quite right for me.
>>
>> This definition afaict is always 0 when using huge pages and just 0 and 1 for
>> 4k pages.  But my understanding of numa_page is that it's the "id" of the numa
>> pages, which isn't that?
>>
>> With "adjust page_num" I get a number that starts at -1 and then increments
>> from there. More correct, but doesn't quite seem right either.
> 
> Apparently handling this special case of splitted buffers edge-cases
> was Pandora box ;) Tomas what do we do about it? Does that has chance
> to get in before freeze?
> 

I don't think the split buffers are pandora box on their own, it's more
that it made us notice other issues / questions. I don't think handling
it is particularly complex - the most difficult part seems to be
figuring out how many rows we'll return, and mapping them to pages.
But that's not very difficult, IMO.

The bigger question is whether it's safe to do the TYPEALIGN_DOWN(),
which may return a pointer from before the first buffer. But I guess
that's OK, thanks to how memory is allocated - at least, that's what all
the move_pages() examples I found do, so unless those are all broken,
that's OK.

>>> +    <tbody>
>>> +     <row>
>>> +      <entry role="catalog_table_entry"><para role="column_definition">
>>> +       <structfield>bufferid</structfield> <type>integer</type>
>>> +      </para>
>>> +      <para>
>>> +       ID, in the range 1..<varname>shared_buffers</varname>
>>> +      </para></entry>
>>> +     </row>
>>> +
>>> +     <row>
>>> +      <entry role="catalog_table_entry"><para role="column_definition">
>>> +       <structfield>page_num</structfield> <type>int</type>
>>> +      </para>
>>> +      <para>
>>> +       number of OS memory page for this buffer
>>> +      </para></entry>
>>> +     </row>
>>
>> "page_num" doesn't really seem very descriptive for "number of OS memory page
>> for this buffer".  To me "number of" sounds like it's counting the number of
>> associated pages, but it's really just a "page id" or something like that.
>>
>> Maybe rename page_num to "os_page_id" and rephrase the comment a bit?
> 
> Tomas, are you good with rename? I think I've would also prefer os_page_id.
> 

Fine with me.

>>> diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
> [..]
>>> +  <para>
>>> +   The <structname>pg_shmem_allocations_numa</structname> shows how shared
>>> +   memory allocations in the server's main shared memory segment are distributed
>>> +   across NUMA nodes. This includes both memory allocated by
>>> +   <productname>PostgreSQL</productname> itself and memory allocated
>>> +   by extensions using the mechanisms detailed in
>>> +   <xref linkend="xfunc-shared-addin" />.
>>> +  </para>
>>
>> I think it'd be good to describe that the view will include multiple rows for
>> each name if spread across multiple numa nodes.
> 
> Added.
> 
>> Perhaps also that querying this view is expensive and that
>> pg_shmem_allocations should be used if numa information isn't needed?
> 
> Already covered by 1st finding fix.
> 
>>> +     /*
>>> +      * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
>>> +      * the OS may have different memory page sizes.
>>> +      *
>>> +      * To correctly map between them, we need to: 1. Determine the OS memory
>>> +      * page size 2. Calculate how many OS pages are used by all buffer blocks
>>> +      * 3. Calculate how many OS pages are contained within each database
>>> +      * block.
>>> +      *
>>> +      * This information is needed before calling move_pages() for NUMA memory
>>> +      * node inquiry.
>>> +      */
>>> +     os_page_size = pg_numa_get_pagesize();
>>> +
>>> +     /*
>>> +      * Allocate memory for page pointers and status based on total shared
>>> +      * memory size. This simplified approach allocates enough space for all
>>> +      * pages in shared memory rather than calculating the exact requirements
>>> +      * for each segment.
>>> +      *
>>> +      * XXX Isn't this wasteful? But there probably is one large segment of
>>> +      * shared memory, much larger than the rest anyway.
>>> +      */
>>> +     shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
>>> +     page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
>>> +     pages_status = palloc(sizeof(int) * shm_total_page_count);
>>
>> There's a fair bit of duplicated code here with pg_buffercache_numa_pages(),
>> could more be moved to a shared helper function?
> 
> -> Tomas?
> 

I'm not against that in principle, but when I tried it didn't quite help
that much. But maybe it's better with the current patch version.

>>> +     hash_seq_init(&hstat, ShmemIndex);
>>> +
>>> +     /* output all allocated entries */
>>> +     memset(nulls, 0, sizeof(nulls));
>>> +     while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
>>> +     {
>>
>> One thing that's interesting with using ShmemIndex is that this won't get
>> anonymous allocations. I think that's ok for now, it'd be complicated to
>> figure out the unmapped "regions".  But I guess it' be good to mention it in
>> the ocs?
> 
> Added.
> 
>>> +             int                     i;
>>> +
>>> +             /* XXX I assume we use TYPEALIGN as a way to round to whole pages.
>>> +              * It's a bit misleading to call that "aligned", no? */
>>> +
>>> +             /* Get number of OS aligned pages */
>>> +             shm_ent_page_count
>>> +                     = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
>>> +
>>> +             /*
>>> +              * If we get ever 0xff back from kernel inquiry, then we probably have
>>> +              * bug in our buffers to OS page mapping code here.
>>> +              */
>>> +             memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
>>
>> There's obviously no guarantee that shm_ent_page_count is a multiple of
>> os_page_size.  I think it'd be interesting to show in the view when one shmem
>> allocation shares a page with the prior allocation - that can contribute a bit
>> to contention.  What about showing a start_os_page_id and end_os_page_id or
>> something?  That could be a feature for later though.
> 
> I was thinking about it, but it could be done when analyzing this
> together with data from pg_shmem_allocations(?) My worry is timing :(
> Anyway, we could extend this view in future revisions.
> 

I'd leave this out for now. It's not difficult, but let's focus on the
other issues.

>>> +SELECT NOT(pg_numa_available()) AS skip_test \gset
>>> +\if :skip_test
>>> +\quit
>>> +\endif
>>> +-- switch to superuser
>>> +\c -
>>> +SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
>>> + ok
>>> +----
>>> + t
>>> +(1 row)
>>
>> Could it be worthwhile to run the test if !pg_numa_available(), to test that
>> we do the right thing in that case? We need an alternative output anyway, so
>> that might be fine?
> 
> Added. the meson test passes, but I'm sending it as fast as possible
> to avoid a clash with Tomas.
> 

Please keep working on this. I may hava a bit of time in the evening,
but in the worst case I'll merge it into your patch.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Tender Wang
Date:
Subject: Re: Removing unneeded self joins
Next
From: Jingtang Zhang
Date:
Subject: Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM