Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Draft for basic NUMA observability
Date
Msg-id 08392b46-4d82-42f5-bf29-a3aa54688650@vondra.me
Whole thread Raw
In response to Re: Draft for basic NUMA observability  (Andres Freund <andres@anarazel.de>)
Responses Re: Draft for basic NUMA observability
List pgsql-hackers
On 4/6/25 00:29, Andres Freund wrote:
> Hi,
> 
> I just played around with this for a bit.  As noted somewhere further down,
> pg_buffercache_numa.page_num ends up wonky in different ways for the different
> pages.
> 
> I think one thing that the docs should mention is that calling the numa
> functions/views will force the pages to be allocated, even if they're
> currently unused.
> 
> Newly started server, with s_b of 32GB an 2MB huge pages:
> 
>   grep ^Huge /proc/meminfo
>   HugePages_Total:   34802
>   HugePages_Free:    34448
>   HugePages_Rsvd:    16437
>   HugePages_Surp:        0
>   Hugepagesize:       2048 kB
>   Hugetlb:        76517376 kB
> 
> run
>   SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id;
> 
> Now the pages that previously were marked as reserved are actually allocated:
> 
>   grep ^Huge /proc/meminfo
>   HugePages_Total:   34802
>   HugePages_Free:    18012
>   HugePages_Rsvd:        1
>   HugePages_Surp:        0
>   Hugepagesize:       2048 kB
>   Hugetlb:        76517376 kB
> 
> 
> I don't see how we can avoid that right now, but at the very least we ought to
> document it.
> 

+1 to documenting this

> 
> On 2025-04-05 16:33:28 +0200, Tomas Vondra wrote:
>> The libnuma library is not available on 32-bit builds (there's no shared
>> object for i386), so we disable it in that case. The i386 is very memory
>> limited anyway, even with PAE, so NUMA is mostly irrelevant.
> 
> Hm? libnuma1:i386 installs just fine to me on debian and contains the shared
> library.
> 
> 
>> +###############################################################
>> +# Library: libnuma
>> +###############################################################
>> +
>> +libnumaopt = get_option('libnuma')
>> +if not libnumaopt.disabled()
>> +  # via pkg-config
>> +  libnuma = dependency('numa', required: libnumaopt)
>> +  if not libnuma.found()
>> +    libnuma = cc.find_library('numa', required: libnumaopt)
>> +  endif
> 
> This fallback isn't going to work if -dlibnuma=enabled is used, as
> dependency() will error out, due to not finding a required dependency. You'd
> need to use required: false there.
> 
> Do we actually need a non-dependency() fallback here?  It's linux only and a
> new dependency, so just requiring a .pc file seems like it'd be fine?
> 

No idea.

> 
>> +#ifdef USE_LIBNUMA
>> +
>> +/*
>> + * This is required on Linux, before pg_numa_query_pages() as we
>> + * need to page-fault before move_pages(2) syscall returns valid results.
>> + */
>> +#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
>> +    ro_volatile_var = *(uint64 *) ptr
> 
> Does it really work that way? A volatile variable to assign the result of
> dereferencing ptr ensures that the *store* isn't removed by the compiler, but
> it doesn't at all guarantee that the *load* isn't removed, since that memory
> isn't marked as volatile.
> 
> I think you'd need to cast the source pointer to a volatile uint64* to ensure
> the load happens.
> 
> 
>> +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
>> +
>> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
>> +\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
>> +
>> +-- Register the new functions.
>> +CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
>> +RETURNS SETOF RECORD
>> +AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
>> +LANGUAGE C PARALLEL SAFE;
>> +
>> +-- Create a view for convenient access.
>> +CREATE OR REPLACE VIEW pg_buffercache_numa AS
>> +    SELECT P.* FROM pg_buffercache_numa_pages() AS P
>> +    (bufferid integer, page_num int4, node_id int4);
> 
> 
> Why CREATE OR REPLACE?
> 

I think this is simply due to copy-pasting the code, a plain CREATE
would be enough here.

> 
>> +Datum
>> +pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
>> +{
>> ...
>> +
>> +        /*
>> +         * To smoothly support upgrades from version 1.0 of this extension
>> +         * transparently handle the (non-)existence of the pinning_backends
>> +         * column. We unfortunately have to get the result type for that... -
>> +         * we can't use the result type determined by the function definition
>> +         * without potentially crashing when somebody uses the old (or even
>> +         * wrong) function definition though.
>> +         */
>> +        if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
>> +            elog(ERROR, "return type must be a row type");
>> +
> 
> Isn't that comment inapplicable for pg_buffercache_numa_pages(), a new view?
> 

Yes, good catch.

> 
>> +
>> +        if (firstNumaTouch)
>> +            elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
> 
> Over the patchseries the related code is duplicated. Seems like it'd be good
> to put it into pg_numa instead?  This seems like the thing that's good to
> abstract away in one central spot.
> 

Abstract away which part, exactly? I thought about moving some of the
code to port/pg_numa.c, but it didn't seem worth it.

> 
>> +        /*
>> +         * Scan through all the buffers, saving the relevant fields in the
>> +         * fctx->record structure.
>> +         *
>> +         * We don't hold the partition locks, so we don't get a consistent
>> +         * snapshot across all buffers, but we do grab the buffer header
>> +         * locks, so the information of each buffer is self-consistent.
>> +         *
>> +         * This loop touches and stores addresses into os_page_ptrs[] as input
>> +         * to one big big move_pages(2) inquiry system call. Basically we ask
>> +         * for all memory pages for NBuffers.
>> +         */
>> +        idx = 0;
>> +        for (i = 0; i < NBuffers; i++)
>> +        {
>> +            BufferDesc *bufHdr;
>> +            uint32        buf_state;
>> +            uint32        bufferid;
>> +
>> +            CHECK_FOR_INTERRUPTS();
>> +
>> +            bufHdr = GetBufferDescriptor(i);
>> +
>> +            /* Lock each buffer header before inspecting. */
>> +            buf_state = LockBufHdr(bufHdr);
>> +            bufferid = BufferDescriptorGetBuffer(bufHdr);
>> +
>> +            UnlockBufHdr(bufHdr, buf_state);
> 
> Given that the only thing you're getting here is the buffer id, it's probably
> not worth getting the spinlock, a single 4 byte read is always atomic on our
> platforms.
> 

Fine with me. I don't expect this to make a measurable difference so I
kept the spinlock, but if we want to remove it, I won't obect.

> 
>> +            /*
>> +             * If we have multiple OS pages per buffer, fill those in too. We
>> +             * always want at least one OS page, even if there are multiple
>> +             * buffers per page.
>> +             *
>> +             * Altough we could query just once per each OS page, we do it
>> +             * repeatably for each Buffer and hit the same address as
>> +             * move_pages(2) requires page aligment. This also simplifies
>> +             * retrieval code later on. Also NBuffers starts from 1.
>> +             */
>> +            for (j = 0; j < Max(1, pages_per_buffer); j++)
>> +            {
>> +                char       *buffptr = (char *) BufferGetBlock(i + 1);
>> +
>> +                fctx->record[idx].bufferid = bufferid;
>> +                fctx->record[idx].numa_page = j;
>> +
>> +                os_page_ptrs[idx]
>> +                    = (char *) TYPEALIGN(os_page_size,
>> +                                         buffptr + (os_page_size * j));
> 
> FWIW, this bit here is the most expensive part of the function itself, as the
> compiler has no choice than to implement it as an actual division, as
> os_page_size is runtime variable.
> 
> It'd be fine to leave it like that, the call to numa_move_pages() is way more
> expensive. But it shouldn't be too hard to do that alignment once, rather than
> having to do it over and over.
> 

Division? It's entirely possible I'm missing something obvious, but I
don't see any divisions in this code. You're however right we could get
rid of most of this, because we could get the buffer pointer once (it's
a bit silly we get it for each page), align that, and then simply add
the page size. Something like this:

    /* align to start of OS page, determine pointer to end of buffer */
    char    *buffptr = (char *) BufferGetBlock(i + 1);
    char    *ptr = buffptr - (buffptr % os_page_size);
    char    *endptr = buffptr + BLCKSZ;

    while (ptr < endptr)
    {
        os_page_ptrs[idx] = ptr;
        ...
        ptr += os_page_size;
    }

This also made me think a bit more about how the blocks and pages might
align / overlap. AFAIK the buffers are aligned to PG_IO_ALIGN_SIZE,
which on x86 is 4K, i.e. the same as OS page size. But let's say the OS
page size is larger, say 1MB. AFAIK it could happen a buffer could span
multiple larger memory pages.

For example, 8K buffer could start 4K before the 1MB page boundary, and
use 4K from the next memory page. This would mean the current formulas
for buffer_per_page and pages_per_buffer can be off by 1.

This would complicate calculating os_page_count a bit, because only some
buffers would actually need the +1 (in the array / view output).

Or what do I miss? It there something that guarantees this won't happen?

> 
> FWIW, neither this definition of numa_page, nor the one from "adjust page_num"
> works quite right for me.
> 
> This definition afaict is always 0 when using huge pages and just 0 and 1 for
> 4k pages.  But my understanding of numa_page is that it's the "id" of the numa
> pages, which isn't that?
> 
> With "adjust page_num" I get a number that starts at -1 and then increments
> from there. More correct, but doesn't quite seem right either.
> >
>> +    <tbody>
>> +     <row>
>> +      <entry role="catalog_table_entry"><para role="column_definition">
>> +       <structfield>bufferid</structfield> <type>integer</type>
>> +      </para>
>> +      <para>
>> +       ID, in the range 1..<varname>shared_buffers</varname>
>> +      </para></entry>
>> +     </row>
>> +
>> +     <row>
>> +      <entry role="catalog_table_entry"><para role="column_definition">
>> +       <structfield>page_num</structfield> <type>int</type>
>> +      </para>
>> +      <para>
>> +       number of OS memory page for this buffer
>> +      </para></entry>
>> +     </row>
> 
> "page_num" doesn't really seem very descriptive for "number of OS memory page
> for this buffer".  To me "number of" sounds like it's counting the number of
> associated pages, but it's really just a "page id" or something like that.
> 
> Maybe rename page_num to "os_page_id" and rephrase the comment a bit?
> 

Yeah, I haven't updated the docs in 0003 when adjusting the page_num
definition. It was more an experiment to see if others like this change.

> 
> 
> 
>> From 03d24af540f8235ad9ca9537db0a1ba5dbcf6ccb Mon Sep 17 00:00:00 2001
>> From: Jakub Wartak <jakub.wartak@enterprisedb.com>
>> Date: Fri, 21 Feb 2025 14:20:18 +0100
>> Subject: [PATCH v25 4/5] Introduce pg_shmem_allocations_numa view
>>
>> Introduce new pg_shmem_alloctions_numa view with information about how
>> shared memory is distributed across NUMA nodes.
> 
> Nice.
> 
> 
>> Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
>> Reviewed-by: Andres Freund <andres@anarazel.de>
>> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
>> Reviewed-by: Tomas Vondra <tomas@vondra.me>
>> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
>> ---
>>  doc/src/sgml/system-views.sgml           |  79 ++++++++++++
>>  src/backend/catalog/system_views.sql     |   8 ++
>>  src/backend/storage/ipc/shmem.c          | 152 +++++++++++++++++++++++
>>  src/include/catalog/pg_proc.dat          |   8 ++
>>  src/test/regress/expected/numa.out       |  12 ++
>>  src/test/regress/expected/numa_1.out     |   3 +
>>  src/test/regress/expected/privileges.out |  16 ++-
>>  src/test/regress/expected/rules.out      |   4 +
>>  src/test/regress/parallel_schedule       |   2 +-
>>  src/test/regress/sql/numa.sql            |   9 ++
>>  src/test/regress/sql/privileges.sql      |   6 +-
>>  11 files changed, 294 insertions(+), 5 deletions(-)
>>  create mode 100644 src/test/regress/expected/numa.out
>>  create mode 100644 src/test/regress/expected/numa_1.out
>>  create mode 100644 src/test/regress/sql/numa.sql
>>
>> diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
>> index 4f336ee0adf..a83365ae24a 100644
>> --- a/doc/src/sgml/system-views.sgml
>> +++ b/doc/src/sgml/system-views.sgml
>> @@ -181,6 +181,11 @@
>>        <entry>shared memory allocations</entry>
>>       </row>
>>
>> +     <row>
>> +      <entry><link
linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry>
>> +      <entry>NUMA node mappings for shared memory allocations</entry>
>> +     </row>
>> +
>>       <row>
>>        <entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
>>        <entry>planner statistics</entry>
>> @@ -4051,6 +4056,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
>>    </para>
>>   </sect1>
>>
>> + <sect1 id="view-pg-shmem-allocations-numa">
>> +  <title><structname>pg_shmem_allocations_numa</structname></title>
>> +
>> +  <indexterm zone="view-pg-shmem-allocations-numa">
>> +   <primary>pg_shmem_allocations_numa</primary>
>> +  </indexterm>
>> +
>> +  <para>
>> +   The <structname>pg_shmem_allocations_numa</structname> shows how shared
>> +   memory allocations in the server's main shared memory segment are distributed
>> +   across NUMA nodes. This includes both memory allocated by
>> +   <productname>PostgreSQL</productname> itself and memory allocated
>> +   by extensions using the mechanisms detailed in
>> +   <xref linkend="xfunc-shared-addin" />.
>> +  </para>
> 
> I think it'd be good to describe that the view will include multiple rows for
> each name if spread across multiple numa nodes.
> 
> Perhaps also that querying this view is expensive and that
> pg_shmem_allocations should be used if numa information isn't needed?
> 
> 
>> +    /*
>> +     * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
>> +     * the OS may have different memory page sizes.
>> +     *
>> +     * To correctly map between them, we need to: 1. Determine the OS memory
>> +     * page size 2. Calculate how many OS pages are used by all buffer blocks
>> +     * 3. Calculate how many OS pages are contained within each database
>> +     * block.
>> +     *
>> +     * This information is needed before calling move_pages() for NUMA memory
>> +     * node inquiry.
>> +     */
>> +    os_page_size = pg_numa_get_pagesize();
>> +
>> +    /*
>> +     * Allocate memory for page pointers and status based on total shared
>> +     * memory size. This simplified approach allocates enough space for all
>> +     * pages in shared memory rather than calculating the exact requirements
>> +     * for each segment.
>> +     *
>> +     * XXX Isn't this wasteful? But there probably is one large segment of
>> +     * shared memory, much larger than the rest anyway.
>> +     */
>> +    shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
>> +    page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
>> +    pages_status = palloc(sizeof(int) * shm_total_page_count);
> 
> There's a fair bit of duplicated code here with pg_buffercache_numa_pages(),
> could more be moved to a shared helper function?
> 
> 
>> +    hash_seq_init(&hstat, ShmemIndex);
>> +
>> +    /* output all allocated entries */
>> +    memset(nulls, 0, sizeof(nulls));
>> +    while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
>> +    {
> 
> One thing that's interesting with using ShmemIndex is that this won't get
> anonymous allocations. I think that's ok for now, it'd be complicated to
> figure out the unmapped "regions".  But I guess it' be good to mention it in
> the ocs?
> 

Agreed.

> 
> 
> 
>> +        int            i;
>> +
>> +        /* XXX I assume we use TYPEALIGN as a way to round to whole pages.
>> +         * It's a bit misleading to call that "aligned", no? */
>> +
>> +        /* Get number of OS aligned pages */
>> +        shm_ent_page_count
>> +            = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
>> +
>> +        /*
>> +         * If we get ever 0xff back from kernel inquiry, then we probably have
>> +         * bug in our buffers to OS page mapping code here.
>> +         */
>> +        memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
> 
> There's obviously no guarantee that shm_ent_page_count is a multiple of
> os_page_size.  I think it'd be interesting to show in the view when one shmem
> allocation shares a page with the prior allocation - that can contribute a bit
> to contention.  What about showing a start_os_page_id and end_os_page_id or
> something?  That could be a feature for later though.
> 

Yeah, adding first/last page might be interesting.

> 
>> +SELECT NOT(pg_numa_available()) AS skip_test \gset
>> +\if :skip_test
>> +\quit
>> +\endif
>> +-- switch to superuser
>> +\c -
>> +SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
>> + ok
>> +----
>> + t
>> +(1 row)
> 
> Could it be worthwhile to run the test if !pg_numa_available(), to test that
> we do the right thing in that case? We need an alternative output anyway, so
> that might be fine?
> 

+1


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: Removing unneeded self joins
Next
From: Tomas Vondra
Date:
Subject: Re: Draft for basic NUMA observability