Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Draft for basic NUMA observability
Date
Msg-id nj3oit4k4fxlbooec3w73c2fombx7ytbye7flmjlwn77yyg727@uumo7we7ysfq
Whole thread Raw
In response to Re: Draft for basic NUMA observability  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Draft for basic NUMA observability
Re: Draft for basic NUMA observability
List pgsql-hackers
Hi,

I just played around with this for a bit.  As noted somewhere further down,
pg_buffercache_numa.page_num ends up wonky in different ways for the different
pages.

I think one thing that the docs should mention is that calling the numa
functions/views will force the pages to be allocated, even if they're
currently unused.

Newly started server, with s_b of 32GB an 2MB huge pages:

  grep ^Huge /proc/meminfo
  HugePages_Total:   34802
  HugePages_Free:    34448
  HugePages_Rsvd:    16437
  HugePages_Surp:        0
  Hugepagesize:       2048 kB
  Hugetlb:        76517376 kB

run
  SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id;

Now the pages that previously were marked as reserved are actually allocated:

  grep ^Huge /proc/meminfo
  HugePages_Total:   34802
  HugePages_Free:    18012
  HugePages_Rsvd:        1
  HugePages_Surp:        0
  Hugepagesize:       2048 kB
  Hugetlb:        76517376 kB


I don't see how we can avoid that right now, but at the very least we ought to
document it.


On 2025-04-05 16:33:28 +0200, Tomas Vondra wrote:
> The libnuma library is not available on 32-bit builds (there's no shared
> object for i386), so we disable it in that case. The i386 is very memory
> limited anyway, even with PAE, so NUMA is mostly irrelevant.

Hm? libnuma1:i386 installs just fine to me on debian and contains the shared
library.


> +###############################################################
> +# Library: libnuma
> +###############################################################
> +
> +libnumaopt = get_option('libnuma')
> +if not libnumaopt.disabled()
> +  # via pkg-config
> +  libnuma = dependency('numa', required: libnumaopt)
> +  if not libnuma.found()
> +    libnuma = cc.find_library('numa', required: libnumaopt)
> +  endif

This fallback isn't going to work if -dlibnuma=enabled is used, as
dependency() will error out, due to not finding a required dependency. You'd
need to use required: false there.

Do we actually need a non-dependency() fallback here?  It's linux only and a
new dependency, so just requiring a .pc file seems like it'd be fine?


> +#ifdef USE_LIBNUMA
> +
> +/*
> + * This is required on Linux, before pg_numa_query_pages() as we
> + * need to page-fault before move_pages(2) syscall returns valid results.
> + */
> +#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
> +    ro_volatile_var = *(uint64 *) ptr

Does it really work that way? A volatile variable to assign the result of
dereferencing ptr ensures that the *store* isn't removed by the compiler, but
it doesn't at all guarantee that the *load* isn't removed, since that memory
isn't marked as volatile.

I think you'd need to cast the source pointer to a volatile uint64* to ensure
the load happens.


> +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> +
> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> +\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
> +
> +-- Register the new functions.
> +CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
> +RETURNS SETOF RECORD
> +AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
> +LANGUAGE C PARALLEL SAFE;
> +
> +-- Create a view for convenient access.
> +CREATE OR REPLACE VIEW pg_buffercache_numa AS
> +    SELECT P.* FROM pg_buffercache_numa_pages() AS P
> +    (bufferid integer, page_num int4, node_id int4);


Why CREATE OR REPLACE?


> +Datum
> +pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
> +{
> ...
> +
> +        /*
> +         * To smoothly support upgrades from version 1.0 of this extension
> +         * transparently handle the (non-)existence of the pinning_backends
> +         * column. We unfortunately have to get the result type for that... -
> +         * we can't use the result type determined by the function definition
> +         * without potentially crashing when somebody uses the old (or even
> +         * wrong) function definition though.
> +         */
> +        if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
> +            elog(ERROR, "return type must be a row type");
> +

Isn't that comment inapplicable for pg_buffercache_numa_pages(), a new view?


> +
> +        if (firstNumaTouch)
> +            elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");

Over the patchseries the related code is duplicated. Seems like it'd be good
to put it into pg_numa instead?  This seems like the thing that's good to
abstract away in one central spot.


> +        /*
> +         * Scan through all the buffers, saving the relevant fields in the
> +         * fctx->record structure.
> +         *
> +         * We don't hold the partition locks, so we don't get a consistent
> +         * snapshot across all buffers, but we do grab the buffer header
> +         * locks, so the information of each buffer is self-consistent.
> +         *
> +         * This loop touches and stores addresses into os_page_ptrs[] as input
> +         * to one big big move_pages(2) inquiry system call. Basically we ask
> +         * for all memory pages for NBuffers.
> +         */
> +        idx = 0;
> +        for (i = 0; i < NBuffers; i++)
> +        {
> +            BufferDesc *bufHdr;
> +            uint32        buf_state;
> +            uint32        bufferid;
> +
> +            CHECK_FOR_INTERRUPTS();
> +
> +            bufHdr = GetBufferDescriptor(i);
> +
> +            /* Lock each buffer header before inspecting. */
> +            buf_state = LockBufHdr(bufHdr);
> +            bufferid = BufferDescriptorGetBuffer(bufHdr);
> +
> +            UnlockBufHdr(bufHdr, buf_state);

Given that the only thing you're getting here is the buffer id, it's probably
not worth getting the spinlock, a single 4 byte read is always atomic on our
platforms.


> +            /*
> +             * If we have multiple OS pages per buffer, fill those in too. We
> +             * always want at least one OS page, even if there are multiple
> +             * buffers per page.
> +             *
> +             * Altough we could query just once per each OS page, we do it
> +             * repeatably for each Buffer and hit the same address as
> +             * move_pages(2) requires page aligment. This also simplifies
> +             * retrieval code later on. Also NBuffers starts from 1.
> +             */
> +            for (j = 0; j < Max(1, pages_per_buffer); j++)
> +            {
> +                char       *buffptr = (char *) BufferGetBlock(i + 1);
> +
> +                fctx->record[idx].bufferid = bufferid;
> +                fctx->record[idx].numa_page = j;
> +
> +                os_page_ptrs[idx]
> +                    = (char *) TYPEALIGN(os_page_size,
> +                                         buffptr + (os_page_size * j));

FWIW, this bit here is the most expensive part of the function itself, as the
compiler has no choice than to implement it as an actual division, as
os_page_size is runtime variable.

It'd be fine to leave it like that, the call to numa_move_pages() is way more
expensive. But it shouldn't be too hard to do that alignment once, rather than
having to do it over and over.


FWIW, neither this definition of numa_page, nor the one from "adjust page_num"
works quite right for me.

This definition afaict is always 0 when using huge pages and just 0 and 1 for
4k pages.  But my understanding of numa_page is that it's the "id" of the numa
pages, which isn't that?

With "adjust page_num" I get a number that starts at -1 and then increments
from there. More correct, but doesn't quite seem right either.


> +    <tbody>
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>bufferid</structfield> <type>integer</type>
> +      </para>
> +      <para>
> +       ID, in the range 1..<varname>shared_buffers</varname>
> +      </para></entry>
> +     </row>
> +
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>page_num</structfield> <type>int</type>
> +      </para>
> +      <para>
> +       number of OS memory page for this buffer
> +      </para></entry>
> +     </row>

"page_num" doesn't really seem very descriptive for "number of OS memory page
for this buffer".  To me "number of" sounds like it's counting the number of
associated pages, but it's really just a "page id" or something like that.

Maybe rename page_num to "os_page_id" and rephrase the comment a bit?




> From 03d24af540f8235ad9ca9537db0a1ba5dbcf6ccb Mon Sep 17 00:00:00 2001
> From: Jakub Wartak <jakub.wartak@enterprisedb.com>
> Date: Fri, 21 Feb 2025 14:20:18 +0100
> Subject: [PATCH v25 4/5] Introduce pg_shmem_allocations_numa view
>
> Introduce new pg_shmem_alloctions_numa view with information about how
> shared memory is distributed across NUMA nodes.

Nice.


> Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
> Reviewed-by: Andres Freund <andres@anarazel.de>
> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
> Reviewed-by: Tomas Vondra <tomas@vondra.me>
> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
> ---
>  doc/src/sgml/system-views.sgml           |  79 ++++++++++++
>  src/backend/catalog/system_views.sql     |   8 ++
>  src/backend/storage/ipc/shmem.c          | 152 +++++++++++++++++++++++
>  src/include/catalog/pg_proc.dat          |   8 ++
>  src/test/regress/expected/numa.out       |  12 ++
>  src/test/regress/expected/numa_1.out     |   3 +
>  src/test/regress/expected/privileges.out |  16 ++-
>  src/test/regress/expected/rules.out      |   4 +
>  src/test/regress/parallel_schedule       |   2 +-
>  src/test/regress/sql/numa.sql            |   9 ++
>  src/test/regress/sql/privileges.sql      |   6 +-
>  11 files changed, 294 insertions(+), 5 deletions(-)
>  create mode 100644 src/test/regress/expected/numa.out
>  create mode 100644 src/test/regress/expected/numa_1.out
>  create mode 100644 src/test/regress/sql/numa.sql
>
> diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
> index 4f336ee0adf..a83365ae24a 100644
> --- a/doc/src/sgml/system-views.sgml
> +++ b/doc/src/sgml/system-views.sgml
> @@ -181,6 +181,11 @@
>        <entry>shared memory allocations</entry>
>       </row>
>
> +     <row>
> +      <entry><link
linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry>
> +      <entry>NUMA node mappings for shared memory allocations</entry>
> +     </row>
> +
>       <row>
>        <entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
>        <entry>planner statistics</entry>
> @@ -4051,6 +4056,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
>    </para>
>   </sect1>
>
> + <sect1 id="view-pg-shmem-allocations-numa">
> +  <title><structname>pg_shmem_allocations_numa</structname></title>
> +
> +  <indexterm zone="view-pg-shmem-allocations-numa">
> +   <primary>pg_shmem_allocations_numa</primary>
> +  </indexterm>
> +
> +  <para>
> +   The <structname>pg_shmem_allocations_numa</structname> shows how shared
> +   memory allocations in the server's main shared memory segment are distributed
> +   across NUMA nodes. This includes both memory allocated by
> +   <productname>PostgreSQL</productname> itself and memory allocated
> +   by extensions using the mechanisms detailed in
> +   <xref linkend="xfunc-shared-addin" />.
> +  </para>

I think it'd be good to describe that the view will include multiple rows for
each name if spread across multiple numa nodes.

Perhaps also that querying this view is expensive and that
pg_shmem_allocations should be used if numa information isn't needed?


> +    /*
> +     * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
> +     * the OS may have different memory page sizes.
> +     *
> +     * To correctly map between them, we need to: 1. Determine the OS memory
> +     * page size 2. Calculate how many OS pages are used by all buffer blocks
> +     * 3. Calculate how many OS pages are contained within each database
> +     * block.
> +     *
> +     * This information is needed before calling move_pages() for NUMA memory
> +     * node inquiry.
> +     */
> +    os_page_size = pg_numa_get_pagesize();
> +
> +    /*
> +     * Allocate memory for page pointers and status based on total shared
> +     * memory size. This simplified approach allocates enough space for all
> +     * pages in shared memory rather than calculating the exact requirements
> +     * for each segment.
> +     *
> +     * XXX Isn't this wasteful? But there probably is one large segment of
> +     * shared memory, much larger than the rest anyway.
> +     */
> +    shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
> +    page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
> +    pages_status = palloc(sizeof(int) * shm_total_page_count);

There's a fair bit of duplicated code here with pg_buffercache_numa_pages(),
could more be moved to a shared helper function?


> +    hash_seq_init(&hstat, ShmemIndex);
> +
> +    /* output all allocated entries */
> +    memset(nulls, 0, sizeof(nulls));
> +    while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
> +    {

One thing that's interesting with using ShmemIndex is that this won't get
anonymous allocations. I think that's ok for now, it'd be complicated to
figure out the unmapped "regions".  But I guess it' be good to mention it in
the ocs?




> +        int            i;
> +
> +        /* XXX I assume we use TYPEALIGN as a way to round to whole pages.
> +         * It's a bit misleading to call that "aligned", no? */
> +
> +        /* Get number of OS aligned pages */
> +        shm_ent_page_count
> +            = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
> +
> +        /*
> +         * If we get ever 0xff back from kernel inquiry, then we probably have
> +         * bug in our buffers to OS page mapping code here.
> +         */
> +        memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);

There's obviously no guarantee that shm_ent_page_count is a multiple of
os_page_size.  I think it'd be interesting to show in the view when one shmem
allocation shares a page with the prior allocation - that can contribute a bit
to contention.  What about showing a start_os_page_id and end_os_page_id or
something?  That could be a feature for later though.


> +SELECT NOT(pg_numa_available()) AS skip_test \gset
> +\if :skip_test
> +\quit
> +\endif
> +-- switch to superuser
> +\c -
> +SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
> + ok
> +----
> + t
> +(1 row)

Could it be worthwhile to run the test if !pg_numa_available(), to test that
we do the right thing in that case? We need an alternative output anyway, so
that might be fine?


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andreas Karlsson
Date:
Subject: Re: dblink query interruptibility
Next
From: Michael Paquier
Date:
Subject: Re: Back-patch of: avoid multiple hard links to same WAL file after a crash