Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Draft for basic NUMA observability
Date
Msg-id Z867b09EIZncOsmV@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to Re: Draft for basic NUMA observability  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses Re: Draft for basic NUMA observability
List pgsql-hackers
Hi,

On Fri, Mar 07, 2025 at 12:33:27PM +0100, Jakub Wartak wrote:
> On Fri, Mar 7, 2025 at 11:20 AM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> >
> > Hi,
> > On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak
> > <jakub.wartak@enterprisedb.com> wrote:
> > >Hi,
> >
> > > > Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
> > > > or such function: to put the "duplicated" code in it and then use this
> > > > shared function in pg_buffercache_pages() and in the new numa related one.
> > >
> > > OK, so hastily attempted that in 7b , I had to do a larger refactor
> > > there to avoid code duplication between those two. I don't know which
> > > attempt is better though (7 vs 7b)..
> > >
> >
> > I'm attaching basically the earlier stuff (v7b) as v8 with the
> > following minor changes:
> > - docs are included
> > - changed int8 to int4 in one function definition for numa_zone_id
> 
> .. and v9 attached because cfbot partially complained about
> .cirrus.tasks.yml being adjusted recently (it seems master is hot
> these days).
> 

Thanks for the new version!

Some random comments on 0001:

=== 1

It does not compiles "alone". It's missing:

+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader  /* standard header for all Postgres shmem */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
 extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;

and

--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -563,7 +563,7 @@ static int  ssl_renegotiation_limit;
  */
 int            huge_pages = HUGE_PAGES_TRY;
 int            huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int            huge_pages_status = HUGE_PAGES_UNKNOWN;

That come with 0002. So it has to be in 0001 instead.

=== 2

+else
+  as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi

Maybe we should add the same test (checking for numa.h) for meson?

=== 3

+# FIXME: filter-out / with/without with_libnuma?
+LIBS += $(LIBNUMA_LIBS)

It looks to me that we can remove those 2 lines.

=== 4

+         Only supported on Linux.

s/on Linux/on platforms for which the libnuma library is implemented/?

I did a quick grep on "Only supported on" and it looks like that could be
a more consistent wording.

=== 5

+#include "c.h"
+#include "postgres.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+#include <unistd.h>

I had a closer look to other header files and it looks like it "should" be:

#include "c.h"
#include "postgres.h"

#include <unistd.h>
#ifdef WIN32
#include <windows.h>
#endif

#include "port/pg_numa.h"
#include "storage/pg_shmem.h"

And is "#include "c.h"" really needed?

=== 6

+/* FIXME not tested, might crash */

That's a bit scary.

=== 7

+        * XXX: for now we issue just WARNING, but long-term that might depend on
+        * numa_set_strict() here

s/here/here./

Did not look carefully at all the comments in 0001, 0002 and 0003 though.

A few random comments regarding 0002:

=== 8

# create extension pg_buffercache;
ERROR:  could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so":
/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages
 
CONTEXT:  SQL statement "CREATE FUNCTION pg_buffercache_pages()
RETURNS SETOF RECORD
AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
LANGUAGE C PARALLEL SAFE"
extension script file "pg_buffercache--1.2.sql", near line 7

While that's ok if 0003 is applied. I think that each individual patch should
"fully" work individually.

=== 9

+       do
+       {
+               if (os_page_ptrs[blk2page + j] == 0)

blk2page + j will be repeated multiple times, better to store it in a local
variable instead?

=== 10

+                       if (firstUseInBackend == true)


if (firstUseInBackend) instead?

=== 11

+       int                     j = 0,
+                               blk2page = (int) i * pages_per_blk;

I wonder if size_t is more appropriate for blk2page:

size_t blk2page = (size_t)(i * pages_per_blk) maybe?

=== 12

as we know that we'll iterate until pages_per_blk then would a for loop be more
appropriate here, something like?

"
for (size_t j = 0; j < pages_per_blk; j++)
{
    if (os_page_ptrs[blk2page + j] == 0)
    {
"

=== 13

+       if (buf_state & BM_DIRTY)
+               fctx->record[i].isdirty = true;
+       else
+               fctx->record[i].isdirty = false;

  fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe?

=== 14

+       if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+               fctx->record[i].isvalid = true;
+       else
+               fctx->record[i].isvalid = false;

fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
maybe?

=== 15

+populate_buffercache_entry(int i, BufferCachePagesContext *fctx)

I wonder if "i" should get a more descriptive name?

=== 16

s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent
with pg_stat_io_build_tuples for example).

I now realize that I did propose populate_buffercache_entry() up-thread,
sorry for changing my mind.

=== 17

+       static bool firstUseInBackend = true;

maybe we should give it a more descriptive name?

Also I need to think more about how firstUseInBackend is used, for example:

   ==== 17.1

   would it be better to define it as a file scope variable? (at least that
   would avoid to pass it as an extra parameter in some functions).

   === 17.2 

   what happens if firstUseInBackend is set to false and later on the pages
   are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is
   called again by a backend that already had set firstUseInBackend to false,
   would that provide accurate information?

=== 18

+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+       FuncCallContext *funcctx;
+       BufferCachePagesContext *fctx;  /* User function context. */
+       bool            query_numa = true;

I don't think we need query_numa anymore in pg_buffercache_numa_pages().

I think that we can just ERROR (or NOTICE and return) here:

+               if (pg_numa_init() == -1)
+               {
+                       elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some
NUMAdata might be unavailable.");
 
+                       query_numa = false;

and fully get rid of query_numa.

=== 19

And then here:

        for (i = 0; i < NBuffers; i++)
        {
            populate_buffercache_entry(i, fctx);
            if (query_numa)
                pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
        }

        if (query_numa)
        {
            if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
                elog(ERROR, "failed NUMA pages inquiry: %m");

            for (i = 0; i < NBuffers; i++)
            {
                int         blk2page = (int) i * pages_per_blk;

                /*
                 * Technically we can get errors too here and pass that to
                 * user. Also we could somehow report single DB block spanning
                 * more than one NUMA zone, but it should be rare.
                 */
                fctx->record[i].numa_zone_id = os_pages_status[blk2page];
            }

maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"?

A few random comments regarding 0003:

=== 20

+  <indexterm zone="view-pg-shmem-numa-allocations">
+   <primary>pg_shmem_numa_allocations</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_shmem_allocations</structname> view shows NUMA nodes

s/pg_shmem_allocations/pg_shmem_numa_allocations/?

=== 21

+  /* FIXME: should we release LWlock here ? */
+  elog(ERROR, "failed NUMA pages inquiry status: %m");

There is no need, see src/backend/storage/lmgr/README.

=== 22

+#define MAX_NUMA_ZONES 32              /* FIXME? */
+               Size            zones[MAX_NUMA_ZONES];

could we rely on pg_numa_get_max_node() instead?

=== 23

+                       if (s >= 0)
+                               zones[s]++;

should we also check that s is below a limit?

=== 24

Regarding how we make use of pg_numa_get_max_node(), are we sure there is
no possible holes? I mean could a system have node 0,1 and 3 but not 2?


Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I 
did a little in 0001 though)

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Steven Niu
Date:
Subject: Re: [Patch] remove duplicated smgrclose
Next
From: Bykov Ivan
Date:
Subject: RE: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET