Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Draft for basic NUMA observability
Date
Msg-id tvdbhi7rkk7rvpcm22bj6jnri4ucdb4dnzefer6q7hqflpkxm7@eq37undomnxw
Whole thread Raw
In response to Re: Draft for basic NUMA observability  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses Re: Draft for basic NUMA observability
List pgsql-hackers
Hi,

On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote:
>    setup_additional_packages_script: |
> -    #apt-get update
> -    #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
> +    apt-get update
> +    DEBIAN_FRONTEND=noninteractive apt-get -y install \
> +      libnuma1 \
> +      libnuma-dev

I think libnuma is installed on the relevant platforms, so you shouldn't need
to install it manually.

> +#
> +# libnuma
> +#
> +AC_MSG_CHECKING([whether to build with libnuma support])
> +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],

Most other dependencies say "build with libxyz ..."


> +/*-------------------------------------------------------------------------
> + *
> + * pg_numa.h
> + *      Basic NUMA portability routines
> + *
> + *
> + * Copyright (c) 2025, PostgreSQL Global Development Group
> + *
> + * IDENTIFICATION
> + *     src/include/port/pg_numa.h
> + *
> + *-------------------------------------------------------------------------
> + */
> +#ifndef PG_NUMA_H
> +#define PG_NUMA_H
> +
> +#include "c.h"
> +#include "postgres.h"

Headers should never include either of those headers.  Nor should .c files
include both.


> @@ -200,6 +200,8 @@ pgxs_empty = [
>  
>    'ICU_LIBS',
>  
> +  'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
> +
>    'LIBURING_CFLAGS', 'LIBURING_LIBS',
>  ]

Maybe I am missing something, but are you actually defining and using those
LIBNUMA_* vars anywhere?


> +Size
> +pg_numa_get_pagesize(void)
> +{
> +    Size        os_page_size = sysconf(_SC_PAGESIZE);
> +
> +    if (huge_pages_status == HUGE_PAGES_ON)
> +        GetHugePageSize(&os_page_size, NULL);
> +
> +    return os_page_size;
> +}

Should this have a comment or an assertion that it can only be used after
shared memory startup? Because before that huge_pages_status won't be
meaningful?


> +#ifndef FRONTEND
> +/*
> + * XXX: not really tested as there is no way to trigger this in our
> + * current usage of libnuma.
> + *
> + * The libnuma built-in code can be seen here:
> + * https://github.com/numactl/numactl/blob/master/libnuma.c
> + *
> + */
> +void
> +numa_warn(int num, char *fmt,...)
> +{
> +    va_list        ap;
> +    int            olde = errno;
> +    int            needed;
> +    StringInfoData msg;
> +
> +    initStringInfo(&msg);
> +
> +    va_start(ap, fmt);
> +    needed = appendStringInfoVA(&msg, fmt, ap);
> +    va_end(ap);
> +    if (needed > 0)
> +    {
> +        enlargeStringInfo(&msg, needed);
> +        va_start(ap, fmt);
> +        appendStringInfoVA(&msg, fmt, ap);
> +        va_end(ap);
> +    }
> +
> +    ereport(WARNING,
> +            (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
> +             errmsg_internal("libnuma: WARNING: %s", msg.data)));

I think you would at least have to hold interrupts across this, as
ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump
out of libnuma in case an interrupt has arrived.

> +Size
> +pg_numa_get_pagesize(void)
> +{
> +#ifndef WIN32
> +    Size        os_page_size = sysconf(_SC_PAGESIZE);
> +#else
> +    Size        os_page_size;
> +    SYSTEM_INFO sysinfo;
> +
> +    GetSystemInfo(&sysinfo);
> +    os_page_size = sysinfo.dwPageSize;
> +#endif
> +    if (huge_pages_status == HUGE_PAGES_ON)
> +        GetHugePageSize(&os_page_size, NULL);
> +    return os_page_size;
> +}


I would probably implement this once, outside of the big ifdef, with one more
ifdef inside, given that you're sharing the same implementation.


> From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001
> From: Jakub Wartak <jakub.wartak@enterprisedb.com>
> Date: Wed, 19 Mar 2025 09:34:56 +0100
> Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's
>  primary function to separate functions.
> 
> This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple()
> and get_buffercache_tuple() that help fill result tuplestores based on the
> buffercache contents. This will be used in a follow-up commit that implements
> NUMA observability in pg_buffercache.
> 
> Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
> ---
>  contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++--------
>  1 file changed, 178 insertions(+), 139 deletions(-)
> 
> diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
> index 62602af1775..86e0b8afe01 100644
> --- a/contrib/pg_buffercache/pg_buffercache_pages.c
> +++ b/contrib/pg_buffercache/pg_buffercache_pages.c
> @@ -14,7 +14,6 @@
>  #include "storage/buf_internals.h"
>  #include "storage/bufmgr.h"
>  
> -
>  #define NUM_BUFFERCACHE_PAGES_MIN_ELEM    8

Independent change.


>  #define NUM_BUFFERCACHE_PAGES_ELEM    9
>  #define NUM_BUFFERCACHE_SUMMARY_ELEM 5
> @@ -68,80 +67,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
>  PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
>  PG_FUNCTION_INFO_V1(pg_buffercache_evict);
>  
> -Datum
> -pg_buffercache_pages(PG_FUNCTION_ARGS)
> +/*
> + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
> + *
> + * This is almost identical to pg_buffercache_numa_pages(), but this one performs
> + * memory mapping inquiries to display NUMA node information for each buffer.
> + */

If it's a helper routine it's probably not identical to
pg_buffercache_numa_pages()?



> +/*
> + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
> + *
> + * Build buffer cache information for a single buffer.
> + */
> +static void
> +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
> +{

This isn't really building a tuple tuple? Seems somewhat confusing, because
get_buffercache_tuple() does actually build one.


> +    BufferDesc *bufHdr;
> +    uint32        buf_state;
> +
> +    bufHdr = GetBufferDescriptor(record_id);
> +    /* Lock each buffer header before inspecting. */
> +    buf_state = LockBufHdr(bufHdr);
> +
> +    fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr);
> +    fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
> +    fctx->record[record_id].reltablespace = bufHdr->tag.spcOid;
> +    fctx->record[record_id].reldatabase = bufHdr->tag.dbOid;
> +    fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag);
> +    fctx->record[record_id].blocknum = bufHdr->tag.blockNum;
> +    fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
> +    fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);

As above, I think this would be more readable if you put
fctx->record[record_id] into a local var.



> +static Datum
> +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
> +{
> +    Datum        values[NUM_BUFFERCACHE_PAGES_ELEM];
> +    bool        nulls[NUM_BUFFERCACHE_PAGES_ELEM];
> +    HeapTuple    tuple;
> +
> +    values[0] = Int32GetDatum(fctx->record[record_id].bufferid);
> +    nulls[0] = false;
> +
> +    /*
> +     * Set all fields except the bufferid to null if the buffer is unused or
> +     * not valid.
> +     */
> +    if (fctx->record[record_id].blocknum == InvalidBlockNumber ||
> +        fctx->record[record_id].isvalid == false)
> +    {
> +        nulls[1] = true;
> +        nulls[2] = true;
> +        nulls[3] = true;
> +        nulls[4] = true;
> +        nulls[5] = true;
> +        nulls[6] = true;
> +        nulls[7] = true;
> +
> +        /* unused for v1.0 callers, but the array is always long enough */
> +        nulls[8] = true;

I'd probably just memset the entire nulls array to either true or false,
instead of doing it one-by-one.


> +    }
> +    else
> +    {
> +        values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber);
> +        nulls[1] = false;
> +        values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace);
> +        nulls[2] = false;
> +        values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase);
> +        nulls[3] = false;
> +        values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum);
> +        nulls[4] = false;
> +        values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum);
> +        nulls[5] = false;
> +        values[6] = BoolGetDatum(fctx->record[record_id].isdirty);
> +        nulls[6] = false;
> +        values[7] = Int16GetDatum(fctx->record[record_id].usagecount);
> +        nulls[7] = false;

Seems like it would end up a lot more readable if you put
fctx->record[record_id] into a local variable.  Unfortunately that'd probably
be best done in one more commit ahead of the rest of the this one...


> @@ -0,0 +1,28 @@
> +SELECT NOT(pg_numa_available()) AS skip_test \gset
> +\if :skip_test
> +\quit
> +\endif

You could avoid the need for an alternative output file if you instead made
the queries do something like
  SELECT NOT pg_numa_available() OR count(*) ...



> --- /dev/null
> +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> @@ -0,0 +1,42 @@
> +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> +
> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> +
> +-- Register the new function.
> +DROP VIEW pg_buffercache;
> +DROP FUNCTION pg_buffercache_pages();

I don't think we can just drop a view in the upgrade script. That will fail if
anybody created a view depending on pg_buffercache.


(Sorry, ran out of time / energy here, i had originally just wanted to comment
on the apt-get thing in the tests)


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Remove vardata parameters from eqjoinsel_inner
Next
From: Andrei Lepikhov
Date:
Subject: Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment