Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | tvdbhi7rkk7rvpcm22bj6jnri4ucdb4dnzefer6q7hqflpkxm7@eq37undomnxw Whole thread Raw |
In response to | Re: Draft for basic NUMA observability (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
Responses |
Re: Draft for basic NUMA observability
|
List | pgsql-hackers |
Hi, On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote: > setup_additional_packages_script: | > - #apt-get update > - #DEBIAN_FRONTEND=noninteractive apt-get -y install ... > + apt-get update > + DEBIAN_FRONTEND=noninteractive apt-get -y install \ > + libnuma1 \ > + libnuma-dev I think libnuma is installed on the relevant platforms, so you shouldn't need to install it manually. > +# > +# libnuma > +# > +AC_MSG_CHECKING([whether to build with libnuma support]) > +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness], Most other dependencies say "build with libxyz ..." > +/*------------------------------------------------------------------------- > + * > + * pg_numa.h > + * Basic NUMA portability routines > + * > + * > + * Copyright (c) 2025, PostgreSQL Global Development Group > + * > + * IDENTIFICATION > + * src/include/port/pg_numa.h > + * > + *------------------------------------------------------------------------- > + */ > +#ifndef PG_NUMA_H > +#define PG_NUMA_H > + > +#include "c.h" > +#include "postgres.h" Headers should never include either of those headers. Nor should .c files include both. > @@ -200,6 +200,8 @@ pgxs_empty = [ > > 'ICU_LIBS', > > + 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS', > + > 'LIBURING_CFLAGS', 'LIBURING_LIBS', > ] Maybe I am missing something, but are you actually defining and using those LIBNUMA_* vars anywhere? > +Size > +pg_numa_get_pagesize(void) > +{ > + Size os_page_size = sysconf(_SC_PAGESIZE); > + > + if (huge_pages_status == HUGE_PAGES_ON) > + GetHugePageSize(&os_page_size, NULL); > + > + return os_page_size; > +} Should this have a comment or an assertion that it can only be used after shared memory startup? Because before that huge_pages_status won't be meaningful? > +#ifndef FRONTEND > +/* > + * XXX: not really tested as there is no way to trigger this in our > + * current usage of libnuma. > + * > + * The libnuma built-in code can be seen here: > + * https://github.com/numactl/numactl/blob/master/libnuma.c > + * > + */ > +void > +numa_warn(int num, char *fmt,...) > +{ > + va_list ap; > + int olde = errno; > + int needed; > + StringInfoData msg; > + > + initStringInfo(&msg); > + > + va_start(ap, fmt); > + needed = appendStringInfoVA(&msg, fmt, ap); > + va_end(ap); > + if (needed > 0) > + { > + enlargeStringInfo(&msg, needed); > + va_start(ap, fmt); > + appendStringInfoVA(&msg, fmt, ap); > + va_end(ap); > + } > + > + ereport(WARNING, > + (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION), > + errmsg_internal("libnuma: WARNING: %s", msg.data))); I think you would at least have to hold interrupts across this, as ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump out of libnuma in case an interrupt has arrived. > +Size > +pg_numa_get_pagesize(void) > +{ > +#ifndef WIN32 > + Size os_page_size = sysconf(_SC_PAGESIZE); > +#else > + Size os_page_size; > + SYSTEM_INFO sysinfo; > + > + GetSystemInfo(&sysinfo); > + os_page_size = sysinfo.dwPageSize; > +#endif > + if (huge_pages_status == HUGE_PAGES_ON) > + GetHugePageSize(&os_page_size, NULL); > + return os_page_size; > +} I would probably implement this once, outside of the big ifdef, with one more ifdef inside, given that you're sharing the same implementation. > From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001 > From: Jakub Wartak <jakub.wartak@enterprisedb.com> > Date: Wed, 19 Mar 2025 09:34:56 +0100 > Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's > primary function to separate functions. > > This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple() > and get_buffercache_tuple() that help fill result tuplestores based on the > buffercache contents. This will be used in a follow-up commit that implements > NUMA observability in pg_buffercache. > > Author: Jakub Wartak <jakub.wartak@enterprisedb.com> > Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> > Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com > --- > contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++-------- > 1 file changed, 178 insertions(+), 139 deletions(-) > > diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c > index 62602af1775..86e0b8afe01 100644 > --- a/contrib/pg_buffercache/pg_buffercache_pages.c > +++ b/contrib/pg_buffercache/pg_buffercache_pages.c > @@ -14,7 +14,6 @@ > #include "storage/buf_internals.h" > #include "storage/bufmgr.h" > > - > #define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8 Independent change. > #define NUM_BUFFERCACHE_PAGES_ELEM 9 > #define NUM_BUFFERCACHE_SUMMARY_ELEM 5 > @@ -68,80 +67,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary); > PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts); > PG_FUNCTION_INFO_V1(pg_buffercache_evict); > > -Datum > -pg_buffercache_pages(PG_FUNCTION_ARGS) > +/* > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). > + * > + * This is almost identical to pg_buffercache_numa_pages(), but this one performs > + * memory mapping inquiries to display NUMA node information for each buffer. > + */ If it's a helper routine it's probably not identical to pg_buffercache_numa_pages()? > +/* > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). > + * > + * Build buffer cache information for a single buffer. > + */ > +static void > +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx) > +{ This isn't really building a tuple tuple? Seems somewhat confusing, because get_buffercache_tuple() does actually build one. > + BufferDesc *bufHdr; > + uint32 buf_state; > + > + bufHdr = GetBufferDescriptor(record_id); > + /* Lock each buffer header before inspecting. */ > + buf_state = LockBufHdr(bufHdr); > + > + fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr); > + fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag); > + fctx->record[record_id].reltablespace = bufHdr->tag.spcOid; > + fctx->record[record_id].reldatabase = bufHdr->tag.dbOid; > + fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag); > + fctx->record[record_id].blocknum = bufHdr->tag.blockNum; > + fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state); > + fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state); As above, I think this would be more readable if you put fctx->record[record_id] into a local var. > +static Datum > +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx) > +{ > + Datum values[NUM_BUFFERCACHE_PAGES_ELEM]; > + bool nulls[NUM_BUFFERCACHE_PAGES_ELEM]; > + HeapTuple tuple; > + > + values[0] = Int32GetDatum(fctx->record[record_id].bufferid); > + nulls[0] = false; > + > + /* > + * Set all fields except the bufferid to null if the buffer is unused or > + * not valid. > + */ > + if (fctx->record[record_id].blocknum == InvalidBlockNumber || > + fctx->record[record_id].isvalid == false) > + { > + nulls[1] = true; > + nulls[2] = true; > + nulls[3] = true; > + nulls[4] = true; > + nulls[5] = true; > + nulls[6] = true; > + nulls[7] = true; > + > + /* unused for v1.0 callers, but the array is always long enough */ > + nulls[8] = true; I'd probably just memset the entire nulls array to either true or false, instead of doing it one-by-one. > + } > + else > + { > + values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber); > + nulls[1] = false; > + values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace); > + nulls[2] = false; > + values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase); > + nulls[3] = false; > + values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum); > + nulls[4] = false; > + values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum); > + nulls[5] = false; > + values[6] = BoolGetDatum(fctx->record[record_id].isdirty); > + nulls[6] = false; > + values[7] = Int16GetDatum(fctx->record[record_id].usagecount); > + nulls[7] = false; Seems like it would end up a lot more readable if you put fctx->record[record_id] into a local variable. Unfortunately that'd probably be best done in one more commit ahead of the rest of the this one... > @@ -0,0 +1,28 @@ > +SELECT NOT(pg_numa_available()) AS skip_test \gset > +\if :skip_test > +\quit > +\endif You could avoid the need for an alternative output file if you instead made the queries do something like SELECT NOT pg_numa_available() OR count(*) ... > --- /dev/null > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > @@ -0,0 +1,42 @@ > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ > + > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit > + > +-- Register the new function. > +DROP VIEW pg_buffercache; > +DROP FUNCTION pg_buffercache_pages(); I don't think we can just drop a view in the upgrade script. That will fail if anybody created a view depending on pg_buffercache. (Sorry, ran out of time / energy here, i had originally just wanted to comment on the apt-get thing in the tests) Greetings, Andres Freund
pgsql-hackers by date: