Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | CAKZiRmzKEkoYUB1_PR0uijPeqihyXCwBRY7N+WN4JGPz9Nk0Yw@mail.gmail.com Whole thread Raw |
In response to | Re: Draft for basic NUMA observability (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Draft for basic NUMA observability
|
List | pgsql-hackers |
On Thu, Mar 27, 2025 at 2:40 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, Hi Andres, > On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote: > > setup_additional_packages_script: | > > - #apt-get update > > - #DEBIAN_FRONTEND=noninteractive apt-get -y install ... > > + apt-get update > > + DEBIAN_FRONTEND=noninteractive apt-get -y install \ > > + libnuma1 \ > > + libnuma-dev > > I think libnuma is installed on the relevant platforms, so you shouldn't need > to install it manually. Fixed. Right, you mentioned this earlier, I just didnt know when it went online. > > +# > > +# libnuma > > +# > > +AC_MSG_CHECKING([whether to build with libnuma support]) > > +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness], > > Most other dependencies say "build with libxyz ..." Done. > > + * pg_numa.h [..] > > +#include "c.h" > > +#include "postgres.h" > > Headers should never include either of those headers. Nor should .c files > include both. Fixed, huh, I've found explanation: https://www.postgresql.org/message-id/11634.1488932128@sss.pgh.pa.us > > @@ -200,6 +200,8 @@ pgxs_empty = [ > > > > 'ICU_LIBS', > > > > + 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS', > > + > > 'LIBURING_CFLAGS', 'LIBURING_LIBS', > > ] > > Maybe I am missing something, but are you actually defining and using those > LIBNUMA_* vars anywhere? OK, so it seems I've been missing `PKG_CHECK_MODULES(LIBNUMA, numa)` in configure.ac that would set those *FLAGS. I'm little bit loss dependent in how to gurantee that meson is synced with autoconf as per project requirements - trying to use past commits as reference, but I still could get something wrong here (especially in src/Makefile.global.in) > > +Size > > +pg_numa_get_pagesize(void) [..] > > Should this have a comment or an assertion that it can only be used after > shared memory startup? Because before that huge_pages_status won't be > meaningful? Added both. > > +#ifndef FRONTEND > > +/* > > + * XXX: not really tested as there is no way to trigger this in our > > + * current usage of libnuma. > > + * > > + * The libnuma built-in code can be seen here: > > + * https://github.com/numactl/numactl/blob/master/libnuma.c > > + * > > + */ > > +void > > +numa_warn(int num, char *fmt,...) [..] > > I think you would at least have to hold interrupts across this, as > ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump > out of libnuma in case an interrupt has arrived. On request by Alvaro I've removed it as that code is simply unreachable/untestable, but point taken - I'm planning to re-add this with holding interrupting in future when we start using proper numa_interleave() one day. Anyway, please let me know if you want still to keep it as deadcode. BTW for context , why this is deadcode is explained in the latter part of [1] message (TL;DR; unless we use pining/numa_interleave/local_alloc() we probably never reach that warnings/error handlers). "Another question without an easy answer as I never hit this error from numa_move_pages(), one gets invalid stuff in *os_pages_status instead. BUT!: most of our patch just uses things that cannot fail as per libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700 /sys` (because it's hard to unmount it) and then still most of numactl stuff works as euid != 0, but numactl --hardware gets at least "libnuma: Warning: Cannot parse distance information in sysfs: Permission denied" or same story with numactl -C 678 date. So unless we start way more heavy use of libnuma (not just for observability) there's like no point in that right now (?) Contrary to that: we can do just do variadic elog() for that, I've put some code, but no idea if that works fine..." > > +Size > > +pg_numa_get_pagesize(void) [..] > > I would probably implement this once, outside of the big ifdef, with one more > ifdef inside, given that you're sharing the same implementation. Done. > > From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001 > > From: Jakub Wartak <jakub.wartak@enterprisedb.com> > > Date: Wed, 19 Mar 2025 09:34:56 +0100 > > Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's > > primary function to separate functions. > > > > This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple() > > and get_buffercache_tuple() that help fill result tuplestores based on the > > buffercache contents. This will be used in a follow-up commit that implements > > NUMA observability in pg_buffercache. > > > > Author: Jakub Wartak <jakub.wartak@enterprisedb.com> > > Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> > > Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com > > --- > > contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++-------- > > 1 file changed, 178 insertions(+), 139 deletions(-) > > > > diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c > > index 62602af1775..86e0b8afe01 100644 > > --- a/contrib/pg_buffercache/pg_buffercache_pages.c > > +++ b/contrib/pg_buffercache/pg_buffercache_pages.c > > @@ -14,7 +14,6 @@ > > #include "storage/buf_internals.h" > > #include "storage/bufmgr.h" > > > > - > > #define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8 > > Independent change. Fixed. > > #define NUM_BUFFERCACHE_PAGES_ELEM 9 > > #define NUM_BUFFERCACHE_SUMMARY_ELEM 5 > > @@ -68,80 +67,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary); > > PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts); > > PG_FUNCTION_INFO_V1(pg_buffercache_evict); > > > > -Datum > > -pg_buffercache_pages(PG_FUNCTION_ARGS) > > +/* > > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). > > + * > > + * This is almost identical to pg_buffercache_numa_pages(), but this one performs > > + * memory mapping inquiries to display NUMA node information for each buffer. > > + */ > > If it's a helper routine it's probably not identical to > pg_buffercache_numa_pages()? Of course not, fixed by removing that comment. > > +/* > > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). > > + * > > + * Build buffer cache information for a single buffer. > > + */ > > +static void > > +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx) > > +{ > > This isn't really building a tuple tuple? Seems somewhat confusing, because > get_buffercache_tuple() does actually build one. s/pg_buffercache_build_tuple/pg_buffercache_save_tuple/g , unless someone wants to come with better name. > > + BufferDesc *bufHdr; > > + uint32 buf_state; > > + > > + bufHdr = GetBufferDescriptor(record_id); > > + /* Lock each buffer header before inspecting. */ > > + buf_state = LockBufHdr(bufHdr); > > + > > + fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr); > > + fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag); > > + fctx->record[record_id].reltablespace = bufHdr->tag.spcOid; > > + fctx->record[record_id].reldatabase = bufHdr->tag.dbOid; > > + fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag); > > + fctx->record[record_id].blocknum = bufHdr->tag.blockNum; > > + fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state); > > + fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state); > > As above, I think this would be more readable if you put > fctx->record[record_id] into a local var. Done. > > +static Datum > > +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx) > > +{ > > + Datum values[NUM_BUFFERCACHE_PAGES_ELEM]; > > + bool nulls[NUM_BUFFERCACHE_PAGES_ELEM]; > > + HeapTuple tuple; > > + > > + values[0] = Int32GetDatum(fctx->record[record_id].bufferid); > > + nulls[0] = false; > > + > > + /* > > + * Set all fields except the bufferid to null if the buffer is unused or > > + * not valid. > > + */ > > + if (fctx->record[record_id].blocknum == InvalidBlockNumber || > > + fctx->record[record_id].isvalid == false) > > + { > > + nulls[1] = true; > > + nulls[2] = true; > > + nulls[3] = true; > > + nulls[4] = true; > > + nulls[5] = true; > > + nulls[6] = true; > > + nulls[7] = true; > > + > > + /* unused for v1.0 callers, but the array is always long enough */ > > + nulls[8] = true; > > I'd probably just memset the entire nulls array to either true or false, > instead of doing it one-by-one. Done. > > + } > > + else > > + { > > + values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber); > > + nulls[1] = false; > > + values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace); > > + nulls[2] = false; > > + values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase); > > + nulls[3] = false; > > + values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum); > > + nulls[4] = false; > > + values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum); > > + nulls[5] = false; > > + values[6] = BoolGetDatum(fctx->record[record_id].isdirty); > > + nulls[6] = false; > > + values[7] = Int16GetDatum(fctx->record[record_id].usagecount); > > + nulls[7] = false; > > Seems like it would end up a lot more readable if you put > fctx->record[record_id] into a local variable. Unfortunately that'd probably > be best done in one more commit ahead of the rest of the this one... Done, i've put it those refactorig changes into the commit already dedicated only for a refactor. For the record Bertrand also asked for something about this, but I was somehow afraid to touch Tom's code. > > @@ -0,0 +1,28 @@ > > +SELECT NOT(pg_numa_available()) AS skip_test \gset > > +\if :skip_test > > +\quit > > +\endif > > You could avoid the need for an alternative output file if you instead made > the queries do something like > SELECT NOT pg_numa_available() OR count(*) ... ITEM REMAINING: Is this for the future or can it stay like that? I don't have a hard opinion on this, but I've already wasted lots of cycles to discover that one can have those ".1" alternative expected result files. > > --- /dev/null > > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > > @@ -0,0 +1,42 @@ > > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ > > + > > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION > > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit > > + > > +-- Register the new function. > > +DROP VIEW pg_buffercache; > > +DROP FUNCTION pg_buffercache_pages(); > > I don't think we can just drop a view in the upgrade script. That will fail if > anybody created a view depending on pg_buffercache. Ugh, fixed, thanks. That must have been some leftover (we later do CREATE OR REPLACE those anyway). > (Sorry, ran out of time / energy here, i had originally just wanted to comment > on the apt-get thing in the tests) Thanks! AIO intensifies ... :) -J. [1] - https://www.postgresql.org/message-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA%2Bs%2B_5hcTOg%40mail.gmail.com
Attachment
pgsql-hackers by date: