Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: Draft for basic NUMA observability
Date
Msg-id CAKZiRmy2TDsuWuDh4C8077QMwO_fidz-fxtYGRYfkJZsSxYpTQ@mail.gmail.com
Whole thread Raw
In response to Re: Draft for basic NUMA observability  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Draft for basic NUMA observability
Re: Draft for basic NUMA observability
List pgsql-hackers
On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi Tomas,

> OK, so you agree the commit messages are complete / correct?

Yes.

> OK. FWIW if you disagree with some of my proposed changes, feel free to
> push back. I'm sure some may be more a matter of personal preference.

No, it's all fine. I will probably have lots of questions about
setting proper env for development that cares itself about style, but
that's for another day.

> [..floats..]
> Hmmm, OK. Maybe it's correct. I still find the float arithmetic really
> confusing and difficult to reason about ...
>
> I agree we don't want special cases for each possible combination of
> page sizes (I'm not sure we even know all the combinations). What I was
> thinking about is two branches, one for (block >= page) and another for
> (block < page). AFAICK both values have to be 2^k, so this would
> guarantee we have either (block/page) or (page/block) as integer.
>
> I wonder if you could even just calculate both, and have one loop that
> deals with both.
>
[..]
> When I say "integer arithmetic" I don't mean it should use 32-bit ints,
> or any other data type. I mean that it works with non-floating point
> values. It could be int64, Size or whatever is large enough to not
> overflow. I really don't see how changing stuff to double makes this
> easier to understand.

I hear you, attached v21 / 0003 is free of float/double arithmetics
and uses non-float point values. It should be more readable too with
those comments. I have not put it into its own function, because now
it fits the whole screen, so hopefully one can follow visually. Please
let me know if that code solves the doubts or feel free to reformat
it. That _numa_prepare_ptrs() is unused and will need to be removed,
but we can still move some code there if necessary.

> > 12) You have also raised "why not pg_shm_allocations_numa" instead of
> > "pg_shm_numa_allocations"
> >
> > OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
> > naming things in general), I can change if you want.
> >
>
> Me neither. I wonder if there's some precedent when adding similar
> variants for other catalogs ... can you check? I've been thinking about
> pg_stats and pg_stats_ext, but maybe there's a better example?

Hm, it seems we always go with suffix "_somethingnew":

* pg_stat_database -> pg_stat_database_conflicts
* pg_stat_subscription -> pg_stat_subscription_stats
* even here: pg_buffercache -> pg_buffercache_numa

@Bertrand: do you have anything against pg_shm_allocations_numa
instead of pg_shm_numa_allocations? I don't mind changing it...

> > 13) In the patch: "review: What if we get multiple pages per buffer
> > (the default). Could we get multiple nodes per buffer?"
> >
> > OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
> > output multiple rows per single buffer (with "page_no") then we could
> > get this:
> > buffer1:..:page0:numanodeID1
> > buffer1:..:page1:numanodeID2
> > buffer2:..:page0:numanodeID1
> >
> > Should we add such functionality?
>
> When you say "today no" does that mean we know all pages will be on the
> same node, or that there may be pages from different nodes and we can't
> display that? That'd not be great, IMHO.
>
> I'm not a huge fan of returning multiple rows per buffer, with one row
> per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The
> rest of the fields is for the whole buffer, it'd be wrong to duplicate
> that for each page.

OPEN_QUESTION: With v21 we have all the information available, we are
just unable to display this in pg_buffercache_numa right now. We could
trim the view so that it has 3 columns (and user needs to JOIN to
pg_buffercache for more details like relationoid), but then what the
whole refactor (0002) was for if we would just return bufferId like
below:

buffer1:page0:numanodeID1
buffer1:page1:numanodeID2
buffer2:page0:numanodeID1
buffer2:page1:numanodeID1

There's also the problem that reading/joining could be inconsistent
and even slower.

> I wonder if we should have a bitmap of nodes for the buffer (but then
> what if there are multiple pages from the same node?), or maybe just an
> array of nodes, with one element per page.

AFAIR this has been discussed back in end of January, and the
conclusion was more or less - on Discord - that everything sucks
(bitmaps, BIT() datatype, arrays,...) either from implementation or
user side, but apparently arrays [] would suck the least from
implementation side. So we could probably do something like up to
node_max_nodes():
buffer1:..:{0, 2, 0, 0}
buffer2:..:{0, 1, 0, 1} #edgecase: buffer across 2 NUMA nodes
buffer3:..:{0, 0, 0, 2}

Other idea is JSON or even simple string with numa_node_id<->count:
buffer1:..:"1=2"
buffer2:..:"1=1 3=1" #edgecase: buffer across 2 NUMA nodes
buffer3:..:"3=2"

I find all of those non-user friendly and I'm afraid I won't be able
to pull that alone in time...

-J.

Attachment

pgsql-hackers by date:

Previous
From: Bertrand Drouvot
Date:
Subject: Re: Fix 035_standby_logical_decoding.pl race conditions
Next
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Fix slot synchronization with two_phase decoding enabled