Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Draft for basic NUMA observability
Date
Msg-id fcae9008-a312-4aef-9252-196f1891cdb9@vondra.me
Whole thread Raw
In response to Re: Draft for basic NUMA observability  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses Re: Draft for basic NUMA observability
List pgsql-hackers
On 4/3/25 09:01, Jakub Wartak wrote:
> On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote:
> 
> Hi Tomas,
> 
>> OK, so you agree the commit messages are complete / correct?
> 
> Yes.
> 
>> OK. FWIW if you disagree with some of my proposed changes, feel free to
>> push back. I'm sure some may be more a matter of personal preference.
> 
> No, it's all fine. I will probably have lots of questions about
> setting proper env for development that cares itself about style, but
> that's for another day.
> 
>> [..floats..]
>> Hmmm, OK. Maybe it's correct. I still find the float arithmetic really
>> confusing and difficult to reason about ...
>>
>> I agree we don't want special cases for each possible combination of
>> page sizes (I'm not sure we even know all the combinations). What I was
>> thinking about is two branches, one for (block >= page) and another for
>> (block < page). AFAICK both values have to be 2^k, so this would
>> guarantee we have either (block/page) or (page/block) as integer.
>>
>> I wonder if you could even just calculate both, and have one loop that
>> deals with both.
>>
> [..]
>> When I say "integer arithmetic" I don't mean it should use 32-bit ints,
>> or any other data type. I mean that it works with non-floating point
>> values. It could be int64, Size or whatever is large enough to not
>> overflow. I really don't see how changing stuff to double makes this
>> easier to understand.
> 
> I hear you, attached v21 / 0003 is free of float/double arithmetics
> and uses non-float point values. It should be more readable too with
> those comments. I have not put it into its own function, because now
> it fits the whole screen, so hopefully one can follow visually. Please
> let me know if that code solves the doubts or feel free to reformat
> it. That _numa_prepare_ptrs() is unused and will need to be removed,
> but we can still move some code there if necessary.
> 

IMHO the code in v21 is much easier to understand. It's not quite clear
to me why it's done outside pg_buffercache_numa_prepare_ptrs(), though.

>>> 12) You have also raised "why not pg_shm_allocations_numa" instead of
>>> "pg_shm_numa_allocations"
>>>
>>> OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
>>> naming things in general), I can change if you want.
>>>
>>
>> Me neither. I wonder if there's some precedent when adding similar
>> variants for other catalogs ... can you check? I've been thinking about
>> pg_stats and pg_stats_ext, but maybe there's a better example?
> 
> Hm, it seems we always go with suffix "_somethingnew":
> 
> * pg_stat_database -> pg_stat_database_conflicts
> * pg_stat_subscription -> pg_stat_subscription_stats
> * even here: pg_buffercache -> pg_buffercache_numa
> 
> @Bertrand: do you have anything against pg_shm_allocations_numa
> instead of pg_shm_numa_allocations? I don't mind changing it...
> 

+1 to pg_shmem_allocations_numa

>>> 13) In the patch: "review: What if we get multiple pages per buffer
>>> (the default). Could we get multiple nodes per buffer?"
>>>
>>> OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
>>> output multiple rows per single buffer (with "page_no") then we could
>>> get this:
>>> buffer1:..:page0:numanodeID1
>>> buffer1:..:page1:numanodeID2
>>> buffer2:..:page0:numanodeID1
>>>
>>> Should we add such functionality?
>>
>> When you say "today no" does that mean we know all pages will be on the
>> same node, or that there may be pages from different nodes and we can't
>> display that? That'd not be great, IMHO.
>>
>> I'm not a huge fan of returning multiple rows per buffer, with one row
>> per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The
>> rest of the fields is for the whole buffer, it'd be wrong to duplicate
>> that for each page.
> 
> OPEN_QUESTION: With v21 we have all the information available, we are
> just unable to display this in pg_buffercache_numa right now. We could
> trim the view so that it has 3 columns (and user needs to JOIN to
> pg_buffercache for more details like relationoid), but then what the
> whole refactor (0002) was for if we would just return bufferId like
> below:
> 
> buffer1:page0:numanodeID1
> buffer1:page1:numanodeID2
> buffer2:page0:numanodeID1
> buffer2:page1:numanodeID1
> 
> There's also the problem that reading/joining could be inconsistent
> and even slower.
> 

I think a view with just 3 columns would be a good solution. It's what
pg_shmem_allocations_numa already does, so it'd be consistent with that
part too.

I'm not too worried about the cost of the extra join - it's going to be
a couple dozen milliseconds at worst, I guess, and that's negligible in
the bigger scheme of things (e.g. compared to how long the move_pages is
expected to take). Also, it's not like having everything in the same
view is free - people would have to do some sort of post-processing, and
that has a cost too.

So unless someone can demonstrate a use case where this would matter,
I'd not worry about it too much.

>> I wonder if we should have a bitmap of nodes for the buffer (but then
>> what if there are multiple pages from the same node?), or maybe just an
>> array of nodes, with one element per page.
> 
> AFAIR this has been discussed back in end of January, and the
> conclusion was more or less - on Discord - that everything sucks
> (bitmaps, BIT() datatype, arrays,...) either from implementation or
> user side, but apparently arrays [] would suck the least from
> implementation side. So we could probably do something like up to
> node_max_nodes():
> buffer1:..:{0, 2, 0, 0}
> buffer2:..:{0, 1, 0, 1} #edgecase: buffer across 2 NUMA nodes
> buffer3:..:{0, 0, 0, 2}
> 
> Other idea is JSON or even simple string with numa_node_id<->count:
> buffer1:..:"1=2"
> buffer2:..:"1=1 3=1" #edgecase: buffer across 2 NUMA nodes
> buffer3:..:"3=2"
> 
> I find all of those non-user friendly and I'm afraid I won't be able
> to pull that alone in time...

I'm -1 on JSON, I don't see how would that solve anything better than
e.g. a regular array, and it's going to be harder to work with. So if we
don't want to go with the 3-column view proposed earlier, I'd stick to a
simple array. I don't think there's a huge difference between those two
approaches, it should be easy to convert between those approaches using
unnest() and array_agg().

Attached is v22, with some minor review comments:

1) I suggested we should just use "libnuma support" in configure,
instead of talking about "NUMA awareness support", and AFAICS you
agreed. But I still see the old text in configure ... is that
intentional or a bit of forgotten text?

2) I added a couple asserts to pg_buffercache_numa_pages() and comments,
and simplified a couple lines (but that's a matter of preference).

3) I don't think it's correct for pg_get_shmem_numa_allocations to just
silently ignore nodes outside the valid range. I suggest we simply do
elog(ERROR), as it's an internal error we don't expect to happen.


regards

-- 
Tomas Vondra

Attachment

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: in BeginCopyTo make materialized view using COPY TO instead of COPY (query).
Next
From: Ranier Vilela
Date:
Subject: Re: libpq support for NegotiateProtocolVersion