Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | CAKZiRmy2TDsuWuDh4C8077QMwO_fidz-fxtYGRYfkJZsSxYpTQ@mail.gmail.com Whole thread Raw |
In response to | Re: Draft for basic NUMA observability (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: Draft for basic NUMA observability
Re: Draft for basic NUMA observability |
List | pgsql-hackers |
On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote: Hi Tomas, > OK, so you agree the commit messages are complete / correct? Yes. > OK. FWIW if you disagree with some of my proposed changes, feel free to > push back. I'm sure some may be more a matter of personal preference. No, it's all fine. I will probably have lots of questions about setting proper env for development that cares itself about style, but that's for another day. > [..floats..] > Hmmm, OK. Maybe it's correct. I still find the float arithmetic really > confusing and difficult to reason about ... > > I agree we don't want special cases for each possible combination of > page sizes (I'm not sure we even know all the combinations). What I was > thinking about is two branches, one for (block >= page) and another for > (block < page). AFAICK both values have to be 2^k, so this would > guarantee we have either (block/page) or (page/block) as integer. > > I wonder if you could even just calculate both, and have one loop that > deals with both. > [..] > When I say "integer arithmetic" I don't mean it should use 32-bit ints, > or any other data type. I mean that it works with non-floating point > values. It could be int64, Size or whatever is large enough to not > overflow. I really don't see how changing stuff to double makes this > easier to understand. I hear you, attached v21 / 0003 is free of float/double arithmetics and uses non-float point values. It should be more readable too with those comments. I have not put it into its own function, because now it fits the whole screen, so hopefully one can follow visually. Please let me know if that code solves the doubts or feel free to reformat it. That _numa_prepare_ptrs() is unused and will need to be removed, but we can still move some code there if necessary. > > 12) You have also raised "why not pg_shm_allocations_numa" instead of > > "pg_shm_numa_allocations" > > > > OPEN_QUESTION: To be honest, I'm not attached to any of those two (or > > naming things in general), I can change if you want. > > > > Me neither. I wonder if there's some precedent when adding similar > variants for other catalogs ... can you check? I've been thinking about > pg_stats and pg_stats_ext, but maybe there's a better example? Hm, it seems we always go with suffix "_somethingnew": * pg_stat_database -> pg_stat_database_conflicts * pg_stat_subscription -> pg_stat_subscription_stats * even here: pg_buffercache -> pg_buffercache_numa @Bertrand: do you have anything against pg_shm_allocations_numa instead of pg_shm_numa_allocations? I don't mind changing it... > > 13) In the patch: "review: What if we get multiple pages per buffer > > (the default). Could we get multiple nodes per buffer?" > > > > OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to > > output multiple rows per single buffer (with "page_no") then we could > > get this: > > buffer1:..:page0:numanodeID1 > > buffer1:..:page1:numanodeID2 > > buffer2:..:page0:numanodeID1 > > > > Should we add such functionality? > > When you say "today no" does that mean we know all pages will be on the > same node, or that there may be pages from different nodes and we can't > display that? That'd not be great, IMHO. > > I'm not a huge fan of returning multiple rows per buffer, with one row > per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The > rest of the fields is for the whole buffer, it'd be wrong to duplicate > that for each page. OPEN_QUESTION: With v21 we have all the information available, we are just unable to display this in pg_buffercache_numa right now. We could trim the view so that it has 3 columns (and user needs to JOIN to pg_buffercache for more details like relationoid), but then what the whole refactor (0002) was for if we would just return bufferId like below: buffer1:page0:numanodeID1 buffer1:page1:numanodeID2 buffer2:page0:numanodeID1 buffer2:page1:numanodeID1 There's also the problem that reading/joining could be inconsistent and even slower. > I wonder if we should have a bitmap of nodes for the buffer (but then > what if there are multiple pages from the same node?), or maybe just an > array of nodes, with one element per page. AFAIR this has been discussed back in end of January, and the conclusion was more or less - on Discord - that everything sucks (bitmaps, BIT() datatype, arrays,...) either from implementation or user side, but apparently arrays [] would suck the least from implementation side. So we could probably do something like up to node_max_nodes(): buffer1:..:{0, 2, 0, 0} buffer2:..:{0, 1, 0, 1} #edgecase: buffer across 2 NUMA nodes buffer3:..:{0, 0, 0, 2} Other idea is JSON or even simple string with numa_node_id<->count: buffer1:..:"1=2" buffer2:..:"1=1 3=1" #edgecase: buffer across 2 NUMA nodes buffer3:..:"3=2" I find all of those non-user friendly and I'm afraid I won't be able to pull that alone in time... -J.
Attachment
pgsql-hackers by date: