Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | fcae9008-a312-4aef-9252-196f1891cdb9@vondra.me Whole thread Raw |
In response to | Re: Draft for basic NUMA observability (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
Responses |
Re: Draft for basic NUMA observability
|
List | pgsql-hackers |
On 4/3/25 09:01, Jakub Wartak wrote: > On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote: > > Hi Tomas, > >> OK, so you agree the commit messages are complete / correct? > > Yes. > >> OK. FWIW if you disagree with some of my proposed changes, feel free to >> push back. I'm sure some may be more a matter of personal preference. > > No, it's all fine. I will probably have lots of questions about > setting proper env for development that cares itself about style, but > that's for another day. > >> [..floats..] >> Hmmm, OK. Maybe it's correct. I still find the float arithmetic really >> confusing and difficult to reason about ... >> >> I agree we don't want special cases for each possible combination of >> page sizes (I'm not sure we even know all the combinations). What I was >> thinking about is two branches, one for (block >= page) and another for >> (block < page). AFAICK both values have to be 2^k, so this would >> guarantee we have either (block/page) or (page/block) as integer. >> >> I wonder if you could even just calculate both, and have one loop that >> deals with both. >> > [..] >> When I say "integer arithmetic" I don't mean it should use 32-bit ints, >> or any other data type. I mean that it works with non-floating point >> values. It could be int64, Size or whatever is large enough to not >> overflow. I really don't see how changing stuff to double makes this >> easier to understand. > > I hear you, attached v21 / 0003 is free of float/double arithmetics > and uses non-float point values. It should be more readable too with > those comments. I have not put it into its own function, because now > it fits the whole screen, so hopefully one can follow visually. Please > let me know if that code solves the doubts or feel free to reformat > it. That _numa_prepare_ptrs() is unused and will need to be removed, > but we can still move some code there if necessary. > IMHO the code in v21 is much easier to understand. It's not quite clear to me why it's done outside pg_buffercache_numa_prepare_ptrs(), though. >>> 12) You have also raised "why not pg_shm_allocations_numa" instead of >>> "pg_shm_numa_allocations" >>> >>> OPEN_QUESTION: To be honest, I'm not attached to any of those two (or >>> naming things in general), I can change if you want. >>> >> >> Me neither. I wonder if there's some precedent when adding similar >> variants for other catalogs ... can you check? I've been thinking about >> pg_stats and pg_stats_ext, but maybe there's a better example? > > Hm, it seems we always go with suffix "_somethingnew": > > * pg_stat_database -> pg_stat_database_conflicts > * pg_stat_subscription -> pg_stat_subscription_stats > * even here: pg_buffercache -> pg_buffercache_numa > > @Bertrand: do you have anything against pg_shm_allocations_numa > instead of pg_shm_numa_allocations? I don't mind changing it... > +1 to pg_shmem_allocations_numa >>> 13) In the patch: "review: What if we get multiple pages per buffer >>> (the default). Could we get multiple nodes per buffer?" >>> >>> OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to >>> output multiple rows per single buffer (with "page_no") then we could >>> get this: >>> buffer1:..:page0:numanodeID1 >>> buffer1:..:page1:numanodeID2 >>> buffer2:..:page0:numanodeID1 >>> >>> Should we add such functionality? >> >> When you say "today no" does that mean we know all pages will be on the >> same node, or that there may be pages from different nodes and we can't >> display that? That'd not be great, IMHO. >> >> I'm not a huge fan of returning multiple rows per buffer, with one row >> per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The >> rest of the fields is for the whole buffer, it'd be wrong to duplicate >> that for each page. > > OPEN_QUESTION: With v21 we have all the information available, we are > just unable to display this in pg_buffercache_numa right now. We could > trim the view so that it has 3 columns (and user needs to JOIN to > pg_buffercache for more details like relationoid), but then what the > whole refactor (0002) was for if we would just return bufferId like > below: > > buffer1:page0:numanodeID1 > buffer1:page1:numanodeID2 > buffer2:page0:numanodeID1 > buffer2:page1:numanodeID1 > > There's also the problem that reading/joining could be inconsistent > and even slower. > I think a view with just 3 columns would be a good solution. It's what pg_shmem_allocations_numa already does, so it'd be consistent with that part too. I'm not too worried about the cost of the extra join - it's going to be a couple dozen milliseconds at worst, I guess, and that's negligible in the bigger scheme of things (e.g. compared to how long the move_pages is expected to take). Also, it's not like having everything in the same view is free - people would have to do some sort of post-processing, and that has a cost too. So unless someone can demonstrate a use case where this would matter, I'd not worry about it too much. >> I wonder if we should have a bitmap of nodes for the buffer (but then >> what if there are multiple pages from the same node?), or maybe just an >> array of nodes, with one element per page. > > AFAIR this has been discussed back in end of January, and the > conclusion was more or less - on Discord - that everything sucks > (bitmaps, BIT() datatype, arrays,...) either from implementation or > user side, but apparently arrays [] would suck the least from > implementation side. So we could probably do something like up to > node_max_nodes(): > buffer1:..:{0, 2, 0, 0} > buffer2:..:{0, 1, 0, 1} #edgecase: buffer across 2 NUMA nodes > buffer3:..:{0, 0, 0, 2} > > Other idea is JSON or even simple string with numa_node_id<->count: > buffer1:..:"1=2" > buffer2:..:"1=1 3=1" #edgecase: buffer across 2 NUMA nodes > buffer3:..:"3=2" > > I find all of those non-user friendly and I'm afraid I won't be able > to pull that alone in time... I'm -1 on JSON, I don't see how would that solve anything better than e.g. a regular array, and it's going to be harder to work with. So if we don't want to go with the 3-column view proposed earlier, I'd stick to a simple array. I don't think there's a huge difference between those two approaches, it should be easy to convert between those approaches using unnest() and array_agg(). Attached is v22, with some minor review comments: 1) I suggested we should just use "libnuma support" in configure, instead of talking about "NUMA awareness support", and AFAICS you agreed. But I still see the old text in configure ... is that intentional or a bit of forgotten text? 2) I added a couple asserts to pg_buffercache_numa_pages() and comments, and simplified a couple lines (but that's a matter of preference). 3) I don't think it's correct for pg_get_shmem_numa_allocations to just silently ignore nodes outside the valid range. I suggest we simply do elog(ERROR), as it's an internal error we don't expect to happen. regards -- Tomas Vondra
Attachment
pgsql-hackers by date: