Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | fd469735-5b81-4032-891d-21f78844bf22@vondra.me Whole thread Raw |
In response to | Re: Draft for basic NUMA observability (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
Responses |
Re: Draft for basic NUMA observability
|
List | pgsql-hackers |
On 4/2/25 16:46, Jakub Wartak wrote: > On Tue, Apr 1, 2025 at 10:17 PM Tomas Vondra <tomas@vondra.me> wrote: >> >> Hi, >> >> I've spent a bit of time reviewing this. In general I haven't found >> anything I'd call a bug, but here's a couple comments for v18 ... Most >> of this is in separate "review" commits, with a couple exceptions. > > Hi, thank you very much for help on this, yes I did not anticipate > this patch to organically grow like that... > I've squashed those review findings into v20 and provided answers for > the "review:". > Thanks. >> 1) Please update the commit messages, with proper formatting, etc. I >> tried to do that in the attached v19, but please go through that, add >> relevant details, update list of reviewers, etc. The subject should not >> be overly long, etc. > > Fixed by you. > OK, so you agree the commit messages are complete / correct? >> 2) I don't think we need "libnuma for NUMA awareness" in configure, I'd >> use just "libnuma support" similar to other libraries. > > Fixed by you. > OK. FWIW if you disagree with some of my proposed changes, feel free to push back. I'm sure some may be more a matter of personal preference. >> 3) I don't think we need pg_numa.h to have this: >> >> extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS); >> >> AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why >> would it be necessary here? It's enough to have a prototype in .c file. > > Right, probably the result of ENOTENOUGHCOFFEE and copy/paste. > >> 4) Improved .sgml to have acronym/productname in a couple places. > > Great. > >> 5) I don't think the comment for pg_buffercache_init_entries() is very >> useful. That it's helper for pg_buffercache_pages() tells me nothing >> about how to use it, what the arguments are, etc. > > I've added an explanation (in 0003 though), so that this is covered. > I've always assumed that 'static' functions don't need that much of > that(?) > I think that's mostly true - the (non-static) functions that are part of the API for a module need better / more detailed docs. But that doesn't mean static functions shouldn't have at least basic docs (unless the function is trivial / obvious, but I don't think that's the case here). If I happen to work a pg_buffercache patch in a couple months, I'll still need to understand what the function does. It won't save me that I'm working on the same file ... I'm not saying this needs a detailed docs, but "helper for X" adds very little information - I can easily see where it's called from, right? >> 6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better >> comment too. I mean, there's no info about what the arguments are, which >> arguments are input or output, etc. And it only discussed one option >> (block page < memory page), but what happens in the other case? The >> formulas with blk2page/blk2pageoff are not quite clear to me (I'm not >> saying it's wrong). >> >> However, it seems rather suspicious that pages_per_blk is calculated as >> float, and pg_buffercache_numa_prepare_ptrs() then does this: >> >> for (size_t j = 0; j < pages_per_blk; j++) >> { ... } >> >> I mean, isn't this vulnerable to rounding errors, which might trigger >> some weird behavior? If not, it'd be good to add a comment why this is >> fine, it confuses me a lot. I personally would probably prefer doing >> just integer arithmetic here. > > Please bear with me: If you set client_min_messages to debug1 and then > pg_buffercache_numa will dump: > a) without HP, DEBUG: NUMA: os_page_count=32768 os_page_size=4096 > pages_per_blk=2.00 > b) with HP (2M) DEBUG: NUMA: os_page_count=64 os_page_size=2097152 > pages_per_blk=0.003906 > > so we need to be agile to support two cases as you mention (BLCKSZ > > PAGESIZE and BLCKSZ < PAGESIZE). BLCKSZ are 2..32kB and pagesize are > 4kB..1GB, thus we can get in that float the following sample values: > BLCKSZ pagesize > 2kB 4kB = 0.5 > 2kB 2048kb = .0009765625 > 2kB 1024*1024kb # 1GB = .0000019073486328125 # worst-case? > 8kB 4kB = 2 > 8kB 2048kb = .003906250 # example from above (x86_64, 2M HP) > 8kB 1024*1024kb # 1GB = .00000762939453 > 32kB 4kB = 8 > 32kB 2048kb = .0156250 > 32kB 1024*1024kb # 1GB = .000030517578125 > > So that loop: > for (size_t j = 0; j < pages_per_blk; j++) > is quite generic and launches in both cases. I've somehow failed to > somehow come up with integer-based math and generic code for this > (without special cases which seem to be no-go here?). So, that loop > then will: > a) launch many times to support BLCKSZ > pagesize, that is when single > DB block spans multiple memory pages > b) launch once when BLCKSZ < pagesize (because 0.003906250 > 0 in the > example above) > Hmmm, OK. Maybe it's correct. I still find the float arithmetic really confusing and difficult to reason about ... I agree we don't want special cases for each possible combination of page sizes (I'm not sure we even know all the combinations). What I was thinking about is two branches, one for (block >= page) and another for (block < page). AFAICK both values have to be 2^k, so this would guarantee we have either (block/page) or (page/block) as integer. I wonder if you could even just calculate both, and have one loop that deals with both. > Loop touches && stores addresses into os_page_ptrs[] as input to this > one big move_pages(2) query. So we basically ask for all memory pages > for NBuffers. Once we get our NUMA information we then use blk2page = > up_to_NBuffers * pages_per_blk to resolve memory pointers back to > Buffers, if anywhere it could be a problem here. > IMHO this is the information I'd expect in the function comment. > So let's say we have s_b=4TB (it wont work for sure for other reasons, > let's assume we have it), let's also assume we have no huge > pages(pagesize=4kB) and BLCKSZ=8kB (default) => NBuffers=1073741824 > which multiplied by 2 = INT_MAX (integer overflow bug), so I think > that int is not big enough there in pg_buffercache_numa_pages() (it > should be "size_t blk2page" there as in > pg_buffercache_numa_prepare_ptrs(), so I've changed it in v20) > > Another angle is s_b=4TB RAM with 2MB HP, BLKSZ=8kB => > NBuffers=2097152 * 0.003906250 = 8192.0 . > > OPEN_QUESTION: I'm not sure all of this is safe and I'm seeking help, but with > float f = 2097152 * 0.003906250; > under clang -Weverything I got "implicit conversion increases > floating-point precision: 'float' to 'double'", so either it is: > - we somehow rewrite all of the core arithmetics here to integer? > - or simply go with doubles just to be sure? I went with doubles in > v20, comments explaining are not there yet. > When I say "integer arithmetic" I don't mean it should use 32-bit ints, or any other data type. I mean that it works with non-floating point values. It could be int64, Size or whatever is large enough to not overflow. I really don't see how changing stuff to double makes this easier to understand. >> 7) This para in the docs seems self-contradictory: >> >> <para> >> The <function>pg_buffercache_numa_pages()</function> provides the same >> information >> as <function>pg_buffercache_pages()</function> but is slower because >> it also >> provides the <acronym>NUMA</acronym> node ID per shared buffer entry. >> The <structname>pg_buffercache_numa</structname> view wraps the >> function for >> convenient use. >> </para> >> >> I mean, "provides the same information, but is slower because it >> provides different information" is strange. I understand the point, but >> maybe rephrase somehow? > > Oh my... yes, now it looks way better. > >> 8) Why is pg_numa_available() marked as volatile? Should not change in a >> running cluster, no? > > No it shouldn't, good find, made it 's'table. > >> 9) I noticed the new SGML docs have IDs with mixed "-" and "_". Maybe >> let's not do that. >> >> <sect2 id="pgbuffercache-pg-buffercache_numa"> > > Fixed. > >> 10) I think it'd be good to mention/explain why move_pages is used >> instead of get_mempolicy - more efficient with batching, etc. This would >> be useful both in the commit message and before the move_pages call > > Ok, added in 0001. > >> (and in general to explain why pg_buffercache_numa_prepare_ptrs prepares the >> pointers like this etc.). > > Added reference to that earlier new comment here too in 0003. > Will take a look in the evening. >> 11) This could use UINT64_FORMAT, instead of a cast: >> >> elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu >> pages_per_blk=%.2f", >> (unsigned long) os_page_count, os_page_size, pages_per_blk); > > Done. > > 12) You have also raised "why not pg_shm_allocations_numa" instead of > "pg_shm_numa_allocations" > > OPEN_QUESTION: To be honest, I'm not attached to any of those two (or > naming things in general), I can change if you want. > Me neither. I wonder if there's some precedent when adding similar variants for other catalogs ... can you check? I've been thinking about pg_stats and pg_stats_ext, but maybe there's a better example? > 13) In the patch: "review: What if we get multiple pages per buffer > (the default). Could we get multiple nodes per buffer?" > > OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to > output multiple rows per single buffer (with "page_no") then we could > get this: > buffer1:..:page0:numanodeID1 > buffer1:..:page1:numanodeID2 > buffer2:..:page0:numanodeID1 > > Should we add such functionality? > When you say "today no" does that mean we know all pages will be on the same node, or that there may be pages from different nodes and we can't display that? That'd not be great, IMHO. I'm not a huge fan of returning multiple rows per buffer, with one row per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The rest of the fields is for the whole buffer, it'd be wrong to duplicate that for each page. I wonder if we should have a bitmap of nodes for the buffer (but then what if there are multiple pages from the same node?), or maybe just an array of nodes, with one element per page. regards -- Tomas Vondra
pgsql-hackers by date: