Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Bertrand Drouvot |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | Z9mDWEj+kQzx6nmk@ip-10-97-1-34.eu-west-3.compute.internal Whole thread Raw |
In response to | Re: Draft for basic NUMA observability (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
Responses |
Re: Draft for basic NUMA observability
|
List | pgsql-hackers |
Hi, On Tue, Mar 18, 2025 at 11:19:32AM +0100, Jakub Wartak wrote: > On Mon, Mar 17, 2025 at 5:11 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > Thanks for v13! > > Rebased and fixes inside in the attached v14 (it passes CI too): Thanks! > > === 9 > > > > + max_zones = pg_numa_get_max_node(); > > > > I think we are mixing "zone" and "node". I think we should standardize on one > > and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to > > vote for node, but zone is fine too if you prefer. > > Given that numa(7) does not use "zone" keyword at all and both > /proc/zoneinfo and /proc/pagetypeinfo shows that NUMA nodes are split > into zones, we can conclude that "zone" is simply a subdivision within > a NUMA node's memory (internal kernel thing). Examples are ZONE_DMA, > ZONE_NORMAL, ZONE_HIGHMEM. We are fetching just node id info (without > internal information about zones), therefore we should stay away from > using "zone" within the patch at all, as we are just fetching NUMA > node info. My bad, it's a terminology error on my side from start - > I've probably saw "zone" info in some command output back then when we > had that workshop and started using it and somehow it propagated > through the patchset up to this day... Thanks for the explanation. > I've adjusted it all and settled on "numa_node_id" column name. Yeah, I can see, things like: + <para> + The definitions of the columns exposed are identical to the + <structname>pg_buffercache</structname> view, except that this one includes + one additional <structfield>numa_node_id</structfield> column as defined in + <xref linkend="pgbuffercache-numa-columns"/>. and like: + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>numa_size</structfield> <type>int4</type> + </para> I think that you re-introduced the "numa_" in the column(s) name that we get rid (or agreed to) of previously. I think that we can get rid of the "numa_" stuff in column(s) name as the column(s) are part of "numa" relation views/output anyway. I think "node_id", "size" as column(s) name should be enough. Or maybe that re-adding "numa_" was intentional? > > === 15 > > > > I think there is still some multi-lines comments that are missing a period. I > > probably also missed some in 0002 during the previous review. I think that's > > worth another check. > > Please do such a check, Found much more: + * In order to get reliable results we also need to touch memory pages, so that + * inquiry about NUMA memory node doesn't return -2 (which indicates + * unmapped/unallocated pages) is missing the period and + /* + * Switch context when allocating stuff to be used in later calls + */ should be as before, meaning on current HEAD: /* Switch context when allocating stuff to be used in later calls */ and + /* + * Return to original context when allocating transient memory + */ should be as before, meaning on current HEAD: /* Return to original context when allocating transient memory */ and + /* + * Note if the buffer is valid, and has storage created + */ should be as before, meaning on current HEAD: /* Note if the buffer is valid, and has storage created */ and + /* + * unused for v1.0 callers, but the array is always long enough + */ should be as before, meaning on current HEAD: /* unused for v1.0 callers, but the array is always long enough */ and +/* + * This is almost identical to the above, but performs + * NUMA inuqiry about memory mappings is missing the period and + * To correctly map between them, we need to: - Determine the OS + * memory page size - Calculate how many OS pages are used by all + * buffer blocks - Calculate how many OS pages are contained within + * each database block is missing the period (2 times as this comment appears in 0002 and 0003) and + /* + * If we ever get 0xff back from kernel inquiry, then we probably have + * bug in our buffers to OS page mapping code here is missing the period (2 times as this comment appears in 0002 and 0003) and + /* + * We are ignoring the following memory regions (as compared to + * pg_get_shmem_allocations()): 1. output shared memory allocated but not + * counted via the shmem index 2. output as-of-yet unused shared memory is missing the period > I've tried pgident on all .c files, but I'm > apparently blind to such issues. I don't think pgident would report missing period. > BTW if patch has anything left that > causes pgident to fix, that is not picked by CI but it is picked by > buildfarm?? I think it has to be done manually before each commit and that this is anyway done at least once per release cycle. > > === 16 > > > > + * In order to get reliable results we also need to touch memory > > + * pages so that inquiry about NUMA zone doesn't return -2. > > + */ > > > > maybe use the same wording as in 0002? > > But 0002 used: > > "In order to get reliable results we also need to touch memory pages, so that > inquiry about NUMA zone doesn't return -2 (which indicates > unmapped/unallocated > pages)" > > or are you looking at something different? Nope, I meant to say that it could make sense to have the exact same comment. > > === 17 > > > > The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count > > but (as for 0002) it looks like we can not avoid it (or at least I don't see > > a way to avoid it). > > Hm, it's literally debug code. Why would we care so much if it is 2 > loops rather than 1? (as stated earlier we need to pack ptrs and then > analyze it) Yeah, but if we could just loop one time I'm pretty sure we'd have done that. > > I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is > > updated. > :w > Cool, thanks in advance. 0001 looks in a good shape from my point of view. For 0002: === 1 I wonder if pg_buffercache_init_entries() and pg_buffercache_build_tuple() could deserve their own patch. That would ease the review for the "real" numa stuff. Maybe something like: 0001 as it is 0002 introduces (and uses) pg_buffercache_init_entries() and pg_buffercache_build_tuple() 0003 current 0002 attached minus 0002 above We did it that way in c2a50ac678e and ff7c40d7fd6 for example. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: