Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Draft for basic NUMA observability
Date
Msg-id Z9mDWEj+kQzx6nmk@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to Re: Draft for basic NUMA observability  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses Re: Draft for basic NUMA observability
List pgsql-hackers
Hi,

On Tue, Mar 18, 2025 at 11:19:32AM +0100, Jakub Wartak wrote:
> On Mon, Mar 17, 2025 at 5:11 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> 
> > Thanks for v13!
> 
> Rebased and fixes inside in the attached v14 (it passes CI too):

Thanks!

> > === 9
> >
> > +       max_zones = pg_numa_get_max_node();
> >
> > I think we are mixing "zone" and "node". I think we should standardize on one
> > and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to
> > vote for node, but zone is fine too if you prefer.
> 
> Given that numa(7) does not use "zone" keyword at all and both
> /proc/zoneinfo and /proc/pagetypeinfo shows that NUMA nodes are split
> into zones, we can conclude that "zone" is simply a subdivision within
> a NUMA node's memory (internal kernel thing). Examples are ZONE_DMA,
> ZONE_NORMAL, ZONE_HIGHMEM. We are fetching just node id info (without
> internal information about zones), therefore we should stay away from
> using "zone" within the patch at all, as we are just fetching NUMA
> node info. My bad, it's a terminology error on my side from start -
> I've probably saw "zone" info in some command output back then when we
> had that workshop and started using it and somehow it propagated
> through the patchset up to this day...

Thanks for the explanation.

> I've adjusted it all and settled on "numa_node_id" column name.

Yeah, I can see, things like:

+  <para>
+   The definitions of the columns exposed are identical to the
+   <structname>pg_buffercache</structname> view, except that this one includes
+   one additional <structfield>numa_node_id</structfield> column as defined in
+   <xref linkend="pgbuffercache-numa-columns"/>.

and like:

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>numa_size</structfield> <type>int4</type>
+      </para>

I think that you re-introduced the "numa_" in the column(s) name that we get
rid (or agreed to) of previously.

I think that we can get rid of the "numa_" stuff in column(s) name as
the column(s) are part of "numa" relation views/output anyway.

I think "node_id", "size" as column(s) name should be enough.

Or maybe that re-adding "numa_" was intentional?

> > === 15
> >
> > I think there is still some multi-lines comments that are missing a period. I
> > probably also missed some in 0002 during the previous review. I think that's
> > worth another check.
> 
> Please do such a check,

Found much more:


+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages)

is missing the period

and

+       /*
+        * Switch context when allocating stuff to be used in later calls
+        */

should be as before, meaning on current HEAD:

/* Switch context when allocating stuff to be used in later calls */

and

+       /*
+        * Return to original context when allocating transient memory
+        */

should be as before, meaning on current HEAD:

/* Return to original context when allocating transient memory */

and

+       /*
+        * Note if the buffer is valid, and has storage created
+        */

should be as before, meaning on current HEAD:

/* Note if the buffer is valid, and has storage created */

and

+               /*
+                * unused for v1.0 callers, but the array is always long enough
+                */


should be as before, meaning on current HEAD:

/* unused for v1.0 callers, but the array is always long enough */

and

+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings

is missing the period

and

+                * To correctly map between them, we need to: - Determine the OS
+                * memory page size - Calculate how many OS pages are used by all
+                * buffer blocks - Calculate how many OS pages are contained within
+                * each database block

is missing the period (2 times as this comment appears in 0002 and 0003)

and

+               /*
+                * If we ever get 0xff back from kernel inquiry, then we probably have
+                * bug in our buffers to OS page mapping code here


is missing the period (2 times as this comment appears in 0002 and 0003)

and

+       /*
+        * We are ignoring the following memory regions (as compared to
+        * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+        * counted via the shmem index 2. output as-of-yet unused shared memory

is missing the period

> I've tried pgident on all .c files, but I'm
> apparently blind to such issues.

I don't think pgident would report missing period.

> BTW if patch has anything left that
> causes pgident to fix, that is not picked by CI but it is picked by
> buildfarm??

I think it has to be done manually before each commit and that this is anyway
done at least once per release cycle.

> > === 16
> >
> > +    * In order to get reliable results we also need to touch memory
> > +    * pages so that inquiry about NUMA zone doesn't return -2.
> > +    */
> >
> > maybe use the same wording as in 0002?
> 
> But 0002 used:
> 
>   "In order to get reliable results we also need to touch memory pages, so that
>   inquiry about NUMA zone doesn't return -2 (which indicates
> unmapped/unallocated
>   pages)"
> 
> or are you looking at something different?

Nope, I meant to say that it could make sense to have the exact same comment.

> > === 17
> >
> > The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count
> > but (as for 0002) it looks like we can not avoid it (or at least I don't see
> > a way to avoid it).
> 
> Hm, it's literally debug code. Why would we care so much if it is 2
> loops rather than 1? (as stated earlier we need to pack ptrs and then
> analyze it)

Yeah, but if we could just loop one time I'm pretty sure we'd have done that.

> > I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is
> > updated.
> :w

> Cool, thanks in advance.

0001 looks in a good shape from my point of view.

For 0002:

=== 1

I wonder if pg_buffercache_init_entries() and pg_buffercache_build_tuple() could
deserve their own patch. That would ease the review for the "real" numa stuff.

Maybe something like:

0001 as it is
0002 introduces (and uses) pg_buffercache_init_entries() and
pg_buffercache_build_tuple()
0003 current 0002 attached minus 0002 above

We did it that way in c2a50ac678e and ff7c40d7fd6 for example.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: optimize file transfer in pg_upgrade
Next
From: Andrew Dunstan
Date:
Subject: Re: Add -k/--link option to pg_combinebackup