Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | 2e3922f2-3310-43b1-886d-d0559984897d@vondra.me Whole thread Raw |
In response to | Re: Draft for basic NUMA observability (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
Responses |
Re: Draft for basic NUMA observability
|
List | pgsql-hackers |
On 4/4/25 09:35, Jakub Wartak wrote: > On Fri, Apr 4, 2025 at 8:50 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: >> >> Hi, >> >> On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote: >>> On 4/3/25 15:12, Jakub Wartak wrote: >>>> On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote: >>>> >>>>> ... >>>>> >>>>> So unless someone can demonstrate a use case where this would matter, >>>>> I'd not worry about it too much. >>>> >>>> OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me, >>>> it's just that I don't have cycles left today and probably lack skills >>>> (i've never dealt with arrays so far) thus it would be slow to get it >>>> right... but I can pick up anything tomorrow morning. >>>> >>> >>> OK, I took a stab at reworking/simplifying this the way I proposed. >>> Here's v24 - needs more polishing, but hopefully enough to show what I >>> had in mind. >>> >>> It does these changes: >>> >>> 1) Drops 0002 with the pg_buffercache refactoring, because the new view >>> is not "extending" the existing one. >> >> I think that makes sense. One would just need to join on the pg_buffercache >> view to get more information about the buffer if needed. >> >> The pg_buffercache_numa_pages() doc needs an update though as I don't think that >> "+ The <function>pg_buffercache_numa_pages()</function> provides the same >> information as <function>pg_buffercache_pages()</function>" is still true. >> >>> 2) Reworks pg_buffercache_num to return just three columns, bufferid, >>> page_num and node_id. page_num is a sequence starting from 0 for each >>> buffer. >> >> +1 on the idea >> >>> 3) It now builds an array of records, with one record per buffer/page. >>> >>> 4) I realized we don't really need to worry about buffers_per_page very >>> much, except for logging/debugging. There's always "at least one page" >>> per buffer, even if an incomplete one, so we can do this: >>> >>> os_page_count = NBuffers * Max(1, pages_per_buffer); >>> >>> and then >>> >>> for (i = 0; i < NBuffers; i++) >>> { >>> for (j = 0; j < Max(1, pages_per_buffer); j++) >> >> That's a nice simplification as we always need to take care of at least one page >> per buffer. >> >>> and everything just works fine, I think. >> >> I think the same. >> >>> Opinions? I personally find this much cleaner / easier to understand. >> >> I agree that's easier to understand and that that looks correct. >> >> A few random comments: >> >> === 1 >> >> It looks like that firstNumaTouch is not set to false anymore. >> >> === 2 >> >> + pfree(os_page_status); >> + pfree(os_page_ptrs); >> >> Not sure that's needed, we should be in a short-lived memory context here >> (ExprContext or such). >> >> === 3 >> >> + ro_volatile_var = *(uint64 *)ptr >> >> space missing before "ptr"? >> > > +my feedback as I've noticed that Bertrand already provided a review. > > Right, the code is now simple , and that Max() is brilliant. I've > attached some review findings as .txt > > 0001 100%LGTM > 0002 doc fix + pgident + Tomas, you should take Authored-by yourself > there for sure, I couldn't pull this off alone in time! So big thanks! > 0003 fixes elog UINT64_FORMAT for ming32 (a little bit funny to have > NUMA on ming32...:)) > OK > When started with interleave=all on serious hardware, I'm getting (~5s > for s_b=64GB) from pg_buffercache_numa > > node_id | count > ---------+--------- > 3 | 2097152 > 0 | 2097152 > 2 | 2097152 > 1 | 2097152 > > so this is valid result (2097152*4 numa nodes*8192 buffer > size/1024/1024/1024 = 64GB) > > Also with pgbench -i -s 20 , after ~8s: > select c.relname, n.node_id, count(*) from pg_buffercache_numa n > join pg_buffercache b on (b.bufferid = n.bufferid) > join pg_class c on (c.relfilenode = b.relfilenode) > group by c.relname, n.node_id order by count(*) desc; > relname | node_id | count > -----------------------------------------------+---------+------- > pgbench_accounts | 2 | 8217 > pgbench_accounts | 0 | 8190 > pgbench_accounts | 3 | 8189 > pgbench_accounts | 1 | 8187 > pg_statistic | 2 | 32 > pg_operator | 2 | 14 > pg_depend | 3 | 14 > [..] > > pg_shm_allocations_numa also looks good. > > I think v24+tiny fixes is good enough to go in. > OK. Do you have any suggestions regarding the column names in the new view? I'm not sure I like node_id and page_num. regards -- Tomas Vondra
pgsql-hackers by date: