Home > mailing lists

Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Draft for basic NUMA observability
Date	April 4 17:36:15
Msg-id	2e3922f2-3310-43b1-886d-d0559984897d@vondra.me Whole thread Raw
In response to	Re: Draft for basic NUMA observability (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses	Re: Draft for basic NUMA observability
List	pgsql-hackers

Tree view


On 4/4/25 09:35, Jakub Wartak wrote:
> On Fri, Apr 4, 2025 at 8:50 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> Hi,
>>
>> On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote:
>>> On 4/3/25 15:12, Jakub Wartak wrote:
>>>> On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:
>>>>
>>>>> ...
>>>>>
>>>>> So unless someone can demonstrate a use case where this would matter,
>>>>> I'd not worry about it too much.
>>>>
>>>> OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
>>>> it's just that I don't have cycles left today and probably lack skills
>>>> (i've never dealt with arrays so far) thus it would be slow to get it
>>>> right... but I can pick up anything tomorrow morning.
>>>>
>>>
>>> OK, I took a stab at reworking/simplifying this the way I proposed.
>>> Here's v24 - needs more polishing, but hopefully enough to show what I
>>> had in mind.
>>>
>>> It does these changes:
>>>
>>> 1) Drops 0002 with the pg_buffercache refactoring, because the new view
>>> is not "extending" the existing one.
>>
>> I think that makes sense. One would just need to join on the pg_buffercache
>> view to get more information about the buffer if needed.
>>
>> The pg_buffercache_numa_pages() doc needs an update though as I don't think that
>> "+  The <function>pg_buffercache_numa_pages()</function> provides the same
>> information as <function>pg_buffercache_pages()</function>" is still true.
>>
>>> 2) Reworks pg_buffercache_num to return just three columns, bufferid,
>>> page_num and node_id. page_num is a sequence starting from 0 for each
>>> buffer.
>>
>> +1 on the idea
>>
>>> 3) It now builds an array of records, with one record per buffer/page.
>>>
>>> 4) I realized we don't really need to worry about buffers_per_page very
>>> much, except for logging/debugging. There's always "at least one page"
>>> per buffer, even if an incomplete one, so we can do this:
>>>
>>>    os_page_count = NBuffers * Max(1, pages_per_buffer);
>>>
>>> and then
>>>
>>>   for (i = 0; i < NBuffers; i++)
>>>   {
>>>       for (j = 0; j < Max(1, pages_per_buffer); j++)
>>
>> That's a nice simplification as we always need to take care of at least one page
>> per buffer.
>>
>>> and everything just works fine, I think.
>>
>> I think the same.
>>
>>> Opinions? I personally find this much cleaner / easier to understand.
>>
>> I agree that's easier to understand and that that looks correct.
>>
>> A few random comments:
>>
>> === 1
>>
>> It looks like that firstNumaTouch is not set to false anymore.
>>
>> === 2
>>
>> +               pfree(os_page_status);
>> +               pfree(os_page_ptrs);
>>
>> Not sure that's needed, we should be in a short-lived memory context here
>> (ExprContext or such).
>>
>> === 3
>>
>> +       ro_volatile_var = *(uint64 *)ptr
>>
>> space missing before "ptr"?
>>
> 
> +my feedback as I've noticed that Bertrand already provided a review.
> 
> Right, the code is now simple , and that Max() is brilliant. I've
> attached some review findings as .txt
> 
> 0001 100%LGTM
> 0002 doc fix + pgident + Tomas, you should take Authored-by yourself
> there for sure, I couldn't pull this off alone in time! So big thanks!
> 0003 fixes elog UINT64_FORMAT for ming32 (a little bit funny to have
> NUMA on ming32...:))
> 

OK

> When started with interleave=all on serious hardware, I'm getting (~5s
> for s_b=64GB) from pg_buffercache_numa
> 
>      node_id |  count
>     ---------+---------
>            3 | 2097152
>            0 | 2097152
>            2 | 2097152
>            1 | 2097152
> 
> so this is valid result (2097152*4 numa nodes*8192 buffer
> size/1024/1024/1024 = 64GB)
> 
> Also with pgbench -i -s 20 , after ~8s:
>     select c.relname, n.node_id, count(*) from pg_buffercache_numa n
>     join pg_buffercache b on (b.bufferid = n.bufferid)
>     join pg_class c on (c.relfilenode = b.relfilenode)
>     group by c.relname, n.node_id order by count(*) desc;
>                         relname                    | node_id | count
>     -----------------------------------------------+---------+-------
>      pgbench_accounts                              |       2 |  8217
>      pgbench_accounts                              |       0 |  8190
>      pgbench_accounts                              |       3 |  8189
>      pgbench_accounts                              |       1 |  8187
>      pg_statistic                                  |       2 |    32
>      pg_operator                                   |       2 |    14
>      pg_depend                                     |       3 |    14
>      [..]
> 
> pg_shm_allocations_numa also looks good.
> 
> I think v24+tiny fixes is good enough to go in.
> 

OK.

Do you have any suggestions regarding the column names in the new view?
I'm not sure I like node_id and page_num.


regards

-- 
Tomas Vondra

pgsql-hackers by date:

From: Robert Treat
Date: 04 April, 17:34:03
Subject: Re: pg_upgrade: Support for upgrading to checksums enabled

From: Andres Freund
Date: 04 April, 17:39:53
Subject: Re: Replace IN VALUES with ANY in WHERE clauses during optimization

Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

Previous

Next