Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Draft for basic NUMA observability
Date
Msg-id a4cfd501-aa76-4bb1-a330-792ed230e1f1@vondra.me
Whole thread Raw
In response to Re: Draft for basic NUMA observability  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Draft for basic NUMA observability
Re: Draft for basic NUMA observability
List pgsql-hackers
On 4/5/25 15:23, Tomas Vondra wrote:
> On 4/5/25 11:37, Bertrand Drouvot wrote:
>> Hi,
>>
>> On Fri, Apr 04, 2025 at 09:25:57PM +0200, Tomas Vondra wrote:
>>> OK,
>>>
>>> here's v25 after going through the patches once more, fixing the issues
>>> mentioned by Bertrand, etc.
>>
>> Thanks!
>>
>>> I think 0001 and 0002 are fine,
>>
>> Agree, I just have some cosmetic nits comments: please find them in
>> nit-bertrand-0002.txt attached.
>>
>>> I have a
>>> couple minor questions about 0003.
>>>
>>> - I was wondering if maybe we should have some "global ID" of memory
>>> page, so that with large memory pages it's indicated the buffers are on
>>> the same memory page. Right now each buffer starts page_num from 0, but
>>> it should not be very hard to have a global counter. Opinions?
>>
>> I think that's a good idea. We could then add a new column (say os_page_id) that
>> would help identify which buffers are sharing the same "physical" page.
>>
> 
> I was thinking we'd change the definition of the existing page_num
> column, i.e. it wouldn't be 0..N sequence for each buffer, but a global
> page ID. But I don't know if this would be useful in practice.
> 

See the attached v25 with a draft of this in patch 0003.

While working on this, I realized it's probably wrong to use TYPEALIGN()
to calculate the OS page pointer. The code did this:

    os_page_ptrs[idx]
        = (char *) TYPEALIGN(os_page_size,
                             buffptr + (os_page_size * j));

but TYPEALIGN() rounds "up". Let's assume we have 1KB buffers and 4KB
memory pages, and that the first buffer is aligned to 4kB (i.e. it
starts right at the beginning of a memory page). Then we expect to get
page_num sequence:

    0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, ...

with 4 buffers per memory page. But we get this:

    0, 1, 1, 1, 1, 2, 2, 2, 2, ...

So I've changed this to TYPEALIGN_DOWN(), which fixes the result.

The pg_shmem_allocations_numa had a variant of this issue when
calculating the number of pages, I think. I believe the shmem segment
may start half-way through a page, and the allocated_size may not be a
multiple of os_page_size (otherwise why use TYPEALIGN). In which case we
might skip the first page, I think.

The one question I have about this is whether we know the pointer
returned by TYPEALIGN_DOWN() is valid. It's before ent->location (or
before the first shared buffer) ...


regards

-- 
Tomas Vondra

Attachment

pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: AIO v2.5
Next
From: Andrew Dunstan
Date:
Subject: Re: getting "shell command argument contains a newline or carriage return:" error with pg_dumpall when db name have new line in double quote