Re: pgsql: Introduce pg_shmem_allocations_numa view - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: pgsql: Introduce pg_shmem_allocations_numa view
Date
Msg-id 8649a4e3-c60d-4f37-aa6f-e7e7c14c581e@vondra.me
Whole thread Raw
In response to Re: pgsql: Introduce pg_shmem_allocations_numa view  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers

On 6/24/25 13:10, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jun 24, 2025 at 11:20:15AM +0200, Tomas Vondra wrote:
>> On 6/24/25 10:24, Bertrand Drouvot wrote:
>>> Yeah, same for me with pg_get_shmem_allocations_numa(). It works if
>>> pg_numa_query_pages() is done on chunks <= 16 pages but fails if done on more
>>> than 16 pages.
>>>
>>> It's also confirmed by test_chunk_size.c attached:
>>>
>>> $ gcc-11 -m32 -o test_chunk_size test_chunk_size.c
>>> $ ./test_chunk_size
>>>  1 pages: SUCCESS (0 errors)
>>>  2 pages: SUCCESS (0 errors)
>>>  3 pages: SUCCESS (0 errors)
>>>  4 pages: SUCCESS (0 errors)
>>>  5 pages: SUCCESS (0 errors)
>>>  6 pages: SUCCESS (0 errors)
>>>  7 pages: SUCCESS (0 errors)
>>>  8 pages: SUCCESS (0 errors)
>>>  9 pages: SUCCESS (0 errors)
>>> 10 pages: SUCCESS (0 errors)
>>> 11 pages: SUCCESS (0 errors)
>>> 12 pages: SUCCESS (0 errors)
>>> 13 pages: SUCCESS (0 errors)
>>> 14 pages: SUCCESS (0 errors)
>>> 15 pages: SUCCESS (0 errors)
>>> 16 pages: SUCCESS (0 errors)
>>> 17 pages: 1 errors
>>> Threshold: 17 pages
>>>
>>> No error if -m32 is not used.
>>>
>>> We could work by chunks (16?) on 32 bits but would probably produce performance
>>> degradation (we mention it in the doc though). Also would always 16 be a correct
>>> chunk size? 
>>
>> I don't see how this would solve anything?
>>
>> AFAICS the problem is the two places are confused about how large the
>> array elements are, and get to interpret that differently.
> 
>> I don't see how using smaller array makes this correct. That it works is
>> more a matter of luck,
> 
> Not sure it's luck, maybe the wrong pointers arithmetic has no effect if batch
> size is <= 16.
> 
> So we have kernel_move_pages() -> kernel_move_pages() (because nodes is NULL here
> for us as we call "numa_move_pages(pid, count, pages, NULL, status, 0);").
> 
> So, if we look at do_pages_stat() ([1]), we can see that it uses an hardcoded
> "#define DO_PAGES_STAT_CHUNK_NR 16UL" and that this pointers arithmetic:
> 
> "
>         pages += chunk_nr;
>         status += chunk_nr;
> "
> 
> is done but has no effect since nr_pages will exit the loop if we use a batch
> size <= 16.
> 
> So if this pointer arithmetic is not correct, (it seems that it should advance
> by 16 * sizeof(compat_uptr_t) instead) then it has no effect as long as the batch
> size is <= 16.
> 
> Does test_chunk_size also fails at 17 for you?

Yes, it fails for me at 17 too. So you're saying the access within each
chunk of 16 elements is OK, but that maybe advancing to the next chunk
is not quite right? In which case limiting the access to 16 entries
might be a workaround.

In any case, this sounds like a kernel bug, right? I don't have much
experience with the kernel code, so don't want to rely too much on my
interpretation of it.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: Logrep launcher race conditions leading to slow tests
Next
From: Tomas Vondra
Date:
Subject: Re: pgsql: Introduce pg_shmem_allocations_numa view