Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Draft for basic NUMA observability
Date
Msg-id 0137610b-933e-4f38-ae63-195f248c87c2@vondra.me
Whole thread Raw
In response to Re: Draft for basic NUMA observability  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses RE: Draft for basic NUMA observability
List pgsql-hackers

On 4/7/25 23:50, Jakub Wartak wrote:
> On Mon, Apr 7, 2025 at 11:27 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> Hi,
>>
>> I've pushed all three parts of v29, with some additional corrections
>> (picked lower OIDs, bumped catversion, fixed commit messages).
> 
> Hi Tomas, great, awesome! (this is an awesome feeling)! Thank You for
> such incredible support on the last mile of this and also to Bertrand
> (for persistence!), Andres and Alvaro for lots of babysitting.
> 

Glad I could help, thanks for the patch.

>> AFAIK v29 fixed this, the end pointer calculations were wrong. With that
>> it passed for me with/without THP, different blocks sizes etc.
> 
> Yeah, that was a typo, I've started writing about v28, but then in the
> middle of that v29 landed and I still was chasing that finding, I've
> just forgotten to bump this.
> 
>> We don't align buffers to os_page_size, we align them PG_IO_ALIGN_SIZE,
>> which is 4kB or so. And it's determined at compile time, while THP is
>> determined when starting the cluster.
> [..]
>> Right, this is because that's where the THP boundary happens to be. And
>> that one "duplicate" entry is for a buffer that happens to span two
>> pages. This is *exactly* the misalignment of blocks and pages that I was
>> wondering about earlier, and with the fixed endptr calculation we handle
>> that just fine.
>>
>> No opinion on the aligment - maybe we should do that, but it's not
>> something this patch needs to worry about.
> 
> Agreed.I was wondering even if there are other drawbacks of the
> situation, but other than not reporting duplicates here in this
> pg_buffercache view, I cannot identify anything worthwhile.
> 

Well, the drawback is that accessing the buffer may require hitting two
different NUMA nodes. I'm not 100% sure it can actually happen, though.
the buffer should be initialized as a whole, so it should got to the
same node. But maybe it could be "split" by THP migration, or something
like that.

In any case, that's not caused by this patch, and it's less serious with
huge pages - it's only affect buffers on the boundaries. But with the
small 4K pages it can happen for *every* buffer.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Christoph Berg
Date:
Subject: Re: [PoC] Federated Authn/z with OAUTHBEARER
Next
From: Nathan Bossart
Date:
Subject: Re: Horribly slow pg_upgrade performance with many Large Objects