Home > mailing lists

Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Adding basic NUMA awareness
Date	July 10 18:31:45
Msg-id	a7c5f516-c818-43d6-b025-e13b685ab72c@vondra.me Whole thread Raw
In response to	Re: Adding basic NUMA awareness (Andres Freund <andres@anarazel.de>)
List	pgsql-hackers

Tree view


On 7/9/25 19:23, Andres Freund wrote:
> Hi,
> 
> On 2025-07-09 12:55:51 -0400, Greg Burd wrote:
>> On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote:
>>
>>> FWIW, I've started to wonder if we shouldn't just get rid of the freelist
>>> entirely. While clocksweep is perhaps minutely slower in a single
>>> thread than
>>> the freelist, clock sweep scales *considerably* better [1]. As it's rather
>>> rare to be bottlenecked on clock sweep speed for a single thread
>>> (rather then
>>> IO or memory copy overhead), I think it's worth favoring clock sweep.
>>
>> Hey Andres, thanks for spending time on this.  I've worked before on
>> freelist implementations (last one in LMDB) and I think you're onto
>> something.  I think it's an innovative idea and that the speed
>> difference will either be lost in the noise or potentially entirely
>> mitigated by avoiding duplicate work.
> 
> Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE
> perform better because it doesn't need to maintain the freelist anymore...
> 
> 
>>> Also needing to switch between getting buffers from the freelist and
>>> the sweep
>>> makes the code more expensive.  I think just having the buffer in the sweep,
>>> with a refcount / usagecount of zero would suffice.
>>
>> If you're not already coding this, I'll jump in. :)
> 
> My experimental patch is literally a four character addition ;), namely adding
> "0 &&" to the relevant code in StrategyGetBuffer().
> 
> Obviously a real patch would need to do some more work than that.  Feel free
> to take on that project, I am not planning on tackling that in near term.
> 
> 
> There's other things around this that could use some attention. It's not hard
> to see clock sweep be a bottleneck in concurrent workloads - partially due to
> the shared maintenance of the clock hand. A NUMAed clock sweep would address
> that. However, we also maintain StrategyControl->numBufferAllocs, which is a
> significant contention point and would not necessarily be removed by a
> NUMAificiation of the clock sweep.
> 

Wouldn't it make sense to partition the numBufferAllocs too, though? I
don't remember if my hacky experimental patch NUMA-partitioning did that
or I just thought about doing that, but why wouldn't that be enough?

Places that need the "total" count would have to sum the counters, but
it seemed to me most of the places would be fine with the "local" count
for that partition. If we also make sure to "sync" the clocksweeps so as
to not work on just a single partition, that might be enough ...

regards

-- 
Tomas Vondra

pgsql-hackers by date:

From: Tomas Vondra
Date: 10 July, 18:20:50
Subject: Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

From: Tom Lane
Date: 10 July, 18:35:26
Subject: Re: Using ASSUME in place of ASSERT in non-assert builds

Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

Previous

Next