Re: Optimize shared LWLock acquisition for high-core-count systems - Mailing list pgsql-hackers

From Zhou, Zhiguo
Subject Re: Optimize shared LWLock acquisition for high-core-count systems
Date
Msg-id 1a3f09c5-3709-4d84-a706-113d9a72cd74@intel.com
Whole thread Raw
In response to Re: Optimize shared LWLock acquisition for high-core-count systems  (Yura Sokolov <y.sokolov@postgrespro.ru>)
List pgsql-hackers

On 7/9/2025 3:56 PM, Yura Sokolov wrote:
> 30.05.2025 14:30, Zhou, Zhiguo пишет:
>> Hi Hackers,
>>
>> I am reaching out to solicit your insights and comments on this patch
>> addressing a significant performance bottleneck in LWLock acquisition
>> observed on high-core-count systems. During performance analysis of
>> HammerDB/TPCC (192 virtual users, 757 warehouses) on a 384-vCPU Intel
>> system, we found that LWLockAttemptLock consumed 7.12% of total CPU
>> cycles. This bottleneck becomes even more pronounced (up to 30% of
>> cycles) after applying lock-free WAL optimizations[1][2].
>>
>> Problem Analysis:
>> The current LWLock implementation uses separate atomic operations for
>> state checking and modification. For shared locks (84% of
>> LWLockAttemptLock calls), this requires:
>> 1.Atomic read of lock->state
>> 2.State modification
>> 3.Atomic compare-exchange (with retries on contention)
>>
>> This design causes excessive atomic operations on contended locks, which
>> are particularly expensive on high-core-count systems where cache-line
>> bouncing amplifies synchronization costs.
>>
>> Optimization Approach:
>> The patch optimizes shared lock acquisition by:
>> 1.Merging state read and update into a single atomic add operation
>> 2.Extending LW_SHARED_MASK by 1 bit and shifting LW_VAL_EXCLUSIVE
>> 3.Adding a willwait parameter to control optimization usage
>>
>> Key implementation details:
>> - For LW_SHARED with willwait=true: Uses atomic fetch-add to increment
>> reference count
>> - Maintains backward compatibility through state mask adjustments
>> - Preserves existing behavior for:
>>     1) Exclusive locks
>>     2) Non-waiting cases (LWLockConditionalAcquire)
>> - Bounds shared lock count to MAX_BACKENDS*2 (handled via mask extension)
>>
>> Performance Impact:
>> Testing on a 384-vCPU Intel system shows:
>> - *8%* NOPM improvement in HammerDB/TPCC with this optimization alone
>> - *46%* cumulative improvement when combined with lock-free WAL
>> optimizations[1][2]
>>
>> Patch Contents:
>> 1.Extends shared mask and shifts exclusive lock value
>> 2.Adds willwait parameter to control optimization
>> 3.Updates lock acquisition/release logic
>> 4.Maintains all existing assertions and safety checks
>>
>> The optimization is particularly effective for contended shared locks,
>> which are common in buffer mapping, lock manager, and shared buffer
>> access patterns.
>>
>> Please review this patch for consideration in upcoming PostgreSQL releases.
>>
>> [1] Lock-free XLog Reservation from WAL:
>>
https://www.postgresql.org/message-id/flat/PH7PR11MB5796659F654F9BE983F3AD97EF142%40PH7PR11MB5796.namprd11.prod.outlook.com
>> [2] Increase NUM_XLOGINSERT_LOCKS:
>> https://www.postgresql.org/message-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru
> 
> Good day, Zhou.
> 
> Could you explain, why your patch is correct?
> 
> As far as I understand, it is clearly not correct at this time:
> - SHARED lock count may be incremented many times, because of for(;;) loop
> in LWLockAcquire and because LWLockAttemptLock is called twice per each
> loop iteration in case lock is held in Exclusive mode by someone else.
> 
> If your patch is correct, then where I'm wrong?
> 
> When I tried to do same thing, I did sub_fetch immediately in case of
> acquisition failure. And did no add_fetch at all if lock is held in
> Exclusive mode.
> 
> BTW, there's way to optimize EXCLUSIVE lock as well since there's no need
> to do CAS if lock is held by someone else.
> 
> See my version in attach...
> 

Good day, Yura.

Thanks for your comments which identifies this critical safety condition 
– you're absolutely right that we must guarantee the shared reference 
count never overflows into the exclusive bit. Let me clarify the safety 
mechanism:

When a reader encounters an exclusive lock during acquisition 
(triggering the for(;;) loop), it does temporarily increment the shared 
count twice – once per LWLockAttemptLock attempt. However, both 
increments occur ​​before​​ the reader waits on its semaphore 
(PGSemaphoreLock(proc->sem)). Crucially, when the exclusive holder 
releases the lock via LWLockReleaseInternal, it ​​resets the entire lock 
state​​ (line 1883: pg_atomic_fetch_and_u32(&lock->state, 
~LW_LOCK_MASK)). This clears all reader references, including any 
"overcounted" increments from blocked readers.

Thus, when blocked readers wake:

1. They retry acquisition on a ​​zero-initialized state​​
2. Each ultimately increments only ​​once​​ for successful acquisition
3. The transient "overcount" (≤ MAX_BACKENDS × 2) stays safely within 
LW_SHARED_MASK

The key invariants are:

- LW_SHARED_MASK = (MAX_BACKENDS << 1) + 1
- Exclusive release resets all shared bits
- Readers never persist >1 reference after wakeup

Does this resolve the concern? I appreciate you flagging this subtlety – 
please correct me if I've misunderstood your scenario or misinterpreted 
the code.

And I'd appreciate you for sharing your implementation – I particularly 
agree with your optimization for ​​exclusive lock acquisition​​. 
Avoiding the CAS loop when the lock is already held (by checking state 
early) is a clever reduction of atomic operations, which we know are 
costly on high-core-count systems. I’ll prioritize evaluating this for 
our HammerDB/TPROC-C workload and share benchmark results soon.

Regarding ​​shared locks​​: Your version (using sub_fetch on acquisition 
failure) does align more cleanly with the original state machine by 
avoiding transient overcounts. I initially came up with a similar 
approach but shifted to the single-atomic-increment design to minimize 
atomic instructions – a critical priority for our 384-core benchmarks 
where atomic ops dominate contention.

Let’s reconcile these strengths:

1. I’ll test your patch head-to-head against our current version in HCC 
TPROC-C workloads.
2. If the atomic savings in your exclusive path yield meaningful gains, 
we will try to integrate it into our patch immediately.
1. For shared locks: if your design shows comparable performance while 
simplifying correctness, it’s a compelling option.

Really appreciate you driving this optimization further!

Regards,
Zhiguo




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Using ASSUME in place of ASSERT in non-assert builds
Next
From: Nathan Bossart
Date:
Subject: Re: CHECKPOINT unlogged data