Re: Optimize shared LWLock acquisition for high-core-count systems - Mailing list pgsql-hackers
From | Zhou, Zhiguo |
---|---|
Subject | Re: Optimize shared LWLock acquisition for high-core-count systems |
Date | |
Msg-id | 1a3f09c5-3709-4d84-a706-113d9a72cd74@intel.com Whole thread Raw |
In response to | Re: Optimize shared LWLock acquisition for high-core-count systems (Yura Sokolov <y.sokolov@postgrespro.ru>) |
List | pgsql-hackers |
On 7/9/2025 3:56 PM, Yura Sokolov wrote: > 30.05.2025 14:30, Zhou, Zhiguo пишет: >> Hi Hackers, >> >> I am reaching out to solicit your insights and comments on this patch >> addressing a significant performance bottleneck in LWLock acquisition >> observed on high-core-count systems. During performance analysis of >> HammerDB/TPCC (192 virtual users, 757 warehouses) on a 384-vCPU Intel >> system, we found that LWLockAttemptLock consumed 7.12% of total CPU >> cycles. This bottleneck becomes even more pronounced (up to 30% of >> cycles) after applying lock-free WAL optimizations[1][2]. >> >> Problem Analysis: >> The current LWLock implementation uses separate atomic operations for >> state checking and modification. For shared locks (84% of >> LWLockAttemptLock calls), this requires: >> 1.Atomic read of lock->state >> 2.State modification >> 3.Atomic compare-exchange (with retries on contention) >> >> This design causes excessive atomic operations on contended locks, which >> are particularly expensive on high-core-count systems where cache-line >> bouncing amplifies synchronization costs. >> >> Optimization Approach: >> The patch optimizes shared lock acquisition by: >> 1.Merging state read and update into a single atomic add operation >> 2.Extending LW_SHARED_MASK by 1 bit and shifting LW_VAL_EXCLUSIVE >> 3.Adding a willwait parameter to control optimization usage >> >> Key implementation details: >> - For LW_SHARED with willwait=true: Uses atomic fetch-add to increment >> reference count >> - Maintains backward compatibility through state mask adjustments >> - Preserves existing behavior for: >> 1) Exclusive locks >> 2) Non-waiting cases (LWLockConditionalAcquire) >> - Bounds shared lock count to MAX_BACKENDS*2 (handled via mask extension) >> >> Performance Impact: >> Testing on a 384-vCPU Intel system shows: >> - *8%* NOPM improvement in HammerDB/TPCC with this optimization alone >> - *46%* cumulative improvement when combined with lock-free WAL >> optimizations[1][2] >> >> Patch Contents: >> 1.Extends shared mask and shifts exclusive lock value >> 2.Adds willwait parameter to control optimization >> 3.Updates lock acquisition/release logic >> 4.Maintains all existing assertions and safety checks >> >> The optimization is particularly effective for contended shared locks, >> which are common in buffer mapping, lock manager, and shared buffer >> access patterns. >> >> Please review this patch for consideration in upcoming PostgreSQL releases. >> >> [1] Lock-free XLog Reservation from WAL: >> https://www.postgresql.org/message-id/flat/PH7PR11MB5796659F654F9BE983F3AD97EF142%40PH7PR11MB5796.namprd11.prod.outlook.com >> [2] Increase NUM_XLOGINSERT_LOCKS: >> https://www.postgresql.org/message-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru > > Good day, Zhou. > > Could you explain, why your patch is correct? > > As far as I understand, it is clearly not correct at this time: > - SHARED lock count may be incremented many times, because of for(;;) loop > in LWLockAcquire and because LWLockAttemptLock is called twice per each > loop iteration in case lock is held in Exclusive mode by someone else. > > If your patch is correct, then where I'm wrong? > > When I tried to do same thing, I did sub_fetch immediately in case of > acquisition failure. And did no add_fetch at all if lock is held in > Exclusive mode. > > BTW, there's way to optimize EXCLUSIVE lock as well since there's no need > to do CAS if lock is held by someone else. > > See my version in attach... > Good day, Yura. Thanks for your comments which identifies this critical safety condition – you're absolutely right that we must guarantee the shared reference count never overflows into the exclusive bit. Let me clarify the safety mechanism: When a reader encounters an exclusive lock during acquisition (triggering the for(;;) loop), it does temporarily increment the shared count twice – once per LWLockAttemptLock attempt. However, both increments occur before the reader waits on its semaphore (PGSemaphoreLock(proc->sem)). Crucially, when the exclusive holder releases the lock via LWLockReleaseInternal, it resets the entire lock state (line 1883: pg_atomic_fetch_and_u32(&lock->state, ~LW_LOCK_MASK)). This clears all reader references, including any "overcounted" increments from blocked readers. Thus, when blocked readers wake: 1. They retry acquisition on a zero-initialized state 2. Each ultimately increments only once for successful acquisition 3. The transient "overcount" (≤ MAX_BACKENDS × 2) stays safely within LW_SHARED_MASK The key invariants are: - LW_SHARED_MASK = (MAX_BACKENDS << 1) + 1 - Exclusive release resets all shared bits - Readers never persist >1 reference after wakeup Does this resolve the concern? I appreciate you flagging this subtlety – please correct me if I've misunderstood your scenario or misinterpreted the code. And I'd appreciate you for sharing your implementation – I particularly agree with your optimization for exclusive lock acquisition. Avoiding the CAS loop when the lock is already held (by checking state early) is a clever reduction of atomic operations, which we know are costly on high-core-count systems. I’ll prioritize evaluating this for our HammerDB/TPROC-C workload and share benchmark results soon. Regarding shared locks: Your version (using sub_fetch on acquisition failure) does align more cleanly with the original state machine by avoiding transient overcounts. I initially came up with a similar approach but shifted to the single-atomic-increment design to minimize atomic instructions – a critical priority for our 384-core benchmarks where atomic ops dominate contention. Let’s reconcile these strengths: 1. I’ll test your patch head-to-head against our current version in HCC TPROC-C workloads. 2. If the atomic savings in your exclusive path yield meaningful gains, we will try to integrate it into our patch immediately. 1. For shared locks: if your design shows comparable performance while simplifying correctness, it’s a compelling option. Really appreciate you driving this optimization further! Regards, Zhiguo
pgsql-hackers by date: