Home > mailing lists

Optimize shared LWLock acquisition for high-core-count systems - Mailing list pgsql-hackers

From	Zhou, Zhiguo
Subject	Optimize shared LWLock acquisition for high-core-count systems
Date	May 30, 2025 14:30:39
Msg-id	73d53acf-4f66-41df-b438-5c2e6115d4de@intel.com Whole thread Raw
Responses	Re: Optimize shared LWLock acquisition for high-core-count systems Re: Optimize shared LWLock acquisition for high-core-count systems
List	pgsql-hackers

Tree view

Hi Hackers,

I am reaching out to solicit your insights and comments on this patch 
addressing a significant performance bottleneck in LWLock acquisition 
observed on high-core-count systems. During performance analysis of 
HammerDB/TPCC (192 virtual users, 757 warehouses) on a 384-vCPU Intel 
system, we found that LWLockAttemptLock consumed 7.12% of total CPU 
cycles. This bottleneck becomes even more pronounced (up to 30% of 
cycles) after applying lock-free WAL optimizations[1][2].

Problem Analysis:
The current LWLock implementation uses separate atomic operations for 
state checking and modification. For shared locks (84% of 
LWLockAttemptLock calls), this requires:
1.Atomic read of lock->state
2.State modification
3.Atomic compare-exchange (with retries on contention)

This design causes excessive atomic operations on contended locks, which 
are particularly expensive on high-core-count systems where cache-line 
bouncing amplifies synchronization costs.

Optimization Approach:
The patch optimizes shared lock acquisition by:
1.Merging state read and update into a single atomic add operation
2.Extending LW_SHARED_MASK by 1 bit and shifting LW_VAL_EXCLUSIVE
3.Adding a willwait parameter to control optimization usage

Key implementation details:
- For LW_SHARED with willwait=true: Uses atomic fetch-add to increment 
reference count
- Maintains backward compatibility through state mask adjustments
- Preserves existing behavior for:
   1) Exclusive locks
   2) Non-waiting cases (LWLockConditionalAcquire)
- Bounds shared lock count to MAX_BACKENDS*2 (handled via mask extension)

Performance Impact:
Testing on a 384-vCPU Intel system shows:
- *8%* NOPM improvement in HammerDB/TPCC with this optimization alone
- *46%* cumulative improvement when combined with lock-free WAL 
optimizations[1][2]

Patch Contents:
1.Extends shared mask and shifts exclusive lock value
2.Adds willwait parameter to control optimization
3.Updates lock acquisition/release logic
4.Maintains all existing assertions and safety checks

The optimization is particularly effective for contended shared locks, 
which are common in buffer mapping, lock manager, and shared buffer 
access patterns.

Please review this patch for consideration in upcoming PostgreSQL releases.

[1] Lock-free XLog Reservation from WAL: 

https://www.postgresql.org/message-id/flat/PH7PR11MB5796659F654F9BE983F3AD97EF142%40PH7PR11MB5796.namprd11.prod.outlook.com
[2] Increase NUM_XLOGINSERT_LOCKS: 
https://www.postgresql.org/message-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru

Regards,
Zhiguo

Attachment

0001-Optimize-shared-LWLock-acquisition-for-high-core-cou.patch

pgsql-hackers by date:

From: Amul Sul
Date: 30 May 2025, 14:24:00
Subject: Re: Replication slot is not able to sync up

From: Shaik Mohammad Mujeeb
Date: 30 May 2025, 14:34:10
Subject: Re: Add comment explaining why queryid is int64 in pg_stat_statements

Optimize shared LWLock acquisition for high-core-count systems - Mailing list pgsql-hackers

Attachment

Previous

Next