Hi,
Thanks for working on this - I think it's something we need to improve.
On 2022-10-31 14:38:23 +0400, Pavel Borisov wrote:
> If the lock state contains references to the queue head and tail, we can
> implement a lockless queue of waiters for the LWLock. Adding new items to
> the queue head or tail can be done with a single CAS operation (adding to
> the tail will also require further setting the reference from the previous
> tail). Given that there could be only one lock releaser, which wakes up
> waiters in the queue, we can handle all the concurrency issues with
> reasonable complexity.
Right now lock releases happen *after* the lock is released. I suspect that is
at least part of the reason for the regression you're seeing. It also looks
like you're going to need a substantially higher number of atomic operations -
right now the queue processing needs O(1) atomic ops, but your approach ends
up with O(waiters) atomic ops.
I suspect it might be worth going halfway, i.e. put the list head/tail in the
atomic variable, but process the list with a lock held, after the lock is
released.
> Removing the queue spinlock bit and the corresponding contention should
> give the performance gain on high concurrent LWLock usage scenarios.
> Currently, the maximum size of procarray is 2^18 (as stated in
> buf_internals.h), so with the use of the 64-bit LWLock state variable, we
> can store the procarray indexes for both head and tail of the queue, all
> the existing machinery, and even have some spare bits for future flags.
Another advantage: It shrinks the LWLock struct, which makes it cheaper to
make locks more granular...
> I can not understand why the performance of a lockless queue patch has a
> minor regression in the region of around 20 connections, even when compared
> to the current master branch.
I suspect that's the point where the additional atomic operations hurt the
most, while not yet providing a substantial gain.
Greetings,
Andres Freund