Thread: spinlock contention

spinlock contention

From
Robert Haas
Date:
On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote:
> On Jun12, 2011, at 23:39 , Robert Haas wrote:
>> So, the majority (60%) of the excess spinning appears to be due to
>> SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).
>
> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
> so I guess that two adjacent LWLocks currently share one cache line.
>
> Currently, the ProcArrayLock has index 4 while SInvalReadLock has
> index 5, so if I'm not mistaken exactly the two locks where you saw
> the largest contention on are on the same cache line...
>
> Might make sense to try and see if these numbers change if you
> either make LWLockPadded 64bytes or arrange the locks differently...

I fooled around with this a while back and saw no benefit.  It's
possible a more careful test would turn up something, but I think the
only real way forward here is going to be to eliminate some of that
locking altogether.

SInvalReadLock is a good example of a lock which seems barely to be
necessary at all.  Except on extraordinarily rare occasions, it is
only ever taken in share mode.  Only SICleanupQueue() needs to lock
out readers, and in the (probably common) case where all backends are
reasonably close to being caught up, it wouldn't even matter if the
determination of minMsgNum were a little bit fuzzy.  I've been
straining my brain trying to figure out whether there's some way to
rewrite this logic so that it can run concurrently with readers; then
we could get rid of SInvalReadLock altogether.  I have a feeling if I
were 25% smarter I could do it, but I'm not.

So my fallback position is to try to find some way to optimize the
common case where there are no messages to be read.  We're already
allowing readers and writers to run in parallel, so it must not be
important for a reader to see a message that a writer is in the
process of writing, but has not yet finished writing.  I believe the
idea here is that sinval messages should be emitted before releasing
the related locks, so if we've successfully acquired a lock on an
object, then any pertinent sinval messages must already be completely
written.  What I'm thinking we could do is just keep a global counter
indicating how many messages have been written, and each backend could
keep a counter indicating the highest-numbered message it has so far
read.  When a message is written, the global counter is bumped.  When
a backend goes to AcceptInvalidationMessages(), it compares the global
counter to its own counter, and if they're the same, then it doesn't
bother taking SInvalReadLock and just exits quickly.

There is an obvious (potential) memory-ordering problem here.  If it's
possible for a backend doing AcceptInvalidationMessages() to see a
stale version of the counter, then it might fail to read messages from
the queue that it really ought to have seen.  Protecting the global
counter with a spinlock defeats the purpose of doing any of this in
the first place, because the whole problem is too many people fighting
over the spinlock protecting SInvalReadLock.  It should be sufficient,
though, for the writing backend to execute a memory-barrier operation
just after bumping the global counter and the reading backend to
execute one just before.  As it happens, LWLockRelease() is such a
barrier, since it must acquire and release a spinlock, so we should be
able to just bump the counter right before releasing the LWLock and
call it good.  On the reading side, the immediately-preceding lock
acquisition seems like it ought to be sufficient - surely it can't be
possible to acquire a heavyweight lock on an object without seeing
memory updates that some other process made before releasing the same
heavyweight lock.

A possible objection here is that a 32-bit counter might wrap around,
giving a false negative, and a read from a 64-bit counter might not be
atomic.  In practice, the chances of a problem here seem remote, but I
think we can guard against it easily enough.  We can put each
backend's counter of messages read into shared memory, protected by a
per-backend lock (probably the same LWLock the fast relation lock
patch already introduces).  When a backend compares its counter to the
global counter, it does so while holding this lock.  When
SICleanupQueue() resets a backend, it grabs the lock and sets that
backend's counter to a special value (like 0) that we guarantee will
never match the global counter (which we can cause to wrap from 2^32-1
to 1).  So then AcceptInvalidationMessages() - in the common case
where there are no invalidation messages to process - will only need
the per-backend LWLock rather than fighting over SInvalLock.

If anyone has read this far, you have my undying gratitude....
especially if you respond with comments.

ProcArrayLock looks like a tougher nut to crack - there's simply no
way, with the system we have right now, that you can take a snapshot
without locking the list of running processes.  I'm not sure what to
do about that, but we're probably going to have to come up with
something, because it seems clear that once we eliminate the lock
manager LWLock contention, this is a major bottleneck.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: spinlock contention

From
Heikki Linnakangas
Date:
On 23.06.2011 18:42, Robert Haas wrote:
> ProcArrayLock looks like a tougher nut to crack - there's simply no
> way, with the system we have right now, that you can take a snapshot
> without locking the list of running processes.  I'm not sure what to
> do about that, but we're probably going to have to come up with
> something, because it seems clear that once we eliminate the lock
> manager LWLock contention, this is a major bottleneck.

ProcArrayLock is currently held for a relatively long period of time 
when a snapshot is taken, especially if there's a lot of backends. There 
are three operations to the procarray:

1. Advertising a new xid belonging to my backend.
2. Ending a transaction.
3. Getting a snapshot.

Advertising a new xid is currently done without a lock, assuming that 
setting an xid in shared memory is atomic. To end a transaction, you 
acquire ProcArrayLock in exclusive mode. To get a snapshot, you acquire 
ProcArrayLock in shared mode, and scan the whole procarray.

I wonder if it would be a better tradeoff to keep a "materialized" 
snapshot in shared memory that's updated every time a new xid is 
assigned or a transaction ends. Getting a snapshot would become much 
cheaper, as you could just memcpy the ready-built snapshot from shared 
memory into local memory.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: spinlock contention

From
Florian Pflug
Date:
On Jun23, 2011, at 17:42 , Robert Haas wrote:
> On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote:
>> On Jun12, 2011, at 23:39 , Robert Haas wrote:
>>> So, the majority (60%) of the excess spinning appears to be due to
>>> SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).
>> 
>> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
>> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
>> so I guess that two adjacent LWLocks currently share one cache line.
>> 
>> Currently, the ProcArrayLock has index 4 while SInvalReadLock has
>> index 5, so if I'm not mistaken exactly the two locks where you saw
>> the largest contention on are on the same cache line...
>> 
>> Might make sense to try and see if these numbers change if you
>> either make LWLockPadded 64bytes or arrange the locks differently...
> 
> I fooled around with this a while back and saw no benefit.  It's
> possible a more careful test would turn up something, but I think the
> only real way forward here is going to be to eliminate some of that
> locking altogether.

It seems hard to believe that there isn't some effect of two locks
sharing a cache line. There are architectures (even some of the
Intel architectures, I believe) where cache lines are 32 bit, though -
so measuring this certainly isn't easy. Also, I'm absolute no
expert for this, so it might very well be that' I'm missing something
fundamental.

> SInvalReadLock is a good example of a lock which seems barely to be
> necessary at all.  Except on extraordinarily rare occasions, it is
> only ever taken in share mode.

This is the case we should optimize for, I think. Currently, there's
contention even between two SHARE lockers of a LWLock because our
LWLock implementation uses a spinlock underneath. I've googled around
a bit, and found this:

http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git

It's an userspace rwlock implementation based on atomic instructions
and futexes (for blocking) written by Linus Torwalds as a response
to the following thread

http://lwn.net/Articles/284741/

It seems to require only a single atomic increment instruction to
acquire a share lock in the uncontested case, and a single compare-
and-exchange instruction to acquire an exlcusive lock (again in
the uncontested case).

The code doesn't seem to have a license attached, and seems to have
been written over a weekend and never touched since, so we might
not want (or be able to) to use this directly. But it'd still be
interesting to see if replacing our LWLocks with this implementation
affects throughput.

> Only SICleanupQueue() needs to lock
> out readers, and in the (probably common) case where all backends are
> reasonably close to being caught up, it wouldn't even matter if the
> determination of minMsgNum were a little bit fuzzy.  I've been
> straining my brain trying to figure out whether there's some way to
> rewrite this logic so that it can run concurrently with readers; then
> we could get rid of SInvalReadLock altogether.  I have a feeling if I
> were 25% smarter I could do it, but I'm not.

This sounds like something which could be done with RCU (Read-Copy-Update)
in a lockless way. The idea is to copy the data structure (or parts)
of it before updating it, and to "commit" the change afterwards by
adjusting a single pointer to point to the new structure (kinda like
we do atomic commits by an atomic update of one bit in the clog).

The hard part is usually to figure when it's OK to clean up or
reuse the old version of the data structure - for that, you need to
know that all previous readers have completed.

Here's a rough description of how that could work for the invalidation
queue. We'd have two queue structures in shared memory, each containing
a version number. We'd also have a "current" pointer pointing to the
most recent one. Each reader would publish the version number of the
queue he's about to read (the "current" one) in shared memory
(and issue a write barrier). SICleanupQueue() would check whether the
non-current queue structure is unused by comparing its version to the
published versions of all backends (after issuing a read barrier, and
after taking a lock that prevents concurrent SICleanupQueue runs of 
course). If there still are users of that version, SICleanupQueue()
out-waits them. Afterwards, it simply copies the current structure
over the old one, does the necessary cleanups, increments the version
number to be larger than the one of the "current" structure,
and *then* updates the "current" pointer.

> So my fallback position is to try to find some way to optimize the
> common case where there are no messages to be read.  We're already
> allowing readers and writers to run in parallel, so it must not be
> important for a reader to see a message that a writer is in the
> process of writing, but has not yet finished writing.  I believe the
> idea here is that sinval messages should be emitted before releasing
> the related locks, so if we've successfully acquired a lock on an
> object, then any pertinent sinval messages must already be completely
> written.  What I'm thinking we could do is just keep a global counter
> indicating how many messages have been written, and each backend could
> keep a counter indicating the highest-numbered message it has so far
> read.  When a message is written, the global counter is bumped.  When
> a backend goes to AcceptInvalidationMessages(), it compares the global
> counter to its own counter, and if they're the same, then it doesn't
> bother taking SInvalReadLock and just exits quickly.

Sounds sensible. I do fear, however, that if we start to micro-optimize
around every single LWLock we're in for quite a maintenance burden in
the future. Even if each single modification is relatively simple, the
complexity still adds ups. So I still believe that it'd be better to
start with optimizing LWLocks in general, and to only touch the LWLocks
which are still hot spots afterwards.

> There is an obvious (potential) memory-ordering problem here.  If it's
> possible for a backend doing AcceptInvalidationMessages() to see a
> stale version of the counter, then it might fail to read messages from
> the queue that it really ought to have seen.  Protecting the global
> counter with a spinlock defeats the purpose of doing any of this in
> the first place, because the whole problem is too many people fighting
> over the spinlock protecting SInvalReadLock.  It should be sufficient,
> though, for the writing backend to execute a memory-barrier operation
> just after bumping the global counter and the reading backend to
> execute one just before.  As it happens, LWLockRelease() is such a
> barrier, since it must acquire and release a spinlock, so we should be
> able to just bump the counter right before releasing the LWLock and
> call it good.  On the reading side, the immediately-preceding lock
> acquisition seems like it ought to be sufficient - surely it can't be
> possible to acquire a heavyweight lock on an object without seeing
> memory updates that some other process made before releasing the same
> heavyweight lock.

If we start doing lockless algorithms, I really think we should add
explicit barrier primitives. We could start out with something
simple such as

#define MemoryBarrierWrite \ do { slock_t l; S_UNLOCK(l); } while (0)
#define MemoryBarrierRead \ do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
#define MemoryBarrierReadWrite \ do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }

But yeah, it seems that in the case of the sinval queue, the surrounding
LWLocks actually provide the necessary barriers. 

> A possible objection here is that a 32-bit counter might wrap around,
> giving a false negative, and a read from a 64-bit counter might not be
> atomic.  In practice, the chances of a problem here seem remote, but I
> think we can guard against it easily enough.  We can put each
> backend's counter of messages read into shared memory, protected by a
> per-backend lock (probably the same LWLock the fast relation lock
> patch already introduces).  When a backend compares its counter to the
> global counter, it does so while holding this lock.  When
> SICleanupQueue() resets a backend, it grabs the lock and sets that
> backend's counter to a special value (like 0) that we guarantee will
> never match the global counter (which we can cause to wrap from 2^32-1
> to 1).  So then AcceptInvalidationMessages() - in the common case
> where there are no invalidation messages to process - will only need
> the per-backend LWLock rather than fighting over SInvalLock.

OTOH, don't all architectures that are interesting targets of
performance tuning provide atomic access to 64-bit values? Even i386
has an 8-byte compare-and-exchange instruction I think...

best regards,
Florian Pflug




Re: spinlock contention

From
Robert Haas
Date:
On Thu, Jun 23, 2011 at 12:19 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 23.06.2011 18:42, Robert Haas wrote:
>> ProcArrayLock looks like a tougher nut to crack - there's simply no
>> way, with the system we have right now, that you can take a snapshot
>> without locking the list of running processes.  I'm not sure what to
>> do about that, but we're probably going to have to come up with
>> something, because it seems clear that once we eliminate the lock
>> manager LWLock contention, this is a major bottleneck.
>
> ProcArrayLock is currently held for a relatively long period of time when a
> snapshot is taken, especially if there's a lot of backends. There are three
> operations to the procarray:
>
> 1. Advertising a new xid belonging to my backend.
> 2. Ending a transaction.
> 3. Getting a snapshot.
>
> Advertising a new xid is currently done without a lock, assuming that
> setting an xid in shared memory is atomic.

The issue isn't just whether it's atomic, but whether some other
backend scanning the ProcArray just afterwards might fail to see the
update due to memory-ordering effects.  But I think even if they do,
there's no real harm done.  Since we're holding XidGenLock, we know
that the XID we are advertising is higher than any XID previously
advertised, and in particular it must be higher than
latestCompletedXid, so GetSnapshotdata() will ignore it anyway.

I am rather doubtful that this code is strictly safe:
                               myproc->subxids.xids[nxids] = xid;                               myproc->subxids.nxids =
nxids+ 1; 

In practice, the window for a race condition is minimal, because (a)
in many cases, nxids will be on the same cache line as the xid being
set; (b) even if it's not, we almost immediately release XidGenLock,
which acts as a memory barrier; (c) on many common architectures, such
as x86, stores are guaranteed to become visible in execution order;
(d) if, on some architecture, you manged to see the stores out of
order and thus loaded a garbage XID from the end of the array, it
might happen to be zero (which would be ignored, as a non-normal
transaction ID) or the transaction might happen never to examine a
tuple with that XID anyway.

> To end a transaction, you acquire
> ProcArrayLock in exclusive mode. To get a snapshot, you acquire
> ProcArrayLock in shared mode, and scan the whole procarray.
>
> I wonder if it would be a better tradeoff to keep a "materialized" snapshot
> in shared memory that's updated every time a new xid is assigned or a
> transaction ends. Getting a snapshot would become much cheaper, as you could
> just memcpy the ready-built snapshot from shared memory into local memory.

At least in the case of the pgbench -S workload, I doubt that would
help at all.  The problem isn't that the LWLocks are being held for
too long -- they are all being held in shared mode and don't conflict
anyway.  The problem is that everyone has to acquire and release the
spinlock in order to acquire the LWLock, and again to release it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: spinlock contention

From
Merlin Moncure
Date:
On Thu, Jun 23, 2011 at 1:34 PM, Florian Pflug <fgp@phlo.org> wrote:
> On Jun23, 2011, at 17:42 , Robert Haas wrote:
>> On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote:
>>> On Jun12, 2011, at 23:39 , Robert Haas wrote:
>>>> So, the majority (60%) of the excess spinning appears to be due to
>>>> SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).
>>>
>>> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
>>> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
>>> so I guess that two adjacent LWLocks currently share one cache line.
>>>
>>> Currently, the ProcArrayLock has index 4 while SInvalReadLock has
>>> index 5, so if I'm not mistaken exactly the two locks where you saw
>>> the largest contention on are on the same cache line...
>>>
>>> Might make sense to try and see if these numbers change if you
>>> either make LWLockPadded 64bytes or arrange the locks differently...
>>
>> I fooled around with this a while back and saw no benefit.  It's
>> possible a more careful test would turn up something, but I think the
>> only real way forward here is going to be to eliminate some of that
>> locking altogether.
>
> It seems hard to believe that there isn't some effect of two locks
> sharing a cache line. There are architectures (even some of the
> Intel architectures, I believe) where cache lines are 32 bit, though -
> so measuring this certainly isn't easy. Also, I'm absolute no
> expert for this, so it might very well be that' I'm missing something
> fundamental.
>
>> SInvalReadLock is a good example of a lock which seems barely to be
>> necessary at all.  Except on extraordinarily rare occasions, it is
>> only ever taken in share mode.
>
> This is the case we should optimize for, I think. Currently, there's
> contention even between two SHARE lockers of a LWLock because our
> LWLock implementation uses a spinlock underneath. I've googled around
> a bit, and found this:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git
>
> It's an userspace rwlock implementation based on atomic instructions
> and futexes (for blocking) written by Linus Torwalds as a response
> to the following thread
>
> http://lwn.net/Articles/284741/
>
> It seems to require only a single atomic increment instruction to
> acquire a share lock in the uncontested case, and a single compare-
> and-exchange instruction to acquire an exlcusive lock (again in
> the uncontested case).
>
> The code doesn't seem to have a license attached, and seems to have
> been written over a weekend and never touched since, so we might
> not want (or be able to) to use this directly. But it'd still be
> interesting to see if replacing our LWLocks with this implementation
> affects throughput.
>
>> Only SICleanupQueue() needs to lock
>> out readers, and in the (probably common) case where all backends are
>> reasonably close to being caught up, it wouldn't even matter if the
>> determination of minMsgNum were a little bit fuzzy.  I've been
>> straining my brain trying to figure out whether there's some way to
>> rewrite this logic so that it can run concurrently with readers; then
>> we could get rid of SInvalReadLock altogether.  I have a feeling if I
>> were 25% smarter I could do it, but I'm not.
>
> This sounds like something which could be done with RCU (Read-Copy-Update)
> in a lockless way. The idea is to copy the data structure (or parts)
> of it before updating it, and to "commit" the change afterwards by
> adjusting a single pointer to point to the new structure (kinda like
> we do atomic commits by an atomic update of one bit in the clog).
>
> The hard part is usually to figure when it's OK to clean up or
> reuse the old version of the data structure - for that, you need to
> know that all previous readers have completed.
>
> Here's a rough description of how that could work for the invalidation
> queue. We'd have two queue structures in shared memory, each containing
> a version number. We'd also have a "current" pointer pointing to the
> most recent one. Each reader would publish the version number of the
> queue he's about to read (the "current" one) in shared memory
> (and issue a write barrier). SICleanupQueue() would check whether the
> non-current queue structure is unused by comparing its version to the
> published versions of all backends (after issuing a read barrier, and
> after taking a lock that prevents concurrent SICleanupQueue runs of
> course). If there still are users of that version, SICleanupQueue()
> out-waits them. Afterwards, it simply copies the current structure
> over the old one, does the necessary cleanups, increments the version
> number to be larger than the one of the "current" structure,
> and *then* updates the "current" pointer.
>
>> So my fallback position is to try to find some way to optimize the
>> common case where there are no messages to be read.  We're already
>> allowing readers and writers to run in parallel, so it must not be
>> important for a reader to see a message that a writer is in the
>> process of writing, but has not yet finished writing.  I believe the
>> idea here is that sinval messages should be emitted before releasing
>> the related locks, so if we've successfully acquired a lock on an
>> object, then any pertinent sinval messages must already be completely
>> written.  What I'm thinking we could do is just keep a global counter
>> indicating how many messages have been written, and each backend could
>> keep a counter indicating the highest-numbered message it has so far
>> read.  When a message is written, the global counter is bumped.  When
>> a backend goes to AcceptInvalidationMessages(), it compares the global
>> counter to its own counter, and if they're the same, then it doesn't
>> bother taking SInvalReadLock and just exits quickly.
>
> Sounds sensible. I do fear, however, that if we start to micro-optimize
> around every single LWLock we're in for quite a maintenance burden in
> the future. Even if each single modification is relatively simple, the
> complexity still adds ups. So I still believe that it'd be better to
> start with optimizing LWLocks in general, and to only touch the LWLocks
> which are still hot spots afterwards.
>
>> There is an obvious (potential) memory-ordering problem here.  If it's
>> possible for a backend doing AcceptInvalidationMessages() to see a
>> stale version of the counter, then it might fail to read messages from
>> the queue that it really ought to have seen.  Protecting the global
>> counter with a spinlock defeats the purpose of doing any of this in
>> the first place, because the whole problem is too many people fighting
>> over the spinlock protecting SInvalReadLock.  It should be sufficient,
>> though, for the writing backend to execute a memory-barrier operation
>> just after bumping the global counter and the reading backend to
>> execute one just before.  As it happens, LWLockRelease() is such a
>> barrier, since it must acquire and release a spinlock, so we should be
>> able to just bump the counter right before releasing the LWLock and
>> call it good.  On the reading side, the immediately-preceding lock
>> acquisition seems like it ought to be sufficient - surely it can't be
>> possible to acquire a heavyweight lock on an object without seeing
>> memory updates that some other process made before releasing the same
>> heavyweight lock.
>
> If we start doing lockless algorithms, I really think we should add
> explicit barrier primitives. We could start out with something
> simple such as
>
> #define MemoryBarrierWrite \
>  do { slock_t l; S_UNLOCK(l); } while (0)
> #define MemoryBarrierRead \
>  do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
> #define MemoryBarrierReadWrite \
>  do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }
>
> But yeah, it seems that in the case of the sinval queue, the surrounding
> LWLocks actually provide the necessary barriers.
>
>> A possible objection here is that a 32-bit counter might wrap around,
>> giving a false negative, and a read from a 64-bit counter might not be
>> atomic.  In practice, the chances of a problem here seem remote, but I
>> think we can guard against it easily enough.  We can put each
>> backend's counter of messages read into shared memory, protected by a
>> per-backend lock (probably the same LWLock the fast relation lock
>> patch already introduces).  When a backend compares its counter to the
>> global counter, it does so while holding this lock.  When
>> SICleanupQueue() resets a backend, it grabs the lock and sets that
>> backend's counter to a special value (like 0) that we guarantee will
>> never match the global counter (which we can cause to wrap from 2^32-1
>> to 1).  So then AcceptInvalidationMessages() - in the common case
>> where there are no invalidation messages to process - will only need
>> the per-backend LWLock rather than fighting over SInvalLock.
>
> OTOH, don't all architectures that are interesting targets of
> performance tuning provide atomic access to 64-bit values? Even i386
> has an 8-byte compare-and-exchange instruction I think...

yup, 'lock cmpxchg8b'.
see: http://www.niallryan.com/node/137 for a way to do it via fpu
without a loop.  the 'lock' is a bus lock.

merlin


Re: spinlock contention

From
Robert Haas
Date:
On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug <fgp@phlo.org> wrote:
> It seems hard to believe that there isn't some effect of two locks
> sharing a cache line. There are architectures (even some of the
> Intel architectures, I believe) where cache lines are 32 bit, though -
> so measuring this certainly isn't easy. Also, I'm absolute no
> expert for this, so it might very well be that' I'm missing something
> fundamental.

Well, I'm sure there is some effect, but my experiments seem to
indicate that it's not a very important one.  Again, please feel free
to provide contrary evidence.  I think the basic issue is that - in
the best possible case - padding the LWLocks so that you don't have
two locks sharing a cache line can reduce contention on the busier
lock by at most 2x.  (The less busy lock may get a larger reduction
but that may not help you much.)  If you what you really need is for
the contention to decrease by 1000x, you're just not really moving the
needle.  That's why the basic fast-relation-lock patch helps so much:
it replaces a system where every lock request results in contention
with a system where NONE of them do.

>> SInvalReadLock is a good example of a lock which seems barely to e
>> necessary at all.  Except on extraordinarily rare occasions, it is
>> only ever taken in share mode.
>
> This is the case we should optimize for, I think. Currently, there's
> contention even between two SHARE lockers of a LWLock because our
> LWLock implementation uses a spinlock underneath. I've googled around
> a bit, and found this:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git
>
> It's an userspace rwlock implementation based on atomic instructions
> and futexes (for blocking) written by Linus Torwalds as a response
> to the following thread
>
> http://lwn.net/Articles/284741/
>
> It seems to require only a single atomic increment instruction to
> acquire a share lock in the uncontested case, and a single compare-
> and-exchange instruction to acquire an exlcusive lock (again in
> the uncontested case).
>
> The code doesn't seem to have a license attached, and seems to have
> been written over a weekend and never touched since, so we might
> not want (or be able to) to use this directly. But it'd still be
> interesting to see if replacing our LWLocks with this implementation
> affects throughput.

I tried rewriting the LWLocks using CAS.  It actually seems to make
things slightly worse on the tests I've done so far, perhaps because I
didn't make it respect spins_per_delay.  Perhaps fetch-and-add would
be better, but I'm not holding my breath.  Everything I'm seeing so
far leads me to the belief that we need to get rid of the contention
altogether, not just contend more quickly.

>> Only SICleanupQueue() needs to lock
>> out readers, and in the (probably common) case where all backends are
>> reasonably close to being caught up, it wouldn't even matter if the
>> determination of minMsgNum were a little bit fuzzy.  I've been
>> straining my brain trying to figure out whether there's some way to
>> rewrite this logic so that it can run concurrently with readers; then
>> we could get rid of SInvalReadLock altogether.  I have a feeling if I
>> were 25% smarter I could do it, but I'm not.
>
> This sounds like something which could be done with RCU (Read-Copy-Update)
> in a lockless way. The idea is to copy the data structure (or parts)
> of it before updating it, and to "commit" the change afterwards by
> adjusting a single pointer to point to the new structure (kinda like
> we do atomic commits by an atomic update of one bit in the clog).
>
> The hard part is usually to figure when it's OK to clean up or
> reuse the old version of the data structure - for that, you need to
> know that all previous readers have completed.
>
> Here's a rough description of how that could work for the invalidation
> queue. We'd have two queue structures in shared memory, each containing
> a version number. We'd also have a "current" pointer pointing to the
> most recent one. Each reader would publish the version number of the
> queue he's about to read (the "current" one) in shared memory
> (and issue a write barrier). SICleanupQueue() would check whether the
> non-current queue structure is unused by comparing its version to the
> published versions of all backends (after issuing a read barrier, and
> after taking a lock that prevents concurrent SICleanupQueue runs of
> course). If there still are users of that version, SICleanupQueue()
> out-waits them. Afterwards, it simply copies the current structure
> over the old one, does the necessary cleanups, increments the version
> number to be larger than the one of the "current" structure,
> and *then* updates the "current" pointer.

Interesting idea.

>> So my fallback position is to try to find some way to optimize the
>> common case where there are no messages to be read.  We're already
>> allowing readers and writers to run in parallel, so it must not be
>> important for a reader to see a message that a writer is in the
>> process of writing, but has not yet finished writing.  I believe the
>> idea here is that sinval messages should be emitted before releasing
>> the related locks, so if we've successfully acquired a lock on an
>> object, then any pertinent sinval messages must already be completely
>> written.  What I'm thinking we could do is just keep a global counter
>> indicating how many messages have been written, and each backend could
>> keep a counter indicating the highest-numbered message it has so far
>> read.  When a message is written, the global counter is bumped.  When
>> a backend goes to AcceptInvalidationMessages(), it compares the global
>> counter to its own counter, and if they're the same, then it doesn't
>> bother taking SInvalReadLock and just exits quickly.
>
> Sounds sensible. I do fear, however, that if we start to micro-optimize
> around every single LWLock we're in for quite a maintenance burden in
> the future. Even if each single modification is relatively simple, the
> complexity still adds ups. So I still believe that it'd be better to
> start with optimizing LWLocks in general, and to only touch the LWLocks
> which are still hot spots afterwards.

I was hoping to do it that way, too, but I have so far been able to
demonstrate a performance benefit from that approach.

>> There is an obvious (potential) memory-ordering problem here.  If it's
>> possible for a backend doing AcceptInvalidationMessages() to see a
>> stale version of the counter, then it might fail to read messages from
>> the queue that it really ought to have seen.  Protecting the global
>> counter with a spinlock defeats the purpose of doing any of this in
>> the first place, because the whole problem is too many people fighting
>> over the spinlock protecting SInvalReadLock.  It should be sufficient,
>> though, for the writing backend to execute a memory-barrier operation
>> just after bumping the global counter and the reading backend to
>> execute one just before.  As it happens, LWLockRelease() is such a
>> barrier, since it must acquire and release a spinlock, so we should be
>> able to just bump the counter right before releasing the LWLock and
>> call it good.  On the reading side, the immediately-preceding lock
>> acquisition seems like it ought to be sufficient - surely it can't be
>> possible to acquire a heavyweight lock on an object without seeing
>> memory updates that some other process made before releasing the same
>> heavyweight lock.
>
> If we start doing lockless algorithms, I really think we should add
> explicit barrier primitives. We could start out with something
> simple such as
>
> #define MemoryBarrierWrite \
>  do { slock_t l; S_UNLOCK(l); } while (0)
> #define MemoryBarrierRead \
>  do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
> #define MemoryBarrierReadWrite \
>  do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }
>
> But yeah, it seems that in the case of the sinval queue, the surrounding
> LWLocks actually provide the necessary barriers.

Yes, a set of general memory barrier primitives would be handy.
Unfortunately, it would probably be quite a bit of work to develop
this and verify that it works properly on all supported platforms.

>> A possible objection here is that a 32-bit counter might wrap around,
>> giving a false negative, and a read from a 64-bit counter might not be
>> atomic.  In practice, the chances of a problem here seem remote, but I
>> think we can guard against it easily enough.  We can put each
>> backend's counter of messages read into shared memory, protected by a
>> per-backend lock (probably the same LWLock the fast relation lock
>> patch already introduces).  When a backend compares its counter to the
>> global counter, it does so while holding this lock.  When
>> SICleanupQueue() resets a backend, it grabs the lock and sets that
>> backend's counter to a special value (like 0) that we guarantee will
>> never match the global counter (which we can cause to wrap from 2^32-1
>> to 1).  So then AcceptInvalidationMessages() - in the common case
>> where there are no invalidation messages to process - will only need
>> the per-backend LWLock rather than fighting over SInvalLock.
>
> OTOH, don't all architectures that are interesting targets of
> performance tuning provide atomic access to 64-bit values? Even i386
> has an 8-byte compare-and-exchange instruction I think...

Maybe so... but if I can get by without needing that, so much the
better.  I don't really want to have to have a separate code path for
architectures that don't support that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: spinlock contention

From
Florian Pflug
Date:
On Jun23, 2011, at 22:15 , Robert Haas wrote:
> On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug <fgp@phlo.org> wrote:
>> It seems hard to believe that there isn't some effect of two locks
>> sharing a cache line. There are architectures (even some of the
>> Intel architectures, I believe) where cache lines are 32 bit, though -
>> so measuring this certainly isn't easy. Also, I'm absolute no
>> expert for this, so it might very well be that' I'm missing something
>> fundamental.
> 
> Well, I'm sure there is some effect, but my experiments seem to
> indicate that it's not a very important one.  Again, please feel free
> to provide contrary evidence.  I think the basic issue is that - in
> the best possible case - padding the LWLocks so that you don't have
> two locks sharing a cache line can reduce contention on the busier
> lock by at most 2x.  (The less busy lock may get a larger reduction
> but that may not help you much.)  If you what you really need is for
> the contention to decrease by 1000x, you're just not really moving the
> needle.

Agreed. OTOH, adding a few dummy entries to the LWLocks array to separate
the most heavily contested LWLocks for the others might still be
worthwhile.

> That's why the basic fast-relation-lock patch helps so much:
> it replaces a system where every lock request results in contention
> with a system where NONE of them do.
> 
> I tried rewriting the LWLocks using CAS.  It actually seems to make
> things slightly worse on the tests I've done so far, perhaps because I
> didn't make it respect spins_per_delay.  Perhaps fetch-and-add would
> be better, but I'm not holding my breath.  Everything I'm seeing so
> far leads me to the belief that we need to get rid of the contention
> altogether, not just contend more quickly.

Is there a patch available? How did you do the slow path (i.e. the
case where there's contention and you need to block)? It seems to
me that without some kernel support like futexes it's impossible
to do better than LWLocks already do, because any simpler scheme
like while (atomic_inc_post(lock) > 0) {   atomic_dec(lock);   block(lock); }
for the shared-locker case suffers from a race condition (the lock
might be released before you actually block()).

>>> There is an obvious (potential) memory-ordering problem here.  If it's
>>> possible for a backend doing AcceptInvalidationMessages() to see a
>>> stale version of the counter, then it might fail to read messages from
>>> the queue that it really ought to have seen.  Protecting the global
>>> counter with a spinlock defeats the purpose of doing any of this in
>>> the first place, because the whole problem is too many people fighting
>>> over the spinlock protecting SInvalReadLock.  It should be sufficient,
>>> though, for the writing backend to execute a memory-barrier operation
>>> just after bumping the global counter and the reading backend to
>>> execute one just before.  As it happens, LWLockRelease() is such a
>>> barrier, since it must acquire and release a spinlock, so we should be
>>> able to just bump the counter right before releasing the LWLock and
>>> call it good.  On the reading side, the immediately-preceding lock
>>> acquisition seems like it ought to be sufficient - surely it can't be
>>> possible to acquire a heavyweight lock on an object without seeing
>>> memory updates that some other process made before releasing the same
>>> heavyweight lock.
>> 
>> If we start doing lockless algorithms, I really think we should add
>> explicit barrier primitives. We could start out with something
>> simple such as
>> 
>> #define MemoryBarrierWrite \
>>  do { slock_t l; S_UNLOCK(l); } while (0)
>> #define MemoryBarrierRead \
>>  do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
>> #define MemoryBarrierReadWrite \
>>  do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }
>> 
>> But yeah, it seems that in the case of the sinval queue, the surrounding
>> LWLocks actually provide the necessary barriers.
> 
> Yes, a set of general memory barrier primitives would be handy.
> Unfortunately, it would probably be quite a bit of work to develop
> this and verify that it works properly on all supported platforms.

The idea would be to start out with something trivial like the above.
Maybe with an #if for compilers which have something like GCC's
__sync_synchronize(). We could then gradually add implementations
for specific architectures, hopefully done by people who actually
own the hardware and can test.

best regards,
Florian Pflug




Re: spinlock contention

From
Robert Haas
Date:
On Thu, Jun 23, 2011 at 5:35 PM, Florian Pflug <fgp@phlo.org> wrote:
>> Well, I'm sure there is some effect, but my experiments seem to
>> indicate that it's not a very important one.  Again, please feel free
>> to provide contrary evidence.  I think the basic issue is that - in
>> the best possible case - padding the LWLocks so that you don't have
>> two locks sharing a cache line can reduce contention on the busier
>> lock by at most 2x.  (The less busy lock may get a larger reduction
>> but that may not help you much.)  If you what you really need is for
>> the contention to decrease by 1000x, you're just not really moving the
>> needle.
>
> Agreed. OTOH, adding a few dummy entries to the LWLocks array to separate
> the most heavily contested LWLocks for the others might still be
> worthwhile.

Hey, if we can show that it works, sign me up.

>> That's why the basic fast-relation-lock patch helps so much:
>> it replaces a system where every lock request results in contention
>> with a system where NONE of them do.
>>
>> I tried rewriting the LWLocks using CAS.  It actually seems to make
>> things slightly worse on the tests I've done so far, perhaps because I
>> didn't make it respect spins_per_delay.  Perhaps fetch-and-add would
>> be better, but I'm not holding my breath.  Everything I'm seeing so
>> far leads me to the belief that we need to get rid of the contention
>> altogether, not just contend more quickly.
>
> Is there a patch available? How did you do the slow path (i.e. the
> case where there's contention and you need to block)? It seems to
> me that without some kernel support like futexes it's impossible
> to do better than LWLocks already do, because any simpler scheme
> like
>  while (atomic_inc_post(lock) > 0) {
>    atomic_dec(lock);
>    block(lock);
>  }
> for the shared-locker case suffers from a race condition (the lock
> might be released before you actually block()).

Attached...

> The idea would be to start out with something trivial like the above.
> Maybe with an #if for compilers which have something like GCC's
> __sync_synchronize(). We could then gradually add implementations
> for specific architectures, hopefully done by people who actually
> own the hardware and can test.

Yes.  But if we go that route, then we have to also support a code
path for architectures for which we don't have that support.  That's
going to be more work, so I don't want to do it until we have a case
where there is a good, clear benefit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: spinlock contention

From
Greg Stark
Date:
On Thu, Jun 23, 2011 at 4:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> ProcArrayLock looks like a tougher nut to crack - there's simply no
> way, with the system we have right now, that you can take a snapshot
> without locking the list of running processes.  I'm not sure what to
> do about that, but we're probably going to have to come up with
> something, because it seems clear that once we eliminate the lock
> manager LWLock contention, this is a major bottleneck.

Well as Tom observed earlier the kernel of a snapshot is actually a
LSN. A snapshot contains a set of xids which all committed before some
LSN and none which committed after it.

So if we had a record of what log sequence number the commit record
for any given transaction is we could build the snapshot at our
leisure without any exclusive lock. In fact we could even build it
lazily as a kind of cache only when we actually are interested in a
given xid.





--
greg


Re: spinlock contention

From
Robert Haas
Date:
On Sat, Jun 25, 2011 at 8:26 PM, Greg Stark <stark@mit.edu> wrote:
> On Thu, Jun 23, 2011 at 4:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> ProcArrayLock looks like a tougher nut to crack - there's simply no
>> way, with the system we have right now, that you can take a snapshot
>> without locking the list of running processes.  I'm not sure what to
>> do about that, but we're probably going to have to come up with
>> something, because it seems clear that once we eliminate the lock
>> manager LWLock contention, this is a major bottleneck.
>
> Well as Tom observed earlier the kernel of a snapshot is actually a
> LSN. A snapshot contains a set of xids which all committed before some
> LSN and none which committed after it.
>
> So if we had a record of what log sequence number the commit record
> for any given transaction is we could build the snapshot at our
> leisure without any exclusive lock. In fact we could even build it
> lazily as a kind of cache only when we actually are interested in a
> given xid.

Yeah, I've been thinking about that.  I think what we might do is set
up a new SLRU that works like CLOG, but each entry is an LSN rather
than just two bits.  When a transaction commits, we save the commit
LSN under the entry for that XID.  We truncate away SLRU pages that
contain no still-running XIDs.  When we need to check whether an XID
is visible to our snapshot, we just look up the commit LSN and compare
it with our snapshot LSN.  If it's before and non-zero, we can see it.If it's after or all-zeroes, we can't.

But I'm not sure how much this would really help.  It might (subject
to working out the details) make the actual process of taking a
snapshot faster.  But it's not clear that taking snapshots more
quickly will actually help anything, because the problem is not the
amount of time spending taking the snapshot.  The problem is rather
that doing so requires acquiring and releasing an LWLock, and each of
those operations requires taking and releasing a spinlock.  And it is
the spinlock contention that is killing us.   That makes me think we
need a way to take a snapshot without taking a spinlock.  Or if we
must take spinlocks, we at least have to avoid every backend that
needs a snapshot lusting after the *same* spinlock.

What I've been thinking about this weekend is whether it might be
possible to create a sort of lock-free queue infrastructure.  When a
backend starts up, it would add an entry to the queue saying "I'm
running".  When it commits, it would add an entry to the queue saying
"I'm committed".  All entries would be added at the *end* of the
queue, so a backend scanning the queue to build up a snapshot wouldn't
ever be able to see commits out of order.  We would need some memory
barrier operations on weak-memory-ordering machines to make sure that
the queue writes became visible before the end-of-queue pointer bump.

The trick is figuring out how to clean up the queue.  Since "commit"
entries exist only to guard against "running" entries earlier in the
queue, the start-of-queue pointer can be advanced whenever it points
to a "commit" entry.  Also, if it points to a "running" entry for
which there is a later "commit" entry, then the start-of-queue pointer
can be advanced over that as well.  However, just because we can
advance the point at which backends start reading doesn't mean that we
can actually recycle space, because while we know that new scans
needn't worry about those entries, we *don't* know that there isn't
already a scan in flight that still needs them.  Furthermore, if a
transaction runs for a long time, we can never advance the
start-of-queue pointer past the "running" entry for its XID, which is
obviously problematic since the queue would get very long.

To work around that problem, I think we could use Florian's idea
upthread of an RCU system.  We keep two copies of the queue around, an
A copy and a B copy.  When the currently active copy fills up, we
rewrite it into the other queue, omitting all "committed" entries and
any "running" entries that have matching "committed" entries, and then
tell everyone to start looking at that copy instead.   We would need
some kind of gymnastics to make sure that we don't flip from the A
copy to the B copy and back to the A copy while some laggardly backend
is still hoping to scan the old A copy.  A simple algorithm (there's
probably a smarter one) would be to have each backend take a spinlock
while it's scanning either copy, and to have the backend that is doing
the rewrite take and release all of those spinlocks one at a time
before beginning the rewrite, thus guaranteeing that any scans still
in progress when the rewrite is requested have completed before it's
actually performed.  Any new scans started in the meanwhile will
certainly be looking at the current copy rather than the old copy
we're about to overwrite.

We would still need a lock around the operation of adding new items to
the queue; if two backends try to do that at the same time, chaos will
ensue.  But if that lock gets busy, it would be possible to have
backends use some sort of system of elimination buffers to combine
their requests.  If ten backends each want to advertise a new XID, it
will be far faster to have one backend acquire the lock, write in all
ten entries, and wake everyone up than to have each backend insert its
entry and wake up the next one.  So what we might do is doing a
condition-acquire on the lock to insert data into the buffer and, if
we can't get it immediately, go find another backend with the same
problem and elect a leader to do all the insertions.  If we're the
leader, sleep on the queue-insertion lock.  If not, sleep until the
leader wakes us up.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: spinlock contention

From
Florian Pflug
Date:
On Jun23, 2011, at 23:40 , Robert Haas wrote:
>>> I tried rewriting the LWLocks using CAS.  It actually seems to make
>>> things slightly worse on the tests I've done so far, perhaps because I
>>> didn't make it respect spins_per_delay.  Perhaps fetch-and-add would
>>> be better, but I'm not holding my breath.  Everything I'm seeing so
>>> far leads me to the belief that we need to get rid of the contention
>>> altogether, not just contend more quickly.

I got curious and put a tool together to benchmark this. The code can
be found at https://github.com/fgp/lockbench. 
The tool starts N workers who each acquire a lock in SHARE mode, increment
a per-worker counter and release the lock again, rinse, repeat. That is
done for T seconds, after which it reports the sum of the individual
counters, the elapsed (wall) time and the sums of the user and system times
consumed by the workers. The lock implementations currently supported are

atomicinc:      RW-Lock using atomic increment/decrement instructions               (locked xaddl) to acquire and
releasethe lock in SHARE               mode in the non-contested case. The contested case is
unimplemented.

cmpxchng:       Like atomicinc, but uses "locked cmpxchng" loop instead of               "locked xaddl".

spin:           RW-Lock built around a simple cmpxchng-based spinlock (i.e.,               to lock/unlock spinlock is
taken,shared-lockers count is               incremented/ decremented, spinlock is released). Similar to
pg_lwlock,but without the sleep() in the contested path of               the spinlock. The contested case of the
RW-Lockis again               unimplemented.
 

none:           No locking.

pg_lwlock:      Postgres LWLock implementation. The contested case doesn't               work because the workers
currentlydon't have a valid MyProc.
 

pw_lwlock_cas:  Like pg_lwlock but with your CAS patch applied.

On an 4-core Intel Xeon system (Intel Xeon X3430, 2.4 GHz, 8MB cache) I get
the following results:
        |   1 worker    |   2 workers   |   3 workers   |   4 workers   |        |wallsec usersec|wallsec
usersec|wallsecusersec|wallsec usersec|        |  /         /  |  /         /  |  /         /  |  /         /  |
|cycles  cycles|cycles   cycles|cycles   cycles|cycles   cycles|
 
--------------------------------------------------------------------------
none     |2.5e-09 2.5e-09|1.3e-09 2.5e-09|8.5e-10 2.5e-09|6.8e-10 2.5e-09|
atomicinc|1.8e-08 1.8e-08|2.8e-08 5.6e-08|2.7e-08 8.1e-08|2.9e-08 1.1e-07|
cmpxchng |3.0e-08 3.0e-08|7.1e-08 1.4e-07|6.9e-08 2.0e-07|7.2e-08 2.9e-07|
spin     |5.0e-08 5.0e-08|3.0e-07 5.9e-07|3.8e-07 1.1e-06|4.0e-07 1.5e-06|
pg_lwlock|2.8e-08 2.8e-08|2.9e-08 3.0e-08|3.0e-08 3.2e-08|2.9e-08 3.1e-08|
pg_lwlock|2.6e-08 2.6e-08|6.2e-08 1.2e-07|7.7e-08 2.3e-07|7.8e-08 3.1e-07|    _cas|               |               |
          |              |
 
--------------------------------------------------------------------------

These results allow the following conclusions to be drawn

First, our current pg_lwlock is quite good at preserving resources, i.e.
at not spinning excessively - at least for up to four workers. It's the only
lock implementation that consistently uses about 30ns of processor time for
one acquire/release cycle. Only atomicinc with one worker (i.e. no cachline
fighting) manages to beat that. It does, however, effectively serializate
execution with this workload - the consumed user time is approximately equal
to the elapsed wall clock time, even in the case of 4 workers, meaning that
most of the time 3 of those 4 workers are sleeping.

Secondly, pg_lwlock_cas and cmpxchng perform simiarly, which comes at no
surprise, since use effectively the same algorithm. One could expect spin
and pg_lwlock to also perform similarly, but these two don't. The reason
is probably that spin uses a rather brain-dead spinning method, while
pg_lwlock uses "rep; nop".

Now to atomicinc, which is (the fast-path of) the most optimized RW-lock
possible, at least on x86 and x86_64. One interesting result here is that
wall time seconds / cycle are suspiciously for atomicinc and pg_lwlock.
This proves, I think, that the limiting factor in both of these tests is
the speed at which cache line invalidation can occur. It doesn't seem to
matter whether the other cores are idle (as in the pg_lwlock case), or
trying to obtain ownership of the cache line themselves (as in the
atomicinc case). 

Finally, the no-locking test (none) shows that, even though the counter
is written to shared memory after every increment, memory bandwith isn't
an issue here, since performance scaled nearly linearly with the number
of workers here.

Here's the result of another run, this time with the per-worker counter
being increment 100 in each acquire/release cycle.

------------------- 100 Increments per Cycle -----------------------------        |   1 worker    |   2 workers   |   3
workers  |   4 workers   |        |wallsec usersec|wallsec usersec|wallsec usersec|wallsec usersec|        |  /
/ |  /         /  |  /         /  |  /         /  |        |cycles   cycles|cycles   cycles|cycles   cycles|cycles
cycles|
--------------------------------------------------------------------------
none     |2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.4e-08 2.5e-07|6.5e-08 2.6e-07|
atomicinc|2.5e-07 2.5e-07|1.3e-07 2.6e-07|8.6e-08 2.5e-07|6.5e-08 2.6e-07|
cmpxchng |2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.6e-08 2.6e-07|7.8e-08 3.1e-07|
spin     |3.0e-07 3.0e-07|1.5e-07 3.0e-07|1.1e-07 3.2e-07|3.4e-07 1.4e-06|
pg_lwlock|2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.4e-08 2.5e-07|1.2e-07 4.8e-07|
pg_lwlock|2.6e-07 2.6e-07|1.6e-07 3.1e-07|1.1e-07 3.2e-07|8.7e-08 3.4e-07|    _cas|               |               |
          |              |
 
--------------------------------------------------------------------------

This makes the overhead of the locking disappear in the noise in the
single-worker case. With 4 workers, however, differences materialize.
The no-locking case still shows approximately linear speedup though,
which I think again rules out memory bandwith effects.

Now, atomicinc is the only implementation which doesn't perform
significantly worse than the no-lock case. The closest competitors are
the CAS-based implementation cmpxchng and pg_lwlock_cas, with 20% and
33% slowdown respectively (of both wall seconds and user seconds per cycle).

The current LWLock implementation falls behind now, taking about twice as
many seconds per cycle compared to atomicinc. 

In other words, with this workload an atomicinc-based RW-lock, taken in
SHARE mode should add virtually no overhead, while every other
implementation adds at least 20%.

It'd be interesting to get benchmark results also on machines with more
than four cores, but I don't have access to any such machine ATM. So
if someone could run that on a 8 or 16-core beast, that'd be great. The
code compiles on Mac OS X and Ubuntu 10.04. It's fairly portable, to
make the postgres parts compile on other platforms, it might be necessary
to adjust pg_stub/pg_config_manual.h.

best regards,
Florian Pflug



Re: spinlock contention

From
Robert Haas
Date:
On Tue, Jun 28, 2011 at 2:33 PM, Florian Pflug <fgp@phlo.org> wrote:
> [ testing of various spinlock implementations ]

I set T=30 and N="1 2 4 8 16 32" and tried this out on a 32-core
loaner from Nate Boley:

100 counter increments per cycle
worker    1        2        4        8        16    32    
time    wall    user    wall    user    wall    user    wall    user    wall    user    wall    user
none    2.8e-07    2.8e-07    1.5e-07    3.0e-07    8.0e-08    3.2e-07    4.2e-08    3.3e-07    2.1e-08    3.3e-07
1.1e-08   3.4e-07
 
atomicinc    3.6e-07    3.6e-07    2.6e-07    5.1e-07    1.4e-07    5.5e-07    1.4e-07    1.1e-06    1.5e-07    2.3e-06
  1.5e-07    4.9e-06
 
cmpxchng    3.6e-07    3.6e-07    3.4e-07    6.9e-07    3.2e-07    1.3e-06    2.9e-07    2.3e-06    4.2e-07    6.6e-06
 4.5e-07    1.4e-05
 
spin    4.1e-07    4.1e-07    2.8e-07    5.7e-07    1.6e-07    6.3e-07    1.2e-06    9.4e-06    3.8e-06    6.1e-05
1.4e-05   4.3e-04
 
pg_lwlock    3.8e-07    3.8e-07    2.7e-07    5.3e-07    1.5e-07    6.2e-07    3.9e-07    3.1e-06    1.6e-06    2.5e-05
  6.4e-06    2.0e-04
 
pg_lwlock_cas    3.7e-07    3.7e-07    2.8e-07    5.6e-07    1.4e-07    5.8e-07    1.4e-07    1.1e-06    1.9e-07
3.0e-06   2.4e-07    7.5e-06
 

I wrote a little script to show to reorganize this data in a
possibly-easier-to-understand format - ordering each column from
lowest to highest, and showing each algorithm as a multiple of the
cheapest value for that column:

wall-1: none(1.0),atomicinc(1.3),cmpxchng(1.3),pg_lwlock_cas(1.3),pg_lwlock(1.4),spin(1.5)
user-1: none(1.0),atomicinc(1.3),cmpxchng(1.3),pg_lwlock_cas(1.3),pg_lwlock(1.4),spin(1.5)
wall-2: none(1.0),atomicinc(1.7),pg_lwlock(1.8),spin(1.9),pg_lwlock_cas(1.9),cmpxchng(2.3)
user-2: none(1.0),atomicinc(1.7),pg_lwlock(1.8),pg_lwlock_cas(1.9),spin(1.9),cmpxchng(2.3)
wall-4: none(1.0),atomicinc(1.7),pg_lwlock_cas(1.7),pg_lwlock(1.9),spin(2.0),cmpxchng(4.0)
user-4: none(1.0),atomicinc(1.7),pg_lwlock_cas(1.8),pg_lwlock(1.9),spin(2.0),cmpxchng(4.1)
wall-8: none(1.0),atomicinc(3.3),pg_lwlock_cas(3.3),cmpxchng(6.9),pg_lwlock(9.3),spin(28.6)
user-8: none(1.0),atomicinc(3.3),pg_lwlock_cas(3.3),cmpxchng(7.0),pg_lwlock(9.4),spin(28.5)
wall-16: none(1.0),atomicinc(7.1),pg_lwlock_cas(9.0),cmpxchng(20.0),pg_lwlock(76.2),spin(181.0)
user-16: none(1.0),atomicinc(7.0),pg_lwlock_cas(9.1),cmpxchng(20.0),pg_lwlock(75.8),spin(184.8)
wall-32: none(1.0),atomicinc(13.6),pg_lwlock_cas(21.8),cmpxchng(40.9),pg_lwlock(581.8),spin(1272.7)
user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)

Here's the result of the same calculation applied to your second set of data:

wall-1: none(1.0),atomicinc(1.0),cmpxchng(1.0),pg_lwlock(1.0),pg_lwlock_cas(1.0),spin(1.2)
user-1: none(1.0),atomicinc(1.0),cmpxchng(1.0),pg_lwlock(1.0),pg_lwlock_cas(1.0),spin(1.2)
wall-2: none(1.0),atomicinc(1.0),cmpxchng(1.0),pg_lwlock(1.0),spin(1.2),pg_lwlock_cas(1.2)
user-2: none(1.0),cmpxchng(1.0),pg_lwlock(1.0),atomicinc(1.0),spin(1.2),pg_lwlock_cas(1.2)
wall-3: none(1.0),pg_lwlock(1.0),atomicinc(1.0),cmpxchng(1.0),spin(1.3),pg_lwlock_cas(1.3)
user-3: none(1.0),atomicinc(1.0),pg_lwlock(1.0),cmpxchng(1.0),spin(1.3),pg_lwlock_cas(1.3)
wall-4: none(1.0),atomicinc(1.0),cmpxchng(1.2),pg_lwlock_cas(1.3),pg_lwlock(1.8),spin(5.2)
user-4: none(1.0),atomicinc(1.0),cmpxchng(1.2),pg_lwlock_cas(1.3),pg_lwlock(1.8),spin(5.4)

There seems to be something a bit funky in your 3-core data, but
overall I read this data to indicate that 4 cores aren't really enough
to see a severe problem with spinlock contention.  I'm not even sure
that you can see this problem on a real workload on a 16-core system.
But as you add cores the problem gets rapidly worse, because the more
cores you have, the more likely it is that someone else is already
holding the spinlock (or has just very recently held the spinlock)
that you want.  And, of course, the problem multiplies itself, because
once your lock acquistions start taking longer, they become a larger
percentage of your execution time, which makes it proportionately more
likely that you'll find someone already there when you go to grab the
lock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: spinlock contention

From
Merlin Moncure
Date:
On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)

I may not be following all this correctly, but doesn't this suggest a
huge potential upside for the cas based patch you posted upthread when
combined with your earlier patches that were bogging down on spinlock
contentionl?

merlin


Re: spinlock contention

From
Robert Haas
Date:
On Tue, Jun 28, 2011 at 5:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)
>
> I may not be following all this correctly, but doesn't this suggest a
> huge potential upside for the cas based patch you posted upthread when
> combined with your earlier patches that were bogging down on spinlock
> contentionl?

Well, you'd think so, but in fact that patch makes it slower.  Don't
ask me why, 'cuz I dunno.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: spinlock contention

From
Florian Pflug
Date:
On Jun28, 2011, at 23:48 , Robert Haas wrote:
> On Tue, Jun 28, 2011 at 5:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)
>>
>> I may not be following all this correctly, but doesn't this suggest a
>> huge potential upside for the cas based patch you posted upthread when
>> combined with your earlier patches that were bogging down on spinlock
>> contentionl?
>
> Well, you'd think so, but in fact that patch makes it slower.  Don't
> ask me why, 'cuz I dunno.  :-(

Huh? Where do you see your CAS patch being slower than unpatched LWLocks
in these results? Or are you referring to pgbench runs you made, not
to these artificial benchmarks?

best regards,
Florian Pflug




Re: spinlock contention

From
Robert Haas
Date:
On Tue, Jun 28, 2011 at 5:55 PM, Florian Pflug <fgp@phlo.org> wrote:
> On Jun28, 2011, at 23:48 , Robert Haas wrote:
>> On Tue, Jun 28, 2011 at 5:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)
>>>
>>> I may not be following all this correctly, but doesn't this suggest a
>>> huge potential upside for the cas based patch you posted upthread when
>>> combined with your earlier patches that were bogging down on spinlock
>>> contentionl?
>>
>> Well, you'd think so, but in fact that patch makes it slower.  Don't
>> ask me why, 'cuz I dunno.  :-(
>
> Huh? Where do you see your CAS patch being slower than unpatched LWLocks
> in these results? Or are you referring to pgbench runs you made, not
> to these artificial benchmarks?

pgbench -S

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: spinlock contention

From
Florian Pflug
Date:
On Jun28, 2011, at 22:18 , Robert Haas wrote:
> On Tue, Jun 28, 2011 at 2:33 PM, Florian Pflug <fgp@phlo.org> wrote:
>> [ testing of various spinlock implementations ]
>
> I set T=30 and N="1 2 4 8 16 32" and tried this out on a 32-core
> loaner from Nate Boley:

Cool, thanks!

> 100 counter increments per cycle
> worker    1        2        4        8        16    32
> time    wall    user    wall    user    wall    user    wall    user    wall    user    wall    user
> none    2.8e-07    2.8e-07    1.5e-07    3.0e-07    8.0e-08    3.2e-07    4.2e-08    3.3e-07    2.1e-08    3.3e-07
1.1e-08   3.4e-07 
> atomicinc    3.6e-07    3.6e-07    2.6e-07    5.1e-07    1.4e-07    5.5e-07    1.4e-07    1.1e-06    1.5e-07
2.3e-06   1.5e-07    4.9e-06 
> cmpxchng    3.6e-07    3.6e-07    3.4e-07    6.9e-07    3.2e-07    1.3e-06    2.9e-07    2.3e-06    4.2e-07
6.6e-06   4.5e-07    1.4e-05 
> spin    4.1e-07    4.1e-07    2.8e-07    5.7e-07    1.6e-07    6.3e-07    1.2e-06    9.4e-06    3.8e-06    6.1e-05
1.4e-05   4.3e-04 
> pg_lwlock    3.8e-07    3.8e-07    2.7e-07    5.3e-07    1.5e-07    6.2e-07    3.9e-07    3.1e-06    1.6e-06
2.5e-05   6.4e-06    2.0e-04 
> pg_lwlock_cas    3.7e-07    3.7e-07    2.8e-07    5.6e-07    1.4e-07    5.8e-07    1.4e-07    1.1e-06    1.9e-07
3.0e-06   2.4e-07    7.5e-06 

Here's the same table, formatted with spaces.

worker          1               2               4               8               16              32
time            wall    user    wall    user    wall    user    wall    user    wall    user    wall    user
none            2.8e-07 2.8e-07 1.5e-07 3.0e-07 8.0e-08 3.2e-07 4.2e-08 3.3e-07 2.1e-08 3.3e-07 1.1e-08 3.4e-07
atomicinc       3.6e-07 3.6e-07 2.6e-07 5.1e-07 1.4e-07 5.5e-07 1.4e-07 1.1e-06 1.5e-07 2.3e-06 1.5e-07 4.9e-06
cmpxchng        3.6e-07 3.6e-07 3.4e-07 6.9e-07 3.2e-07 1.3e-06 2.9e-07 2.3e-06 4.2e-07 6.6e-06 4.5e-07 1.4e-05
spin            4.1e-07 4.1e-07 2.8e-07 5.7e-07 1.6e-07 6.3e-07 1.2e-06 9.4e-06 3.8e-06 6.1e-05 1.4e-05 4.3e-04
pg_lwlock       3.8e-07 3.8e-07 2.7e-07 5.3e-07 1.5e-07 6.2e-07 3.9e-07 3.1e-06 1.6e-06 2.5e-05 6.4e-06 2.0e-04
pg_lwlock_cas   3.7e-07 3.7e-07 2.8e-07 5.6e-07 1.4e-07 5.8e-07 1.4e-07 1.1e-06 1.9e-07 3.0e-06 2.4e-07 7.5e-06

And here's the throughput table calculated from your results,
i.e. the wall time per cycle relative to the wall time per cycle
for 1 worker.

workers           2   4   8  16  32
none            1.9 3.5 6.7  13  26
atomicinc       1.4 2.6 2.6 2.4 2.4
cmpxchng        1.1 1.1 1.2 0.9 0.8
spin            1.5 2.6 0.3 0.1 0.0
pg_lwlock       1.4 2.5 1.0 0.2 0.1
pg_lwlock_cas   1.3 2.6 2.6 1.9 1.5

Hm, so in the best case we get 2.6x the throughput of a single core,
and that only for 4 and 8 workers (1.4e-7 second / cycle vs 3.6e-7).
In that case, there also seems to be little difference between
pg_lwlock{_cas} and atomicinc. atomicinc again manages to at least
sustain that throughput when the worker count is increased, while
for for the others the throughput actually *decreases*.

What totally puzzles me is that your results don't show any
trace of a decreased system load for the pg_lwlock implementation,
which I'd have expected due to the sleep() in the contested
path. Here are the user vs. wall time ratios - I'd have expected
to see value significantly below the number of workers for pg_lwlock

none          1.0 2.0 4.0 7.9 16 31
atomicinc     1.0 2.0 3.9 7.9 15 33
cmpxchng      1.0 2.0 4.1 7.9 16 31
spin          1.0 2.0 3.9 7.8 16 31
pg_lwlock     1.0 2.0 4.1 7.9 16 31
pg_lwlock_cas 1.0 2.0 4.1 7.9 16 31

> I wrote a little script to show to reorganize this data in a
> possibly-easier-to-understand format - ordering each column from
> lowest to highest, and showing each algorithm as a multiple of the
> cheapest value for that column:

If you're OK with that, I'd like to add that to the lockbench
repo.

> There seems to be something a bit funky in your 3-core data, but
> overall I read this data to indicate that 4 cores aren't really enough
> to see a severe problem with spinlock contention.

Hm, it starts to show if you lower the counter increment per cycle
(the D constant in run.sh). But yeah, it's never as bad as the
32-core results above..

best regards,
Florian Pflug



Re: spinlock contention

From
Robert Haas
Date:
On Tue, Jun 28, 2011 at 8:11 PM, Florian Pflug <fgp@phlo.org> wrote:
>> I wrote a little script to show to reorganize this data in a
>> possibly-easier-to-understand format - ordering each column from
>> lowest to highest, and showing each algorithm as a multiple of the
>> cheapest value for that column:
>
> If you're OK with that, I'd like to add that to the lockbench
> repo.

It's a piece of crap, but you're welcome to try to fix it up.  See attached.

With respect to the apparent lack of impact of the sleep loop, I can
only speculate that this test case doesn't really mimic real-world
conditions inside the backend very well.  There is a lot more going on
there - multiple locks, lots of memory access, etc.  But I don't
really understand it either.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: spinlock contention

From
Robert Haas
Date:
On Thu, Jun 23, 2011 at 11:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote:
>> On Jun12, 2011, at 23:39 , Robert Haas wrote:
>>> So, the majority (60%) of the excess spinning appears to be due to
>>> SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).
>>
>> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
>> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
>> so I guess that two adjacent LWLocks currently share one cache line.
>>
>> Currently, the ProcArrayLock has index 4 while SInvalReadLock has
>> index 5, so if I'm not mistaken exactly the two locks where you saw
>> the largest contention on are on the same cache line...
>>
>> Might make sense to try and see if these numbers change if you
>> either make LWLockPadded 64bytes or arrange the locks differently...
>
> I fooled around with this a while back and saw no benefit.  It's
> possible a more careful test would turn up something, but I think the
> only real way forward here is going to be to eliminate some of that
> locking altogether.

I did some benchmarking, on the 32-core system from Nate Boley, with
pgbench -n -S -c 80 -j 80.   With master+fastlock+lazyvxid, I can hit
180-200k TPS in the 32-client range.  But at 80 clients throughput
starts to fall off quite a bit, dropping down to about 80k TPS.  So
then, just for giggles, I inserted a "return;" statement at the top of
AcceptInvalidationMessages(), turning it into a noop.  Performance at
80 clients shot up to 210k TPS.  I also tried inserting an
acquire-and-release cycle on MyProc->backendLock, which was just as
good.  That seems like a pretty encouraging result, since there appear
to be several ways of reimplementing SIGetDataEntries() that would
work with a per-backend lock rather than a global one.

I did some other experiments, too.  In the hopes of finding a general
way to reduce spinlock contention, I implemented a set of "elimination
buffers", where backends that can't get a spinlock go and try to
combine their requests and then send a designated representative to
acquire the spinlock and acquire shared locks simultaneously for all
group members.  It took me a while to debug the code, and I still
can't get it to do anything other than reduce performance, which may
mean that I haven't found all the bugs yet, but I'm not feeling
encouraged.  Some poking around suggests that the problem isn't that
spinlocks are routinely contended - it seems that we nearly always get
the spinlock right off the bat.  I'm wondering if the problem may be
not so much that we have continuous spinlock contention, but rather
than every once in a while a process gets time-sliced out while it
holds a spinlock.  If it's an important spinlock (like the one
protecting SInvalReadLock), the system will quickly evolve into a
state where every single processor is doing nothing but trying to get
that spinlock.  Even after the hapless lock-holder gets to run again
and lets go of the lock, you have a whole pile of other backends who
are sitting there firing of lock xchgb in a tight loop, and they can
only get it one at a time, so you have ferocious cache line contention
until the backlog clears.  Then things are OK again for a bit until
the same thing happens to some other backend.  This is just a theory,
I might be totally wrong...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: spinlock contention

From
Florian Pflug
Date:
On Jul7, 2011, at 03:35 , Robert Haas wrote:
> On Thu, Jun 23, 2011 at 11:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote:
>>> On Jun12, 2011, at 23:39 , Robert Haas wrote:
>>>> So, the majority (60%) of the excess spinning appears to be due to
>>>> SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).
>>>
>>> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
>>> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
>>> so I guess that two adjacent LWLocks currently share one cache line.
>>>
>>> Currently, the ProcArrayLock has index 4 while SInvalReadLock has
>>> index 5, so if I'm not mistaken exactly the two locks where you saw
>>> the largest contention on are on the same cache line...
>>>
>>> Might make sense to try and see if these numbers change if you
>>> either make LWLockPadded 64bytes or arrange the locks differently...
>>
>> I fooled around with this a while back and saw no benefit.  It's
>> possible a more careful test would turn up something, but I think the
>> only real way forward here is going to be to eliminate some of that
>> locking altogether.
>
> I also tried inserting an
> acquire-and-release cycle on MyProc->backendLock, which was just as
> good.  That seems like a pretty encouraging result, since there appear
> to be several ways of reimplementing SIGetDataEntries() that would
> work with a per-backend lock rather than a global one.

I did some research and found "TLRW: Return of the Read-Write Lock"
by D. Dice and N. Shavit. PDF can be found here http://transact09.cs.washington.edu/19_paper.pdf

They describe a read-write lock called "bytelock" with set of "slots"
for shared lockers, where each shared locker only needs to increment his
slot, and thus doesn't interfere with other concurrent shared locking
attempts. The price is that, after incrementing its slot, a shared
locker must issue a fencing instruction and then verify that no writer
has taken the lock in the mean while. In their application, they
distinguish between "slotted" threads - those who own a designated
slot, and "unslotted" thread - those who operation on the normal
shared counter.

I implemented that idea, but with a few modifications. First, I don't
distinguish between "slotted" and "unslotted" threads, but allow
multiple backends to share a slot. This means my implementation cannot
use a simple unlocked increment to update the slot, but instead uses
"locked xadd". I also moved the slots out of the lock itself, and into
a separate set of LWLockPart structures. Each such structure contains
the counters for one slot id and multiple locks.

In effect, the resulting thing is an LWLock with a partitioned shared
counter. The partition one backend operates on for shared locks is
determined by its backend id.

I've added the implementation to the lock benchmarking tool at https://github.com/fgp/lockbench
and also pushed a patched version of postgres to https://github.com/fgp/postgres/tree/lwlock_part

The number of shared counter partitions is current 4, but can easily
be adjusted in lwlock.h. The code uses GCC's atomic fetch and add
intrinsic if available, otherwise it falls back to using a per-partition
spin lock.

Benchmarking results look promising so far. This is the through-put
I get, in acquisition/release cycles per second, for the LWLock
implementations on a 4-core machine. pg_lwlock_part,N,X is the
partitioned lock with N lock partitions and using LOCKED XADD if
X == yes. The numbers are normalized to the through-put in the
single-worker case.

---------- Cycles / Second (Wall-Time) ---------------------
1       2       4       8       16      worker
wall    wall    wall    wall    wall    time
1       1.9     3.9     3.9     3.9     none
1       0.2     0.9     0.8     0.6     pg_lwlock_orig
1       0.4     0.3     0.3     0.3     pg_lwlock_cas
1       0.3     1.0     0.8     0.6     pg_lwlock_part,1,no
1       0.4     0.4     0.4     0.4     pg_lwlock_part,1,yes
1       2.0     0.4     0.4     0.3     pg_lwlock_part,2,no
1       2.0     0.7     0.8     0.7     pg_lwlock_part,2,yes
1       2.0     4.1     2.1     1.0     pg_lwlock_part,4,no
1       2.1     3.8     1.8     2.2     pg_lwlock_part,4,yes

These numbers show that as long is the number of workers does not
exceed the number of lock partitions, throughput increases approximately
linearly. It drops off afterwards, but I that's to be expected -
the test executes LQLockAcquire/Release in a tight loop, so once there's
any kind of contention (even cache-line bouncing), performance drops.
This is also why the original LWLock beats CAS and the partitioned
lock with less than 4 partitions here - effectively, at least half of
the workers are sleeping at any given time, as the following
user vs. wall time numbers show.

---------- User-Time / Wall-Time ----------------------------
1    2    4    8    16    worker
1.0    1.9    3.9    3.9    3.9    none
1.0    1.9    1.1    1.3    1.8    pg_lwlock_orig
1.0    2.0    4.0    4.0    4.0    pg_lwlock_cas
1.0    1.9    1.1    1.4    1.8    pg_lwlock_part,1,no
1.0    2.0    4.0    3.9    4.0    pg_lwlock_part,1,yes
1.0    2.0    3.6    3.8    3.6    pg_lwlock_part,2,no
1.0    2.0    3.8    3.9    3.9    pg_lwlock_part,2,yes
1.0    2.0    4.1    4.1    3.9    pg_lwlock_part,4,no
1.0    2.0    4.0    3.9    3.9    pg_lwlock_part,4,yes

I lack the hardware to produce any meaningful benchmark results
with pgbench for this, so again I'd be very cool if someone
(Robert? ;-) who has access to that kind of hardware could give the
patched version from github a spin with pgbench.

The patched version is current set of 4 shared counter partitions,
and should use LOCKED XADD if compiled with GCC >= 4.1 (or ICC).

> I'm wondering if the problem may be
> not so much that we have continuous spinlock contention, but rather
> than every once in a while a process gets time-sliced out while it
> holds a spinlock.  If it's an important spinlock (like the one
> protecting SInvalReadLock), the system will quickly evolve into a
> state where every single processor is doing nothing but trying to get
> that spinlock.  Even after the hapless lock-holder gets to run again
> and lets go of the lock, you have a whole pile of other backends who
> are sitting there firing of lock xchgb in a tight loop, and they can
> only get it one at a time, so you have ferocious cache line contention
> until the backlog clears.  Then things are OK again for a bit until
> the same thing happens to some other backend.  This is just a theory,
> I might be totally wrong...

Hm, but one would expect most of the spin lock contesters to have
given up and gone to sleep by the time the interrupted lock holder
resumes. If that is not what happens, then maybe we should revisit
the logic in SpinLockAcquire()...

best regards,
Florian Pflug



Re: spinlock contention

From
Robert Haas
Date:
On Thu, Jul 7, 2011 at 5:54 AM, Florian Pflug <fgp@phlo.org> wrote:
> In effect, the resulting thing is an LWLock with a partitioned shared
> counter. The partition one backend operates on for shared locks is
> determined by its backend id.
>
> I've added the implementation to the lock benchmarking tool at
>  https://github.com/fgp/lockbench
> and also pushed a patched version of postgres to
>  https://github.com/fgp/postgres/tree/lwlock_part
>
> The number of shared counter partitions is current 4, but can easily
> be adjusted in lwlock.h. The code uses GCC's atomic fetch and add
> intrinsic if available, otherwise it falls back to using a per-partition
> spin lock.

I think this is probably a good trade-off for locks that are most
frequently taken in shared mode (like SInvalReadLock), but it seems
like it could be a very bad trade-off for locks that are frequently
taken in both shared and exclusive mode (e.g. ProcArrayLock,
BufMappingLocks).

I don't want to fiddle with your git repo, but if you attach a patch
that applies to the master branch I'll give it a spin if I have time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: spinlock contention

From
Florian Pflug
Date:
On Jul7, 2011, at 18:09 , Robert Haas wrote:
> On Thu, Jul 7, 2011 at 5:54 AM, Florian Pflug <fgp@phlo.org> wrote:
>> In effect, the resulting thing is an LWLock with a partitioned shared
>> counter. The partition one backend operates on for shared locks is
>> determined by its backend id.
>>
>> I've added the implementation to the lock benchmarking tool at
>>  https://github.com/fgp/lockbench
>> and also pushed a patched version of postgres to
>>  https://github.com/fgp/postgres/tree/lwlock_part
>>
>> The number of shared counter partitions is current 4, but can easily
>> be adjusted in lwlock.h. The code uses GCC's atomic fetch and add
>> intrinsic if available, otherwise it falls back to using a per-partition
>> spin lock.
>
> I think this is probably a good trade-off for locks that are most
> frequently taken in shared mode (like SInvalReadLock), but it seems
> like it could be a very bad trade-off for locks that are frequently
> taken in both shared and exclusive mode (e.g. ProcArrayLock,
> BufMappingLocks).

That's definitely a concern. One idea I had to alleviate that is to
add a per-partition "used" flag to the LWLock struct, and set that
to true (if not already set) before incrementing a partition's
shared counter. Exclusive lockers would then only need to inspect those
partitions for which the flag is set, and would clear all flags after
having acquires the lock.

I actually tried to do that initially, but figuring out where and how
to safely clear the flag again turned out to be a bit hairy. So I decided
to keep it simple first, and wait to see whether that problem manifests
itself in pratice.

> I don't want to fiddle with your git repo, but if you attach a patch
> that applies to the master branch I'll give it a spin if I have time.

Patch attached.

Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
spin lock instead of "locked xadd" to increment the shared counters.

best regards,
Florian Pflug

Attachment

Re: spinlock contention

From
Tom Lane
Date:
Florian Pflug <fgp@phlo.org> writes:
> Patch attached.

> Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
> spin lock instead of "locked xadd" to increment the shared counters.

That's already sufficient reason to reject the patch.  Not everyone
uses gcc, let alone very recent versions of gcc.
        regards, tom lane


Re: spinlock contention

From
Florian Pflug
Date:
On Jul8, 2011, at 16:21 , Tom Lane wrote:
> Florian Pflug <fgp@phlo.org> writes:
>> Patch attached.
> 
>> Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
>> spin lock instead of "locked xadd" to increment the shared counters.
> 
> That's already sufficient reason to reject the patch.  Not everyone
> uses gcc, let alone very recent versions of gcc.

This is a WIP version meant for testing, not a finish patch!

Spending time on making this work on every conceivable compiler before we
even know whether or not the approach is worthwhile at all seems ludicrous
to me.

A finished version would use inline assembly to avoid the GCC version
dependency, and would support as many additional compilers as there are
people with access to these compilers who offer to help...

But yeah, that will very probably leave some compilers unsupported
(in the "fall back to spin lock per partition sense. Which, if the patch
proves worthwhile at all, probably still provides a benefit over the current
code).

If that is reason enough to reject the patch, i.e. if the policy is "we
don't want it for any if we cannot have it for all", then consider it
withdrawn.

best regards,
Florian Pflug



Re: spinlock contention

From
Stefan Kaltenbrunner
Date:
On 07/08/2011 04:21 PM, Tom Lane wrote:
> Florian Pflug <fgp@phlo.org> writes:
>> Patch attached.
> 
>> Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
>> spin lock instead of "locked xadd" to increment the shared counters.
> 
> That's already sufficient reason to reject the patch.  Not everyone
> uses gcc, let alone very recent versions of gcc.

hmm - 4.1.0 was released in february 2006, which will be much older than
even the 5 year support policy we have on pg when 9.2 will be released,
not sure how much it will matter if we don't support as specific
optimization on a gcc that old..


Stefan



Re: spinlock contention

From
Florian Pflug
Date:
On Jul8, 2011, at 22:27 , Stefan Kaltenbrunner wrote:
> On 07/08/2011 04:21 PM, Tom Lane wrote:
>> Florian Pflug <fgp@phlo.org> writes:
>>> Patch attached.
>>
>>> Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
>>> spin lock instead of "locked xadd" to increment the shared counters.
>>
>> That's already sufficient reason to reject the patch.  Not everyone
>> uses gcc, let alone very recent versions of gcc.
>
> hmm - 4.1.0 was released in february 2006, which will be much older than
> even the 5 year support policy we have on pg when 9.2 will be released,
> not sure how much it will matter if we don't support as specific
> optimization on a gcc that old..

Still, it's not really hard to support older Versions, at least on
x86 and x86-64. All it takes is some inline assembly. I just don't
want to put effort into this until we know whether or not the whole
approach is worthwhile or not.

Should someone want to test this patch, but can't because of the GCC
version restriction, please speak up.

best regards,
Florian Pflug



Re: spinlock contention

From
Florian Pflug
Date:
On Jul7, 2011, at 03:35 , Robert Haas wrote:
> Some poking around suggests that the problem isn't that
> spinlocks are routinely contended - it seems that we nearly always get
> the spinlock right off the bat.  I'm wondering if the problem may be
> not so much that we have continuous spinlock contention, but rather
> than every once in a while a process gets time-sliced out while it
> holds a spinlock.  If it's an important spinlock (like the one
> protecting SInvalReadLock), the system will quickly evolve into a
> state where every single processor is doing nothing but trying to get
> that spinlock.  Even after the hapless lock-holder gets to run again
> and lets go of the lock, you have a whole pile of other backends who
> are sitting there firing of lock xchgb in a tight loop, and they can
> only get it one at a time, so you have ferocious cache line contention
> until the backlog clears.

Pondering this some more, I came up with another idea, pretty much
orthogonal to the shared counter partitioning I posted a patch for.

If indeed that problem isn't spin lock contention, but rather losing
control of the CPU while holding a spin lock, then why not try to get
rid of the spin lock entirely? On Linux, that's easy - this is exactly
what futexes are about. But it occurred to me that kernel support isn't
actually needed to do that - futexes can effectively be emulated in
user-space using just a semaphore, and without doing a syscall except
when there's contention.

The algorithm is quite straight forward, if one assumes a lock-free
implementation of a queue (More on that below)
 LockAcquire:   (1) CAS the lock state to increment the reader count or set       the exclusive bit in a loop while the
locklooks free.       If successful, we're done, otherwise we continue with (2)   (2) Add ourself to the wait queue   [
Sincewe've added ourself to the queue, we *must* now     decrement the semaphore no matter what, to keep the
increment/decrementcalls balanced. We're careful to     maintain that invariant below. ]   (3) Fence (AKA issue full
memorybarrier)   (4) Re-check if the lock still looks taken. If it does,       we decrement the semaphore
(PGSemaphoreLock),and       (upon wake-up) restart at (1).       Otherwise, continue with (5)   (5) Remove the first
waiterfrom the queue and increment       her semaphore. Rinse-and-repeat until we either removed       ourself from the
queueor the queue is empty.   (6) Decrement our semaphore.   [ (6) is necessary in the general case, see the remark
below(2). But we can of course detect the case were     we'd increment our own semaphore in (5) only to     decrement
itagain in (6), and skip both operations ]
 
 LockRelease:   (1) Set the lock state to 0, i.e. release the lock.   (2) Fence (AKA issue full memory barrier)   (3)
Ifthe lock still looks free, remove the first waiter from       the queue and increment her semaphore. Rinse-and-repeat
     while the lock looks free and the queue is non-empty.   [ From a correctness POV, we only have to wake up one
waiterhere, and that only if there isn't one who was been     woken up but hasn't yet retried to take the lock. In
reality,    the decision if and how many waiter to wake up would depend     on their desired lock level, and some
variantof what we     currently call releaseOK. ]
 

The fencing steps (LockAcquire:3) and (LockRelease:2) guarantee
that if we block in LockAcquire() a lock holder will see our
queue entry and thus will wake us up eventually.

Because we use a semaphore and not, say, a simple signal, we don't
have to worry about the precise ordering of block and unblock
operations - we just need to ensure they're balanced.

Now to to enqueue() and dequeue() primitives that the algorithm
above depends on. There are multiple published algorithms
for lock-free queues. Some googling turned up
 "An optimistic approach to lock-free FIFO queues",  E.L. Mozes, N. Shavit, DOI 10.1007/s00446-007-0050-0

and  "CAS-Based Lock-Free Algorithm for Shared Deques", M.M. Michael

The second one looks promising, since it only requires a single
CAS to enqueue and dequeue entries in the common case. Thus, it
should be just as efficient as our current spin-lock-based queue
in the uncontended case (and much better in the contested case, of
course).

[ Our case is, in fact, simpler than the generic setting that these
algorithms support. We only ever enqueue our *own* proc array entry
(so allocation and re-use of queue entries isn't much of an issue),
and always *process* the queue after enqueuing something - either
directly in LockAcquire or later in LockRelease. We thus don't really
need to support concurrent removal of queue entries, but might get
away with simply skipping queue processing if we detect that somebody
else is in the process of doing so. I think I have an idea for a
simpler lock-less queue that fits our needs, but I haven't yet
ironed out all the details (As it turns out, removing the very last
entry from the queue is the hard case). ]

Thoughts, comments and criticism are all very welcome!

best regards,
Florian Pflug





Re: spinlock contention

From
Robert Haas
Date:
On Jul 12, 2011, at 8:03 AM, Florian Pflug <fgp@phlo.org> wrote:
> The algorithm is quite straight forward, if one assumes a lock-free
> implementation of a queue (More on that below)

This is similar to the CAS-based LWLocks I played around with, though I didn't use a lock-free queue.  I think I
probablyneed to devote some time to figuring out why that didn't work out to a win... 

...Robert

Re: spinlock contention

From
Florian Pflug
Date:
On Jul13, 2011, at 00:10 , Robert Haas wrote:
> On Jul 12, 2011, at 8:03 AM, Florian Pflug <fgp@phlo.org> wrote:
>> The algorithm is quite straight forward, if one assumes a lock-free
>> implementation of a queue (More on that below)
> 
> This is similar to the CAS-based LWLocks I played around with, though
> I didn't use a lock-free queue.  I think I probably need to devote some
> time to figuring out why that didn't work out to a win...

Yeah, the non-waitqueue-related parts are mostly identical. The only
significant difference there is that the waiters-present bit is replaced
by fence-and-recheck.

What motivated me to research this was your theory that the problem is
not so much spin-lock contention, but rather those unlucky processes who
lose their time-slice while holding the lock.

If that is true, then maybe the problem with your CAS patch is that
once the LWLocks is contested heavily, the waiters-present bit will be
set pretty much all the time, and the code will thus fall back to
using the spin-lock.

*Reading the code again*

Hm, I just realized that you only clear the waiters-present bit after
emptying the queue completely. With rising contention, you might reach a
point where you never empty the queue completely (unless the lock is
only ever acquired in share mode, that is). As it stands, the CAS patch
is effectively neutralized at this point. It'll even have an adverse
effect due to the higher number of atomic operations compared to the
unpatched version.

I wonder if clearing the waiters-present bit only upon clearing the
queue completely is necessary for correctness. Wouldn't it be OK
to clear the bit after waking up at least one locker? That'd still
guarantee that you cannot end up with a blocked process but no lock
holder, I believe.

best regards
Florian Pflug



Re: spinlock contention

From
Robert Haas
Date:
On Jul 12, 2011, at 8:10 PM, Florian Pflug <fgp@phlo.org> wrote:
> On Jul13, 2011, at 00:10 , Robert Haas wrote:
>> On Jul 12, 2011, at 8:03 AM, Florian Pflug <fgp@phlo.org> wrote:
>>> The algorithm is quite straight forward, if one assumes a lock-free
>>> implementation of a queue (More on that below)
>>
>> This is similar to the CAS-based LWLocks I played around with, though
>> I didn't use a lock-free queue.  I think I probably need to devote some
>> time to figuring out why that didn't work out to a win...
>
> Yeah, the non-waitqueue-related parts are mostly identical. The only
> significant difference there is that the waiters-present bit is replaced
> by fence-and-recheck.
>
> What motivated me to research this was your theory that the problem is
> not so much spin-lock contention, but rather those unlucky processes who
> lose their time-slice while holding the lock.
>
> If that is true, then maybe the problem with your CAS patch is that
> once the LWLocks is contested heavily, the waiters-present bit will be
> set pretty much all the time, and the code will thus fall back to
> using the spin-lock.
>
> *Reading the code again*
>
> Hm, I just realized that you only clear the waiters-present bit after
> emptying the queue completely. With rising contention, you might reach a
> point where you never empty the queue completely (unless the lock is
> only ever acquired in share mode, that is). As it stands, the CAS patch
> is effectively neutralized at this point. It'll even have an adverse
> effect due to the higher number of atomic operations compared to the
> unpatched version.

Hmm, yeah.

> I wonder if clearing the waiters-present bit only upon clearing the
> queue completely is necessary for correctness. Wouldn't it be OK
> to clear the bit after waking up at least one locker? That'd still
> guarantee that you cannot end up with a blocked process but no lock
> holder, I believe.

I don't think so - how does the next process that releases the lock know to release waiters?

...Robert

Re: spinlock contention

From
Florian Pflug
Date:
On Jul13, 2011, at 22:04 , Robert Haas wrote:
> On Jul 12, 2011, at 8:10 PM, Florian Pflug <fgp@phlo.org> wrote:
>> I wonder if clearing the waiters-present bit only upon clearing the
>> queue completely is necessary for correctness. Wouldn't it be OK
>> to clear the bit after waking up at least one locker? That'd still
>> guarantee that you cannot end up with a blocked process but no lock
>> holder, I believe.
> 
> I don't think so - how does the next process that releases the lock
> know to release waiters?

It won't :-(. Not that easily, at least.

The idea stemmed from that fact that my shared counter partitioning
patch get away with something similar. But that only works because it
always re-checks for possible interference after doing the fast path,
so the idea isn't directly applicable to your patch it seems.

Porting that idea to your CAS patch would probably make it nearly
identical to the shared-counter partitioning patch with the number of
partitions set to 1, so I see little point in that.

BTW, I think I got a workable atomic queue that suits our needs sketched
up - it'll post the details once I've convinced myself that it's actually
correct. So should you believe that approach to be unworkable for some
reason, please speak up and same me the effort.

best regards,
Florian Pflug



Re: spinlock contention

From
Robert Haas
Date:
On Fri, Jul 8, 2011 at 6:02 AM, Florian Pflug <fgp@phlo.org> wrote:
>> I don't want to fiddle with your git repo, but if you attach a patch
>> that applies to the master branch I'll give it a spin if I have time.
>
> Patch attached.
>
> Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
> spin lock instead of "locked xadd" to increment the shared counters.

[ Back from vacation, catching up on email. ]

gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu5)

pgbench -n -S -T 180 -c 32 -j 32

with lwlock-part patch:
tps = 36974.644091 (including connections establishing)

unpatched cd34647c666be867f95ef8fc0492c30356043f10:
tps = 39432.064293 (including connections establishing)

And with -c 8 -j 8:

tps = 26946.202428 (including connections establishing)
tps = 27206.507424 (including connections establishing)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: spinlock contention

From
Florian Pflug
Date:
On Jul18, 2011, at 04:36 , Robert Haas wrote:
> On Fri, Jul 8, 2011 at 6:02 AM, Florian Pflug <fgp@phlo.org> wrote:
>>> I don't want to fiddle with your git repo, but if you attach a patch
>>> that applies to the master branch I'll give it a spin if I have time.
>> 
>> Patch attached.
>> 
>> Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
>> spin lock instead of "locked xadd" to increment the shared counters.
> 
> [ Back from vacation, catching up on email. ]
> 
> gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu5)
> 
> pgbench -n -S -T 180 -c 32 -j 32
> 
> with lwlock-part patch:
> tps = 36974.644091 (including connections establishing)
> 
> unpatched cd34647c666be867f95ef8fc0492c30356043f10:
> tps = 39432.064293 (including connections establishing)
> 
> And with -c 8 -j 8:
> 
> tps = 26946.202428 (including connections establishing)
> tps = 27206.507424 (including connections establishing)

:-( That's disappointing, to say the least.

I also completely fail to understand what the heck is going on there.

I mean, you did conclusively prove that commenting out the SInval stuff
made a huge difference. There's also supposed to hardly any invalidation
going on during a pgbench -S run. So, since the patch removes two of the
three spin-lock acquisitions from SIGetDataEntries() (so long as there are
no exclusive lockers of SInvalReadLock), there should be some effect.
Or so I'd think at least...

If anyone has I theory, I'd love to hear it.

best regards,
Florian Pflug