Re: spinlock contention - Mailing list pgsql-hackers

From Robert Haas
Subject Re: spinlock contention
Date
Msg-id BANLkTin+sGwrH7r7b0-S+-JVY0aA5MvHMg@mail.gmail.com
Whole thread Raw
In response to Re: spinlock contention  (Florian Pflug <fgp@phlo.org>)
Responses Re: spinlock contention
List pgsql-hackers
On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug <fgp@phlo.org> wrote:
> It seems hard to believe that there isn't some effect of two locks
> sharing a cache line. There are architectures (even some of the
> Intel architectures, I believe) where cache lines are 32 bit, though -
> so measuring this certainly isn't easy. Also, I'm absolute no
> expert for this, so it might very well be that' I'm missing something
> fundamental.

Well, I'm sure there is some effect, but my experiments seem to
indicate that it's not a very important one.  Again, please feel free
to provide contrary evidence.  I think the basic issue is that - in
the best possible case - padding the LWLocks so that you don't have
two locks sharing a cache line can reduce contention on the busier
lock by at most 2x.  (The less busy lock may get a larger reduction
but that may not help you much.)  If you what you really need is for
the contention to decrease by 1000x, you're just not really moving the
needle.  That's why the basic fast-relation-lock patch helps so much:
it replaces a system where every lock request results in contention
with a system where NONE of them do.

>> SInvalReadLock is a good example of a lock which seems barely to e
>> necessary at all.  Except on extraordinarily rare occasions, it is
>> only ever taken in share mode.
>
> This is the case we should optimize for, I think. Currently, there's
> contention even between two SHARE lockers of a LWLock because our
> LWLock implementation uses a spinlock underneath. I've googled around
> a bit, and found this:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git
>
> It's an userspace rwlock implementation based on atomic instructions
> and futexes (for blocking) written by Linus Torwalds as a response
> to the following thread
>
> http://lwn.net/Articles/284741/
>
> It seems to require only a single atomic increment instruction to
> acquire a share lock in the uncontested case, and a single compare-
> and-exchange instruction to acquire an exlcusive lock (again in
> the uncontested case).
>
> The code doesn't seem to have a license attached, and seems to have
> been written over a weekend and never touched since, so we might
> not want (or be able to) to use this directly. But it'd still be
> interesting to see if replacing our LWLocks with this implementation
> affects throughput.

I tried rewriting the LWLocks using CAS.  It actually seems to make
things slightly worse on the tests I've done so far, perhaps because I
didn't make it respect spins_per_delay.  Perhaps fetch-and-add would
be better, but I'm not holding my breath.  Everything I'm seeing so
far leads me to the belief that we need to get rid of the contention
altogether, not just contend more quickly.

>> Only SICleanupQueue() needs to lock
>> out readers, and in the (probably common) case where all backends are
>> reasonably close to being caught up, it wouldn't even matter if the
>> determination of minMsgNum were a little bit fuzzy.  I've been
>> straining my brain trying to figure out whether there's some way to
>> rewrite this logic so that it can run concurrently with readers; then
>> we could get rid of SInvalReadLock altogether.  I have a feeling if I
>> were 25% smarter I could do it, but I'm not.
>
> This sounds like something which could be done with RCU (Read-Copy-Update)
> in a lockless way. The idea is to copy the data structure (or parts)
> of it before updating it, and to "commit" the change afterwards by
> adjusting a single pointer to point to the new structure (kinda like
> we do atomic commits by an atomic update of one bit in the clog).
>
> The hard part is usually to figure when it's OK to clean up or
> reuse the old version of the data structure - for that, you need to
> know that all previous readers have completed.
>
> Here's a rough description of how that could work for the invalidation
> queue. We'd have two queue structures in shared memory, each containing
> a version number. We'd also have a "current" pointer pointing to the
> most recent one. Each reader would publish the version number of the
> queue he's about to read (the "current" one) in shared memory
> (and issue a write barrier). SICleanupQueue() would check whether the
> non-current queue structure is unused by comparing its version to the
> published versions of all backends (after issuing a read barrier, and
> after taking a lock that prevents concurrent SICleanupQueue runs of
> course). If there still are users of that version, SICleanupQueue()
> out-waits them. Afterwards, it simply copies the current structure
> over the old one, does the necessary cleanups, increments the version
> number to be larger than the one of the "current" structure,
> and *then* updates the "current" pointer.

Interesting idea.

>> So my fallback position is to try to find some way to optimize the
>> common case where there are no messages to be read.  We're already
>> allowing readers and writers to run in parallel, so it must not be
>> important for a reader to see a message that a writer is in the
>> process of writing, but has not yet finished writing.  I believe the
>> idea here is that sinval messages should be emitted before releasing
>> the related locks, so if we've successfully acquired a lock on an
>> object, then any pertinent sinval messages must already be completely
>> written.  What I'm thinking we could do is just keep a global counter
>> indicating how many messages have been written, and each backend could
>> keep a counter indicating the highest-numbered message it has so far
>> read.  When a message is written, the global counter is bumped.  When
>> a backend goes to AcceptInvalidationMessages(), it compares the global
>> counter to its own counter, and if they're the same, then it doesn't
>> bother taking SInvalReadLock and just exits quickly.
>
> Sounds sensible. I do fear, however, that if we start to micro-optimize
> around every single LWLock we're in for quite a maintenance burden in
> the future. Even if each single modification is relatively simple, the
> complexity still adds ups. So I still believe that it'd be better to
> start with optimizing LWLocks in general, and to only touch the LWLocks
> which are still hot spots afterwards.

I was hoping to do it that way, too, but I have so far been able to
demonstrate a performance benefit from that approach.

>> There is an obvious (potential) memory-ordering problem here.  If it's
>> possible for a backend doing AcceptInvalidationMessages() to see a
>> stale version of the counter, then it might fail to read messages from
>> the queue that it really ought to have seen.  Protecting the global
>> counter with a spinlock defeats the purpose of doing any of this in
>> the first place, because the whole problem is too many people fighting
>> over the spinlock protecting SInvalReadLock.  It should be sufficient,
>> though, for the writing backend to execute a memory-barrier operation
>> just after bumping the global counter and the reading backend to
>> execute one just before.  As it happens, LWLockRelease() is such a
>> barrier, since it must acquire and release a spinlock, so we should be
>> able to just bump the counter right before releasing the LWLock and
>> call it good.  On the reading side, the immediately-preceding lock
>> acquisition seems like it ought to be sufficient - surely it can't be
>> possible to acquire a heavyweight lock on an object without seeing
>> memory updates that some other process made before releasing the same
>> heavyweight lock.
>
> If we start doing lockless algorithms, I really think we should add
> explicit barrier primitives. We could start out with something
> simple such as
>
> #define MemoryBarrierWrite \
>  do { slock_t l; S_UNLOCK(l); } while (0)
> #define MemoryBarrierRead \
>  do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
> #define MemoryBarrierReadWrite \
>  do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }
>
> But yeah, it seems that in the case of the sinval queue, the surrounding
> LWLocks actually provide the necessary barriers.

Yes, a set of general memory barrier primitives would be handy.
Unfortunately, it would probably be quite a bit of work to develop
this and verify that it works properly on all supported platforms.

>> A possible objection here is that a 32-bit counter might wrap around,
>> giving a false negative, and a read from a 64-bit counter might not be
>> atomic.  In practice, the chances of a problem here seem remote, but I
>> think we can guard against it easily enough.  We can put each
>> backend's counter of messages read into shared memory, protected by a
>> per-backend lock (probably the same LWLock the fast relation lock
>> patch already introduces).  When a backend compares its counter to the
>> global counter, it does so while holding this lock.  When
>> SICleanupQueue() resets a backend, it grabs the lock and sets that
>> backend's counter to a special value (like 0) that we guarantee will
>> never match the global counter (which we can cause to wrap from 2^32-1
>> to 1).  So then AcceptInvalidationMessages() - in the common case
>> where there are no invalidation messages to process - will only need
>> the per-backend LWLock rather than fighting over SInvalLock.
>
> OTOH, don't all architectures that are interesting targets of
> performance tuning provide atomic access to 64-bit values? Even i386
> has an 8-byte compare-and-exchange instruction I think...

Maybe so... but if I can get by without needing that, so much the
better.  I don't really want to have to have a separate code path for
architectures that don't support that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: ALTER TABLE lock strength reduction patch is unsafe
Next
From: Florian Pflug
Date:
Subject: Re: spinlock contention