On Wed, Jun 10, 2015 at 1:39 PM, Andres Freund <andres@anarazel.de> wrote:
> In the uncontended case lwlocks are just as fast as spinlocks now, with
> the exception of the local tracking array. They're faster if there's
> differences with read/write lockers.
If nothing else, the spinlock calls are inline, while the lwlock calls
involve a function call. So in the uncontended case I think the
spinlock stuff is like 2 instructions, an atomic and a branch.
>> I'm not altogether convinced that a simple replacement of spinlocks
>> with atomics is going to solve this problem. If it does, great. But
>> that doesn't eliminate spinning; it just moves it from the spinlock
>> loop to a CAS loop.
>
> Well, not necessarily. If you can write your algorithm in a way that
> xadd etc are used, instead of a lock cmpxchg, you're actually never
> spinning on x86 as it's guaranteed to succeed. I think that's possible
> for most of the places that currently lock buffer headers.
Well, it will succeed by looping inside the instruction, I suppose. But OK.
> (I had a version of the lwlock changes that used xadd for shared lock
> acquisition - but the code needed to back out in error cases made things
> more complicated, and the benefit on a four socket machine wasn't that
> large)
Now that we (EnterpriseDB) have this 8-socket machine, maybe we could
try your patch there, bound to varying numbers of sockets.
>> This is clearly better: when the CAS works,
>> you're done, as opposed to having to then manipulate the cache line
>> further and release the spinlock, during any of which the cache line
>> may get stolen from you.
>
> It's not just that, it's also that there's no chance of sleeping while
> holding a lock. Which doesn't happen that infrequently in contended, cpu
> bound workloads.
That's a really good point.
>> However, I'm worried that it will still be possible to create the same
>> kinds of performance collapses that we see with spinlocks with the CAS
>> loops with which we place them - or even worse problems, because
>> clearly the spin_delay stuff *works*, and we're unlikely to
>> incorporate similarly sophisticated logic into every CAS loop.
>
> I doubt it's really neccessary, but It shouldn't be too hard to
> generalize the sleeping logic so it could be used with little code in
> the individual callsites.
We should probably experiment with it to see whether it is needed. I
am fine with leaving it out if it isn't, but it's worth testing.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company