Re: Better LWLocks with compare-and-swap (9.4) - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Better LWLocks with compare-and-swap (9.4)
Date
Msg-id 519A938A.1070903@vmware.com
Whole thread Raw
In response to Re: Better LWLocks with compare-and-swap (9.4)  (Daniel Farina <daniel@heroku.com>)
Responses Spinlock implementation on x86_64 (was Re: Better LWLocks with compare-and-swap (9.4))  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
On 16.05.2013 01:08, Daniel Farina wrote:
> On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com>  wrote:
>> pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when
>> I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine:
>>
>> -  64.09%  postgres  postgres           [.] tas
>>     - tas
>>        - 99.83% s_lock
>>           - 53.22% LWLockAcquire
>>              + 99.87% GetSnapshotData
>>           - 46.78% LWLockRelease
>>                GetSnapshotData
>>              + GetTransactionSnapshot
>> +   2.97%  postgres  postgres           [.] tas
>> +   1.53%  postgres  libc-2.13.so       [.] 0x119873
>> +   1.44%  postgres  postgres           [.] GetSnapshotData
>> +   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
>> +   1.18%  postgres  postgres           [.] AllocSetAlloc
>> ...
>>
>> So, on this test, a lot of time is wasted spinning on the mutex of
>> ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
>> surprisingly steep drop in performance once you go beyond 29 clients
>> (attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory
>> is that after that point all the cores are busy, and processes start to be
>> sometimes context switched while holding the spinlock, which kills
>> performance.
>
> I have, I also used linux perf to come to this conclusion, and my
> determination was similar: a system was undergoing increasingly heavy
> load, in this case with processes>>  number of processors.  It was
> also a phase-change type of event: at one moment everything would be
> going great, but once a critical threshold was hit, s_lock would
> consume enormous amount of CPU time.  I figured preemption while in
> the spinlock was to blame at the time, given the extreme nature

Stop the press! I'm getting the same speedup on that 32-core box I got 
with the compare-and-swap patch, from this one-liner:

--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -200,6 +200,8 @@ typedef unsigned char slock_t;
 #define TAS(lock) tas(lock)

+#define TAS_SPIN(lock)    (*(lock) ? 1 : TAS(lock))
+ static __inline__ int tas(volatile slock_t *lock) {

So, on this system, doing a non-locked test before the locked xchg 
instruction while spinning, is a very good idea. That contradicts the 
testing that was done earlier when the x86-64 implementation was added, 
as we have this comment in the tas() implementation:

>     /*
>      * On Opteron, using a non-locking test before the locking instruction
>      * is a huge loss.  On EM64T, it appears to be a wash or small loss,
>      * so we needn't bother to try to distinguish the sub-architectures.
>      */

On my test system, the non-locking test is a big win. I tested this 
because I was reading this article from Intel:


http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/.

It says very explicitly that the non-locking test is a good idea:

> Spinning on volatile read vs. spin on lock attempt
>
> One common mistake made by developers developing their own spin-wait loops is attempting to spin on an atomic
instructioninstead of spinning on a volatile read. Spinning on a dirty read instead of attempting to acquire a lock
consumesless time and resources. This allows an application to only attempt to acquire a lock only when it is free.
 

Now, I'm not sure what to do about this. If we put the non-locking test 
in there, according to the previous testing that would be a huge loss on 
Opterons.

Perhaps we should just sleep earlier, ie. lower MAX_SPINS_PER_DELAY. 
That way, even if each TAS_SPIN test is very expensive, we don't spend 
too much time spinning if it's really busy, or held by a process that is 
sleeping.

- Heikki



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Fast promotion failure
Next
From: Robins Tharakan
Date:
Subject: Re: Add more regression tests for dbcommands