Spinlock implementation on x86_64 (was Re: Better LWLocks with compare-and-swap (9.4)) - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Spinlock implementation on x86_64 (was Re: Better LWLocks with compare-and-swap (9.4))
Date
Msg-id 521E2DFC.8050906@vmware.com
Whole thread Raw
In response to Re: Better LWLocks with compare-and-swap (9.4)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Spinlock implementation on x86_64 (was Re: Better LWLocks with compare-and-swap (9.4))  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 21.05.2013 00:20, Heikki Linnakangas wrote:
> On 16.05.2013 01:08, Daniel Farina wrote:
>> On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>> pgbench -S is such a workload. With 9.3beta1, I'm seeing this
>>> profile, when
>>> I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux
>>> machine:
>>>
>>> - 64.09% postgres postgres [.] tas
>>> - tas
>>> - 99.83% s_lock
>>> - 53.22% LWLockAcquire
>>> + 99.87% GetSnapshotData
>>> - 46.78% LWLockRelease
>>> GetSnapshotData
>>> + GetTransactionSnapshot
>>> + 2.97% postgres postgres [.] tas
>>> + 1.53% postgres libc-2.13.so [.] 0x119873
>>> + 1.44% postgres postgres [.] GetSnapshotData
>>> + 1.29% postgres [kernel.kallsyms] [k] arch_local_irq_enable
>>> + 1.18% postgres postgres [.] AllocSetAlloc
>>> ...
>>>
>>> So, on this test, a lot of time is wasted spinning on the mutex of
>>> ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
>>> surprisingly steep drop in performance once you go beyond 29 clients
>>> (attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My
>>> theory
>>> is that after that point all the cores are busy, and processes start
>>> to be
>>> sometimes context switched while holding the spinlock, which kills
>>> performance.
>>
>> I have, I also used linux perf to come to this conclusion, and my
>> determination was similar: a system was undergoing increasingly heavy
>> load, in this case with processes>> number of processors. It was
>> also a phase-change type of event: at one moment everything would be
>> going great, but once a critical threshold was hit, s_lock would
>> consume enormous amount of CPU time. I figured preemption while in
>> the spinlock was to blame at the time, given the extreme nature
>
> Stop the press! I'm getting the same speedup on that 32-core box I got
> with the compare-and-swap patch, from this one-liner:
>
> --- a/src/include/storage/s_lock.h
> +++ b/src/include/storage/s_lock.h
> @@ -200,6 +200,8 @@ typedef unsigned char slock_t;
>
> #define TAS(lock) tas(lock)
>
> +#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))
> +
> static __inline__ int
> tas(volatile slock_t *lock)
> {
>
> So, on this system, doing a non-locked test before the locked xchg
> instruction while spinning, is a very good idea. That contradicts the
> testing that was done earlier when the x86-64 implementation was added,
> as we have this comment in the tas() implementation:
>
>> /*
>> * On Opteron, using a non-locking test before the locking instruction
>> * is a huge loss. On EM64T, it appears to be a wash or small loss,
>> * so we needn't bother to try to distinguish the sub-architectures.
>> */
>
> On my test system, the non-locking test is a big win. I tested this
> because I was reading this article from Intel:
>
>
http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/.
> It says very explicitly that the non-locking test is a good idea:
>
>> Spinning on volatile read vs. spin on lock attempt
>>
>> One common mistake made by developers developing their own spin-wait
>> loops is attempting to spin on an atomic instruction instead of
>> spinning on a volatile read. Spinning on a dirty read instead of
>> attempting to acquire a lock consumes less time and resources. This
>> allows an application to only attempt to acquire a lock only when it
>> is free.
>
> Now, I'm not sure what to do about this. If we put the non-locking test
> in there, according to the previous testing that would be a huge loss on
> Opterons.

I think we should apply the attached patch to put a non-locked test in
TAS_SPIN() on x86_64.

Tom tested this back in 2011 on a 32-core Opteron [1] and found little
difference. And on a 80-core Xeon (160 with hyperthreading) system, he
saw a quite significant gain from it [2]. Reading the thread, I don't
understand why it wasn't applied to x86_64 back then. Maybe it was for
the fear of having a negative impact on performance on old Opterons, but
I strongly suspect that the performance would be OK even on old Opterons
if you only do the non-locked test in TAS_SPIN() and not in TAS(). Back
when the comment about poor performance with a non-locking test on
Opteron was written, we used the same TAS() macro in the first and while
spinning.

I tried to find some sort of an authoritative guide from AMD that would
say how spinlocks should be implemented, but didn't find any. The Intel
guide I linked above says we should apply the patch. Linux uses a ticket
spinlock thingie nowadays, which is slightly different, but if I'm
reading the older pre-ticket version correctly, they used to do the
same. Ie. non-locking test while spinning, but not before the first
attempt to grab the lock. Same in FreeBSD.

So, my plan is to apply the attached non-locked-tas-spin-x86_64.patch to
master. But I would love to get feedback from people running different
x86_64 hardware.

[1] http://www.postgresql.org/message-id/8292.1314641721@sss.pgh.pa.us
[2] http://www.postgresql.org/message-id/25989.1314734752@sss.pgh.pa.us

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Dimitri Fontaine
Date:
Subject: Re: Extension Templates S03E11
Next
From: Bruce Momjian
Date:
Subject: Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: Proposal for Allow postgresql.conf values to be changed via SQL [review])