Re: Improving spin-lock implementation on ARM. - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Improving spin-lock implementation on ARM.
Date
Msg-id CAPpHfdvTd_VUWC+iK+qWtw14v9Hi6pTSfzaCXNdJoDoung7=GQ@mail.gmail.com
Whole thread Raw
In response to Re: Improving spin-lock implementation on ARM.  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Improving spin-lock implementation on ARM.  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Thu, Nov 26, 2020 at 1:32 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 26/11/2020 06:30, Krunal Bauskar wrote:
> > Improving spin-lock implementation on ARM.
> > ------------------------------------------------------------
> >
> > * Spin-Lock is known to have a significant effect on performance
> >    with increasing scalability.
> >
> > * Existing Spin-Lock implementation for ARM is sub-optimal due to
> >    use of TAS (test and swap)
> >
> > * TAS is implemented on ARM as load-store so even if the lock is not free,
> >    store operation will execute to replace the same value.
> >    This redundant operation (mainly store) is costly.
> >
> > * CAS is implemented on ARM as load-check-store-check that means if the
> >    lock is not free, check operation, post-load will cause the loop to
> >    return there-by saving on costlier store operation. [1]
>
> Can you add some code comments to explain that why CAS is cheaper than
> TAS on ARM?
>
> Is there some official ARM documentation, like a programmer's reference
> manual or something like that, that would show a reference
> implementation of a spinlock on ARM? It would be good to refer to an
> authoritative source on this.

Let me add my 2 cents.

I've compared assembly output of gcc implementations of CAS and TAS.
The sample C-program is attached.  I've compiled it on raspberry pi 4
using gcc 9.3.0.

The inner loop of CAS is as follows.  So, if the value loaded by ldaxr
doesn't match expected value, then we immediately quit the loop.

.L3:
        ldxr    w3, [x0]
        cmp     w3, w1
        bne     .L4
        stlxr   w4, w2, [x0]
        cbnz    w4, .L3
.L4:

The inner loop of TAS is as follows.  So it really does "stxr"
unconditionally.  In principle, it could check if a new value matches
the observed value and there is nothing to do, but it doesn't.
Moreover, stxr might fail, then we can spend multiple loops of
ldxr/stxr due to concurrent memory access.  AFAIK, those concurrent
accesses could reflect not only lock release, but also other
unsuccessful lock attempts.  So, good chance for extra loops to be
useless.

.L7:
        ldxr    w2, [x0]
        stxr    w3, w1, [x0]
        cbnz    w3, .L7

I've also googled for spinlock implementation on arm and found a blog
post about spinlock implementation in linux kernel [1].  Surprisingly
it doesn't use the trick to skip stxr if the lock is busy.  Instead,
they use some arm-specific power-saving option called WFE.

So, I think it's quite clear that switching from TAS to CAS on arm
would be a win.  But there could be other options to do this even
better.

Links
1. https://linux-concepts.blogspot.com/2018/05/spinlock-implementation-in-arm.html

------
Regards,
Alexander Korotkov

Attachment

pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Multi Inserts in CREATE TABLE AS - revived patch
Next
From: Bharath Rupireddy
Date:
Subject: Re: Multi Inserts in CREATE TABLE AS - revived patch