Re: Improving spin-lock implementation on ARM. - Mailing list pgsql-hackers

From Krunal Bauskar
Subject Re: Improving spin-lock implementation on ARM.
Date
Msg-id CAB10pyZfSW2ayVcUpqF8jf5TTxtY60jfZvKKYzyChfN55o6hXw@mail.gmail.com
Whole thread Raw
In response to Re: Improving spin-lock implementation on ARM.  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers

On Thu, 26 Nov 2020 at 16:02, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 26/11/2020 06:30, Krunal Bauskar wrote:
> Improving spin-lock implementation on ARM.
> ------------------------------------------------------------
>
> * Spin-Lock is known to have a significant effect on performance
>    with increasing scalability.
>
> * Existing Spin-Lock implementation for ARM is sub-optimal due to
>    use of TAS (test and swap)
>
> * TAS is implemented on ARM as load-store so even if the lock is not free,
>    store operation will execute to replace the same value.
>    This redundant operation (mainly store) is costly.
>
> * CAS is implemented on ARM as load-check-store-check that means if the
>    lock is not free, check operation, post-load will cause the loop to
>    return there-by saving on costlier store operation. [1]

Can you add some code comments to explain that why CAS is cheaper than
TAS on ARM?

1. As Alexey too pointed out in followup email
CAS = load value -> check if the value is expected -> if yes then replace (store new value) -> else exit/break
TAS = load value -> store new value

This means each TAS would be converted to 2 operations that are LOAD and STORE and of-course
STORE is costlier in terms of latency. CAS ensures optimization in this regard with an early check.

Let's look at some micro-benchmarking. I implemented a simple spin-loop using both approaches and 
as expected with increase scalability, CAS continues to out-perform TAS by a margin up to 50%.

---- TAS ----
Running 128 parallelism
Elapsed time: 1.34271 s
Running 256 parallelism
Elapsed time: 3.6487 s
Running 512 parallelism
Elapsed time: 11.3725 s
Running 1024 parallelism
Elapsed time: 43.5376 s
---- CAS ----
Running 128 parallelism
Elapsed time: 1.00131 s
Running 256 parallelism
Elapsed time: 2.53202 s
Running 512 parallelism
Elapsed time: 7.66829 s
Running 1024 parallelism
Elapsed time: 22.6294 s

This could be also observed from the perf profiling

TAS:
 15.57 │44:   ldxr  w0, [x19]
 83.93 │      stxr  w1, w21, [x19]

CAS:
 81.29 │58: ↓ b.ne  cc
....
  9.86 │cc:   ldaxr w0, [x22]  
  8.84 │      cmp   w0, #0x0   
       │    ↑ b.ne  58            
       │      stxr  w1, w20, [x22]  

In TAS: STORE is pretty costly.

2.  I have added the needed comment in the patch. Updated patch attached.

----------------------
Thanks for taking look at this and surely let me know if any more info is needed.
 

Is there some official ARM documentation, like a programmer's reference
manual or something like that, that would show a reference
implementation of a spinlock on ARM? It would be good to refer to an
authoritative source on this.

- Heikki


--
Regards,
Krunal Bauskar
Attachment

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: [PATCH] Add features to pg_stat_statements
Next
From: Peter Eisentraut
Date:
Subject: Re: Improper use about DatumGetInt32