On 26/11/2020 06:30, Krunal Bauskar wrote:
> Improving spin-lock implementation on ARM.
> ------------------------------------------------------------
>
> * Spin-Lock is known to have a significant effect on performance
> with increasing scalability.
>
> * Existing Spin-Lock implementation for ARM is sub-optimal due to
> use of TAS (test and swap)
>
> * TAS is implemented on ARM as load-store so even if the lock is not free,
> store operation will execute to replace the same value.
> This redundant operation (mainly store) is costly.
>
> * CAS is implemented on ARM as load-check-store-check that means if the
> lock is not free, check operation, post-load will cause the loop to
> return there-by saving on costlier store operation. [1]
Can you add some code comments to explain that why CAS is cheaper than
TAS on ARM?
1. As Alexey too pointed out in followup email
CAS = load value -> check if the value is expected -> if yes then replace (store new value) -> else exit/break
TAS = load value -> store new value
This means each TAS would be converted to 2 operations that are LOAD and STORE and of-course
STORE is costlier in terms of latency. CAS ensures optimization in this regard with an early check.
Let's look at some micro-benchmarking. I implemented a simple spin-loop using both approaches and
as expected with increase scalability, CAS continues to out-perform TAS by a margin up to 50%.
---- TAS ----
Running 128 parallelism
Elapsed time: 1.34271 s
Running 256 parallelism
Elapsed time: 3.6487 s
Running 512 parallelism
Elapsed time: 11.3725 s
Running 1024 parallelism
Elapsed time: 43.5376 s
---- CAS ----
Running 128 parallelism
Elapsed time: 1.00131 s
Running 256 parallelism
Elapsed time: 2.53202 s
Running 512 parallelism
Elapsed time: 7.66829 s
Running 1024 parallelism
Elapsed time: 22.6294 s
This could be also observed from the perf profiling
TAS:
15.57 │44: ldxr w0, [x19]
83.93 │ stxr w1, w21, [x19]
CAS:
....
9.86 │cc: ldaxr w0, [x22]
8.84 │ cmp w0, #0x0
│ ↑
b.ne 58
│ stxr w1, w20, [x22]
In TAS: STORE is pretty costly.
2. I have added the needed comment in the patch. Updated patch attached.
----------------------
Thanks for taking look at this and surely let me know if any more info is needed.
Is there some official ARM documentation, like a programmer's reference
manual or something like that, that would show a reference
implementation of a spinlock on ARM? It would be good to refer to an
authoritative source on this.
- Heikki