Re: Improving spin-lock implementation on ARM. - Mailing list pgsql-hackers

From Krunal Bauskar
Subject Re: Improving spin-lock implementation on ARM.
Date
Msg-id CAB10pyZcDqfr7L_T27qcrVAC4PPipS62J0oQepvtrE=uQaO7Ag@mail.gmail.com
Whole thread Raw
In response to Re: Improving spin-lock implementation on ARM.  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Improving spin-lock implementation on ARM.  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers


On Mon, 30 Nov 2020 at 11:38, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Krunal Bauskar <krunalbauskar@gmail.com> writes:
> On Mon, 30 Nov 2020 at 10:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The results I posted at [1] seem to contradict this for Apple's new
>> machines.

> For the results you saw on Mac-Mini was LSE enabled by default.

Hmm, I don't know how to get Apple's clang to admit what its default
settings are ... anybody?

However, it does accept "-march=armv8-a+lse", and that seems to
not be the default, because I get different results from my spinlock-
pounding test than I did yesterday.  Abbreviating into a table:

                --- CFLAGS=-O2 ---      --- CFLAGS="-O2 -march=armv8-a+lse" ---

TPS             HEAD    CAS patch       HEAD    CAS patch

clients=1       2127    2174            2612    2722
clients=2       1816    859             892     950
clients=4       714     519             610     468
clients=8       -       -               108     185

Thanks for trying this Tom.

---------

Some of us may be surprised by the fact that enabling lse is causing regression (1816 -> 892 or 714 -> 610) with HEAD itself.
While lse is meant to improve the performance. This, unfortunately, is not always the case at-least based on my previous experience with LSE.too.

I am still wondering why CAS is slower than TAS on M1. What is special on M1 that other ARM archs has not picked up.

Tom, Sorry to bother you again but this is arising a lot of curiosity about M1.
Whenever you get time can do some micro-benchmarking on M1 (to understand TAS vs CAS).
Also, if you can share assembly code is emitted for the TAS vs CAS.
 

Unfortunately, that still doesn't lead me to think that either LSE
or CAS are net wins on this hardware.  It's quite clear that LSE
makes the uncontended case a good bit faster, but the contended case
is a lot worse, so is that really a tradeoff we want?

> * I would also suggest if possible try with higher scalability (more than 4
> to check if with increase scalability CAS out-perform).

As I said yesterday, running more than 4 processes is just going
to bring the low-performance cores into the equation, which is likely
to swamp any interesting comparison.  I did run the test with "-c 8"
today, as shown in the right-hand columns, and the results seem
to bear that out.

                        regards, tom lane


--
Regards,
Krunal Bauskar

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Improving spin-lock implementation on ARM.
Next
From: Tatsuro Yamada
Date:
Subject: Re: Is it useful to record whether plans are generic or custom?