Re: Improving spin-lock implementation on ARM. - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Improving spin-lock implementation on ARM.
Date
Msg-id CAPpHfdsGqVd6EJ4mr_RZVE5xSiCNBy4MuSvdTrKmTpM0eyWGpg@mail.gmail.com
Whole thread Raw
In response to Re: Improving spin-lock implementation on ARM.  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: Improving spin-lock implementation on ARM.  (Krunal Bauskar <krunalbauskar@gmail.com>)
List pgsql-hackers
On Sat, Nov 28, 2020 at 1:31 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> I guess that might depend on the implementation of CAS and TAS.  I bet
> usage of CAS in spinlock gives advantage when ldxr/stxr are used, but
> not when swpal/casa are used.  I found out that I can force clang to
> use swpal/casa by setting "-march=armv8-a+lse".  I'm going to make
> some experiments on a multicore AWS graviton2 instance with different
> atomic implementation.

I've made some benchmarks on c6gd.16xlarge ec2 instance with graviton2
processor of 64 virtual CPUs (graphs and raw results are attached).
I've analyzed two patches: spinlock using cas by Krunal Bauskar, and
my implementation of lwlock using lwrex/strex.  My arm lwlock patch
has the same idea as my previous patch for power: we can put lwlock
attempt logic between lwrex and strex.  In spite of my previous power
patch, the arm patch doesn't contain assembly: instead I've used
C-wrappers over lwrex/strex.

The first series of experiments I've made using standard compiling
options.  So, LSE instructions from ARM v8.1 weren't used.  Atomics
were implemented using lwrex/strex pair.

In the read-only benchmark, both spinlock (cas-spinlock graph) and
lwlock (ldrew-strex-lwlock graph) patches give observable performance
gain of similar value.   However, performance of combination of these
patches (ldrew-strex-lwlock-cas-spinlock graph) is close to
performance of unpatched version.  That could be counterintuitive, but
I've rechecked that multiple times.

In the read-write benchmark, both spinlock and lwlock patches give
more significant performance gain, and lwlock patch gives more effect
than spinlock patch.  Noticeable, that combination of patches now
gives some cumulative effect instead of counterintuitive slowdown.

Then I've tried to compile postgres with LSE instruction using
"-march=armv8-a+lse" flag with clang (graphs with -lse suffix).  The
effect of LSE is HUGE!!!  Unpatched version with LSE is times faster
than any version without LSE on high concurrency.  In the both
read-only and read-write benchmarks spinlock patch doesn't show any
significant difference.  The lwlock patch shows a great slowdown with
LSE.  Noticeable, in read-write benchmark, lwlock patch shows worse
results than unpatched version without LSE.  Probably, combining
different atomics implementations isn't a good idea.

It seems that ARM Kunpeng 920 should support ARM v8.1.  I wonder if
the published benchmarks results were made with LSE.  I suspect that
it was not.  It would be nice to repeat the same benchmarks with LSE.
I'd like to ask Krunal Bauskar and Amit Khandekar to repeat these
benchmarks with LSE.

My preliminary conclusions are so:
1) Since the effect of LSE is so huge, we should advise users of
multicore ARM servers to compile PostgreSQL with LSE support.  We
probably should provide separate packaging for ARM v8.1 and higher
(packages for ARM v8 are still needed for raspberry etc).
2) It seems that atomics in ARM v8.1 becomes very similar to x86
atomics, and it doesn't need special optimizations.  And I think ARM
v8 processors don't have so many cores and aren't so heavily used in
high-concurrent environments.  So, special optimizations for ARM v8
probably aren't worth it.

Links
1. https://www.postgresql.org/message-id/CAB10pyamDkTFWU_BVGeEVmkc8%3DEhgCjr6QBk02SCdJtKpHkdFw%40mail.gmail.com
2. https://www.postgresql.org/message-id/CAPpHfdsKrh7c7P8-5eG-qW3VQobybbwqH%3DgL5Ck%2BdOES-gBbFg%40mail.gmail.com

------
Regards,
Alexander Korotkov

Attachment

pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: VACUUM (DISABLE_PAGE_SKIPPING on)
Next
From: Alexander Korotkov
Date:
Subject: Re: Improving spin-lock implementation on ARM.