Thread: spin_delay() for ARM

spin_delay() for ARM

From
Amit Khandekar
Date:
Hi,

We use (an equivalent of) the PAUSE instruction in spin_delay() for
Intel architectures. The goal is to slow down the spinlock tight loop
and thus prevent it from eating CPU and causing CPU starvation, so
that other processes get their fair share of the CPU time. Intel
documentation [1] clearly mentions this, along with other benefits of
PAUSE, like, low power consumption, and avoidance of memory order
violation while exiting the loop.

Similar to PAUSE, the ARM architecture has YIELD instruction, which is
also clearly documented [2]. It explicitly says that it is a way to
hint the CPU that it is being called in a spinlock loop and this
process can be preempted out.  But for ARM, we are not using any kind
of spin delay.

For PG spinlocks, the goal of both of these instructions are the same,
and also both architectures recommend using them in spinlock loops.
Also, I found multiple places where YIELD is already used in same
situations : Linux kernel [3] ; OpenJDK [4],[5]

Now, for ARM implementations that don't implement YIELD, it runs as a
no-op.  Unfortunately the ARM machine I have does not implement YIELD.
But recently there has been some ARM implementations that are
hyperthreaded, so they are expected to actually do the YIELD, although
the docs do not explicitly say that YIELD has to be implemented only
by hyperthreaded implementations.

I ran some pgbench tests to test PAUSE/YIELD on the respective
architectures, once with the instruction present, and once with the
instruction removed.  Didn't see change in the TPS numbers; they were
more or less same. For Arm, this was expected because my ARM machine
does not implement it.

On my Intel Xeon machine with 8 cores, I tried to test PAUSE also
using a sample C program (attached spin.c). Here, many child processes
(much more than CPUs) wait in a tight loop for a shared variable to
become 0, while the parent process continuously increments a sequence
number for a fixed amount of time, after which, it sets the shared
variable to 0. The child's tight loop calls PAUSE in each iteration.
What I hoped was that because of PAUSE in children, the parent process
would get more share of the CPU, due to which, in a given time, the
sequence number will reach a higher value. Also, I expected the CPU
cycles spent by child processes to drop down, thanks to PAUSE. None of
these happened. There was no change.

Possibly, this testcase is not right. Probably the process preemption
occurs only within the set of hyperthreads attached to a single core.
And in my testcase, the parent process is the only one who is ready to
run. Still, I have anyway attached the program (spin.c) for archival;
in case somebody with a YIELD-supporting ARM machine wants to use it
to test YIELD.

Nevertheless, I think because we have clear documentation that
strongly recommends to use it, and because it has been used in other
use-cases such as linux kernel and JDK, we should start using YIELD
for spin_delay() in ARM.

Attached is the trivial patch (spin_delay_for_arm.patch). To start
with, it contains changes only for aarch64. I haven't yet added
changes in configure[.in] for making sure yield compiles successfully
(YIELD is present in manuals from ARMv6 onwards). Before that I
thought of getting some comments; so didn't do configure changes yet.


[1] https://c9x.me/x86/html/file_module_x86_id_232.html
[2] https://developer.arm.com/docs/100076/0100/instruction-set-reference/a64-general-instructions/yield
[3] https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/processor.h#L259
[4] http://cr.openjdk.java.net/~dchuyko/8186670/yield/spinwait.html
[5] http://mail.openjdk.java.net/pipermail/aarch64-port-dev/2017-August/004880.html


--
Thanks,
-Amit Khandekar
Huawei Technologies

Attachment

Re: spin_delay() for ARM

From
Andres Freund
Date:
Hi,

On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote:
> On my Intel Xeon machine with 8 cores, I tried to test PAUSE also
> using a sample C program (attached spin.c). Here, many child processes
> (much more than CPUs) wait in a tight loop for a shared variable to
> become 0, while the parent process continuously increments a sequence
> number for a fixed amount of time, after which, it sets the shared
> variable to 0. The child's tight loop calls PAUSE in each iteration.
> What I hoped was that because of PAUSE in children, the parent process
> would get more share of the CPU, due to which, in a given time, the
> sequence number will reach a higher value. Also, I expected the CPU
> cycles spent by child processes to drop down, thanks to PAUSE. None of
> these happened. There was no change.

> Possibly, this testcase is not right. Probably the process preemption
> occurs only within the set of hyperthreads attached to a single core.
> And in my testcase, the parent process is the only one who is ready to
> run. Still, I have anyway attached the program (spin.c) for archival;
> in case somebody with a YIELD-supporting ARM machine wants to use it
> to test YIELD.

PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't
just schedule another process - you won't see different CPU usage if you
measure it purely as the time running. You should be able to see a
difference if you measure with a profiler that shows you data from the
CPUs performance monitoring unit.

Greetings,

Andres Freund



Re: spin_delay() for ARM

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote:
>> On my Intel Xeon machine with 8 cores, I tried to test PAUSE also
>> using a sample C program (attached spin.c).

> PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't
> just schedule another process - you won't see different CPU usage if you
> measure it purely as the time running. You should be able to see a
> difference if you measure with a profiler that shows you data from the
> CPUs performance monitoring unit.

A more useful test would be to directly experiment with contended
spinlocks.  As I recall, we had some test cases laying about when
we were fooling with the spin delay stuff on Intel --- maybe
resurrecting one of those would be useful?

            regards, tom lane



Re: spin_delay() for ARM

From
Tom Lane
Date:
I wrote:
> A more useful test would be to directly experiment with contended
> spinlocks.  As I recall, we had some test cases laying about when
> we were fooling with the spin delay stuff on Intel --- maybe
> resurrecting one of those would be useful?

The last really significant performance testing we did in this area
seems to have been in this thread:

https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com

A relevant point from that is Haas' comment

    I think optimizing spinlocks for machines with only a few CPUs is
    probably pointless.  Based on what I've seen so far, spinlock
    contention even at 16 CPUs is negligible pretty much no matter what
    you do.  Whether your implementation is fast or slow isn't going to
    matter, because even an inefficient implementation will account for
    only a negligible percentage of the total CPU time - much less than 1%
    - as opposed to a 64-core machine, where it's not that hard to find
    cases where spin-waits consume the *majority* of available CPU time
    (recall previous discussion of lseek).

So I wonder whether this patch is getting ahead of the game.  It does
seem that ARM systems with a couple dozen cores exist, but are they
common enough to optimize for yet?  Can we even find *one* to test on
and verify that this is a win and not a loss?  (Also, seeing that
there are so many different ARM vendors, results from just one
chipset might not be too trustworthy ...)

            regards, tom lane



Re: spin_delay() for ARM

From
Amit Khandekar
Date:


On Sat, 11 Apr 2020 at 00:47, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote:
> > On my Intel Xeon machine with 8 cores, I tried to test PAUSE also
> > using a sample C program (attached spin.c). Here, many child processes
> > (much more than CPUs) wait in a tight loop for a shared variable to
> > become 0, while the parent process continuously increments a sequence
> > number for a fixed amount of time, after which, it sets the shared
> > variable to 0. The child's tight loop calls PAUSE in each iteration.
> > What I hoped was that because of PAUSE in children, the parent process
> > would get more share of the CPU, due to which, in a given time, the
> > sequence number will reach a higher value. Also, I expected the CPU
> > cycles spent by child processes to drop down, thanks to PAUSE. None of
> > these happened. There was no change.
>
> > Possibly, this testcase is not right. Probably the process preemption
> > occurs only within the set of hyperthreads attached to a single core.
> > And in my testcase, the parent process is the only one who is ready to
> > run. Still, I have anyway attached the program (spin.c) for archival;
> > in case somebody with a YIELD-supporting ARM machine wants to use it
> > to test YIELD.
>
> PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't
> just schedule another process - you won't see different CPU usage if you
> measure it purely as the time running.

Yeah, I thought that the OS scheduling would be an *indirect* consequence of the pause because of it's slowing down the CPU, but looks like that does not happen.


> You should be able to see a
> difference if you measure with a profiler that shows you data from the
> CPUs performance monitoring unit.
Hmm, I had tried with perf and could see the pause itself consuming 5% cpu. But I haven't yet played with per-process figures.



--
Thanks,
-Amit Khandekar
Huawei Technologies
--
Thanks,
-Amit Khandekar
Huawei Technologies

Re: spin_delay() for ARM

From
Amit Khandekar
Date:


On Sat, 11 Apr 2020 at 04:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > A more useful test would be to directly experiment with contended
> > spinlocks.  As I recall, we had some test cases laying about when
> > we were fooling with the spin delay stuff on Intel --- maybe
> > resurrecting one of those would be useful?
>
> The last really significant performance testing we did in this area
> seems to have been in this thread:
>
> https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com
>
> A relevant point from that is Haas' comment
>
>     I think optimizing spinlocks for machines with only a few CPUs is
>     probably pointless.  Based on what I've seen so far, spinlock
>     contention even at 16 CPUs is negligible pretty much no matter what
>     you do.  Whether your implementation is fast or slow isn't going to
>     matter, because even an inefficient implementation will account for
>     only a negligible percentage of the total CPU time - much less than 1%
>     - as opposed to a 64-core machine, where it's not that hard to find
>     cases where spin-waits consume the *majority* of available CPU time
>     (recall previous discussion of lseek).

Yeah, will check if I find some machines with large cores.


> So I wonder whether this patch is getting ahead of the game.  It does
> seem that ARM systems with a couple dozen cores exist, but are they
> common enough to optimize for yet?  Can we even find *one* to test on
> and verify that this is a win and not a loss?  (Also, seeing that
> there are so many different ARM vendors, results from just one
> chipset might not be too trustworthy ...)

Ok. Yes, it would be worth waiting to see if there are others in the community with ARM systems that have implemented YIELD. May be after that we might gain some confidence. I myself also hope that I will get one soon to test, but right now I have one that does not support it, so it will be just a no-op.

--
Thanks,
-Amit Khandekar
Huawei Technologies
--
Thanks,
-Amit Khandekar
Huawei Technologies

Re: spin_delay() for ARM

From
Amit Khandekar
Date:
On Mon, 13 Apr 2020 at 20:16, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On Sat, 11 Apr 2020 at 04:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > I wrote:
> > > A more useful test would be to directly experiment with contended
> > > spinlocks.  As I recall, we had some test cases laying about when
> > > we were fooling with the spin delay stuff on Intel --- maybe
> > > resurrecting one of those would be useful?
> >
> > The last really significant performance testing we did in this area
> > seems to have been in this thread:
> >
> >
https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com
> >
> > A relevant point from that is Haas' comment
> >
> >     I think optimizing spinlocks for machines with only a few CPUs is
> >     probably pointless.  Based on what I've seen so far, spinlock
> >     contention even at 16 CPUs is negligible pretty much no matter what
> >     you do.  Whether your implementation is fast or slow isn't going to
> >     matter, because even an inefficient implementation will account for
> >     only a negligible percentage of the total CPU time - much less than 1%
> >     - as opposed to a 64-core machine, where it's not that hard to find
> >     cases where spin-waits consume the *majority* of available CPU time
> >     (recall previous discussion of lseek).
>
> Yeah, will check if I find some machines with large cores.

I got hold of a 32 CPUs VM (actually it was a 16-core, but being
hyperthreaded, CPUs were 32).
It was an Intel Xeon , 3Gz CPU. 15G available memory. Hypervisor :
KVM. Single NUMA node.
PG parameters changed : shared_buffer: 8G ; max_connections : 1000

I compared pgbench results with HEAD versus PAUSE removed like this :
 perform_spin_delay(SpinDelayStatus *status)
 {
-       /* CPU-specific delay each time through the loop */
-       SPIN_DELAY();

Ran with increasing number of parallel clients :
pgbench -S -c $num -j $num -T 60 -M prepared
But couldn't find any significant change in the TPS numbers with or
without PAUSE:

Clients     HEAD     Without_PAUSE
8         244446       247264
16        399939       399549
24        454189       453244
32       1097592      1098844
40       1090424      1087984
48       1068645      1075173
64       1035035      1039973
96        976578       970699

May be it will indeed show some difference only with around 64 cores,
or perhaps a bare metal machine will help; but as of now I didn't get
such a machine. Anyways, I thought why not archive the results with
whatever I have.

Not relevant to the PAUSE stuff .... Note that when the parallel
clients reach from 24 to 32 (which equals the machine CPUs), the TPS
shoots from 454189 to 1097592 which is more than double speed gain
with just a 30% increase in parallel sessions. I was not expecting
this much speed gain, because, with contended scenario already pgbench
processes are already taking around 20% of the total CPU time of
pgbench run. May be later on, I will get a chance to run with some
customized pgbench script that runs a server function which keeps on
running an index scan on pgbench_accounts, so as to make pgbench
clients almost idle.

Thanks
-Amit Khandekar



Re: spin_delay() for ARM

From
Pavel Stehule
Date:


čt 16. 4. 2020 v 9:18 odesílatel Amit Khandekar <amitdkhan.pg@gmail.com> napsal:
On Mon, 13 Apr 2020 at 20:16, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On Sat, 11 Apr 2020 at 04:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > I wrote:
> > > A more useful test would be to directly experiment with contended
> > > spinlocks.  As I recall, we had some test cases laying about when
> > > we were fooling with the spin delay stuff on Intel --- maybe
> > > resurrecting one of those would be useful?
> >
> > The last really significant performance testing we did in this area
> > seems to have been in this thread:
> >
> > https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com
> >
> > A relevant point from that is Haas' comment
> >
> >     I think optimizing spinlocks for machines with only a few CPUs is
> >     probably pointless.  Based on what I've seen so far, spinlock
> >     contention even at 16 CPUs is negligible pretty much no matter what
> >     you do.  Whether your implementation is fast or slow isn't going to
> >     matter, because even an inefficient implementation will account for
> >     only a negligible percentage of the total CPU time - much less than 1%
> >     - as opposed to a 64-core machine, where it's not that hard to find
> >     cases where spin-waits consume the *majority* of available CPU time
> >     (recall previous discussion of lseek).
>
> Yeah, will check if I find some machines with large cores.

I got hold of a 32 CPUs VM (actually it was a 16-core, but being
hyperthreaded, CPUs were 32).
It was an Intel Xeon , 3Gz CPU. 15G available memory. Hypervisor :
KVM. Single NUMA node.
PG parameters changed : shared_buffer: 8G ; max_connections : 1000

I compared pgbench results with HEAD versus PAUSE removed like this :
 perform_spin_delay(SpinDelayStatus *status)
 {
-       /* CPU-specific delay each time through the loop */
-       SPIN_DELAY();

Ran with increasing number of parallel clients :
pgbench -S -c $num -j $num -T 60 -M prepared
But couldn't find any significant change in the TPS numbers with or
without PAUSE:

Clients     HEAD     Without_PAUSE
8         244446       247264
16        399939       399549
24        454189       453244
32       1097592      1098844
40       1090424      1087984
48       1068645      1075173
64       1035035      1039973
96        976578       970699

May be it will indeed show some difference only with around 64 cores,
or perhaps a bare metal machine will help; but as of now I didn't get
such a machine. Anyways, I thought why not archive the results with
whatever I have.

Not relevant to the PAUSE stuff .... Note that when the parallel
clients reach from 24 to 32 (which equals the machine CPUs), the TPS
shoots from 454189 to 1097592 which is more than double speed gain
with just a 30% increase in parallel sessions. I was not expecting
this much speed gain, because, with contended scenario already pgbench
processes are already taking around 20% of the total CPU time of
pgbench run. May be later on, I will get a chance to run with some
customized pgbench script that runs a server function which keeps on
running an index scan on pgbench_accounts, so as to make pgbench
clients almost idle.

what I know, pgbench cannot be used for testing spinlocks problems.

Maybe you can see this issue when a) use higher number clients - hundreds, thousands. Decrease share memory, so there will be press on related spin lock.

Regards

Pavel


Thanks
-Amit Khandekar


Re: spin_delay() for ARM

From
Ants Aasma
Date:
On Thu, 16 Apr 2020 at 10:33, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> what I know, pgbench cannot be used for testing spinlocks problems.
>
> Maybe you can see this issue when a) use higher number clients - hundreds, thousands. Decrease share memory, so there
willbe press on related spin lock.
 

There really aren't many spinlocks left that could be tickled by a
normal workload. I looked for a way to trigger spinlock contention
when I prototyped a patch to replace spinlocks with futexes. The only
one that I could figure out a way to make contended was the lock
protecting parallel btree scan. A highly parallel index only scan on a
fully cached index should create at least some spinlock contention.

Regards,
Ants Aasma



Re: spin_delay() for ARM

From
Robert Haas
Date:
On Thu, Apr 16, 2020 at 3:18 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Not relevant to the PAUSE stuff .... Note that when the parallel
> clients reach from 24 to 32 (which equals the machine CPUs), the TPS
> shoots from 454189 to 1097592 which is more than double speed gain
> with just a 30% increase in parallel sessions.

I've seen stuff like this too. For instance, check out the graph from
this 2012 blog post:

http://rhaas.blogspot.com/2012/04/did-i-say-32-cores-how-about-64.html

You can see that the performance growth is basically on a straight
line up to about 16 cores, but then it kinks downward until about 28,
after which it kinks sharply upward until about 36 cores.

I think this has something to do with the process scheduling behavior
of Linux, because I vaguely recall some discussion where somebody did
benchmarking on the same hardware on both Linux and one of the BSD
systems, and the effect didn't appear on BSD. They had other problems,
like a huge drop-off at higher core counts, but they didn't have that
effect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: spin_delay() for ARM

From
Thomas Munro
Date:
On Sat, Apr 18, 2020 at 2:00 AM Ants Aasma <ants@cybertec.at> wrote:
> On Thu, 16 Apr 2020 at 10:33, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> > what I know, pgbench cannot be used for testing spinlocks problems.
> >
> > Maybe you can see this issue when a) use higher number clients - hundreds, thousands. Decrease share memory, so
therewill be press on related spin lock.
 
>
> There really aren't many spinlocks left that could be tickled by a
> normal workload. I looked for a way to trigger spinlock contention
> when I prototyped a patch to replace spinlocks with futexes. The only
> one that I could figure out a way to make contended was the lock
> protecting parallel btree scan. A highly parallel index only scan on a
> fully cached index should create at least some spinlock contention.

I suspect the snapshot-too-old "mutex_threshold" spinlock can become
contended under workloads that generate a high rate of
heap_page_prune_opt() calls with old_snapshot_threshold enabled.  One
way to do that is with a bunch of concurrent index scans that hit the
heap in random order.  Some notes about that:

https://www.postgresql.org/message-id/flat/CA%2BhUKGKT8oTkp5jw_U4p0S-7UG9zsvtw_M47Y285bER6a2gD%2Bg%40mail.gmail.com



Re: spin_delay() for ARM

From
Amit Khandekar
Date:
On Sat, 18 Apr 2020 at 03:30, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sat, Apr 18, 2020 at 2:00 AM Ants Aasma <ants@cybertec.at> wrote:
> > On Thu, 16 Apr 2020 at 10:33, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> > > what I know, pgbench cannot be used for testing spinlocks problems.
> > >
> > > Maybe you can see this issue when a) use higher number clients - hundreds, thousands. Decrease share memory, so
therewill be press on related spin lock.
 
> >
> > There really aren't many spinlocks left that could be tickled by a
> > normal workload. I looked for a way to trigger spinlock contention
> > when I prototyped a patch to replace spinlocks with futexes. The only
> > one that I could figure out a way to make contended was the lock
> > protecting parallel btree scan. A highly parallel index only scan on a
> > fully cached index should create at least some spinlock contention.
>
> I suspect the snapshot-too-old "mutex_threshold" spinlock can become
> contended under workloads that generate a high rate of
> heap_page_prune_opt() calls with old_snapshot_threshold enabled.  One
> way to do that is with a bunch of concurrent index scans that hit the
> heap in random order.  Some notes about that:
>
> https://www.postgresql.org/message-id/flat/CA%2BhUKGKT8oTkp5jw_U4p0S-7UG9zsvtw_M47Y285bER6a2gD%2Bg%40mail.gmail.com

Thanks all for the inputs. Will keep these two particular scenarios in
mind, and try to get some bandwidth on this soon.


-- 
Thanks,
-Amit Khandekar
Huawei Technologies