Thread: spin_delay() for ARM
Hi, We use (an equivalent of) the PAUSE instruction in spin_delay() for Intel architectures. The goal is to slow down the spinlock tight loop and thus prevent it from eating CPU and causing CPU starvation, so that other processes get their fair share of the CPU time. Intel documentation [1] clearly mentions this, along with other benefits of PAUSE, like, low power consumption, and avoidance of memory order violation while exiting the loop. Similar to PAUSE, the ARM architecture has YIELD instruction, which is also clearly documented [2]. It explicitly says that it is a way to hint the CPU that it is being called in a spinlock loop and this process can be preempted out. But for ARM, we are not using any kind of spin delay. For PG spinlocks, the goal of both of these instructions are the same, and also both architectures recommend using them in spinlock loops. Also, I found multiple places where YIELD is already used in same situations : Linux kernel [3] ; OpenJDK [4],[5] Now, for ARM implementations that don't implement YIELD, it runs as a no-op. Unfortunately the ARM machine I have does not implement YIELD. But recently there has been some ARM implementations that are hyperthreaded, so they are expected to actually do the YIELD, although the docs do not explicitly say that YIELD has to be implemented only by hyperthreaded implementations. I ran some pgbench tests to test PAUSE/YIELD on the respective architectures, once with the instruction present, and once with the instruction removed. Didn't see change in the TPS numbers; they were more or less same. For Arm, this was expected because my ARM machine does not implement it. On my Intel Xeon machine with 8 cores, I tried to test PAUSE also using a sample C program (attached spin.c). Here, many child processes (much more than CPUs) wait in a tight loop for a shared variable to become 0, while the parent process continuously increments a sequence number for a fixed amount of time, after which, it sets the shared variable to 0. The child's tight loop calls PAUSE in each iteration. What I hoped was that because of PAUSE in children, the parent process would get more share of the CPU, due to which, in a given time, the sequence number will reach a higher value. Also, I expected the CPU cycles spent by child processes to drop down, thanks to PAUSE. None of these happened. There was no change. Possibly, this testcase is not right. Probably the process preemption occurs only within the set of hyperthreads attached to a single core. And in my testcase, the parent process is the only one who is ready to run. Still, I have anyway attached the program (spin.c) for archival; in case somebody with a YIELD-supporting ARM machine wants to use it to test YIELD. Nevertheless, I think because we have clear documentation that strongly recommends to use it, and because it has been used in other use-cases such as linux kernel and JDK, we should start using YIELD for spin_delay() in ARM. Attached is the trivial patch (spin_delay_for_arm.patch). To start with, it contains changes only for aarch64. I haven't yet added changes in configure[.in] for making sure yield compiles successfully (YIELD is present in manuals from ARMv6 onwards). Before that I thought of getting some comments; so didn't do configure changes yet. [1] https://c9x.me/x86/html/file_module_x86_id_232.html [2] https://developer.arm.com/docs/100076/0100/instruction-set-reference/a64-general-instructions/yield [3] https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/processor.h#L259 [4] http://cr.openjdk.java.net/~dchuyko/8186670/yield/spinwait.html [5] http://mail.openjdk.java.net/pipermail/aarch64-port-dev/2017-August/004880.html -- Thanks, -Amit Khandekar Huawei Technologies
Attachment
Hi, On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote: > On my Intel Xeon machine with 8 cores, I tried to test PAUSE also > using a sample C program (attached spin.c). Here, many child processes > (much more than CPUs) wait in a tight loop for a shared variable to > become 0, while the parent process continuously increments a sequence > number for a fixed amount of time, after which, it sets the shared > variable to 0. The child's tight loop calls PAUSE in each iteration. > What I hoped was that because of PAUSE in children, the parent process > would get more share of the CPU, due to which, in a given time, the > sequence number will reach a higher value. Also, I expected the CPU > cycles spent by child processes to drop down, thanks to PAUSE. None of > these happened. There was no change. > Possibly, this testcase is not right. Probably the process preemption > occurs only within the set of hyperthreads attached to a single core. > And in my testcase, the parent process is the only one who is ready to > run. Still, I have anyway attached the program (spin.c) for archival; > in case somebody with a YIELD-supporting ARM machine wants to use it > to test YIELD. PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't just schedule another process - you won't see different CPU usage if you measure it purely as the time running. You should be able to see a difference if you measure with a profiler that shows you data from the CPUs performance monitoring unit. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote: >> On my Intel Xeon machine with 8 cores, I tried to test PAUSE also >> using a sample C program (attached spin.c). > PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't > just schedule another process - you won't see different CPU usage if you > measure it purely as the time running. You should be able to see a > difference if you measure with a profiler that shows you data from the > CPUs performance monitoring unit. A more useful test would be to directly experiment with contended spinlocks. As I recall, we had some test cases laying about when we were fooling with the spin delay stuff on Intel --- maybe resurrecting one of those would be useful? regards, tom lane
I wrote: > A more useful test would be to directly experiment with contended > spinlocks. As I recall, we had some test cases laying about when > we were fooling with the spin delay stuff on Intel --- maybe > resurrecting one of those would be useful? The last really significant performance testing we did in this area seems to have been in this thread: https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com A relevant point from that is Haas' comment I think optimizing spinlocks for machines with only a few CPUs is probably pointless. Based on what I've seen so far, spinlock contention even at 16 CPUs is negligible pretty much no matter what you do. Whether your implementation is fast or slow isn't going to matter, because even an inefficient implementation will account for only a negligible percentage of the total CPU time - much less than 1% - as opposed to a 64-core machine, where it's not that hard to find cases where spin-waits consume the *majority* of available CPU time (recall previous discussion of lseek). So I wonder whether this patch is getting ahead of the game. It does seem that ARM systems with a couple dozen cores exist, but are they common enough to optimize for yet? Can we even find *one* to test on and verify that this is a win and not a loss? (Also, seeing that there are so many different ARM vendors, results from just one chipset might not be too trustworthy ...) regards, tom lane
On Sat, 11 Apr 2020 at 00:47, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote:
> > On my Intel Xeon machine with 8 cores, I tried to test PAUSE also
> > using a sample C program (attached spin.c). Here, many child processes
> > (much more than CPUs) wait in a tight loop for a shared variable to
> > become 0, while the parent process continuously increments a sequence
> > number for a fixed amount of time, after which, it sets the shared
> > variable to 0. The child's tight loop calls PAUSE in each iteration.
> > What I hoped was that because of PAUSE in children, the parent process
> > would get more share of the CPU, due to which, in a given time, the
> > sequence number will reach a higher value. Also, I expected the CPU
> > cycles spent by child processes to drop down, thanks to PAUSE. None of
> > these happened. There was no change.
>
> > Possibly, this testcase is not right. Probably the process preemption
> > occurs only within the set of hyperthreads attached to a single core.
> > And in my testcase, the parent process is the only one who is ready to
> > run. Still, I have anyway attached the program (spin.c) for archival;
> > in case somebody with a YIELD-supporting ARM machine wants to use it
> > to test YIELD.
>
> PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't
> just schedule another process - you won't see different CPU usage if you
> measure it purely as the time running.
Yeah, I thought that the OS scheduling would be an *indirect* consequence of the pause because of it's slowing down the CPU, but looks like that does not happen.
> You should be able to see a
> difference if you measure with a profiler that shows you data from the
> CPUs performance monitoring unit.
Hmm, I had tried with perf and could see the pause itself consuming 5% cpu. But I haven't yet played with per-process figures.
--
Thanks,
-Amit Khandekar
Huawei Technologies
Thanks,
-Amit Khandekar
Huawei Technologies
-Amit Khandekar
Huawei Technologies
On Sat, 11 Apr 2020 at 04:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > A more useful test would be to directly experiment with contended
> > spinlocks. As I recall, we had some test cases laying about when
> > we were fooling with the spin delay stuff on Intel --- maybe
> > resurrecting one of those would be useful?
>
> The last really significant performance testing we did in this area
> seems to have been in this thread:
>
> https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com
>
> A relevant point from that is Haas' comment
>
> I think optimizing spinlocks for machines with only a few CPUs is
> probably pointless. Based on what I've seen so far, spinlock
> contention even at 16 CPUs is negligible pretty much no matter what
> you do. Whether your implementation is fast or slow isn't going to
> matter, because even an inefficient implementation will account for
> only a negligible percentage of the total CPU time - much less than 1%
> - as opposed to a 64-core machine, where it's not that hard to find
> cases where spin-waits consume the *majority* of available CPU time
> (recall previous discussion of lseek).
Yeah, will check if I find some machines with large cores.
> So I wonder whether this patch is getting ahead of the game. It does
> seem that ARM systems with a couple dozen cores exist, but are they
> common enough to optimize for yet? Can we even find *one* to test on
> and verify that this is a win and not a loss? (Also, seeing that
> there are so many different ARM vendors, results from just one
> chipset might not be too trustworthy ...)
Ok. Yes, it would be worth waiting to see if there are others in the community with ARM systems that have implemented YIELD. May be after that we might gain some confidence. I myself also hope that I will get one soon to test, but right now I have one that does not support it, so it will be just a no-op.
--
Thanks,
-Amit Khandekar
Huawei Technologies
Thanks,
-Amit Khandekar
Huawei Technologies
-Amit Khandekar
Huawei Technologies
On Mon, 13 Apr 2020 at 20:16, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On Sat, 11 Apr 2020 at 04:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > I wrote: > > > A more useful test would be to directly experiment with contended > > > spinlocks. As I recall, we had some test cases laying about when > > > we were fooling with the spin delay stuff on Intel --- maybe > > > resurrecting one of those would be useful? > > > > The last really significant performance testing we did in this area > > seems to have been in this thread: > > > > https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com > > > > A relevant point from that is Haas' comment > > > > I think optimizing spinlocks for machines with only a few CPUs is > > probably pointless. Based on what I've seen so far, spinlock > > contention even at 16 CPUs is negligible pretty much no matter what > > you do. Whether your implementation is fast or slow isn't going to > > matter, because even an inefficient implementation will account for > > only a negligible percentage of the total CPU time - much less than 1% > > - as opposed to a 64-core machine, where it's not that hard to find > > cases where spin-waits consume the *majority* of available CPU time > > (recall previous discussion of lseek). > > Yeah, will check if I find some machines with large cores. I got hold of a 32 CPUs VM (actually it was a 16-core, but being hyperthreaded, CPUs were 32). It was an Intel Xeon , 3Gz CPU. 15G available memory. Hypervisor : KVM. Single NUMA node. PG parameters changed : shared_buffer: 8G ; max_connections : 1000 I compared pgbench results with HEAD versus PAUSE removed like this : perform_spin_delay(SpinDelayStatus *status) { - /* CPU-specific delay each time through the loop */ - SPIN_DELAY(); Ran with increasing number of parallel clients : pgbench -S -c $num -j $num -T 60 -M prepared But couldn't find any significant change in the TPS numbers with or without PAUSE: Clients HEAD Without_PAUSE 8 244446 247264 16 399939 399549 24 454189 453244 32 1097592 1098844 40 1090424 1087984 48 1068645 1075173 64 1035035 1039973 96 976578 970699 May be it will indeed show some difference only with around 64 cores, or perhaps a bare metal machine will help; but as of now I didn't get such a machine. Anyways, I thought why not archive the results with whatever I have. Not relevant to the PAUSE stuff .... Note that when the parallel clients reach from 24 to 32 (which equals the machine CPUs), the TPS shoots from 454189 to 1097592 which is more than double speed gain with just a 30% increase in parallel sessions. I was not expecting this much speed gain, because, with contended scenario already pgbench processes are already taking around 20% of the total CPU time of pgbench run. May be later on, I will get a chance to run with some customized pgbench script that runs a server function which keeps on running an index scan on pgbench_accounts, so as to make pgbench clients almost idle. Thanks -Amit Khandekar
čt 16. 4. 2020 v 9:18 odesílatel Amit Khandekar <amitdkhan.pg@gmail.com> napsal:
On Mon, 13 Apr 2020 at 20:16, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On Sat, 11 Apr 2020 at 04:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > I wrote:
> > > A more useful test would be to directly experiment with contended
> > > spinlocks. As I recall, we had some test cases laying about when
> > > we were fooling with the spin delay stuff on Intel --- maybe
> > > resurrecting one of those would be useful?
> >
> > The last really significant performance testing we did in this area
> > seems to have been in this thread:
> >
> > https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com
> >
> > A relevant point from that is Haas' comment
> >
> > I think optimizing spinlocks for machines with only a few CPUs is
> > probably pointless. Based on what I've seen so far, spinlock
> > contention even at 16 CPUs is negligible pretty much no matter what
> > you do. Whether your implementation is fast or slow isn't going to
> > matter, because even an inefficient implementation will account for
> > only a negligible percentage of the total CPU time - much less than 1%
> > - as opposed to a 64-core machine, where it's not that hard to find
> > cases where spin-waits consume the *majority* of available CPU time
> > (recall previous discussion of lseek).
>
> Yeah, will check if I find some machines with large cores.
I got hold of a 32 CPUs VM (actually it was a 16-core, but being
hyperthreaded, CPUs were 32).
It was an Intel Xeon , 3Gz CPU. 15G available memory. Hypervisor :
KVM. Single NUMA node.
PG parameters changed : shared_buffer: 8G ; max_connections : 1000
I compared pgbench results with HEAD versus PAUSE removed like this :
perform_spin_delay(SpinDelayStatus *status)
{
- /* CPU-specific delay each time through the loop */
- SPIN_DELAY();
Ran with increasing number of parallel clients :
pgbench -S -c $num -j $num -T 60 -M prepared
But couldn't find any significant change in the TPS numbers with or
without PAUSE:
Clients HEAD Without_PAUSE
8 244446 247264
16 399939 399549
24 454189 453244
32 1097592 1098844
40 1090424 1087984
48 1068645 1075173
64 1035035 1039973
96 976578 970699
May be it will indeed show some difference only with around 64 cores,
or perhaps a bare metal machine will help; but as of now I didn't get
such a machine. Anyways, I thought why not archive the results with
whatever I have.
Not relevant to the PAUSE stuff .... Note that when the parallel
clients reach from 24 to 32 (which equals the machine CPUs), the TPS
shoots from 454189 to 1097592 which is more than double speed gain
with just a 30% increase in parallel sessions. I was not expecting
this much speed gain, because, with contended scenario already pgbench
processes are already taking around 20% of the total CPU time of
pgbench run. May be later on, I will get a chance to run with some
customized pgbench script that runs a server function which keeps on
running an index scan on pgbench_accounts, so as to make pgbench
clients almost idle.
what I know, pgbench cannot be used for testing spinlocks problems.
Maybe you can see this issue when a) use higher number clients - hundreds, thousands. Decrease share memory, so there will be press on related spin lock.
Regards
Pavel
Thanks
-Amit Khandekar
On Thu, 16 Apr 2020 at 10:33, Pavel Stehule <pavel.stehule@gmail.com> wrote: > what I know, pgbench cannot be used for testing spinlocks problems. > > Maybe you can see this issue when a) use higher number clients - hundreds, thousands. Decrease share memory, so there willbe press on related spin lock. There really aren't many spinlocks left that could be tickled by a normal workload. I looked for a way to trigger spinlock contention when I prototyped a patch to replace spinlocks with futexes. The only one that I could figure out a way to make contended was the lock protecting parallel btree scan. A highly parallel index only scan on a fully cached index should create at least some spinlock contention. Regards, Ants Aasma
On Thu, Apr 16, 2020 at 3:18 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Not relevant to the PAUSE stuff .... Note that when the parallel > clients reach from 24 to 32 (which equals the machine CPUs), the TPS > shoots from 454189 to 1097592 which is more than double speed gain > with just a 30% increase in parallel sessions. I've seen stuff like this too. For instance, check out the graph from this 2012 blog post: http://rhaas.blogspot.com/2012/04/did-i-say-32-cores-how-about-64.html You can see that the performance growth is basically on a straight line up to about 16 cores, but then it kinks downward until about 28, after which it kinks sharply upward until about 36 cores. I think this has something to do with the process scheduling behavior of Linux, because I vaguely recall some discussion where somebody did benchmarking on the same hardware on both Linux and one of the BSD systems, and the effect didn't appear on BSD. They had other problems, like a huge drop-off at higher core counts, but they didn't have that effect. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Apr 18, 2020 at 2:00 AM Ants Aasma <ants@cybertec.at> wrote: > On Thu, 16 Apr 2020 at 10:33, Pavel Stehule <pavel.stehule@gmail.com> wrote: > > what I know, pgbench cannot be used for testing spinlocks problems. > > > > Maybe you can see this issue when a) use higher number clients - hundreds, thousands. Decrease share memory, so therewill be press on related spin lock. > > There really aren't many spinlocks left that could be tickled by a > normal workload. I looked for a way to trigger spinlock contention > when I prototyped a patch to replace spinlocks with futexes. The only > one that I could figure out a way to make contended was the lock > protecting parallel btree scan. A highly parallel index only scan on a > fully cached index should create at least some spinlock contention. I suspect the snapshot-too-old "mutex_threshold" spinlock can become contended under workloads that generate a high rate of heap_page_prune_opt() calls with old_snapshot_threshold enabled. One way to do that is with a bunch of concurrent index scans that hit the heap in random order. Some notes about that: https://www.postgresql.org/message-id/flat/CA%2BhUKGKT8oTkp5jw_U4p0S-7UG9zsvtw_M47Y285bER6a2gD%2Bg%40mail.gmail.com
On Sat, 18 Apr 2020 at 03:30, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sat, Apr 18, 2020 at 2:00 AM Ants Aasma <ants@cybertec.at> wrote: > > On Thu, 16 Apr 2020 at 10:33, Pavel Stehule <pavel.stehule@gmail.com> wrote: > > > what I know, pgbench cannot be used for testing spinlocks problems. > > > > > > Maybe you can see this issue when a) use higher number clients - hundreds, thousands. Decrease share memory, so therewill be press on related spin lock. > > > > There really aren't many spinlocks left that could be tickled by a > > normal workload. I looked for a way to trigger spinlock contention > > when I prototyped a patch to replace spinlocks with futexes. The only > > one that I could figure out a way to make contended was the lock > > protecting parallel btree scan. A highly parallel index only scan on a > > fully cached index should create at least some spinlock contention. > > I suspect the snapshot-too-old "mutex_threshold" spinlock can become > contended under workloads that generate a high rate of > heap_page_prune_opt() calls with old_snapshot_threshold enabled. One > way to do that is with a bunch of concurrent index scans that hit the > heap in random order. Some notes about that: > > https://www.postgresql.org/message-id/flat/CA%2BhUKGKT8oTkp5jw_U4p0S-7UG9zsvtw_M47Y285bER6a2gD%2Bg%40mail.gmail.com Thanks all for the inputs. Will keep these two particular scenarios in mind, and try to get some bandwidth on this soon. -- Thanks, -Amit Khandekar Huawei Technologies