Thread: Better LWLocks with compare-and-swap (9.4)
I've been working on-and-off on the WAL-insert scaling patch. It's in pretty good shape now, and I'll post it shortly, but one thing I noticed is that it benefits a lot from using an atomic compare-and-swap instruction for the contention-critical part. I realized that we could also use compare-and-swap to make LWLocks scale better. The LWLock struct is too large to compare-and-swap atomically, but we can still use CAS to increment/decrement the shared/exclusive counters, when there's no need to manipulate the wait queue. That would help with workloads where you have a lot of CPUs, and a lot of backends need to acquire the same lwlock in shared mode, but there's no real contention (ie. few exclusive lockers). pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine: - 64.09% postgres postgres [.] tas - tas - 99.83% s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78% LWLockRelease GetSnapshotData + GetTransactionSnapshot + 2.97% postgres postgres [.] tas + 1.53% postgres libc-2.13.so [.] 0x119873 + 1.44% postgres postgres [.] GetSnapshotData + 1.29% postgres [kernel.kallsyms] [k] arch_local_irq_enable + 1.18% postgres postgres [.] AllocSetAlloc ... So, on this test, a lot of time is wasted spinning on the mutex of ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a surprisingly steep drop in performance once you go beyond 29 clients (attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory is that after that point all the cores are busy, and processes start to be sometimes context switched while holding the spinlock, which kills performance. Has anyone else seen that pattern? Curiously, I don't see that when connecting pgbench via TCP over localhost, only when connecting via unix domain sockets. Overall performance is higher over unix domain sockets, so I guess the TCP layer adds some overhead, hurting performance, and also affects scheduling somehow, making the steep drop go away. Using a compare-and-swap instruction in place of spinning solves the problem (green line in attached graph). This needs some more testing with different workloads to make sure it doesn't hurt other scenarios, but I don't think it will. I'm also waiting for a colleague to test this on a different machine, where he saw a similar steep drop in performance. The attached patch is still work-in-progress. There needs to be a configure test and fallback to spinlock if a CAS instruction is not available. I used the gcc __sync_val_compare_and_swap() builtin directly, that needs to be abstracted. Also, in the case that the wait queue needs to be manipulated, the code spins on the CAS instruction, but there is no delay mechanism like there is on a regular spinlock; that needs to be added in somehow. - Heikki
Attachment
On Mon, May 13, 2013 at 7:50 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > The attached patch is still work-in-progress. There needs to be a configure > test and fallback to spinlock if a CAS instruction is not available. I used > the gcc __sync_val_compare_and_swap() builtin directly, that needs to be > abstracted. Also, in the case that the wait queue needs to be manipulated, > the code spins on the CAS instruction, but there is no delay mechanism like > there is on a regular spinlock; that needs to be added in somehow. These are really interesting results. Why is the CAS method so much faster then TAS? Did you see any contention? merlin
On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when > I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine: > > - 64.09% postgres postgres [.] tas > - tas > - 99.83% s_lock > - 53.22% LWLockAcquire > + 99.87% GetSnapshotData > - 46.78% LWLockRelease > GetSnapshotData > + GetTransactionSnapshot > + 2.97% postgres postgres [.] tas > + 1.53% postgres libc-2.13.so [.] 0x119873 > + 1.44% postgres postgres [.] GetSnapshotData > + 1.29% postgres [kernel.kallsyms] [k] arch_local_irq_enable > + 1.18% postgres postgres [.] AllocSetAlloc > ... > > So, on this test, a lot of time is wasted spinning on the mutex of > ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a > surprisingly steep drop in performance once you go beyond 29 clients > (attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory > is that after that point all the cores are busy, and processes start to be > sometimes context switched while holding the spinlock, which kills > performance. I have, I also used linux perf to come to this conclusion, and my determination was similar: a system was undergoing increasingly heavy load, in this case with processes >> number of processors. It was also a phase-change type of event: at one moment everything would be going great, but once a critical threshold was hit, s_lock would consume enormous amount of CPU time. I figured preemption while in the spinlock was to blame at the time, given the extreme nature.
On Wed, May 15, 2013 at 3:08 PM, Daniel Farina <daniel@heroku.com> wrote: > On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when >> I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine: >> >> - 64.09% postgres postgres [.] tas >> - tas >> - 99.83% s_lock >> - 53.22% LWLockAcquire >> + 99.87% GetSnapshotData >> - 46.78% LWLockRelease >> GetSnapshotData >> + GetTransactionSnapshot >> + 2.97% postgres postgres [.] tas >> + 1.53% postgres libc-2.13.so [.] 0x119873 >> + 1.44% postgres postgres [.] GetSnapshotData >> + 1.29% postgres [kernel.kallsyms] [k] arch_local_irq_enable >> + 1.18% postgres postgres [.] AllocSetAlloc >> ... >> >> So, on this test, a lot of time is wasted spinning on the mutex of >> ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a >> surprisingly steep drop in performance once you go beyond 29 clients >> (attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory >> is that after that point all the cores are busy, and processes start to be >> sometimes context switched while holding the spinlock, which kills >> performance. I accidentally some important last words from Heikki's last words in his mail, which make my correspondence pretty bizarre to understand at the outset. Apologies. He wrote: >> [...] Has anyone else seen that pattern? > I have, I also used linux perf to come to this conclusion, and my > determination was similar: a system was undergoing increasingly heavy > load, in this case with processes >> number of processors. It was > also a phase-change type of event: at one moment everything would be > going great, but once a critical threshold was hit, s_lock would > consume enormous amount of CPU time. I figured preemption while in > the spinlock was to blame at the time, given the extreme nature.
* Heikki Linnakangas (hlinnakangas@vmware.com) wrote: > My theory is that after that point all the cores are busy, > and processes start to be sometimes context switched while holding > the spinlock, which kills performance. Has anyone else seen that > pattern? Isn't this the same issue which has prompted multiple people to propose (sometimes with code, as I recall) to rip out our internal spinlock system and replace it with kernel-backed calls which do it better, specifically by dealing with issues like the above? Have you seen those threads in the past? Any thoughts about moving in that direction? > Curiously, I don't see that when connecting pgbench via TCP > over localhost, only when connecting via unix domain sockets. > Overall performance is higher over unix domain sockets, so I guess > the TCP layer adds some overhead, hurting performance, and also > affects scheduling somehow, making the steep drop go away. I wonder if the kernel locks around unix domain sockets are helping us out here, while it's not able to take advantage of such knowledge about the process that's waiting when it's a TCP connection? Just a hunch. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > Isn't this the same issue which has prompted multiple people to propose > (sometimes with code, as I recall) to rip out our internal spinlock > system and replace it with kernel-backed calls which do it better, > specifically by dealing with issues like the above? Have you seen those > threads in the past? Any thoughts about moving in that direction? All of the proposals of that sort that I've seen had a flavor of "my OS is the only one that matters". While I don't object to platform-dependent implementations of spinlocks as such, they're not much of a cure for a generic performance issue. regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Em 13-05-2013 09:50, Heikki Linnakangas escreveu: > I've been working on-and-off on the WAL-insert scaling patch. It's > in pretty good shape now, and I'll post it shortly, but one thing I > noticed is that it benefits a lot from using an atomic > compare-and-swap instruction for the contention-critical part. > > I realized that we could also use compare-and-swap to make LWLocks > scale better. The LWLock struct is too large to compare-and-swap > atomically, but we can still use CAS to increment/decrement the > shared/exclusive counters, when there's no need to manipulate the > wait queue. That would help with workloads where you have a lot of > CPUs, and a lot of backends need to acquire the same lwlock in > shared mode, but there's no real contention (ie. few exclusive > lockers). > > pgbench -S is such a workload. With 9.3beta1, I'm seeing this > profile, when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a > 32-core Linux machine: > > - 64.09% postgres postgres [.] tas - tas - 99.83% > s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78% > LWLockRelease GetSnapshotData + GetTransactionSnapshot + 2.97% > postgres postgres [.] tas + 1.53% postgres > libc-2.13.so [.] 0x119873 + 1.44% postgres postgres > [.] GetSnapshotData + 1.29% postgres [kernel.kallsyms] [k] > arch_local_irq_enable + 1.18% postgres postgres [.] > AllocSetAlloc ... I'd like to test this here but I couldn't reproduce that perf output here in a 64-core or 24-core machines, could you post the changes to postgresql.conf and the perf arguments that you used? Thanks! []s - -- Dickson S. Guedes mail/xmpp: guedes@guedesoft.net - skype: guediz http://guedesoft.net - http://www.postgresql.org.br http://github.net/guedes - twitter: @guediz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJRltDJAAoJEBa5zL7BI5C7UQkH/Au8p90pTMl1qvbft3q1Gtxp a4PV8fjOrzQou2I+9Sxu5W1ql3qyVmfFare+bJVKg5L3LmvACjZ6bbw9oKBEnPGB vzE9nB6+3F3eyo464Niq19cTVgmyRQBcuOT/Ye88Uh2mrrgUYB+lGfk9M2Af7on1 nUZI5YsWWXt/bm9wf6rRCzDs76fS7ity943V0aSg2AHryjfcB8o4oBhJBnrRfnm7 v+SxLg0xDEWQPo8VOCQlIw5IhoxNokHjMAt8Ho7o0dXJRR91vSerdulK4Uxkz13Q E9GlDBDBzZsHmqHCGECNSglqVegXRA5g2i/o3tmQ/lEKzCF9OiX7GBSkXN+gEsc= =nGJ5 -----END PGP SIGNATURE-----
On 18.05.2013 03:52, Dickson S. Guedes wrote: >> pgbench -S is such a workload. With 9.3beta1, I'm seeing this >> profile, when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a >> 32-core Linux machine: >> >> - 64.09% postgres postgres [.] tas - tas - 99.83% >> s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78% >> LWLockRelease GetSnapshotData + GetTransactionSnapshot + 2.97% >> postgres postgres [.] tas + 1.53% postgres >> libc-2.13.so [.] 0x119873 + 1.44% postgres postgres >> [.] GetSnapshotData + 1.29% postgres [kernel.kallsyms] [k] >> arch_local_irq_enable + 1.18% postgres postgres [.] >> AllocSetAlloc ... > > I'd like to test this here but I couldn't reproduce that perf output > here in a 64-core or 24-core machines, could you post the changes to > postgresql.conf and the perf arguments that you used? Sure, here are the non-default postgresql.conf settings: name | current_setting ----------------------------+----------------------------------------- application_name | psql autovacuum | off checkpoint_segments | 70 config_file | /home/hlinnakangas/data/postgresql.conf data_directory | /home/hlinnakangas/data default_text_search_config | pg_catalog.english hba_file | /home/hlinnakangas/data/pg_hba.conf ident_file | /home/hlinnakangas/data/pg_ident.conf lc_collate | en_US.UTF-8 lc_ctype | en_US.UTF-8 log_timezone | US/Pacificlogging_collector | on max_connections | 100 max_stack_depth | 2MB server_encoding | UTF8 shared_buffers | 2GB synchronous_commit | off TimeZone | US/Pacific transaction_deferrable | off transaction_isolation | read committed transaction_read_only | off wal_buffers | 16MB While pgbench was running, I ran this: perf record -p 6050 -g -e cpu-clock to connect to one of the backends. (I used cpu-clock, because the default cpu-cycles event didn't work on the box) - Heikki
Em Dom, 2013-05-19 às 09:29 +0300, Heikki Linnakangas escreveu: > On 18.05.2013 03:52, Dickson S. Guedes wrote: > >> pgbench -S is such a workload. With 9.3beta1, I'm seeing this > >> profile, when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a > >> 32-core Linux machine: > >> > >> - 64.09% postgres postgres [.] tas - tas - 99.83% > >> s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78% > >> LWLockRelease GetSnapshotData + GetTransactionSnapshot + 2.97% > >> postgres postgres [.] tas + 1.53% postgres > >> libc-2.13.so [.] 0x119873 + 1.44% postgres postgres > >> [.] GetSnapshotData + 1.29% postgres [kernel.kallsyms] [k] > >> arch_local_irq_enable + 1.18% postgres postgres [.] > >> AllocSetAlloc ... > > > > I'd like to test this here but I couldn't reproduce that perf output > > here in a 64-core or 24-core machines, could you post the changes to > > postgresql.conf and the perf arguments that you used? > > Sure, here are the non-default postgresql.conf settings: Thank you for your information. > While pgbench was running, I ran this: > > perf record -p 6050 -g -e cpu-clock > > to connect to one of the backends. (I used cpu-clock, because the > default cpu-cycles event didn't work on the box) Hum, I was supposing that I was doing something wrong but I'm getting the same result as before even using your test case and my results is still different from yours: + 71,27% postgres postgres [.] AtEOXact_Buffers + 7,67% postgres postgres [.] AtEOXact_CatCache + 6,30% postgres postgres [.] AllocSetCheck + 5,34% postgres libc-2.12.so [.] __mcount_internal + 2,14% postgres [kernel.kallsyms][k] activate_page It's a 64-core machine with PGDATA in a SAN. vendor_id : GenuineIntel cpu family : 6 model : 47 model name : Intel(R) Xeon(R) CPU E7- 4830 @ 2.13GHz stepping : 2 cpu MHz : 1064.000 cache size : 24576 KB physical id : 3 siblings : 16 core id : 24 cpu cores : 8 apicid : 241 initial apicid : 241 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid bogomips : 4255.87 clflush size : 64 cache_alignment : 64 address sizes : 44 bits physical, 48 bits virtual power management: Would you like that I test some other configuration to try to simulate that expected workload? []s -- Dickson S. Guedes mail/xmpp: guedes@guedesoft.net - skype: guediz http://guedesoft.net - http://www.postgresql.org.br http://www.rnp.br/keyserver/pks/lookup?search=0x8F3E3C06D428D10A
On 2013-05-20 09:31:15 -0300, Dickson S. Guedes wrote: > Hum, I was supposing that I was doing something wrong but I'm getting > the same result as before even using your test case and my results is > still different from yours: > > > + 71,27% postgres postgres [.] AtEOXact_Buffers > + 7,67% postgres postgres [.] AtEOXact_CatCache > + 6,30% postgres postgres [.] AllocSetCheck > + 5,34% postgres libc-2.12.so [.] __mcount_internal > + 2,14% postgres [kernel.kallsyms][k] activate_page That looks like you have configured with --enable-cassert and probably also --enable-profiling? The former will give completely distorted performance results... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Em Seg, 2013-05-20 às 14:35 +0200, Andres Freund escreveu: > On 2013-05-20 09:31:15 -0300, Dickson S. Guedes wrote: > > Hum, I was supposing that I was doing something wrong but I'm getting > > the same result as before even using your test case and my results is > > still different from yours: > > > > > > + 71,27% postgres postgres [.] AtEOXact_Buffers > > + 7,67% postgres postgres [.] AtEOXact_CatCache > > + 6,30% postgres postgres [.] AllocSetCheck > > + 5,34% postgres libc-2.12.so [.] __mcount_internal > > + 2,14% postgres [kernel.kallsyms][k] activate_page > > That looks like you have configured with --enable-cassert and probably > also --enable-profiling? The former will give completely distorted > performance results... Ah! Wrong PATH, so wrong binaries. Thanks Andres. -- Dickson S. Guedes mail/xmpp: guedes@guedesoft.net - skype: guediz http://guedesoft.net - http://www.postgresql.org.br http://www.rnp.br/keyserver/pks/lookup?search=0x8F3E3C06D428D10A
On 13.05.2013 17:21, Merlin Moncure wrote: > On Mon, May 13, 2013 at 7:50 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> The attached patch is still work-in-progress. There needs to be a configure >> test and fallback to spinlock if a CAS instruction is not available. I used >> the gcc __sync_val_compare_and_swap() builtin directly, that needs to be >> abstracted. Also, in the case that the wait queue needs to be manipulated, >> the code spins on the CAS instruction, but there is no delay mechanism like >> there is on a regular spinlock; that needs to be added in somehow. > > These are really interesting results. Why is the CAS method so much > faster then TAS? If my theory is right, the steep fall is caused by backends being sometimes context switched while holding the spinlock. The CAS method alleviates that, because the lock is never "held" by anyone. With a spinlock, if a process gets context switched while holding the lock, everyone else will have to wait until the process gets context switched back, and releases the spinlock. With the CAS method, if a process gets switched in the CAS-loop, it doesn't hurt others. With the patch, the CAS-loop still spins, just like a spinlock, when the wait queue has to be manipulated. So when that happens, it will behave more or less like the spinlock implementation. I'm not too concerned about that optimizing that further; sleeping and signaling sleeping backends are heavy-weight operations anyway. > Did you see any contention? With the current spinlock implementation? Yeah, per the profile I posted, a lot of CPU time was spent spinning, which is a sign of contention. I didn't see contention with the CAS implemention. Attached is a new version of the patch. I cleaned it up a lot, and added a configure test, and fallback to the good old spinlock implementation if compare-and-swap is not available. Also, if the CAS loops needs to spin, like a spinlock, it now sleeps after a while and updates the "spins_per_delay" like regular spinlock. - Heikki
Attachment
On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote: > Stephen Frost <sfrost@snowman.net> writes: > > Isn't this the same issue which has prompted multiple people to propose > > (sometimes with code, as I recall) to rip out our internal spinlock > > system and replace it with kernel-backed calls which do it better, > > specifically by dealing with issues like the above? Have you seen those > > threads in the past? Any thoughts about moving in that direction? > > All of the proposals of that sort that I've seen had a flavor of > "my OS is the only one that matters". While I don't object to > platform-dependent implementations of spinlocks as such, they're not > much of a cure for a generic performance issue. Uh, is this an x86-64-only optimization? Seems so. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
> diff --git a/configure.in b/configure.in > index 4ea5699..ff8470e 100644 > --- a/configure.in > +++ b/configure.in > @@ -1445,17 +1445,6 @@ fi > AC_CHECK_FUNCS([strtoll strtoq], [break]) > AC_CHECK_FUNCS([strtoull strtouq], [break]) > > -AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics, > -[AC_TRY_LINK([], > - [int lock = 0; > - __sync_lock_test_and_set(&lock, 1); > - __sync_lock_release(&lock);], > - [pgac_cv_gcc_int_atomics="yes"], > - [pgac_cv_gcc_int_atomics="no"])]) > -if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then > - AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.]) > -fi > - Careful here --- s_lock.h has some code conditional on HAVE_GCC_INT_ATOMICS which your patch is not touching, yet it is removing the definition, unless I'm misreading. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 20.05.2013 23:01, Bruce Momjian wrote: > On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote: >> Stephen Frost<sfrost@snowman.net> writes: >>> Isn't this the same issue which has prompted multiple people to propose >>> (sometimes with code, as I recall) to rip out our internal spinlock >>> system and replace it with kernel-backed calls which do it better, >>> specifically by dealing with issues like the above? Have you seen those >>> threads in the past? Any thoughts about moving in that direction? >> >> All of the proposals of that sort that I've seen had a flavor of >> "my OS is the only one that matters". While I don't object to >> platform-dependent implementations of spinlocks as such, they're not >> much of a cure for a generic performance issue. > > Uh, is this an x86-64-only optimization? Seems so. All modern architectures have an atomic compare-and-swap instruction (or something more powerful that can be used to implement it). That includes x86, x86-64, ARM, PowerPC, among others. There are some differences in how wide values can be swapped with it; 386 only supported 32-bit, until Pentium, which added a 64-bit variant. I used the 64-bit variant in the patch, but for lwlocks, we could actually get away with the 32-bit variant if we packed the booleans and the shared lock counter more tightly. - Heikki
On 20.05.2013 23:11, Alvaro Herrera wrote: >> diff --git a/configure.in b/configure.in >> index 4ea5699..ff8470e 100644 >> --- a/configure.in >> +++ b/configure.in >> @@ -1445,17 +1445,6 @@ fi >> AC_CHECK_FUNCS([strtoll strtoq], [break]) >> AC_CHECK_FUNCS([strtoull strtouq], [break]) >> >> -AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics, >> -[AC_TRY_LINK([], >> - [int lock = 0; >> - __sync_lock_test_and_set(&lock, 1); >> - __sync_lock_release(&lock);], >> - [pgac_cv_gcc_int_atomics="yes"], >> - [pgac_cv_gcc_int_atomics="no"])]) >> -if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then >> - AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.]) >> -fi >> - > > Careful here --- s_lock.h has some code conditional on > HAVE_GCC_INT_ATOMICS which your patch is not touching, yet it is > removing the definition, unless I'm misreading. Thanks, good catch. I renamed HAVE_GCC_INT_ATOMICS to HAVE_GCC_INT_TEST_AND_SET because "atomics" seems too generic when we also test for __sync_val_compare_and_swap(p, oldval, newval). - Heikki
On Mon, May 20, 2013 at 11:16:41PM +0300, Heikki Linnakangas wrote: > On 20.05.2013 23:01, Bruce Momjian wrote: > >On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote: > >>Stephen Frost<sfrost@snowman.net> writes: > >>>Isn't this the same issue which has prompted multiple people to propose > >>>(sometimes with code, as I recall) to rip out our internal spinlock > >>>system and replace it with kernel-backed calls which do it better, > >>>specifically by dealing with issues like the above? Have you seen those > >>>threads in the past? Any thoughts about moving in that direction? > >> > >>All of the proposals of that sort that I've seen had a flavor of > >>"my OS is the only one that matters". While I don't object to > >>platform-dependent implementations of spinlocks as such, they're not > >>much of a cure for a generic performance issue. > > > >Uh, is this an x86-64-only optimization? Seems so. > > All modern architectures have an atomic compare-and-swap instruction > (or something more powerful that can be used to implement it). That > includes x86, x86-64, ARM, PowerPC, among others. > > There are some differences in how wide values can be swapped with > it; 386 only supported 32-bit, until Pentium, which added a 64-bit > variant. I used the 64-bit variant in the patch, but for lwlocks, we > could actually get away with the 32-bit variant if we packed the > booleans and the shared lock counter more tightly. So we are going to need to add this kind of assembly language optimization for every CPU type? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 16.05.2013 01:08, Daniel Farina wrote: > On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when >> I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine: >> >> - 64.09% postgres postgres [.] tas >> - tas >> - 99.83% s_lock >> - 53.22% LWLockAcquire >> + 99.87% GetSnapshotData >> - 46.78% LWLockRelease >> GetSnapshotData >> + GetTransactionSnapshot >> + 2.97% postgres postgres [.] tas >> + 1.53% postgres libc-2.13.so [.] 0x119873 >> + 1.44% postgres postgres [.] GetSnapshotData >> + 1.29% postgres [kernel.kallsyms] [k] arch_local_irq_enable >> + 1.18% postgres postgres [.] AllocSetAlloc >> ... >> >> So, on this test, a lot of time is wasted spinning on the mutex of >> ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a >> surprisingly steep drop in performance once you go beyond 29 clients >> (attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory >> is that after that point all the cores are busy, and processes start to be >> sometimes context switched while holding the spinlock, which kills >> performance. > > I have, I also used linux perf to come to this conclusion, and my > determination was similar: a system was undergoing increasingly heavy > load, in this case with processes>> number of processors. It was > also a phase-change type of event: at one moment everything would be > going great, but once a critical threshold was hit, s_lock would > consume enormous amount of CPU time. I figured preemption while in > the spinlock was to blame at the time, given the extreme nature Stop the press! I'm getting the same speedup on that 32-core box I got with the compare-and-swap patch, from this one-liner: --- a/src/include/storage/s_lock.h +++ b/src/include/storage/s_lock.h @@ -200,6 +200,8 @@ typedef unsigned char slock_t; #define TAS(lock) tas(lock) +#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock)) + static __inline__ int tas(volatile slock_t *lock) { So, on this system, doing a non-locked test before the locked xchg instruction while spinning, is a very good idea. That contradicts the testing that was done earlier when the x86-64 implementation was added, as we have this comment in the tas() implementation: > /* > * On Opteron, using a non-locking test before the locking instruction > * is a huge loss. On EM64T, it appears to be a wash or small loss, > * so we needn't bother to try to distinguish the sub-architectures. > */ On my test system, the non-locking test is a big win. I tested this because I was reading this article from Intel: http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/. It says very explicitly that the non-locking test is a good idea: > Spinning on volatile read vs. spin on lock attempt > > One common mistake made by developers developing their own spin-wait loops is attempting to spin on an atomic instructioninstead of spinning on a volatile read. Spinning on a dirty read instead of attempting to acquire a lock consumesless time and resources. This allows an application to only attempt to acquire a lock only when it is free. Now, I'm not sure what to do about this. If we put the non-locking test in there, according to the previous testing that would be a huge loss on Opterons. Perhaps we should just sleep earlier, ie. lower MAX_SPINS_PER_DELAY. That way, even if each TAS_SPIN test is very expensive, we don't spend too much time spinning if it's really busy, or held by a process that is sleeping. - Heikki
Spinlock implementation on x86_64 (was Re: Better LWLocks with compare-and-swap (9.4))
From
Heikki Linnakangas
Date:
On 21.05.2013 00:20, Heikki Linnakangas wrote: > On 16.05.2013 01:08, Daniel Farina wrote: >> On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>> pgbench -S is such a workload. With 9.3beta1, I'm seeing this >>> profile, when >>> I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux >>> machine: >>> >>> - 64.09% postgres postgres [.] tas >>> - tas >>> - 99.83% s_lock >>> - 53.22% LWLockAcquire >>> + 99.87% GetSnapshotData >>> - 46.78% LWLockRelease >>> GetSnapshotData >>> + GetTransactionSnapshot >>> + 2.97% postgres postgres [.] tas >>> + 1.53% postgres libc-2.13.so [.] 0x119873 >>> + 1.44% postgres postgres [.] GetSnapshotData >>> + 1.29% postgres [kernel.kallsyms] [k] arch_local_irq_enable >>> + 1.18% postgres postgres [.] AllocSetAlloc >>> ... >>> >>> So, on this test, a lot of time is wasted spinning on the mutex of >>> ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a >>> surprisingly steep drop in performance once you go beyond 29 clients >>> (attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My >>> theory >>> is that after that point all the cores are busy, and processes start >>> to be >>> sometimes context switched while holding the spinlock, which kills >>> performance. >> >> I have, I also used linux perf to come to this conclusion, and my >> determination was similar: a system was undergoing increasingly heavy >> load, in this case with processes>> number of processors. It was >> also a phase-change type of event: at one moment everything would be >> going great, but once a critical threshold was hit, s_lock would >> consume enormous amount of CPU time. I figured preemption while in >> the spinlock was to blame at the time, given the extreme nature > > Stop the press! I'm getting the same speedup on that 32-core box I got > with the compare-and-swap patch, from this one-liner: > > --- a/src/include/storage/s_lock.h > +++ b/src/include/storage/s_lock.h > @@ -200,6 +200,8 @@ typedef unsigned char slock_t; > > #define TAS(lock) tas(lock) > > +#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock)) > + > static __inline__ int > tas(volatile slock_t *lock) > { > > So, on this system, doing a non-locked test before the locked xchg > instruction while spinning, is a very good idea. That contradicts the > testing that was done earlier when the x86-64 implementation was added, > as we have this comment in the tas() implementation: > >> /* >> * On Opteron, using a non-locking test before the locking instruction >> * is a huge loss. On EM64T, it appears to be a wash or small loss, >> * so we needn't bother to try to distinguish the sub-architectures. >> */ > > On my test system, the non-locking test is a big win. I tested this > because I was reading this article from Intel: > > http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/. > It says very explicitly that the non-locking test is a good idea: > >> Spinning on volatile read vs. spin on lock attempt >> >> One common mistake made by developers developing their own spin-wait >> loops is attempting to spin on an atomic instruction instead of >> spinning on a volatile read. Spinning on a dirty read instead of >> attempting to acquire a lock consumes less time and resources. This >> allows an application to only attempt to acquire a lock only when it >> is free. > > Now, I'm not sure what to do about this. If we put the non-locking test > in there, according to the previous testing that would be a huge loss on > Opterons. I think we should apply the attached patch to put a non-locked test in TAS_SPIN() on x86_64. Tom tested this back in 2011 on a 32-core Opteron [1] and found little difference. And on a 80-core Xeon (160 with hyperthreading) system, he saw a quite significant gain from it [2]. Reading the thread, I don't understand why it wasn't applied to x86_64 back then. Maybe it was for the fear of having a negative impact on performance on old Opterons, but I strongly suspect that the performance would be OK even on old Opterons if you only do the non-locked test in TAS_SPIN() and not in TAS(). Back when the comment about poor performance with a non-locking test on Opteron was written, we used the same TAS() macro in the first and while spinning. I tried to find some sort of an authoritative guide from AMD that would say how spinlocks should be implemented, but didn't find any. The Intel guide I linked above says we should apply the patch. Linux uses a ticket spinlock thingie nowadays, which is slightly different, but if I'm reading the older pre-ticket version correctly, they used to do the same. Ie. non-locking test while spinning, but not before the first attempt to grab the lock. Same in FreeBSD. So, my plan is to apply the attached non-locked-tas-spin-x86_64.patch to master. But I would love to get feedback from people running different x86_64 hardware. [1] http://www.postgresql.org/message-id/8292.1314641721@sss.pgh.pa.us [2] http://www.postgresql.org/message-id/25989.1314734752@sss.pgh.pa.us - Heikki
Attachment
Re: Spinlock implementation on x86_64 (was Re: Better LWLocks with compare-and-swap (9.4))
From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > So, my plan is to apply the attached non-locked-tas-spin-x86_64.patch to > master. But I would love to get feedback from people running different > x86_64 hardware. Surely this patch should update the existing comment at line 209? Or at least point out that a non-locked test in TAS_SPIN is not the same as a non-locked test in tas() itself. Other than the commenting, I have no objection to this. I think you're probably right that the old tests in which this looked like a bad idea were adding the unlocked test to tas() not only the spin case. regards, tom lane
Re: Spinlock implementation on x86_64 (was Re: Better LWLocks with compare-and-swap (9.4))
From
Heikki Linnakangas
Date:
On 28.08.2013 20:21, Tom Lane wrote: > Heikki Linnakangas<hlinnakangas@vmware.com> writes: >> So, my plan is to apply the attached non-locked-tas-spin-x86_64.patch to >> master. But I would love to get feedback from people running different >> x86_64 hardware. > > Surely this patch should update the existing comment at line 209? Or at > least point out that a non-locked test in TAS_SPIN is not the same as a > non-locked test in tas() itself. Committed with some comment changes. I also added a note to the 32-bit x86 implementation, suggesting we probably should do the same there, no-one's just gotten around to do the testing. - Heikki