Thread: Better LWLocks with compare-and-swap (9.4)

Better LWLocks with compare-and-swap (9.4)

From
Heikki Linnakangas
Date:
I've been working on-and-off on the WAL-insert scaling patch. It's in
pretty good shape now, and I'll post it shortly, but one thing I noticed
is that it benefits a lot from using an atomic compare-and-swap
instruction for the contention-critical part.

I realized that we could also use compare-and-swap to make LWLocks scale
better. The LWLock struct is too large to compare-and-swap atomically,
but we can still use CAS to increment/decrement the shared/exclusive
counters, when there's no need to manipulate the wait queue. That would
help with workloads where you have a lot of CPUs, and a lot of backends
need to acquire the same lwlock in shared mode, but there's no real
contention (ie. few exclusive lockers).

pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile,
when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux
machine:

-  64.09%  postgres  postgres           [.] tas
    - tas
       - 99.83% s_lock
          - 53.22% LWLockAcquire
             + 99.87% GetSnapshotData
          - 46.78% LWLockRelease
               GetSnapshotData
             + GetTransactionSnapshot
+   2.97%  postgres  postgres           [.] tas
+   1.53%  postgres  libc-2.13.so       [.] 0x119873
+   1.44%  postgres  postgres           [.] GetSnapshotData
+   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
+   1.18%  postgres  postgres           [.] AllocSetAlloc
...

So, on this test, a lot of time is wasted spinning on the mutex of
ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
surprisingly steep drop in performance once you go beyond 29 clients
(attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My
theory is that after that point all the cores are busy, and processes
start to be sometimes context switched while holding the spinlock, which
kills performance. Has anyone else seen that pattern? Curiously, I don't
see that when connecting pgbench via TCP over localhost, only when
connecting via unix domain sockets. Overall performance is higher over
unix domain sockets, so I guess the TCP layer adds some overhead,
hurting performance, and also affects scheduling somehow, making the
steep drop go away.

Using a compare-and-swap instruction in place of spinning solves the
problem (green line in attached graph). This needs some more testing
with different workloads to make sure it doesn't hurt other scenarios,
but I don't think it will. I'm also waiting for a colleague to test this
on a different machine, where he saw a similar steep drop in performance.

The attached patch is still work-in-progress. There needs to be a
configure test and fallback to spinlock if a CAS instruction is not
available. I used the gcc __sync_val_compare_and_swap() builtin
directly, that needs to be abstracted. Also, in the case that the wait
queue needs to be manipulated, the code spins on the CAS instruction,
but there is no delay mechanism like there is on a regular spinlock;
that needs to be added in somehow.

- Heikki

Attachment

Re: Better LWLocks with compare-and-swap (9.4)

From
Merlin Moncure
Date:
On Mon, May 13, 2013 at 7:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> The attached patch is still work-in-progress. There needs to be a configure
> test and fallback to spinlock if a CAS instruction is not available. I used
> the gcc __sync_val_compare_and_swap() builtin directly, that needs to be
> abstracted. Also, in the case that the wait queue needs to be manipulated,
> the code spins on the CAS instruction, but there is no delay mechanism like
> there is on a regular spinlock; that needs to be added in somehow.

These are really interesting results. Why is the CAS method so much
faster then TAS?  Did you see any contention?

merlin



Re: Better LWLocks with compare-and-swap (9.4)

From
Daniel Farina
Date:
On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when
> I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine:
>
> -  64.09%  postgres  postgres           [.] tas
>    - tas
>       - 99.83% s_lock
>          - 53.22% LWLockAcquire
>             + 99.87% GetSnapshotData
>          - 46.78% LWLockRelease
>               GetSnapshotData
>             + GetTransactionSnapshot
> +   2.97%  postgres  postgres           [.] tas
> +   1.53%  postgres  libc-2.13.so       [.] 0x119873
> +   1.44%  postgres  postgres           [.] GetSnapshotData
> +   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
> +   1.18%  postgres  postgres           [.] AllocSetAlloc
> ...
>
> So, on this test, a lot of time is wasted spinning on the mutex of
> ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
> surprisingly steep drop in performance once you go beyond 29 clients
> (attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory
> is that after that point all the cores are busy, and processes start to be
> sometimes context switched while holding the spinlock, which kills
> performance.

I have, I also used linux perf to come to this conclusion, and my
determination was similar: a system was undergoing increasingly heavy
load, in this case with processes >> number of processors.  It was
also a phase-change type of event: at one moment everything would be
going great, but once a critical threshold was hit, s_lock would
consume enormous amount of CPU time.  I figured preemption while in
the spinlock was to blame at the time, given the extreme nature.



Re: Better LWLocks with compare-and-swap (9.4)

From
Daniel Farina
Date:
On Wed, May 15, 2013 at 3:08 PM, Daniel Farina <daniel@heroku.com> wrote:
> On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when
>> I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine:
>>
>> -  64.09%  postgres  postgres           [.] tas
>>    - tas
>>       - 99.83% s_lock
>>          - 53.22% LWLockAcquire
>>             + 99.87% GetSnapshotData
>>          - 46.78% LWLockRelease
>>               GetSnapshotData
>>             + GetTransactionSnapshot
>> +   2.97%  postgres  postgres           [.] tas
>> +   1.53%  postgres  libc-2.13.so       [.] 0x119873
>> +   1.44%  postgres  postgres           [.] GetSnapshotData
>> +   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
>> +   1.18%  postgres  postgres           [.] AllocSetAlloc
>> ...
>>
>> So, on this test, a lot of time is wasted spinning on the mutex of
>> ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
>> surprisingly steep drop in performance once you go beyond 29 clients
>> (attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory
>> is that after that point all the cores are busy, and processes start to be
>> sometimes context switched while holding the spinlock, which kills
>> performance.

I accidentally some important last words from Heikki's last words in
his mail, which make my correspondence pretty bizarre to understand at
the outset.  Apologies.  He wrote:

>> [...] Has anyone else seen that pattern?

> I have, I also used linux perf to come to this conclusion, and my
> determination was similar: a system was undergoing increasingly heavy
> load, in this case with processes >> number of processors.  It was
> also a phase-change type of event: at one moment everything would be
> going great, but once a critical threshold was hit, s_lock would
> consume enormous amount of CPU time.  I figured preemption while in
> the spinlock was to blame at the time, given the extreme nature.



Re: Better LWLocks with compare-and-swap (9.4)

From
Stephen Frost
Date:
* Heikki Linnakangas (hlinnakangas@vmware.com) wrote:
> My theory is that after that point all the cores are busy,
> and processes start to be sometimes context switched while holding
> the spinlock, which kills performance. Has anyone else seen that
> pattern?

Isn't this the same issue which has prompted multiple people to propose
(sometimes with code, as I recall) to rip out our internal spinlock
system and replace it with kernel-backed calls which do it better,
specifically by dealing with issues like the above?  Have you seen those
threads in the past?  Any thoughts about moving in that direction?

> Curiously, I don't see that when connecting pgbench via TCP
> over localhost, only when connecting via unix domain sockets.
> Overall performance is higher over unix domain sockets, so I guess
> the TCP layer adds some overhead, hurting performance, and also
> affects scheduling somehow, making the steep drop go away.

I wonder if the kernel locks around unix domain sockets are helping us
out here, while it's not able to take advantage of such knowledge about
the process that's waiting when it's a TCP connection?  Just a hunch.
Thanks,
    Stephen

Re: Better LWLocks with compare-and-swap (9.4)

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> Isn't this the same issue which has prompted multiple people to propose
> (sometimes with code, as I recall) to rip out our internal spinlock
> system and replace it with kernel-backed calls which do it better,
> specifically by dealing with issues like the above?  Have you seen those
> threads in the past?  Any thoughts about moving in that direction?

All of the proposals of that sort that I've seen had a flavor of
"my OS is the only one that matters".  While I don't object to
platform-dependent implementations of spinlocks as such, they're not
much of a cure for a generic performance issue.
        regards, tom lane



Re: Better LWLocks with compare-and-swap (9.4)

From
"Dickson S. Guedes"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Em 13-05-2013 09:50, Heikki Linnakangas escreveu:
> I've been working on-and-off on the WAL-insert scaling patch. It's
> in pretty good shape now, and I'll post it shortly, but one thing I
> noticed is that it benefits a lot from using an atomic
> compare-and-swap instruction for the contention-critical part.
> 
> I realized that we could also use compare-and-swap to make LWLocks
> scale better. The LWLock struct is too large to compare-and-swap
> atomically, but we can still use CAS to increment/decrement the
> shared/exclusive counters, when there's no need to manipulate the
> wait queue. That would help with workloads where you have a lot of
> CPUs, and a lot of backends need to acquire the same lwlock in
> shared mode, but there's no real contention (ie. few exclusive
> lockers).
> 
> pgbench -S is such a workload. With 9.3beta1, I'm seeing this
> profile, when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a
> 32-core Linux machine:
> 
> -  64.09%  postgres  postgres           [.] tas - tas - 99.83%
> s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78%
> LWLockRelease GetSnapshotData + GetTransactionSnapshot +   2.97%
> postgres  postgres           [.] tas +   1.53%  postgres
> libc-2.13.so       [.] 0x119873 +   1.44%  postgres  postgres
> [.] GetSnapshotData +   1.29%  postgres  [kernel.kallsyms]  [k]
> arch_local_irq_enable +   1.18%  postgres  postgres           [.]
> AllocSetAlloc ...

I'd like to test this here but I couldn't reproduce that perf output
here in a 64-core or 24-core machines, could you post the changes to
postgresql.conf and the perf arguments that you used?

Thanks!

[]s
- -- 
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br
http://github.net/guedes - twitter: @guediz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJRltDJAAoJEBa5zL7BI5C7UQkH/Au8p90pTMl1qvbft3q1Gtxp
a4PV8fjOrzQou2I+9Sxu5W1ql3qyVmfFare+bJVKg5L3LmvACjZ6bbw9oKBEnPGB
vzE9nB6+3F3eyo464Niq19cTVgmyRQBcuOT/Ye88Uh2mrrgUYB+lGfk9M2Af7on1
nUZI5YsWWXt/bm9wf6rRCzDs76fS7ity943V0aSg2AHryjfcB8o4oBhJBnrRfnm7
v+SxLg0xDEWQPo8VOCQlIw5IhoxNokHjMAt8Ho7o0dXJRR91vSerdulK4Uxkz13Q
E9GlDBDBzZsHmqHCGECNSglqVegXRA5g2i/o3tmQ/lEKzCF9OiX7GBSkXN+gEsc=
=nGJ5
-----END PGP SIGNATURE-----



Re: Better LWLocks with compare-and-swap (9.4)

From
Heikki Linnakangas
Date:
On 18.05.2013 03:52, Dickson S. Guedes wrote:
>> pgbench -S is such a workload. With 9.3beta1, I'm seeing this
>> profile, when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a
>> 32-core Linux machine:
>>
>> -  64.09%  postgres  postgres           [.] tas - tas - 99.83%
>> s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78%
>> LWLockRelease GetSnapshotData + GetTransactionSnapshot +   2.97%
>> postgres  postgres           [.] tas +   1.53%  postgres
>> libc-2.13.so       [.] 0x119873 +   1.44%  postgres  postgres
>> [.] GetSnapshotData +   1.29%  postgres  [kernel.kallsyms]  [k]
>> arch_local_irq_enable +   1.18%  postgres  postgres           [.]
>> AllocSetAlloc ...
>
> I'd like to test this here but I couldn't reproduce that perf output
> here in a 64-core or 24-core machines, could you post the changes to
> postgresql.conf and the perf arguments that you used?

Sure, here are the non-default postgresql.conf settings:

            name            |             current_setting
----------------------------+----------------------------------------- application_name           | psql autovacuum
           | off checkpoint_segments        | 70 config_file                | /home/hlinnakangas/data/postgresql.conf
data_directory            | /home/hlinnakangas/data default_text_search_config | pg_catalog.english hba_file
      | /home/hlinnakangas/data/pg_hba.conf ident_file                 | /home/hlinnakangas/data/pg_ident.conf
lc_collate                | en_US.UTF-8 lc_ctype                   | en_US.UTF-8 log_timezone               |
US/Pacificlogging_collector          | on max_connections            | 100 max_stack_depth            | 2MB
server_encoding           | UTF8 shared_buffers             | 2GB synchronous_commit         | off TimeZone
     | US/Pacific transaction_deferrable     | off transaction_isolation      | read committed transaction_read_only
 | off wal_buffers                | 16MB
 

While pgbench was running, I ran this:

perf record -p 6050 -g -e cpu-clock

to connect to one of the backends. (I used cpu-clock, because the 
default cpu-cycles event didn't work on the box)

- Heikki



Re: Better LWLocks with compare-and-swap (9.4)

From
"Dickson S. Guedes"
Date:
Em Dom, 2013-05-19 às 09:29 +0300, Heikki Linnakangas escreveu:
> On 18.05.2013 03:52, Dickson S. Guedes wrote:
> >> pgbench -S is such a workload. With 9.3beta1, I'm seeing this
> >> profile, when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a
> >> 32-core Linux machine:
> >>
> >> -  64.09%  postgres  postgres           [.] tas - tas - 99.83%
> >> s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78%
> >> LWLockRelease GetSnapshotData + GetTransactionSnapshot +   2.97%
> >> postgres  postgres           [.] tas +   1.53%  postgres
> >> libc-2.13.so       [.] 0x119873 +   1.44%  postgres  postgres
> >> [.] GetSnapshotData +   1.29%  postgres  [kernel.kallsyms]  [k]
> >> arch_local_irq_enable +   1.18%  postgres  postgres           [.]
> >> AllocSetAlloc ...
> >
> > I'd like to test this here but I couldn't reproduce that perf output
> > here in a 64-core or 24-core machines, could you post the changes to
> > postgresql.conf and the perf arguments that you used?
>
> Sure, here are the non-default postgresql.conf settings:


Thank you for your information.


> While pgbench was running, I ran this:
>
> perf record -p 6050 -g -e cpu-clock
>
> to connect to one of the backends. (I used cpu-clock, because the
> default cpu-cycles event didn't work on the box)


Hum, I was supposing that I was doing something wrong but I'm getting
the same result as before even using your test case and my results is
still different from yours:


+ 71,27% postgres postgres         [.] AtEOXact_Buffers
+  7,67% postgres postgres         [.] AtEOXact_CatCache
+  6,30% postgres postgres         [.] AllocSetCheck
+  5,34% postgres libc-2.12.so     [.] __mcount_internal
+  2,14% postgres [kernel.kallsyms][k] activate_page


It's a 64-core machine with PGDATA in a SAN.

vendor_id       : GenuineIntel
cpu family      : 6
model           : 47
model name      : Intel(R) Xeon(R) CPU E7- 4830  @ 2.13GHz
stepping        : 2
cpu MHz         : 1064.000
cache size      : 24576 KB
physical id     : 3
siblings        : 16
core id         : 24
cpu cores       : 8
apicid          : 241
initial apicid  : 241
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx
smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes
lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips        : 4255.87
clflush size    : 64
cache_alignment : 64
address sizes   : 44 bits physical, 48 bits virtual
power management:


Would you like that I test some other configuration to try to simulate
that expected workload?


[]s
--
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br
http://www.rnp.br/keyserver/pks/lookup?search=0x8F3E3C06D428D10A

Re: Better LWLocks with compare-and-swap (9.4)

From
Andres Freund
Date:
On 2013-05-20 09:31:15 -0300, Dickson S. Guedes wrote:
> Hum, I was supposing that I was doing something wrong but I'm getting
> the same result as before even using your test case and my results is
> still different from yours:
> 
> 
> + 71,27% postgres postgres         [.] AtEOXact_Buffers
> +  7,67% postgres postgres         [.] AtEOXact_CatCache
> +  6,30% postgres postgres         [.] AllocSetCheck
> +  5,34% postgres libc-2.12.so     [.] __mcount_internal
> +  2,14% postgres [kernel.kallsyms][k] activate_page

That looks like you have configured with --enable-cassert and probably
also --enable-profiling? The former will give completely distorted
performance results...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Better LWLocks with compare-and-swap (9.4)

From
"Dickson S. Guedes"
Date:
Em Seg, 2013-05-20 às 14:35 +0200, Andres Freund escreveu:
> On 2013-05-20 09:31:15 -0300, Dickson S. Guedes wrote:
> > Hum, I was supposing that I was doing something wrong but I'm getting
> > the same result as before even using your test case and my results is
> > still different from yours:
> >
> >
> > + 71,27% postgres postgres         [.] AtEOXact_Buffers
> > +  7,67% postgres postgres         [.] AtEOXact_CatCache
> > +  6,30% postgres postgres         [.] AllocSetCheck
> > +  5,34% postgres libc-2.12.so     [.] __mcount_internal
> > +  2,14% postgres [kernel.kallsyms][k] activate_page
>
> That looks like you have configured with --enable-cassert and probably
> also --enable-profiling? The former will give completely distorted
> performance results...


Ah! Wrong PATH, so wrong binaries. Thanks Andres.


--
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br
http://www.rnp.br/keyserver/pks/lookup?search=0x8F3E3C06D428D10A

Re: Better LWLocks with compare-and-swap (9.4)

From
Heikki Linnakangas
Date:
On 13.05.2013 17:21, Merlin Moncure wrote:
> On Mon, May 13, 2013 at 7:50 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com>  wrote:
>> The attached patch is still work-in-progress. There needs to be a configure
>> test and fallback to spinlock if a CAS instruction is not available. I used
>> the gcc __sync_val_compare_and_swap() builtin directly, that needs to be
>> abstracted. Also, in the case that the wait queue needs to be manipulated,
>> the code spins on the CAS instruction, but there is no delay mechanism like
>> there is on a regular spinlock; that needs to be added in somehow.
>
> These are really interesting results. Why is the CAS method so much
> faster then TAS?

If my theory is right, the steep fall is caused by backends being
sometimes context switched while holding the spinlock. The CAS method
alleviates that, because the lock is never "held" by anyone. With a
spinlock, if a process gets context switched while holding the lock,
everyone else will have to wait until the process gets context switched
back, and releases the spinlock. With the CAS method, if a process gets
switched in the CAS-loop, it doesn't hurt others.

With the patch, the CAS-loop still spins, just like a spinlock, when the
wait queue has to be manipulated. So when that happens, it will behave
more or less like the spinlock implementation. I'm not too concerned
about that optimizing that further; sleeping and signaling sleeping
backends are heavy-weight operations anyway.

> Did you see any contention?

With the current spinlock implementation? Yeah, per the profile I
posted, a lot of CPU time was spent spinning, which is a sign of
contention. I didn't see contention with the CAS implemention.

Attached is a new version of the patch. I cleaned it up a lot, and added
a configure test, and fallback to the good old spinlock implementation
if compare-and-swap is not available. Also, if the CAS loops needs to
spin, like a spinlock, it now sleeps after a while and updates the
"spins_per_delay" like regular spinlock.

- Heikki

Attachment

Re: Better LWLocks with compare-and-swap (9.4)

From
Bruce Momjian
Date:
On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > Isn't this the same issue which has prompted multiple people to propose
> > (sometimes with code, as I recall) to rip out our internal spinlock
> > system and replace it with kernel-backed calls which do it better,
> > specifically by dealing with issues like the above?  Have you seen those
> > threads in the past?  Any thoughts about moving in that direction?
> 
> All of the proposals of that sort that I've seen had a flavor of
> "my OS is the only one that matters".  While I don't object to
> platform-dependent implementations of spinlocks as such, they're not
> much of a cure for a generic performance issue.

Uh, is this an x86-64-only optimization?  Seems so.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Better LWLocks with compare-and-swap (9.4)

From
Alvaro Herrera
Date:
> diff --git a/configure.in b/configure.in
> index 4ea5699..ff8470e 100644
> --- a/configure.in
> +++ b/configure.in
> @@ -1445,17 +1445,6 @@ fi
>  AC_CHECK_FUNCS([strtoll strtoq], [break])
>  AC_CHECK_FUNCS([strtoull strtouq], [break])
>
> -AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics,
> -[AC_TRY_LINK([],
> -  [int lock = 0;
> -   __sync_lock_test_and_set(&lock, 1);
> -   __sync_lock_release(&lock);],
> -  [pgac_cv_gcc_int_atomics="yes"],
> -  [pgac_cv_gcc_int_atomics="no"])])
> -if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
> -  AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.])
> -fi
> -

Careful here --- s_lock.h has some code conditional on
HAVE_GCC_INT_ATOMICS which your patch is not touching, yet it is
removing the definition, unless I'm misreading.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Better LWLocks with compare-and-swap (9.4)

From
Heikki Linnakangas
Date:
On 20.05.2013 23:01, Bruce Momjian wrote:
> On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote:
>> Stephen Frost<sfrost@snowman.net>  writes:
>>> Isn't this the same issue which has prompted multiple people to propose
>>> (sometimes with code, as I recall) to rip out our internal spinlock
>>> system and replace it with kernel-backed calls which do it better,
>>> specifically by dealing with issues like the above?  Have you seen those
>>> threads in the past?  Any thoughts about moving in that direction?
>>
>> All of the proposals of that sort that I've seen had a flavor of
>> "my OS is the only one that matters".  While I don't object to
>> platform-dependent implementations of spinlocks as such, they're not
>> much of a cure for a generic performance issue.
>
> Uh, is this an x86-64-only optimization?  Seems so.

All modern architectures have an atomic compare-and-swap instruction (or 
something more powerful that can be used to implement it). That includes 
x86, x86-64, ARM, PowerPC, among others.

There are some differences in how wide values can be swapped with it; 
386 only supported 32-bit, until Pentium, which added a 64-bit variant. 
I used the 64-bit variant in the patch, but for lwlocks, we could 
actually get away with the 32-bit variant if we packed the booleans and 
the shared lock counter more tightly.

- Heikki



Re: Better LWLocks with compare-and-swap (9.4)

From
Heikki Linnakangas
Date:
On 20.05.2013 23:11, Alvaro Herrera wrote:
>> diff --git a/configure.in b/configure.in
>> index 4ea5699..ff8470e 100644
>> --- a/configure.in
>> +++ b/configure.in
>> @@ -1445,17 +1445,6 @@ fi
>>   AC_CHECK_FUNCS([strtoll strtoq], [break])
>>   AC_CHECK_FUNCS([strtoull strtouq], [break])
>>
>> -AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics,
>> -[AC_TRY_LINK([],
>> -  [int lock = 0;
>> -   __sync_lock_test_and_set(&lock, 1);
>> -   __sync_lock_release(&lock);],
>> -  [pgac_cv_gcc_int_atomics="yes"],
>> -  [pgac_cv_gcc_int_atomics="no"])])
>> -if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
>> -  AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.])
>> -fi
>> -
>
> Careful here --- s_lock.h has some code conditional on
> HAVE_GCC_INT_ATOMICS which your patch is not touching, yet it is
> removing the definition, unless I'm misreading.

Thanks, good catch. I renamed HAVE_GCC_INT_ATOMICS to 
HAVE_GCC_INT_TEST_AND_SET because "atomics" seems too generic when we 
also test for __sync_val_compare_and_swap(p, oldval, newval).

- Heikki



Re: Better LWLocks with compare-and-swap (9.4)

From
Bruce Momjian
Date:
On Mon, May 20, 2013 at 11:16:41PM +0300, Heikki Linnakangas wrote:
> On 20.05.2013 23:01, Bruce Momjian wrote:
> >On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote:
> >>Stephen Frost<sfrost@snowman.net>  writes:
> >>>Isn't this the same issue which has prompted multiple people to propose
> >>>(sometimes with code, as I recall) to rip out our internal spinlock
> >>>system and replace it with kernel-backed calls which do it better,
> >>>specifically by dealing with issues like the above?  Have you seen those
> >>>threads in the past?  Any thoughts about moving in that direction?
> >>
> >>All of the proposals of that sort that I've seen had a flavor of
> >>"my OS is the only one that matters".  While I don't object to
> >>platform-dependent implementations of spinlocks as such, they're not
> >>much of a cure for a generic performance issue.
> >
> >Uh, is this an x86-64-only optimization?  Seems so.
> 
> All modern architectures have an atomic compare-and-swap instruction
> (or something more powerful that can be used to implement it). That
> includes x86, x86-64, ARM, PowerPC, among others.
> 
> There are some differences in how wide values can be swapped with
> it; 386 only supported 32-bit, until Pentium, which added a 64-bit
> variant. I used the 64-bit variant in the patch, but for lwlocks, we
> could actually get away with the 32-bit variant if we packed the
> booleans and the shared lock counter more tightly.

So we are going to need to add this kind of assembly language
optimization for every CPU type?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Better LWLocks with compare-and-swap (9.4)

From
Heikki Linnakangas
Date:
On 16.05.2013 01:08, Daniel Farina wrote:
> On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com>  wrote:
>> pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when
>> I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine:
>>
>> -  64.09%  postgres  postgres           [.] tas
>>     - tas
>>        - 99.83% s_lock
>>           - 53.22% LWLockAcquire
>>              + 99.87% GetSnapshotData
>>           - 46.78% LWLockRelease
>>                GetSnapshotData
>>              + GetTransactionSnapshot
>> +   2.97%  postgres  postgres           [.] tas
>> +   1.53%  postgres  libc-2.13.so       [.] 0x119873
>> +   1.44%  postgres  postgres           [.] GetSnapshotData
>> +   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
>> +   1.18%  postgres  postgres           [.] AllocSetAlloc
>> ...
>>
>> So, on this test, a lot of time is wasted spinning on the mutex of
>> ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
>> surprisingly steep drop in performance once you go beyond 29 clients
>> (attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory
>> is that after that point all the cores are busy, and processes start to be
>> sometimes context switched while holding the spinlock, which kills
>> performance.
>
> I have, I also used linux perf to come to this conclusion, and my
> determination was similar: a system was undergoing increasingly heavy
> load, in this case with processes>>  number of processors.  It was
> also a phase-change type of event: at one moment everything would be
> going great, but once a critical threshold was hit, s_lock would
> consume enormous amount of CPU time.  I figured preemption while in
> the spinlock was to blame at the time, given the extreme nature

Stop the press! I'm getting the same speedup on that 32-core box I got 
with the compare-and-swap patch, from this one-liner:

--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -200,6 +200,8 @@ typedef unsigned char slock_t;
 #define TAS(lock) tas(lock)

+#define TAS_SPIN(lock)    (*(lock) ? 1 : TAS(lock))
+ static __inline__ int tas(volatile slock_t *lock) {

So, on this system, doing a non-locked test before the locked xchg 
instruction while spinning, is a very good idea. That contradicts the 
testing that was done earlier when the x86-64 implementation was added, 
as we have this comment in the tas() implementation:

>     /*
>      * On Opteron, using a non-locking test before the locking instruction
>      * is a huge loss.  On EM64T, it appears to be a wash or small loss,
>      * so we needn't bother to try to distinguish the sub-architectures.
>      */

On my test system, the non-locking test is a big win. I tested this 
because I was reading this article from Intel:


http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/.

It says very explicitly that the non-locking test is a good idea:

> Spinning on volatile read vs. spin on lock attempt
>
> One common mistake made by developers developing their own spin-wait loops is attempting to spin on an atomic
instructioninstead of spinning on a volatile read. Spinning on a dirty read instead of attempting to acquire a lock
consumesless time and resources. This allows an application to only attempt to acquire a lock only when it is free.
 

Now, I'm not sure what to do about this. If we put the non-locking test 
in there, according to the previous testing that would be a huge loss on 
Opterons.

Perhaps we should just sleep earlier, ie. lower MAX_SPINS_PER_DELAY. 
That way, even if each TAS_SPIN test is very expensive, we don't spend 
too much time spinning if it's really busy, or held by a process that is 
sleeping.

- Heikki



Spinlock implementation on x86_64 (was Re: Better LWLocks with compare-and-swap (9.4))

From
Heikki Linnakangas
Date:
On 21.05.2013 00:20, Heikki Linnakangas wrote:
> On 16.05.2013 01:08, Daniel Farina wrote:
>> On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>> pgbench -S is such a workload. With 9.3beta1, I'm seeing this
>>> profile, when
>>> I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux
>>> machine:
>>>
>>> - 64.09% postgres postgres [.] tas
>>> - tas
>>> - 99.83% s_lock
>>> - 53.22% LWLockAcquire
>>> + 99.87% GetSnapshotData
>>> - 46.78% LWLockRelease
>>> GetSnapshotData
>>> + GetTransactionSnapshot
>>> + 2.97% postgres postgres [.] tas
>>> + 1.53% postgres libc-2.13.so [.] 0x119873
>>> + 1.44% postgres postgres [.] GetSnapshotData
>>> + 1.29% postgres [kernel.kallsyms] [k] arch_local_irq_enable
>>> + 1.18% postgres postgres [.] AllocSetAlloc
>>> ...
>>>
>>> So, on this test, a lot of time is wasted spinning on the mutex of
>>> ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
>>> surprisingly steep drop in performance once you go beyond 29 clients
>>> (attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My
>>> theory
>>> is that after that point all the cores are busy, and processes start
>>> to be
>>> sometimes context switched while holding the spinlock, which kills
>>> performance.
>>
>> I have, I also used linux perf to come to this conclusion, and my
>> determination was similar: a system was undergoing increasingly heavy
>> load, in this case with processes>> number of processors. It was
>> also a phase-change type of event: at one moment everything would be
>> going great, but once a critical threshold was hit, s_lock would
>> consume enormous amount of CPU time. I figured preemption while in
>> the spinlock was to blame at the time, given the extreme nature
>
> Stop the press! I'm getting the same speedup on that 32-core box I got
> with the compare-and-swap patch, from this one-liner:
>
> --- a/src/include/storage/s_lock.h
> +++ b/src/include/storage/s_lock.h
> @@ -200,6 +200,8 @@ typedef unsigned char slock_t;
>
> #define TAS(lock) tas(lock)
>
> +#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))
> +
> static __inline__ int
> tas(volatile slock_t *lock)
> {
>
> So, on this system, doing a non-locked test before the locked xchg
> instruction while spinning, is a very good idea. That contradicts the
> testing that was done earlier when the x86-64 implementation was added,
> as we have this comment in the tas() implementation:
>
>> /*
>> * On Opteron, using a non-locking test before the locking instruction
>> * is a huge loss. On EM64T, it appears to be a wash or small loss,
>> * so we needn't bother to try to distinguish the sub-architectures.
>> */
>
> On my test system, the non-locking test is a big win. I tested this
> because I was reading this article from Intel:
>
>
http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/.
> It says very explicitly that the non-locking test is a good idea:
>
>> Spinning on volatile read vs. spin on lock attempt
>>
>> One common mistake made by developers developing their own spin-wait
>> loops is attempting to spin on an atomic instruction instead of
>> spinning on a volatile read. Spinning on a dirty read instead of
>> attempting to acquire a lock consumes less time and resources. This
>> allows an application to only attempt to acquire a lock only when it
>> is free.
>
> Now, I'm not sure what to do about this. If we put the non-locking test
> in there, according to the previous testing that would be a huge loss on
> Opterons.

I think we should apply the attached patch to put a non-locked test in
TAS_SPIN() on x86_64.

Tom tested this back in 2011 on a 32-core Opteron [1] and found little
difference. And on a 80-core Xeon (160 with hyperthreading) system, he
saw a quite significant gain from it [2]. Reading the thread, I don't
understand why it wasn't applied to x86_64 back then. Maybe it was for
the fear of having a negative impact on performance on old Opterons, but
I strongly suspect that the performance would be OK even on old Opterons
if you only do the non-locked test in TAS_SPIN() and not in TAS(). Back
when the comment about poor performance with a non-locking test on
Opteron was written, we used the same TAS() macro in the first and while
spinning.

I tried to find some sort of an authoritative guide from AMD that would
say how spinlocks should be implemented, but didn't find any. The Intel
guide I linked above says we should apply the patch. Linux uses a ticket
spinlock thingie nowadays, which is slightly different, but if I'm
reading the older pre-ticket version correctly, they used to do the
same. Ie. non-locking test while spinning, but not before the first
attempt to grab the lock. Same in FreeBSD.

So, my plan is to apply the attached non-locked-tas-spin-x86_64.patch to
master. But I would love to get feedback from people running different
x86_64 hardware.

[1] http://www.postgresql.org/message-id/8292.1314641721@sss.pgh.pa.us
[2] http://www.postgresql.org/message-id/25989.1314734752@sss.pgh.pa.us

- Heikki

Attachment
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> So, my plan is to apply the attached non-locked-tas-spin-x86_64.patch to 
> master. But I would love to get feedback from people running different 
> x86_64 hardware.

Surely this patch should update the existing comment at line 209?  Or at
least point out that a non-locked test in TAS_SPIN is not the same as a
non-locked test in tas() itself.

Other than the commenting, I have no objection to this.  I think you're
probably right that the old tests in which this looked like a bad idea
were adding the unlocked test to tas() not only the spin case.
        regards, tom lane



On 28.08.2013 20:21, Tom Lane wrote:
> Heikki Linnakangas<hlinnakangas@vmware.com>  writes:
>> So, my plan is to apply the attached non-locked-tas-spin-x86_64.patch to
>> master. But I would love to get feedback from people running different
>> x86_64 hardware.
>
> Surely this patch should update the existing comment at line 209?  Or at
> least point out that a non-locked test in TAS_SPIN is not the same as a
> non-locked test in tas() itself.

Committed with some comment changes. I also added a note to the 32-bit 
x86 implementation, suggesting we probably should do the same there, 
no-one's just gotten around to do the testing.

- Heikki