Thread: s_lock() seems too aggressive for machines with many sockets

s_lock() seems too aggressive for machines with many sockets

From
Jan Wieck
Date:
Hi,

I think I may have found one of the problems, PostgreSQL has on machines
with many NUMA nodes. I am not yet sure what exactly happens on the NUMA
bus, but there seems to be a tipping point at which the spinlock
concurrency wreaks havoc and the performance of the database collapses.

On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total,
a pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput
then takes a very sharp dive and reaches around 20,000 TPS at 120
clients. It never recovers from there.

The attached patch demonstrates that less aggressive spinning and (much)
more often delaying improves the performance "on this type of machine".
The 8 socket machine in question scales to over 350,000 TPS.

The patch is meant to demonstrate this effect only. It has a negative
performance impact on smaller machines and client counts < #cores, so
the real solution will probably look much different. But I thought it
would be good to share this and start the discussion about reevaluating
the spinlock code before PGCon.


Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info

Attachment

Re: s_lock() seems too aggressive for machines with many sockets

From
Andres Freund
Date:
On 2015-06-10 09:18:56 -0400, Jan Wieck wrote:
> On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, a
> pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput then
> takes a very sharp dive and reaches around 20,000 TPS at 120 clients. It
> never recovers from there.

85k? Phew, that's pretty bad. What exact type of CPU is this? Which
pgbench scale? Did you use -M prepared?

Could you share a call graph perf profile?

> The attached patch demonstrates that less aggressive spinning and
> (much) more often delaying improves the performance "on this type of
> machine". The 8 socket machine in question scales to over 350,000 TPS.

Even that seems quite low. I've gotten over 500k TPS on a four socket
x86 machine, and about 700k on a 8 socket x86 machine.

Maybe we need to adjust the amount of spinning, but to me such drastic
differences are a hint that we should tackle the actual contention
point. Often a spinlock for something regularly heavily contended can be
worse than a queued lock.

Greetings,

Andres Freund



Re: s_lock() seems too aggressive for machines with many sockets

From
Bruce Momjian
Date:
On Wed, Jun 10, 2015 at 09:18:56AM -0400, Jan Wieck wrote:
> The attached patch demonstrates that less aggressive spinning and
> (much) more often delaying improves the performance "on this type of
> machine". The 8 socket machine in question scales to over 350,000
> TPS.
> 
> The patch is meant to demonstrate this effect only. It has a
> negative performance impact on smaller machines and client counts <
> #cores, so the real solution will probably look much different. But
> I thought it would be good to share this and start the discussion
> about reevaluating the spinlock code before PGCon.

Wow, you are in that code!  We kind of guessed on some of those
constants many years ago, and never revisited it.  It would be nice to
get a better heuristic for those, but I am concerned it would require
the user to specify the number of CPU cores.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: s_lock() seems too aggressive for machines with many sockets

From
Jan Wieck
Date:
On 06/10/2015 09:28 AM, Andres Freund wrote:
> On 2015-06-10 09:18:56 -0400, Jan Wieck wrote:
>> On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, a
>> pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput then
>> takes a very sharp dive and reaches around 20,000 TPS at 120 clients. It
>> never recovers from there.
>
> 85k? Phew, that's pretty bad. What exact type of CPU is this? Which
> pgbench scale? Did you use -M prepared?

model name      : Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz

numactl --hardware shows the distance to the attached memory as 10, the 
distance to every other node as 21. I interpret that as the machine 
having one NUMA bus with all cpu packages attached to that, rather than 
individual connections from cpu to cpu or something different.

pgbench scale=300, -Msimple.

>
> Could you share a call graph perf profile?

I do not have them handy at the moment and the machine is in use for 
something else until tomorrow. I will forward perf and systemtap based 
graphs ASAP.

What led me into that spinlock area was the fact that a wall clock based 
systemtap FlameGraph showed a large portion of the time spent in
BufferPin() and BufferUnpin().

>
>> The attached patch demonstrates that less aggressive spinning and
>> (much) more often delaying improves the performance "on this type of
>> machine". The 8 socket machine in question scales to over 350,000 TPS.
>
> Even that seems quite low. I've gotten over 500k TPS on a four socket
> x86 machine, and about 700k on a 8 socket x86 machine.

There is more wrong with the machine in question than just that. But for 
the moment I am satisfied with having a machine where I can reproduce 
this phenomenon in what appears to be a worst case.

>
> Maybe we need to adjust the amount of spinning, but to me such drastic
> differences are a hint that we should tackle the actual contention
> point. Often a spinlock for something regularly heavily contended can be
> worse than a queued lock.

I have the impression that the code assumes that there is little penalty 
for accessing the shared byte in a tight loop from any number of cores 
in parallel. That apparently is true for some architectures and core 
counts, but no longer holds for these machines with many sockets.


Regards, Jan

-- 
Jan Wieck
Senior Software Engineer
http://slony.info



Re: s_lock() seems too aggressive for machines with many sockets

From
Andres Freund
Date:
Hi,

On 2015-06-10 09:54:00 -0400, Jan Wieck wrote:
> model name      : Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz

> numactl --hardware shows the distance to the attached memory as 10, the
> distance to every other node as 21. I interpret that as the machine having
> one NUMA bus with all cpu packages attached to that, rather than individual
> connections from cpu to cpu or something different.

Generally that doesn't say very much - IIRC the distances are defined by
the bios.

> What led me into that spinlock area was the fact that a wall clock based
> systemtap FlameGraph showed a large portion of the time spent in
> BufferPin() and BufferUnpin().

I've seen that as a bottleneck in the past as well. My plan to fix that
is to "simply" make buffer pinning lockless for the majority of cases. I
don't have access to hardware to test that at higher node counts atm
though.

My guess is that the pins are on the btree root pages. But it'd be good
to confirm that.

> >Maybe we need to adjust the amount of spinning, but to me such drastic
> >differences are a hint that we should tackle the actual contention
> >point. Often a spinlock for something regularly heavily contended can be
> >worse than a queued lock.
> 
> I have the impression that the code assumes that there is little penalty for
> accessing the shared byte in a tight loop from any number of cores in
> parallel. That apparently is true for some architectures and core counts,
> but no longer holds for these machines with many sockets.

It's just generally a tradeoff. It's beneficial to spin longer if
there's only mild amounts of contention. If the likelihood of getting
the spinlock soon is high (i.e. existing, but low contention), it'll
nearly always be beneficial to spin. If the likelihood is low, it'll be
mostly beneficial to sleep.  The latter is especially true if a machine
is sufficiently overcommitted that it's likely that it'll sleep while
holding a spinlock. The danger of sleeping while holding a spinlock,
without targeted wakeup, is why spinlocks in userspace aren't a really
good idea.

My bet is that if you'd measure using different number iterations for
different spinlocks you'd find some where the higher number of
iterations is rather beneficial as well.

Greetings,

Andres Freund



Re: s_lock() seems too aggressive for machines with many sockets

From
Nils Goroll
Date:
On larger Linux machines, we have been running with spin locks replaced by
generic posix mutexes for years now. I personally haven't look at the code for
ages, but we maintain a patch which pretty much does the same thing still:

Ref: http://www.postgresql.org/message-id/4FEDE0BF.7080203@schokola.de

I understand that there are systems out there which have less efficient posix
mutex implementations than Linux (which uses futexes), but I think it would
still be worth considering to do away with the roll-your-own spinlocks on
systems whose posix mutexes are known to behave.

Nils




Re: s_lock() seems too aggressive for machines with many sockets

From
Nils Goroll
Date:
On 10/06/15 16:05, Andres Freund wrote:
> it'll nearly always be beneficial to spin

Trouble is that postgres cannot know if the process holding the lock actually
does run, so if it doesn't, all we're doing is burn cycles and make the problem
worse.

Contrary to that, the kernel does know, so for a (f|m)utex which fails to
acquire immediately and thus needs to syscall, the kernel has the option to spin
only if the lock holder is running (the "adaptive" mutex).

Nils



Re: s_lock() seems too aggressive for machines with many sockets

From
Jan Wieck
Date:
On 06/10/2015 10:07 AM, Nils Goroll wrote:
> On larger Linux machines, we have been running with spin locks replaced by
> generic posix mutexes for years now. I personally haven't look at the code for
> ages, but we maintain a patch which pretty much does the same thing still:
>
> Ref: http://www.postgresql.org/message-id/4FEDE0BF.7080203@schokola.de
>
> I understand that there are systems out there which have less efficient posix
> mutex implementations than Linux (which uses futexes), but I think it would
> still be worth considering to do away with the roll-your-own spinlocks on
> systems whose posix mutexes are known to behave.

I have played with test code that isolates a stripped down version of 
s_lock() and uses it with multiple threads. I then implemented multiple 
different versions of that s_lock(). The results with 200 concurrent 
threads are that using a __sync_val_compare_and_swap() to acquire the 
lock and then falling back to a futex() is limited to about 500,000 
locks/second. Spinning for 10 times and then doing a usleep(1000) (one 
millisecond) gives me 25 million locks/second.

Note that the __sync_val_compare_and_swap() GCC built in seems identical 
in performance with the assembler xchgb operation used by PostgreSQL 
today on x84_64.


Regards, Jan

-- 
Jan Wieck
Senior Software Engineer
http://slony.info



Re: s_lock() seems too aggressive for machines with many sockets

From
Tom Lane
Date:
Jan Wieck <jan@wi3ck.info> writes:
> The attached patch demonstrates that less aggressive spinning and (much) 
> more often delaying improves the performance "on this type of machine". 

Hm.  One thing worth asking is why the code didn't converge to a good
value of spins_per_delay without help.  The value should drop every time
we had to delay, so under heavy contention it ought to end up small
anyhow, no?  Maybe we just need to alter the feedback loop a bit.

(The comment about uniprocessors vs multiprocessors seems pretty wacko in
this context, but at least the sign of the feedback term seems correct.)
        regards, tom lane



Re: s_lock() seems too aggressive for machines with many sockets

From
Andres Freund
Date:
On 2015-06-10 16:12:05 +0200, Nils Goroll wrote:
>
> On 10/06/15 16:05, Andres Freund wrote:
> > it'll nearly always be beneficial to spin
>
> Trouble is that postgres cannot know if the process holding the lock actually
> does run, so if it doesn't, all we're doing is burn cycles and make the problem
> worse.

That's precisely what I referred to in the bit you cut away...

> Contrary to that, the kernel does know, so for a (f|m)utex which fails to
> acquire immediately and thus needs to syscall, the kernel has the option to spin
> only if the lock holder is running (the "adaptive" mutex).

Unfortunately there's no portable futex support. That's what stopped us
from adopting them so far.  And even futexes can be significantly more
heavyweight under moderate contention than our spinlocks - It's rather
easy to reproduce scenarios where futexes cause significant slowdown in
comparison to spinning in userspace (just reproduce contention on a
spinlock where the protected area will be *very* short - i.e. no cache
misses, no branches or such).

I think we should eventually work on replacing most of the currently
spinlock using code to either use lwlocks (which will enter the kernel
under contention, but not otherwise) or use lockless programming
techniques. I think there's relatively few relevant places left. Most
prominetly the buffer header spinlocks...



Re: s_lock() seems too aggressive for machines with many sockets

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Unfortunately there's no portable futex support. That's what stopped us
> from adopting them so far.  And even futexes can be significantly more
> heavyweight under moderate contention than our spinlocks - It's rather
> easy to reproduce scenarios where futexes cause significant slowdown in
> comparison to spinning in userspace (just reproduce contention on a
> spinlock where the protected area will be *very* short - i.e. no cache
> misses, no branches or such).

Which, you'll note, is the ONLY case that's allowed by our coding rules
for spinlock use.  If there are any locking sections that are not very
short straight-line code, or at least code with easily predicted branches,
we need to fix those before we worry about the spinlock mechanism per se.
Optimizing for misuse of the mechanism is not the way.
        regards, tom lane



Re: s_lock() seems too aggressive for machines with many sockets

From
Jan Wieck
Date:
On 06/10/2015 10:20 AM, Tom Lane wrote:
> Jan Wieck <jan@wi3ck.info> writes:
>> The attached patch demonstrates that less aggressive spinning and (much)
>> more often delaying improves the performance "on this type of machine".
>
> Hm.  One thing worth asking is why the code didn't converge to a good
> value of spins_per_delay without help.  The value should drop every time
> we had to delay, so under heavy contention it ought to end up small
> anyhow, no?  Maybe we just need to alter the feedback loop a bit.
>
> (The comment about uniprocessors vs multiprocessors seems pretty wacko in
> this context, but at least the sign of the feedback term seems correct.)

The feedback loop looks a bit heavy leaning towards increasing the spin 
count vs. decreasing it (100 up vs. 1 down). I have test time booked on 
the machine for tomorrow and will test that as well.

However, to me it seems that with the usual minimum sleep() interval of 
1ms, once we have to delay at all we are already losing. That spinning 
10x still outperforms the same code with 1,000 spins per delay by factor 
5 tells me that "on this particular box" something is going horribly 
wrong once we get over the tipping point in concurrency. As said, I am 
not sure what exactly that is yet. At a minimum the probability that 
another CPU package is stealing the cache line from the one, holding the 
spinlock, is going up. Which cannot possibly be good for performance. 
But I would expect that to show a more gradual drop in throughput than 
what I see in the pgbench -S example. Kevin had speculated to me that it 
may be possible that at that tipping point the kernel starts feeling the 
need to relocate the memory page in question to whichever cpu package 
makes the most failing requests and thus ends up with a huge round robin 
page relocation project. Unfortunately I don't know how to confirm or 
disprove that theory.

This is done on CentOS 7 with kernel 3.10 BTW. And no, I am not at 
liberty to install a different distribution or switch to another kernel.


Regards, Jan

-- 
Jan Wieck
Senior Software Engineer
http://slony.info



Re: s_lock() seems too aggressive for machines with many sockets

From
Nils Goroll
Date:
On 10/06/15 16:20, Andres Freund wrote:
> That's precisely what I referred to in the bit you cut away...

I apologize, yes.

On 10/06/15 16:25, Tom Lane wrote:
> Optimizing for misuse of the mechanism is not the way.

I absolutely agree and I really appreciate all efforts towards lockless data
structures or at least better concurrency using classical mutual exclusion.

But still I am convinced that on today's massively parallel NUMAs, spinlocks are
plain wrong:

- Even if critical sections are kept minimal, they can still become hot spots

- When they do, we get potentially massive negative scalability, it will be hard to exclude the possibility of a system
"tilting"under (potentially yet unknown) load patterns as long as userland slocks exist.
 
 Briefly: When slocks fail, they fail big time

- slocks optimize for the best case, but I think on today's systems we should optimize for the worst case.

- The fact that well behaved mutexes have a higher initial cost could even motivate good use of them rather than
optimizemisuse.
 

Cheers,

Nils



Re: s_lock() seems too aggressive for machines with many sockets

From
Andres Freund
Date:
On 2015-06-10 10:25:32 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Unfortunately there's no portable futex support. That's what stopped us
> > from adopting them so far.  And even futexes can be significantly more
> > heavyweight under moderate contention than our spinlocks - It's rather
> > easy to reproduce scenarios where futexes cause significant slowdown in
> > comparison to spinning in userspace (just reproduce contention on a
> > spinlock where the protected area will be *very* short - i.e. no cache
> > misses, no branches or such).
> 
> Which, you'll note, is the ONLY case that's allowed by our coding rules
> for spinlock use.  If there are any locking sections that are not very
> short straight-line code, or at least code with easily predicted branches,
> we need to fix those before we worry about the spinlock mechanism per
> se.

We haven't followed that all that strictly imo. While lwlocks are a bit
less problematic in 9.5 (as they take far fewer spinlocks), they're
still far from perfect as we manipulate linked lists while holding a
lock.  We malso do lots of hard to predict stuff while the buffer header
spinlock is held...

> Optimizing for misuse of the mechanism is not the way.

Agreed. I'm not particularly interested in optimizing spinlocks. We
should get rid of most.

Greetings,

Andres Freund



Re: s_lock() seems too aggressive for machines with many sockets

From
Robert Haas
Date:
On Wed, Jun 10, 2015 at 10:20 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jan Wieck <jan@wi3ck.info> writes:
>> The attached patch demonstrates that less aggressive spinning and (much)
>> more often delaying improves the performance "on this type of machine".
>
> Hm.  One thing worth asking is why the code didn't converge to a good
> value of spins_per_delay without help.  The value should drop every time
> we had to delay, so under heavy contention it ought to end up small
> anyhow, no?  Maybe we just need to alter the feedback loop a bit.
>
> (The comment about uniprocessors vs multiprocessors seems pretty wacko in
> this context, but at least the sign of the feedback term seems correct.)

The code seems to have been written with the idea that we should
converge to MAX_SPINS_PER_DELAY if spinning *ever* works.  The way
that's implemented is that, if we get a spinlock without having to
delay, we add 100 to spins_per_delay, but if we have to delay at least
once (potentially hundreds of times), then we subtract 1.
spins_per_delay will be >900 most of the time even if only 1% of the
lock acquisitions manage to get the lock without delaying.

It is possible that, as you say, all we need to do is alter the
feedback loop so that, say, we subtract 1 every time we delay (rather
than every time we have at least 1 delay) and add 1 (rather than 100)
every time we don't end up needing to delay.  I'm a bit concerned,
though, that this would tend to make spins_per_delay unstable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: s_lock() seems too aggressive for machines with many sockets

From
Andres Freund
Date:
On 2015-06-10 16:55:31 +0200, Nils Goroll wrote:
> But still I am convinced that on today's massively parallel NUMAs, spinlocks are
> plain wrong:

Sure. But a large number of installations are not using massive NUMA
systems, so we can't focus on optimizing for NUMA.

We definitely have quite some catchup to do there. Unfortunately most of
the problems are only reproducible on 4, 8 socket machines, and it's
hard to get hand on those for prolonged amounts of time.

> - Even if critical sections are kept minimal, they can still become hot spots

That's why we started to remove several of them...

> - The fact that well behaved mutexes have a higher initial cost could even
>   motivate good use of them rather than optimize misuse.

Well. There's many locks in a RDBMS that can't realistically be
avoided. So optimizing for no and moderate contention isn't something
you can simply forgo.




Re: s_lock() seems too aggressive for machines with many sockets

From
Jan Wieck
Date:
On 06/10/2015 10:59 AM, Robert Haas wrote:
> On Wed, Jun 10, 2015 at 10:20 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Jan Wieck <jan@wi3ck.info> writes:
>>> The attached patch demonstrates that less aggressive spinning and (much)
>>> more often delaying improves the performance "on this type of machine".
>>
>> Hm.  One thing worth asking is why the code didn't converge to a good
>> value of spins_per_delay without help.  The value should drop every time
>> we had to delay, so under heavy contention it ought to end up small
>> anyhow, no?  Maybe we just need to alter the feedback loop a bit.
>>
>> (The comment about uniprocessors vs multiprocessors seems pretty wacko in
>> this context, but at least the sign of the feedback term seems correct.)
>
> The code seems to have been written with the idea that we should
> converge to MAX_SPINS_PER_DELAY if spinning *ever* works.  The way
> that's implemented is that, if we get a spinlock without having to
> delay, we add 100 to spins_per_delay, but if we have to delay at least
> once (potentially hundreds of times), then we subtract 1.
> spins_per_delay will be >900 most of the time even if only 1% of the
> lock acquisitions manage to get the lock without delaying.

And note that spins_per_delay is global. Getting ANY lock without delay 
adds 100, regardless of that being a high or low contented one. Your 
process only needs to hit one low contention slock every 100 calls to 
securely peg this value >=900.


Jan

-- 
Jan Wieck
Senior Software Engineer
http://slony.info



Re: s_lock() seems too aggressive for machines with many sockets

From
Nils Goroll
Date:
On 10/06/15 16:18, Jan Wieck wrote:
> 
> I have played with test code that isolates a stripped down version of s_lock()
> and uses it with multiple threads. I then implemented multiple different
> versions of that s_lock(). The results with 200 concurrent threads are that
> using a __sync_val_compare_and_swap() to acquire the lock and then falling back
> to a futex() is limited to about 500,000 locks/second. Spinning for 10 times and
> then doing a usleep(1000) (one millisecond) gives me 25 million locks/second.
> 
> Note that the __sync_val_compare_and_swap() GCC built in seems identical in
> performance with the assembler xchgb operation used by PostgreSQL today on x84_64.

These numbers don't work for me. Do IUC that you are not holding the lock for
any reasonable time? If yes, the test case is invalid (the uncontended case is
never relevant). If No, the numbers don't match up - if you held one lock for
1ms, you'd not get more than 1000 locks/s anyway. If you had 200 locks, you'd
get 200.000 locks/s.

Can you please explain what the message is you are trying to get across?

Thanks, Nils



Re: s_lock() seems too aggressive for machines with many sockets

From
Jan Wieck
Date:
On 06/10/2015 11:06 AM, Nils Goroll wrote:
> On 10/06/15 16:18, Jan Wieck wrote:
>>
>> I have played with test code that isolates a stripped down version of s_lock()
>> and uses it with multiple threads. I then implemented multiple different
>> versions of that s_lock(). The results with 200 concurrent threads are that
>> using a __sync_val_compare_and_swap() to acquire the lock and then falling back
>> to a futex() is limited to about 500,000 locks/second. Spinning for 10 times and
>> then doing a usleep(1000) (one millisecond) gives me 25 million locks/second.
>>
>> Note that the __sync_val_compare_and_swap() GCC built in seems identical in
>> performance with the assembler xchgb operation used by PostgreSQL today on x84_64.
>
> These numbers don't work for me. Do IUC that you are not holding the lock for
> any reasonable time? If yes, the test case is invalid (the uncontended case is
> never relevant). If No, the numbers don't match up - if you held one lock for
> 1ms, you'd not get more than 1000 locks/s anyway. If you had 200 locks, you'd
> get 200.000 locks/s.
>
> Can you please explain what the message is you are trying to get across?

The test case is that 200 threads are running in a tight loop like this:

for (...)
{    s_lock();    // do something with a global variable    s_unlock();
}

That is the most contended case I can think of, yet the short and 
predictable code while holding the lock is the intended use case for a 
spinlock.

The code in s_lock() is what is doing multiple CAS attempts, then sleep. 
The code is never holding the lock for 1ms. Sorry if that wasn't clear.


Regards, Jan


-- 
Jan Wieck
Senior Software Engineer
http://slony.info



Re: s_lock() seems too aggressive for machines with many sockets

From
Nils Goroll
Date:

On 10/06/15 17:01, Andres Freund wrote:
>> > - The fact that well behaved mutexes have a higher initial cost could even
>> >   motivate good use of them rather than optimize misuse.
> Well. There's many locks in a RDBMS that can't realistically be
> avoided. So optimizing for no and moderate contention isn't something
> you can simply forgo.

Let's get back to my initial suggestion:

On 10/06/15 16:07, Nils Goroll wrote:
> I think it would
> still be worth considering to do away with the roll-your-own spinlocks on
> systems whose posix mutexes are known to behave.

Where we use the mutex patch we have not seen any relevant negative impact -
neither in benchmarks nor in production.

So, yes, postgres should still work fine on a 2-core laptop and I don't see any
reason why using posix mutexes *where they are known to behave* would do any harm.

And, to be honest, Linux is quite dominant, so solving the issue for this
platform would be a start at least.

Nils



Re: s_lock() seems too aggressive for machines with many sockets

From
Nils Goroll
Date:
On 10/06/15 17:12, Jan Wieck wrote:
> for (...)
> {
>     s_lock();
>     // do something with a global variable
>     s_unlock();
> }

OK, I understand now, thank you. I am not sure if this test case is appropriate
for the critical sections in postgres (if it was, we'd not have the problem we
are discussion).

Nils



Re: s_lock() seems too aggressive for machines with many sockets

From
Andres Freund
Date:
On 2015-06-10 16:07:50 +0200, Nils Goroll wrote:
> On larger Linux machines, we have been running with spin locks replaced by
> generic posix mutexes for years now. I personally haven't look at the code for
> ages, but we maintain a patch which pretty much does the same thing still:

Interesting. I've been able to reproduce quite massive slowdowns doing
this on a 4 socket linux machine (after applying the lwlock patch that's
now in 9.5) in readonly workloads. As in 200%+ slower. And that was with
a new kernel/glibc.  That was primarily due to buffer header
spinlocks. For write workloads the difference was smaller, but still
noticeably. There xlog.c's spinlocks where noticeable which are usually
held very shortly.

> Ref: http://www.postgresql.org/message-id/4FEDE0BF.7080203@schokola.de

Do you have any details about the workloads that scaled badly back then?
It'd be interesting to find out which spinlocks they bottlenecked
on.

> I understand that there are systems out there which have less efficient posix
> mutex implementations than Linux (which uses futexes), but I think it would
> still be worth considering to do away with the roll-your-own spinlocks on
> systems whose posix mutexes are known to behave.

If we get rid of the 'hot' spinlocks something very roughly like this
will possibly be realistic (we can't rely on pthreads, but ...). I don't
think it'll be before that.



Re: s_lock() seems too aggressive for machines with many sockets

From
Nils Goroll
Date:

On 10/06/15 17:17, Andres Freund wrote:
> On 2015-06-10 16:07:50 +0200, Nils Goroll wrote:
>> On larger Linux machines, we have been running with spin locks replaced by
>> generic posix mutexes for years now. I personally haven't look at the code for
>> ages, but we maintain a patch which pretty much does the same thing still:
> 
> Interesting. I've been able to reproduce quite massive slowdowns doing
> this on a 4 socket linux machine (after applying the lwlock patch that's
> now in 9.5) 

Sorry, I cannot comment on this, 9.4.1 is the latest we are running in
production and I haven't even tested the patch with 9.5.

> As in 200%+ slower.

Have you tried PTHREAD_MUTEX_ADAPTIVE_NP ?

>> Ref: http://www.postgresql.org/message-id/4FEDE0BF.7080203@schokola.de
> 
> Do you have any details about the workloads that scaled badly back then?
> It'd be interesting to find out which spinlocks they bottlenecked
> on.

OLTP. But really the root cause from back then should be eliminated, this was
with 9.1.3

I only got woken up by s_lock() in email subjects.

Nils



Re: s_lock() seems too aggressive for machines with many sockets

From
Andres Freund
Date:
On 2015-06-10 11:12:46 -0400, Jan Wieck wrote:
> The test case is that 200 threads are running in a tight loop like this:
> 
> for (...)
> {
>     s_lock();
>     // do something with a global variable
>     s_unlock();
> }
> 
> That is the most contended case I can think of, yet the short and
> predictable code while holding the lock is the intended use case for a
> spinlock.

But lots of our code isn't that predictable. Check e.g. the buffer
spinlock case:

static void
UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
{PrivateRefCountEntry *ref;
/* not moving as we're likely deleting it soon anyway */ref = GetPrivateRefCountEntry(buf->buf_id + 1,
false);Assert(ref!= NULL);
 
if (fixOwner)    ResourceOwnerForgetBuffer(CurrentResourceOwner,
BufferDescriptorGetBuffer(buf));
Assert(ref->refcount > 0);ref->refcount--;if (ref->refcount == 0){    /* I'd better not still hold any locks on the
buffer*/    Assert(!LWLockHeldByMe(buf->content_lock));    Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
 
    LockBufHdr(buf);
    /* Decrement the shared reference count */    Assert(buf->refcount > 0);    buf->refcount--;
    /* Support LockBufferForCleanup() */    if ((buf->flags & BM_PIN_COUNT_WAITER) &&        buf->refcount == 1)    {
    /* we just released the last pin other than the waiter's */        int            wait_backend_pid =
buf->wait_backend_pid;
        buf->flags &= ~BM_PIN_COUNT_WAITER;        UnlockBufHdr(buf);        ProcSendSignal(wait_backend_pid);    }
else       UnlockBufHdr(buf);
 
    ForgetPrivateRefCountEntry(ref);}
}


If you check the section where the spinlock is held there's nontrivial
code executed. Under contention you'll have the problem that if backend
tries to acquire the the spinlock while another backend holds the lock,
it'll "steal" the cacheline on which the lock and the rest of the buffer
descriptor resides. Which means that operations like buf->refcount--,the
if (), and the &= will now have to wait till the cacheline is
transferred back (as the cacheline will directly be marked as modified
on the attempted locker). All the while other backends will also try to
acquire the pin, causing the same kind of trouble.

If this'd instead be written as

ret = pg_atomic_fetch_sub_u32(&buf->state, 1);

if (ret & BM_PIN_COUNT_WAITER)
{  pg_atomic_fetch_sub_u32(&buf->state, BM_PIN_COUNT_WAITER);  /* XXX: deal with race that another backend has set
BM_PIN_COUNT_WAITER*/
 
}

the whole thing looks entirely different. While there's obviously still
cacheline bouncing, every operation is guaranteed to make progress (on
x86 at least, but even elsewhere we'll never block another locker).

Now unlock is the simpler case, but...


- Andres




Re: s_lock() seems too aggressive for machines with many sockets

From
Andres Freund
Date:
On 2015-06-10 17:30:33 +0200, Nils Goroll wrote:
> On 10/06/15 17:17, Andres Freund wrote:
> > On 2015-06-10 16:07:50 +0200, Nils Goroll wrote:
> > Interesting. I've been able to reproduce quite massive slowdowns doing
> > this on a 4 socket linux machine (after applying the lwlock patch that's
> > now in 9.5) 
> 
> Sorry, I cannot comment on this, 9.4.1 is the latest we are running in
> production and I haven't even tested the patch with 9.5.

Ok.

> > As in 200%+ slower.
> 
> Have you tried PTHREAD_MUTEX_ADAPTIVE_NP ?

Yes.

> >> Ref: http://www.postgresql.org/message-id/4FEDE0BF.7080203@schokola.de
> > 
> > Do you have any details about the workloads that scaled badly back then?
> > It'd be interesting to find out which spinlocks they bottlenecked
> > on.
> 
> OLTP. But really the root cause from back then should be eliminated, this was
> with 9.1.3

Hm, ok. Any chance you have profiles from back then? It'd be very
interesting to know which spinlock you were contending on. If we convert
spinlocks into something more heavyweight we'll want to benchmark the
actually contending cases to avoid regressions.



Re: s_lock() seems too aggressive for machines with many sockets

From
Jan Wieck
Date:
On 06/10/2015 11:34 AM, Andres Freund wrote:
> If you check the section where the spinlock is held there's nontrivial
> code executed. Under contention you'll have the problem that if backend
> tries to acquire the the spinlock while another backend holds the lock,
> it'll "steal" the cacheline on which the lock and the rest of the buffer
> descriptor resides. Which means that operations like buf->refcount--,the
> if (), and the &= will now have to wait till the cacheline is
> transferred back (as the cacheline will directly be marked as modified
> on the attempted locker). All the while other backends will also try to
> acquire the pin, causing the same kind of trouble.

Yes, that is the possibility of "cache line stealing while holding the 
lock" that I was talking about. It looks like we are in wild agreement 
(as Kevin usually puts it).

>
> If this'd instead be written as
>
> ret = pg_atomic_fetch_sub_u32(&buf->state, 1);
>
> if (ret & BM_PIN_COUNT_WAITER)
> {
>     pg_atomic_fetch_sub_u32(&buf->state, BM_PIN_COUNT_WAITER);
>     /* XXX: deal with race that another backend has set BM_PIN_COUNT_WAITER */
> }

There are atomic AND and OR functions too, at least in the GCC built in 
parts. We might be able to come up with pg_atomic_...() versions of them 
and avoid the race part.

>
> the whole thing looks entirely different. While there's obviously still
> cacheline bouncing, every operation is guaranteed to make progress (on
> x86 at least, but even elsewhere we'll never block another locker).
>
> Now unlock is the simpler case, but...

While some locks may be avoidable, some may be replaced by atomic 
operations, I believe that we will still be left with some of them. 
Optimizing spinlocks if we can do it in a generic fashion that does not 
hurt other platforms will still give us something.


Regards, Jan

-- 
Jan Wieck
Senior Software Engineer
http://slony.info



Re: s_lock() seems too aggressive for machines with many sockets

From
Nils Goroll
Date:
>>> As in 200%+ slower.
>> Have you tried PTHREAD_MUTEX_ADAPTIVE_NP ?
> Yes.

Ok, if this can be validated, we might have a new case now for which my
suggestion would not be helpful. Reviewed, optimized code with short critical
sections and no hotspots by design could indeed be an exception where to keep
slock as they are.

> Hm, ok. Any chance you have profiles from back then?

IIUC I had shared all relevant data on the list. Does this help?
http://www.postgresql.org/message-id/4FE9EB27.9020502@schokola.de

Thanks, NIls



Re: s_lock() seems too aggressive for machines with many sockets

From
Andres Freund
Date:
On 2015-06-10 11:51:06 -0400, Jan Wieck wrote:
> >ret = pg_atomic_fetch_sub_u32(&buf->state, 1);
> >
> >if (ret & BM_PIN_COUNT_WAITER)
> >{
> >    pg_atomic_fetch_sub_u32(&buf->state, BM_PIN_COUNT_WAITER);
> >    /* XXX: deal with race that another backend has set BM_PIN_COUNT_WAITER */
> >}
> 
> There are atomic AND and OR functions too, at least in the GCC built in
> parts. We might be able to come up with pg_atomic_...() versions of them and
> avoid the race part.

The race part isn't actually about that. It's that BM_PIN_COUNT_WAITER
might have been set after the fetch_sub above.

fetch_sub() itself would actually be race-free to unset a flag, if it's
a proper power of two.

> While some locks may be avoidable, some may be replaced by atomic
> operations, I believe that we will still be left with some of them.

Besides the two xlog.c ones and lwlock.c, which are hot otherwise? I
think we pretty much removed the rest?

> Optimizing spinlocks if we can do it in a generic fashion that does not hurt
> other platforms will still give us something.

Sure, I'm just doubtful that's easy.

I think we should just gank spinlocks asap. The hard part is removing
them from lwlock.c's slow path and the buffer headers imo. After that we
should imo be fine replacing them with lwlocks.



Re: s_lock() seems too aggressive for machines with many sockets

From
Robert Haas
Date:
On Wed, Jun 10, 2015 at 11:58 AM, Andres Freund <andres@anarazel.de> wrote:
> I think we should just gank spinlocks asap. The hard part is removing
> them from lwlock.c's slow path and the buffer headers imo. After that we
> should imo be fine replacing them with lwlocks.

Mmmph.  I'm not convinced there's any point in replacing
lightly-contended spinlocks with slower, more-complex lwlocks.  But
maybe it's best to stay away from that argument and focus on getting
rid of the spinlocks that are hosing us right now.

I'm not altogether convinced that a simple replacement of spinlocks
with atomics is going to solve this problem.  If it does, great.  But
that doesn't eliminate spinning; it just moves it from the spinlock
loop to a CAS loop.  This is clearly better: when the CAS works,
you're done, as opposed to having to then manipulate the cache line
further and release the spinlock, during any of which the cache line
may get stolen from you.  However, I'm worried that it will still be
possible to create the same kinds of performance collapses that we see
with spinlocks with the CAS loops with which we place them - or even
worse problems, because clearly the spin_delay stuff *works*, and
we're unlikely to incorporate similarly sophisticated logic into every
CAS loop.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: s_lock() seems too aggressive for machines with many sockets

From
Andres Freund
Date:
On 2015-06-10 13:19:14 -0400, Robert Haas wrote:
> On Wed, Jun 10, 2015 at 11:58 AM, Andres Freund <andres@anarazel.de> wrote:
> > I think we should just gank spinlocks asap. The hard part is removing
> > them from lwlock.c's slow path and the buffer headers imo. After that we
> > should imo be fine replacing them with lwlocks.
> 
> Mmmph.  I'm not convinced there's any point in replacing
> lightly-contended spinlocks with slower, more-complex lwlocks.  But
> maybe it's best to stay away from that argument and focus on getting
> rid of the spinlocks that are hosing us right now.

In the uncontended case lwlocks are just as fast as spinlocks now, with
the exception of the local tracking array. They're faster if there's
differences with read/write lockers.

> I'm not altogether convinced that a simple replacement of spinlocks
> with atomics is going to solve this problem.  If it does, great.  But
> that doesn't eliminate spinning; it just moves it from the spinlock
> loop to a CAS loop.

Well, not necessarily. If you can write your algorithm in a way that
xadd etc are used, instead of a lock cmpxchg, you're actually never
spinning on x86 as it's guaranteed to succeed.  I think that's possible
for most of the places that currently lock buffer headers.

(I had a version of the lwlock changes that used xadd for shared lock
acquisition - but the code needed to back out in error cases made things
more complicated, and the benefit on a four socket machine wasn't that
large)

> This is clearly better: when the CAS works,
> you're done, as opposed to having to then manipulate the cache line
> further and release the spinlock, during any of which the cache line
> may get stolen from you.

It's not just that, it's also that there's no chance of sleeping while
holding a lock. Which doesn't happen that infrequently in contended, cpu
bound workloads.

> However, I'm worried that it will still be possible to create the same
> kinds of performance collapses that we see with spinlocks with the CAS
> loops with which we place them - or even worse problems, because
> clearly the spin_delay stuff *works*, and we're unlikely to
> incorporate similarly sophisticated logic into every CAS loop.

I doubt it's really neccessary, but It shouldn't be too hard to
generalize the sleeping logic so it could be used with little code in
the individual callsites.

Greetings,

Andres Freund



Re: s_lock() seems too aggressive for machines with many sockets

From
Robert Haas
Date:
On Wed, Jun 10, 2015 at 1:39 PM, Andres Freund <andres@anarazel.de> wrote:
> In the uncontended case lwlocks are just as fast as spinlocks now, with
> the exception of the local tracking array. They're faster if there's
> differences with read/write lockers.

If nothing else, the spinlock calls are inline, while the lwlock calls
involve a function call.  So in the uncontended case I think the
spinlock stuff is like 2 instructions, an atomic and a branch.

>> I'm not altogether convinced that a simple replacement of spinlocks
>> with atomics is going to solve this problem.  If it does, great.  But
>> that doesn't eliminate spinning; it just moves it from the spinlock
>> loop to a CAS loop.
>
> Well, not necessarily. If you can write your algorithm in a way that
> xadd etc are used, instead of a lock cmpxchg, you're actually never
> spinning on x86 as it's guaranteed to succeed.  I think that's possible
> for most of the places that currently lock buffer headers.

Well, it will succeed by looping inside the instruction, I suppose.  But OK.

> (I had a version of the lwlock changes that used xadd for shared lock
> acquisition - but the code needed to back out in error cases made things
> more complicated, and the benefit on a four socket machine wasn't that
> large)

Now that we (EnterpriseDB) have this 8-socket machine, maybe we could
try your patch there, bound to varying numbers of sockets.

>> This is clearly better: when the CAS works,
>> you're done, as opposed to having to then manipulate the cache line
>> further and release the spinlock, during any of which the cache line
>> may get stolen from you.
>
> It's not just that, it's also that there's no chance of sleeping while
> holding a lock. Which doesn't happen that infrequently in contended, cpu
> bound workloads.

That's a really good point.

>> However, I'm worried that it will still be possible to create the same
>> kinds of performance collapses that we see with spinlocks with the CAS
>> loops with which we place them - or even worse problems, because
>> clearly the spin_delay stuff *works*, and we're unlikely to
>> incorporate similarly sophisticated logic into every CAS loop.
>
> I doubt it's really neccessary, but It shouldn't be too hard to
> generalize the sleeping logic so it could be used with little code in
> the individual callsites.

We should probably experiment with it to see whether it is needed.  I
am fine with leaving it out if it isn't, but it's worth testing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: s_lock() seems too aggressive for machines with many sockets

From
Andres Freund
Date:
On 2015-06-10 13:52:14 -0400, Robert Haas wrote:
> On Wed, Jun 10, 2015 at 1:39 PM, Andres Freund <andres@anarazel.de> wrote:
> > Well, not necessarily. If you can write your algorithm in a way that
> > xadd etc are used, instead of a lock cmpxchg, you're actually never
> > spinning on x86 as it's guaranteed to succeed.  I think that's possible
> > for most of the places that currently lock buffer headers.
> 
> Well, it will succeed by looping inside the instruction, I suppose.  But OK.

On x86 atomic ops hold the bus lock for a short while - that's why
they're that expensive - and in that case you directly can do useful
work (xadd) or just return after reading the current value if there was
a concurrent change (cmpxchg). Afaik there's a more fundamental
difference than one variant just doing the retry in microcode. It's hard
to get definitive answers to that.

> > (I had a version of the lwlock changes that used xadd for shared lock
> > acquisition - but the code needed to back out in error cases made things
> > more complicated, and the benefit on a four socket machine wasn't that
> > large)
> 
> Now that we (EnterpriseDB) have this 8-socket machine, maybe we could
> try your patch there, bound to varying numbers of sockets.

It'd be a significant amount of work to rebase it ontop current HEAD. I
guess the easiest thing would be to try an older version of the patch
with the xadd in place, and use a tree from back then.

Greetings,

Andres Freund



Re: s_lock() seems too aggressive for machines with many sockets

From
Robert Haas
Date:
On Wed, Jun 10, 2015 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote:
>> Now that we (EnterpriseDB) have this 8-socket machine, maybe we could
>> try your patch there, bound to varying numbers of sockets.
>
> It'd be a significant amount of work to rebase it ontop current HEAD. I
> guess the easiest thing would be to try an older version of the patch
> with the xadd in place, and use a tree from back then.

We could do that, I guess.  But we've made other significant
improvements in the meantime, so...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: s_lock() seems too aggressive for machines with many sockets

From
Jan Wieck
Date:
The whole thing turns out to be based on wrong baseline data, taken with 
a pgbench client running from a remote machine. It all started out from 
an investigation against 9.3. Curiously enough, the s_lock() problem 
that existed in 9.3 has a very similar effect on throughput as a network 
bottleneck has on 9.5.

Sorry for the noise, Jan


On 06/10/2015 09:18 AM, Jan Wieck wrote:
> Hi,
>
> I think I may have found one of the problems, PostgreSQL has on machines
> with many NUMA nodes. I am not yet sure what exactly happens on the NUMA
> bus, but there seems to be a tipping point at which the spinlock
> concurrency wreaks havoc and the performance of the database collapses.
>
> On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total,
> a pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput
> then takes a very sharp dive and reaches around 20,000 TPS at 120
> clients. It never recovers from there.
>
> The attached patch demonstrates that less aggressive spinning and (much)
> more often delaying improves the performance "on this type of machine".
> The 8 socket machine in question scales to over 350,000 TPS.
>
> The patch is meant to demonstrate this effect only. It has a negative
> performance impact on smaller machines and client counts < #cores, so
> the real solution will probably look much different. But I thought it
> would be good to share this and start the discussion about reevaluating
> the spinlock code before PGCon.
>
>
> Regards, Jan
>


-- 
Jan Wieck
Senior Software Engineer
http://slony.info



Re: s_lock() seems too aggressive for machines with many sockets

From
Robert Haas
Date:
On Sun, Jun 14, 2015 at 9:52 AM, Jan Wieck <jan@wi3ck.info> wrote:
> The whole thing turns out to be based on wrong baseline data, taken with a
> pgbench client running from a remote machine. It all started out from an
> investigation against 9.3. Curiously enough, the s_lock() problem that
> existed in 9.3 has a very similar effect on throughput as a network
> bottleneck has on 9.5.

So, changing max_spins_per_delay no longer helps on 9.5?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company