Thread: s_lock() seems too aggressive for machines with many sockets
Hi, I think I may have found one of the problems, PostgreSQL has on machines with many NUMA nodes. I am not yet sure what exactly happens on the NUMA bus, but there seems to be a tipping point at which the spinlock concurrency wreaks havoc and the performance of the database collapses. On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, a pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput then takes a very sharp dive and reaches around 20,000 TPS at 120 clients. It never recovers from there. The attached patch demonstrates that less aggressive spinning and (much) more often delaying improves the performance "on this type of machine". The 8 socket machine in question scales to over 350,000 TPS. The patch is meant to demonstrate this effect only. It has a negative performance impact on smaller machines and client counts < #cores, so the real solution will probably look much different. But I thought it would be good to share this and start the discussion about reevaluating the spinlock code before PGCon. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info
Attachment
On 2015-06-10 09:18:56 -0400, Jan Wieck wrote: > On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, a > pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput then > takes a very sharp dive and reaches around 20,000 TPS at 120 clients. It > never recovers from there. 85k? Phew, that's pretty bad. What exact type of CPU is this? Which pgbench scale? Did you use -M prepared? Could you share a call graph perf profile? > The attached patch demonstrates that less aggressive spinning and > (much) more often delaying improves the performance "on this type of > machine". The 8 socket machine in question scales to over 350,000 TPS. Even that seems quite low. I've gotten over 500k TPS on a four socket x86 machine, and about 700k on a 8 socket x86 machine. Maybe we need to adjust the amount of spinning, but to me such drastic differences are a hint that we should tackle the actual contention point. Often a spinlock for something regularly heavily contended can be worse than a queued lock. Greetings, Andres Freund
On Wed, Jun 10, 2015 at 09:18:56AM -0400, Jan Wieck wrote: > The attached patch demonstrates that less aggressive spinning and > (much) more often delaying improves the performance "on this type of > machine". The 8 socket machine in question scales to over 350,000 > TPS. > > The patch is meant to demonstrate this effect only. It has a > negative performance impact on smaller machines and client counts < > #cores, so the real solution will probably look much different. But > I thought it would be good to share this and start the discussion > about reevaluating the spinlock code before PGCon. Wow, you are in that code! We kind of guessed on some of those constants many years ago, and never revisited it. It would be nice to get a better heuristic for those, but I am concerned it would require the user to specify the number of CPU cores. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 06/10/2015 09:28 AM, Andres Freund wrote: > On 2015-06-10 09:18:56 -0400, Jan Wieck wrote: >> On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, a >> pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput then >> takes a very sharp dive and reaches around 20,000 TPS at 120 clients. It >> never recovers from there. > > 85k? Phew, that's pretty bad. What exact type of CPU is this? Which > pgbench scale? Did you use -M prepared? model name : Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz numactl --hardware shows the distance to the attached memory as 10, the distance to every other node as 21. I interpret that as the machine having one NUMA bus with all cpu packages attached to that, rather than individual connections from cpu to cpu or something different. pgbench scale=300, -Msimple. > > Could you share a call graph perf profile? I do not have them handy at the moment and the machine is in use for something else until tomorrow. I will forward perf and systemtap based graphs ASAP. What led me into that spinlock area was the fact that a wall clock based systemtap FlameGraph showed a large portion of the time spent in BufferPin() and BufferUnpin(). > >> The attached patch demonstrates that less aggressive spinning and >> (much) more often delaying improves the performance "on this type of >> machine". The 8 socket machine in question scales to over 350,000 TPS. > > Even that seems quite low. I've gotten over 500k TPS on a four socket > x86 machine, and about 700k on a 8 socket x86 machine. There is more wrong with the machine in question than just that. But for the moment I am satisfied with having a machine where I can reproduce this phenomenon in what appears to be a worst case. > > Maybe we need to adjust the amount of spinning, but to me such drastic > differences are a hint that we should tackle the actual contention > point. Often a spinlock for something regularly heavily contended can be > worse than a queued lock. I have the impression that the code assumes that there is little penalty for accessing the shared byte in a tight loop from any number of cores in parallel. That apparently is true for some architectures and core counts, but no longer holds for these machines with many sockets. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info
Hi, On 2015-06-10 09:54:00 -0400, Jan Wieck wrote: > model name : Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz > numactl --hardware shows the distance to the attached memory as 10, the > distance to every other node as 21. I interpret that as the machine having > one NUMA bus with all cpu packages attached to that, rather than individual > connections from cpu to cpu or something different. Generally that doesn't say very much - IIRC the distances are defined by the bios. > What led me into that spinlock area was the fact that a wall clock based > systemtap FlameGraph showed a large portion of the time spent in > BufferPin() and BufferUnpin(). I've seen that as a bottleneck in the past as well. My plan to fix that is to "simply" make buffer pinning lockless for the majority of cases. I don't have access to hardware to test that at higher node counts atm though. My guess is that the pins are on the btree root pages. But it'd be good to confirm that. > >Maybe we need to adjust the amount of spinning, but to me such drastic > >differences are a hint that we should tackle the actual contention > >point. Often a spinlock for something regularly heavily contended can be > >worse than a queued lock. > > I have the impression that the code assumes that there is little penalty for > accessing the shared byte in a tight loop from any number of cores in > parallel. That apparently is true for some architectures and core counts, > but no longer holds for these machines with many sockets. It's just generally a tradeoff. It's beneficial to spin longer if there's only mild amounts of contention. If the likelihood of getting the spinlock soon is high (i.e. existing, but low contention), it'll nearly always be beneficial to spin. If the likelihood is low, it'll be mostly beneficial to sleep. The latter is especially true if a machine is sufficiently overcommitted that it's likely that it'll sleep while holding a spinlock. The danger of sleeping while holding a spinlock, without targeted wakeup, is why spinlocks in userspace aren't a really good idea. My bet is that if you'd measure using different number iterations for different spinlocks you'd find some where the higher number of iterations is rather beneficial as well. Greetings, Andres Freund
On larger Linux machines, we have been running with spin locks replaced by generic posix mutexes for years now. I personally haven't look at the code for ages, but we maintain a patch which pretty much does the same thing still: Ref: http://www.postgresql.org/message-id/4FEDE0BF.7080203@schokola.de I understand that there are systems out there which have less efficient posix mutex implementations than Linux (which uses futexes), but I think it would still be worth considering to do away with the roll-your-own spinlocks on systems whose posix mutexes are known to behave. Nils
On 10/06/15 16:05, Andres Freund wrote: > it'll nearly always be beneficial to spin Trouble is that postgres cannot know if the process holding the lock actually does run, so if it doesn't, all we're doing is burn cycles and make the problem worse. Contrary to that, the kernel does know, so for a (f|m)utex which fails to acquire immediately and thus needs to syscall, the kernel has the option to spin only if the lock holder is running (the "adaptive" mutex). Nils
On 06/10/2015 10:07 AM, Nils Goroll wrote: > On larger Linux machines, we have been running with spin locks replaced by > generic posix mutexes for years now. I personally haven't look at the code for > ages, but we maintain a patch which pretty much does the same thing still: > > Ref: http://www.postgresql.org/message-id/4FEDE0BF.7080203@schokola.de > > I understand that there are systems out there which have less efficient posix > mutex implementations than Linux (which uses futexes), but I think it would > still be worth considering to do away with the roll-your-own spinlocks on > systems whose posix mutexes are known to behave. I have played with test code that isolates a stripped down version of s_lock() and uses it with multiple threads. I then implemented multiple different versions of that s_lock(). The results with 200 concurrent threads are that using a __sync_val_compare_and_swap() to acquire the lock and then falling back to a futex() is limited to about 500,000 locks/second. Spinning for 10 times and then doing a usleep(1000) (one millisecond) gives me 25 million locks/second. Note that the __sync_val_compare_and_swap() GCC built in seems identical in performance with the assembler xchgb operation used by PostgreSQL today on x84_64. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info
Jan Wieck <jan@wi3ck.info> writes: > The attached patch demonstrates that less aggressive spinning and (much) > more often delaying improves the performance "on this type of machine". Hm. One thing worth asking is why the code didn't converge to a good value of spins_per_delay without help. The value should drop every time we had to delay, so under heavy contention it ought to end up small anyhow, no? Maybe we just need to alter the feedback loop a bit. (The comment about uniprocessors vs multiprocessors seems pretty wacko in this context, but at least the sign of the feedback term seems correct.) regards, tom lane
On 2015-06-10 16:12:05 +0200, Nils Goroll wrote: > > On 10/06/15 16:05, Andres Freund wrote: > > it'll nearly always be beneficial to spin > > Trouble is that postgres cannot know if the process holding the lock actually > does run, so if it doesn't, all we're doing is burn cycles and make the problem > worse. That's precisely what I referred to in the bit you cut away... > Contrary to that, the kernel does know, so for a (f|m)utex which fails to > acquire immediately and thus needs to syscall, the kernel has the option to spin > only if the lock holder is running (the "adaptive" mutex). Unfortunately there's no portable futex support. That's what stopped us from adopting them so far. And even futexes can be significantly more heavyweight under moderate contention than our spinlocks - It's rather easy to reproduce scenarios where futexes cause significant slowdown in comparison to spinning in userspace (just reproduce contention on a spinlock where the protected area will be *very* short - i.e. no cache misses, no branches or such). I think we should eventually work on replacing most of the currently spinlock using code to either use lwlocks (which will enter the kernel under contention, but not otherwise) or use lockless programming techniques. I think there's relatively few relevant places left. Most prominetly the buffer header spinlocks...
Andres Freund <andres@anarazel.de> writes: > Unfortunately there's no portable futex support. That's what stopped us > from adopting them so far. And even futexes can be significantly more > heavyweight under moderate contention than our spinlocks - It's rather > easy to reproduce scenarios where futexes cause significant slowdown in > comparison to spinning in userspace (just reproduce contention on a > spinlock where the protected area will be *very* short - i.e. no cache > misses, no branches or such). Which, you'll note, is the ONLY case that's allowed by our coding rules for spinlock use. If there are any locking sections that are not very short straight-line code, or at least code with easily predicted branches, we need to fix those before we worry about the spinlock mechanism per se. Optimizing for misuse of the mechanism is not the way. regards, tom lane
On 06/10/2015 10:20 AM, Tom Lane wrote: > Jan Wieck <jan@wi3ck.info> writes: >> The attached patch demonstrates that less aggressive spinning and (much) >> more often delaying improves the performance "on this type of machine". > > Hm. One thing worth asking is why the code didn't converge to a good > value of spins_per_delay without help. The value should drop every time > we had to delay, so under heavy contention it ought to end up small > anyhow, no? Maybe we just need to alter the feedback loop a bit. > > (The comment about uniprocessors vs multiprocessors seems pretty wacko in > this context, but at least the sign of the feedback term seems correct.) The feedback loop looks a bit heavy leaning towards increasing the spin count vs. decreasing it (100 up vs. 1 down). I have test time booked on the machine for tomorrow and will test that as well. However, to me it seems that with the usual minimum sleep() interval of 1ms, once we have to delay at all we are already losing. That spinning 10x still outperforms the same code with 1,000 spins per delay by factor 5 tells me that "on this particular box" something is going horribly wrong once we get over the tipping point in concurrency. As said, I am not sure what exactly that is yet. At a minimum the probability that another CPU package is stealing the cache line from the one, holding the spinlock, is going up. Which cannot possibly be good for performance. But I would expect that to show a more gradual drop in throughput than what I see in the pgbench -S example. Kevin had speculated to me that it may be possible that at that tipping point the kernel starts feeling the need to relocate the memory page in question to whichever cpu package makes the most failing requests and thus ends up with a huge round robin page relocation project. Unfortunately I don't know how to confirm or disprove that theory. This is done on CentOS 7 with kernel 3.10 BTW. And no, I am not at liberty to install a different distribution or switch to another kernel. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info
On 10/06/15 16:20, Andres Freund wrote: > That's precisely what I referred to in the bit you cut away... I apologize, yes. On 10/06/15 16:25, Tom Lane wrote: > Optimizing for misuse of the mechanism is not the way. I absolutely agree and I really appreciate all efforts towards lockless data structures or at least better concurrency using classical mutual exclusion. But still I am convinced that on today's massively parallel NUMAs, spinlocks are plain wrong: - Even if critical sections are kept minimal, they can still become hot spots - When they do, we get potentially massive negative scalability, it will be hard to exclude the possibility of a system "tilting"under (potentially yet unknown) load patterns as long as userland slocks exist. Briefly: When slocks fail, they fail big time - slocks optimize for the best case, but I think on today's systems we should optimize for the worst case. - The fact that well behaved mutexes have a higher initial cost could even motivate good use of them rather than optimizemisuse. Cheers, Nils
On 2015-06-10 10:25:32 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Unfortunately there's no portable futex support. That's what stopped us > > from adopting them so far. And even futexes can be significantly more > > heavyweight under moderate contention than our spinlocks - It's rather > > easy to reproduce scenarios where futexes cause significant slowdown in > > comparison to spinning in userspace (just reproduce contention on a > > spinlock where the protected area will be *very* short - i.e. no cache > > misses, no branches or such). > > Which, you'll note, is the ONLY case that's allowed by our coding rules > for spinlock use. If there are any locking sections that are not very > short straight-line code, or at least code with easily predicted branches, > we need to fix those before we worry about the spinlock mechanism per > se. We haven't followed that all that strictly imo. While lwlocks are a bit less problematic in 9.5 (as they take far fewer spinlocks), they're still far from perfect as we manipulate linked lists while holding a lock. We malso do lots of hard to predict stuff while the buffer header spinlock is held... > Optimizing for misuse of the mechanism is not the way. Agreed. I'm not particularly interested in optimizing spinlocks. We should get rid of most. Greetings, Andres Freund
On Wed, Jun 10, 2015 at 10:20 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jan Wieck <jan@wi3ck.info> writes: >> The attached patch demonstrates that less aggressive spinning and (much) >> more often delaying improves the performance "on this type of machine". > > Hm. One thing worth asking is why the code didn't converge to a good > value of spins_per_delay without help. The value should drop every time > we had to delay, so under heavy contention it ought to end up small > anyhow, no? Maybe we just need to alter the feedback loop a bit. > > (The comment about uniprocessors vs multiprocessors seems pretty wacko in > this context, but at least the sign of the feedback term seems correct.) The code seems to have been written with the idea that we should converge to MAX_SPINS_PER_DELAY if spinning *ever* works. The way that's implemented is that, if we get a spinlock without having to delay, we add 100 to spins_per_delay, but if we have to delay at least once (potentially hundreds of times), then we subtract 1. spins_per_delay will be >900 most of the time even if only 1% of the lock acquisitions manage to get the lock without delaying. It is possible that, as you say, all we need to do is alter the feedback loop so that, say, we subtract 1 every time we delay (rather than every time we have at least 1 delay) and add 1 (rather than 100) every time we don't end up needing to delay. I'm a bit concerned, though, that this would tend to make spins_per_delay unstable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-06-10 16:55:31 +0200, Nils Goroll wrote: > But still I am convinced that on today's massively parallel NUMAs, spinlocks are > plain wrong: Sure. But a large number of installations are not using massive NUMA systems, so we can't focus on optimizing for NUMA. We definitely have quite some catchup to do there. Unfortunately most of the problems are only reproducible on 4, 8 socket machines, and it's hard to get hand on those for prolonged amounts of time. > - Even if critical sections are kept minimal, they can still become hot spots That's why we started to remove several of them... > - The fact that well behaved mutexes have a higher initial cost could even > motivate good use of them rather than optimize misuse. Well. There's many locks in a RDBMS that can't realistically be avoided. So optimizing for no and moderate contention isn't something you can simply forgo.
On 06/10/2015 10:59 AM, Robert Haas wrote: > On Wed, Jun 10, 2015 at 10:20 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Jan Wieck <jan@wi3ck.info> writes: >>> The attached patch demonstrates that less aggressive spinning and (much) >>> more often delaying improves the performance "on this type of machine". >> >> Hm. One thing worth asking is why the code didn't converge to a good >> value of spins_per_delay without help. The value should drop every time >> we had to delay, so under heavy contention it ought to end up small >> anyhow, no? Maybe we just need to alter the feedback loop a bit. >> >> (The comment about uniprocessors vs multiprocessors seems pretty wacko in >> this context, but at least the sign of the feedback term seems correct.) > > The code seems to have been written with the idea that we should > converge to MAX_SPINS_PER_DELAY if spinning *ever* works. The way > that's implemented is that, if we get a spinlock without having to > delay, we add 100 to spins_per_delay, but if we have to delay at least > once (potentially hundreds of times), then we subtract 1. > spins_per_delay will be >900 most of the time even if only 1% of the > lock acquisitions manage to get the lock without delaying. And note that spins_per_delay is global. Getting ANY lock without delay adds 100, regardless of that being a high or low contented one. Your process only needs to hit one low contention slock every 100 calls to securely peg this value >=900. Jan -- Jan Wieck Senior Software Engineer http://slony.info
On 10/06/15 16:18, Jan Wieck wrote: > > I have played with test code that isolates a stripped down version of s_lock() > and uses it with multiple threads. I then implemented multiple different > versions of that s_lock(). The results with 200 concurrent threads are that > using a __sync_val_compare_and_swap() to acquire the lock and then falling back > to a futex() is limited to about 500,000 locks/second. Spinning for 10 times and > then doing a usleep(1000) (one millisecond) gives me 25 million locks/second. > > Note that the __sync_val_compare_and_swap() GCC built in seems identical in > performance with the assembler xchgb operation used by PostgreSQL today on x84_64. These numbers don't work for me. Do IUC that you are not holding the lock for any reasonable time? If yes, the test case is invalid (the uncontended case is never relevant). If No, the numbers don't match up - if you held one lock for 1ms, you'd not get more than 1000 locks/s anyway. If you had 200 locks, you'd get 200.000 locks/s. Can you please explain what the message is you are trying to get across? Thanks, Nils
On 06/10/2015 11:06 AM, Nils Goroll wrote: > On 10/06/15 16:18, Jan Wieck wrote: >> >> I have played with test code that isolates a stripped down version of s_lock() >> and uses it with multiple threads. I then implemented multiple different >> versions of that s_lock(). The results with 200 concurrent threads are that >> using a __sync_val_compare_and_swap() to acquire the lock and then falling back >> to a futex() is limited to about 500,000 locks/second. Spinning for 10 times and >> then doing a usleep(1000) (one millisecond) gives me 25 million locks/second. >> >> Note that the __sync_val_compare_and_swap() GCC built in seems identical in >> performance with the assembler xchgb operation used by PostgreSQL today on x84_64. > > These numbers don't work for me. Do IUC that you are not holding the lock for > any reasonable time? If yes, the test case is invalid (the uncontended case is > never relevant). If No, the numbers don't match up - if you held one lock for > 1ms, you'd not get more than 1000 locks/s anyway. If you had 200 locks, you'd > get 200.000 locks/s. > > Can you please explain what the message is you are trying to get across? The test case is that 200 threads are running in a tight loop like this: for (...) { s_lock(); // do something with a global variable s_unlock(); } That is the most contended case I can think of, yet the short and predictable code while holding the lock is the intended use case for a spinlock. The code in s_lock() is what is doing multiple CAS attempts, then sleep. The code is never holding the lock for 1ms. Sorry if that wasn't clear. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info
On 10/06/15 17:01, Andres Freund wrote: >> > - The fact that well behaved mutexes have a higher initial cost could even >> > motivate good use of them rather than optimize misuse. > Well. There's many locks in a RDBMS that can't realistically be > avoided. So optimizing for no and moderate contention isn't something > you can simply forgo. Let's get back to my initial suggestion: On 10/06/15 16:07, Nils Goroll wrote: > I think it would > still be worth considering to do away with the roll-your-own spinlocks on > systems whose posix mutexes are known to behave. Where we use the mutex patch we have not seen any relevant negative impact - neither in benchmarks nor in production. So, yes, postgres should still work fine on a 2-core laptop and I don't see any reason why using posix mutexes *where they are known to behave* would do any harm. And, to be honest, Linux is quite dominant, so solving the issue for this platform would be a start at least. Nils
On 10/06/15 17:12, Jan Wieck wrote: > for (...) > { > s_lock(); > // do something with a global variable > s_unlock(); > } OK, I understand now, thank you. I am not sure if this test case is appropriate for the critical sections in postgres (if it was, we'd not have the problem we are discussion). Nils
On 2015-06-10 16:07:50 +0200, Nils Goroll wrote: > On larger Linux machines, we have been running with spin locks replaced by > generic posix mutexes for years now. I personally haven't look at the code for > ages, but we maintain a patch which pretty much does the same thing still: Interesting. I've been able to reproduce quite massive slowdowns doing this on a 4 socket linux machine (after applying the lwlock patch that's now in 9.5) in readonly workloads. As in 200%+ slower. And that was with a new kernel/glibc. That was primarily due to buffer header spinlocks. For write workloads the difference was smaller, but still noticeably. There xlog.c's spinlocks where noticeable which are usually held very shortly. > Ref: http://www.postgresql.org/message-id/4FEDE0BF.7080203@schokola.de Do you have any details about the workloads that scaled badly back then? It'd be interesting to find out which spinlocks they bottlenecked on. > I understand that there are systems out there which have less efficient posix > mutex implementations than Linux (which uses futexes), but I think it would > still be worth considering to do away with the roll-your-own spinlocks on > systems whose posix mutexes are known to behave. If we get rid of the 'hot' spinlocks something very roughly like this will possibly be realistic (we can't rely on pthreads, but ...). I don't think it'll be before that.
On 10/06/15 17:17, Andres Freund wrote: > On 2015-06-10 16:07:50 +0200, Nils Goroll wrote: >> On larger Linux machines, we have been running with spin locks replaced by >> generic posix mutexes for years now. I personally haven't look at the code for >> ages, but we maintain a patch which pretty much does the same thing still: > > Interesting. I've been able to reproduce quite massive slowdowns doing > this on a 4 socket linux machine (after applying the lwlock patch that's > now in 9.5) Sorry, I cannot comment on this, 9.4.1 is the latest we are running in production and I haven't even tested the patch with 9.5. > As in 200%+ slower. Have you tried PTHREAD_MUTEX_ADAPTIVE_NP ? >> Ref: http://www.postgresql.org/message-id/4FEDE0BF.7080203@schokola.de > > Do you have any details about the workloads that scaled badly back then? > It'd be interesting to find out which spinlocks they bottlenecked > on. OLTP. But really the root cause from back then should be eliminated, this was with 9.1.3 I only got woken up by s_lock() in email subjects. Nils
On 2015-06-10 11:12:46 -0400, Jan Wieck wrote: > The test case is that 200 threads are running in a tight loop like this: > > for (...) > { > s_lock(); > // do something with a global variable > s_unlock(); > } > > That is the most contended case I can think of, yet the short and > predictable code while holding the lock is the intended use case for a > spinlock. But lots of our code isn't that predictable. Check e.g. the buffer spinlock case: static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner) {PrivateRefCountEntry *ref; /* not moving as we're likely deleting it soon anyway */ref = GetPrivateRefCountEntry(buf->buf_id + 1, false);Assert(ref!= NULL); if (fixOwner) ResourceOwnerForgetBuffer(CurrentResourceOwner, BufferDescriptorGetBuffer(buf)); Assert(ref->refcount > 0);ref->refcount--;if (ref->refcount == 0){ /* I'd better not still hold any locks on the buffer*/ Assert(!LWLockHeldByMe(buf->content_lock)); Assert(!LWLockHeldByMe(buf->io_in_progress_lock)); LockBufHdr(buf); /* Decrement the shared reference count */ Assert(buf->refcount > 0); buf->refcount--; /* Support LockBufferForCleanup() */ if ((buf->flags & BM_PIN_COUNT_WAITER) && buf->refcount == 1) { /* we just released the last pin other than the waiter's */ int wait_backend_pid = buf->wait_backend_pid; buf->flags &= ~BM_PIN_COUNT_WAITER; UnlockBufHdr(buf); ProcSendSignal(wait_backend_pid); } else UnlockBufHdr(buf); ForgetPrivateRefCountEntry(ref);} } If you check the section where the spinlock is held there's nontrivial code executed. Under contention you'll have the problem that if backend tries to acquire the the spinlock while another backend holds the lock, it'll "steal" the cacheline on which the lock and the rest of the buffer descriptor resides. Which means that operations like buf->refcount--,the if (), and the &= will now have to wait till the cacheline is transferred back (as the cacheline will directly be marked as modified on the attempted locker). All the while other backends will also try to acquire the pin, causing the same kind of trouble. If this'd instead be written as ret = pg_atomic_fetch_sub_u32(&buf->state, 1); if (ret & BM_PIN_COUNT_WAITER) { pg_atomic_fetch_sub_u32(&buf->state, BM_PIN_COUNT_WAITER); /* XXX: deal with race that another backend has set BM_PIN_COUNT_WAITER*/ } the whole thing looks entirely different. While there's obviously still cacheline bouncing, every operation is guaranteed to make progress (on x86 at least, but even elsewhere we'll never block another locker). Now unlock is the simpler case, but... - Andres
On 2015-06-10 17:30:33 +0200, Nils Goroll wrote: > On 10/06/15 17:17, Andres Freund wrote: > > On 2015-06-10 16:07:50 +0200, Nils Goroll wrote: > > Interesting. I've been able to reproduce quite massive slowdowns doing > > this on a 4 socket linux machine (after applying the lwlock patch that's > > now in 9.5) > > Sorry, I cannot comment on this, 9.4.1 is the latest we are running in > production and I haven't even tested the patch with 9.5. Ok. > > As in 200%+ slower. > > Have you tried PTHREAD_MUTEX_ADAPTIVE_NP ? Yes. > >> Ref: http://www.postgresql.org/message-id/4FEDE0BF.7080203@schokola.de > > > > Do you have any details about the workloads that scaled badly back then? > > It'd be interesting to find out which spinlocks they bottlenecked > > on. > > OLTP. But really the root cause from back then should be eliminated, this was > with 9.1.3 Hm, ok. Any chance you have profiles from back then? It'd be very interesting to know which spinlock you were contending on. If we convert spinlocks into something more heavyweight we'll want to benchmark the actually contending cases to avoid regressions.
On 06/10/2015 11:34 AM, Andres Freund wrote: > If you check the section where the spinlock is held there's nontrivial > code executed. Under contention you'll have the problem that if backend > tries to acquire the the spinlock while another backend holds the lock, > it'll "steal" the cacheline on which the lock and the rest of the buffer > descriptor resides. Which means that operations like buf->refcount--,the > if (), and the &= will now have to wait till the cacheline is > transferred back (as the cacheline will directly be marked as modified > on the attempted locker). All the while other backends will also try to > acquire the pin, causing the same kind of trouble. Yes, that is the possibility of "cache line stealing while holding the lock" that I was talking about. It looks like we are in wild agreement (as Kevin usually puts it). > > If this'd instead be written as > > ret = pg_atomic_fetch_sub_u32(&buf->state, 1); > > if (ret & BM_PIN_COUNT_WAITER) > { > pg_atomic_fetch_sub_u32(&buf->state, BM_PIN_COUNT_WAITER); > /* XXX: deal with race that another backend has set BM_PIN_COUNT_WAITER */ > } There are atomic AND and OR functions too, at least in the GCC built in parts. We might be able to come up with pg_atomic_...() versions of them and avoid the race part. > > the whole thing looks entirely different. While there's obviously still > cacheline bouncing, every operation is guaranteed to make progress (on > x86 at least, but even elsewhere we'll never block another locker). > > Now unlock is the simpler case, but... While some locks may be avoidable, some may be replaced by atomic operations, I believe that we will still be left with some of them. Optimizing spinlocks if we can do it in a generic fashion that does not hurt other platforms will still give us something. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info
>>> As in 200%+ slower. >> Have you tried PTHREAD_MUTEX_ADAPTIVE_NP ? > Yes. Ok, if this can be validated, we might have a new case now for which my suggestion would not be helpful. Reviewed, optimized code with short critical sections and no hotspots by design could indeed be an exception where to keep slock as they are. > Hm, ok. Any chance you have profiles from back then? IIUC I had shared all relevant data on the list. Does this help? http://www.postgresql.org/message-id/4FE9EB27.9020502@schokola.de Thanks, NIls
On 2015-06-10 11:51:06 -0400, Jan Wieck wrote: > >ret = pg_atomic_fetch_sub_u32(&buf->state, 1); > > > >if (ret & BM_PIN_COUNT_WAITER) > >{ > > pg_atomic_fetch_sub_u32(&buf->state, BM_PIN_COUNT_WAITER); > > /* XXX: deal with race that another backend has set BM_PIN_COUNT_WAITER */ > >} > > There are atomic AND and OR functions too, at least in the GCC built in > parts. We might be able to come up with pg_atomic_...() versions of them and > avoid the race part. The race part isn't actually about that. It's that BM_PIN_COUNT_WAITER might have been set after the fetch_sub above. fetch_sub() itself would actually be race-free to unset a flag, if it's a proper power of two. > While some locks may be avoidable, some may be replaced by atomic > operations, I believe that we will still be left with some of them. Besides the two xlog.c ones and lwlock.c, which are hot otherwise? I think we pretty much removed the rest? > Optimizing spinlocks if we can do it in a generic fashion that does not hurt > other platforms will still give us something. Sure, I'm just doubtful that's easy. I think we should just gank spinlocks asap. The hard part is removing them from lwlock.c's slow path and the buffer headers imo. After that we should imo be fine replacing them with lwlocks.
On Wed, Jun 10, 2015 at 11:58 AM, Andres Freund <andres@anarazel.de> wrote: > I think we should just gank spinlocks asap. The hard part is removing > them from lwlock.c's slow path and the buffer headers imo. After that we > should imo be fine replacing them with lwlocks. Mmmph. I'm not convinced there's any point in replacing lightly-contended spinlocks with slower, more-complex lwlocks. But maybe it's best to stay away from that argument and focus on getting rid of the spinlocks that are hosing us right now. I'm not altogether convinced that a simple replacement of spinlocks with atomics is going to solve this problem. If it does, great. But that doesn't eliminate spinning; it just moves it from the spinlock loop to a CAS loop. This is clearly better: when the CAS works, you're done, as opposed to having to then manipulate the cache line further and release the spinlock, during any of which the cache line may get stolen from you. However, I'm worried that it will still be possible to create the same kinds of performance collapses that we see with spinlocks with the CAS loops with which we place them - or even worse problems, because clearly the spin_delay stuff *works*, and we're unlikely to incorporate similarly sophisticated logic into every CAS loop. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-06-10 13:19:14 -0400, Robert Haas wrote: > On Wed, Jun 10, 2015 at 11:58 AM, Andres Freund <andres@anarazel.de> wrote: > > I think we should just gank spinlocks asap. The hard part is removing > > them from lwlock.c's slow path and the buffer headers imo. After that we > > should imo be fine replacing them with lwlocks. > > Mmmph. I'm not convinced there's any point in replacing > lightly-contended spinlocks with slower, more-complex lwlocks. But > maybe it's best to stay away from that argument and focus on getting > rid of the spinlocks that are hosing us right now. In the uncontended case lwlocks are just as fast as spinlocks now, with the exception of the local tracking array. They're faster if there's differences with read/write lockers. > I'm not altogether convinced that a simple replacement of spinlocks > with atomics is going to solve this problem. If it does, great. But > that doesn't eliminate spinning; it just moves it from the spinlock > loop to a CAS loop. Well, not necessarily. If you can write your algorithm in a way that xadd etc are used, instead of a lock cmpxchg, you're actually never spinning on x86 as it's guaranteed to succeed. I think that's possible for most of the places that currently lock buffer headers. (I had a version of the lwlock changes that used xadd for shared lock acquisition - but the code needed to back out in error cases made things more complicated, and the benefit on a four socket machine wasn't that large) > This is clearly better: when the CAS works, > you're done, as opposed to having to then manipulate the cache line > further and release the spinlock, during any of which the cache line > may get stolen from you. It's not just that, it's also that there's no chance of sleeping while holding a lock. Which doesn't happen that infrequently in contended, cpu bound workloads. > However, I'm worried that it will still be possible to create the same > kinds of performance collapses that we see with spinlocks with the CAS > loops with which we place them - or even worse problems, because > clearly the spin_delay stuff *works*, and we're unlikely to > incorporate similarly sophisticated logic into every CAS loop. I doubt it's really neccessary, but It shouldn't be too hard to generalize the sleeping logic so it could be used with little code in the individual callsites. Greetings, Andres Freund
On Wed, Jun 10, 2015 at 1:39 PM, Andres Freund <andres@anarazel.de> wrote: > In the uncontended case lwlocks are just as fast as spinlocks now, with > the exception of the local tracking array. They're faster if there's > differences with read/write lockers. If nothing else, the spinlock calls are inline, while the lwlock calls involve a function call. So in the uncontended case I think the spinlock stuff is like 2 instructions, an atomic and a branch. >> I'm not altogether convinced that a simple replacement of spinlocks >> with atomics is going to solve this problem. If it does, great. But >> that doesn't eliminate spinning; it just moves it from the spinlock >> loop to a CAS loop. > > Well, not necessarily. If you can write your algorithm in a way that > xadd etc are used, instead of a lock cmpxchg, you're actually never > spinning on x86 as it's guaranteed to succeed. I think that's possible > for most of the places that currently lock buffer headers. Well, it will succeed by looping inside the instruction, I suppose. But OK. > (I had a version of the lwlock changes that used xadd for shared lock > acquisition - but the code needed to back out in error cases made things > more complicated, and the benefit on a four socket machine wasn't that > large) Now that we (EnterpriseDB) have this 8-socket machine, maybe we could try your patch there, bound to varying numbers of sockets. >> This is clearly better: when the CAS works, >> you're done, as opposed to having to then manipulate the cache line >> further and release the spinlock, during any of which the cache line >> may get stolen from you. > > It's not just that, it's also that there's no chance of sleeping while > holding a lock. Which doesn't happen that infrequently in contended, cpu > bound workloads. That's a really good point. >> However, I'm worried that it will still be possible to create the same >> kinds of performance collapses that we see with spinlocks with the CAS >> loops with which we place them - or even worse problems, because >> clearly the spin_delay stuff *works*, and we're unlikely to >> incorporate similarly sophisticated logic into every CAS loop. > > I doubt it's really neccessary, but It shouldn't be too hard to > generalize the sleeping logic so it could be used with little code in > the individual callsites. We should probably experiment with it to see whether it is needed. I am fine with leaving it out if it isn't, but it's worth testing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-06-10 13:52:14 -0400, Robert Haas wrote: > On Wed, Jun 10, 2015 at 1:39 PM, Andres Freund <andres@anarazel.de> wrote: > > Well, not necessarily. If you can write your algorithm in a way that > > xadd etc are used, instead of a lock cmpxchg, you're actually never > > spinning on x86 as it's guaranteed to succeed. I think that's possible > > for most of the places that currently lock buffer headers. > > Well, it will succeed by looping inside the instruction, I suppose. But OK. On x86 atomic ops hold the bus lock for a short while - that's why they're that expensive - and in that case you directly can do useful work (xadd) or just return after reading the current value if there was a concurrent change (cmpxchg). Afaik there's a more fundamental difference than one variant just doing the retry in microcode. It's hard to get definitive answers to that. > > (I had a version of the lwlock changes that used xadd for shared lock > > acquisition - but the code needed to back out in error cases made things > > more complicated, and the benefit on a four socket machine wasn't that > > large) > > Now that we (EnterpriseDB) have this 8-socket machine, maybe we could > try your patch there, bound to varying numbers of sockets. It'd be a significant amount of work to rebase it ontop current HEAD. I guess the easiest thing would be to try an older version of the patch with the xadd in place, and use a tree from back then. Greetings, Andres Freund
On Wed, Jun 10, 2015 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote: >> Now that we (EnterpriseDB) have this 8-socket machine, maybe we could >> try your patch there, bound to varying numbers of sockets. > > It'd be a significant amount of work to rebase it ontop current HEAD. I > guess the easiest thing would be to try an older version of the patch > with the xadd in place, and use a tree from back then. We could do that, I guess. But we've made other significant improvements in the meantime, so... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
The whole thing turns out to be based on wrong baseline data, taken with a pgbench client running from a remote machine. It all started out from an investigation against 9.3. Curiously enough, the s_lock() problem that existed in 9.3 has a very similar effect on throughput as a network bottleneck has on 9.5. Sorry for the noise, Jan On 06/10/2015 09:18 AM, Jan Wieck wrote: > Hi, > > I think I may have found one of the problems, PostgreSQL has on machines > with many NUMA nodes. I am not yet sure what exactly happens on the NUMA > bus, but there seems to be a tipping point at which the spinlock > concurrency wreaks havoc and the performance of the database collapses. > > On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, > a pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput > then takes a very sharp dive and reaches around 20,000 TPS at 120 > clients. It never recovers from there. > > The attached patch demonstrates that less aggressive spinning and (much) > more often delaying improves the performance "on this type of machine". > The 8 socket machine in question scales to over 350,000 TPS. > > The patch is meant to demonstrate this effect only. It has a negative > performance impact on smaller machines and client counts < #cores, so > the real solution will probably look much different. But I thought it > would be good to share this and start the discussion about reevaluating > the spinlock code before PGCon. > > > Regards, Jan > -- Jan Wieck Senior Software Engineer http://slony.info
On Sun, Jun 14, 2015 at 9:52 AM, Jan Wieck <jan@wi3ck.info> wrote: > The whole thing turns out to be based on wrong baseline data, taken with a > pgbench client running from a remote machine. It all started out from an > investigation against 9.3. Curiously enough, the s_lock() problem that > existed in 9.3 has a very similar effect on throughput as a network > bottleneck has on 9.5. So, changing max_spins_per_delay no longer helps on 9.5? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company