Thread: Spinlocks, yet again: analysis and proposed patches
The test case I just posted shows that our spinlock code, which we had thought largely done, is once again becoming a performance bottleneck. It's time to resurrect some of the ideas we kicked around in early 2002, and didn't pursue because we decided spinlocks weren't our major performance problem. I started investigating this by trying to understand why the "futex spinlock" patch had helped the Red Hat customer I mentioned, when it hadn't done anything much in last year's experiments. It turns out that this customer was using an 8-way Opteron, and the futex patch helps a lot more on that architecture than others. Some random experimentation eventually turned up part of the reason: the x86 TAS assembly code we are using is pessimal for x86_64. Specifically, s_lock.h uses this for both architectures: /* Use a non-locking test before asserting the bus lock */ __asm__ __volatile__( " cmpb $0,%1 \n" " jne 1f \n" " lock \n" " xchgb %0,%1 \n" "1: \n" and it turns out that deleting the cmpb and jne makes things go significantly faster on Opterons. So the "non-locking test" is not a win at all on x86_64. As best I can tell from testing, removing it wins because it causes the rate of spinlock delays to drop significantly. Why this should be is not completely obvious. I speculate that the Opterons are capable of pulling in a copy of the cache line that contains the spinlock and then running 100 iterations of the cmpb test without ever noticing that the processor holding the lock has now released it. Without the cmpb, they are forced by the locking xchgb test to actually look at the real state of the spinlock each time. I kinda suspect that the cmpb test is a no-op or loss on all Intelish processors: it can only be a win if you expect a lot of contention for the spin lock, but in percentage terms we still have a very low conflict rate, so in most executions of the TAS macro, the cmpb is just wasted cycles. Bottom line: we definitely don't want it for x86_64, and maybe not at all, but we need more research to decide the latter. The second reason that the futex patch is helping is that when a spinlock delay does occur, it allows the delaying process to be awoken almost immediately, rather than delaying 10 msec or more as the existing code does. However, given that we are only expecting the spinlock to be held for a couple dozen instructions, using the kernel futex mechanism is huge overkill --- the in-kernel overhead to manage the futex state is almost certainly several orders of magnitude more than the delay we actually want. I looked into several other methods of doing the spinlock delay instead. I think all of these were suggested at one point or another in our earlier discussions of spinlocks: 1. Use sched_yield() if available: it does just what we want, ie, yield the processor without forcing any useless time delay before we can be rescheduled. This doesn't exist everywhere but it exists in recent Linuxen, so I tried it. It made for a beautiful improvement in my test case numbers: CPU utilization went to 100% and the context swap rate to almost nil. Unfortunately, I also saw fairly frequent "stuck spinlock" panics when running more queries than there were processors --- this despite increasing NUM_DELAYS to 10000 in s_lock.c. So I don't trust sched_yield anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect. (I speculate that it's set up to only yield the processor to other processes already affiliated to that processor. In any case, it is definitely capable of getting through 10000 yields without running the guy who's holding the spinlock.) 2. Reduce the length of the select() delays in s_lock. The current code delays in quanta of 10msec, but on recent Linuxen (and I think other platforms too) the scheduling quantum is 1msec, so we can get the processor back faster if we ask for a 1msec delay. I tried this and it is definitely a win on Linux; it doesn't seem to hurt anything on older systems either, they just round the delay up to 10msec like before. So I think we should do this, even though it's only a partial answer. 3. Modify the spin loop to do a little more work between TAS() tests. In the existing code there are only about half a dozen instructions total in the normal spin loop, most of them very fast, with the result that the spinning processor does its best to monopolize the system bus with locked probe instructions. This is obviously not good, as it'll interfere with the ability of the spinlock holder to complete its work and release the lock. (The bulk of the spinlock uses are for LWLocks, and with the current data structures the LWLock's own state is usually going to be in the same cache line as the spinlock proper, so that cache grabs for the spinlock hurt the LWLock state updating as well as the eventual spinlock release.) I modified the code to use an integer modulo operation as part of the test for SPINS_PER_DELAY; on most processors this should keep the processor's attention and keep it away from the system bus for several times longer than an average instruction. (It's relevant here that the "rep;nop" used to slow the spin loop on recent Intel processors appears to be completely ignored by Opterons. Basically we need a non-hardware- specific trick to slow the loop.) 4. Modify SPINS_PER_DELAY. Looping more times instead of entering the kernel is a win for SMP boxes: whenever you give up and call the kernel to delay, it's likely that the spinlock holder will have released the lock before you even get all the way into the kernel, let alone get rescheduled again. I found that pushing SPINS_PER_DELAY above 200 caused the select delay rate to fall to almost nil in this test case. However, that would be counterproductive for uniprocessors where the optimal strategy is normally to give up the processor immediately. We've talked about making the spin count user configurable, but I never cared for that answer. Instead, I've worked up some code to automatically adjust the spin count depending on whether we seem to be in a uniprocessor or multiprocessor. The idea is to decrease the spin count (down to a lower limit) whenever s_lock has to delay in order to get a lock, and to increase the count (up to a limit) whenever we see the lock come free before we've had to delay. The former condition suggests that we are in a uniprocessor and the latter suggests we are in a multiprocessor. I recall having complained that this idea would probably be unstable, but in testing it seems to do pretty nicely at converging to the minimum spin count on a 1-CPU machine and the maximum limit on an SMP box. It could do with more testing though to see if it works the same for other people. My proposal therefore is to do #2, #3, and #4, and to modify the TAS assembly code at least on x86_64. Together, these changes bring my test case on a 4-way Opteron from N, runtime: 1 36s 2 61s 4 105s 8 198s to N, runtime: 1 35s 2 48s 4 58s 8 105s At 4 queries, CPU utilization is 100% and the context swap rate nearly nil, which seems to mean we've gained another (temporary no doubt) victory over the spinlock problem. I attach two proposed patches: the first removes the cmpb/jne from the x86 TAS assembly code, and the second one does the s_lock changes enumerated as points #2, #3, #4. The first one in particular needs more testing to see if it hurts performance on any non-Opteron x86 chips. (If so, we'd just make it conditional to x86_64.) Comments and testing invited. regards, tom lane *** REL8_0/src/include/storage/s_lock.h Sun Aug 28 20:41:44 2005 --- new/s_lock.h Fri Sep 9 14:58:44 2005 *************** *** 120,132 **** { register slock_t _res = 1; ! /* Use a non-locking test before asserting the bus lock */ __asm__ __volatile__( - " cmpb $0,%1 \n" - " jne 1f \n" - " lock \n" " xchgb %0,%1 \n" - "1: \n" : "+q"(_res), "+m"(*lock) : : "memory", "cc"); --- 120,128 ---- { register slock_t _res = 1; ! /* xchg implies a LOCK prefix, so no need to say LOCK explicitly */ __asm__ __volatile__( " xchgb %0,%1 \n" : "+q"(_res), "+m"(*lock) : : "memory", "cc"); *** /home/postgres/pgsql/src/backend/storage/lmgr/proc.c.orig Sat Aug 20 19:26:24 2005 --- /home/postgres/pgsql/src/backend/storage/lmgr/proc.c Sun Sep 11 15:07:20 2005 *************** *** 170,175 **** --- 170,177 ---- ProcGlobal->freeProcs = INVALID_OFFSET; + ProcGlobal->spins_per_delay = DEFAULT_SPINS_PER_DELAY; + /* * Pre-create the PGPROC structures and create a semaphore for * each. *************** *** 224,232 **** --- 226,239 ---- /* * Try to get a proc struct from the free list. If this fails, we * must be out of PGPROC structures (not to mention semaphores). + * + * While we are holding the ProcStructLock, also copy the current + * shared estimate of spins_per_delay to local storage. */ SpinLockAcquire(ProcStructLock); + set_spins_per_delay(procglobal->spins_per_delay); + myOffset = procglobal->freeProcs; if (myOffset != INVALID_OFFSET) *************** *** 318,338 **** Assert(proctype >= 0 && proctype < NUM_DUMMY_PROCS); dummyproc = &DummyProcs[proctype]; /* * dummyproc should not presently be in use by anyone else */ if (dummyproc->pid != 0) elog(FATAL, "DummyProc[%d] is in use by PID %d", proctype, dummyproc->pid); MyProc = dummyproc; /* * Initialize all fields of MyProc, except MyProc->sem which was set * up by InitProcGlobal. */ - MyProc->pid = MyProcPid; /* marks dummy proc as in use by me */ SHMQueueElemInit(&(MyProc->links)); MyProc->waitStatus = STATUS_OK; MyProc->xid = InvalidTransactionId; --- 325,362 ---- Assert(proctype >= 0 && proctype < NUM_DUMMY_PROCS); + /* + * Just for paranoia's sake, we use the ProcStructLock to protect + * assignment and releasing of DummyProcs entries. + * + * While we are holding the ProcStructLock, also copy the current + * shared estimate of spins_per_delay to local storage. + */ + SpinLockAcquire(ProcStructLock); + + set_spins_per_delay(ProcGlobal->spins_per_delay); + dummyproc = &DummyProcs[proctype]; /* * dummyproc should not presently be in use by anyone else */ if (dummyproc->pid != 0) + { + SpinLockRelease(ProcStructLock); elog(FATAL, "DummyProc[%d] is in use by PID %d", proctype, dummyproc->pid); + } MyProc = dummyproc; + MyProc->pid = MyProcPid; /* marks dummy proc as in use by me */ + + SpinLockRelease(ProcStructLock); + /* * Initialize all fields of MyProc, except MyProc->sem which was set * up by InitProcGlobal. */ SHMQueueElemInit(&(MyProc->links)); MyProc->waitStatus = STATUS_OK; MyProc->xid = InvalidTransactionId; *************** *** 509,514 **** --- 533,541 ---- /* PGPROC struct isn't mine anymore */ MyProc = NULL; + /* Update shared estimate of spins_per_delay */ + procglobal->spins_per_delay = update_spins_per_delay(procglobal->spins_per_delay); + SpinLockRelease(ProcStructLock); } *************** *** 532,542 **** --- 559,576 ---- /* Release any LW locks I am holding (see notes above) */ LWLockReleaseAll(); + SpinLockAcquire(ProcStructLock); + /* Mark dummy proc no longer in use */ MyProc->pid = 0; /* PGPROC struct isn't mine anymore */ MyProc = NULL; + + /* Update shared estimate of spins_per_delay */ + ProcGlobal->spins_per_delay = update_spins_per_delay(ProcGlobal->spins_per_delay); + + SpinLockRelease(ProcStructLock); } *** /home/postgres/pgsql/src/backend/storage/lmgr/s_lock.c.orig Fri Aug 26 10:47:35 2005 --- /home/postgres/pgsql/src/backend/storage/lmgr/s_lock.c Sun Sep 11 17:43:20 2005 *************** *** 21,26 **** --- 21,30 ---- #include "storage/s_lock.h" #include "miscadmin.h" + + static int spins_per_delay = DEFAULT_SPINS_PER_DELAY; + + /* * s_lock_stuck() - complain about a stuck spinlock */ *************** *** 49,102 **** * We loop tightly for awhile, then delay using pg_usleep() and try * again. Preferably, "awhile" should be a small multiple of the * maximum time we expect a spinlock to be held. 100 iterations seems ! * about right. In most multi-CPU scenarios, the spinlock is probably ! * held by a process on another CPU and will be released before we ! * finish 100 iterations. However, on a uniprocessor, the tight loop ! * is just a waste of cycles, so don't iterate thousands of times. * * Once we do decide to block, we use randomly increasing pg_usleep() ! * delays. The first delay is 10 msec, then the delay randomly ! * increases to about one second, after which we reset to 10 msec and * start again. The idea here is that in the presence of heavy * contention we need to increase the delay, else the spinlock holder * may never get to run and release the lock. (Consider situation * where spinlock holder has been nice'd down in priority by the * scheduler --- it will not get scheduled until all would-be ! * acquirers are sleeping, so if we always use a 10-msec sleep, there * is a real possibility of starvation.) But we can't just clamp the * delay to an upper bound, else it would take a long time to make a * reasonable number of tries. * * We time out and declare error after NUM_DELAYS delays (thus, exactly * that many tries). With the given settings, this will usually take ! * 3 or so minutes. It seems better to fix the total number of tries * (and thus the probability of unintended failure) than to fix the * total time spent. * ! * The pg_usleep() delays are measured in centiseconds (0.01 sec) because ! * 10 msec is a common resolution limit at the OS level. */ ! #define SPINS_PER_DELAY 100 #define NUM_DELAYS 1000 ! #define MIN_DELAY_CSEC 1 ! #define MAX_DELAY_CSEC 100 ! int spins = 0; int delays = 0; ! int cur_delay = MIN_DELAY_CSEC; while (TAS(lock)) { /* CPU-specific delay each time through the loop */ SPIN_DELAY(); ! /* Block the process every SPINS_PER_DELAY tries */ ! if (++spins > SPINS_PER_DELAY) { if (++delays > NUM_DELAYS) s_lock_stuck(lock, file, line); ! pg_usleep(cur_delay * 10000L); #if defined(S_LOCK_TEST) fprintf(stdout, "*"); --- 53,121 ---- * We loop tightly for awhile, then delay using pg_usleep() and try * again. Preferably, "awhile" should be a small multiple of the * maximum time we expect a spinlock to be held. 100 iterations seems ! * about right as an initial guess. However, on a uniprocessor the ! * loop is a waste of cycles, while in a multi-CPU scenario it's usually ! * better to spin a bit longer than to call the kernel, so we try to ! * adapt the spin loop count depending on whether we seem to be in ! * a uniprocessor or multiprocessor. * * Once we do decide to block, we use randomly increasing pg_usleep() ! * delays. The first delay is 1 msec, then the delay randomly ! * increases to about one second, after which we reset to 1 msec and * start again. The idea here is that in the presence of heavy * contention we need to increase the delay, else the spinlock holder * may never get to run and release the lock. (Consider situation * where spinlock holder has been nice'd down in priority by the * scheduler --- it will not get scheduled until all would-be ! * acquirers are sleeping, so if we always use a 1-msec sleep, there * is a real possibility of starvation.) But we can't just clamp the * delay to an upper bound, else it would take a long time to make a * reasonable number of tries. * * We time out and declare error after NUM_DELAYS delays (thus, exactly * that many tries). With the given settings, this will usually take ! * 2 or so minutes. It seems better to fix the total number of tries * (and thus the probability of unintended failure) than to fix the * total time spent. * ! * The pg_usleep() delays are measured in milliseconds because 1 msec ! * is a common resolution limit at the OS level for newer platforms. ! * On older platforms the resolution limit is usually 10 msec, in ! * which case the total delay before timeout will be a bit more. */ ! #define MIN_SPINS_PER_DELAY 10 ! #define MAX_SPINS_PER_DELAY 1000 #define NUM_DELAYS 1000 ! #define MIN_DELAY_MSEC 1 ! #define MAX_DELAY_MSEC 1000 ! int spins = spins_per_delay; int delays = 0; ! int cur_delay = 0; while (TAS(lock)) { /* CPU-specific delay each time through the loop */ SPIN_DELAY(); ! /* ! * Block the process every spins_per_delay tries. ! * ! * What we are really testing for here is spins being decremented ! * to zero. We insert an unnecessary integer modulo operation ! * into the test because we'd like this loop to run longer than ! * just two or three instructions: ideally, the processor should ! * not be contending for the system bus for a little while here. ! */ ! if ((--spins % MAX_SPINS_PER_DELAY) == 0) { if (++delays > NUM_DELAYS) s_lock_stuck(lock, file, line); ! if (cur_delay == 0) /* first time to delay? */ ! cur_delay = MIN_DELAY_MSEC; ! ! pg_usleep(cur_delay * 1000L); #if defined(S_LOCK_TEST) fprintf(stdout, "*"); *************** *** 107,119 **** cur_delay += (int) (cur_delay * (((double) random()) / ((double) MAX_RANDOM_VALUE)) + 0.5); /* wrap back to minimum delay when max is exceeded */ ! if (cur_delay > MAX_DELAY_CSEC) ! cur_delay = MIN_DELAY_CSEC; ! spins = 0; } } } /* * Various TAS implementations that cannot live in s_lock.h as no inline --- 126,200 ---- cur_delay += (int) (cur_delay * (((double) random()) / ((double) MAX_RANDOM_VALUE)) + 0.5); /* wrap back to minimum delay when max is exceeded */ ! if (cur_delay > MAX_DELAY_MSEC) ! cur_delay = MIN_DELAY_MSEC; ! spins = spins_per_delay; } } + + /* + * If we were able to acquire the lock without delaying, it's a good + * indication we are in a multiprocessor. If we had to delay, it's + * a sign (but not a sure thing) that we are in a uniprocessor. + * Hence, we decrement spins_per_delay slowly when we had to delay, + * and increase it rapidly when we didn't. It's expected that + * spins_per_delay will converge to the minimum value on a uniprocessor + * and to the maximum value on a multiprocessor. + * + * Note: spins_per_delay is local within our current process. + * We want to average these observations across multiple backends, + * since it's relatively rare for this function to even get entered, + * and so a single backend might not live long enough to converge on + * a good value. That is handled by the two routines below. + */ + if (cur_delay == 0) + { + /* we never had to delay */ + spins_per_delay += 100; + spins_per_delay = Min(spins_per_delay, MAX_SPINS_PER_DELAY); + } + else + { + spins_per_delay -= 1; + spins_per_delay = Max(spins_per_delay, MIN_SPINS_PER_DELAY); + } + } + + + /* + * Set local copy of spins_per_delay during backend startup. + * + * NB: this has to be pretty fast as it is called while holding a spinlock + */ + void + set_spins_per_delay(int shared_spins_per_delay) + { + spins_per_delay = shared_spins_per_delay; + } + + /* + * Update shared estimate of spins_per_delay during backend exit. + * + * NB: this has to be pretty fast as it is called while holding a spinlock + */ + int + update_spins_per_delay(int shared_spins_per_delay) + { + /* + * We use an exponential moving average with a relatively slow + * adaption rate, so that noise in any one backend's result won't + * affect the shared value too much. As long as both inputs are + * within the allowed range, the result must be too, so we need not + * worry about clamping the result. + * + * We deliberately truncate rather than rounding; this is so that + * single adjustments inside a backend can affect the shared estimate + * (see the asymmetric adjustment rules above). + */ + return (shared_spins_per_delay * 15 + spins_per_delay) / 16; } + /* * Various TAS implementations that cannot live in s_lock.h as no inline *** /home/postgres/pgsql/src/include/storage/proc.h.orig Sat Aug 20 19:26:34 2005 --- /home/postgres/pgsql/src/include/storage/proc.h Sun Sep 11 15:07:09 2005 *************** *** 105,110 **** --- 105,112 ---- { /* Head of list of free PGPROC structures */ SHMEM_OFFSET freeProcs; + /* Current shared estimate of appropriate spins_per_delay value */ + int spins_per_delay; } PROC_HDR; *** /home/postgres/pgsql/src/include/storage/s_lock.h.orig Sun Aug 28 20:41:34 2005 --- /home/postgres/pgsql/src/include/storage/s_lock.h Sun Sep 11 15:07:09 2005 *************** *** 826,829 **** --- 826,835 ---- */ extern void s_lock(volatile slock_t *lock, const char *file, int line); + /* Support for dynamic adjustment of spins_per_delay */ + #define DEFAULT_SPINS_PER_DELAY 100 + + extern void set_spins_per_delay(int shared_spins_per_delay); + extern int update_spins_per_delay(int shared_spins_per_delay); + #endif /* S_LOCK_H */
On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote: > > I kinda suspect that the cmpb test is a no-op or loss on all > Intelish processors: it can only be a win if you expect a lot > of contention for the spin lock, but in percentage terms we still > have a very low conflict rate, so in most executions of the TAS > macro, the cmpb is just wasted cycles. Bottom line: we definitely > don't want it for x86_64, and maybe not at all, but we need more > research to decide the latter. I think an important question is wether this is for x86_64 in general, of opteron specific. It could be that it's not the same on Intel's EM64Ts. One reason this might behave differently for opterons is that it's a cc-NUMA instead of the "normal" SMP. Something else to consider is the OS you're using. I've been told that Linux isn't that good in NUMA and FreeBSD might be better. Kurt
Kurt Roeckx <kurt@roeckx.be> writes: > On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote: >> I kinda suspect that the cmpb test is a no-op or loss on all >> Intelish processors: > I think an important question is wether this is for x86_64 in > general, of opteron specific. It could be that it's not the same > on Intel's EM64Ts. Good point --- anyone have one to try? > Something else to consider is the OS you're using. I've been > told that Linux isn't that good in NUMA and FreeBSD might be > better. It's hard to see how the OS could affect behavior at the level of processor cache operations --- unless they did something truly spectacularly stupid, like mark main memory non-cacheable. regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Kurt Roeckx <kurt@roeckx.be> writes: > > On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote: > >> I kinda suspect that the cmpb test is a no-op or loss on all > >> Intelish processors: > > > I think an important question is wether this is for x86_64 in > > general, of opteron specific. It could be that it's not the same > > on Intel's EM64Ts. > > Good point --- anyone have one to try? I've got one I can test on. I need to upgrade the kernel and some other things on it though (it's running 2.6.8 atm, and an older Debian/unstable which I should probably bring up to current). I'll work on it starting now and post results once I get some. Stephen
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > My proposal therefore is to do #2, #3, and #4, and to modify the TAS > assembly code at least on x86_64. Together, these changes bring > my test case on a 4-way Opteron from > > N, runtime: 1 36s 2 61s 4 105s 8 198s em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12: N, runtime: 1 31s 2 47s 4 86s 8 159s > to > > N, runtime: 1 35s 2 48s 4 58s 8 105s N, runtime: 1 32s 2 53s 4 90s 8 169s CPU utilization is definitely higher when running with the patch though. Hope this helps, happy to do additional testing if necessary. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12: > N, runtime: 1 31s 2 47s 4 86s 8 159s > N, runtime: 1 32s 2 53s 4 90s 8 169s Er, which (or both) of the two patches did you apply here? regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12: > > > N, runtime: 1 31s 2 47s 4 86s 8 159s > > > N, runtime: 1 32s 2 53s 4 90s 8 169s > > Er, which (or both) of the two patches did you apply here? Applied both, sorry that wasn't clear. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> Er, which (or both) of the two patches did you apply here? > Applied both, sorry that wasn't clear. Thanks. If you've got the time, could you try the two patches separately and see what you get? regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > > My proposal therefore is to do #2, #3, and #4, and to modify the TAS > assembly code at least on x86_64. Together, these changes bring > my test case on a 4-way Opteron from > Some changes are based on tests and heuristics, so can we make them into the configure script so the choice could be made automatically? Regards, Qingqing
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > >> Er, which (or both) of the two patches did you apply here? > > > Applied both, sorry that wasn't clear. > > Thanks. If you've got the time, could you try the two patches > separately and see what you get? Sure. CVS Head: N, runtime: 1 31s 2 47s 4 86s 8 159s With just slock-no-cmpb.patch: N, runtime: 1 32s 2 39s 4 82s 8 167s With just spin-delay.patch N, runtime: 1 32s 2 52s 4 94s 8 164s With both: N, runtime: 1 32s 2 53s 4 90s 8 169s Hope that helps, Stephen
Tom Lane <tgl@sss.pgh.pa.us> writes: > > Something else to consider is the OS you're using. I've been > > told that Linux isn't that good in NUMA and FreeBSD might be > > better. > > It's hard to see how the OS could affect behavior at the level of > processor cache operations --- unless they did something truly > spectacularly stupid, like mark main memory non-cacheable. Well it could schedule processes on processors in ways that force less than optimal memory usage patterns. But maybe you should tell the Altix folk with their 32-processor 384Gb NUMA machines what you've "been told" about Linux not being that good in NUMA. Really, these kind of cargo cult anecdotes are pretty pointless. -- greg
* Stephen Frost (sfrost@snowman.net) wrote: > > Thanks. If you've got the time, could you try the two patches > > separately and see what you get? > > Sure. [...] Just to be clear- this was from a completely default 'make install' using the Debian configure options from 8.0.3 (which aren't that particularly interesting really- nls, integer-datetimes, debug, disable-rpath, tcl, perl, python, pam, krb5, openssl, gnu-ld, enable-thread-safety). If it'd be useful for the test I can adjust whichever parameters are appropriate (work_mem, cache_size, etc). There's absolutely nothing else going on except for these test, a few shells, top, etc, on the box. Enjoy, Stephen
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > Some changes are based on tests and heuristics, so can we make them into the > configure script so the choice could be made automatically? It's a bit premature to propose that, when we don't yet know if the suggested changes are a win or loss under any conditions, let alone how to automatically tell the difference. Right now we are in the find-out-the-facts mode ... regards, tom lane
Tom Lane wrote: > I attach two proposed patches: the first removes the cmpb/jne from > the x86 TAS assembly code, and the second one does the s_lock changes > enumerated as points #2, #3, #4. The first one in particular needs > more testing to see if it hurts performance on any non-Opteron x86 > chips. (If so, we'd just make it conditional to x86_64.) > 2x PIII 1G 2G Freebsd 6.0Beta4 8.1beta1 (2005-08-28): N runtime: 1 85s 2 139s 4 220s 8.1beta1 (2005-08-28) + patch 1 (s_lock.h only) N runtime: 1 89s 2 137s 4 229s 8.1beta1 (2005-08-28) + patch 2 N runtime: 1 84s 2 108s 4 214s Observe the interesting little speed improvement for patch 2 with 2 processes (seems to be repeatable). Let me know if you want to see vmstat output for any of these. regards Mark
Tom Lane wrote: > Comments and testing invited. I have tested the patches on a Dual Xeon 2,4 GHz w/ HT (no EM64T). (Configured with "CFLAGS='-O2 -mcpu=pentium4 -march=pentium4' --enable-casserts"). The results were pretty stable (around .2 seconds). I would not trust the numbers for N=2, linux, at least 2.4 is not good at not scheduling two running processes on two different HTs on the same core. Those values also had the most variance (> 1s). All other measures were quite stable over several runs. CVS tip from 2005-09-12 ~16:00 1: 57s 2: 82s 4: 124s 8: 237s with only slock-no-cmpb.patch applied 1: 55s 2: 79s 4: 119s 8: 229s with only spin-delay.patch applied 1: 56s 2: 79s 4: 124s 8: 235s with both patches applied 1: 55s 2: 78s 4: 124s 8: 235s compare to 7.4.8 on the same machine ;-) 1: 92s 2: 235s 4: 474s 8: did not try ... It seems to me the slock-no-cmpb is a win in any case. The spin-delay patch does not really help much on this machine. That seems to match Stephen Frost's results with EM64T, if I read them correctly. The cs rate is about 150 on CVS tip without patches and below 100 with the patches (all three cases). With 7.4.8 its 230000-280000 with N>1. 8.1 is clearly the winner here. Great work, Tom. I hope some more data helps. Best Regards, Michael Paesold
"Michael Paesold" <mpaesold@gmx.at> writes: > It seems to me the slock-no-cmpb is a win in any case. The spin-delay patch > does not really help much on this machine. That seems to match Stephen > Frost's results with EM64T, if I read them correctly. Yeah, it's interesting that you both see slock-no-cmpb as saving some cycles and the second patch as giving them back. I wonder whether the integer-modulo-to-slow-the-loop trick is counterproductive on your machines. You both seem to be using hardware that will recognize rep;nop and maybe that's all that's needed. I probably should have broken down the spindelay patch into multiple components. But it's only a small change --- could you try simplifying the patched line if ((--spins % MAX_SPINS_PER_DELAY) == 0) to if (--spins == 0) and see how the patch does that way? regards, tom lane
Tom Lane wrote: > I probably should have broken down the spindelay patch into multiple > components. But it's only a small change --- could you try simplifying > the patched line > > if ((--spins % MAX_SPINS_PER_DELAY) == 0) > > to > > if (--spins == 0) > > and see how the patch does that way? I'll do tomorrow morning (CEST, i.e. in about 11 hours). Best Regards, Michael Paesold
I wrote: > ... and see how the patch does that way? BTW, please do look at "vmstat 1" while running the test case corresponding to your number of processors. It's hard to tell from the runtime alone whether the patch is fully accomplishing its goal of reducing wasted cycles. If you see user CPU percent go to 100 and context swaps drop to a low value, though, you know it's working. regards, tom lane
On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote: > The second reason that the futex patch is helping is that when > a spinlock delay does occur, it allows the delaying process to be > awoken almost immediately, rather than delaying 10 msec or more > as the existing code does. However, given that we are only expecting > the spinlock to be held for a couple dozen instructions, using the > kernel futex mechanism is huge overkill --- the in-kernel overhead > to manage the futex state is almost certainly several orders of > magnitude more than the delay we actually want. Why do you think so? AFAIK on uncontented case there will be no kernel access, only atomic inc/dec. On contented case you'll want task switch anyway, so the futex managing should not matter. Also this mechanism is specifically optimized for inter-process locking, I don't think you can get more efficient mechanism from side-effects from generic syscalls. If you don't want Linux-specific locking in core code, then it's another matter... > 1. Use sched_yield() if available: it does just what we want, > ie, yield the processor without forcing any useless time delay > before we can be rescheduled. This doesn't exist everywhere > but it exists in recent Linuxen, so I tried it. It made for a > beautiful improvement in my test case numbers: CPU utilization > went to 100% and the context swap rate to almost nil. Unfortunately, > I also saw fairly frequent "stuck spinlock" panics when running > more queries than there were processors --- this despite increasing > NUM_DELAYS to 10000 in s_lock.c. So I don't trust sched_yield > anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect. > (I speculate that it's set up to only yield the processor to other > processes already affiliated to that processor. In any case, it > is definitely capable of getting through 10000 yields without > running the guy who's holding the spinlock.) This is intended behaviour of sched_yield. http://lwn.net/Articles/31462/ http://marc.theaimsgroup.com/?l=linux-kernel&m=112432727428224&w=2 -- marko
I wrote: > I'll do tomorrow morning (CEST, i.e. in about 11 hours). These are the tests with the change: > if ((--spins % MAX_SPINS_PER_DELAY) == 0) > > to > > if (--spins == 0) I have called the resulting patch (spin-delay + this change) spin-delay-2. again with only slock-no-cmpb applied 1: 55s 4: 119s (cpu ~97%) with only spin-delay-2 applied 1: 56s 4: 125s (cpu ~99.5%) with slock-no-cmpb and spin-delay-2 applied 1: 56s 4: 125s (cpu ~100%) (Note that the cpu averages are not calulated but estimated by looking at vmstat.) Yesterdays results: > CVS tip from 2005-09-12 ~16:00 > 1: 57s 2: 82s 4: 124s 8: 237s > > with only slock-no-cmpb.patch applied > 1: 55s 2: 79s 4: 119s 8: 229s > > with only spin-delay.patch applied > 1: 56s 2: 79s 4: 124s 8: 235s > > with both patches applied > 1: 55s 2: 78s 4: 124s 8: 235s To have other data, I have retested the patches on a single-cpu Intel P4 3GHz w/ HT (i.e. 2 virtual cpus), no EM64T. Comparing to the 2,4 dual-Xeon results it's clear that this is in reality only one cpu. While the runtime for N=1 is better than the other system, for N=4 it's already worse. The situation with the patches is quite different, though. Unfortunatly. CVS tip from 2005-09-12: 1: 36s 2: 77s (cpu ~85%) 4: 159s (cpu ~98%) only slock-no-cmpb: 1: 36s 2: 81s (cpu ~79%) 4: 177s (cpu ~94%) (doesn't help this time) only spin-delay: 1: 36s 2: 86s (cpu =100%) 4: 157s (cpu =100%) (bad runtime for N=2 (repeatable), cpu not doing real work here?) slock-no-cmpb and spin-delay: 1: 36s 2: 106s (cpu =100%) 4: 192s (cpu =100%) (it gets worse) only spin-delay-2: 1: 36s 2: 85s (cpu =100%) 4: 160s (cpu =100%) (quite the same as spin-delay) slock-no-cmpb and spin-delay-2: 1: 36s 2: 109s (cpu =100%) 4: 198s (cpu =100%) (worse again) CS rate was low (20-50) for all tests, increasing for N>2 which has to be expected. For this system I see no case for applying these patches. Is there a portable way to detect the CPU we are running on? Do you have any other idea how to implement the delays? Best Regards, Michael Paesold
Marko Kreen <marko@l-t.ee> writes: > > (I speculate that it's set up to only yield the processor to other > > processes already affiliated to that processor. In any case, it > > is definitely capable of getting through 10000 yields without > > running the guy who's holding the spinlock.) Maybe it should try sched_yield once and then use select after that? -- greg
Marko Kreen <marko@l-t.ee> writes: > On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote: >> However, given that we are only expecting >> the spinlock to be held for a couple dozen instructions, using the >> kernel futex mechanism is huge overkill --- the in-kernel overhead >> to manage the futex state is almost certainly several orders of >> magnitude more than the delay we actually want. > Why do you think so? AFAIK on uncontented case there will be no > kernel access, only atomic inc/dec. In the uncontended case, we never even enter s_lock() and so the entire mechanism of yielding is irrelevant. The problem that's being exposed by these test cases is that on multiprocessors, you can see a significant rate of spinlock contention (order of 100 events/second, which is still a tiny fraction of the number of TAS calls) and our existing mechanism for dealing with contention is just not efficient enough. > On contented case you'll want task switch anyway, so the futex > managing should not matter. No, we DON'T want a task switch. That's the entire point: in a multiprocessor, it's a good bet that the spinlock is held by a task running on another processor, and doing a task switch will take orders of magnitude longer than just spinning until the lock is released. You should yield only after spinning long enough to make it a strong probability that the spinlock is held by a process that's lost the CPU and needs to be rescheduled. > If you don't want Linux-specific locking in core code, then > it's another matter... Well, it's true, we don't particularly want a one-platform solution, but if it did what we wanted we might hold our noses and use it anyway. (I think, BTW, that using futexes at the spinlock level is misguided; what would be interesting would be to see if we could substitute for both LWLock and spinlock logic with one futex-based module.) >> I also saw fairly frequent "stuck spinlock" panics when running >> more queries than there were processors --- this despite increasing >> NUM_DELAYS to 10000 in s_lock.c. So I don't trust sched_yield >> anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect. >> (I speculate that it's set up to only yield the processor to other >> processes already affiliated to that processor. In any case, it >> is definitely capable of getting through 10000 yields without >> running the guy who's holding the spinlock.) > This is intended behaviour of sched_yield. > http://lwn.net/Articles/31462/ > http://marc.theaimsgroup.com/?l=linux-kernel&m=112432727428224&w=2 No; that page still says specifically "So a process calling sched_yield() now must wait until all other runnable processes in the system have used up their time slices before it will get the processor again." I can prove that that is NOT what happens, at least not on a multi-CPU Opteron with current FC4 kernel. However, if the newer kernels penalize a process calling sched_yield as heavily as this page claims, then it's not what we want anyway ... regards, tom lane
Marko Kreen <marko@l-t.ee> writes: > On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote: > > The second reason that the futex patch is helping is that when > > a spinlock delay does occur, it allows the delaying process to be > > awoken almost immediately, rather than delaying 10 msec or more > > as the existing code does. However, given that we are only expecting > > the spinlock to be held for a couple dozen instructions, using the > > kernel futex mechanism is huge overkill --- the in-kernel overhead > > to manage the futex state is almost certainly several orders of > > magnitude more than the delay we actually want. > > Why do you think so? AFAIK on uncontented case there will be no > kernel access, only atomic inc/dec. On contented case you'll > want task switch anyway, so the futex managing should not > matter. Also this mechanism is specifically optimized for > inter-process locking, I don't think you can get more efficient > mechanism from side-effects from generic syscalls. > > If you don't want Linux-specific locking in core code, then > it's another matter... How does the futex using code compare with the new patches on this new benchmark test? -- greg
Greg Stark <gsstark@mit.edu> writes: > Marko Kreen <marko@l-t.ee> writes: > (I speculate that it's set up to only yield the processor to other > processes already affiliated to that processor. In any case, it > is definitely capable of getting through 10000 yields without > running the guy who's holding the spinlock.) > Maybe it should try sched_yield once and then use select after that? I tried that, actually, but it didn't seem to offer any particular benefit. (Maybe it would have helped more on older Linux kernels before they changed sched_yield?) I'm feeling even more disenchanted with sched_yield now that Marko pointed out that the behavior was changed recently. Here we have a construct that is not portable cross-platform, does not act as documented in its man page, and the kernel guys feel free to whack its behavior around in major ways without documenting that either. It seems to be a crap-shoot whether it will be useful on any particular machine or not. At least with the select() code we can be reasonably confident we know what will happen. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > > On contented case you'll want task switch anyway, so the futex > > managing should not matter. > > No, we DON'T want a task switch. That's the entire point: in a > multiprocessor, it's a good bet that the spinlock is held by a task > running on another processor, and doing a task switch will take orders > of magnitude longer than just spinning until the lock is released. > You should yield only after spinning long enough to make it a strong > probability that the spinlock is held by a process that's lost the > CPU and needs to be rescheduled. Does the futex code make any attempt to record the CPU of the process grabbing the lock? Clearly it wouldn't be a guarantee of anything but if it's only used for short-lived spinlocks while acquiring longer lived locks then maybe? > No; that page still says specifically "So a process calling > sched_yield() now must wait until all other runnable processes in the > system have used up their time slices before it will get the processor > again." I can prove that that is NOT what happens, at least not on > a multi-CPU Opteron with current FC4 kernel. However, if the newer > kernels penalize a process calling sched_yield as heavily as this page > claims, then it's not what we want anyway ... Well it would be no worse than select or any other random i/o syscall. It seems to me what you've found is an outright bug in the linux scheduler. Perhaps posting it to linux-kernel would be worthwhile. -- greg
Tom, All: > It seems to me what you've found is an outright bug in the linux scheduler. > Perhaps posting it to linux-kernel would be worthwhile. For people using this on Linux 2.6, which scheduler are you using? Deadline is the recommended one for databases, and does offer significant (+5-8%) benefits in some heavy-contention environments, at least in OSDL tests. -- Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > Tom, All: > >> It seems to me what you've found is an outright bug in the linux scheduler. >> Perhaps posting it to linux-kernel would be worthwhile. > > For people using this on Linux 2.6, which scheduler are you using? Deadline > is the recommended one for databases, and does offer significant (+5-8%) > benefits in some heavy-contention environments, at least in OSDL tests. I thought 'deadline' was an I/O scheduler, not a CPU scheduler? -Doug
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: > >> No; that page still says specifically "So a process calling >> sched_yield() now must wait until all other runnable processes in the >> system have used up their time slices before it will get the processor >> again." I can prove that that is NOT what happens, at least not on >> a multi-CPU Opteron with current FC4 kernel. However, if the newer >> kernels penalize a process calling sched_yield as heavily as this page >> claims, then it's not what we want anyway ... > > Well it would be no worse than select or any other random i/o syscall. > > It seems to me what you've found is an outright bug in the linux scheduler. > Perhaps posting it to linux-kernel would be worthwhile. People have complained on l-k several times about the 2.6 sched_yield() behavior; the response has basically been "if you rely on any particular sched_yield() behavior for synchronization, your app is broken--it's not a synchronization primitive." -Doug
Douglas McNaught <doug@mcnaught.org> writes: > > It seems to me what you've found is an outright bug in the linux scheduler. > > Perhaps posting it to linux-kernel would be worthwhile. > > People have complained on l-k several times about the 2.6 > sched_yield() behavior; the response has basically been "if you rely > on any particular sched_yield() behavior for synchronization, your app > is broken--it's not a synchronization primitive." They're not talking about this. They're talking about applications that spin on sched_yield() and expect it to reduce cpu load as if the process were calling sleep(). What Tom found was that some processes are never scheduled when sched_yield is called. There's no reason that should be happening. -- greg
Greg Stark <gsstark@mit.edu> writes: > What Tom found was that some processes are never scheduled when sched_yield is > called. There's no reason that should be happening. Yeah, that would probably be a bug... -Doug
On Tue, 13 Sep 2005 12:21:45 -0400 Douglas McNaught <doug@mcnaught.org> wrote: > Josh Berkus <josh@agliodbs.com> writes: > > > Tom, All: > > > >> It seems to me what you've found is an outright bug in the linux scheduler. > >> Perhaps posting it to linux-kernel would be worthwhile. > > > > For people using this on Linux 2.6, which scheduler are you using? Deadline > > is the recommended one for databases, and does offer significant (+5-8%) > > benefits in some heavy-contention environments, at least in OSDL tests. > > I thought 'deadline' was an I/O scheduler, not a CPU scheduler? Yes, that's correct. That's an i/o elevator algorithm. Mark
Douglas McNaught <doug@mcnaught.org> writes: > Greg Stark <gsstark@mit.edu> writes: >> What Tom found was that some processes are never scheduled when sched_yield is >> called. There's no reason that should be happening. > Yeah, that would probably be a bug... I suspect the kernel hackers might consider it deliberate in a NUMA-aware kernel. regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > I'm feeling even more disenchanted with sched_yield now that Marko > pointed out that the behavior was changed recently. Here we have a To be fair, I'm not entirely sure 'recently' is quite the right word. It sounds like it changed during the 2.5 development cycle; which was between major kernel releases, and looks like it was around Dec, 2003. Thanks, Stephen
"Michael Paesold" <mpaesold@gmx.at> writes: > To have other data, I have retested the patches on a single-cpu Intel P4 > 3GHz w/ HT (i.e. 2 virtual cpus), no EM64T. Comparing to the 2,4 dual-Xeon > results it's clear that this is in reality only one cpu. While the runtime > for N=1 is better than the other system, for N=4 it's already worse. The > situation with the patches is quite different, though. Unfortunatly. > CVS tip from 2005-09-12: > 1: 36s 2: 77s (cpu ~85%) 4: 159s (cpu ~98%) > only slock-no-cmpb: > 1: 36s 2: 81s (cpu ~79%) 4: 177s (cpu ~94%) > (doesn't help this time) Hm. This is the first configuration we've seen in which slock-no-cmpb was a loss. Could you double-check that result? > Is there a portable way to detect the CPU we are running on? Do you have any > other idea how to implement the delays? I can't see any reasonable way to do runtime switching of the cmpb test --- whatever logic we put in to control it would cost as much or more than the cmpb anyway :-(. I think that has to be a compile-time choice. From my perspective it'd be acceptable to remove the cmpb only for x86_64, since only there does it seem to be a really significant win. On the other hand it seems that removing the cmpb is a net win on most x86 setups too, so maybe we should just do it and accept that there are some cases where it's not perfect. Looking back at the original discussion, I see that the cmpb was added with little discussion and no specific performance testing, so it shouldn't be presumed to have a lot of weight behind it: http://archives.postgresql.org/pgsql-patches/2003-12/msg00352.php The other thing that's going on in the patch is dynamic adjustment of spins_per_delay, and that could certainly be made much smarter than it is now. For instance, it'd cost almost nothing to make the lower and upper bounds for spins_per_delay be variables instead of constants. We could set the bounds during postmaster startup based on some criteria yet to be determined. regards, tom lane
On Tue, Sep 13, 2005 at 10:10:13AM -0400, Tom Lane wrote: > Marko Kreen <marko@l-t.ee> writes: > > On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote: > >> However, given that we are only expecting > >> the spinlock to be held for a couple dozen instructions, using the > >> kernel futex mechanism is huge overkill --- the in-kernel overhead > >> to manage the futex state is almost certainly several orders of > >> magnitude more than the delay we actually want. > > > Why do you think so? AFAIK on uncontented case there will be no > > kernel access, only atomic inc/dec. > > In the uncontended case, we never even enter s_lock() and so the entire > mechanism of yielding is irrelevant. The problem that's being exposed > by these test cases is that on multiprocessors, you can see a > significant rate of spinlock contention (order of 100 events/second, > which is still a tiny fraction of the number of TAS calls) and our > existing mechanism for dealing with contention is just not efficient > enough. > > > On contented case you'll want task switch anyway, so the futex > > managing should not matter. > > No, we DON'T want a task switch. That's the entire point: in a > multiprocessor, it's a good bet that the spinlock is held by a task > running on another processor, and doing a task switch will take orders > of magnitude longer than just spinning until the lock is released. > You should yield only after spinning long enough to make it a strong > probability that the spinlock is held by a process that's lost the > CPU and needs to be rescheduled. Hmm. I guess this could be separated into 2 cases: 1. Light load - both lock owner and lock requester wont get scheduled while busy (owner in critical section, waiter spinning.) 2. Big load - either or both of them gets scheduled while busy. (waiter is scheduled by OS or voluntarily by eg. callingselect()) So my impression is that currently you optimize for #1 at the expense of #2, while with futexes you'd optimize for #2 at the expense of #1. Additionally I'm pretty convinced that futexes give you most efficient implementation for #2, as kernel knows what processes are waiting on particular lock so it can make best decisions for scheduling. > > If you don't want Linux-specific locking in core code, then > > it's another matter... > > Well, it's true, we don't particularly want a one-platform solution, > but if it did what we wanted we might hold our noses and use it anyway. > > (I think, BTW, that using futexes at the spinlock level is misguided; > what would be interesting would be to see if we could substitute for > both LWLock and spinlock logic with one futex-based module.) Use pthreads ;) > >> I also saw fairly frequent "stuck spinlock" panics when running > >> more queries than there were processors --- this despite increasing > >> NUM_DELAYS to 10000 in s_lock.c. So I don't trust sched_yield > >> anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect. > >> (I speculate that it's set up to only yield the processor to other > >> processes already affiliated to that processor. In any case, it > >> is definitely capable of getting through 10000 yields without > >> running the guy who's holding the spinlock.) > > > This is intended behaviour of sched_yield. > > > http://lwn.net/Articles/31462/ > > http://marc.theaimsgroup.com/?l=linux-kernel&m=112432727428224&w=2 > > No; that page still says specifically "So a process calling > sched_yield() now must wait until all other runnable processes in the > system have used up their time slices before it will get the processor > again." I can prove that that is NOT what happens, at least not on > a multi-CPU Opteron with current FC4 kernel. However, if the newer > kernels penalize a process calling sched_yield as heavily as this page > claims, then it's not what we want anyway ... My fault. As I saw that there is problem with sched_yield, I said "I bet this is because of behaviour change" and only skimmed the exact details. But the point that sched_yield is not meant for such usage still stands. About fast yielding, comment on sys_sched_yield() says: * sys_sched_yield - yield the current processor to other threads.** this function yields the current CPU by moving the callingthread* to the expired array. If there are no other threads running on this* CPU then this function will return. So there just is nothing else to schedule on that CPU. -- marko
Marko Kreen <marko@l-t.ee> writes: > Hmm. I guess this could be separated into 2 cases: > 1. Light load - both lock owner and lock requester wont get > scheduled while busy (owner in critical section, waiter > spinning.) > 2. Big load - either or both of them gets scheduled while busy. > (waiter is scheduled by OS or voluntarily by eg. calling select()) Don't forget that the coding rules for our spinlocks say that you mustn't hold any such lock for more than a couple dozen instructions, and certainly any kernel call while holding the lock is Right Out. There is *no* case where the holder of a spinlock is going to voluntarily give up the CPU. The design intention was that the odds of losing the CPU while holding a spinlock would be negligibly small, simply because we don't hold it very long. > About fast yielding, comment on sys_sched_yield() says: > * sys_sched_yield - yield the current processor to other threads. > * > * this function yields the current CPU by moving the calling thread > * to the expired array. If there are no other threads running on this > * CPU then this function will return. Mph. So that's pretty much exactly what I suspected... I just had a thought: it seems that the reason we are seeing a significant issue here is that on SMP machines, the cost of trading exclusively-owned cache lines back and forth between processors is so high that the TAS instructions (specifically the xchgb, in the x86 cases) represent a significant fraction of backend execution time all by themselves. (We know this is the case due to oprofile results, see discussions from last April.) What that means is that there's a fair chance of a process losing its timeslice immediately after the xchgb. Which is precisely the scenario we do not want, if the process successfully acquired the spinlock by means of the xchgb. We could ameliorate this if there were a way to acquire ownership of the cache line without necessarily winning the spinlock. I'm imagining that we insert a "dummy" locked instruction just ahead of the xchgb, which touches the spinlock in such a way as to not change its state. (xchgb won't do for this, but maybe one of the other lockable instructions will.) We do the xchgb just after this one. The idea is that if we don't own the cache line, the first instruction causes it to be faulted into the processor's cache, and if our timeslice expires while that is happening, we lose the processor without having acquired the spinlock. This assumes that once we've got the cache line, the xchgb that actually does the work can get executed with not much extra time spent and only low probability of someone else stealing the cache line back first. The fact that cmpb isn't helping proves that getting the cache line in a read-only fashion does *not* do enough to protect the xchgb in this way. But maybe another locking instruction would. Comments? regards, tom lane
I wrote: > We could ameliorate this if there were a way to acquire ownership of the > cache line without necessarily winning the spinlock. Another thought came to mind: maybe the current data layout for LWLocks is bad. Right now, the spinlock that protects each LWLock data struct is itself part of the struct, and since the structs aren't large (circa 20 bytes), the whole thing is usually all in the same cache line. This had seemed like a good idea at the time, on the theory that once you'd obtained the spinlock you'd also have pulled in the LWLock contents. But that's only optimal under an assumption of low contention. If there's high contention for the spinlock, then another processor spinning on the lock will be continuously taking away ownership of the cache line and thus slowing down the guy who's got the lock and is trying to examine/update the LWLock contents. Maybe it'd be better to allocate the spinlocks off by themselves. Then, spinning processors would not affect the processor that's updating the LWLock; only when it finishes doing that and needs to clear the spinlock will it have to contend with the spinners for the cache line containing the spinlock. This would add an instruction or so to LWLockAcquire and LWLockRelease, and would be of no benefit on uniprocessors, but it might be worth doing for multiprocessors. Another patch to test ... I'm starting to think that we might have to succumb to having a compile option "optimize for multiprocessor" or "optimize for single processor". It's pretty hard to see how we'd alter a data structure decision like this on the fly. regards, tom lane
On Tue, 13 Sep 2005 Tom Lane wrote : > I wrote: > > We could ameliorate this if there were a way to acquire ownership of the > > cache line without necessarily winning the spinlock. > > Another thought came to mind: maybe the current data layout for LWLocks > is bad. Right now, the spinlock that protects each LWLock data struct > is itself part of the struct, and since the structs aren't large (circa > 20 bytes), the whole thing is usually all in the same cache line. This > had seemed like a good idea at the time, on the theory that once you'd > obtained the spinlock you'd also have pulled in the LWLock contents. > But that's only optimal under an assumption of low contention. If > there's high contention for the spinlock, then another processor > spinning on the lock will be continuously taking away ownership of the > cache line and thus slowing down the guy who's got the lock and is > trying to examine/update the LWLock contents. If this were the case, perhaps first fetch the spin lock with read-only permission should have helped. Modern processors have store buffers to not let a store miss to slow down the processor. Therefore, if processor A has the spin lock and it is examining and updating the LWLock on the the same cache line, as long as processor B doesn't try to write the same cache line, processor A won't be slowed down. What would happen is that multiple writing are coalescing in the write buffer and they are going to be flushed to the memory when processor A regain the write permission. I still think your first scenario is possible. That's when final processor B gets the lock, its time slice runs out. This sounds to me is likely to cause a chain of context switches. > Maybe it'd be better to allocate the spinlocks off by themselves. > Then, spinning processors would not affect the processor that's updating > the LWLock; only when it finishes doing that and needs to clear the > spinlock will it have to contend with the spinners for the cache line > containing the spinlock. > > This would add an instruction or so to LWLockAcquire and LWLockRelease, > and would be of no benefit on uniprocessors, but it might be worth doing > for multiprocessors. Another patch to test ... > > I'm starting to think that we might have to succumb to having a compile > option "optimize for multiprocessor" or "optimize for single processor". > It's pretty hard to see how we'd alter a data structure decision like > this on the fly. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > I'm starting to think that we might have to succumb to having a compile > option "optimize for multiprocessor" or "optimize for single processor". > It's pretty hard to see how we'd alter a data structure decision like > this on the fly. I'd really hate to see this happen. In this case I don't think the change you're proposing would have much of an impact on a uniprocessor machine. Having seperate compile-time options for uniprocessor and multiprocessor would open the gates for potentially other changes which *would* have a more serious impact on one or the other when compiled for the opposite. I think this would be a serious problem for binary distributions and correspondingly their users. Happy to test these changes, btw. Also, I'm redoing my tests from the other patches with all the various combinations; I'm a bit concerned that the recompiles I did (which were just 'make', not 'make clean', 'make') for them previously didn't actually recompile everything necessary. Sorry if the results differ. Hope to have that all done this evening. Thanks, Stephen
On Tue, 13 Sep 2005, Stephen Frost wrote: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > > I'm starting to think that we might have to succumb to having a compile > > option "optimize for multiprocessor" or "optimize for single processor". > > It's pretty hard to see how we'd alter a data structure decision like > > this on the fly. > > I'd really hate to see this happen. In this case I don't think the > change you're proposing would have much of an impact on a uniprocessor > machine. Having seperate compile-time options for uniprocessor and > multiprocessor would open the gates for potentially other changes which > *would* have a more serious impact on one or the other when compiled for > the opposite. I think this would be a serious problem for binary > distributions and correspondingly their users. It does make it painful for distribution/package maintainers but I think the potential benefits for single/multi-CPU architectures are high. It means that our lock intrinsic on uniprocessors can just be a lock/delay loop without any spinning -- which is a complete waste of time if you only have one CPU. Doing this at runtime involvevs some pretty significant uglification of low level code I think. Gavin
* Gavin Sherry (swm@linuxworld.com.au) wrote: > It does make it painful for distribution/package maintainers but I think > the potential benefits for single/multi-CPU architectures are high. It > means that our lock intrinsic on uniprocessors can just be a lock/delay > loop without any spinning -- which is a complete waste of time if you only > have one CPU. > > Doing this at runtime involvevs some pretty significant uglification of > low level code I think. I suspect distributors would go for the multi-cpu setup (especially if a uniprocessor build is *broken* for multiprocessor) and then in a lot of cases you end up not actually getting any benefit. I'm afraid you'd also end up having to tell alot of people who complain to recompile, who will then complain back to their distributors, etc. I suppose another option would be to provide seperate packages... Could this be done as a shared library so it's more 'plug-and-play' to switch between the two? I dunno, just trying to think about how to deal with this without making it suck terribly bad for me (as a Debian user and maintainer). Certainly makes me wish there was a way to do it which made the kernel handle the uniprocessor/multiprocessor question and did locking as appropriate for those. Ah, well. Thanks, Stephen
"Min Xu (Hsu)" <xu@cs.wisc.edu> writes: > ...If this were the case, perhaps first fetch the spin lock with read-only > permission should have helped. But the cmpb instruction in the 8.0 version of TAS would have done that, and I think we've already established that the cmpb is a loss on most machines (except maybe single-physical-CPU Xeons). I suggested in my other message that it might help to grab write permission on the cache line before actually trying to acquire the spin lock --- but I don't see how getting read-only permission first will help. regards, tom lane
Stephen Frost <sfrost@snowman.net> writes: > I suspect distributors would go for the multi-cpu setup (especially if > a uniprocessor build is *broken* for multiprocessor) and then in a > lot of cases you end up not actually getting any benefit. I'm afraid > you'd also end up having to tell alot of people who complain to > recompile, who will then complain back to their distributors, etc. Yeah. Being in charge of Red Hat's packaging of PG, I feel that pain as keenly as anybody ... and I *know* RH will not be interested in shipping two different packages. If we go this way, the RH distributions will use the --optimize-multi switch, because that's where the money is. The bottom line here is that we will have to make some compromises: if we want one-size-fits-all code, it will not be optimal for every single architecture. If we don't do one-size-fits-all, then we will pay for it in various other ways. regards, tom lane
On Tue, 13 Sep 2005 Tom Lane wrote : > "Min Xu (Hsu)" <xu@cs.wisc.edu> writes: > > ...If this were the case, perhaps first fetch the spin lock with read-only > > permission should have helped. > > But the cmpb instruction in the 8.0 version of TAS would have done that, > and I think we've already established that the cmpb is a loss on most > machines (except maybe single-physical-CPU Xeons). I suggested in my > other message that it might help to grab write permission on the cache > line before actually trying to acquire the spin lock --- but I don't see > how getting read-only permission first will help. Yes, I agree. What I was trying to say was that if the second scenario you hypothesize were true, i.e. fetching write-able line before actually trying to acquire the spin lock would cause another processor to slow down its execution inside the critical section, then fetching read-only lines would have helped. As you said, however, experimental results shows fetching read-only lines didn't help, which led me wonder whether the second scenario your described was really happening.
"Min Xu (Hsu)" <xu@cs.wisc.edu> writes: > ... As you said, however, experimental results > shows fetching read-only lines didn't help, which led me wonder whether the > second scenario your described was really happening. I don't know --- we haven't tried it. I do intend to work up some patches for these ideas, and then we can test them. Or, if you happen to be wide awake at the moment, feel free to take a shot at it ... I'm out of steam and won't get to it before tomorrow... regards, tom lane
On Tue, 13 Sep 2005 Tom Lane wrote : > "Min Xu (Hsu)" <xu@cs.wisc.edu> writes: > > ... As you said, however, experimental results > > shows fetching read-only lines didn't help, which led me wonder whether the > > second scenario your described was really happening. > > I don't know --- we haven't tried it. I do intend to work up some > patches for these ideas, and then we can test them. I see. I look forward to it. > Or, if you happen to be wide awake at the moment, feel free to take a > shot at it ... I'm out of steam and won't get to it before tomorrow... I am newbie here. Please don't let my comments discourage you from doing anything. I enjoyed very much following your discussion and analysis on this problem.
Tom Lane wrote: > But the cmpb instruction in the 8.0 version of TAS would have done that, > and I think we've already established that the cmpb is a loss on most > machines (except maybe single-physical-CPU Xeons). Note that this was a regular Pentium 4 system, not a Xeon. Best Regards, Michael Paesold
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane > Sent: 13 September 2005 23:03 > To: Marko Kreen > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Spinlocks, yet again: analysis and > proposed patches > > I'm starting to think that we might have to succumb to having > a compile > option "optimize for multiprocessor" or "optimize for single > processor". > It's pretty hard to see how we'd alter a data structure decision like > this on the fly. That would be a major pita for packagers of binary distributions. Regards, Dave.
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > > I'm starting to think that we might have to succumb to having a compile > > option "optimize for multiprocessor" or "optimize for single processor". > > It's pretty hard to see how we'd alter a data structure decision like > > this on the fly. > > I'd really hate to see this happen. I suspect you're thinking too hard about this. It's not like some LWLocks would need SMP spinlocks and others uniprocessor locks. And it's not like it's going to change dynamically. Wouldn't it just be a couple: if (multiprocessor) { complex version } else { simple version } Ugly to be sure but it's not going to spread to other areas of the source base and you know the if statement isn't going to be growing any more arms. It's no uglier than doing the same thing with an #ifdef anyways. If you really really want to have two versions and you think using function pointers would impose too big a performance penalty you could play linker games. But that way lies madness. -- greg
Tom Lane wrote: > "Michael Paesold" <mpaesold@gmx.at> writes: >> To have other data, I have retested the patches on a single-cpu Intel P4 >> 3GHz w/ HT (i.e. 2 virtual cpus), no EM64T. Comparing to the 2,4 >> dual-Xeon >> results it's clear that this is in reality only one cpu. While the >> runtime >> for N=1 is better than the other system, for N=4 it's already worse. The >> situation with the patches is quite different, though. Unfortunatly. > >> CVS tip from 2005-09-12: >> 1: 36s 2: 77s (cpu ~85%) 4: 159s (cpu ~98%) > >> only slock-no-cmpb: >> 1: 36s 2: 81s (cpu ~79%) 4: 177s (cpu ~94%) >> (doesn't help this time) > > Hm. This is the first configuration we've seen in which slock-no-cmpb > was a loss. Could you double-check that result? The first tests were compiled with CFLAGS='-O2 -mcpu=pentium4 -march=pentium4'. I had redone the tests with just CFLAGS='-O2' yesterday. The difference was just about a second, but the result from the patch was the same. The results for N=4 and N=8 show the positive effect more clearly. configure: CFLAGS='-O2' --enable-casserts On RHEL 4.1, gcc (GCC) 3.4.3 20050227 (Red Hat 3.4.3-22.1) CVS tip from 2005-09-12: 1: 37s 2: 78s 4: 159s 8: 324 only slock-no-cmpb: 1: 37s 2: 82s (5%) 4: 178s (12%) 8: 362 (12%) configure: --enable-casserts (Btw. I have always done "make clean ; make ; make install" between tests) Best Regards, Michael Paesold > I can't see any reasonable way to do runtime switching of the cmpb test > --- whatever logic we put in to control it would cost as much or more > than the cmpb anyway :-(. I think that has to be a compile-time choice. > From my perspective it'd be acceptable to remove the cmpb only for > x86_64, since only there does it seem to be a really significant win. > On the other hand it seems that removing the cmpb is a net win on most > x86 setups too, so maybe we should just do it and accept that there are > some cases where it's not perfect. How many test cases do we have yet? Summary of the effects without the cmpb instruction seems to be: 8-way Opteron: better Dual/HT Xeon w/o EM64T: better Dual/HT EM64T: better for N<=cpus, worse for N>cpus (Stephen's) HT P4 w/o EM64T: worse (stronger for N>cpus) Have I missed other reports that did test the slock-no-cmpb.patch alone? Two of the systems with positive effects are x86_64, one is an older high-end Intel x86 chip. The negative effect is on a low-cost Pentium 4 with only hyper threading. According to the mentions thread's title, this was an optimization for hyperthreading, not regular multi-cpus. We could have more data, especially on newer and high-end systems. Could some of you test the slock-no-cmpb.patch? You'll need an otherwise idle system to get repeatable results. http://archives.postgresql.org/pgsql-hackers/2005-09/msg00565.php http://archives.postgresql.org/pgsql-hackers/2005-09/msg00566.php I have re-attached the relevant files from Tom's posts because in the archive it's not clear anymore what should go into which file. See instructions in the first messages above. The patch applies to CVS tip with patch -p1 < slock-no-cmpb.patch Best Regards, Michael Paesold
On Tue, Sep 13, 2005 at 08:14:47PM -0400, Stephen Frost wrote: > I suppose another option would be to provide seperate packages... Could > this be done as a shared library so it's more 'plug-and-play' to switch > between the two? I dunno, just trying to think about how to deal with > this without making it suck terribly bad for me (as a Debian user and > maintainer). Note that the Linux kernel has played with moving spinlock code out of line. Due to the effect of having so many, it ended up that the memory saved by moving the code out of line actually benefitted overall. An unconditional call to a function can't be that expensive, surely. However, it would *have* to be in the same compile unit, no shared libraries. ELF imposes some overhead for calls in different compile units, using function pointers won't save you (see Ulrich Drepper paper on it). However, to make it flexible you would need a pointer to a function pointer. Fortunatly these variables won't change often so the function pointer and the function itself should be in the cache if used often enough. Finally, the kernel solves the problem by saying, if you compile for uniprocessor, optimize the code away, since a lot of the issues don't apply. My main concern is how you even detect the number of processors in a portable way... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Tom, et al., Updated, with full recompiles between everything and the new modification: N, runtime: Tip: 1 31s 2 37s 4 86s 8 159s no-cmpb: 1 32s 2 43s 4 83s 8 168s spin: 1 32s 2 51s 4 84s 8 160s spin+mod: 1 32s 2 51s 4 89s 8 158s spin+no-cmpb: 1 32s 2 51s 4 87s 8 163s spin+mod+no-cmpb: 1 32s 2 50s 4 86s 8 161s Unfortunately, the results don't seem to be terribly consistent between runs anyway: Run 2: Tip: 1 32s 2 43s 4 87s 8 160s no-cmpb: 1 31s 2 47s 4 83s 8 167s spin: 1 32s 2 52s 4 88s 8 154s spin+no-cmpb: 1 32s 2 51s 4 102s 8 166s spin+mod: 1 32s 2 53s 4 85s 8 154s spin+mod+no-cmpb: 1 32s 2 51s 4 91s 8 161s Hope it helps, Stephen
I wrote: > We could ameliorate this if there were a way to acquire ownership of the > cache line without necessarily winning the spinlock. I'm imagining > that we insert a "dummy" locked instruction just ahead of the xchgb, > which touches the spinlock in such a way as to not change its state. I tried this, using this tas code: static __inline__ int tas(volatile slock_t *lock) {register slock_t _res = 1;register slock_t _dummy = 0; /* Use a locking test before trying to take the spinlock *//* xchg implies a LOCK prefix, so no need to say LOCK for it */__asm____volatile__( " lock \n" " xaddb %2,%1 \n" " xchgb %0,%1 \n" : "+q"(_res), "+m"(*lock), "+q"(_dummy) : : "memory", "cc");return (int) _res; } At least on Opteron, it's a loser. The previous best results (with slock-no-cmpb and spin-delay patches) were1 31s 2 42s 4 51s 8 100s and with this instead of slock-no-cmpb,1 33s 2 45s 4 55s 8 104s The xadd may indeed be helping in terms of protecting the xchg from end-of-timeslice --- the rate of select() delays is really tiny, one every few seconds, which is better than I saw before. But the extra cost of the extra locked operation isn't getting repaid overall. Feel free to try it on other hardware, but it doesn't look promising. BTW, I also determined that on that 4-way Opteron box, the integer modulo idea doesn't make any difference --- that is, spin-delay and what Michael called spin-delay-2 are the same speed. I think I had tried the modulo before adding the variable spin delay, and it did help in that configuration; but most likely, it was just helping by stretching out the amount of time spent looping before entering the kernel. So we can drop that idea too. regards, tom lane
I wrote: > Another thought came to mind: maybe the current data layout for LWLocks > is bad. Right now, the spinlock that protects each LWLock data struct > is itself part of the struct, and since the structs aren't large (circa > 20 bytes), the whole thing is usually all in the same cache line. ... > Maybe it'd be better to allocate the spinlocks off by themselves. Well, this is odd. I made up a patch to do this (attached) and found that it pretty much sucks. Still the 4-way Opteron, previous best (slock-no-cmpb and spin-delay-2): 1 31s 2 42s 4 51s 8 100s with lwlock-separate added: 1 31s 2 52s 4 106s 8 213s What I had expected to see was a penalty in the single-thread case (due to instructions added to the LWLock functions), but there isn't any. I did not expect to see a factor-of-2 penalty for four threads. I guess what this means is that there's no real problem with losing the cache line while manipulating the LWLock, which is what the patch was intended to prevent. Instead, we're paying for swapping two cache lines (the spinlock and the LWLock) across processors instead of just one line. But that should at worst be a 2x inflation of the time previously spent in LWLockAcquire/Release, which is surely not yet all of the application ;-). Why the heck is this so bad? Should we expect that apparently minor changes in shared data structures might be costing equivalently huge penalties in SMP performance elsewhere? Unfortunately I don't have root on the Opteron and can't run oprofile. But I'd really like to see some oprofile stats from these two cases so we can figure out what in the world is going on here. Can anyone help? regards, tom lane *** src/backend/storage/lmgr/lwlock.c.orig Sat Aug 20 19:26:24 2005 --- src/backend/storage/lmgr/lwlock.c Wed Sep 14 11:50:09 2005 *************** *** 27,35 **** #include "storage/spin.h" typedef struct LWLock { ! slock_t mutex; /* Protects LWLock and queue of PGPROCs */ bool releaseOK; /* T if ok to release waiters */ char exclusive; /* # of exclusive holders (0 or 1) */ int shared; /* # of shared holders (0..MaxBackends) */ --- 27,44 ---- #include "storage/spin.h" + /* + * Each LWLock has an associated spinlock, which we consider as locking the + * content of the LWLock struct as well as the associated queue of waiting + * PGPROCs. To minimize cache-line contention on multiprocessors, the + * spinlocks are kept physically separate from their LWLocks --- we don't + * want a spinning processor to interfere with access to the LWLock for + * the processor holding the spinlock. + */ + typedef struct LWLock { ! slock_t *mutex; /* Protects LWLock and queue of PGPROCs */ bool releaseOK; /* T if ok to release waiters */ char exclusive; /* # of exclusive holders (0 or 1) */ int shared; /* # of shared holders (0..MaxBackends) */ *************** *** 39,44 **** --- 48,65 ---- } LWLock; /* + * We allocate CACHE_LINE_SIZE bytes for the fixed LWLocks' spinlocks. + * We assume that the fixed locks are few enough and heavily used enough + * that it's worth trying to put each one's spinlock in its own cache line. + * The dynamic locks are assumed to be many and less heavily hit, so they + * get just MAXALIGN space apiece. (Packing spinlocks tightly sounds like a + * bad idea on machines where slock_t is just a byte ... even for lesser-used + * spinlocks, there would be enough locks per cache line to create a + * contention issue.) + */ + #define CACHE_LINE_SIZE 64 + + /* * This points to the array of LWLocks in shared memory. Backends inherit * the pointer by fork from the postmaster. LWLockIds are indexes into * the array. *************** *** 135,144 **** Size size; int numLocks = NumLWLocks(); ! /* Allocate the LWLocks plus space for shared allocation counter. */ size = mul_size(numLocks, sizeof(LWLock)); ! size = add_size(size, 2 * sizeof(int)); return size; } --- 156,174 ---- Size size; int numLocks = NumLWLocks(); ! /* Space for the array of LWLock structs. */ size = mul_size(numLocks, sizeof(LWLock)); ! /* Space for dynamic allocation counter and MAXALIGN padding. */ ! size = add_size(size, 2 * sizeof(int) + MAXIMUM_ALIGNOF); ! ! /* Space for the spinlocks --- see notes for CACHE_LINE_SIZE. */ ! size = add_size(size, ! mul_size((int) NumFixedLWLocks, ! Max(sizeof(slock_t), CACHE_LINE_SIZE))); ! size = add_size(size, ! mul_size(numLocks - (int) NumFixedLWLocks, ! Max(sizeof(slock_t), MAXIMUM_ALIGNOF))); return size; } *************** *** 153,182 **** int numLocks = NumLWLocks(); Size spaceLocks = LWLockShmemSize(); LWLock *lock; int id; /* Allocate space */ LWLockArray = (LWLock *) ShmemAlloc(spaceLocks); /* * Initialize all LWLocks to "unlocked" state */ for (id = 0, lock = LWLockArray; id < numLocks; id++, lock++) { ! SpinLockInit(&lock->mutex); lock->releaseOK = true; lock->exclusive = 0; lock->shared = 0; lock->head = NULL; lock->tail = NULL; - } ! /* ! * Initialize the dynamic-allocation counter at the end of the array ! */ ! LWLockCounter = (int *) lock; ! LWLockCounter[0] = (int) NumFixedLWLocks; ! LWLockCounter[1] = numLocks; } --- 183,229 ---- int numLocks = NumLWLocks(); Size spaceLocks = LWLockShmemSize(); LWLock *lock; + char *mutexptr; int id; /* Allocate space */ LWLockArray = (LWLock *) ShmemAlloc(spaceLocks); /* + * Set up dynamic-allocation counter at the end of the array of structs. + * (Since struct LWLock contains ints, we should not need any extra + * alignment padding here.) + */ + LWLockCounter = (int *) (LWLockArray + numLocks); + LWLockCounter[0] = (int) NumFixedLWLocks; + LWLockCounter[1] = numLocks; + + /* + * We place the spinlocks after the counter, maxaligning for safety. + */ + mutexptr = (char *) (LWLockCounter + 2); + mutexptr = (char *) MAXALIGN(mutexptr); + + /* * Initialize all LWLocks to "unlocked" state */ for (id = 0, lock = LWLockArray; id < numLocks; id++, lock++) { ! slock_t *mutex = (slock_t *) mutexptr; ! ! SpinLockInit(mutex); ! lock->mutex = mutex; lock->releaseOK = true; lock->exclusive = 0; lock->shared = 0; lock->head = NULL; lock->tail = NULL; ! if (id < (int) NumFixedLWLocks) ! mutexptr += Max(sizeof(slock_t), CACHE_LINE_SIZE); ! else ! mutexptr += Max(sizeof(slock_t), MAXIMUM_ALIGNOF); ! } } *************** *** 207,212 **** --- 254,260 ---- LWLockAcquire(LWLockId lockid, LWLockMode mode) { volatile LWLock *lock = LWLockArray + lockid; + volatile slock_t *mutex = lock->mutex; PGPROC *proc = MyProc; bool retry = false; int extraWaits = 0; *************** *** 253,259 **** bool mustwait; /* Acquire mutex. Time spent holding mutex should be short! */ ! SpinLockAcquire_NoHoldoff(&lock->mutex); /* If retrying, allow LWLockRelease to release waiters again */ if (retry) --- 301,307 ---- bool mustwait; /* Acquire mutex. Time spent holding mutex should be short! */ ! SpinLockAcquire_NoHoldoff(mutex); /* If retrying, allow LWLockRelease to release waiters again */ if (retry) *************** *** 304,310 **** lock->tail = proc; /* Can release the mutex now */ ! SpinLockRelease_NoHoldoff(&lock->mutex); /* * Wait until awakened. --- 352,358 ---- lock->tail = proc; /* Can release the mutex now */ ! SpinLockRelease_NoHoldoff(mutex); /* * Wait until awakened. *************** *** 336,342 **** } /* We are done updating shared state of the lock itself. */ ! SpinLockRelease_NoHoldoff(&lock->mutex); /* Add lock to list of locks held by this backend */ held_lwlocks[num_held_lwlocks++] = lockid; --- 384,390 ---- } /* We are done updating shared state of the lock itself. */ ! SpinLockRelease_NoHoldoff(mutex); /* Add lock to list of locks held by this backend */ held_lwlocks[num_held_lwlocks++] = lockid; *************** *** 359,364 **** --- 407,413 ---- LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode) { volatile LWLock *lock = LWLockArray + lockid; + volatile slock_t *mutex = lock->mutex; bool mustwait; PRINT_LWDEBUG("LWLockConditionalAcquire", lockid, lock); *************** *** 375,381 **** HOLD_INTERRUPTS(); /* Acquire mutex. Time spent holding mutex should be short! */ ! SpinLockAcquire_NoHoldoff(&lock->mutex); /* If I can get the lock, do so quickly. */ if (mode == LW_EXCLUSIVE) --- 424,430 ---- HOLD_INTERRUPTS(); /* Acquire mutex. Time spent holding mutex should be short! */ ! SpinLockAcquire_NoHoldoff(mutex); /* If I can get the lock, do so quickly. */ if (mode == LW_EXCLUSIVE) *************** *** 400,406 **** } /* We are done updating shared state of the lock itself. */ ! SpinLockRelease_NoHoldoff(&lock->mutex); if (mustwait) { --- 449,455 ---- } /* We are done updating shared state of the lock itself. */ ! SpinLockRelease_NoHoldoff(mutex); if (mustwait) { *************** *** 424,429 **** --- 473,479 ---- LWLockRelease(LWLockId lockid) { volatile LWLock *lock = LWLockArray + lockid; + volatile slock_t *mutex = lock->mutex; PGPROC *head; PGPROC *proc; int i; *************** *** 446,452 **** held_lwlocks[i] = held_lwlocks[i + 1]; /* Acquire mutex. Time spent holding mutex should be short! */ ! SpinLockAcquire_NoHoldoff(&lock->mutex); /* Release my hold on lock */ if (lock->exclusive > 0) --- 496,502 ---- held_lwlocks[i] = held_lwlocks[i + 1]; /* Acquire mutex. Time spent holding mutex should be short! */ ! SpinLockAcquire_NoHoldoff(mutex); /* Release my hold on lock */ if (lock->exclusive > 0) *************** *** 494,500 **** } /* We are done updating shared state of the lock itself. */ ! SpinLockRelease_NoHoldoff(&lock->mutex); /* * Awaken any waiters I removed from the queue. --- 544,550 ---- } /* We are done updating shared state of the lock itself. */ ! SpinLockRelease_NoHoldoff(mutex); /* * Awaken any waiters I removed from the queue.
Tom Lane wrote: >I wrote: >> We could ameliorate this if there were a way to acquire ownership of the >> cache line without necessarily winning the spinlock. I'm imagining >> that we insert a "dummy" locked instruction just ahead of the xchgb, >> which touches the spinlock in such a way as to not change its state. > > I tried this, using this tas code: ... I have tried it on the P4 w/ HT. CVS tip 1: 37s 2: 78s 4: 159s 8: 324 only slock-no-cmpb 1: 37s 2: 82s 4: 178s 8: 362 only dummy-locking 1: 44s 2: 96s 4: 197s ... I guess there is no reason to try any more. Best Regards, Michael Paesold
Tom Lane wrote: > I guess what this means is that there's no real problem with losing the > cache line while manipulating the LWLock, which is what the patch was > intended to prevent. Instead, we're paying for swapping two cache lines > (the spinlock and the LWLock) across processors instead of just one line. > But that should at worst be a 2x inflation of the time previously spent > in LWLockAcquire/Release, which is surely not yet all of the application > ;-). Why the heck is this so bad? Should we expect that apparently > minor changes in shared data structures might be costing equivalently > huge penalties in SMP performance elsewhere? > > Unfortunately I don't have root on the Opteron and can't run oprofile. > But I'd really like to see some oprofile stats from these two cases > so we can figure out what in the world is going on here. Can anyone > help? I will try the patch here and see if it gives the same result. If so I could try to run with oprofile if you can give me a quick start. Best Regards, Michael Paesold
I wrote: > I guess what this means is that there's no real problem with losing the > cache line while manipulating the LWLock, which is what the patch was > intended to prevent. Instead, we're paying for swapping two cache lines > (the spinlock and the LWLock) across processors instead of just one line. > But that should at worst be a 2x inflation of the time previously spent > in LWLockAcquire/Release, which is surely not yet all of the application > ;-). Why the heck is this so bad? Should we expect that apparently > minor changes in shared data structures might be costing equivalently > huge penalties in SMP performance elsewhere? I did some oprofile work and found that the cost seems to be because (1) there's an increase in spinlock contention (more time spent in s_lock), and (2) there's more time spent in LWLockAcquire/Release. I'm not entirely clear about the reason for (1), but (2) apparently is because of the extra cache line swapping, as posited above. In the oprofile trace it's clear that the extra cost comes exactly in the statements that touch fields of the shared LWLock, which are the places where you might have to wait to acquire a cache line. I thought for a bit that the problem might come from having chosen to put a pointer to the spinlock into each LWLock; fetching that pointer implies an additional access to the contended cache line. However, changing the data structure to a simple linear array of spinlocks didn't help. So that whole idea seems to have died a painful death. One other interesting result is that the data structure change neither helped nor hurt on an EM64T machine with 2 physical (4 logical) processors. This is also x86_64, but evidently the caching behavior is totally different from Opterons. One thing that did seem to help a little bit was padding the LWLocks to 32 bytes (by default they are 24 bytes each on x86_64) and ensuring the array starts on a 32-byte boundary. This ensures that we won't have any LWLocks crossing cache lines --- contended access to such an LWLock would probably incur the sort of large penalty seen above, because you'd be trading two cache lines back and forth not one. It seems that the important locks are not split that way in CVS tip, because the gain wasn't much, but I wonder whether some effect like this might explain some of the unexplainable performance changes we've noticed in the past (eg, in the dbt2 results). A seemingly unrelated small change in the size of other data structures in shared memory might move things around enough to make a performance-critical lock cross a cache line boundary. On regular x86, the LWLock struct is by chance exactly 16 bytes, so there's no alignment issue. But 64-bit platforms and platforms where slock_t is bigger than char are exposed to this risk. I'm going to go ahead and make that change, since it doesn't seem likely to have any downside. It might be interesting to look into forcing proper alignment of the shared buffer headers as well. regards, tom lane
On Thu, 15 Sep 2005, Tom Lane wrote: > One thing that did seem to help a little bit was padding the LWLocks > to 32 bytes (by default they are 24 bytes each on x86_64) and ensuring > the array starts on a 32-byte boundary. This ensures that we won't have > any LWLocks crossing cache lines --- contended access to such an LWLock > would probably incur the sort of large penalty seen above, because you'd > be trading two cache lines back and forth not one. It seems that the > important locks are not split that way in CVS tip, because the gain > wasn't much, but I wonder whether some effect like this might explain > some of the unexplainable performance changes we've noticed in the past > (eg, in the dbt2 results). A seemingly unrelated small change in the > size of other data structures in shared memory might move things around > enough to make a performance-critical lock cross a cache line boundary. What about padding the LWLock to 64 bytes on these architectures. Both P4 and Opteron have 64 byte cache lines, IIRC. This would ensure that a cacheline doesn't hold two LWLocks. Gavin
> Tom Lane wrote > I'm going to go ahead and make that change, since it doesn't > seem likely > to have any downside. It might be interesting to look into forcing > proper alignment of the shared buffer headers as well. Just catching up on your mails - all of that sounds good so far. Everything mentioned so far talks about spinlocks in the general sense, rather than with respect to particular locks. The different lock types we have are held for different amounts of time and are also subject to differing amounts of contention. I think it would be useful to put in a capability to tune each class of lock according to the possibility for delay and contention on it. Long delay, low contention e.g. CheckpointLock => waiter should sleep immediately Medium delay, high contention e.g. WALWriteLock => waiter spins if at head of queue, else sleeps immediately Short delay, high contention e.g. BufMappingLock => waiter spins and retries forever, cos the lock is coming soon Short delay, low contention, sometimes long waits at end of VACUUM e.g. FreeSpaceLock => waiter spins, then eventually sleeps I'm not sure whether you'll agree with my characterisation of these locks, but the main thing I'm trying to get across is that once-size-fits-all isn't the optimal approach. I'm not saying either that we end up with individual characterisations for each lock type, but a few key ones could be catered for differently. Best Regards, Simon Riggs
Gavin Sherry <swm@linuxworld.com.au> writes: > What about padding the LWLock to 64 bytes on these architectures. Both P4 > and Opteron have 64 byte cache lines, IIRC. This would ensure that a > cacheline doesn't hold two LWLocks. I tried that first, actually, but it was a net loss. I guess enlarging the array that much wastes too much cache space. regards, tom lane
On Thu, 15 Sep 2005, Tom Lane wrote: > Gavin Sherry <swm@linuxworld.com.au> writes: > > What about padding the LWLock to 64 bytes on these architectures. Both P4 > > and Opteron have 64 byte cache lines, IIRC. This would ensure that a > > cacheline doesn't hold two LWLocks. > > I tried that first, actually, but it was a net loss. I guess enlarging > the array that much wastes too much cache space. Interesting. On Xeon (2 phys, 4 log), with LWLock padded to 64 bytes and the cmpb/jump removed I get: [swm@backup pgsqlpad]$ for i in 1 2 4; do time ./nrun.sh $i; done real 0m54.362s user 0m0.003s sys 0m0.009s real 1m9.788s user 0m0.011s sys 0m0.013s real 2m8.870s user 0m0.016s sys 0m0.028s [swm@backup pgsqlpad]$ for i in 1 2 4; do time ./nrun.sh $i; done real 0m55.544s user 0m0.006s sys 0m0.007s real 1m9.313s user 0m0.007s sys 0m0.018s real 2m1.769s user 0m0.017s sys 0m0.027s This compares to the following, which is unpadded but has cmpb/jump removed but is otherwise vanilla: 1: 55: 2: 111: 4: 207 The decrease is small, but it's there. Gavin
Gregory Maxwell <gmaxwell@gmail.com> writes: > might be useful to align the structure so it always crosses two lines > and measure the performance difference.. the delta could be basically > attributed to the cache line bouncing since even one additional bounce > would overwhelm the other performance effects from the changed > alignment. Good idea. I goosed the struct declaration and setup code to arrange that the BufMappingLock's spinlock and the rest of its data were in different cache lines instead of the same one. The results (still on Red Hat's 4-way Opteron): previous best code (slock-no-cmpb and spin-delay-2): 1 31s 2 42s 4 51s 8 100s with LWLock padded to 32 bytes and correctly aligned: 1 31s 2 41s 4 51s 8 97s with LWLocks 32 bytes, but deliberately misaligned: 1 30s 2 50s 4 102s 8 200s There is no other reason than having to touch multiple cache lines for the second and third cases to be different: the array indexing code should be exactly the same. These last numbers are pretty close to what I got from the separated-spinlock patch: 1 31s 2 52s 4 106s 8 213s So it seems there's no doubt that it's the doubled cache traffic that was causing most of the problem there. regards, tom lane
Gavin Sherry <swm@linuxworld.com.au> writes: > Interesting. On Xeon (2 phys, 4 log), with LWLock padded to 64 bytes and > the cmpb/jump removed I get: > [ 1 55s 2 69s 4 128s ] > This compares to the following, which is unpadded but has cmpb/jump > removed but is otherwise vanilla: > 1: 55: 2: 111: 4: 207 Hmm, that's pretty significant. I tried it on a Xeon EM64T (thanks to Stephen Frost for providing access), also 2 phys 4 log, and get: Yesterday's CVS tip:1 32s 2 46s 4 88s 8 168s plus no-cmpb and spindelay2:1 32s 2 48s 4 100s 8 177s plus just-committed code to pad LWLock to 32:1 33s 2 50s 4 98s 8 179s alter to pad to 64:1 33s 2 38s 4 108s 8 180s I don't know what to make of the 2-process time going down while 4-process goes up; that seems just weird. But both numbers are repeatable. regards, tom lane
On 9/15/05, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yesterday's CVS tip: > 1 32s 2 46s 4 88s 8 168s > plus no-cmpb and spindelay2: > 1 32s 2 48s 4 100s 8 177s > plus just-committed code to pad LWLock to 32: > 1 33s 2 50s 4 98s 8 179s > alter to pad to 64: > 1 33s 2 38s 4 108s 8 180s > > I don't know what to make of the 2-process time going down while > 4-process goes up; that seems just weird. But both numbers are > repeatable. It is odd. In the two process case there is, assuming random behavior, a 1/2 chance that you've already got the right line, but in the 4 process case only a 1/4 chance (since we're on a 4 way box). This would explain why we don't see as much cost in the intentionally misaligned case. You'd expect the a similar pattern of improvement with the 64byte alignment (some in the two process case, but more in the 4 case), but here we see more improvement in the two way case. If I had to guess I might say that the 64byte alignment is removing much of the unneeded line bouncing in the the two process case but is at the same time creating more risk of bouncing caused by aliasing. Since two processes have 1/2 chance the aliasing isn't a problem so the change is a win, but in the four process case it's no longer a win because with aliasing there is still a lot of fighting over the cache lines even if you pack well, and the decrease in packing makes odd aliasing somewhat more likely. This might also explain why the misaligned case performed so poorly in the 4process case, since the misalignment didn't just increase the cost 2x, it also increased the likelihood of a bogus bounce due to aliasing.. If this is the case, then it may be possible through very careful memory alignment to make sure that no two high contention locks that are likely to be contended at once share the same line (through either aliasing or through being directly within the same line). Then again I could be completely wrong, my understanding of multiprocessor cache coherency is very limited, and I have no clue how cache aliasing fits into it... So the above is just uninformed conjecture.
Gregory Maxwell <gmaxwell@gmail.com> writes: > If I had to guess I might say that the 64byte alignment is removing > much of the unneeded line bouncing in the the two process case but is > at the same time creating more risk of bouncing caused by aliasing. It's an idea ... not sure if it's right or not. One thing I have noticed both on Xeon HT (2-physical-4-logical) and on Opteron (4 real processors) is that the 2-process times are significantly more variable than the 1, 4, or 8-process times: repeating the measurements shows variance around the 10% level for 2 processes but only around 1% for the others. What I think this means is that the kernel is scheduling the 2 processes onto 2 processors chosen-at-random, without awareness of whether those two processors are on the same chip (in the Xeon case) or have closer NUMA affinity (in the Opteron case). The other test cases are all symmetric and there's no way for the kernel to blow it too badly, but in the 2-process case it will be very visible whether the kernel understands the hardware or not. And the impression I have (which might be out of date) is that current Linux kernel releases are not very bright about these things. How does that tie into the point at hand about different results for 64-byte vs 32-byte LWLocks? Not sure, but I feel it's related somehow. Anyway, it'd be interesting to get some test results for these things on SMP machines running non-Linux kernels. Also, we ought to be more rigorous about reporting exactly which kernel we are using for any particular test. regards, tom lane
I thought I'd throw SPARC into the equation (SPARC IIIi, in a dual SunFire v250): vanilla HEAD from ~1 week ago: bash-3.00$ for i in 1 2 4; do time ./nrun.sh $i; done real 1m49.037s user 0m0.008s sys 0m0.016s real 2m3.482s user 0m0.014s sys 0m0.026s real 3m54.215s user 0m0.028s sys 0m0.046s With the spin-delay patch: real 1m50.791s user 0m0.008s sys 0m0.016s real 2m0.553s user 0m0.015s sys 0m0.026s real 3m58.185s user 0m0.027s sys 0m0.048s Padding the LWLock to 64 bytes: bash-3.00$ for i in 1 2 4; do time ./nrun.sh $i; done real 1m49.594s user 0m0.008s sys 0m0.016s real 1m59.141s user 0m0.015s sys 0m0.026s real 3m56.236s user 0m0.027s sys 0m0.047s Padded to 128 bytes (I'm not sure about the cache line size): real 1m50.498s user 0m0.008s sys 0m0.016s real 2m0.169s user 0m0.015s sys 0m0.026s real 3m56.891s user 0m0.027s sys 0m0.048s ---- These details seem suspicious to me when compared to those we're seeing on x86 platforms. The reason is that padding the LWLock out to 64 or 128 bytes makes no difference (maybe the cache line is bigger?). The good thing is, the difference between 1 and 2 instances of the test is small -- precisely what we want. Thanks, Gavin
Tom, > What I think this means is that the kernel is scheduling the 2 processes > onto 2 processors chosen-at-random, without awareness of whether those > two processors are on the same chip (in the Xeon case) or have closer > NUMA affinity (in the Opteron case). That would be consistent with my experience with HT, and the reason why many software vendors recommend disabling it. Not sure about NUMA. -- Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > Tom, > > > What I think this means is that the kernel is scheduling the 2 processes > > onto 2 processors chosen-at-random, without awareness of whether those > > two processors are on the same chip (in the Xeon case) or have closer > > NUMA affinity (in the Opteron case). > > That would be consistent with my experience with HT, and the reason why many > software vendors recommend disabling it. Not sure about NUMA. What version of linux was this test with? What you describe was certainly well known for Linux 2.4 and earlier. It was well on its way to becoming established dogma cargo-cult style. However I was under the impression that 2.6 had moved beyond that problem. It would be very interesting to know if 2.6 still suffers from this. Also, those who recommend disabling HT on this basis should realize it didn't penalize the >2 process case the same way. You may be getting a benefit only in the cases where your system isn't heavily loaded. Ie, only when it doesn't really matter. -- greg
* Greg Stark (gsstark@mit.edu) wrote: > However I was under the impression that 2.6 had moved beyond that problem. > It would be very interesting to know if 2.6 still suffers from this. The tests on the em64t at my place were using 2.6.12. I had thought 2.6 was better about this too, but I don't have another explanation for it. CONFIG_SMT was enabled, etc.. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Greg Stark (gsstark@mit.edu) wrote: >> However I was under the impression that 2.6 had moved beyond that problem. >> It would be very interesting to know if 2.6 still suffers from this. > The tests on the em64t at my place were using 2.6.12. I had thought 2.6 > was better about this too, but I don't have another explanation for it. The 4-way Opteron I've been using at Red Hat is running 2.6.12-1.1398_FC4smp (Fedora Core 4 obviously). Red Hat in particular has been working hard in this area, and I thought that their recent kernels included NUMA fixes that weren't yet accepted upstream (at least not in the stable kernel branches). But it seems there's still a ways to go yet. It'd be real interesting to see comparable numbers from some non-Linux kernels, particularly commercial systems like Solaris. regards, tom lane
On Sat, 17 Sep 2005, Tom Lane wrote: > Stephen Frost <sfrost@snowman.net> writes: > > * Greg Stark (gsstark@mit.edu) wrote: > >> However I was under the impression that 2.6 had moved beyond that problem. > >> It would be very interesting to know if 2.6 still suffers from this. > > > The tests on the em64t at my place were using 2.6.12. I had thought 2.6 > > was better about this too, but I don't have another explanation for it. > > The 4-way Opteron I've been using at Red Hat is running > 2.6.12-1.1398_FC4smp (Fedora Core 4 obviously). Red Hat in particular > has been working hard in this area, and I thought that their recent > kernels included NUMA fixes that weren't yet accepted upstream (at least > not in the stable kernel branches). But it seems there's still a ways > to go yet. > > It'd be real interesting to see comparable numbers from some non-Linux > kernels, particularly commercial systems like Solaris. Did you see the Solaris results I posted? Gavin
Gavin Sherry <swm@linuxworld.com.au> writes: > On Sat, 17 Sep 2005, Tom Lane wrote: >> It'd be real interesting to see comparable numbers from some non-Linux >> kernels, particularly commercial systems like Solaris. > Did you see the Solaris results I posted? Are you speaking of http://archives.postgresql.org/pgsql-hackers/2005-09/msg00715.php ? That doesn't seem directly relevant to the point, because it's for a 2-CPU machine; so there's no way to run a test case that uses more than one but less than all the processors. In either the "one" or "all" cases, performance ought to be pretty stable regardless of whether the kernel understands about any processor asymmetries that may exist in the hardware. Not to mention that I don't know of any asymmetries in a dual SPARC anyway. We really need to test this on comparable hardware, which I guess means we need Solaris/x86 on something with hyperthreading or known NUMA asymmetry. regards, tom lane
On Sat, Sep 17, 2005 at 01:40:28AM -0400, Tom Lane wrote: > Gavin Sherry <swm@linuxworld.com.au> writes: > > On Sat, 17 Sep 2005, Tom Lane wrote: > >> It'd be real interesting to see comparable numbers from some non-Linux > >> kernels, particularly commercial systems like Solaris. > > > Did you see the Solaris results I posted? > > Are you speaking of > http://archives.postgresql.org/pgsql-hackers/2005-09/msg00715.php > ? > > That doesn't seem directly relevant to the point, because it's for a > 2-CPU machine; so there's no way to run a test case that uses more than > one but less than all the processors. In either the "one" or "all" > cases, performance ought to be pretty stable regardless of whether the > kernel understands about any processor asymmetries that may exist in > the hardware. Not to mention that I don't know of any asymmetries in > a dual SPARC anyway. We really need to test this on comparable > hardware, which I guess means we need Solaris/x86 on something with > hyperthreading or known NUMA asymmetry. I have access to a 4-way Opteron 852 running Solaris 10. What patches would you like me to test? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Sun, 18 Sep 2005, Jim C. Nasby wrote: > On Sat, Sep 17, 2005 at 01:40:28AM -0400, Tom Lane wrote: > > Gavin Sherry <swm@linuxworld.com.au> writes: > > > On Sat, 17 Sep 2005, Tom Lane wrote: > > >> It'd be real interesting to see comparable numbers from some non-Linux > > >> kernels, particularly commercial systems like Solaris. > > > > > Did you see the Solaris results I posted? > > > > Are you speaking of > > http://archives.postgresql.org/pgsql-hackers/2005-09/msg00715.php > > ? > > > > That doesn't seem directly relevant to the point, because it's for a > > 2-CPU machine; so there's no way to run a test case that uses more than > > one but less than all the processors. In either the "one" or "all" > > cases, performance ought to be pretty stable regardless of whether the > > kernel understands about any processor asymmetries that may exist in > > the hardware. Not to mention that I don't know of any asymmetries in > > a dual SPARC anyway. We really need to test this on comparable > > hardware, which I guess means we need Solaris/x86 on something with > > hyperthreading or known NUMA asymmetry. > > I have access to a 4-way Opteron 852 running Solaris 10. What patches > would you like me to test? These ones here: http://archives.postgresql.org/pgsql-hackers/2005-09/msg00566.php Gavin
I have removed this TODO item: * Research use of sched_yield() for spinlock acquisition failure --------------------------------------------------------------------------- Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > Marko Kreen <marko@l-t.ee> writes: > > (I speculate that it's set up to only yield the processor to other > > processes already affiliated to that processor. In any case, it > > is definitely capable of getting through 10000 yields without > > running the guy who's holding the spinlock.) > > > Maybe it should try sched_yield once and then use select after that? > > I tried that, actually, but it didn't seem to offer any particular > benefit. (Maybe it would have helped more on older Linux kernels before > they changed sched_yield?) > > I'm feeling even more disenchanted with sched_yield now that Marko > pointed out that the behavior was changed recently. Here we have a > construct that is not portable cross-platform, does not act as > documented in its man page, and the kernel guys feel free to whack its > behavior around in major ways without documenting that either. It seems > to be a crap-shoot whether it will be useful on any particular machine > or not. At least with the select() code we can be reasonably confident > we know what will happen. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
The fact that cmpb isn't helping proves that getting the cache line in a read-only fashion does *not* do enough to protect the xchgb in this way. But maybe another locking instruction would. Comments? regards, tom lane Hi, Tom! Possible you help next link 'Replay: Unknown Features of the NetBurst Core' http://www.xbitlabs.com/articles/cpu/display/replay.html Best regards,Alexander Kirpa
Is there a TODO here? --------------------------------------------------------------------------- Tom Lane wrote: > The test case I just posted shows that our spinlock code, which we had > thought largely done, is once again becoming a performance bottleneck. > It's time to resurrect some of the ideas we kicked around in early > 2002, and didn't pursue because we decided spinlocks weren't our major > performance problem. > > I started investigating this by trying to understand why the "futex > spinlock" patch had helped the Red Hat customer I mentioned, when it > hadn't done anything much in last year's experiments. It turns out > that this customer was using an 8-way Opteron, and the futex patch > helps a lot more on that architecture than others. Some random > experimentation eventually turned up part of the reason: the x86 TAS > assembly code we are using is pessimal for x86_64. Specifically, > s_lock.h uses this for both architectures: > > /* Use a non-locking test before asserting the bus lock */ > __asm__ __volatile__( > " cmpb $0,%1 \n" > " jne 1f \n" > " lock \n" > " xchgb %0,%1 \n" > "1: \n" > > and it turns out that deleting the cmpb and jne makes things go > significantly faster on Opterons. So the "non-locking test" is > not a win at all on x86_64. As best I can tell from testing, > removing it wins because it causes the rate of spinlock delays > to drop significantly. Why this should be is not completely > obvious. I speculate that the Opterons are capable of pulling > in a copy of the cache line that contains the spinlock and then > running 100 iterations of the cmpb test without ever noticing > that the processor holding the lock has now released it. Without > the cmpb, they are forced by the locking xchgb test to actually > look at the real state of the spinlock each time. > > I kinda suspect that the cmpb test is a no-op or loss on all > Intelish processors: it can only be a win if you expect a lot > of contention for the spin lock, but in percentage terms we still > have a very low conflict rate, so in most executions of the TAS > macro, the cmpb is just wasted cycles. Bottom line: we definitely > don't want it for x86_64, and maybe not at all, but we need more > research to decide the latter. > > The second reason that the futex patch is helping is that when > a spinlock delay does occur, it allows the delaying process to be > awoken almost immediately, rather than delaying 10 msec or more > as the existing code does. However, given that we are only expecting > the spinlock to be held for a couple dozen instructions, using the > kernel futex mechanism is huge overkill --- the in-kernel overhead > to manage the futex state is almost certainly several orders of > magnitude more than the delay we actually want. > > I looked into several other methods of doing the spinlock delay > instead. I think all of these were suggested at one point or > another in our earlier discussions of spinlocks: > > 1. Use sched_yield() if available: it does just what we want, > ie, yield the processor without forcing any useless time delay > before we can be rescheduled. This doesn't exist everywhere > but it exists in recent Linuxen, so I tried it. It made for a > beautiful improvement in my test case numbers: CPU utilization > went to 100% and the context swap rate to almost nil. Unfortunately, > I also saw fairly frequent "stuck spinlock" panics when running > more queries than there were processors --- this despite increasing > NUM_DELAYS to 10000 in s_lock.c. So I don't trust sched_yield > anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect. > (I speculate that it's set up to only yield the processor to other > processes already affiliated to that processor. In any case, it > is definitely capable of getting through 10000 yields without > running the guy who's holding the spinlock.) > > 2. Reduce the length of the select() delays in s_lock. The current code > delays in quanta of 10msec, but on recent Linuxen (and I think other > platforms too) the scheduling quantum is 1msec, so we can get the > processor back faster if we ask for a 1msec delay. I tried this and it > is definitely a win on Linux; it doesn't seem to hurt anything on older > systems either, they just round the delay up to 10msec like before. > So I think we should do this, even though it's only a partial answer. > > 3. Modify the spin loop to do a little more work between TAS() tests. > In the existing code there are only about half a dozen instructions > total in the normal spin loop, most of them very fast, with the result > that the spinning processor does its best to monopolize the system bus > with locked probe instructions. This is obviously not good, as it'll > interfere with the ability of the spinlock holder to complete its work > and release the lock. (The bulk of the spinlock uses are for LWLocks, > and with the current data structures the LWLock's own state is usually > going to be in the same cache line as the spinlock proper, so that > cache grabs for the spinlock hurt the LWLock state updating as well > as the eventual spinlock release.) I modified the code to use an > integer modulo operation as part of the test for SPINS_PER_DELAY; > on most processors this should keep the processor's attention and > keep it away from the system bus for several times longer than an > average instruction. (It's relevant here that the "rep;nop" used > to slow the spin loop on recent Intel processors appears to be > completely ignored by Opterons. Basically we need a non-hardware- > specific trick to slow the loop.) > > 4. Modify SPINS_PER_DELAY. Looping more times instead of entering the > kernel is a win for SMP boxes: whenever you give up and call the kernel > to delay, it's likely that the spinlock holder will have released the > lock before you even get all the way into the kernel, let alone get > rescheduled again. I found that pushing SPINS_PER_DELAY above 200 > caused the select delay rate to fall to almost nil in this test case. > However, that would be counterproductive for uniprocessors where the > optimal strategy is normally to give up the processor immediately. > We've talked about making the spin count user configurable, but I never > cared for that answer. Instead, I've worked up some code to > automatically adjust the spin count depending on whether we seem to be > in a uniprocessor or multiprocessor. The idea is to decrease the spin > count (down to a lower limit) whenever s_lock has to delay in order to > get a lock, and to increase the count (up to a limit) whenever we see > the lock come free before we've had to delay. The former condition > suggests that we are in a uniprocessor and the latter suggests we are in > a multiprocessor. I recall having complained that this idea would > probably be unstable, but in testing it seems to do pretty nicely at > converging to the minimum spin count on a 1-CPU machine and the maximum > limit on an SMP box. It could do with more testing though to see if it > works the same for other people. > > My proposal therefore is to do #2, #3, and #4, and to modify the TAS > assembly code at least on x86_64. Together, these changes bring > my test case on a 4-way Opteron from > > N, runtime: 1 36s 2 61s 4 105s 8 198s > > to > > N, runtime: 1 35s 2 48s 4 58s 8 105s > > At 4 queries, CPU utilization is 100% and the context swap rate nearly > nil, which seems to mean we've gained another (temporary no doubt) > victory over the spinlock problem. > > I attach two proposed patches: the first removes the cmpb/jne from > the x86 TAS assembly code, and the second one does the s_lock changes > enumerated as points #2, #3, #4. The first one in particular needs > more testing to see if it hurts performance on any non-Opteron x86 > chips. (If so, we'd just make it conditional to x86_64.) > > Comments and testing invited. > > regards, tom lane > Content-Description: slock-no-cmpb.patch > *** REL8_0/src/include/storage/s_lock.h Sun Aug 28 20:41:44 2005 > --- new/s_lock.h Fri Sep 9 14:58:44 2005 > *************** > *** 120,132 **** > { > register slock_t _res = 1; > > ! /* Use a non-locking test before asserting the bus lock */ > __asm__ __volatile__( > - " cmpb $0,%1 \n" > - " jne 1f \n" > - " lock \n" > " xchgb %0,%1 \n" > - "1: \n" > : "+q"(_res), "+m"(*lock) > : > : "memory", "cc"); > --- 120,128 ---- > { > register slock_t _res = 1; > > ! /* xchg implies a LOCK prefix, so no need to say LOCK explicitly */ > __asm__ __volatile__( > " xchgb %0,%1 \n" > : "+q"(_res), "+m"(*lock) > : > : "memory", "cc"); Content-Description: spin-delay.patch > *** /home/postgres/pgsql/src/backend/storage/lmgr/proc.c.orig Sat Aug 20 19:26:24 2005 > --- /home/postgres/pgsql/src/backend/storage/lmgr/proc.c Sun Sep 11 15:07:20 2005 > *************** > *** 170,175 **** > --- 170,177 ---- > > ProcGlobal->freeProcs = INVALID_OFFSET; > > + ProcGlobal->spins_per_delay = DEFAULT_SPINS_PER_DELAY; > + > /* > * Pre-create the PGPROC structures and create a semaphore for > * each. > *************** > *** 224,232 **** > --- 226,239 ---- > /* > * Try to get a proc struct from the free list. If this fails, we > * must be out of PGPROC structures (not to mention semaphores). > + * > + * While we are holding the ProcStructLock, also copy the current > + * shared estimate of spins_per_delay to local storage. > */ > SpinLockAcquire(ProcStructLock); > > + set_spins_per_delay(procglobal->spins_per_delay); > + > myOffset = procglobal->freeProcs; > > if (myOffset != INVALID_OFFSET) > *************** > *** 318,338 **** > > Assert(proctype >= 0 && proctype < NUM_DUMMY_PROCS); > > dummyproc = &DummyProcs[proctype]; > > /* > * dummyproc should not presently be in use by anyone else > */ > if (dummyproc->pid != 0) > elog(FATAL, "DummyProc[%d] is in use by PID %d", > proctype, dummyproc->pid); > MyProc = dummyproc; > > /* > * Initialize all fields of MyProc, except MyProc->sem which was set > * up by InitProcGlobal. > */ > - MyProc->pid = MyProcPid; /* marks dummy proc as in use by me */ > SHMQueueElemInit(&(MyProc->links)); > MyProc->waitStatus = STATUS_OK; > MyProc->xid = InvalidTransactionId; > --- 325,362 ---- > > Assert(proctype >= 0 && proctype < NUM_DUMMY_PROCS); > > + /* > + * Just for paranoia's sake, we use the ProcStructLock to protect > + * assignment and releasing of DummyProcs entries. > + * > + * While we are holding the ProcStructLock, also copy the current > + * shared estimate of spins_per_delay to local storage. > + */ > + SpinLockAcquire(ProcStructLock); > + > + set_spins_per_delay(ProcGlobal->spins_per_delay); > + > dummyproc = &DummyProcs[proctype]; > > /* > * dummyproc should not presently be in use by anyone else > */ > if (dummyproc->pid != 0) > + { > + SpinLockRelease(ProcStructLock); > elog(FATAL, "DummyProc[%d] is in use by PID %d", > proctype, dummyproc->pid); > + } > MyProc = dummyproc; > > + MyProc->pid = MyProcPid; /* marks dummy proc as in use by me */ > + > + SpinLockRelease(ProcStructLock); > + > /* > * Initialize all fields of MyProc, except MyProc->sem which was set > * up by InitProcGlobal. > */ > SHMQueueElemInit(&(MyProc->links)); > MyProc->waitStatus = STATUS_OK; > MyProc->xid = InvalidTransactionId; > *************** > *** 509,514 **** > --- 533,541 ---- > /* PGPROC struct isn't mine anymore */ > MyProc = NULL; > > + /* Update shared estimate of spins_per_delay */ > + procglobal->spins_per_delay = update_spins_per_delay(procglobal->spins_per_delay); > + > SpinLockRelease(ProcStructLock); > } > > *************** > *** 532,542 **** > --- 559,576 ---- > /* Release any LW locks I am holding (see notes above) */ > LWLockReleaseAll(); > > + SpinLockAcquire(ProcStructLock); > + > /* Mark dummy proc no longer in use */ > MyProc->pid = 0; > > /* PGPROC struct isn't mine anymore */ > MyProc = NULL; > + > + /* Update shared estimate of spins_per_delay */ > + ProcGlobal->spins_per_delay = update_spins_per_delay(ProcGlobal->spins_per_delay); > + > + SpinLockRelease(ProcStructLock); > } > > > *** /home/postgres/pgsql/src/backend/storage/lmgr/s_lock.c.orig Fri Aug 26 10:47:35 2005 > --- /home/postgres/pgsql/src/backend/storage/lmgr/s_lock.c Sun Sep 11 17:43:20 2005 > *************** > *** 21,26 **** > --- 21,30 ---- > #include "storage/s_lock.h" > #include "miscadmin.h" > > + > + static int spins_per_delay = DEFAULT_SPINS_PER_DELAY; > + > + > /* > * s_lock_stuck() - complain about a stuck spinlock > */ > *************** > *** 49,102 **** > * We loop tightly for awhile, then delay using pg_usleep() and try > * again. Preferably, "awhile" should be a small multiple of the > * maximum time we expect a spinlock to be held. 100 iterations seems > ! * about right. In most multi-CPU scenarios, the spinlock is probably > ! * held by a process on another CPU and will be released before we > ! * finish 100 iterations. However, on a uniprocessor, the tight loop > ! * is just a waste of cycles, so don't iterate thousands of times. > * > * Once we do decide to block, we use randomly increasing pg_usleep() > ! * delays. The first delay is 10 msec, then the delay randomly > ! * increases to about one second, after which we reset to 10 msec and > * start again. The idea here is that in the presence of heavy > * contention we need to increase the delay, else the spinlock holder > * may never get to run and release the lock. (Consider situation > * where spinlock holder has been nice'd down in priority by the > * scheduler --- it will not get scheduled until all would-be > ! * acquirers are sleeping, so if we always use a 10-msec sleep, there > * is a real possibility of starvation.) But we can't just clamp the > * delay to an upper bound, else it would take a long time to make a > * reasonable number of tries. > * > * We time out and declare error after NUM_DELAYS delays (thus, exactly > * that many tries). With the given settings, this will usually take > ! * 3 or so minutes. It seems better to fix the total number of tries > * (and thus the probability of unintended failure) than to fix the > * total time spent. > * > ! * The pg_usleep() delays are measured in centiseconds (0.01 sec) because > ! * 10 msec is a common resolution limit at the OS level. > */ > ! #define SPINS_PER_DELAY 100 > #define NUM_DELAYS 1000 > ! #define MIN_DELAY_CSEC 1 > ! #define MAX_DELAY_CSEC 100 > > ! int spins = 0; > int delays = 0; > ! int cur_delay = MIN_DELAY_CSEC; > > while (TAS(lock)) > { > /* CPU-specific delay each time through the loop */ > SPIN_DELAY(); > > ! /* Block the process every SPINS_PER_DELAY tries */ > ! if (++spins > SPINS_PER_DELAY) > { > if (++delays > NUM_DELAYS) > s_lock_stuck(lock, file, line); > > ! pg_usleep(cur_delay * 10000L); > > #if defined(S_LOCK_TEST) > fprintf(stdout, "*"); > --- 53,121 ---- > * We loop tightly for awhile, then delay using pg_usleep() and try > * again. Preferably, "awhile" should be a small multiple of the > * maximum time we expect a spinlock to be held. 100 iterations seems > ! * about right as an initial guess. However, on a uniprocessor the > ! * loop is a waste of cycles, while in a multi-CPU scenario it's usually > ! * better to spin a bit longer than to call the kernel, so we try to > ! * adapt the spin loop count depending on whether we seem to be in > ! * a uniprocessor or multiprocessor. > * > * Once we do decide to block, we use randomly increasing pg_usleep() > ! * delays. The first delay is 1 msec, then the delay randomly > ! * increases to about one second, after which we reset to 1 msec and > * start again. The idea here is that in the presence of heavy > * contention we need to increase the delay, else the spinlock holder > * may never get to run and release the lock. (Consider situation > * where spinlock holder has been nice'd down in priority by the > * scheduler --- it will not get scheduled until all would-be > ! * acquirers are sleeping, so if we always use a 1-msec sleep, there > * is a real possibility of starvation.) But we can't just clamp the > * delay to an upper bound, else it would take a long time to make a > * reasonable number of tries. > * > * We time out and declare error after NUM_DELAYS delays (thus, exactly > * that many tries). With the given settings, this will usually take > ! * 2 or so minutes. It seems better to fix the total number of tries > * (and thus the probability of unintended failure) than to fix the > * total time spent. > * > ! * The pg_usleep() delays are measured in milliseconds because 1 msec > ! * is a common resolution limit at the OS level for newer platforms. > ! * On older platforms the resolution limit is usually 10 msec, in > ! * which case the total delay before timeout will be a bit more. > */ > ! #define MIN_SPINS_PER_DELAY 10 > ! #define MAX_SPINS_PER_DELAY 1000 > #define NUM_DELAYS 1000 > ! #define MIN_DELAY_MSEC 1 > ! #define MAX_DELAY_MSEC 1000 > > ! int spins = spins_per_delay; > int delays = 0; > ! int cur_delay = 0; > > while (TAS(lock)) > { > /* CPU-specific delay each time through the loop */ > SPIN_DELAY(); > > ! /* > ! * Block the process every spins_per_delay tries. > ! * > ! * What we are really testing for here is spins being decremented > ! * to zero. We insert an unnecessary integer modulo operation > ! * into the test because we'd like this loop to run longer than > ! * just two or three instructions: ideally, the processor should > ! * not be contending for the system bus for a little while here. > ! */ > ! if ((--spins % MAX_SPINS_PER_DELAY) == 0) > { > if (++delays > NUM_DELAYS) > s_lock_stuck(lock, file, line); > > ! if (cur_delay == 0) /* first time to delay? */ > ! cur_delay = MIN_DELAY_MSEC; > ! > ! pg_usleep(cur_delay * 1000L); > > #if defined(S_LOCK_TEST) > fprintf(stdout, "*"); > *************** > *** 107,119 **** > cur_delay += (int) (cur_delay * > (((double) random()) / ((double) MAX_RANDOM_VALUE)) + 0.5); > /* wrap back to minimum delay when max is exceeded */ > ! if (cur_delay > MAX_DELAY_CSEC) > ! cur_delay = MIN_DELAY_CSEC; > > ! spins = 0; > } > } > } > > /* > * Various TAS implementations that cannot live in s_lock.h as no inline > --- 126,200 ---- > cur_delay += (int) (cur_delay * > (((double) random()) / ((double) MAX_RANDOM_VALUE)) + 0.5); > /* wrap back to minimum delay when max is exceeded */ > ! if (cur_delay > MAX_DELAY_MSEC) > ! cur_delay = MIN_DELAY_MSEC; > > ! spins = spins_per_delay; > } > } > + > + /* > + * If we were able to acquire the lock without delaying, it's a good > + * indication we are in a multiprocessor. If we had to delay, it's > + * a sign (but not a sure thing) that we are in a uniprocessor. > + * Hence, we decrement spins_per_delay slowly when we had to delay, > + * and increase it rapidly when we didn't. It's expected that > + * spins_per_delay will converge to the minimum value on a uniprocessor > + * and to the maximum value on a multiprocessor. > + * > + * Note: spins_per_delay is local within our current process. > + * We want to average these observations across multiple backends, > + * since it's relatively rare for this function to even get entered, > + * and so a single backend might not live long enough to converge on > + * a good value. That is handled by the two routines below. > + */ > + if (cur_delay == 0) > + { > + /* we never had to delay */ > + spins_per_delay += 100; > + spins_per_delay = Min(spins_per_delay, MAX_SPINS_PER_DELAY); > + } > + else > + { > + spins_per_delay -= 1; > + spins_per_delay = Max(spins_per_delay, MIN_SPINS_PER_DELAY); > + } > + } > + > + > + /* > + * Set local copy of spins_per_delay during backend startup. > + * > + * NB: this has to be pretty fast as it is called while holding a spinlock > + */ > + void > + set_spins_per_delay(int shared_spins_per_delay) > + { > + spins_per_delay = shared_spins_per_delay; > + } > + > + /* > + * Update shared estimate of spins_per_delay during backend exit. > + * > + * NB: this has to be pretty fast as it is called while holding a spinlock > + */ > + int > + update_spins_per_delay(int shared_spins_per_delay) > + { > + /* > + * We use an exponential moving average with a relatively slow > + * adaption rate, so that noise in any one backend's result won't > + * affect the shared value too much. As long as both inputs are > + * within the allowed range, the result must be too, so we need not > + * worry about clamping the result. > + * > + * We deliberately truncate rather than rounding; this is so that > + * single adjustments inside a backend can affect the shared estimate > + * (see the asymmetric adjustment rules above). > + */ > + return (shared_spins_per_delay * 15 + spins_per_delay) / 16; > } > + > > /* > * Various TAS implementations that cannot live in s_lock.h as no inline > *** /home/postgres/pgsql/src/include/storage/proc.h.orig Sat Aug 20 19:26:34 2005 > --- /home/postgres/pgsql/src/include/storage/proc.h Sun Sep 11 15:07:09 2005 > *************** > *** 105,110 **** > --- 105,112 ---- > { > /* Head of list of free PGPROC structures */ > SHMEM_OFFSET freeProcs; > + /* Current shared estimate of appropriate spins_per_delay value */ > + int spins_per_delay; > } PROC_HDR; > > > *** /home/postgres/pgsql/src/include/storage/s_lock.h.orig Sun Aug 28 20:41:34 2005 > --- /home/postgres/pgsql/src/include/storage/s_lock.h Sun Sep 11 15:07:09 2005 > *************** > *** 826,829 **** > --- 826,835 ---- > */ > extern void s_lock(volatile slock_t *lock, const char *file, int line); > > + /* Support for dynamic adjustment of spins_per_delay */ > + #define DEFAULT_SPINS_PER_DELAY 100 > + > + extern void set_spins_per_delay(int shared_spins_per_delay); > + extern int update_spins_per_delay(int shared_spins_per_delay); > + > #endif /* S_LOCK_H */ > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote: > I wrote: > > Another thought came to mind: maybe the current data layout for LWLocks > > is bad. Right now, the spinlock that protects each LWLock data struct > > is itself part of the struct, and since the structs aren't large (circa > > 20 bytes), the whole thing is usually all in the same cache line. ... > > Maybe it'd be better to allocate the spinlocks off by themselves. > > Well, this is odd. I made up a patch to do this (attached) and found > that it pretty much sucks. Still the 4-way Opteron, previous best > (slock-no-cmpb and spin-delay-2): > 1 31s 2 42s 4 51s 8 100s > with lwlock-separate added: > 1 31s 2 52s 4 106s 8 213s > > What I had expected to see was a penalty in the single-thread case (due > to instructions added to the LWLock functions), but there isn't any. > I did not expect to see a factor-of-2 penalty for four threads. > > I guess what this means is that there's no real problem with losing the > cache line while manipulating the LWLock, which is what the patch was > intended to prevent. Instead, we're paying for swapping two cache lines > (the spinlock and the LWLock) across processors instead of just one line. > But that should at worst be a 2x inflation of the time previously spent > in LWLockAcquire/Release, which is surely not yet all of the application > ;-). Why the heck is this so bad? (Just going back through the whole thread for completeness.) > Should we expect that apparently > minor changes in shared data structures might be costing equivalently > huge penalties in SMP performance elsewhere? Yes. That's the advice from Intel and AMD; but we should add that there is potential for improving performance also. The possible problem we were trying to avoid here was false sharing of the cache line by lock requestors of locks whose spin locks were adjacent in memory. Splitting the data structure was just one of the possible ways of avoiding that. The above shows that the possible solution described above didn't work, but there could be others. One thing we tried in February was padding out the statically defined locks with dummy lock definitions in the enum. This has the effect of ensuring that the most contested locks are very definitely in their own cache line and not shared with others. That showed a noticeable improvement in performance, probably because it costs very little to implement, even if the code would require some explanatory documentation. The lwlock structure in include/storage/lwlock.h looks like typedef enum LWLockId {BufMappingLock,BufFreelistLock,LockMgrLock,OidGenLock,XidGenLock,ProcArrayLock,SInvalLock,FreeSpaceLock,WALInsertLock,WALWriteLock,... Notice that the heavily contested locks (i.e. first 3 and the WAL locks) are adjacent to at least one other heavily contested lock. So they are certain to be in the same cache line and therefore to cause false sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's optimization handbooks). This could be replaced by... typedef enum LWLockId {BufMappingLock,BufMappingLock_Padding1,BufMappingLock_Padding2,BufMappingLock_Padding3,BufMappingLock_Padding4,BufMappingLock_Padding5,BufMappingLock_Padding6,BufMappingLock_Padding7,BufFreelistLock,BufFreelistLock_Padding1,BufFreelistLock_Padding2,BufFreelistLock_Padding3,BufFreelistLock_Padding4,BufFreelistLock_Padding5,BufFreelistLock_Padding6,BufFreelistLock_Padding7,LockMgrLock,LockMgrLock_Padding1,LockMgrLock_Padding2,LockMgrLock_Padding3,LockMgrLock_Padding4,LockMgrLock_Padding5,LockMgrLock_Padding6,LockMgrLock_Padding7,OidGenLock,XidGenLock,ProcArrayLock,SInvalLock,FreeSpaceLock,WALInsertLock,WALInsertLock_Padding1,WALInsertLock_Padding2,WALInsertLock_Padding3,WALInsertLock_Padding4,WALInsertLock_Padding5,WALInsertLock_Padding6,WALInsertLock_Padding7,WALWriteLock,WALWriteLock_Padding1,WALWriteLock_Padding2,WALWriteLock_Padding3,WALWriteLock_Padding4,WALWriteLock_Padding5,WALWriteLock_Padding6,WALWriteLock_Padding7,... where the number of padding locks is determined by how many lock structures fit within a 128 byte cache line. This isn't exactly elegant coding, but it provides a useful improvement on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks pretty darn stupid. But it does follow the CPU optimization handbook advice and I did see a noticeable improvement in performance and a reduction in context switching. I'm not in a position to try this again now on 8.1beta, but I'd welcome a performance test result from anybody that is. I'll supply a patch against 8.1beta for anyone wanting to test this. Best Regards, Simon Riggs
> where the number of padding locks is determined by how many lock > structures fit within a 128 byte cache line. > > This isn't exactly elegant coding, but it provides a useful improvement > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks > pretty darn stupid. But it does follow the CPU optimization handbook > advice and I did see a noticeable improvement in performance and a > reduction in context switching. > > I'm not in a position to try this again now on 8.1beta, but I'd welcome > a performance test result from anybody that is. I'll supply a patch > against 8.1beta for anyone wanting to test this. > I don't have an 8 way available right now but I can run tests on a 4 way Opteron if that would be helpful. Emil
On Wed, 12 Oct 2005 18:44:50 +0100 Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote: > > I wrote: > > > Another thought came to mind: maybe the current data layout for LWLocks > > > is bad. Right now, the spinlock that protects each LWLock data struct > > > is itself part of the struct, and since the structs aren't large (circa > > > 20 bytes), the whole thing is usually all in the same cache line. ... > > > Maybe it'd be better to allocate the spinlocks off by themselves. > > > > Well, this is odd. I made up a patch to do this (attached) and found > > that it pretty much sucks. Still the 4-way Opteron, previous best > > (slock-no-cmpb and spin-delay-2): > > 1 31s 2 42s 4 51s 8 100s > > with lwlock-separate added: > > 1 31s 2 52s 4 106s 8 213s > > > > What I had expected to see was a penalty in the single-thread case (due > > to instructions added to the LWLock functions), but there isn't any. > > I did not expect to see a factor-of-2 penalty for four threads. > > > > I guess what this means is that there's no real problem with losing the > > cache line while manipulating the LWLock, which is what the patch was > > intended to prevent. Instead, we're paying for swapping two cache lines > > (the spinlock and the LWLock) across processors instead of just one line. > > But that should at worst be a 2x inflation of the time previously spent > > in LWLockAcquire/Release, which is surely not yet all of the application > > ;-). Why the heck is this so bad? > > (Just going back through the whole thread for completeness.) > > > Should we expect that apparently > > minor changes in shared data structures might be costing equivalently > > huge penalties in SMP performance elsewhere? > > Yes. That's the advice from Intel and AMD; but we should add that there > is potential for improving performance also. > > The possible problem we were trying to avoid here was false sharing of > the cache line by lock requestors of locks whose spin locks were > adjacent in memory. > > Splitting the data structure was just one of the possible ways of > avoiding that. The above shows that the possible solution described > above didn't work, but there could be others. > > One thing we tried in February was padding out the statically defined > locks with dummy lock definitions in the enum. This has the effect of > ensuring that the most contested locks are very definitely in their own > cache line and not shared with others. > That showed a noticeable improvement in performance, probably because it > costs very little to implement, even if the code would require some > explanatory documentation. > > The lwlock structure in include/storage/lwlock.h looks like > > typedef enum LWLockId > { > BufMappingLock, > BufFreelistLock, > LockMgrLock, > OidGenLock, > XidGenLock, > ProcArrayLock, > SInvalLock, > FreeSpaceLock, > WALInsertLock, > WALWriteLock, > ... > > Notice that the heavily contested locks (i.e. first 3 and the WAL locks) > are adjacent to at least one other heavily contested lock. So they are > certain to be in the same cache line and therefore to cause false > sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's > optimization handbooks). > > This could be replaced by... > > typedef enum LWLockId > { > BufMappingLock, > BufMappingLock_Padding1, > BufMappingLock_Padding2, > BufMappingLock_Padding3, > BufMappingLock_Padding4, > BufMappingLock_Padding5, > BufMappingLock_Padding6, > BufMappingLock_Padding7, > BufFreelistLock, > BufFreelistLock_Padding1, > BufFreelistLock_Padding2, > BufFreelistLock_Padding3, > BufFreelistLock_Padding4, > BufFreelistLock_Padding5, > BufFreelistLock_Padding6, > BufFreelistLock_Padding7, > LockMgrLock, > LockMgrLock_Padding1, > LockMgrLock_Padding2, > LockMgrLock_Padding3, > LockMgrLock_Padding4, > LockMgrLock_Padding5, > LockMgrLock_Padding6, > LockMgrLock_Padding7, > OidGenLock, > XidGenLock, > ProcArrayLock, > SInvalLock, > FreeSpaceLock, > WALInsertLock, > WALInsertLock_Padding1, > WALInsertLock_Padding2, > WALInsertLock_Padding3, > WALInsertLock_Padding4, > WALInsertLock_Padding5, > WALInsertLock_Padding6, > WALInsertLock_Padding7, > WALWriteLock, > WALWriteLock_Padding1, > WALWriteLock_Padding2, > WALWriteLock_Padding3, > WALWriteLock_Padding4, > WALWriteLock_Padding5, > WALWriteLock_Padding6, > WALWriteLock_Padding7, > ... > > where the number of padding locks is determined by how many lock > structures fit within a 128 byte cache line. > > This isn't exactly elegant coding, but it provides a useful improvement > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks > pretty darn stupid. But it does follow the CPU optimization handbook > advice and I did see a noticeable improvement in performance and a > reduction in context switching. > > I'm not in a position to try this again now on 8.1beta, but I'd welcome > a performance test result from anybody that is. I'll supply a patch > against 8.1beta for anyone wanting to test this. I'll take a patch. I have a 4 processor (8core) POWER5 system just about ready. Mark
On Wed, Oct 12, 2005 at 03:07:23PM -0400, Emil Briggs wrote: > > where the number of padding locks is determined by how many lock > > structures fit within a 128 byte cache line. > > > > This isn't exactly elegant coding, but it provides a useful improvement > > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks > > pretty darn stupid. But it does follow the CPU optimization handbook > > advice and I did see a noticeable improvement in performance and a > > reduction in context switching. > > > > I'm not in a position to try this again now on 8.1beta, but I'd welcome > > a performance test result from anybody that is. I'll supply a patch > > against 8.1beta for anyone wanting to test this. > > > > I don't have an 8 way available right now but I can run tests on a 4 way > Opteron if that would be helpful. Likewise I can test on a 4 way opteron running open solaris. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote: > > > > This isn't exactly elegant coding, but it provides a useful improvement > > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks > > pretty darn stupid. But it does follow the CPU optimization handbook > > advice and I did see a noticeable improvement in performance and a > > reduction in context switching. > > I'm not in a position to try this again now on 8.1beta, but I'd welcome > > a performance test result from anybody that is. I'll supply a patch > > against 8.1beta for anyone wanting to test this. > > Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which > I've just set up and probably needs a bit of tuning. I don't see much > difference but I'm wondering if the cacheline sizes are dramatically > different from Intel/AMD processors. I still need to take a closer look > to make sure I haven't grossly mistuned anything, but I'll let everyone > take a look: Well, the Power 5 architecture probably has the lowest overall memory delay you can get currently so in some ways that would negate the effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's clear the patch isn't significantly better (like it was with 8.0 when we tried this on the 8-way Itanium in Feb). > cvs 20051013 > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/ > 2501 notpm > > cvs 20051013 w/ lw.patch > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/ > 2519 notpm Could you re-run with wal_buffers = 32 ? (Without patch) Thanks Best Regards, Simon Riggs
On Thu, 20 Oct 2005 23:03:47 +0100 Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote: > > > > > > This isn't exactly elegant coding, but it provides a useful improvement > > > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks > > > pretty darn stupid. But it does follow the CPU optimization handbook > > > advice and I did see a noticeable improvement in performance and a > > > reduction in context switching. > > > > I'm not in a position to try this again now on 8.1beta, but I'd welcome > > > a performance test result from anybody that is. I'll supply a patch > > > against 8.1beta for anyone wanting to test this. > > > > Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which > > I've just set up and probably needs a bit of tuning. I don't see much > > difference but I'm wondering if the cacheline sizes are dramatically > > different from Intel/AMD processors. I still need to take a closer look > > to make sure I haven't grossly mistuned anything, but I'll let everyone > > take a look: > > Well, the Power 5 architecture probably has the lowest overall memory > delay you can get currently so in some ways that would negate the > effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's > clear the patch isn't significantly better (like it was with 8.0 when we > tried this on the 8-way Itanium in Feb). > > > cvs 20051013 > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/ > > 2501 notpm > > > > cvs 20051013 w/ lw.patch > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/ > > 2519 notpm > > Could you re-run with wal_buffers = 32 ? (Without patch) Thanks Ok, sorry for the delay. I've bumped up the wal_buffers to 2048 and redid the disk layout. Here's where I'm at now: cvs 20051013 http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/40/ 3257 notpm cvs 20051013 w/ lw.patch http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/42/ 3285 notpm Still not much of a difference with the patch. A quick glance over the iostat data suggests I'm still not i/o bound, but the i/o wait is rather high according to vmstat. Will try to see if there's anything else obvious to get the load up higher. Mark
On Mon, 2005-10-31 at 16:10 -0800, Mark Wong wrote: > On Thu, 20 Oct 2005 23:03:47 +0100 > Simon Riggs <simon@2ndquadrant.com> wrote: > > > On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote: > > > > > > > > This isn't exactly elegant coding, but it provides a useful improvement > > > > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks > > > > pretty darn stupid. But it does follow the CPU optimization handbook > > > > advice and I did see a noticeable improvement in performance and a > > > > reduction in context switching. > > > > > > I'm not in a position to try this again now on 8.1beta, but I'd welcome > > > > a performance test result from anybody that is. I'll supply a patch > > > > against 8.1beta for anyone wanting to test this. > > > > > > Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which > > > I've just set up and probably needs a bit of tuning. I don't see much > > > difference but I'm wondering if the cacheline sizes are dramatically > > > different from Intel/AMD processors. I still need to take a closer look > > > to make sure I haven't grossly mistuned anything, but I'll let everyone > > > take a look: > > > > Well, the Power 5 architecture probably has the lowest overall memory > > delay you can get currently so in some ways that would negate the > > effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's > > clear the patch isn't significantly better (like it was with 8.0 when we > > tried this on the 8-way Itanium in Feb). > > > > > cvs 20051013 > > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/ > > > 2501 notpm > > > > > > cvs 20051013 w/ lw.patch > > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/ > > > 2519 notpm > > > > Could you re-run with wal_buffers = 32 ? (Without patch) Thanks > > Ok, sorry for the delay. I've bumped up the wal_buffers to 2048 and > redid the disk layout. Here's where I'm at now: > > cvs 20051013 > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/40/ > 3257 notpm > > cvs 20051013 w/ lw.patch > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/42/ > 3285 notpm > > Still not much of a difference with the patch. A quick glance over the > iostat data suggests I'm still not i/o bound, but the i/o wait is rather > high according to vmstat. Will try to see if there's anything else > obvious to get the load up higher. OK, thats fine. I'm glad there's some gain, but not much yet. I think we should leave out doing any more tests on lw.patch for now. Concerned about the awful checkpointing. Can you bump wal_buffers to 8192 just to make sure? Thats way too high, but just to prove it. We need to rdeuce the number of blocks to be written at checkpoint. bgwriter_all_maxpages 5 -> 15bgwriter_all_percent 0.333bgwriter_delay 200 bgwriter_lru_maxpages 5 -> 7bgwriter_lru_percent 1 shared_buffers set lower to 100000(which should cause some amusement on-list) Best Regards, Simon Riggs
On Tue, 01 Nov 2005 07:32:32 +0000 Simon Riggs <simon@2ndquadrant.com> wrote: > On Mon, 2005-10-31 at 16:10 -0800, Mark Wong wrote: > > On Thu, 20 Oct 2005 23:03:47 +0100 > > Simon Riggs <simon@2ndquadrant.com> wrote: > > > > > On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote: > > > > > > > > > > This isn't exactly elegant coding, but it provides a useful improvement > > > > > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks > > > > > pretty darn stupid. But it does follow the CPU optimization handbook > > > > > advice and I did see a noticeable improvement in performance and a > > > > > reduction in context switching. > > > > > > > > I'm not in a position to try this again now on 8.1beta, but I'd welcome > > > > > a performance test result from anybody that is. I'll supply a patch > > > > > against 8.1beta for anyone wanting to test this. > > > > > > > > Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which > > > > I've just set up and probably needs a bit of tuning. I don't see much > > > > difference but I'm wondering if the cacheline sizes are dramatically > > > > different from Intel/AMD processors. I still need to take a closer look > > > > to make sure I haven't grossly mistuned anything, but I'll let everyone > > > > take a look: > > > > > > Well, the Power 5 architecture probably has the lowest overall memory > > > delay you can get currently so in some ways that would negate the > > > effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's > > > clear the patch isn't significantly better (like it was with 8.0 when we > > > tried this on the 8-way Itanium in Feb). > > > > > > > cvs 20051013 > > > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/ > > > > 2501 notpm > > > > > > > > cvs 20051013 w/ lw.patch > > > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/ > > > > 2519 notpm > > > > > > Could you re-run with wal_buffers = 32 ? (Without patch) Thanks > > > > Ok, sorry for the delay. I've bumped up the wal_buffers to 2048 and > > redid the disk layout. Here's where I'm at now: > > > > cvs 20051013 > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/40/ > > 3257 notpm > > > > cvs 20051013 w/ lw.patch > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/42/ > > 3285 notpm > > > > Still not much of a difference with the patch. A quick glance over the > > iostat data suggests I'm still not i/o bound, but the i/o wait is rather > > high according to vmstat. Will try to see if there's anything else > > obvious to get the load up higher. > > OK, thats fine. I'm glad there's some gain, but not much yet. I think we > should leave out doing any more tests on lw.patch for now. > > Concerned about the awful checkpointing. Can you bump wal_buffers to > 8192 just to make sure? Thats way too high, but just to prove it. > > We need to rdeuce the number of blocks to be written at checkpoint. > > bgwriter_all_maxpages 5 -> 15 > bgwriter_all_percent 0.333 > bgwriter_delay 200 > bgwriter_lru_maxpages 5 -> 7 > bgwriter_lru_percent 1 > > shared_buffers set lower to 100000 > (which should cause some amusement on-list) Okay, here goes, all with the same source base w/ the lw.patch: http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/44/ only increased wal_buffers to 8192 from 2048 3242 notpm http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/43/ only increased bgwriter_all_maxpages to 15, and bgwriter_lru_maxpages to 7 3019 notpm (but more interesting graph) http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/45/ Same as the previously listen run with hared_buffers lowered to 10000 2503 notpm Mark
On Thu, 2005-11-03 at 08:03 -0800, Mark Wong wrote: > On Tue, 01 Nov 2005 07:32:32 +0000 > Simon Riggs <simon@2ndquadrant.com> wrote: > > Concerned about the awful checkpointing. Can you bump wal_buffers to > > 8192 just to make sure? Thats way too high, but just to prove it. > > > > We need to rdeuce the number of blocks to be written at checkpoint. > > > > bgwriter_all_maxpages 5 -> 15 > > bgwriter_all_percent 0.333 > > bgwriter_delay 200 > > bgwriter_lru_maxpages 5 -> 7 > > bgwriter_lru_percent 1 > > > > shared_buffers set lower to 100000 > > (which should cause some amusement on-list) > > > Okay, here goes, all with the same source base w/ the lw.patch: > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/44/ > only increased wal_buffers to 8192 from 2048 > 3242 notpm That looks to me like a clear negative effect from increasing wal_buffers. Try putting it back down to 1024. Looks like we need to plug that gap. > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/43/ > only increased bgwriter_all_maxpages to 15, and bgwriter_lru_maxpages to 7 > 3019 notpm (but more interesting graph) Man that sucks. What the heck is happening there? Hackers - if you watching you should see this graph - it shows some very poor behaviour. I'm not happy with that performance at all.... any chance you could re- run that exact same test to see if we can get that repeatably? I see you have vm.dirty_writeback_centisecs = 0 which pretty much means we aren't ever writing to disk by the pdflush daemons, even when the bgwriter is active. Could we set the bgwriter stuff back to default and try vm.dirty_writeback_centisecs = 500 > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/45/ > Same as the previously listen run with hared_buffers lowered to 10000 > 2503 notpm Sorry, that was 100,000 not 10,000. Looks like we need dates on the log_line_prefix so we can check the logs. ...not sure about the oprofile results. Seems to show CreateLWLocks being as high as xlog_insert, which is mad. Either that shows startup time is excessive, or it means the oprofile timing range is too short. Not sure which. Best Regards, Simon Riggs
On Thu, 03 Nov 2005 18:29:09 +0000 Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, 2005-11-03 at 08:03 -0800, Mark Wong wrote: > > On Tue, 01 Nov 2005 07:32:32 +0000 > > Simon Riggs <simon@2ndquadrant.com> wrote: > > > Concerned about the awful checkpointing. Can you bump wal_buffers to > > > 8192 just to make sure? Thats way too high, but just to prove it. > > > > > > We need to rdeuce the number of blocks to be written at checkpoint. > > > > > > bgwriter_all_maxpages 5 -> 15 > > > bgwriter_all_percent 0.333 > > > bgwriter_delay 200 > > > bgwriter_lru_maxpages 5 -> 7 > > > bgwriter_lru_percent 1 > > > > > > shared_buffers set lower to 100000 > > > (which should cause some amusement on-list) > > > > > > Okay, here goes, all with the same source base w/ the lw.patch: > > > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/44/ > > only increased wal_buffers to 8192 from 2048 > > 3242 notpm > > That looks to me like a clear negative effect from increasing > wal_buffers. Try putting it back down to 1024. > Looks like we need to plug that gap. > > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/43/ > > only increased bgwriter_all_maxpages to 15, and bgwriter_lru_maxpages to 7 > > 3019 notpm (but more interesting graph) > > Man that sucks. What the heck is happening there? Hackers - if you > watching you should see this graph - it shows some very poor behaviour. > > I'm not happy with that performance at all.... any chance you could re- > run that exact same test to see if we can get that repeatably? > > I see you have > vm.dirty_writeback_centisecs = 0 > > which pretty much means we aren't ever writing to disk by the pdflush > daemons, even when the bgwriter is active. > > Could we set the bgwriter stuff back to default and try > vm.dirty_writeback_centisecs = 500 http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/47/ 3309 notpm > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/45/ > > Same as the previously listen run with hared_buffers lowered to 10000 > > 2503 notpm > > Sorry, that was 100,000 not 10,000. Oops! http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/46/ 2794 notpm > Looks like we need dates on the log_line_prefix so we can check the > logs. Oops again! I didn't check to make sure I had set this correctly before I ran the last two tests, I'll get on it. > ...not sure about the oprofile results. Seems to show CreateLWLocks > being as high as xlog_insert, which is mad. Either that shows startup > time is excessive, or it means the oprofile timing range is too short. > Not sure which. Yeah, we've seen this before. I think I'll have to try pulling the oprofile cvs code to see if there's any improvement. I've been working with oprofile-0.9.1. Mark
Added to TODO list. --------------------------------------------------------------------------- Simon Riggs wrote: > On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote: > > I wrote: > > > Another thought came to mind: maybe the current data layout for LWLocks > > > is bad. Right now, the spinlock that protects each LWLock data struct > > > is itself part of the struct, and since the structs aren't large (circa > > > 20 bytes), the whole thing is usually all in the same cache line. ... > > > Maybe it'd be better to allocate the spinlocks off by themselves. > > > > Well, this is odd. I made up a patch to do this (attached) and found > > that it pretty much sucks. Still the 4-way Opteron, previous best > > (slock-no-cmpb and spin-delay-2): > > 1 31s 2 42s 4 51s 8 100s > > with lwlock-separate added: > > 1 31s 2 52s 4 106s 8 213s > > > > What I had expected to see was a penalty in the single-thread case (due > > to instructions added to the LWLock functions), but there isn't any. > > I did not expect to see a factor-of-2 penalty for four threads. > > > > I guess what this means is that there's no real problem with losing the > > cache line while manipulating the LWLock, which is what the patch was > > intended to prevent. Instead, we're paying for swapping two cache lines > > (the spinlock and the LWLock) across processors instead of just one line. > > But that should at worst be a 2x inflation of the time previously spent > > in LWLockAcquire/Release, which is surely not yet all of the application > > ;-). Why the heck is this so bad? > > (Just going back through the whole thread for completeness.) > > > Should we expect that apparently > > minor changes in shared data structures might be costing equivalently > > huge penalties in SMP performance elsewhere? > > Yes. That's the advice from Intel and AMD; but we should add that there > is potential for improving performance also. > > The possible problem we were trying to avoid here was false sharing of > the cache line by lock requestors of locks whose spin locks were > adjacent in memory. > > Splitting the data structure was just one of the possible ways of > avoiding that. The above shows that the possible solution described > above didn't work, but there could be others. > > One thing we tried in February was padding out the statically defined > locks with dummy lock definitions in the enum. This has the effect of > ensuring that the most contested locks are very definitely in their own > cache line and not shared with others. > That showed a noticeable improvement in performance, probably because it > costs very little to implement, even if the code would require some > explanatory documentation. > > The lwlock structure in include/storage/lwlock.h looks like > > typedef enum LWLockId > { > BufMappingLock, > BufFreelistLock, > LockMgrLock, > OidGenLock, > XidGenLock, > ProcArrayLock, > SInvalLock, > FreeSpaceLock, > WALInsertLock, > WALWriteLock, > ... > > Notice that the heavily contested locks (i.e. first 3 and the WAL locks) > are adjacent to at least one other heavily contested lock. So they are > certain to be in the same cache line and therefore to cause false > sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's > optimization handbooks). > > This could be replaced by... > > typedef enum LWLockId > { > BufMappingLock, > BufMappingLock_Padding1, > BufMappingLock_Padding2, > BufMappingLock_Padding3, > BufMappingLock_Padding4, > BufMappingLock_Padding5, > BufMappingLock_Padding6, > BufMappingLock_Padding7, > BufFreelistLock, > BufFreelistLock_Padding1, > BufFreelistLock_Padding2, > BufFreelistLock_Padding3, > BufFreelistLock_Padding4, > BufFreelistLock_Padding5, > BufFreelistLock_Padding6, > BufFreelistLock_Padding7, > LockMgrLock, > LockMgrLock_Padding1, > LockMgrLock_Padding2, > LockMgrLock_Padding3, > LockMgrLock_Padding4, > LockMgrLock_Padding5, > LockMgrLock_Padding6, > LockMgrLock_Padding7, > OidGenLock, > XidGenLock, > ProcArrayLock, > SInvalLock, > FreeSpaceLock, > WALInsertLock, > WALInsertLock_Padding1, > WALInsertLock_Padding2, > WALInsertLock_Padding3, > WALInsertLock_Padding4, > WALInsertLock_Padding5, > WALInsertLock_Padding6, > WALInsertLock_Padding7, > WALWriteLock, > WALWriteLock_Padding1, > WALWriteLock_Padding2, > WALWriteLock_Padding3, > WALWriteLock_Padding4, > WALWriteLock_Padding5, > WALWriteLock_Padding6, > WALWriteLock_Padding7, > ... > > where the number of padding locks is determined by how many lock > structures fit within a 128 byte cache line. > > This isn't exactly elegant coding, but it provides a useful improvement > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks > pretty darn stupid. But it does follow the CPU optimization handbook > advice and I did see a noticeable improvement in performance and a > reduction in context switching. > > I'm not in a position to try this again now on 8.1beta, but I'd welcome > a performance test result from anybody that is. I'll supply a patch > against 8.1beta for anyone wanting to test this. > > Best Regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > -- Bruce Momjian http://candle.pha.pa.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Bruce Momjian" <pgman@candle.pha.pa.us> wrote > > Added to TODO list. > > > One thing we tried in February was padding out the statically defined > > locks with dummy lock definitions in the enum. This has the effect of > > ensuring that the most contested locks are very definitely in their own > > cache line and not shared with others. > > That showed a noticeable improvement in performance, probably because it > > costs very little to implement, even if the code would require some > > explanatory documentation. > > Has this been done? See the LWLOCK_PADDED_SIZE macro in code. Regards, Qingqing
Qingqing Zhou wrote: > > "Bruce Momjian" <pgman@candle.pha.pa.us> wrote > > > > Added to TODO list. > > > > > One thing we tried in February was padding out the statically defined > > > locks with dummy lock definitions in the enum. This has the effect of > > > ensuring that the most contested locks are very definitely in their own > > > cache line and not shared with others. > > > That showed a noticeable improvement in performance, probably because it > > > costs very little to implement, even if the code would require some > > > explanatory documentation. > > > > > Has this been done? See the LWLOCK_PADDED_SIZE macro in code. Oh, yes, thanks. I thought it had but I couldn't find anything in the area of the code he propsed the patch. -- Bruce Momjian http://candle.pha.pa.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: >>> One thing we tried in February was padding out the statically defined >>> locks with dummy lock definitions in the enum. > Has this been done? See the LWLOCK_PADDED_SIZE macro in code. Not really --- that patch was intended to ensure that LWLocks don't unnecessarily cross two cache lines. It doesn't ensure that two different LWLocks aren't sharing a cache line. You could do that by increasing LWLOCK_PADDED_SIZE to the cache line size for your hardware, if you know what that is. I think a more effective answer might be to twiddle the order of "enum LWLockId" items so that the most heavily used LWLocks aren't close together. Haven't looked into it though. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > > Not really --- that patch was intended to ensure that LWLocks don't > unnecessarily cross two cache lines. It doesn't ensure that two > different LWLocks aren't sharing a cache line. You could do that > by increasing LWLOCK_PADDED_SIZE to the cache line size for your > hardware, if you know what that is. > Exactly, this is one way -- if we make LWLOCK_PADDED_SIZE big enough, we can assure that one lwlock one cacheline. If so, maybe we should plug in a check like LMBench in ./configure to guess out current cacheline size. But this way looks like overkill -- a compromise is to pad only some of the LWLocks big enough but not all (for example, the buffer content lock array). Regards, Qingqing