Thread: spinlocks: generalizing "non-locking test"
Currently, the assembly for TAS() on x86 does a non-locking test before using an atomic operation to attempt to acquire the spinlock: __asm__ __volatile__( " cmpb $0,%1 \n" " jne 1f \n" " lock \n" " xchgb %0,%1 \n" "1: \n" : "+q"(_res), "+m"(*lock) : : "memory", "cc"); The reason this is a good idea is that if we fail to immediately acquire the spinlock, s_lock() will spin SPINS_PER_DELAY times in userspace calling TAS() each time before going to sleep. If we do an atomic operation for each spin, this generates a lot more bus traffic than is necessary. Doing a non-locking test (followed by an atomic operation to acquire the spinlock if appropriate) is therefore better on SMP systems. Currently x86 is the only platform on which we do this -- ISTM that all the other platforms that implement spinlocks via atomic operations could benefit from this technique. We could fix this by tweaking each platform's assembler to add a non-blocking test, but there might be an easier way. Rather than modifying platform-specific assembler, I believe this C sequence is equivalent to the non-locking test: volatile slock_t *lock = ...; if (*lock == 0) TAS(lock); Because the lock variable is volatile, the compiler should reload it from memory for each loop iteration. (If this is actually not a sufficient non-locking test, please let me know...) We could add a new s_lock.h macro, TAS_INNER_LOOP(), whose default implementation would be: #define TAS_INNER_LOOP(lock) \ if ((*lock) == 0) \ TAS(lock); And then remove the x86-specific non-locking test from TAS. Comments? -Neil
Neil Conway <neilc@samurai.com> writes: > Currently x86 is the only platform on which we do this -- ISTM that all > the other platforms that implement spinlocks via atomic operations could > benefit from this technique. Do you have any actual evidence for that opinion? ISTM this is dependent on a large set of assumptions about the CPU's bus behavior, boiling down to the conclusion that an extra conditional branch is cheaper than a locked bus cycle. It would be a serious error to transfer that conclusion to non-Intel chips without investigation. (For that matter, I'm not that thrilled about it even for the Intel chips, since the extra test is certainly 100% wasted cycles on any single-CPU machine, or indeed anywhere that the lock is not contended for.) I believe that an extra test would be a dead loss on PPC, for instance, because the initial load is already nonblocking there. > if (*lock == 0) > TAS(lock); This will certainly not work on HPPA, and it really shouldn't assume that zero is the non-locked state, and what happened to testing the TAS result? On the whole it seems far easier to stick an extra load into the asm code on those platforms where we think it's a win. Especially since we have already done that ;-) regards, tom lane
On Mon, 2004-10-18 at 04:13, Tom Lane wrote: > Do you have any actual evidence for that opinion? ISTM this is > dependent on a large set of assumptions about the CPU's bus behavior, > boiling down to the conclusion that an extra conditional branch is > cheaper than a locked bus cycle. I think the conditional branch is effectively free: we're spinning in a busy-wait loop, so we're effectively throwing away cycles until the lock is free anyway. The important things are: (a) we interfere as little as possible with concurrent system activity while spinning (b) we notice promptly that the lock is free. ISTM that doing a non-locking test before the locking test achieves those two conditions. > (For that matter, I'm not that thrilled about it even for the Intel > chips, since the extra test is certainly 100% wasted cycles on any > single-CPU machine, or indeed anywhere that the lock is not contended > for.) Yes, but the proper fix for this is not to spin at all on UP machines; I have a grungy patch to do this that I will clean up and send in for 8.1. > > if (*lock == 0) > > TAS(lock); > > This will certainly not work on HPPA, and it really shouldn't assume > that zero is the non-locked state, and what happened to testing the > TAS result? Erm, perhaps I should have emphasized that that was pseudo-code :) A more realistic implementation would be: #define TAS_INNER_LOOP(lock) (S_LOCK_FREE(lock) ? TAS(lock) : 1) -Neil
Neil Conway <neilc@samurai.com> writes: > On Mon, 2004-10-18 at 04:13, Tom Lane wrote: >> Do you have any actual evidence for that opinion? ISTM this is >> dependent on a large set of assumptions about the CPU's bus behavior, >> boiling down to the conclusion that an extra conditional branch is >> cheaper than a locked bus cycle. > I think the conditional branch is effectively free: we're spinning in a > busy-wait loop, so we're effectively throwing away cycles until the lock > is free anyway. Only once we've begun to spin. The first time through, it's not at all clear whether the extra test is worthwhile --- it's certainly a win if the lock is always already held, and certainly a loss if the lock is always free, and otherwise you have to do some benchmarking to decide if you want it or not. We have the ASM-level test on those platforms where people seem to think that it is worthwhile, but not everywhere. As near as I can tell, your proposal is to never have the extra test on the initial TAS (because you'll remove it from the ASM) and to have it always on repeated tests (inside the spin loop). This is not what is done elsewhere AFAICS, and I'm not prepared to take on faith that it's a win. regards, tom lane
On Sun, Oct 17, 2004 at 11:16:50PM +1000, Neil Conway wrote: > Currently, the assembly for TAS() on x86 does a non-locking test before > using an atomic operation to attempt to acquire the spinlock: > > __asm__ __volatile__( > " cmpb $0,%1 \n" > " jne 1f \n" > " lock \n" > " xchgb %0,%1 \n" > "1: \n" > : "+q"(_res), "+m"(*lock) > : > : "memory", "cc"); > > The reason this is a good idea is that if we fail to immediately acquire > the spinlock, s_lock() will spin SPINS_PER_DELAY times in userspace > calling TAS() each time before going to sleep. If we do an atomic > operation for each spin, this generates a lot more bus traffic than is > necessary. Doing a non-locking test (followed by an atomic operation to > acquire the spinlock if appropriate) is therefore better on SMP systems. > > Currently x86 is the only platform on which we do this -- ISTM that all > the other platforms that implement spinlocks via atomic operations could > benefit from this technique. > > We could fix this by tweaking each platform's assembler to add a > non-blocking test, but there might be an easier way. Rather than > modifying platform-specific assembler, I believe this C sequence is > equivalent to the non-locking test: > > volatile slock_t *lock = ...; > > if (*lock == 0) > TAS(lock); > > Because the lock variable is volatile, the compiler should reload it > from memory for each loop iteration. (If this is actually not a > sufficient non-locking test, please let me know...) > > We could add a new s_lock.h macro, TAS_INNER_LOOP(), whose default > implementation would be: > > #define TAS_INNER_LOOP(lock) \ > if ((*lock) == 0) \ > TAS(lock); > > And then remove the x86-specific non-locking test from TAS. > > Comments? > > -Neil > Steve Hemminger had this to say: The linux kernel code is: #define UNLOCKED 1 ... while (atomic_dec(lock) != 0) { do { rep_nop(); } while(*lock != UNLOCKED); } To do the equivalent thing in postgres would mean while(TAS(lock)) { do { rep_nop(); } while (*lock); The point is do the locking test first (assume success is possible) if that doesn't work spin without doing locked operations and make sure and do the rep; nop; for the hyperthreaded CPU's
On Mon, 2004-10-18 at 11:53, Tom Lane wrote: > Only once we've begun to spin. The first time through, it's not at all > clear whether the extra test is worthwhile --- it's certainly a win if > the lock is always already held, and certainly a loss if the lock is > always free Granted, but I think you've mostly conceded my point: every _subsequent_ time TAS() is invoked, the non-locking test is a clear win (with the possible exception of PPC). Therefore we have two cases: "initial TAS" and "TAS inside loop", so so the logical implementation is two distinct macros. Of course, there may well be platforms on which TAS() is defined to TAS_INNER_LOOP() or vice versa, but this decision will vary by platform. > you have to do some benchmarking to decide if you want it or not. I agree that benchmarking is worth doing before making changes. > We have the ASM-level test on those platforms > where people seem to think that it is worthwhile, but not everywhere. That is certainly an optimistic interpretation :-) I would say an equally likely theory is that there is only one platform on which people have bothered to try a non-blocking test and see if it improves performance, and accordingly we only have one platform on which a non-locking test is used. (If anyone out there _has_ modified the spinlock implementation for PostgreSQL on a particular platform to use a non-locking initial test and found it hasn't improved performance, please speak up.) -Neil
Neil Conway <neilc@samurai.com> writes: > Granted, but I think you've mostly conceded my point: every _subsequent_ > time TAS() is invoked, the non-locking test is a clear win (with the > possible exception of PPC). I'm not real sure. One point here is that the standard advice about this stuff is generally thinking in terms of an *extremely* tight spin loop, ie while (TAS(lock)) ; The loop in s_lock.c has a bit more overhead than that. Also, because we only use spinlocks to protect LWLocks, the expected hold time for a spinlock is just a couple dozen instructions, which is probably less than the expected time in most other uses of spinlocks. So I think it's less than clear that we should expect TAS to fail, even within the loop. Basically I'd like to see some tests proving that there's actually any value in it before we go complicating the assembly-code API ... regards, tom lane