Thread: spinlocks: generalizing "non-locking test"

spinlocks: generalizing "non-locking test"

From
Neil Conway
Date:
Currently, the assembly for TAS() on x86 does a non-locking test before 
using an atomic operation to attempt to acquire the spinlock:
__asm__ __volatile__(    "    cmpb    $0,%1    \n"    "    jne    1f    \n"    "    lock        \n"    "    xchgb
%0,%1   \n"    "1: \n"
 
:        "+q"(_res), "+m"(*lock)
:
:        "memory", "cc");

The reason this is a good idea is that if we fail to immediately acquire 
the spinlock, s_lock() will spin SPINS_PER_DELAY times in userspace 
calling TAS() each time before going to sleep. If we do an atomic 
operation for each spin, this generates a lot more bus traffic than is 
necessary. Doing a non-locking test (followed by an atomic operation to 
acquire the spinlock if appropriate) is therefore better on SMP systems.

Currently x86 is the only platform on which we do this -- ISTM that all 
the other platforms that implement spinlocks via atomic operations could 
benefit from this technique.

We could fix this by tweaking each platform's assembler to add a 
non-blocking test, but there might be an easier way. Rather than 
modifying platform-specific assembler, I believe this C sequence is 
equivalent to the non-locking test:
    volatile slock_t *lock = ...;
    if (*lock == 0)        TAS(lock);

Because the lock variable is volatile, the compiler should reload it 
from memory for each loop iteration. (If this is actually not a 
sufficient non-locking test, please let me know...)

We could add a new s_lock.h macro, TAS_INNER_LOOP(), whose default 
implementation would be:
    #define TAS_INNER_LOOP(lock) \        if ((*lock) == 0) \            TAS(lock);

And then remove the x86-specific non-locking test from TAS.

Comments?

-Neil


Re: spinlocks: generalizing "non-locking test"

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> Currently x86 is the only platform on which we do this -- ISTM that all 
> the other platforms that implement spinlocks via atomic operations could 
> benefit from this technique.

Do you have any actual evidence for that opinion?  ISTM this is
dependent on a large set of assumptions about the CPU's bus behavior,
boiling down to the conclusion that an extra conditional branch is
cheaper than a locked bus cycle.  It would be a serious error to
transfer that conclusion to non-Intel chips without investigation.
(For that matter, I'm not that thrilled about it even for the Intel
chips, since the extra test is certainly 100% wasted cycles on any
single-CPU machine, or indeed anywhere that the lock is not contended
for.)

I believe that an extra test would be a dead loss on PPC, for instance,
because the initial load is already nonblocking there.

>      if (*lock == 0)
>          TAS(lock);

This will certainly not work on HPPA, and it really shouldn't assume
that zero is the non-locked state, and what happened to testing the
TAS result?

On the whole it seems far easier to stick an extra load into the asm
code on those platforms where we think it's a win.  Especially since
we have already done that ;-)
        regards, tom lane


Re: spinlocks: generalizing "non-locking test"

From
Neil Conway
Date:
On Mon, 2004-10-18 at 04:13, Tom Lane wrote:
> Do you have any actual evidence for that opinion?  ISTM this is
> dependent on a large set of assumptions about the CPU's bus behavior,
> boiling down to the conclusion that an extra conditional branch is
> cheaper than a locked bus cycle.

I think the conditional branch is effectively free: we're spinning in a
busy-wait loop, so we're effectively throwing away cycles until the lock
is free anyway. The important things are: (a) we interfere as little as
possible with concurrent system activity while spinning (b) we notice
promptly that the lock is free. ISTM that doing a non-locking test
before the locking test achieves those two conditions.

> (For that matter, I'm not that thrilled about it even for the Intel
> chips, since the extra test is certainly 100% wasted cycles on any
> single-CPU machine, or indeed anywhere that the lock is not contended
> for.)

Yes, but the proper fix for this is not to spin at all on UP machines; I
have a grungy patch to do this that I will clean up and send in for 8.1.

> >      if (*lock == 0)
> >          TAS(lock);
> 
> This will certainly not work on HPPA, and it really shouldn't assume
> that zero is the non-locked state, and what happened to testing the
> TAS result?

Erm, perhaps I should have emphasized that that was pseudo-code :) A
more realistic implementation would be:

#define TAS_INNER_LOOP(lock) (S_LOCK_FREE(lock) ? TAS(lock) : 1)

-Neil




Re: spinlocks: generalizing "non-locking test"

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> On Mon, 2004-10-18 at 04:13, Tom Lane wrote:
>> Do you have any actual evidence for that opinion?  ISTM this is
>> dependent on a large set of assumptions about the CPU's bus behavior,
>> boiling down to the conclusion that an extra conditional branch is
>> cheaper than a locked bus cycle.

> I think the conditional branch is effectively free: we're spinning in a
> busy-wait loop, so we're effectively throwing away cycles until the lock
> is free anyway.

Only once we've begun to spin.  The first time through, it's not at all
clear whether the extra test is worthwhile --- it's certainly a win if
the lock is always already held, and certainly a loss if the lock is
always free, and otherwise you have to do some benchmarking to decide
if you want it or not.  We have the ASM-level test on those platforms
where people seem to think that it is worthwhile, but not everywhere.
As near as I can tell, your proposal is to never have the extra test on
the initial TAS (because you'll remove it from the ASM) and to have it
always on repeated tests (inside the spin loop).  This is not what is
done elsewhere AFAICS, and I'm not prepared to take on faith that it's
a win.
        regards, tom lane


Re: spinlocks: generalizing "non-locking test"

From
Mark Wong
Date:
On Sun, Oct 17, 2004 at 11:16:50PM +1000, Neil Conway wrote:
> Currently, the assembly for TAS() on x86 does a non-locking test before 
> using an atomic operation to attempt to acquire the spinlock:
> 
>     __asm__ __volatile__(
>         "    cmpb    $0,%1    \n"
>         "    jne    1f    \n"
>         "    lock        \n"
>         "    xchgb    %0,%1    \n"
>         "1: \n"
> :        "+q"(_res), "+m"(*lock)
> :
> :        "memory", "cc");
> 
> The reason this is a good idea is that if we fail to immediately acquire 
> the spinlock, s_lock() will spin SPINS_PER_DELAY times in userspace 
> calling TAS() each time before going to sleep. If we do an atomic 
> operation for each spin, this generates a lot more bus traffic than is 
> necessary. Doing a non-locking test (followed by an atomic operation to 
> acquire the spinlock if appropriate) is therefore better on SMP systems.
> 
> Currently x86 is the only platform on which we do this -- ISTM that all 
> the other platforms that implement spinlocks via atomic operations could 
> benefit from this technique.
> 
> We could fix this by tweaking each platform's assembler to add a 
> non-blocking test, but there might be an easier way. Rather than 
> modifying platform-specific assembler, I believe this C sequence is 
> equivalent to the non-locking test:
> 
>      volatile slock_t *lock = ...;
> 
>      if (*lock == 0)
>          TAS(lock);
> 
> Because the lock variable is volatile, the compiler should reload it 
> from memory for each loop iteration. (If this is actually not a 
> sufficient non-locking test, please let me know...)
> 
> We could add a new s_lock.h macro, TAS_INNER_LOOP(), whose default 
> implementation would be:
> 
>      #define TAS_INNER_LOOP(lock) \
>          if ((*lock) == 0) \
>              TAS(lock);
> 
> And then remove the x86-specific non-locking test from TAS.
> 
> Comments?
> 
> -Neil
> 

Steve Hemminger had this to say:


The linux kernel code is:
#define UNLOCKED 1

...   while (atomic_dec(lock) != 0) {     do {         rep_nop();     } while(*lock != UNLOCKED);   }

To do the equivalent thing in postgres would mean
   while(TAS(lock)) {      do {         rep_nop();      }  while (*lock);

The point is do the locking test first (assume success is possible)
if that doesn't work spin without doing locked operations and make
sure and do the rep; nop; for the hyperthreaded CPU's


Re: spinlocks: generalizing "non-locking test"

From
Neil Conway
Date:
On Mon, 2004-10-18 at 11:53, Tom Lane wrote:
> Only once we've begun to spin.  The first time through, it's not at all
> clear whether the extra test is worthwhile --- it's certainly a win if
> the lock is always already held, and certainly a loss if the lock is
> always free

Granted, but I think you've mostly conceded my point: every _subsequent_
time TAS() is invoked, the non-locking test is a clear win (with the
possible exception of PPC). Therefore we have two cases: "initial TAS"
and "TAS inside loop", so so the logical implementation is two distinct
macros. Of course, there may well be platforms on which TAS() is defined
to TAS_INNER_LOOP() or vice versa, but this decision will vary by
platform.

> you have to do some benchmarking to decide if you want it or not. 

I agree that benchmarking is worth doing before making changes.

> We have the ASM-level test on those platforms
> where people seem to think that it is worthwhile, but not everywhere.

That is certainly an optimistic interpretation :-) I would say an
equally likely theory is that there is only one platform on which people
have bothered to try a non-blocking test and see if it improves
performance, and accordingly we only have one platform on which a
non-locking test is used.

(If anyone out there _has_ modified the spinlock implementation for
PostgreSQL on a particular platform to use a non-locking initial test
and found it hasn't improved performance, please speak up.)

-Neil




Re: spinlocks: generalizing "non-locking test"

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> Granted, but I think you've mostly conceded my point: every _subsequent_
> time TAS() is invoked, the non-locking test is a clear win (with the
> possible exception of PPC).

I'm not real sure.  One point here is that the standard advice about
this stuff is generally thinking in terms of an *extremely* tight spin
loop, ie
while (TAS(lock))    ;

The loop in s_lock.c has a bit more overhead than that.  Also, because
we only use spinlocks to protect LWLocks, the expected hold time for a
spinlock is just a couple dozen instructions, which is probably less
than the expected time in most other uses of spinlocks.  So I think it's
less than clear that we should expect TAS to fail, even within the loop.

Basically I'd like to see some tests proving that there's actually any
value in it before we go complicating the assembly-code API ...
        regards, tom lane