Thread: Assuming that TAS() will succeed the first time is verboten

Assuming that TAS() will succeed the first time is verboten

From
Tom Lane
Date:
I have been digging into the observed failure

FATAL: Checkpoint lock is busy while data base is shutting down

on some Alpha machines.  It apparently doesn't happen on all Alphas,
but it's quite reproducible on some of them.

The bottom line turns out to be that on the Alpha hardware, it is
possible for TAS() to fail even when the lock is initially zero,
because that hardware's locking protocol will fail to acquire the
lock if the ldq_l/stq_c sequence is interrupted.  TAS() *must* be
called in a retry loop on Alphas.  Thus, the coding presently in
xlog.c,
   while (TAS(&(XLogCtl->chkp_lck)))   {       struct timeval delay = {2, 0};
       if (shutdown)           elog(STOP, "Checkpoint lock is busy while data base is shutting down");       (void)
select(0,NULL, NULL, NULL, &delay);   }
 

is no good because it does not allow for multiple retries.

Offhand I see no good reason why the above-quoted code isn't just
   S_LOCK(&(XLogCtl->chkp_lck));

and propose to fix this problem by reducing it to that.  If the lock
is held when it shouldn't be, we'll fail with a stuck-spinlock error.

It also bothers me that xlog.c contains several places where there is a
potentially infinite wait for a lock.  It seems to me that these should
time out with stuck-spinlock messages.  Do you object to such a change?
        regards, tom lane


Re: Assuming that TAS() will succeed the first time is verboten

From
ncm@zembu.com (Nathan Myers)
Date:
On Thu, Dec 28, 2000 at 03:54:50PM -0500, Tom Lane wrote:
> I have been digging into the observed failure
> 
> FATAL: Checkpoint lock is busy while data base is shutting down
> 
> on some Alpha machines.  It apparently doesn't happen on all Alphas,
> but it's quite reproducible on some of them.
> 
> The bottom line turns out to be that on the Alpha hardware, it is
> possible for TAS() to fail even when the lock is initially zero,
> because that hardware's locking protocol will fail to acquire the
> lock if the ldq_l/stq_c sequence is interrupted.  TAS() *must* be
> called in a retry loop on Alphas.  

As I understand it, this is normal semantics for the load-locked/
store-conditional primitive.  The idea is that the normal case is fast, 
but anything that might interfere with absolute correctness causes it 
to fail and need to be retried (generally just once). 

> It also bothers me that xlog.c contains several places where there is a
> potentially infinite wait for a lock.  It seems to me that these should
> time out with stuck-spinlock messages.  Do you object to such a change?

A spinlock held for more than a few cycles indicates a bug.

I wonder about the advisability of using spinlocks in user-level code 
which might be swapped out any time.  Normally, spinlocks are taken in 
kernel code with great care about interrupts and context switches while
the lock is held.  I don't know how one could take the necessary 
precautions at user level.  

Nathan Myers
ncm@zembu.com


RE: Assuming that TAS() will succeed the first time is verboten

From
"Mikheev, Vadim"
Date:
> The bottom line turns out to be that on the Alpha hardware, it is
> possible for TAS() to fail even when the lock is initially zero,
> because that hardware's locking protocol will fail to acquire the
> lock if the ldq_l/stq_c sequence is interrupted.  TAS() *must* be
> called in a retry loop on Alphas.  Thus, the coding presently in
> xlog.c,
> 
>     while (TAS(&(XLogCtl->chkp_lck)))
>     {
>         struct timeval delay = {2, 0};
> 
>         if (shutdown)
>             elog(STOP, "Checkpoint lock is busy while data 
> base is shutting down");
>         (void) select(0, NULL, NULL, NULL, &delay);
>     }
> 
> is no good because it does not allow for multiple retries.
> 
> Offhand I see no good reason why the above-quoted code isn't just
> 
>     S_LOCK(&(XLogCtl->chkp_lck));
> 
> and propose to fix this problem by reducing it to that.  If the lock
> is held when it shouldn't be, we'll fail with a stuck-spinlock error.

It was not test against stuck slock: postmaster can shutdown db only
when there is no backends running and this was just additional test
for that. Assuming that postmaster does right things I don't object
to get rid of this test, but I would just remove shutdown test itself
- checkpoints run long time and 2 sec timeout is better than ones
in s_lock_sleep();

> It also bothers me that xlog.c contains several places where 
> there is a potentially infinite wait for a lock.  It seems to me
> that these should time out with stuck-spinlock messages. Do you object
> to such a change?

I always considered time-based decision is slock stuck or not as hack -
in old times elog(ERROR/FATAL) didn't unlock slocks sometimes, isn't
it fixed now?

Anyway I don't object if it bothers you -:)
But please do not use S_LOCK - as you see WAL code try to do other
things if slock of "primary interest" is busy.

Vadim


Re: Assuming that TAS() will succeed the first time is verboten

From
Tom Lane
Date:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
> Anyway I don't object if it bothers you -:)
> But please do not use S_LOCK - as you see WAL code try to do other
> things if slock of "primary interest" is busy.

In some places, yes.  But I also saw a number of places where S_LOCK is
sufficient, and I think it's clearer to code that way whenever there's
not useful work to do before acquiring the lock.  For example, is
   if (updrqst)   {       unsigned    i = 0;
       for (;;)       {           if (!TAS(&(XLogCtl->info_lck)))           {               if
(XLByteLT(XLogCtl->LgwrRqst.Write,LgwrRqst.Write))                   XLogCtl->LgwrRqst.Write = LgwrRqst.Write;
    S_UNLOCK(&(XLogCtl->info_lck));               break;           }           s_lock_sleep(i++);       }   }
 

really better than
   if (updrqst)   {       S_LOCK(&(XLogCtl->info_lck));       if (XLByteLT(XLogCtl->LgwrRqst.Write, LgwrRqst.Write))
      XLogCtl->LgwrRqst.Write = LgwrRqst.Write;       S_UNLOCK(&(XLogCtl->info_lck));   }
 

?  I don't think so...

What I'm thinking of doing for the places where there is useful work
to do while waiting is
spins = 0;while (TAS(lock)){    /* do useful work here */    S_LOCK_SLEEP(spins++);}

where S_LOCK_SLEEP() expands to do the same things as the body of the
existing loop in s_lock().
        regards, tom lane


RE: Assuming that TAS() will succeed the first time is verboten

From
"Mikheev, Vadim"
Date:
Ok.

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Thursday, December 28, 2000 2:07 PM
> To: Mikheev, Vadim
> Cc: pgsql-hackers@postgreSQL.org
> Subject: Re: Assuming that TAS() will succeed the first time 
> is verboten
> 
> 
> 
> "Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
> > Anyway I don't object if it bothers you -:)
> > But please do not use S_LOCK - as you see WAL code try to do other
> > things if slock of "primary interest" is busy.
> 
> In some places, yes.  But I also saw a number of places where 
> S_LOCK is
> sufficient, and I think it's clearer to code that way whenever there's
> not useful work to do before acquiring the lock.  For example, is
> 
>     if (updrqst)
>     {
>         unsigned    i = 0;
> 
>         for (;;)
>         {
>             if (!TAS(&(XLogCtl->info_lck)))
>             {
>                 if (XLByteLT(XLogCtl->LgwrRqst.Write, LgwrRqst.Write))
>                     XLogCtl->LgwrRqst.Write = LgwrRqst.Write;
>                 S_UNLOCK(&(XLogCtl->info_lck));
>                 break;
>             }
>             s_lock_sleep(i++);
>         }
>     }
> 
> really better than
> 
>     if (updrqst)
>     {
>         S_LOCK(&(XLogCtl->info_lck));
>         if (XLByteLT(XLogCtl->LgwrRqst.Write, LgwrRqst.Write))
>             XLogCtl->LgwrRqst.Write = LgwrRqst.Write;
>         S_UNLOCK(&(XLogCtl->info_lck));
>     }
> 
> ?  I don't think so...
> 
> What I'm thinking of doing for the places where there is useful work
> to do while waiting is
> 
>     spins = 0;
>     while (TAS(lock))
>     {
>         /* do useful work here */
>         S_LOCK_SLEEP(spins++);
>     }
> 
> where S_LOCK_SLEEP() expands to do the same things as the body of the
> existing loop in s_lock().
> 
>             regards, tom lane
> 


Re: Assuming that TAS() will succeed the first time is verboten

From
Tom Lane
Date:
ncm@zembu.com (Nathan Myers) writes:
> I wonder about the advisability of using spinlocks in user-level code 
> which might be swapped out any time.

The reason we use spinlocks is that we expect the lock to succeed (not
block) the majority of the time, and we want the code to fall through
as quickly as possible in that case.  In particular we do *not* want to
expend a kernel call when we are able to acquire the lock immediately.
It's not a true "spin" lock because we don't sit in a tight loop when
we do have to wait for the lock --- we use select() to delay for a small
interval before trying again.  See src/backend/storage/buffer/s_lock.c.

The design is reasonable, even if a little bit offbeat.
        regards, tom lane


Re: Assuming that TAS() will succeed the first time is verboten

From
ncm@zembu.com (Nathan Myers)
Date:
On Thu, Dec 28, 2000 at 05:12:22PM -0500, Tom Lane wrote:
> ncm@zembu.com (Nathan Myers) writes:
> > I wonder about the advisability of using spinlocks in user-level code 
> > which might be swapped out any time.
> 
> The reason we use spinlocks is that we expect the lock to succeed (not
> block) the majority of the time, and we want the code to fall through
> as quickly as possible in that case.  In particular we do *not* want to
> expend a kernel call when we are able to acquire the lock immediately.

Most implementations of mutex and semaphore do no system call if they 
get the lock; if they fail to get the lock, they block in the kernel 
without any need for complicated "back-off" loops.  

> It's not a true "spin" lock because we don't sit in a tight loop when
> we do have to wait for the lock --- we use select() to delay for a small
> interval before trying again.  See src/backend/storage/buffer/s_lock.c.
> 
> The design is reasonable, even if a little bit offbeat.

I suspect "a little bit offbeat" qualifies as an extreme understatement.  
In particular, it's not really a spinlock, in the standard sense.

The code is based on some odd assumptions.  A select() with 0 delay 
returns immediately unless there is an interrupt during its (very short!) 
time in kernel space.  On a single processor this is extremely unlikely 
to result in a change to the lock.  I suspect
 #define S_NSPINCYCLE    2 #define S_MAX_BUSY      1000 * S_NSPINCYCLE
 int                     s_spincycle[S_NSPINCYCLE] = {1,10000};

would give better results, assuming we want to keep the existing
mechanism.

Nathan Myers
ncm@zembu.com


Re: Assuming that TAS() will succeed the first time is verboten

From
"Dominic J. Eidson"
Date:
On Thu, 28 Dec 2000, Nathan Myers wrote:

> The code is based on some odd assumptions.  A select() with 0 delay 
> returns immediately unless there is an interrupt during its (very short!) 

If you look closely, it's a select with a 2 second timeout.

>>> { 2, 0 }


-- 
Dominic J. Eidson                                       "Baruk Khazad! Khazad ai-menu!" - Gimli
-------------------------------------------------------------------------------
http://www.the-infinite.org/              http://www.the-infinite.org/~dominic/



Re: Assuming that TAS() will succeed the first time is verboten

From
ncm@zembu.com (Nathan Myers)
Date:
On Thu, Dec 28, 2000 at 05:26:23PM -0600, Dominic J. Eidson wrote:
> On Thu, 28 Dec 2000, Nathan Myers wrote:
> 
> > The code is based on some odd assumptions.  A select() with 0 delay 
> > returns immediately unless there is an interrupt during its (very short!) 
> 
> If you look closely, it's a select with a 2 second timeout.
> 
> >>> { 2, 0 }

I don't see that in src/backend/storage/buffer/s_lock.c:
 delay.tv_sec = 0; delay.tv_usec = s_spincycle[spin % S_NSPINCYCLE]; (void) select(0, NULL, NULL, NULL, &delay);

Nathan Myers
ncm@zembu.com


Re: Assuming that TAS() will succeed the first time is verboten

From
Tom Lane
Date:
ncm@zembu.com (Nathan Myers) writes:
> On Thu, Dec 28, 2000 at 05:12:22PM -0500, Tom Lane wrote:
>> The reason we use spinlocks is that we expect the lock to succeed (not
>> block) the majority of the time, and we want the code to fall through
>> as quickly as possible in that case.  In particular we do *not* want to
>> expend a kernel call when we are able to acquire the lock immediately.

> Most implementations of mutex and semaphore do no system call if they 
> get the lock; if they fail to get the lock, they block in the kernel 
> without any need for complicated "back-off" loops.  

There's been some talk of reimplementing S_LOCK() and friends to use
Posix user-space semaphores, on platforms that provide such.  But it'll
be a *long* time before we can expect such facilities to be available
everywhere.

> The code is based on some odd assumptions.  A select() with 0 delay 
> returns immediately unless there is an interrupt during its (very short!) 
> time in kernel space.

Yeah, I've wondered whether the 0 entries in s_spincycle[] shouldn't be
1.  The code author evidently expected select() to at least yield the
processor even with delay 0, but the select() man pages I have handy
say that it will "return immediately" when delay is 0.
        regards, tom lane


Re: Assuming that TAS() will succeed the first time is verboten

From
Alfred Perlstein
Date:
* Tom Lane <tgl@sss.pgh.pa.us> [001228 14:25] wrote:
> ncm@zembu.com (Nathan Myers) writes:
> > I wonder about the advisability of using spinlocks in user-level code 
> > which might be swapped out any time.
> 
> The reason we use spinlocks is that we expect the lock to succeed (not
> block) the majority of the time, and we want the code to fall through
> as quickly as possible in that case.  In particular we do *not* want to
> expend a kernel call when we are able to acquire the lock immediately.
> It's not a true "spin" lock because we don't sit in a tight loop when
> we do have to wait for the lock --- we use select() to delay for a small
> interval before trying again.  See src/backend/storage/buffer/s_lock.c.
> 
> The design is reasonable, even if a little bit offbeat.

It sounds pretty bad, if you have a contested lock you'll trap into
the kernel each time you miss, crossing the protection boundry and
then waiting.  It's a tough call to make, because on UP systems
you loose bigtime by spinning for your quantum, however on SMP
systems there's a chance that the lock is owned by a process on
another CPU and spinning might be benificial.

One trick that may help is calling sched_yield(2) on a lock miss,
it's a POSIX call and quite new so you'd need a 'configure' test
for it.

http://www.freebsd.org/cgi/man.cgi?query=sched_yield&apropos=0&sektion=0&manpath=FreeBSD+4.2-RELEASE&format=html

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."


Re: Assuming that TAS() will succeed the first time is verboten

From
Tom Lane
Date:
Alfred Perlstein <bright@wintelcom.net> writes:
> One trick that may help is calling sched_yield(2) on a lock miss,
> it's a POSIX call and quite new so you'd need a 'configure' test
> for it.

The author of the current s_lock code seems to have thought that
select() with a zero delay would do the equivalent of sched_yield().
I'm not sure if that's true on very many kernels, if indeed any...

I doubt we could buy much by depending on sched_yield(); if you want
to assume POSIX facilities, ISTM you might as well go for user-space
semaphores and forget the whole TAS mechanism.
        regards, tom lane


Re: Assuming that TAS() will succeed the first time is verboten

From
Bruce Momjian
Date:
> Alfred Perlstein <bright@wintelcom.net> writes:
> > One trick that may help is calling sched_yield(2) on a lock miss,
> > it's a POSIX call and quite new so you'd need a 'configure' test
> > for it.
> 
> The author of the current s_lock code seems to have thought that
> select() with a zero delay would do the equivalent of sched_yield().
> I'm not sure if that's true on very many kernels, if indeed any...
> 
> I doubt we could buy much by depending on sched_yield(); if you want
> to assume POSIX facilities, ISTM you might as well go for user-space
> semaphores and forget the whole TAS mechanism.


Another issue is that sched_yield brings in the pthreads library/hooks
on some OS's, which we certainly want to avoid.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Assuming that TAS() will succeed the first time is verboten

From
Alfred Perlstein
Date:
* Bruce Momjian <pgman@candle.pha.pa.us> [010101 23:59] wrote:
> > Alfred Perlstein <bright@wintelcom.net> writes:
> > > One trick that may help is calling sched_yield(2) on a lock miss,
> > > it's a POSIX call and quite new so you'd need a 'configure' test
> > > for it.
> > 
> > The author of the current s_lock code seems to have thought that
> > select() with a zero delay would do the equivalent of sched_yield().
> > I'm not sure if that's true on very many kernels, if indeed any...
> > 
> > I doubt we could buy much by depending on sched_yield(); if you want
> > to assume POSIX facilities, ISTM you might as well go for user-space
> > semaphores and forget the whole TAS mechanism.
> 
> 
> Another issue is that sched_yield brings in the pthreads library/hooks
> on some OS's, which we certainly want to avoid.

I know it's a major undertaking, but since the work is sort of done,
have you guys considered the port to solaris threads and seeing about
making a pthreads port of that?

I know it would probably get you considerable gains under Windows
at the expense of dropping some really really legacy system.

Or you could do what apache (is rumored) does and have it do either
threads or processes or both...

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."