Thread: Assuming that TAS() will succeed the first time is verboten
I have been digging into the observed failure FATAL: Checkpoint lock is busy while data base is shutting down on some Alpha machines. It apparently doesn't happen on all Alphas, but it's quite reproducible on some of them. The bottom line turns out to be that on the Alpha hardware, it is possible for TAS() to fail even when the lock is initially zero, because that hardware's locking protocol will fail to acquire the lock if the ldq_l/stq_c sequence is interrupted. TAS() *must* be called in a retry loop on Alphas. Thus, the coding presently in xlog.c, while (TAS(&(XLogCtl->chkp_lck))) { struct timeval delay = {2, 0}; if (shutdown) elog(STOP, "Checkpoint lock is busy while data base is shutting down"); (void) select(0,NULL, NULL, NULL, &delay); } is no good because it does not allow for multiple retries. Offhand I see no good reason why the above-quoted code isn't just S_LOCK(&(XLogCtl->chkp_lck)); and propose to fix this problem by reducing it to that. If the lock is held when it shouldn't be, we'll fail with a stuck-spinlock error. It also bothers me that xlog.c contains several places where there is a potentially infinite wait for a lock. It seems to me that these should time out with stuck-spinlock messages. Do you object to such a change? regards, tom lane
Re: Assuming that TAS() will succeed the first time is verboten
From
ncm@zembu.com (Nathan Myers)
Date:
On Thu, Dec 28, 2000 at 03:54:50PM -0500, Tom Lane wrote: > I have been digging into the observed failure > > FATAL: Checkpoint lock is busy while data base is shutting down > > on some Alpha machines. It apparently doesn't happen on all Alphas, > but it's quite reproducible on some of them. > > The bottom line turns out to be that on the Alpha hardware, it is > possible for TAS() to fail even when the lock is initially zero, > because that hardware's locking protocol will fail to acquire the > lock if the ldq_l/stq_c sequence is interrupted. TAS() *must* be > called in a retry loop on Alphas. As I understand it, this is normal semantics for the load-locked/ store-conditional primitive. The idea is that the normal case is fast, but anything that might interfere with absolute correctness causes it to fail and need to be retried (generally just once). > It also bothers me that xlog.c contains several places where there is a > potentially infinite wait for a lock. It seems to me that these should > time out with stuck-spinlock messages. Do you object to such a change? A spinlock held for more than a few cycles indicates a bug. I wonder about the advisability of using spinlocks in user-level code which might be swapped out any time. Normally, spinlocks are taken in kernel code with great care about interrupts and context switches while the lock is held. I don't know how one could take the necessary precautions at user level. Nathan Myers ncm@zembu.com
> The bottom line turns out to be that on the Alpha hardware, it is > possible for TAS() to fail even when the lock is initially zero, > because that hardware's locking protocol will fail to acquire the > lock if the ldq_l/stq_c sequence is interrupted. TAS() *must* be > called in a retry loop on Alphas. Thus, the coding presently in > xlog.c, > > while (TAS(&(XLogCtl->chkp_lck))) > { > struct timeval delay = {2, 0}; > > if (shutdown) > elog(STOP, "Checkpoint lock is busy while data > base is shutting down"); > (void) select(0, NULL, NULL, NULL, &delay); > } > > is no good because it does not allow for multiple retries. > > Offhand I see no good reason why the above-quoted code isn't just > > S_LOCK(&(XLogCtl->chkp_lck)); > > and propose to fix this problem by reducing it to that. If the lock > is held when it shouldn't be, we'll fail with a stuck-spinlock error. It was not test against stuck slock: postmaster can shutdown db only when there is no backends running and this was just additional test for that. Assuming that postmaster does right things I don't object to get rid of this test, but I would just remove shutdown test itself - checkpoints run long time and 2 sec timeout is better than ones in s_lock_sleep(); > It also bothers me that xlog.c contains several places where > there is a potentially infinite wait for a lock. It seems to me > that these should time out with stuck-spinlock messages. Do you object > to such a change? I always considered time-based decision is slock stuck or not as hack - in old times elog(ERROR/FATAL) didn't unlock slocks sometimes, isn't it fixed now? Anyway I don't object if it bothers you -:) But please do not use S_LOCK - as you see WAL code try to do other things if slock of "primary interest" is busy. Vadim
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: > Anyway I don't object if it bothers you -:) > But please do not use S_LOCK - as you see WAL code try to do other > things if slock of "primary interest" is busy. In some places, yes. But I also saw a number of places where S_LOCK is sufficient, and I think it's clearer to code that way whenever there's not useful work to do before acquiring the lock. For example, is if (updrqst) { unsigned i = 0; for (;;) { if (!TAS(&(XLogCtl->info_lck))) { if (XLByteLT(XLogCtl->LgwrRqst.Write,LgwrRqst.Write)) XLogCtl->LgwrRqst.Write = LgwrRqst.Write; S_UNLOCK(&(XLogCtl->info_lck)); break; } s_lock_sleep(i++); } } really better than if (updrqst) { S_LOCK(&(XLogCtl->info_lck)); if (XLByteLT(XLogCtl->LgwrRqst.Write, LgwrRqst.Write)) XLogCtl->LgwrRqst.Write = LgwrRqst.Write; S_UNLOCK(&(XLogCtl->info_lck)); } ? I don't think so... What I'm thinking of doing for the places where there is useful work to do while waiting is spins = 0;while (TAS(lock)){ /* do useful work here */ S_LOCK_SLEEP(spins++);} where S_LOCK_SLEEP() expands to do the same things as the body of the existing loop in s_lock(). regards, tom lane
Ok. > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Thursday, December 28, 2000 2:07 PM > To: Mikheev, Vadim > Cc: pgsql-hackers@postgreSQL.org > Subject: Re: Assuming that TAS() will succeed the first time > is verboten > > > > "Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: > > Anyway I don't object if it bothers you -:) > > But please do not use S_LOCK - as you see WAL code try to do other > > things if slock of "primary interest" is busy. > > In some places, yes. But I also saw a number of places where > S_LOCK is > sufficient, and I think it's clearer to code that way whenever there's > not useful work to do before acquiring the lock. For example, is > > if (updrqst) > { > unsigned i = 0; > > for (;;) > { > if (!TAS(&(XLogCtl->info_lck))) > { > if (XLByteLT(XLogCtl->LgwrRqst.Write, LgwrRqst.Write)) > XLogCtl->LgwrRqst.Write = LgwrRqst.Write; > S_UNLOCK(&(XLogCtl->info_lck)); > break; > } > s_lock_sleep(i++); > } > } > > really better than > > if (updrqst) > { > S_LOCK(&(XLogCtl->info_lck)); > if (XLByteLT(XLogCtl->LgwrRqst.Write, LgwrRqst.Write)) > XLogCtl->LgwrRqst.Write = LgwrRqst.Write; > S_UNLOCK(&(XLogCtl->info_lck)); > } > > ? I don't think so... > > What I'm thinking of doing for the places where there is useful work > to do while waiting is > > spins = 0; > while (TAS(lock)) > { > /* do useful work here */ > S_LOCK_SLEEP(spins++); > } > > where S_LOCK_SLEEP() expands to do the same things as the body of the > existing loop in s_lock(). > > regards, tom lane >
ncm@zembu.com (Nathan Myers) writes: > I wonder about the advisability of using spinlocks in user-level code > which might be swapped out any time. The reason we use spinlocks is that we expect the lock to succeed (not block) the majority of the time, and we want the code to fall through as quickly as possible in that case. In particular we do *not* want to expend a kernel call when we are able to acquire the lock immediately. It's not a true "spin" lock because we don't sit in a tight loop when we do have to wait for the lock --- we use select() to delay for a small interval before trying again. See src/backend/storage/buffer/s_lock.c. The design is reasonable, even if a little bit offbeat. regards, tom lane
Re: Assuming that TAS() will succeed the first time is verboten
From
ncm@zembu.com (Nathan Myers)
Date:
On Thu, Dec 28, 2000 at 05:12:22PM -0500, Tom Lane wrote: > ncm@zembu.com (Nathan Myers) writes: > > I wonder about the advisability of using spinlocks in user-level code > > which might be swapped out any time. > > The reason we use spinlocks is that we expect the lock to succeed (not > block) the majority of the time, and we want the code to fall through > as quickly as possible in that case. In particular we do *not* want to > expend a kernel call when we are able to acquire the lock immediately. Most implementations of mutex and semaphore do no system call if they get the lock; if they fail to get the lock, they block in the kernel without any need for complicated "back-off" loops. > It's not a true "spin" lock because we don't sit in a tight loop when > we do have to wait for the lock --- we use select() to delay for a small > interval before trying again. See src/backend/storage/buffer/s_lock.c. > > The design is reasonable, even if a little bit offbeat. I suspect "a little bit offbeat" qualifies as an extreme understatement. In particular, it's not really a spinlock, in the standard sense. The code is based on some odd assumptions. A select() with 0 delay returns immediately unless there is an interrupt during its (very short!) time in kernel space. On a single processor this is extremely unlikely to result in a change to the lock. I suspect #define S_NSPINCYCLE 2 #define S_MAX_BUSY 1000 * S_NSPINCYCLE int s_spincycle[S_NSPINCYCLE] = {1,10000}; would give better results, assuming we want to keep the existing mechanism. Nathan Myers ncm@zembu.com
On Thu, 28 Dec 2000, Nathan Myers wrote: > The code is based on some odd assumptions. A select() with 0 delay > returns immediately unless there is an interrupt during its (very short!) If you look closely, it's a select with a 2 second timeout. >>> { 2, 0 } -- Dominic J. Eidson "Baruk Khazad! Khazad ai-menu!" - Gimli ------------------------------------------------------------------------------- http://www.the-infinite.org/ http://www.the-infinite.org/~dominic/
Re: Assuming that TAS() will succeed the first time is verboten
From
ncm@zembu.com (Nathan Myers)
Date:
On Thu, Dec 28, 2000 at 05:26:23PM -0600, Dominic J. Eidson wrote: > On Thu, 28 Dec 2000, Nathan Myers wrote: > > > The code is based on some odd assumptions. A select() with 0 delay > > returns immediately unless there is an interrupt during its (very short!) > > If you look closely, it's a select with a 2 second timeout. > > >>> { 2, 0 } I don't see that in src/backend/storage/buffer/s_lock.c: delay.tv_sec = 0; delay.tv_usec = s_spincycle[spin % S_NSPINCYCLE]; (void) select(0, NULL, NULL, NULL, &delay); Nathan Myers ncm@zembu.com
ncm@zembu.com (Nathan Myers) writes: > On Thu, Dec 28, 2000 at 05:12:22PM -0500, Tom Lane wrote: >> The reason we use spinlocks is that we expect the lock to succeed (not >> block) the majority of the time, and we want the code to fall through >> as quickly as possible in that case. In particular we do *not* want to >> expend a kernel call when we are able to acquire the lock immediately. > Most implementations of mutex and semaphore do no system call if they > get the lock; if they fail to get the lock, they block in the kernel > without any need for complicated "back-off" loops. There's been some talk of reimplementing S_LOCK() and friends to use Posix user-space semaphores, on platforms that provide such. But it'll be a *long* time before we can expect such facilities to be available everywhere. > The code is based on some odd assumptions. A select() with 0 delay > returns immediately unless there is an interrupt during its (very short!) > time in kernel space. Yeah, I've wondered whether the 0 entries in s_spincycle[] shouldn't be 1. The code author evidently expected select() to at least yield the processor even with delay 0, but the select() man pages I have handy say that it will "return immediately" when delay is 0. regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [001228 14:25] wrote: > ncm@zembu.com (Nathan Myers) writes: > > I wonder about the advisability of using spinlocks in user-level code > > which might be swapped out any time. > > The reason we use spinlocks is that we expect the lock to succeed (not > block) the majority of the time, and we want the code to fall through > as quickly as possible in that case. In particular we do *not* want to > expend a kernel call when we are able to acquire the lock immediately. > It's not a true "spin" lock because we don't sit in a tight loop when > we do have to wait for the lock --- we use select() to delay for a small > interval before trying again. See src/backend/storage/buffer/s_lock.c. > > The design is reasonable, even if a little bit offbeat. It sounds pretty bad, if you have a contested lock you'll trap into the kernel each time you miss, crossing the protection boundry and then waiting. It's a tough call to make, because on UP systems you loose bigtime by spinning for your quantum, however on SMP systems there's a chance that the lock is owned by a process on another CPU and spinning might be benificial. One trick that may help is calling sched_yield(2) on a lock miss, it's a POSIX call and quite new so you'd need a 'configure' test for it. http://www.freebsd.org/cgi/man.cgi?query=sched_yield&apropos=0&sektion=0&manpath=FreeBSD+4.2-RELEASE&format=html -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
Alfred Perlstein <bright@wintelcom.net> writes: > One trick that may help is calling sched_yield(2) on a lock miss, > it's a POSIX call and quite new so you'd need a 'configure' test > for it. The author of the current s_lock code seems to have thought that select() with a zero delay would do the equivalent of sched_yield(). I'm not sure if that's true on very many kernels, if indeed any... I doubt we could buy much by depending on sched_yield(); if you want to assume POSIX facilities, ISTM you might as well go for user-space semaphores and forget the whole TAS mechanism. regards, tom lane
> Alfred Perlstein <bright@wintelcom.net> writes: > > One trick that may help is calling sched_yield(2) on a lock miss, > > it's a POSIX call and quite new so you'd need a 'configure' test > > for it. > > The author of the current s_lock code seems to have thought that > select() with a zero delay would do the equivalent of sched_yield(). > I'm not sure if that's true on very many kernels, if indeed any... > > I doubt we could buy much by depending on sched_yield(); if you want > to assume POSIX facilities, ISTM you might as well go for user-space > semaphores and forget the whole TAS mechanism. Another issue is that sched_yield brings in the pthreads library/hooks on some OS's, which we certainly want to avoid. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
* Bruce Momjian <pgman@candle.pha.pa.us> [010101 23:59] wrote: > > Alfred Perlstein <bright@wintelcom.net> writes: > > > One trick that may help is calling sched_yield(2) on a lock miss, > > > it's a POSIX call and quite new so you'd need a 'configure' test > > > for it. > > > > The author of the current s_lock code seems to have thought that > > select() with a zero delay would do the equivalent of sched_yield(). > > I'm not sure if that's true on very many kernels, if indeed any... > > > > I doubt we could buy much by depending on sched_yield(); if you want > > to assume POSIX facilities, ISTM you might as well go for user-space > > semaphores and forget the whole TAS mechanism. > > > Another issue is that sched_yield brings in the pthreads library/hooks > on some OS's, which we certainly want to avoid. I know it's a major undertaking, but since the work is sort of done, have you guys considered the port to solaris threads and seeing about making a pthreads port of that? I know it would probably get you considerable gains under Windows at the expense of dropping some really really legacy system. Or you could do what apache (is rumored) does and have it do either threads or processes or both... -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."