Thread: Adjustment of spinlock sleep delays
I've been thinking about Ludwig Lim's recent report of a "stuck spinlock" failure on a heavily loaded machine. Although I originally found this hard to believe, there is a scenario which makes it plausible. Suppose that we have a bunch of recently-started backends as well as one or more that have been running a long time --- long enough that the scheduler has niced them down a priority level or two. Now suppose that one of the old-timers gets interrupted while holding a spinlock (an event of small but nonzero probability), and that before it can get scheduled again, several of the newer, higher-priority backends all start trying to acquire the same spinlock. The "acquire" code looks like "try to grab the spinlock a few times, then sleep for 10 msec, then try again; give up after 1 minute". If there are enough backends trying this that cycling through all of them takes at least 10 msec, then the lower-priority backend will never get scheduled, and after a minute we get the dreaded "stuck spinlock". To forestall this scenario, I'm thinking of introducing backoff into the sleep intervals --- that is, after first failure to get the spinlock, sleep 10 msec; after the second, sleep 20 msec, then 40, etc, with a maximum sleep time of maybe a second. The number of iterations would be reduced so that we still time out after a minute's total delay. Comments? regards, tom lane
Tom Lane wrote: > I've been thinking about Ludwig Lim's recent report of a "stuck > spinlock" failure on a heavily loaded machine. Although I originally > found this hard to believe, there is a scenario which makes it > plausible. Suppose that we have a bunch of recently-started backends > as well as one or more that have been running a long time --- long > enough that the scheduler has niced them down a priority level or two. > Now suppose that one of the old-timers gets interrupted while holding > a spinlock (an event of small but nonzero probability), and that before > it can get scheduled again, several of the newer, higher-priority > backends all start trying to acquire the same spinlock. The "acquire" > code looks like "try to grab the spinlock a few times, then sleep for > 10 msec, then try again; give up after 1 minute". If there are enough > backends trying this that cycling through all of them takes at least > 10 msec, then the lower-priority backend will never get scheduled, and > after a minute we get the dreaded "stuck spinlock". > > To forestall this scenario, I'm thinking of introducing backoff into the > sleep intervals --- that is, after first failure to get the spinlock, > sleep 10 msec; after the second, sleep 20 msec, then 40, etc, with a > maximum sleep time of maybe a second. The number of iterations would be > reduced so that we still time out after a minute's total delay. > > Comments? Should there be any correlation between the manner by which the backoff occurs and the number of active backends? Mike Mascari mascarm@mascari.com
> To forestall this scenario, I'm thinking of introducing backoff into the > sleep intervals --- that is, after first failure to get the spinlock, > sleep 10 msec; after the second, sleep 20 msec, then 40, etc, with a > maximum sleep time of maybe a second. The number of iterations would be > reduced so that we still time out after a minute's total delay. After the first few sleeps should it add a random() element to the delay time?
Mike Mascari <mascarm@mascari.com> writes: > Should there be any correlation between the manner by which the > backoff occurs and the number of active backends? If we could guess how many are contending for the same spinlock, maybe we could use that info ... but I don't see a reasonably cheap way to do so. regards, tom lane
Rod Taylor <rbt@rbt.ca> writes: > After the first few sleeps should it add a random() element to the delay > time? Hmm, that's a thought --- but how big a random element? Fooling with the original idea, I'm having trouble with getting both plausible backoff and a reasonable number of attempts before failing. I tried the sequence 10 msec, 20 msec, 40, 80, ..., 1280 (1.28 sec), repeat but this only gives a couple of hundred tries before one minute has elapsed, which seems uncomfortably low. Maybe there's no alternative, though, if we want any good-sized delays in there. regards, tom lane
On Tue, 2003-08-05 at 18:19, Tom Lane wrote: > Rod Taylor <rbt@rbt.ca> writes: > > After the first few sleeps should it add a random() element to the delay > > time? > > Hmm, that's a thought --- but how big a random element? > > Fooling with the original idea, I'm having trouble with getting both > plausible backoff and a reasonable number of attempts before failing. > I tried the sequence > > 10 msec, 20 msec, 40, 80, ..., 1280 (1.28 sec), repeat > > but this only gives a couple of hundred tries before one minute has > elapsed, which seems uncomfortably low. Maybe there's no alternative, > though, if we want any good-sized delays in there. How about (round to nearest 10msec): time = oldtime + oldtime / 2 + oldtime * rand() while (time > 1 second)time = time - 0.80sec This would stagger the wakeup times, and ensure a larger number of retries -- but the times should be large enough after the first few tries (larger than 200msec) that further backoff won't be required.
Rod Taylor <rbt@rbt.ca> writes: > How about (round to nearest 10msec): > time =3D oldtime + oldtime / 2 + oldtime * rand() > while (time > 1 second) > time =3D time - 0.80sec > This would stagger the wakeup times, and ensure a larger number of > retries -- but the times should be large enough after the first few > tries (larger than 200msec) that further backoff won't be required. But after the first few tries the sleep time would always exceed 200msec, so there would be a *maximum* of 60*5 = 300 tries before failing --- probably a lot less, like about 120 on average. The random component should already help to scatter the wakeups pretty well, so I'm thinking about just if (oldtime > 1 sec) time = 10msecelse time = oldtime + oldtime * rand() ie random growth of a maximum of 2x per try, and reset to minimum delay when you get past 1 sec. This would guarantee at least as many tries as I'm getting currently with the deterministic algorithm (which is effectively this if rand() always returned 1). regards, tom lane
> To forestall this scenario, I'm thinking of introducing backoff into the > sleep intervals --- that is, after first failure to get the spinlock, > sleep 10 msec; after the second, sleep 20 msec, then 40, etc, with a > maximum sleep time of maybe a second. The number of iterations would be > reduced so that we still time out after a minute's total delay. Sounds good. Chris
From: "Tom Lane" <tgl@sss.pgh.pa.us> > To forestall this scenario, I'm thinking of introducing backoff into the > sleep intervals --- that is, after first failure to get the spinlock, > sleep 10 msec; after the second, sleep 20 msec, then 40, etc, with a > maximum sleep time of maybe a second. The number of iterations would be > reduced so that we still time out after a minute's total delay. What about use the same algorithm used in ethernet when a collision is detected? When a collision occurs: 1) Stop sending. 2) Wait for some random amount of time between 0-T seconds 3) Try again. 4) If collision occurs again,increase T using an algorithm, go to Step 2. algorithm: T = rand ( 2n -1 ) * 51.2 ms where n- number of collisions Regards Gaetano Mendola
"Mendola Gaetano" <mendola@bigfoot.com> writes: > What about use the same algorithm used in ethernet when a collision is > detected? Random backoff is what Rod suggested, but I don't care for the ethernet method in detail, because it allows for only a fairly small number of retries before giving up. It's okay for an ethernet to drop packets under load ... it's not okay for us to fail to acquire a spinlock. regards, tom lane
I said: > The random component should already help to scatter the wakeups pretty > well, so I'm thinking about just > if (oldtime > 1 sec) > time = 10msec > else > time = oldtime + oldtime * rand() > ie random growth of a maximum of 2x per try, and reset to minimum delay > when you get past 1 sec. This would guarantee at least as many tries > as I'm getting currently with the deterministic algorithm (which is > effectively this if rand() always returned 1). Eventually it occurred to me that when using random delays, we should set the timeout to occur after a fixed number of tries, not after a fixed total time spent. This is because the probability of unwanted failure (ie, the spinlock isn't really stuck, you just managed to always look when someone else had it) depends directly on the number of tries. I've committed code that does the above with a limit of 1000 iterations; timeout seems to take about 3.5 minutes on average. (In the prior code we would make 6000 attempts over a period of 1 minute.) regards, tom lane