Thread: Adjustment of spinlock sleep delays

Adjustment of spinlock sleep delays

From
Tom Lane
Date:
I've been thinking about Ludwig Lim's recent report of a "stuck
spinlock" failure on a heavily loaded machine.  Although I originally
found this hard to believe, there is a scenario which makes it
plausible.  Suppose that we have a bunch of recently-started backends
as well as one or more that have been running a long time --- long
enough that the scheduler has niced them down a priority level or two.
Now suppose that one of the old-timers gets interrupted while holding
a spinlock (an event of small but nonzero probability), and that before
it can get scheduled again, several of the newer, higher-priority
backends all start trying to acquire the same spinlock.  The "acquire"
code looks like "try to grab the spinlock a few times, then sleep for
10 msec, then try again; give up after 1 minute".  If there are enough
backends trying this that cycling through all of them takes at least
10 msec, then the lower-priority backend will never get scheduled, and
after a minute we get the dreaded "stuck spinlock".

To forestall this scenario, I'm thinking of introducing backoff into the
sleep intervals --- that is, after first failure to get the spinlock,
sleep 10 msec; after the second, sleep 20 msec, then 40, etc, with a
maximum sleep time of maybe a second.  The number of iterations would be
reduced so that we still time out after a minute's total delay.

Comments?
        regards, tom lane


Re: Adjustment of spinlock sleep delays

From
Mike Mascari
Date:
Tom Lane wrote:

> I've been thinking about Ludwig Lim's recent report of a "stuck
> spinlock" failure on a heavily loaded machine.  Although I originally
> found this hard to believe, there is a scenario which makes it
> plausible.  Suppose that we have a bunch of recently-started backends
> as well as one or more that have been running a long time --- long
> enough that the scheduler has niced them down a priority level or two.
> Now suppose that one of the old-timers gets interrupted while holding
> a spinlock (an event of small but nonzero probability), and that before
> it can get scheduled again, several of the newer, higher-priority
> backends all start trying to acquire the same spinlock.  The "acquire"
> code looks like "try to grab the spinlock a few times, then sleep for
> 10 msec, then try again; give up after 1 minute".  If there are enough
> backends trying this that cycling through all of them takes at least
> 10 msec, then the lower-priority backend will never get scheduled, and
> after a minute we get the dreaded "stuck spinlock".
> 
> To forestall this scenario, I'm thinking of introducing backoff into the
> sleep intervals --- that is, after first failure to get the spinlock,
> sleep 10 msec; after the second, sleep 20 msec, then 40, etc, with a
> maximum sleep time of maybe a second.  The number of iterations would be
> reduced so that we still time out after a minute's total delay.
> 
> Comments?

Should there be any correlation between the manner by which the
backoff occurs and the number of active backends?

Mike Mascari
mascarm@mascari.com






Re: Adjustment of spinlock sleep delays

From
Rod Taylor
Date:
> To forestall this scenario, I'm thinking of introducing backoff into the
> sleep intervals --- that is, after first failure to get the spinlock,
> sleep 10 msec; after the second, sleep 20 msec, then 40, etc, with a
> maximum sleep time of maybe a second.  The number of iterations would be
> reduced so that we still time out after a minute's total delay.

After the first few sleeps should it add a random() element to the delay
time?


Re: Adjustment of spinlock sleep delays

From
Tom Lane
Date:
Mike Mascari <mascarm@mascari.com> writes:
> Should there be any correlation between the manner by which the
> backoff occurs and the number of active backends?

If we could guess how many are contending for the same spinlock, maybe
we could use that info ... but I don't see a reasonably cheap way to do
so.
        regards, tom lane


Re: Adjustment of spinlock sleep delays

From
Tom Lane
Date:
Rod Taylor <rbt@rbt.ca> writes:
> After the first few sleeps should it add a random() element to the delay
> time?

Hmm, that's a thought --- but how big a random element?

Fooling with the original idea, I'm having trouble with getting both
plausible backoff and a reasonable number of attempts before failing.
I tried the sequence
10 msec, 20 msec, 40, 80, ..., 1280 (1.28 sec), repeat

but this only gives a couple of hundred tries before one minute has
elapsed, which seems uncomfortably low.  Maybe there's no alternative,
though, if we want any good-sized delays in there.
        regards, tom lane


Re: Adjustment of spinlock sleep delays

From
Rod Taylor
Date:
On Tue, 2003-08-05 at 18:19, Tom Lane wrote:
> Rod Taylor <rbt@rbt.ca> writes:
> > After the first few sleeps should it add a random() element to the delay
> > time?
>
> Hmm, that's a thought --- but how big a random element?
>
> Fooling with the original idea, I'm having trouble with getting both
> plausible backoff and a reasonable number of attempts before failing.
> I tried the sequence
>
>     10 msec, 20 msec, 40, 80, ..., 1280 (1.28 sec), repeat
>
> but this only gives a couple of hundred tries before one minute has
> elapsed, which seems uncomfortably low.  Maybe there's no alternative,
> though, if we want any good-sized delays in there.

How about (round to nearest 10msec):

time = oldtime + oldtime / 2 + oldtime * rand()

while (time > 1 second)time = time - 0.80sec

This would stagger the wakeup times, and ensure a larger number of
retries -- but the times should be large enough after the first few
tries (larger than 200msec) that further backoff won't be required.

Re: Adjustment of spinlock sleep delays

From
Tom Lane
Date:
Rod Taylor <rbt@rbt.ca> writes:
> How about (round to nearest 10msec):

> time =3D oldtime + oldtime / 2 + oldtime * rand()

> while (time > 1 second)
>     time =3D time - 0.80sec

> This would stagger the wakeup times, and ensure a larger number of
> retries -- but the times should be large enough after the first few
> tries (larger than 200msec) that further backoff won't be required.

But after the first few tries the sleep time would always exceed
200msec, so there would be a *maximum* of 60*5 = 300 tries before
failing --- probably a lot less, like about 120 on average.

The random component should already help to scatter the wakeups pretty
well, so I'm thinking about just
if (oldtime > 1 sec)    time = 10msecelse    time = oldtime + oldtime * rand()

ie random growth of a maximum of 2x per try, and reset to minimum delay
when you get past 1 sec.  This would guarantee at least as many tries
as I'm getting currently with the deterministic algorithm (which is
effectively this if rand() always returned 1).
        regards, tom lane


Re: Adjustment of spinlock sleep delays

From
"Christopher Kings-Lynne"
Date:
> To forestall this scenario, I'm thinking of introducing backoff into the
> sleep intervals --- that is, after first failure to get the spinlock,
> sleep 10 msec; after the second, sleep 20 msec, then 40, etc, with a
> maximum sleep time of maybe a second.  The number of iterations would be
> reduced so that we still time out after a minute's total delay.

Sounds good.

Chris



Re: Adjustment of spinlock sleep delays

From
"Mendola Gaetano"
Date:
From: "Tom Lane" <tgl@sss.pgh.pa.us>

> To forestall this scenario, I'm thinking of introducing backoff into the
> sleep intervals --- that is, after first failure to get the spinlock,
> sleep 10 msec; after the second, sleep 20 msec, then 40, etc, with a
> maximum sleep time of maybe a second.  The number of iterations would be
> reduced so that we still time out after a minute's total delay.


What about use the same algorithm used in ethernet when a collision is
detected?

When a collision occurs: 
 1) Stop sending. 2) Wait for some random amount of time between 0-T seconds 3) Try again. 4) If collision occurs
again,increase T using an algorithm, go to         Step 2. 
 
algorithm:

T = rand ( 2n -1 ) * 51.2 ms

where n- number of collisions 



Regards
Gaetano Mendola




Re: Adjustment of spinlock sleep delays

From
Tom Lane
Date:
"Mendola Gaetano" <mendola@bigfoot.com> writes:
> What about use the same algorithm used in ethernet when a collision is
> detected?

Random backoff is what Rod suggested, but I don't care for the ethernet
method in detail, because it allows for only a fairly small number of
retries before giving up.  It's okay for an ethernet to drop packets
under load ... it's not okay for us to fail to acquire a spinlock.
        regards, tom lane


Re: Adjustment of spinlock sleep delays

From
Tom Lane
Date:
I said:
> The random component should already help to scatter the wakeups pretty
> well, so I'm thinking about just
>     if (oldtime > 1 sec)
>         time = 10msec
>     else
>         time = oldtime + oldtime * rand()
> ie random growth of a maximum of 2x per try, and reset to minimum delay
> when you get past 1 sec.  This would guarantee at least as many tries
> as I'm getting currently with the deterministic algorithm (which is
> effectively this if rand() always returned 1).

Eventually it occurred to me that when using random delays, we should
set the timeout to occur after a fixed number of tries, not after a
fixed total time spent.  This is because the probability of unwanted
failure (ie, the spinlock isn't really stuck, you just managed to always
look when someone else had it) depends directly on the number of tries.

I've committed code that does the above with a limit of 1000 iterations;
timeout seems to take about 3.5 minutes on average.  (In the prior code
we would make 6000 attempts over a period of 1 minute.)
        regards, tom lane