Thread: Microsecond sleeps with select()

Microsecond sleeps with select()

From
Bruce Momjian
Date:
A comment on microsecond delays using select().  Most Unix kernels run
at 100hz, meaning that they have a programmable timer that interrupts
the CPU every 10 milliseconds.  The kernel only gets to control the cpu
during those tick interrupts or if a user application makes a kernel
call.

Therefore, it is no surprise the most Unix kernels can't do 5
microsecond sleeps.  The only way they could do it would be to reprogram
the timer interrupt if a wakeup was going to occur in less than 10
milliseconds.  I doubt many kernels do that because I don't think timer
interrupt programming is a very quick or accurate operation.  Also,
reprogramming it would make the actual 100hz timer unreliable.

Now, kernels could check on return from kernel to user code to see if
someone is ready to be woken up, but I doubt they do that either. 
Looking at the BSDI kernel, all timeouts are expressed in ticks, which
are 10 milliseconds.  Obviously there is no checking during kernel call
returns because they don't even store the sleeps with enough resolution
to perform a check.  In fact, the kernel doesn't even contain have a way
to measure microsecond timings.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Microsecond sleeps with select()

From
Lamar Owen
Date:
Bruce Momjian wrote:
> In fact, the kernel doesn't even contain have a way
> to measure microsecond timings.

Linux has patches available to do microsecond timings, but they're
nonportable, of course.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11


Re: Microsecond sleeps with select()

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> A comment on microsecond delays using select().  Most Unix kernels run
> at 100hz, meaning that they have a programmable timer that interrupts
> the CPU every 10 milliseconds.

Right --- this probably also explains my observation that some kernels
seem to add an extra 10msec to the requested sleep time.  Actually
they're interpreting a one-clock-tick select() delay as "wait till
the next clock tick, plus one tick".  The actual delay will be between
one and two ticks depending on just when you went to sleep.

I have been thinking some more about the s_lock() delay loop in
connection with this.  We currently have

/** Each time we busy spin we select the next element of this array as the* number of microseconds to wait. This
accomplishespseudo random back-off.* Values are not critical but 10 milliseconds is a common platform* granularity.**
Totaltime to cycle through all 20 entries might be about .07 sec,* so the given value of S_MAX_BUSY results in timeout
after~70 sec.*/
 
#define S_NSPINCYCLE    20
#define S_MAX_BUSY    1000 * S_NSPINCYCLE

int    s_spincycle[S_NSPINCYCLE] =
{    0, 0, 0, 0, 10000, 0, 0, 0, 10000, 0,0, 10000, 0, 0, 10000, 0, 10000, 0, 10000, 10000
};

Having read the select(2) man page more closely, I now realize that
it is *defined* not to yield the processor when the requested delay
is zero: it just checks the file ready status and returns immediately.

Therefore, on a single-CPU machine, the zero entries in this array
are a gold-plated, guaranteed waste of time: we'll cycle through the
kernel call, go back and retest the spinlock, and find it still locked.

On a multi-CPU machine, the time wasted in the kernel call might
possibly be enough to allow the lock holder (if running on another
CPU) to finish its work and release the lock.  But it's more likely
that we're just wasting CPU.

On either single or multi CPUs, there is no "pseudo random backoff"
behavior here whatsoever.  Spinning through the zero entries will take
some small number of microseconds, and then we hit the one-clock-tick
select() delay.  In reality, every backend waiting for a spinlock will
be awoken on every clock tick.

If your kernel is one of the ones that interprets a one-tick delay
request as "rest of the current tick plus a tick", then all the
delays are actually two ticks.  In this case, on average half the
population of waiting backends will be awoken on each alternate tick.
A little better but not much.

In short: s_spincycle in its current form does not do anything anywhere
near what the author thought it would.  It's wasted complexity.

I am thinking about simplifying s_lock_sleep down to simple
wait-one-tick-on-every-call logic.  An alternative is to keep
s_spincycle, but populate it with, say, 10000, 20000 and larger entries,
which would offer some hope of actual random-backoff behavior.
Either change would clearly be a win on single-CPU machines, and I doubt
it would hurt on multi-CPU machines.

Comments?
        regards, tom lane


Re: Microsecond sleeps with select()

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > A comment on microsecond delays using select().  Most Unix kernels run
> > at 100hz, meaning that they have a programmable timer that interrupts
> > the CPU every 10 milliseconds.
> 
> Right --- this probably also explains my observation that some kernels
> seem to add an extra 10msec to the requested sleep time.  Actually
> they're interpreting a one-clock-tick select() delay as "wait till
> the next clock tick, plus one tick".  The actual delay will be between
> one and two ticks depending on just when you went to sleep.
> 

The BSDI code would be pselect():
    /*     * If poll wait was tiny, this could be zero; we will     * have to round it up to avoid sleeping forever.
If    * we retry below, the timercmp above will get us out.     * Note that if wait was 0, the timercmp will prevent
* us from getting here the first time.     */    timo = hzto(&atv);    if (timo == 0)        timo = 1;
 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Microsecond sleeps with select()

From
Bruce Momjian
Date:
> I have been thinking some more about the s_lock() delay loop in
> connection with this.  We currently have
> 
> /*
>  * Each time we busy spin we select the next element of this array as the
>  * number of microseconds to wait. This accomplishes pseudo random back-off.
>  * Values are not critical but 10 milliseconds is a common platform
>  * granularity.
>  *
>  * Total time to cycle through all 20 entries might be about .07 sec,
>  * so the given value of S_MAX_BUSY results in timeout after ~70 sec.
>  */
> #define S_NSPINCYCLE    20
> #define S_MAX_BUSY    1000 * S_NSPINCYCLE
> 
> int    s_spincycle[S_NSPINCYCLE] =
> {    0, 0, 0, 0, 10000, 0, 0, 0, 10000, 0,
>     0, 10000, 0, 0, 10000, 0, 10000, 0, 10000, 10000
> };
> 
> Having read the select(2) man page more closely, I now realize that
> it is *defined* not to yield the processor when the requested delay
> is zero: it just checks the file ready status and returns immediately.

Actually, a kernel call is something.  On kernel call return, process
priorities are checked and the CPU may be yielded to a higher-priority
backend that perhaps just had its I/O completed.

I think the 0 and 10000 are correct.  They would be zero ticks and one
tick.  You think 5000 and 10000 would be better?  I can see that.


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Microsecond sleeps with select()

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> Having read the select(2) man page more closely, I now realize that
>> it is *defined* not to yield the processor when the requested delay
>> is zero: it just checks the file ready status and returns immediately.

> Actually, a kernel call is something.  On kernel call return, process
> priorities are checked and the CPU may be yielded to a higher-priority
> backend that perhaps just had its I/O completed.

So *if* some I/O just completed, the call *might* do what we need,
which is yield the CPU.  Otherwise we're just wasting cycles, and
will continue to waste them until we do a select with a nonzero
delay.  I propose we cut out the spinning and just do a nonzero delay
immediately.

> I think the 0 and 10000 are correct.  They would be zero ticks and one
> tick.  You think 5000 and 10000 would be better?  I can see that.

No, I am not suggesting that, because there is no difference between
5000 and 10000.

All of this stuff probably ought to be replaced with a less-bogus
mechanism (POSIX semaphores maybe?), but not in late beta.
        regards, tom lane


Re: Microsecond sleeps with select()

From
Bruce Momjian
Date:
> So *if* some I/O just completed, the call *might* do what we need,
> which is yield the CPU.  Otherwise we're just wasting cycles, and
> will continue to waste them until we do a select with a nonzero
> delay.  I propose we cut out the spinning and just do a nonzero delay
> immediately.

Well, any backend with a higher piority would get run over the current
process.  The question is how would that happen.  I will say that
because of CPU cache issues, the system tries _not_ to change processes
if the current one still needs the CPU, so the zero may be bogus.

> 
> > I think the 0 and 10000 are correct.  They would be zero ticks and one
> > tick.  You think 5000 and 10000 would be better?  I can see that.
> 
> No, I am not suggesting that, because there is no difference between
> 5000 and 10000.
> 
> All of this stuff probably ought to be replaced with a less-bogus
> mechanism (POSIX semaphores maybe?), but not in late beta.

Good question.  We have sched_yield, that is a threads function, or at
least only in the pthreads library.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Microsecond sleeps with select()

From
ncm@zembu.com (Nathan Myers)
Date:
On Sat, Feb 17, 2001 at 12:26:31PM -0500, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > A comment on microsecond delays using select().  Most Unix kernels run
> > at 100hz, meaning that they have a programmable timer that interrupts
> > the CPU every 10 milliseconds.
> 
> Right --- this probably also explains my observation that some kernels
> seem to add an extra 10msec to the requested sleep time.  Actually
> they're interpreting a one-clock-tick select() delay as "wait till
> the next clock tick, plus one tick".  The actual delay will be between
> one and two ticks depending on just when you went to sleep.
> ...
> In short: s_spincycle in its current form does not do anything anywhere
> near what the author thought it would.  It's wasted complexity.
> 
> I am thinking about simplifying s_lock_sleep down to simple
> wait-one-tick-on-every-call logic.  An alternative is to keep
> s_spincycle, but populate it with, say, 10000, 20000 and larger entries,
> which would offer some hope of actual random-backoff behavior.
> Either change would clearly be a win on single-CPU machines, and I doubt
> it would hurt on multi-CPU machines.
> 
> Comments?

I don't believe that most kernels schedule only on clock ticks.
They schedule on a clock tick *or* whenever the process yields, 
which on a loaded system may be much more frequently.

The question is whether, scheduling, the kernel considers processes
that have requested to sleep less than a clock tick as "ready" once
their actual request time expires.  On V7 Unix, the answer was no, 
because the kernel had no way to measure any time shorter than a
tick, so it rounded up all sleeps to "the next tick".

Certainly there are machines and kernels that count time more precisely 
(isn't PG ported to QNX?).  We do users of such kernels no favors by 
pretending they only count clock ticks.  Furthermore, a 1ms clock
tick is pretty common, e.g. on Alpha boxes.  A 10ms initial delay is 
ten clock ticks, far longer than seems appropriate.

This argues for yielding the minimum discernable amount of time (1us)
and then backing off to a less-minimal time (1ms).  On systems that 
chug at 10ms, this is equivalent to a sleep of up-to-10ms (i.e. until 
the next tick), then a sequence of 10ms sleeps; on dumbOS Alphas, it's 
equivalent to a sequence of 1ms sleeps; and on a smartOS on an Alpha it's 
equivalent to a short, variable time (long enough for other runnable 
processes to run and yield) followed by a sequence of 1ms sleeps.  
(Some of the numbers above are doubled on really dumb kernels, as
Tom noted.)

Nathan Myers
ncm@zembu.com


Re: Microsecond sleeps with select()

From
Tom Lane
Date:
ncm@zembu.com (Nathan Myers) writes:
> Certainly there are machines and kernels that count time more precisely 
> (isn't PG ported to QNX?).  We do users of such kernels no favors by 
> pretending they only count clock ticks.  Furthermore, a 1ms clock
> tick is pretty common, e.g. on Alpha boxes.

Okay, I didn't know there were any popular systems that did that.

> This argues for yielding the minimum discernable amount of time (1us)
> and then backing off to a less-minimal time (1ms).

Fair enough.  As you say, it's the same result on machines with coarse
time resolution, and it should help on smarter boxes.  The main thing
is that I want to change the zero entries in s_spincycle, which
clearly aren't doing what the author intended.
        regards, tom lane