Re: Microsecond sleeps with select() - Mailing list pgsql-hackers

From ncm@zembu.com (Nathan Myers)
Subject Re: Microsecond sleeps with select()
Date
Msg-id 20010217153515.B16600@store.zembu.com
Whole thread Raw
In response to Re: Microsecond sleeps with select()  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Microsecond sleeps with select()  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Sat, Feb 17, 2001 at 12:26:31PM -0500, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > A comment on microsecond delays using select().  Most Unix kernels run
> > at 100hz, meaning that they have a programmable timer that interrupts
> > the CPU every 10 milliseconds.
> 
> Right --- this probably also explains my observation that some kernels
> seem to add an extra 10msec to the requested sleep time.  Actually
> they're interpreting a one-clock-tick select() delay as "wait till
> the next clock tick, plus one tick".  The actual delay will be between
> one and two ticks depending on just when you went to sleep.
> ...
> In short: s_spincycle in its current form does not do anything anywhere
> near what the author thought it would.  It's wasted complexity.
> 
> I am thinking about simplifying s_lock_sleep down to simple
> wait-one-tick-on-every-call logic.  An alternative is to keep
> s_spincycle, but populate it with, say, 10000, 20000 and larger entries,
> which would offer some hope of actual random-backoff behavior.
> Either change would clearly be a win on single-CPU machines, and I doubt
> it would hurt on multi-CPU machines.
> 
> Comments?

I don't believe that most kernels schedule only on clock ticks.
They schedule on a clock tick *or* whenever the process yields, 
which on a loaded system may be much more frequently.

The question is whether, scheduling, the kernel considers processes
that have requested to sleep less than a clock tick as "ready" once
their actual request time expires.  On V7 Unix, the answer was no, 
because the kernel had no way to measure any time shorter than a
tick, so it rounded up all sleeps to "the next tick".

Certainly there are machines and kernels that count time more precisely 
(isn't PG ported to QNX?).  We do users of such kernels no favors by 
pretending they only count clock ticks.  Furthermore, a 1ms clock
tick is pretty common, e.g. on Alpha boxes.  A 10ms initial delay is 
ten clock ticks, far longer than seems appropriate.

This argues for yielding the minimum discernable amount of time (1us)
and then backing off to a less-minimal time (1ms).  On systems that 
chug at 10ms, this is equivalent to a sleep of up-to-10ms (i.e. until 
the next tick), then a sequence of 10ms sleeps; on dumbOS Alphas, it's 
equivalent to a sequence of 1ms sleeps; and on a smartOS on an Alpha it's 
equivalent to a short, variable time (long enough for other runnable 
processes to run and yield) followed by a sequence of 1ms sleeps.  
(Some of the numbers above are doubled on really dumb kernels, as
Tom noted.)

Nathan Myers
ncm@zembu.com


pgsql-hackers by date:

Previous
From: Brent Verner
Date:
Subject: Re: WAL and commit_delay
Next
From: ncm@zembu.com (Nathan Myers)
Date:
Subject: Re: Re: WAL and commit_delay