Re: Microsecond sleeps with select() - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Microsecond sleeps with select()
Date
Msg-id 4120.982430791@sss.pgh.pa.us
Whole thread Raw
In response to Microsecond sleeps with select()  (Bruce Momjian <pgman@candle.pha.pa.us>)
Responses Re: Microsecond sleeps with select()  (Bruce Momjian <pgman@candle.pha.pa.us>)
Re: Microsecond sleeps with select()  (Bruce Momjian <pgman@candle.pha.pa.us>)
Re: Microsecond sleeps with select()  (ncm@zembu.com (Nathan Myers))
List pgsql-hackers
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> A comment on microsecond delays using select().  Most Unix kernels run
> at 100hz, meaning that they have a programmable timer that interrupts
> the CPU every 10 milliseconds.

Right --- this probably also explains my observation that some kernels
seem to add an extra 10msec to the requested sleep time.  Actually
they're interpreting a one-clock-tick select() delay as "wait till
the next clock tick, plus one tick".  The actual delay will be between
one and two ticks depending on just when you went to sleep.

I have been thinking some more about the s_lock() delay loop in
connection with this.  We currently have

/** Each time we busy spin we select the next element of this array as the* number of microseconds to wait. This
accomplishespseudo random back-off.* Values are not critical but 10 milliseconds is a common platform* granularity.**
Totaltime to cycle through all 20 entries might be about .07 sec,* so the given value of S_MAX_BUSY results in timeout
after~70 sec.*/
 
#define S_NSPINCYCLE    20
#define S_MAX_BUSY    1000 * S_NSPINCYCLE

int    s_spincycle[S_NSPINCYCLE] =
{    0, 0, 0, 0, 10000, 0, 0, 0, 10000, 0,0, 10000, 0, 0, 10000, 0, 10000, 0, 10000, 10000
};

Having read the select(2) man page more closely, I now realize that
it is *defined* not to yield the processor when the requested delay
is zero: it just checks the file ready status and returns immediately.

Therefore, on a single-CPU machine, the zero entries in this array
are a gold-plated, guaranteed waste of time: we'll cycle through the
kernel call, go back and retest the spinlock, and find it still locked.

On a multi-CPU machine, the time wasted in the kernel call might
possibly be enough to allow the lock holder (if running on another
CPU) to finish its work and release the lock.  But it's more likely
that we're just wasting CPU.

On either single or multi CPUs, there is no "pseudo random backoff"
behavior here whatsoever.  Spinning through the zero entries will take
some small number of microseconds, and then we hit the one-clock-tick
select() delay.  In reality, every backend waiting for a spinlock will
be awoken on every clock tick.

If your kernel is one of the ones that interprets a one-tick delay
request as "rest of the current tick plus a tick", then all the
delays are actually two ticks.  In this case, on average half the
population of waiting backends will be awoken on each alternate tick.
A little better but not much.

In short: s_spincycle in its current form does not do anything anywhere
near what the author thought it would.  It's wasted complexity.

I am thinking about simplifying s_lock_sleep down to simple
wait-one-tick-on-every-call logic.  An alternative is to keep
s_spincycle, but populate it with, say, 10000, 20000 and larger entries,
which would offer some hope of actual random-backoff behavior.
Either change would clearly be a win on single-CPU machines, and I doubt
it would hurt on multi-CPU machines.

Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: The Hermit Hacker
Date:
Subject: Re: Re: beta5 ...
Next
From: Tom Lane
Date:
Subject: Re: Re: beta5 ...