Thread: Spinlock backoff algorithm

Spinlock backoff algorithm

From
Magne Mæhre
Date:
I was playing with a Nevada server and noticed a rush on the FPU
(the Nevada has a single shared FPU for its 32 threads).
Looking at the spinlock code, I found :
 cur_delay += (int) (cur_delay * ((double) random() / (double) MAX_RANDOM_VALUE) + 0.5);

I understand the reasoning for the backoff (as of the discussion on
2003-08-05), but is there any particular reason for using floating
point operations here ?   Maybe a modulo would be just as good (or
better since it doesn't involve the FPU) ?

Something like:   cur_delay  +=  random() % (cur_delay + 1) ;


--Magne


Re: Spinlock backoff algorithm

From
Zdenek Kotala
Date:
Magne Mæhre wrote:
> 
> I was playing with a Nevada server and noticed a rush on the FPU
> (the Nevada has a single shared FPU for its 32 threads).

Probably you mean Niagara ;-).

    Zdenek


Re: Spinlock backoff algorithm

From
Tom Lane
Date:
Magne Mæhre <Magne.Mahre@Sun.COM> writes:
> I understand the reasoning for the backoff (as of the discussion on
> 2003-08-05), but is there any particular reason for using floating
> point operations here ?   Maybe a modulo would be just as good (or
> better since it doesn't involve the FPU) ?

My goodness that's a hardware-dependent proposal.  Shall we discuss
how many CPUs there are where an integer division is *slower* than
a floating-point op?

Why do you think that a couple of FP ops here are a problem, anyway?
This is a code path where we've already yielded the processor, so
by definition the repetition rate has to be pretty low.

The other problem with using modulo is that it makes the result depend
mostly on the low-order bits of the random() result, rather than mostly
on the high-order bits; with lower-grade implementations of random(),
the lower bits are materially less random than the higher.  Now
admittedly high-grade randomness is probably not too important for this
specific context, but I dislike putting in poor coding practices that
someone might see and copy without thinking...
        regards, tom lane


Re: Spinlock backoff algorithm

From
Mark Mielke
Date:
Tom Lane wrote:
> Magne Mæhre <Magne.Mahre@Sun.COM> writes:
>   
>> I understand the reasoning for the backoff (as of the discussion on
>> 2003-08-05), but is there any particular reason for using floating
>> point operations here ?   Maybe a modulo would be just as good (or
>> better since it doesn't involve the FPU) ?
>>     
> My goodness that's a hardware-dependent proposal.  Shall we discuss
> how many CPUs there are where an integer division is *slower* than
> a floating-point op?
>   
Hi Tom:

Do you have one in mind, or is this a straw man? :-)

> Why do you think that a couple of FP ops here are a problem, anyway?
> This is a code path where we've already yielded the processor, so
> by definition the repetition rate has to be pretty low.
Yielded the processor? Or yielded the lock? With 32 active threads 
contending for the lock, but first contending for the FPU, one could see 
how it might be relevant.

I think I agree with you that this won't be the only problem, however, 
the FPU may not need to contribute to the problem?

> The other problem with using modulo is that it makes the result depend
> mostly on the low-order bits of the random() result, rather than mostly
> on the high-order bits; with lower-grade implementations of random(),
> the lower bits are materially less random than the higher.  Now
> admittedly high-grade randomness is probably not too important for this
> specific context, but I dislike putting in poor coding practices that
> someone might see and copy without thinking...
>   
If this was a serious problem, there is the >> operator. I see it as a 
poor coding practice to make assumptions about which bits are most 
"random" in a call to random(). If anybody fixes the problem you 
describe, then the opposite may become true. Perhaps the lowest bits are 
the most random. If random() is broken, random() should be fixed. Coding 
in specific implementation details about specific implementations of 
random() is just as bad. :-)

IMHO, use of FPU should be avoided wherever possible on any platform. On 
some platforms, the FPU may be fully implemented in software. My memory 
is faint, but I think SPARC v7 either implemented / in software, or had 
a trap that implemented it in software.

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>


Re: Spinlock backoff algorithm

From
Tom Lane
Date:
Mark Mielke <mark@mark.mielke.cc> writes:
> Tom Lane wrote:
>> My goodness that's a hardware-dependent proposal.  Shall we discuss
>> how many CPUs there are where an integer division is *slower* than
>> a floating-point op?

> Do you have one in mind, or is this a straw man? :-)

I've got one upstairs (HPPA), and I believe that it's actually a pretty
common situation in scientifically-oriented workstations from a few
years back.

>> Why do you think that a couple of FP ops here are a problem, anyway?
>> This is a code path where we've already yielded the processor, so
>> by definition the repetition rate has to be pretty low.

> Yielded the processor?

Yielded the processor, as in pg_usleep.  It is absolutely impossible for
any thread to execute that line of code more than 1000 times per second,
and the expected rate would be very much less.  Furthermore, the entire
point of the function is to try to make processes come out of the sleep
at different times, so they shouldn't be ganging up on the FPU anyway.

There may be some place where we have an unnecessarily high amount
of FP usage, but I very much doubt that this is it.
        regards, tom lane


Re: Spinlock backoff algorithm

From
Greg Smith
Date:
On Wed, 14 Nov 2007, Mark Mielke wrote:

>> The other problem with using modulo is that it makes the result depend
>> mostly on the low-order bits of the random() result, rather than mostly
>> on the high-order bits; with lower-grade implementations of random(),
>> the lower bits are materially less random than the higher.

> If this was a serious problem, there is the >> operator. I see it as a poor 
> coding practice to make assumptions about which bits are most "random" in a 
> call to random().

There are many types of pseudo-random number generators where the 
low-order bits are not so random, and the assumption Tom has described is 
pretty likely to be true.  See http://www.fourmilab.ch/random/ as one 
comment about the badness of the standard UNIX random generator for 
example.

There is an interesting discussion of this issue along with code showing a 
way to improve things while only using integer math (which in some cases 
uses >> as you suggest) as part of the Java standard library:

http://java.sun.com/j2se/1.4.2/docs/api/java/util/Random.html#nextInt(int)

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Spinlock backoff algorithm

From
"Trevor Talbot"
Date:
On 11/14/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> The other problem with using modulo is that it makes the result depend
> mostly on the low-order bits of the random() result, rather than mostly
> on the high-order bits; with lower-grade implementations of random(),
> the lower bits are materially less random than the higher.  Now
> admittedly high-grade randomness is probably not too important for this
> specific context, but I dislike putting in poor coding practices that
> someone might see and copy without thinking...

If there's a dependency on a particular quality of random()
implementation, why not just include one?  Mersenne Twister is easy,
while not being cryptographic strength.
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html


Re: Spinlock backoff algorithm

From
Josh Berkus
Date:
Tom,

> I've got one upstairs (HPPA), and I believe that it's actually a pretty
> common situation in scientifically-oriented workstations from a few
> years back.

Last I checked, scientific workstations aren't exactly a common platform for 
PostgreSQL servers.

The question is, for our most common platforms (like AMD and Intel) is the FPU 
notably slower/more contended than integer division?  I'd the impression that 
it was, but my knowledge of chip architectures is liable to be out of date.

Can we have a hardware geek speak up?

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: Spinlock backoff algorithm

From
"Joshua D. Drake"
Date:
Josh Berkus wrote:
> Tom,
> 
>> I've got one upstairs (HPPA), and I believe that it's actually a pretty
>> common situation in scientifically-oriented workstations from a few
>> years back.
> 
> Last I checked, scientific workstations aren't exactly a common platform for 
> PostgreSQL servers.
> 
> The question is, for our most common platforms (like AMD and Intel) is the FPU 
> notably slower/more contended than integer division?  I'd the impression that 
> it was, but my knowledge of chip architectures is liable to be out of date.
> 
> Can we have a hardware geek speak up?
> 

http://www.intel.com/performance/server/xeon/intspd.htm
http://www.intel.com/performance/server/xeon/fpspeed.htm


Sincerely,

Joshua D. Drake



Re: Spinlock backoff algorithm

From
Steve Atkins
Date:
On Nov 14, 2007, at 6:57 PM, Josh Berkus wrote:

> Tom,
>
>> I've got one upstairs (HPPA), and I believe that it's actually a  
>> pretty
>> common situation in scientifically-oriented workstations from a few
>> years back.
>
> Last I checked, scientific workstations aren't exactly a common  
> platform for
> PostgreSQL servers.
>
> The question is, for our most common platforms (like AMD and Intel)  
> is the FPU
> notably slower/more contended than integer division?  I'd the  
> impression that
> it was, but my knowledge of chip architectures is liable to be out  
> of date.
>
> Can we have a hardware geek speak up?

Somewhat. The last version of K7 I looked at had three integer  
execution units versus one floating point unit.

They're also scheduled fairly independently, meaning that casts from  
double to integer or back again will have some minor negative effects  
on the pipeline or the scheduler more than the use of floating point  
itself.

In the grand scheme of things, though, I don't believe it's a big  
deal for typical code on most modern desktop CPUs, certainly not  
compared to memory starvation, use of less than optimal compilers and  
all the other reasons the pipeline might stall. I might care in the  
innermost of inner loops, but possibly not even then unless a  
profiler told me differently.

Cheers,  Steve



Re: Spinlock backoff algorithm

From
"Gregory Maxwell"
Date:
On Nov 14, 2007 10:12 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> http://www.intel.com/performance/server/xeon/intspd.htm
> http://www.intel.com/performance/server/xeon/fpspeed.htm

That says precisely nothing about the matter at hand. Someone should
simply change it and benchmark it in pgsql. I doubt you'll see a
difference there on regular AMD/Intel ... and if it makes the sun
hyperthreaded cpu happier...


Re: Spinlock backoff algorithm

From
"Joshua D. Drake"
Date:
Gregory Maxwell wrote:
> On Nov 14, 2007 10:12 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
>> http://www.intel.com/performance/server/xeon/intspd.htm
>> http://www.intel.com/performance/server/xeon/fpspeed.htm
> 
> That says precisely nothing about the matter at hand. Someone should
> simply change it and benchmark it in pgsql. I doubt you'll see a
> difference there on regular AMD/Intel ... and if it makes the sun
> hyperthreaded cpu happier...

Are you offering?

Joshua D. Drake




Re: Spinlock backoff algorithm

From
Josh Berkus
Date:
Greg,

> That says precisely nothing about the matter at hand. Someone should
> simply change it and benchmark it in pgsql. I doubt you'll see a
> difference there on regular AMD/Intel ... and if it makes the sun
> hyperthreaded cpu happier...

Nah, if it's only Niagara, it's not worth bothering.  I was under the 
impression that most x86 chips had similar issues around having less FP than 
IP, but I could be wrong.  It's also worth thinking about the new Intel 
multi-core archtectures; do they starve FPs as well?

On a busy oltp system, spinlock waits get called 100's of times per second, so 
like procarraylock this it's frequent enought to call for microoptimization.

I think maybe we should try making the change and testing it on Niagara and on 
some standard x86 platforms, and then on the new x86 architectures, to see if 
it makes any difference in CPU utilization.

FYI, if nobody had guessed this is coming out of study Magne is doing on 
improving PostgreSQL SMP scalability.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: Spinlock backoff algorithm

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Nah, if it's only Niagara, it's not worth bothering.

It's not only that aspect of it --- it's that I am 100% convinced that
Magne has misidentified the source of whatever FPU contention he's
seeing.  The floating-point code in s_lock() is executed only just
after having returned from a sleep that is at least one millisecond
and often many times that.  If Niagara cannot handle a few kiloflops
then you need to find some other company to work for ;-)

I am interested to find out what the true cause of the reported FPU
contention is, but I'll bet our next lunch that s_lock.c ain't it.
        regards, tom lane


Re: Spinlock backoff algorithm

From
Martijn van Oosterhout
Date:
On Wed, Nov 14, 2007 at 06:57:18PM -0800, Josh Berkus wrote:
> The question is, for our most common platforms (like AMD and Intel) is the FPU
> notably slower/more contended than integer division?  I'd the impression that
> it was, but my knowledge of chip architectures is liable to be out of date.
>
> Can we have a hardware geek speak up?

I did some searching and the best figures I can find for integer
division is 15-40 cycles whereas for floating point the best I found was 5
cycles. The FP units on modern processer as very highly optimsed, but
the figures quoted are for pipelined division, which is not what we're
doing here.

http://www.behardware.com/art/imprimer/623/

I did find a thread about how integer division on the Itanium was
implemented by copying the integers to the FP registers, doing the
division and rounding there and copying back. So at least there it's
not true.

http://gcc.gnu.org/ml/gcc-patches/2005-12/msg01518.html

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Off-Topic: Float point division academia? (was: Re: Spinlock backoff algorithm)

From
Mark Mielke
Date:
Martijn van Oosterhout wrote: <blockquote cite="mid:20071116084542.GB31271@svana.org" type="cite"><pre wrap="">On Wed,
Nov14, 2007 at 06:57:18PM -0800, Josh Berkus wrote: </pre><blockquote type="cite"><pre wrap="">The question is, for our
mostcommon platforms (like AMD and Intel) is the FPU 
 
notably slower/more contended than integer division?  I'd the impression that 
it was, but my knowledge of chip architectures is liable to be out of date.

Can we have a hardware geek speak up?   </pre></blockquote><pre wrap="">I did some searching and the best figures I can
findfor integer
 
division is 15-40 cycles whereas for floating point the best I found was 5
cycles. The FP units on modern processer as very highly optimsed, but
the figures quoted are for pipelined division, which is not what we're
doing here. </pre></blockquote> I agree with Tom that the although the user might be experiencing odd issues as
PostgreSQLwas not designed for 32 light-weight cores like the Niagara, that this particular issue is unlikely to be the
primaryissue.<br /><br /> In response to the academic discussion above, did you also look up the time to convert from
INT-> FLOAT, and then FLOAT -> INT?<br /><br /> Also, if divided by a constant, there are automatic optimization
tricksperformed, such as:<br /><br /> $ cat a.c<br /> int f(int i)<br /> {<br />     return i / 7;<br /> }<br /> $ gcc
-O3-S a.c<br /> $ cat a.s<br /> ...<br /> .globl f<br />         .type   f, @function<br /> f:<br /> .LFB2:<br />
       movl    %edi, %eax<br />         movl    $-1840700269, %edx<br />         imull   %edx<br />         leal   
(%rdx,%rdi),%eax<br />         sarl    $31, %edi<br />         sarl    $2, %eax<br />         subl    %edi, %eax<br />
       ret<br /><br /> No divide. No need to convert from INT -> FLOAT -> INT.<br /><br /> Here is non-conclusive
evidenceboth that integer division by a constant may still be faster on a fairly modern processor (AMD X2 3800+), and
that,as Tom believes, this issue is a waste of time:<br /><br /> $ cat a.c<br /> #define MAX_RANDOM_VALUE 
(0x7FFFFFFF)<br/><br /> int old_f(int i)<br /> {<br />     return ((double) i / (double) MAX_RANDOM_VALUE) + 0.5;<br />
}<br/><br /> int new_f(int i)<br /> {<br />     return (i + MAX_RANDOM_VALUE - 1) / MAX_RANDOM_VALUE;<br /> }<br /> $
catold.c<br /> int old_f(int);<br /> int new_f(int);<br /><br /> int main()<br /> {<br />     for (int i = 0; i <
100000000;i++)<br />         old_f(50);<br />     return 0;<br /> }<br /> $ cat new.c<br /> int old_f(int);<br /> int
new_f(int);<br/><br /> int main()<br /> {<br />     for (int i = 0; i < 100000000; i++)<br />         new_f(50);<br
/>    return 0;<br /> }<br /> $ gcc -o old -O3 -std=c99 old.c a.c<br /> $ gcc -o new -O3 -std=c99 new.c a.c<br /> $
time./old<br /> ./old  1.58s user 0.00s system 99% cpu 1.590 total<br /> $ time ./old<br /> ./old  1.31s user 0.00s
system99% cpu 1.314 total<br /> $ time ./old<br /> ./old  1.14s user 0.00s system 99% cpu 1.142 total<br /> $ time
./old<br/> ./old  1.46s user 0.00s system 99% cpu 1.461 total<br /> $ time ./new                       <br /> ./new 
0.68suser 0.00s system 99% cpu 0.682 total<br /> $ time ./new<br /> ./new  0.70s user 0.01s system 99% cpu 0.703
total<br/> $ time ./new<br /> ./new  0.70s user 0.00s system 99% cpu 0.705 total<br /> $ time ./new<br /> ./new  0.70s
user0.00s system 99% cpu 0.703 total<br /><br /><br /> I say non-conclusive, because:<br /> 1) my CPU is probably doing
branchprediction and may possibly be doing out-of-order execution. Since it has more integer units that floating point
units,this would work in the favour of integers. The contention case described by the original poster does not allow
formultiple integer units to be in use at the same time.<br /> 2) as Tom has already said, 100 million divides in
either0.7 seconds or 1.5 seconds, is still orders of magnitude more than the contended case would ever be called.<br
/><br/> For the future, please consider doing integer division when working with integers and the rvalue is a constant.
Forthis case, it doesn't seem like it is worth it to risk breaking the code.<br /><br /> Cheers,<br /> mark<br /><br
/><preclass="moz-signature" cols="72">-- 
 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: Spinlock backoff algorithm

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Mark Mielke <mark@mark.mielke.cc> writes:
>> Tom Lane wrote:
>>> My goodness that's a hardware-dependent proposal.  Shall we discuss
>>> how many CPUs there are where an integer division is *slower* than
>>> a floating-point op?
>
>> Do you have one in mind, or is this a straw man? :-)
>
> I've got one upstairs (HPPA), and I believe that it's actually a pretty
> common situation in scientifically-oriented workstations from a few
> years back.

I think floating point is fast on many common platforms, even many i386
variants. But usually that's assuming you're comparing doing a whole bunch of
work in floating point or integer math. Converting a bunch of integers to
floating point for a single operation doesn't seem like a case that's going to
shine on any floating point unit.

And there are plenty of platforms with *no* floating point. On embedded 486,
ARM, etc the floating point will be emulated by either the compiler or the
kernel.

But as Tom said there are plenty of places using floating point arithmetic in
the kernel. The planner does all of its selectivity and costing in floating
point for example.

I wonder if you wouldn't be better off going whole hog with soft floating
point emulation. I et Sun's cc has an option to implement floating point
arithmetic entirely in user-side software.

Another option might be to compile just some modules with soft float and do
utils/adt/*.c using hardfloat -- so user-defined types still use the floating
point unit but the database's internal arithmetic is done using software
emulated math.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Spinlock backoff algorithm

From
Zdenek Kotala
Date:
Gregory Stark wrote:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
> 
>> Mark Mielke <mark@mark.mielke.cc> writes:
>>> Tom Lane wrote:
>>>> My goodness that's a hardware-dependent proposal.  Shall we discuss
>>>> how many CPUs there are where an integer division is *slower* than
>>>> a floating-point op?
>>> Do you have one in mind, or is this a straw man? :-)
>> I've got one upstairs (HPPA), and I believe that it's actually a pretty
>> common situation in scientifically-oriented workstations from a few
>> years back.
> 
> I think floating point is fast on many common platforms, even many i386
> variants. But usually that's assuming you're comparing doing a whole bunch of
> work in floating point or integer math. Converting a bunch of integers to
> floating point for a single operation doesn't seem like a case that's going to
> shine on any floating point unit.

Just for fullness, task context switch is more complex (slower) when 
application uses FP operation. It is not important for PostgreSQL 
because it has FP operation on many places, but it is good to know.

    Zdenek