Re: tweaking MemSet() performance - 7.4.5 - Mailing list pgsql-hackers

From Marc Colosimo
Subject Re: tweaking MemSet() performance - 7.4.5
Date
Msg-id E60A72E0-08E7-11D9-B617-000A95A5D8B2@mitre.org
Whole thread Raw
In response to Re: tweaking MemSet() performance - 7.4.5  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: tweaking MemSet() performance - 7.4.5
List pgsql-hackers
On Sep 17, 2004, at 3:55 PM, Tom Lane wrote:

> Marc Colosimo <mcolosimo@mitre.org> writes:
>> I'm using 7.4.5 on Mac OS X (G5) and was profiling it to see why it is
>> SO SLOW at committing  inserts and deletes into a large database. One
>> of the many slowdowns was from MemSet. I found an old (2002) thread
>> about this and retried the tests  (see below). The main point is that
>> the system memset crushes pg's!!
>
> Hmm.  I tried to duplicate this on my G4 laptop, and found that they
> were more or less on a par for small-to-middling block sizes (using
> "gcc -O2").  Darwin's memset code must have some additional tweaks for
> use on G5 hardware.  Good for Apple --- this is the sort of thing that
> OS vendors *ought* to be doing.  The fact that we can beat the system
> memset on so many platforms is an indictment of those platforms.
>
>> Is it possible to add a define to call
>> the system memset at build time! This probably isn't the case on other
>> systems.
>
> Feel free to hack the definition of MemSet in src/include/c.h.  See the
> comments for it for more context.
>
> Note that for small compile-time-constant block sizes (a case your test
> program doesn't test, but it's common in pgsql), gcc with a 
> sufficiently
> high optimization setting can unroll the loop into a linear sequence of
> words zeroings.  I would expect that to beat the system memset up to a
> few dozen words, no matter how tense the memset coding is.  So you
> probably want to think in terms of reducing MEMSET_LOOP_LIMIT rather
> than diking out the macro code altogether.  Or maybe reduce MemSet to
> "memset(...)" but leave MemSetAligned and/or MemSetTest/MemSetLoop
> as-is.  In any case, reporting results without mentioning the compiler
> and optimization level in use isn't going to convince anybody ...
>

Oops, I used the same setting as in the old hacking message (-O2, gcc 
3.3). If I understand what you are saying, then it turns out yes, PG's 
MemSet is faster for smaller blocksizes (see below, between 32 and 64). 
I just replaced the whole MemSet with memset and it is not very low 
when I profile. I could squeeze more out of it if I spent more time 
trying to understand it (change MEMSET_LOOP_LIMIT to 32 and then add 
memset after that?). I'm now working one understanding  Spin Locks and 
friends. Putting in a sync call (in s_lock.h) is really a time killer 
and bad for performance (it takes up 35 cycles).

run on a single processor G5 (1.8Gz, other was on a DP 2Gz G5)
pgMemSet:
*  4
0.070u 0.000s 0:00.15 46.6%     0+0k 0+0io 0pf+0w
* 8
0.090u 0.000s 0:00.16 56.2%     0+0k 0+0io 0pf+0w
* 16
0.120u 0.000s 0:00.17 70.5%     0+0k 0+0io 0pf+0w
* 32
0.180u 0.000s 0:00.29 62.0%     0+0k 0+0io 0pf+0w
* 64
0.450u 0.000s 0:00.92 48.9%     0+0k 0+0io 0pf+0w


memset:
* 4
0.170u 0.010s 0:00.44 40.9%     0+0k 0+0io 0pf+0w
* 8
0.190u 0.000s 0:00.42 45.2%     0+0k 0+0io 0pf+0w
* 16
0.190u 0.010s 0:00.39 51.2%     0+0k 0+0io 0pf+0w
* 32
0.200u 0.000s 0:00.39 51.2%     0+0k 0+0io 0pf+0w
* 64
0.260u 0.000s 0:00.38 68.4%     0+0k 0+0io 0pf+0w


Marc




pgsql-hackers by date:

Previous
From: Szima Gábor
Date:
Subject: transaction idle timeout in 7.4.5 and 8.0.0beta2
Next
From: Gary Doades
Date:
Subject: Re: libpq and prepared statements progress for 8.0