Thread: tweaking MemSet() performance - 7.4.5

tweaking MemSet() performance - 7.4.5

From
Marc Colosimo
Date:
HI,

I'm using 7.4.5 on Mac OS X (G5) and was profiling it to see why it is 
SO SLOW at committing  inserts and deletes into a large database. One 
of the many slowdowns was from MemSet. I found an old (2002) thread 
about this and retried the tests  (see below). The main point is that 
the system memset crushes pg's!! Is it possible to add a define to call 
the system memset at build time! This probably isn't the case on other 
systems.

I wanted to know the size of FunctionCallInfoData (in execQual.c) 
because the profiler said that if it was over 128 then use the system 
call.

Here are my results:

pgMemSet
* 64
0.410u 0.000s 0:00.42 97.6%     0+0k 0+0io 0pf+0w
* 128
0.600u 0.000s 0:00.61 98.3%     0+0k 0+0io 0pf+0w
* 176 Size of fcinfo is 176, used in execQual.c which was being very 
slow here!
0.790u 0.000s 0:00.79 100.0%    0+0k 0+0io 0pf+0w
* 256
1.040u 0.000s 0:01.08 96.2%     0+0k 0+0io 0pf+0w
* 512
2.030u 0.000s 0:02.04 99.5%     0+0k 0+0io 0pf+0w
* 1024
3.950u 0.010s 0:03.94 100.5%    0+0k 0+0io 0pf+0w
* 2048
7.710u 0.000s 0:07.75 99.4%     0+0k 0+0io 0pf+0w
* 4096
15.390u 0.000s 0:15.37 100.1%   0+0k 0+0io 0pf+0w

system memset
* 64
0.260u 0.000s 0:00.25 104.0%    0+0k 0+0io 0pf+0w
* 128
0.310u 0.000s 0:00.31 100.0%    0+0k 0+0io 0pf+0w
* 176 Size of fcinfo is 176
0.300u 0.010s 0:00.30 103.3%    0+0k 0+0io 0pf+0w
* 256
0.310u 0.000s 0:00.30 103.3%    0+0k 0+0io 0pf+0w
* 512
0.350u 0.000s 0:00.33 106.0%    0+0k 0+0io 0pf+0w
* 1024
0.590u 0.010s 0:00.63 95.2%     0+0k 0+0io 0pf+0w
* 2048
0.780u 0.000s 0:00.77 101.2%    0+0k 0+0io 0pf+0w
* 4096
1.320u 0.000s 0:01.33 99.2%     0+0k 0+0io 0pf+0w

 #include <string.h> #include "postgres.h" #include "fmgr.h"
 #undef MEMSET_LOOP_LIMIT #define MEMSET_LOOP_LIMIT  1000000
 int main(int argc, char **argv) {  int  len = atoi(argv[1]);  char  buffer[len];  long long i;
    FunctionCallInfoData fcinfo;    printf("Size of fcinfo is %d\n", sizeof(fcinfo));
  for (i = 0; i < 9900000; i++)   MemSet(buffer, 0, len);   //memset(buffer, 0, len);  return 0; }




Re: tweaking MemSet() performance - 7.4.5

From
Tom Lane
Date:
Marc Colosimo <mcolosimo@mitre.org> writes:
> I'm using 7.4.5 on Mac OS X (G5) and was profiling it to see why it is 
> SO SLOW at committing  inserts and deletes into a large database. One 
> of the many slowdowns was from MemSet. I found an old (2002) thread 
> about this and retried the tests  (see below). The main point is that 
> the system memset crushes pg's!!

Hmm.  I tried to duplicate this on my G4 laptop, and found that they
were more or less on a par for small-to-middling block sizes (using
"gcc -O2").  Darwin's memset code must have some additional tweaks for
use on G5 hardware.  Good for Apple --- this is the sort of thing that
OS vendors *ought* to be doing.  The fact that we can beat the system
memset on so many platforms is an indictment of those platforms.

> Is it possible to add a define to call 
> the system memset at build time! This probably isn't the case on other 
> systems.

Feel free to hack the definition of MemSet in src/include/c.h.  See the
comments for it for more context.

Note that for small compile-time-constant block sizes (a case your test
program doesn't test, but it's common in pgsql), gcc with a sufficiently
high optimization setting can unroll the loop into a linear sequence of
words zeroings.  I would expect that to beat the system memset up to a
few dozen words, no matter how tense the memset coding is.  So you
probably want to think in terms of reducing MEMSET_LOOP_LIMIT rather
than diking out the macro code altogether.  Or maybe reduce MemSet to
"memset(...)" but leave MemSetAligned and/or MemSetTest/MemSetLoop
as-is.  In any case, reporting results without mentioning the compiler
and optimization level in use isn't going to convince anybody ...
        regards, tom lane


Re: tweaking MemSet() performance - 7.4.5

From
Marc Colosimo
Date:
On Sep 17, 2004, at 3:55 PM, Tom Lane wrote:

> Marc Colosimo <mcolosimo@mitre.org> writes:
>> I'm using 7.4.5 on Mac OS X (G5) and was profiling it to see why it is
>> SO SLOW at committing  inserts and deletes into a large database. One
>> of the many slowdowns was from MemSet. I found an old (2002) thread
>> about this and retried the tests  (see below). The main point is that
>> the system memset crushes pg's!!
>
> Hmm.  I tried to duplicate this on my G4 laptop, and found that they
> were more or less on a par for small-to-middling block sizes (using
> "gcc -O2").  Darwin's memset code must have some additional tweaks for
> use on G5 hardware.  Good for Apple --- this is the sort of thing that
> OS vendors *ought* to be doing.  The fact that we can beat the system
> memset on so many platforms is an indictment of those platforms.
>
>> Is it possible to add a define to call
>> the system memset at build time! This probably isn't the case on other
>> systems.
>
> Feel free to hack the definition of MemSet in src/include/c.h.  See the
> comments for it for more context.
>
> Note that for small compile-time-constant block sizes (a case your test
> program doesn't test, but it's common in pgsql), gcc with a 
> sufficiently
> high optimization setting can unroll the loop into a linear sequence of
> words zeroings.  I would expect that to beat the system memset up to a
> few dozen words, no matter how tense the memset coding is.  So you
> probably want to think in terms of reducing MEMSET_LOOP_LIMIT rather
> than diking out the macro code altogether.  Or maybe reduce MemSet to
> "memset(...)" but leave MemSetAligned and/or MemSetTest/MemSetLoop
> as-is.  In any case, reporting results without mentioning the compiler
> and optimization level in use isn't going to convince anybody ...
>

Oops, I used the same setting as in the old hacking message (-O2, gcc 
3.3). If I understand what you are saying, then it turns out yes, PG's 
MemSet is faster for smaller blocksizes (see below, between 32 and 64). 
I just replaced the whole MemSet with memset and it is not very low 
when I profile. I could squeeze more out of it if I spent more time 
trying to understand it (change MEMSET_LOOP_LIMIT to 32 and then add 
memset after that?). I'm now working one understanding  Spin Locks and 
friends. Putting in a sync call (in s_lock.h) is really a time killer 
and bad for performance (it takes up 35 cycles).

run on a single processor G5 (1.8Gz, other was on a DP 2Gz G5)
pgMemSet:
*  4
0.070u 0.000s 0:00.15 46.6%     0+0k 0+0io 0pf+0w
* 8
0.090u 0.000s 0:00.16 56.2%     0+0k 0+0io 0pf+0w
* 16
0.120u 0.000s 0:00.17 70.5%     0+0k 0+0io 0pf+0w
* 32
0.180u 0.000s 0:00.29 62.0%     0+0k 0+0io 0pf+0w
* 64
0.450u 0.000s 0:00.92 48.9%     0+0k 0+0io 0pf+0w


memset:
* 4
0.170u 0.010s 0:00.44 40.9%     0+0k 0+0io 0pf+0w
* 8
0.190u 0.000s 0:00.42 45.2%     0+0k 0+0io 0pf+0w
* 16
0.190u 0.010s 0:00.39 51.2%     0+0k 0+0io 0pf+0w
* 32
0.200u 0.000s 0:00.39 51.2%     0+0k 0+0io 0pf+0w
* 64
0.260u 0.000s 0:00.38 68.4%     0+0k 0+0io 0pf+0w


Marc




Re: tweaking MemSet() performance - 7.4.5

From
Manfred Spraul
Date:
Marc Colosimo wrote:

> Oops, I used the same setting as in the old hacking message (-O2, gcc 
> 3.3). If I understand what you are saying, then it turns out yes, PG's 
> MemSet is faster for smaller blocksizes (see below, between 32 and 
> 64). I just replaced the whole MemSet with memset and it is not very 
> low when I profile.

Could you check what the OS-X memset function does internally?
One trick to speed up memset it to bypass the cache and bulk-write 
directly from write buffers to main memory. i386 cpus support that and 
in microbenchmarks it's 3 times faster (or something like that). 
Unfortunately it's a loss in real-world tests: Typically a structure is 
initialized with memset and then immediately accessed. If the memset 
bypasses the cache then the following access will cause a cache line 
miss, which can be so slow that using the faster memset can result in a 
net performance loss.

> I could squeeze more out of it if I spent more time trying to 
> understand it (change MEMSET_LOOP_LIMIT to 32 and then add memset 
> after that?). I'm now working one understanding  Spin Locks and 
> friends. Putting in a sync call (in s_lock.h) is really a time killer 
> and bad for performance (it takes up 35 cycles).
>
That's the price you pay for weakly ordered memory access.
Linux on ppc uses eieio, on ppc64 lwsync is used. Could you check if 
they are faster?

--   Manfred


Re: tweaking MemSet() performance - 7.4.5

From
Tom Lane
Date:
Manfred Spraul <manfred@colorfullife.com> writes:
> That's the price you pay for weakly ordered memory access.
> Linux on ppc uses eieio, on ppc64 lwsync is used. Could you check if 
> they are faster?

I recall looking at lwsync and being concerned about portability
problems --- older assemblers will fail to recognize it.  I'd want
to see some hard evidence that changing sync to lwsync would be a
significant performance win before taking any portability risk here.
        regards, tom lane


Re: tweaking MemSet() performance - 7.4.5

From
mcolosimo@mitre.org
Date:
>Marc Colosimo wrote:
>
>> Oops, I used the same setting as in the old hacking message (-O2, gcc 
>> 3.3). If I understand what you are saying, then it turns out yes, PG's 
>> MemSet is faster for smaller blocksizes (see below, between 32 and 
>> 64). I just replaced the whole MemSet with memset and it is not very 
>> low when I profile.
>
>Could you check what the OS-X memset function does internally?
>One trick to speed up memset it to bypass the cache and bulk-write 
>directly from write buffers to main memory. i386 cpus support that and 
>in microbenchmarks it's 3 times faster (or something like that). 
>Unfortunately it's a loss in real-world tests: Typically a structure is 
>initialized with memset and then immediately accessed. If the memset 
>bypasses the cache then the following access will cause a cache line 
>miss, which can be so slow that using the faster memset can result in a 
>net performance loss.
>

Could you suggest some structs to test? If I get your meaning, I would make a loop that sets then reads from the
structure.
 

>> I could squeeze more out of it if I spent more time trying to 
>> understand it (change MEMSET_LOOP_LIMIT to 32 and then add memset 
>> after that?). I'm now working one understanding  Spin Locks and 
>> friends. Putting in a sync call (in s_lock.h) is really a time killer 
>> and bad for performance (it takes up 35 cycles).
>>
>That's the price you pay for weakly ordered memory access.
>Linux on ppc uses eieio, on ppc64 lwsync is used. Could you check if 
>they are faster?
>

I found the reason why "sync" was put in <http://archives.postgresql.org/pgsql-bugs/2002-09/msg00239.php>, but it is
oddwhy it works. Why syncing one processor prevents the other from doing something is interesting. What type of shared
memoryis being used on OS X? I'm confused about the two types of semaphores, sysV or POSIX.
<http://archives.postgresql.org/pgsql-patches/2001-01/msg00052.php>Itseems the POSIX is the way to go on OS X.
 

Marc





Re: tweaking MemSet() performance - 7.4.5

From
Manfred Spraul
Date:
mcolosimo@mitre.org wrote:

>>If the memset 
>>bypasses the cache then the following access will cause a cache line 
>>miss, which can be so slow that using the faster memset can result in a 
>>net performance loss.
>>
>>    
>>
>
>Could you suggest some structs to test? If I get your meaning, I would make a loop that sets then reads from the
structure.
 
>
>  
>
Read the sources and the cpu specs. Benchmarking such problems is 
virtually impossible.
I don't have OS-X, thus I checked the Linux-kernel sources: It seems 
that the power architecture doesn't have the same problem as x86.
There is a special clear cacheline instruction for large memsets and the 
rest is done through carefully optimized store byte/halfword/word/double 
word sequences.

Thus I'd check what happens if you memset not perfectly aligned buffers. 
That's another point where over-optimized functions sometimes break 
down. If there is no slowdown, then I'd replace the postgres function 
with the OS provided function.

I'd add some __builtin_constant_p() optimizations, but I guess Tom won't 
like gcc hacks ;-)
--   Manfred


Re: tweaking MemSet() performance - 7.4.5

From
Karel Zak
Date:
On Sat, 2004-09-25 at 23:23 +0200, Manfred Spraul wrote:
> mcolosimo@mitre.org wrote:
> 
> >>If the memset 
> >>bypasses the cache then the following access will cause a cache line 
> >>miss, which can be so slow that using the faster memset can result in a 
> >>net performance loss.
> >>
> >>    
> >>
> >
> >Could you suggest some structs to test? If I get your meaning, I would make a loop that sets then reads from the
structure.
 
> >
> >  
> >
> Read the sources and the cpu specs. Benchmarking such problems is 
> virtually impossible.
> I don't have OS-X, thus I checked the Linux-kernel sources: It seems 
> that the power architecture doesn't have the same problem as x86.
> There is a special clear cacheline instruction for large memsets and the 
> rest is done through carefully optimized store byte/halfword/word/double 
> word sequences.
> 
> Thus I'd check what happens if you memset not perfectly aligned buffers. 
> That's another point where over-optimized functions sometimes break 
> down. If there is no slowdown, then I'd replace the postgres function 
> with the OS provided function.
> 
> I'd add some __builtin_constant_p() optimizations, but I guess Tom won't 
> like gcc hacks ;-)

I think it cannot be problem if you write it to some .h file (in port
directory?) as macro with "#ifdef GCC". The other thing is real
advantage of hacks like this in practical PG usage :-)
Karel

-- 
Karel Zak
http://home.zf.jcu.cz/~zakkr



Re: tweaking MemSet() performance - 7.4.5

From
Bruce Momjian
Date:
Karel Zak wrote:
> On Sat, 2004-09-25 at 23:23 +0200, Manfred Spraul wrote:
> > mcolosimo@mitre.org wrote:
> > 
> > >>If the memset 
> > >>bypasses the cache then the following access will cause a cache line 
> > >>miss, which can be so slow that using the faster memset can result in a 
> > >>net performance loss.
> > >>
> > >>    
> > >>
> > >
> > >Could you suggest some structs to test? If I get your meaning, I would make a loop that sets then reads from the
structure.
 
> > >
> > >  
> > >
> > Read the sources and the cpu specs. Benchmarking such problems is 
> > virtually impossible.
> > I don't have OS-X, thus I checked the Linux-kernel sources: It seems 
> > that the power architecture doesn't have the same problem as x86.
> > There is a special clear cacheline instruction for large memsets and the 
> > rest is done through carefully optimized store byte/halfword/word/double 
> > word sequences.
> > 
> > Thus I'd check what happens if you memset not perfectly aligned buffers. 
> > That's another point where over-optimized functions sometimes break 
> > down. If there is no slowdown, then I'd replace the postgres function 
> > with the OS provided function.
> > 
> > I'd add some __builtin_constant_p() optimizations, but I guess Tom won't 
> > like gcc hacks ;-)
> 
> I think it cannot be problem if you write it to some .h file (in port
> directory?) as macro with "#ifdef GCC". The other thing is real
> advantage of hacks like this in practical PG usage :-)

The reason MemSet is a win is not that the C code is great but because
it eliminates a function call.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: tweaking MemSet() performance - 7.4.5

From
Neil Conway
Date:
On Wed, 2004-09-29 at 21:37, Bruce Momjian wrote:
> The reason MemSet is a win is not that the C code is great but because
> it eliminates a function call.

A reasonable compiler ought to be able to implement memset() as a
compiler intrinsic where it makes sense to do so. MSVC++ can certainly
do this; per the GCC 3.4 docs, it seems GCC can/does as well:

The ISO C90 functions abort, abs, acos, asin, atan2, atan, calloc, ceil,
cosh, cos, exit, exp, fabs, floor, fmod, fprintf, fputs, frexp, fscanf,
labs, ldexp, log10, log, malloc, memcmp, memcpy, memset, modf, pow,
printf, putchar, puts, scanf, sinh, sin, snprintf, sprintf, sqrt,
sscanf, strcat, strchr, strcmp, strcpy, strcspn, strlen, strncat,
strncmp, strncpy, strpbrk, strrchr, strspn, strstr, tanh, tan, vfprintf,
vprintf and vsprintf are all recognized as built-in functions unless
-fno-builtin is specified (or -fno-builtin-function is specified for an
individual function). All of these functions have corresponding versions
prefixed with __builtin_.

(http://gcc.gnu.org/onlinedocs/gcc-3.4.2/gcc/Other-Builtins.html#Other-Builtins)

-Neil




Re: tweaking MemSet() performance - 7.4.5

From
Bruce Momjian
Date:
Neil Conway wrote:
> On Wed, 2004-09-29 at 21:37, Bruce Momjian wrote:
> > The reason MemSet is a win is not that the C code is great but because
> > it eliminates a function call.
> 
> A reasonable compiler ought to be able to implement memset() as a
> compiler intrinsic where it makes sense to do so. MSVC++ can certainly
> do this; per the GCC 3.4 docs, it seems GCC can/does as well:
> 
> The ISO C90 functions abort, abs, acos, asin, atan2, atan, calloc, ceil,
> cosh, cos, exit, exp, fabs, floor, fmod, fprintf, fputs, frexp, fscanf,
> labs, ldexp, log10, log, malloc, memcmp, memcpy, memset, modf, pow,
> printf, putchar, puts, scanf, sinh, sin, snprintf, sprintf, sqrt,
> sscanf, strcat, strchr, strcmp, strcpy, strcspn, strlen, strncat,
> strncmp, strncpy, strpbrk, strrchr, strspn, strstr, tanh, tan, vfprintf,
> vprintf and vsprintf are all recognized as built-in functions unless
> -fno-builtin is specified (or -fno-builtin-function is specified for an
> individual function). All of these functions have corresponding versions
> prefixed with __builtin_.
> 
> (http://gcc.gnu.org/onlinedocs/gcc-3.4.2/gcc/Other-Builtins.html#Other-Builtins)

MemSet was written when gcc 2.X wasn't even stable yet.  Have you run
any tests on 3.4 to see if MemSet is still a win with that compiler?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: tweaking MemSet() performance - 7.4.5

From
Marc Colosimo
Date:
On Sep 29, 2004, at 7:37 AM, Bruce Momjian wrote:

> Karel Zak wrote:
>> On Sat, 2004-09-25 at 23:23 +0200, Manfred Spraul wrote:
>>> mcolosimo@mitre.org wrote:
>>>
>>>>> If the memset
>>>>> bypasses the cache then the following access will cause a cache 
>>>>> line
>>>>> miss, which can be so slow that using the faster memset can result 
>>>>> in a
>>>>> net performance loss.
>>>>
>>>> Could you suggest some structs to test? If I get your meaning, I 
>>>> would make a loop that sets then reads from the structure.
>>>>
>>> Read the sources and the cpu specs. Benchmarking such problems is
>>> virtually impossible.
>>> I don't have OS-X, thus I checked the Linux-kernel sources: It seems
>>> that the power architecture doesn't have the same problem as x86.
>>> There is a special clear cacheline instruction for large memsets and 
>>> the
>>> rest is done through carefully optimized store 
>>> byte/halfword/word/double
>>> word sequences.
>>>
>>> Thus I'd check what happens if you memset not perfectly aligned 
>>> buffers.
>>> That's another point where over-optimized functions sometimes break
>>> down. If there is no slowdown, then I'd replace the postgres function
>>> with the OS provided function.
>>>

all memory (via malloc and friends) will be aligned on OS X, unless you 
remove padding (which I don't think you do)

>>> I'd add some __builtin_constant_p() optimizations, but I guess Tom 
>>> won't
>>> like gcc hacks ;-)
>>
>> I think it cannot be problem if you write it to some .h file (in port
>> directory?) as macro with "#ifdef GCC". The other thing is real
>> advantage of hacks like this in practical PG usage :-)
>
> The reason MemSet is a win is not that the C code is great but because
> it eliminates a function call.
>

Using MemSet really did speed things up. I think the function overhead 
is okay. As for real world usage, the function ExecMakeFunctionResult 
dropped from the top of the list when profiling (now < 1% vs 16% 
before)!  This was doing a big nasty delete (w/ cascading), insert in a 
cursor.

Here are results for a Mac G4 (single processor) OS 10.3, using -O2. 
This time the mac memset wins all around. Someone posted that this 
wasn't the case.

PG MemSet:
pgmemset_test 32
0.670u 0.020s 0:00.70 98.5%     0+0k 0+0io 0pf+0w
pgmemset_test 64
1.060u 0.000s 0:01.05 100.9%    0+0k 0+0io 0pf+0w
pgmemset_test 128
1.750u 0.010s 0:01.76 100.0%    0+0k 0+0io 0pf+0w
pgmemset_test 512
6.010u 0.030s 0:06.04 100.0%    0+0k 0+0io 0pf+0w

Mac memset:
memset_test 32
0.660u 0.020s 0:00.67 101.4%    0+0k 0+0io 0pf+0w
memset_test 64
0.720u 0.000s 0:00.72 100.0%    0+0k 0+0io 0pf+0w
memset_test 128
0.800u 0.010s 0:00.81 100.0%    0+0k 0+0io 0pf+0w
memset_test 512
1.470u 0.010s 0:01.48 100.0%    0+0k 0+0io 0pf+0w

Now I check about setting a byte after I memset, and it does slow down 
a tiny bit. But it is the same for both MemSet and memset for under 64.




Re: tweaking MemSet() performance - 7.4.5

From
Peter Eisentraut
Date:
Bruce Momjian wrote:
> MemSet was written when gcc 2.X wasn't even stable yet.  Have you run
> any tests on 3.4 to see if MemSet is still a win with that compiler?

I've done a test years ago that showed that memset is usually at least 
as good as MemSet:

http://archives.postgresql.org/pgsql-patches/2002-10/msg00085.php

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/