Thread: tweaking MemSet() performance - 7.4.5
HI, I'm using 7.4.5 on Mac OS X (G5) and was profiling it to see why it is SO SLOW at committing inserts and deletes into a large database. One of the many slowdowns was from MemSet. I found an old (2002) thread about this and retried the tests (see below). The main point is that the system memset crushes pg's!! Is it possible to add a define to call the system memset at build time! This probably isn't the case on other systems. I wanted to know the size of FunctionCallInfoData (in execQual.c) because the profiler said that if it was over 128 then use the system call. Here are my results: pgMemSet * 64 0.410u 0.000s 0:00.42 97.6% 0+0k 0+0io 0pf+0w * 128 0.600u 0.000s 0:00.61 98.3% 0+0k 0+0io 0pf+0w * 176 Size of fcinfo is 176, used in execQual.c which was being very slow here! 0.790u 0.000s 0:00.79 100.0% 0+0k 0+0io 0pf+0w * 256 1.040u 0.000s 0:01.08 96.2% 0+0k 0+0io 0pf+0w * 512 2.030u 0.000s 0:02.04 99.5% 0+0k 0+0io 0pf+0w * 1024 3.950u 0.010s 0:03.94 100.5% 0+0k 0+0io 0pf+0w * 2048 7.710u 0.000s 0:07.75 99.4% 0+0k 0+0io 0pf+0w * 4096 15.390u 0.000s 0:15.37 100.1% 0+0k 0+0io 0pf+0w system memset * 64 0.260u 0.000s 0:00.25 104.0% 0+0k 0+0io 0pf+0w * 128 0.310u 0.000s 0:00.31 100.0% 0+0k 0+0io 0pf+0w * 176 Size of fcinfo is 176 0.300u 0.010s 0:00.30 103.3% 0+0k 0+0io 0pf+0w * 256 0.310u 0.000s 0:00.30 103.3% 0+0k 0+0io 0pf+0w * 512 0.350u 0.000s 0:00.33 106.0% 0+0k 0+0io 0pf+0w * 1024 0.590u 0.010s 0:00.63 95.2% 0+0k 0+0io 0pf+0w * 2048 0.780u 0.000s 0:00.77 101.2% 0+0k 0+0io 0pf+0w * 4096 1.320u 0.000s 0:01.33 99.2% 0+0k 0+0io 0pf+0w #include <string.h> #include "postgres.h" #include "fmgr.h" #undef MEMSET_LOOP_LIMIT #define MEMSET_LOOP_LIMIT 1000000 int main(int argc, char **argv) { int len = atoi(argv[1]); char buffer[len]; long long i; FunctionCallInfoData fcinfo; printf("Size of fcinfo is %d\n", sizeof(fcinfo)); for (i = 0; i < 9900000; i++) MemSet(buffer, 0, len); //memset(buffer, 0, len); return 0; }
Marc Colosimo <mcolosimo@mitre.org> writes: > I'm using 7.4.5 on Mac OS X (G5) and was profiling it to see why it is > SO SLOW at committing inserts and deletes into a large database. One > of the many slowdowns was from MemSet. I found an old (2002) thread > about this and retried the tests (see below). The main point is that > the system memset crushes pg's!! Hmm. I tried to duplicate this on my G4 laptop, and found that they were more or less on a par for small-to-middling block sizes (using "gcc -O2"). Darwin's memset code must have some additional tweaks for use on G5 hardware. Good for Apple --- this is the sort of thing that OS vendors *ought* to be doing. The fact that we can beat the system memset on so many platforms is an indictment of those platforms. > Is it possible to add a define to call > the system memset at build time! This probably isn't the case on other > systems. Feel free to hack the definition of MemSet in src/include/c.h. See the comments for it for more context. Note that for small compile-time-constant block sizes (a case your test program doesn't test, but it's common in pgsql), gcc with a sufficiently high optimization setting can unroll the loop into a linear sequence of words zeroings. I would expect that to beat the system memset up to a few dozen words, no matter how tense the memset coding is. So you probably want to think in terms of reducing MEMSET_LOOP_LIMIT rather than diking out the macro code altogether. Or maybe reduce MemSet to "memset(...)" but leave MemSetAligned and/or MemSetTest/MemSetLoop as-is. In any case, reporting results without mentioning the compiler and optimization level in use isn't going to convince anybody ... regards, tom lane
On Sep 17, 2004, at 3:55 PM, Tom Lane wrote: > Marc Colosimo <mcolosimo@mitre.org> writes: >> I'm using 7.4.5 on Mac OS X (G5) and was profiling it to see why it is >> SO SLOW at committing inserts and deletes into a large database. One >> of the many slowdowns was from MemSet. I found an old (2002) thread >> about this and retried the tests (see below). The main point is that >> the system memset crushes pg's!! > > Hmm. I tried to duplicate this on my G4 laptop, and found that they > were more or less on a par for small-to-middling block sizes (using > "gcc -O2"). Darwin's memset code must have some additional tweaks for > use on G5 hardware. Good for Apple --- this is the sort of thing that > OS vendors *ought* to be doing. The fact that we can beat the system > memset on so many platforms is an indictment of those platforms. > >> Is it possible to add a define to call >> the system memset at build time! This probably isn't the case on other >> systems. > > Feel free to hack the definition of MemSet in src/include/c.h. See the > comments for it for more context. > > Note that for small compile-time-constant block sizes (a case your test > program doesn't test, but it's common in pgsql), gcc with a > sufficiently > high optimization setting can unroll the loop into a linear sequence of > words zeroings. I would expect that to beat the system memset up to a > few dozen words, no matter how tense the memset coding is. So you > probably want to think in terms of reducing MEMSET_LOOP_LIMIT rather > than diking out the macro code altogether. Or maybe reduce MemSet to > "memset(...)" but leave MemSetAligned and/or MemSetTest/MemSetLoop > as-is. In any case, reporting results without mentioning the compiler > and optimization level in use isn't going to convince anybody ... > Oops, I used the same setting as in the old hacking message (-O2, gcc 3.3). If I understand what you are saying, then it turns out yes, PG's MemSet is faster for smaller blocksizes (see below, between 32 and 64). I just replaced the whole MemSet with memset and it is not very low when I profile. I could squeeze more out of it if I spent more time trying to understand it (change MEMSET_LOOP_LIMIT to 32 and then add memset after that?). I'm now working one understanding Spin Locks and friends. Putting in a sync call (in s_lock.h) is really a time killer and bad for performance (it takes up 35 cycles). run on a single processor G5 (1.8Gz, other was on a DP 2Gz G5) pgMemSet: * 4 0.070u 0.000s 0:00.15 46.6% 0+0k 0+0io 0pf+0w * 8 0.090u 0.000s 0:00.16 56.2% 0+0k 0+0io 0pf+0w * 16 0.120u 0.000s 0:00.17 70.5% 0+0k 0+0io 0pf+0w * 32 0.180u 0.000s 0:00.29 62.0% 0+0k 0+0io 0pf+0w * 64 0.450u 0.000s 0:00.92 48.9% 0+0k 0+0io 0pf+0w memset: * 4 0.170u 0.010s 0:00.44 40.9% 0+0k 0+0io 0pf+0w * 8 0.190u 0.000s 0:00.42 45.2% 0+0k 0+0io 0pf+0w * 16 0.190u 0.010s 0:00.39 51.2% 0+0k 0+0io 0pf+0w * 32 0.200u 0.000s 0:00.39 51.2% 0+0k 0+0io 0pf+0w * 64 0.260u 0.000s 0:00.38 68.4% 0+0k 0+0io 0pf+0w Marc
Marc Colosimo wrote: > Oops, I used the same setting as in the old hacking message (-O2, gcc > 3.3). If I understand what you are saying, then it turns out yes, PG's > MemSet is faster for smaller blocksizes (see below, between 32 and > 64). I just replaced the whole MemSet with memset and it is not very > low when I profile. Could you check what the OS-X memset function does internally? One trick to speed up memset it to bypass the cache and bulk-write directly from write buffers to main memory. i386 cpus support that and in microbenchmarks it's 3 times faster (or something like that). Unfortunately it's a loss in real-world tests: Typically a structure is initialized with memset and then immediately accessed. If the memset bypasses the cache then the following access will cause a cache line miss, which can be so slow that using the faster memset can result in a net performance loss. > I could squeeze more out of it if I spent more time trying to > understand it (change MEMSET_LOOP_LIMIT to 32 and then add memset > after that?). I'm now working one understanding Spin Locks and > friends. Putting in a sync call (in s_lock.h) is really a time killer > and bad for performance (it takes up 35 cycles). > That's the price you pay for weakly ordered memory access. Linux on ppc uses eieio, on ppc64 lwsync is used. Could you check if they are faster? -- Manfred
Manfred Spraul <manfred@colorfullife.com> writes: > That's the price you pay for weakly ordered memory access. > Linux on ppc uses eieio, on ppc64 lwsync is used. Could you check if > they are faster? I recall looking at lwsync and being concerned about portability problems --- older assemblers will fail to recognize it. I'd want to see some hard evidence that changing sync to lwsync would be a significant performance win before taking any portability risk here. regards, tom lane
>Marc Colosimo wrote: > >> Oops, I used the same setting as in the old hacking message (-O2, gcc >> 3.3). If I understand what you are saying, then it turns out yes, PG's >> MemSet is faster for smaller blocksizes (see below, between 32 and >> 64). I just replaced the whole MemSet with memset and it is not very >> low when I profile. > >Could you check what the OS-X memset function does internally? >One trick to speed up memset it to bypass the cache and bulk-write >directly from write buffers to main memory. i386 cpus support that and >in microbenchmarks it's 3 times faster (or something like that). >Unfortunately it's a loss in real-world tests: Typically a structure is >initialized with memset and then immediately accessed. If the memset >bypasses the cache then the following access will cause a cache line >miss, which can be so slow that using the faster memset can result in a >net performance loss. > Could you suggest some structs to test? If I get your meaning, I would make a loop that sets then reads from the structure. >> I could squeeze more out of it if I spent more time trying to >> understand it (change MEMSET_LOOP_LIMIT to 32 and then add memset >> after that?). I'm now working one understanding Spin Locks and >> friends. Putting in a sync call (in s_lock.h) is really a time killer >> and bad for performance (it takes up 35 cycles). >> >That's the price you pay for weakly ordered memory access. >Linux on ppc uses eieio, on ppc64 lwsync is used. Could you check if >they are faster? > I found the reason why "sync" was put in <http://archives.postgresql.org/pgsql-bugs/2002-09/msg00239.php>, but it is oddwhy it works. Why syncing one processor prevents the other from doing something is interesting. What type of shared memoryis being used on OS X? I'm confused about the two types of semaphores, sysV or POSIX. <http://archives.postgresql.org/pgsql-patches/2001-01/msg00052.php>Itseems the POSIX is the way to go on OS X. Marc
mcolosimo@mitre.org wrote: >>If the memset >>bypasses the cache then the following access will cause a cache line >>miss, which can be so slow that using the faster memset can result in a >>net performance loss. >> >> >> > >Could you suggest some structs to test? If I get your meaning, I would make a loop that sets then reads from the structure. > > > Read the sources and the cpu specs. Benchmarking such problems is virtually impossible. I don't have OS-X, thus I checked the Linux-kernel sources: It seems that the power architecture doesn't have the same problem as x86. There is a special clear cacheline instruction for large memsets and the rest is done through carefully optimized store byte/halfword/word/double word sequences. Thus I'd check what happens if you memset not perfectly aligned buffers. That's another point where over-optimized functions sometimes break down. If there is no slowdown, then I'd replace the postgres function with the OS provided function. I'd add some __builtin_constant_p() optimizations, but I guess Tom won't like gcc hacks ;-) -- Manfred
On Sat, 2004-09-25 at 23:23 +0200, Manfred Spraul wrote: > mcolosimo@mitre.org wrote: > > >>If the memset > >>bypasses the cache then the following access will cause a cache line > >>miss, which can be so slow that using the faster memset can result in a > >>net performance loss. > >> > >> > >> > > > >Could you suggest some structs to test? If I get your meaning, I would make a loop that sets then reads from the structure. > > > > > > > Read the sources and the cpu specs. Benchmarking such problems is > virtually impossible. > I don't have OS-X, thus I checked the Linux-kernel sources: It seems > that the power architecture doesn't have the same problem as x86. > There is a special clear cacheline instruction for large memsets and the > rest is done through carefully optimized store byte/halfword/word/double > word sequences. > > Thus I'd check what happens if you memset not perfectly aligned buffers. > That's another point where over-optimized functions sometimes break > down. If there is no slowdown, then I'd replace the postgres function > with the OS provided function. > > I'd add some __builtin_constant_p() optimizations, but I guess Tom won't > like gcc hacks ;-) I think it cannot be problem if you write it to some .h file (in port directory?) as macro with "#ifdef GCC". The other thing is real advantage of hacks like this in practical PG usage :-) Karel -- Karel Zak http://home.zf.jcu.cz/~zakkr
Karel Zak wrote: > On Sat, 2004-09-25 at 23:23 +0200, Manfred Spraul wrote: > > mcolosimo@mitre.org wrote: > > > > >>If the memset > > >>bypasses the cache then the following access will cause a cache line > > >>miss, which can be so slow that using the faster memset can result in a > > >>net performance loss. > > >> > > >> > > >> > > > > > >Could you suggest some structs to test? If I get your meaning, I would make a loop that sets then reads from the structure. > > > > > > > > > > > Read the sources and the cpu specs. Benchmarking such problems is > > virtually impossible. > > I don't have OS-X, thus I checked the Linux-kernel sources: It seems > > that the power architecture doesn't have the same problem as x86. > > There is a special clear cacheline instruction for large memsets and the > > rest is done through carefully optimized store byte/halfword/word/double > > word sequences. > > > > Thus I'd check what happens if you memset not perfectly aligned buffers. > > That's another point where over-optimized functions sometimes break > > down. If there is no slowdown, then I'd replace the postgres function > > with the OS provided function. > > > > I'd add some __builtin_constant_p() optimizations, but I guess Tom won't > > like gcc hacks ;-) > > I think it cannot be problem if you write it to some .h file (in port > directory?) as macro with "#ifdef GCC". The other thing is real > advantage of hacks like this in practical PG usage :-) The reason MemSet is a win is not that the C code is great but because it eliminates a function call. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Wed, 2004-09-29 at 21:37, Bruce Momjian wrote: > The reason MemSet is a win is not that the C code is great but because > it eliminates a function call. A reasonable compiler ought to be able to implement memset() as a compiler intrinsic where it makes sense to do so. MSVC++ can certainly do this; per the GCC 3.4 docs, it seems GCC can/does as well: The ISO C90 functions abort, abs, acos, asin, atan2, atan, calloc, ceil, cosh, cos, exit, exp, fabs, floor, fmod, fprintf, fputs, frexp, fscanf, labs, ldexp, log10, log, malloc, memcmp, memcpy, memset, modf, pow, printf, putchar, puts, scanf, sinh, sin, snprintf, sprintf, sqrt, sscanf, strcat, strchr, strcmp, strcpy, strcspn, strlen, strncat, strncmp, strncpy, strpbrk, strrchr, strspn, strstr, tanh, tan, vfprintf, vprintf and vsprintf are all recognized as built-in functions unless -fno-builtin is specified (or -fno-builtin-function is specified for an individual function). All of these functions have corresponding versions prefixed with __builtin_. (http://gcc.gnu.org/onlinedocs/gcc-3.4.2/gcc/Other-Builtins.html#Other-Builtins) -Neil
Neil Conway wrote: > On Wed, 2004-09-29 at 21:37, Bruce Momjian wrote: > > The reason MemSet is a win is not that the C code is great but because > > it eliminates a function call. > > A reasonable compiler ought to be able to implement memset() as a > compiler intrinsic where it makes sense to do so. MSVC++ can certainly > do this; per the GCC 3.4 docs, it seems GCC can/does as well: > > The ISO C90 functions abort, abs, acos, asin, atan2, atan, calloc, ceil, > cosh, cos, exit, exp, fabs, floor, fmod, fprintf, fputs, frexp, fscanf, > labs, ldexp, log10, log, malloc, memcmp, memcpy, memset, modf, pow, > printf, putchar, puts, scanf, sinh, sin, snprintf, sprintf, sqrt, > sscanf, strcat, strchr, strcmp, strcpy, strcspn, strlen, strncat, > strncmp, strncpy, strpbrk, strrchr, strspn, strstr, tanh, tan, vfprintf, > vprintf and vsprintf are all recognized as built-in functions unless > -fno-builtin is specified (or -fno-builtin-function is specified for an > individual function). All of these functions have corresponding versions > prefixed with __builtin_. > > (http://gcc.gnu.org/onlinedocs/gcc-3.4.2/gcc/Other-Builtins.html#Other-Builtins) MemSet was written when gcc 2.X wasn't even stable yet. Have you run any tests on 3.4 to see if MemSet is still a win with that compiler? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Sep 29, 2004, at 7:37 AM, Bruce Momjian wrote: > Karel Zak wrote: >> On Sat, 2004-09-25 at 23:23 +0200, Manfred Spraul wrote: >>> mcolosimo@mitre.org wrote: >>> >>>>> If the memset >>>>> bypasses the cache then the following access will cause a cache >>>>> line >>>>> miss, which can be so slow that using the faster memset can result >>>>> in a >>>>> net performance loss. >>>> >>>> Could you suggest some structs to test? If I get your meaning, I >>>> would make a loop that sets then reads from the structure. >>>> >>> Read the sources and the cpu specs. Benchmarking such problems is >>> virtually impossible. >>> I don't have OS-X, thus I checked the Linux-kernel sources: It seems >>> that the power architecture doesn't have the same problem as x86. >>> There is a special clear cacheline instruction for large memsets and >>> the >>> rest is done through carefully optimized store >>> byte/halfword/word/double >>> word sequences. >>> >>> Thus I'd check what happens if you memset not perfectly aligned >>> buffers. >>> That's another point where over-optimized functions sometimes break >>> down. If there is no slowdown, then I'd replace the postgres function >>> with the OS provided function. >>> all memory (via malloc and friends) will be aligned on OS X, unless you remove padding (which I don't think you do) >>> I'd add some __builtin_constant_p() optimizations, but I guess Tom >>> won't >>> like gcc hacks ;-) >> >> I think it cannot be problem if you write it to some .h file (in port >> directory?) as macro with "#ifdef GCC". The other thing is real >> advantage of hacks like this in practical PG usage :-) > > The reason MemSet is a win is not that the C code is great but because > it eliminates a function call. > Using MemSet really did speed things up. I think the function overhead is okay. As for real world usage, the function ExecMakeFunctionResult dropped from the top of the list when profiling (now < 1% vs 16% before)! This was doing a big nasty delete (w/ cascading), insert in a cursor. Here are results for a Mac G4 (single processor) OS 10.3, using -O2. This time the mac memset wins all around. Someone posted that this wasn't the case. PG MemSet: pgmemset_test 32 0.670u 0.020s 0:00.70 98.5% 0+0k 0+0io 0pf+0w pgmemset_test 64 1.060u 0.000s 0:01.05 100.9% 0+0k 0+0io 0pf+0w pgmemset_test 128 1.750u 0.010s 0:01.76 100.0% 0+0k 0+0io 0pf+0w pgmemset_test 512 6.010u 0.030s 0:06.04 100.0% 0+0k 0+0io 0pf+0w Mac memset: memset_test 32 0.660u 0.020s 0:00.67 101.4% 0+0k 0+0io 0pf+0w memset_test 64 0.720u 0.000s 0:00.72 100.0% 0+0k 0+0io 0pf+0w memset_test 128 0.800u 0.010s 0:00.81 100.0% 0+0k 0+0io 0pf+0w memset_test 512 1.470u 0.010s 0:01.48 100.0% 0+0k 0+0io 0pf+0w Now I check about setting a byte after I memset, and it does slow down a tiny bit. But it is the same for both MemSet and memset for under 64.
Bruce Momjian wrote: > MemSet was written when gcc 2.X wasn't even stable yet. Have you run > any tests on 3.4 to see if MemSet is still a win with that compiler? I've done a test years ago that showed that memset is usually at least as good as MemSet: http://archives.postgresql.org/pgsql-patches/2002-10/msg00085.php -- Peter Eisentraut http://developer.postgresql.org/~petere/