Thread: tweaking MemSet() performance
In include/c.h, MemSet() is defined to be different than the stock function memset() only when copying less than or equal to MEMSET_LOOP_LIMIT bytes (currently 64). The comments above the macro definition note: * We got the 64 number by testing this against the stock memset() on* BSD/OS 3.0. Larger values were slower. bjm1997/09/11** I think the crossover point could be a good deal higher for* most platforms, actually. tgl 2000-03-19 I decided to investigate Tom's suggestion and determine the performance of MemSet() versus memset() on my machine, for various values of MEMSET_LOOP_LIMIT. The machine this is being tested on is a Pentium 4 1.8 Ghz with RDRAM, running Linux 2.4.19pre8 with GCC 3.1.1 and glibc 2.2.5 -- the results may or may not apply to other machines. The test program was: #include <string.h> #include "postgres.h" #undef MEMSET_LOOP_LIMIT #define MEMSET_LOOP_LIMIT BUFFER_SIZE int main(void) {char buffer[BUFFER_SIZE];long long i; for (i = 0; i < 99000000; i++){ MemSet(buffer, 0, sizeof(buffer));} return 0; } (I manually changed MemSet() to memset() when testing the performance of the latter function.) It was compiled like so: gcc -O2 -DBUFFER_SIZE=xxx -Ipgsql/src/include memset.c (The -O2 optimization flag is important: the results are significantly different if it is not used.) Here are the results (each timing is the 'total' listing from 'time ./a.out'): BUFFER_SIZE = 64 MemSet() -> 2.756, 2.810, 2.789 memset() -> 13.844, 13.782, 13.778 BUFFER_SIZE = 128 MemSet() -> 5.848, 5.989, 5.861 memset() -> 15.637, 15.631, 15.631 BUFFER_SIZE = 256 MemSet() -> 9.602, 9.652, 9.633 memset() -> 19.305, 19.370, 19.302 BUFFER_SIZE = 512 MemSet() -> 17.416, 17.462, 17.353 memset() -> 26.657, 26.658, 26.678 BUFFER_SIZE = 1024 MemSet() -> 32.144, 32.179, 32.086 memset() -> 41.186, 41.115, 41.176 BUFFER_SIZE = 2048 MemSet() -> 60.39, 60.48, 60.32 memset() -> 71.19, 71.18, 71.17 BUFFER_SIZE = 4096 MemSet() -> 118.29, 120.07, 118.69 memset() -> 131.40, 131.41 ... at which point I stopped benchmarking. Is the benchmark above a reasonable assessment of memset() / MemSet() performance when copying word-aligned amounts of memory? If so, what's a good value for MEMSET_LOOP_LIMIT (perhaps 512)? Also, if anyone would like to contribute the results of doing the benchmark on their particular system, that might provide some useful additional data points. Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
I consider this a very good test. As you can see from the date of my last test, 1997/09/11, I think I may have had a dual Pentium Pro at that point, and hardware has certainly changed since then. I did try 128 at that time and found it to be slower, but with newer hardware, it is very possible it has improved. I remember in writing that macro how surprised I was that there was any improvements, but obviously there is a gain and the gain is getting bigger. I tested the following program: #include <string.h>#include "postgres.h"#undef MEMSET_LOOP_LIMIT#define MEMSET_LOOP_LIMIT 1000000intmain(int argc, char **argv){ int len = atoi(argv[1]); char buffer[len]; long long i; for (i = 0; i < 9900000; i++) MemSet(buffer, 0, len); return 0;} and, yes, -O2 is significant! Looks like we use -O2 on all platforms that use GCC so we should be OK there. I tested with the following script: for TIME in 64 128 256 512 1024 2048 4096; do echo "*$TIME\c";time tst1 $TIME; done and got for MemSet:*64real 0m1.001suser 0m1.000ssys 0m0.003s*128real 0m1.578suser 0m1.567ssys 0m0.013s*256real 0m2.723suser 0m2.723ssys 0m0.003s*512real 0m5.044suser 0m5.029ssys 0m0.013s*1024real 0m9.621suser 0m9.621ssys 0m0.003s*2048real 0m18.821suser 0m18.811ssys 0m0.013s*4096real 0m37.266suser 0m37.266ssys 0m0.003s and for memset():*64real 0m1.813suser 0m1.801ssys 0m0.014s*128real 0m2.489suser 0m2.499ssys 0m0.994s*256real 0m4.397suser 0m5.389ssys 0m0.005s*512real 0m5.186suser 0m6.170ssys 0m0.015s*1024real 0m6.676suser 0m6.676ssys 0m0.003s*2048real 0m9.766suser 0m9.776ssys 0m0.994s*4096real 0m15.970suser 0m15.954ssys 0m0.003s so for BSD/OS, the break-even is 512. I am on a dual P3/550 using 2.95.2. I will tell you exactly why my break-even is lower than most --- I have assembly language memset() functions in libc on BSD/OS. I suggest changing the MEMSET_LOOP_LIMIT to 512. --------------------------------------------------------------------------- Neil Conway wrote: > In include/c.h, MemSet() is defined to be different than the stock > function memset() only when copying less than or equal to > MEMSET_LOOP_LIMIT bytes (currently 64). The comments above the macro > definition note: > > * We got the 64 number by testing this against the stock memset() on > * BSD/OS 3.0. Larger values were slower. bjm 1997/09/11 > * > * I think the crossover point could be a good deal higher for > * most platforms, actually. tgl 2000-03-19 > > I decided to investigate Tom's suggestion and determine the > performance of MemSet() versus memset() on my machine, for various > values of MEMSET_LOOP_LIMIT. The machine this is being tested on is a > Pentium 4 1.8 Ghz with RDRAM, running Linux 2.4.19pre8 with GCC 3.1.1 > and glibc 2.2.5 -- the results may or may not apply to other > machines. > > The test program was: > > #include <string.h> > #include "postgres.h" > > #undef MEMSET_LOOP_LIMIT > #define MEMSET_LOOP_LIMIT BUFFER_SIZE > > int > main(void) > { > char buffer[BUFFER_SIZE]; > long long i; > > for (i = 0; i < 99000000; i++) > { > MemSet(buffer, 0, sizeof(buffer)); > } > > return 0; > } > > (I manually changed MemSet() to memset() when testing the performance > of the latter function.) > > It was compiled like so: > > gcc -O2 -DBUFFER_SIZE=xxx -Ipgsql/src/include memset.c > > (The -O2 optimization flag is important: the results are significantly > different if it is not used.) > > Here are the results (each timing is the 'total' listing from 'time > ./a.out'): > > BUFFER_SIZE = 64 > MemSet() -> 2.756, 2.810, 2.789 > memset() -> 13.844, 13.782, 13.778 > > BUFFER_SIZE = 128 > MemSet() -> 5.848, 5.989, 5.861 > memset() -> 15.637, 15.631, 15.631 > > BUFFER_SIZE = 256 > MemSet() -> 9.602, 9.652, 9.633 > memset() -> 19.305, 19.370, 19.302 > > BUFFER_SIZE = 512 > MemSet() -> 17.416, 17.462, 17.353 > memset() -> 26.657, 26.658, 26.678 > > BUFFER_SIZE = 1024 > MemSet() -> 32.144, 32.179, 32.086 > memset() -> 41.186, 41.115, 41.176 > > BUFFER_SIZE = 2048 > MemSet() -> 60.39, 60.48, 60.32 > memset() -> 71.19, 71.18, 71.17 > > BUFFER_SIZE = 4096 > MemSet() -> 118.29, 120.07, 118.69 > memset() -> 131.40, 131.41 > > ... at which point I stopped benchmarking. > > Is the benchmark above a reasonable assessment of memset() / MemSet() > performance when copying word-aligned amounts of memory? If so, what's > a good value for MEMSET_LOOP_LIMIT (perhaps 512)? > > Also, if anyone would like to contribute the results of doing the > benchmark on their particular system, that might provide some useful > additional data points. > > Cheers, > > Neil > > -- > Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073 ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org --ELM1030676823-15578-0_--
I consider this a very good test. As you can see from the date of my last test, 1997/09/11, I think I may have had a dual Pentium Pro at that point, and hardware has certainly changed since then. I did try 128 at that time and found it to be slower, but with newer hardware, it is very possible it has improved. I remember in writing that macro how surprised I was that there was any improvements, but obviously there is a gain and the gain is getting bigger. I tested the following program: #include <string.h>#include "postgres.h"#undef MEMSET_LOOP_LIMIT#define MEMSET_LOOP_LIMIT 1000000intmain(int argc, char **argv){ int len = atoi(argv[1]); char buffer[len]; long long i; for (i = 0; i < 9900000; i++) MemSet(buffer, 0, len); return 0;} and, yes, -O2 is significant! Looks like we use -O2 on all platforms that use GCC so we should be OK there. I tested with the following script: for TIME in 64 128 256 512 1024 2048 4096; do echo "*$TIME\c";time tst1 $TIME; done and got for MemSet:*64real 0m1.001suser 0m1.000ssys 0m0.003s*128real 0m1.578suser 0m1.567ssys 0m0.013s*256real 0m2.723suser 0m2.723ssys 0m0.003s*512real 0m5.044suser 0m5.029ssys 0m0.013s*1024real 0m9.621suser 0m9.621ssys 0m0.003s*2048real 0m18.821suser 0m18.811ssys 0m0.013s*4096real 0m37.266suser 0m37.266ssys 0m0.003s and for memset():*64real 0m1.813suser 0m1.801ssys 0m0.014s*128real 0m2.489suser 0m2.499ssys 0m0.994s*256real 0m4.397suser 0m5.389ssys 0m0.005s*512real 0m5.186suser 0m6.170ssys 0m0.015s*1024real 0m6.676suser 0m6.676ssys 0m0.003s*2048real 0m9.766suser 0m9.776ssys 0m0.994s*4096real 0m15.970suser 0m15.954ssys 0m0.003s so for BSD/OS, the break-even is 512. I am on a dual P3/550 using 2.95.2. I will tell you exactly why my break-even is lower than most --- I have assembly language memset() functions in libc on BSD/OS. I suggest changing the MEMSET_LOOP_LIMIT to 512. --------------------------------------------------------------------------- Neil Conway wrote: > In include/c.h, MemSet() is defined to be different than the stock > function memset() only when copying less than or equal to > MEMSET_LOOP_LIMIT bytes (currently 64). The comments above the macro > definition note: > > * We got the 64 number by testing this against the stock memset() on > * BSD/OS 3.0. Larger values were slower. bjm 1997/09/11 > * > * I think the crossover point could be a good deal higher for > * most platforms, actually. tgl 2000-03-19 > > I decided to investigate Tom's suggestion and determine the > performance of MemSet() versus memset() on my machine, for various > values of MEMSET_LOOP_LIMIT. The machine this is being tested on is a > Pentium 4 1.8 Ghz with RDRAM, running Linux 2.4.19pre8 with GCC 3.1.1 > and glibc 2.2.5 -- the results may or may not apply to other > machines. > > The test program was: > > #include <string.h> > #include "postgres.h" > > #undef MEMSET_LOOP_LIMIT > #define MEMSET_LOOP_LIMIT BUFFER_SIZE > > int > main(void) > { > char buffer[BUFFER_SIZE]; > long long i; > > for (i = 0; i < 99000000; i++) > { > MemSet(buffer, 0, sizeof(buffer)); > } > > return 0; > } > > (I manually changed MemSet() to memset() when testing the performance > of the latter function.) > > It was compiled like so: > > gcc -O2 -DBUFFER_SIZE=xxx -Ipgsql/src/include memset.c > > (The -O2 optimization flag is important: the results are significantly > different if it is not used.) > > Here are the results (each timing is the 'total' listing from 'time > ./a.out'): > > BUFFER_SIZE = 64 > MemSet() -> 2.756, 2.810, 2.789 > memset() -> 13.844, 13.782, 13.778 > > BUFFER_SIZE = 128 > MemSet() -> 5.848, 5.989, 5.861 > memset() -> 15.637, 15.631, 15.631 > > BUFFER_SIZE = 256 > MemSet() -> 9.602, 9.652, 9.633 > memset() -> 19.305, 19.370, 19.302 > > BUFFER_SIZE = 512 > MemSet() -> 17.416, 17.462, 17.353 > memset() -> 26.657, 26.658, 26.678 > > BUFFER_SIZE = 1024 > MemSet() -> 32.144, 32.179, 32.086 > memset() -> 41.186, 41.115, 41.176 > > BUFFER_SIZE = 2048 > MemSet() -> 60.39, 60.48, 60.32 > memset() -> 71.19, 71.18, 71.17 > > BUFFER_SIZE = 4096 > MemSet() -> 118.29, 120.07, 118.69 > memset() -> 131.40, 131.41 > > ... at which point I stopped benchmarking. > > Is the benchmark above a reasonable assessment of memset() / MemSet() > performance when copying word-aligned amounts of memory? If so, what's > a good value for MEMSET_LOOP_LIMIT (perhaps 512)? > > Also, if anyone would like to contribute the results of doing the > benchmark on their particular system, that might provide some useful > additional data points. > > Cheers, > > Neil > > -- > Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073 ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org --ELM1030676823-15578-0_--
On Thu, Aug 29, 2002 at 01:27:41AM -0400, Neil Conway wrote: > > Also, if anyone would like to contribute the results of doing the > benchmark on their particular system, that might provide some useful > additional data points. Ok, here's a run on a Sun E450, Solaris 7. I presume your "total" time label corresponds to my "real" time. That's what I'm including, anyway. System Configuration: Sun Microsystems sun4u Sun Enterprise 450 (2 X UltraSPARC-II 400MHz) System clock frequency: 100 MHz Memory size: 2560 Megabytes BUFFER_SIZE = 64 MemSet(): 0m13.343s,12.567s,13.659s memset(): 0m1.255s,0m1.258s,0m1.254s BUFFER_SIZE = 128 MemSet(): 0m21.347s,0m21.200s,0m20.541s memset(): 0m18.041s,0m17.963s,0m17.990s BUFFER_SIZE = 256 MemSet(): 0m38.023s,0m37.480s,0m37.631s memset(): 0m25.969s,0m26.047s,0m26.012s BUFFER_SIZE = 512 MemSet(): 1m9.226s,1m9.901s,1m10.148s memset(): 2m17.897s,2m18.310s,2m17.984s BUFFER_SIZE = 1024 MemSet(): 2m13.690s,2m13.981s,2m13.206s memset(): 4m43.195s,4m43.405s,4m43.390s . . .at which point I gave up. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Andrew Sullivan wrote: > On Thu, Aug 29, 2002 at 01:27:41AM -0400, Neil Conway wrote: > > > > Also, if anyone would like to contribute the results of doing the > > benchmark on their particular system, that might provide some useful > > additional data points. > > Ok, here's a run on a Sun E450, Solaris 7. I presume your "total" > time label corresponds to my "real" time. That's what I'm including, > anyway. Now, these are unusual results. In the 64 case, MemSet is dramatically slower, and it only starts to win around 512, and seems to speed up after that. These are strange results. The idea of MemSet was to prevent the function call overhead for memset, but in such a case, you would think the function call overhead would reduce as a percentage of the total time as the buffer got longer. In your results it seems to suggest that memset() gets slower for longer buffer lengths, and a for loop starts to win at longer sizes. Should I pull out my Solaris kernel source and see what memset() is doing? --------------------------------------------------------------------------- > System Configuration: Sun Microsystems sun4u Sun Enterprise 450 (2 > X UltraSPARC-II 400MHz) > System clock frequency: 100 MHz > Memory size: 2560 Megabytes > > BUFFER_SIZE = 64 > MemSet(): 0m13.343s,12.567s,13.659s > memset(): 0m1.255s,0m1.258s,0m1.254s > > BUFFER_SIZE = 128 > MemSet(): 0m21.347s,0m21.200s,0m20.541s > memset(): 0m18.041s,0m17.963s,0m17.990s > > BUFFER_SIZE = 256 > MemSet(): 0m38.023s,0m37.480s,0m37.631s > memset(): 0m25.969s,0m26.047s,0m26.012s > > BUFFER_SIZE = 512 > MemSet(): 1m9.226s,1m9.901s,1m10.148s > memset(): 2m17.897s,2m18.310s,2m17.984s > > BUFFER_SIZE = 1024 > MemSet(): 2m13.690s,2m13.981s,2m13.206s > memset(): 4m43.195s,4m43.405s,4m43.390s > > . . .at which point I gave up. > > A > > -- > ---- > Andrew Sullivan 204-4141 Yonge Street > Liberty RMS Toronto, Ontario Canada > <andrew@libertyrms.info> M2P 2A8 > +1 416 646 3304 x110 > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
En Thu, 29 Aug 2002 19:35:13 -0400 (EDT) Bruce Momjian <pgman@candle.pha.pa.us> escribió: > In your results it seems to suggest that memset() gets slower for longer > buffer lengths, and a for loop starts to win at longer sizes. Should I > pull out my Solaris kernel source and see what memset() is doing? No, because memset() belongs to the libc AFAICS... Do you have source code for that? -- Alvaro Herrera (<alvherre[a]atentus.com>) "Pensar que el espectro que vemos es ilusorio no lo despoja de espanto, sólo le suma el nuevo terror de la locura" (Perelandra, CSLewis)
On Thu, 2002-08-29 at 18:53, Alvaro Herrera wrote: > En Thu, 29 Aug 2002 19:35:13 -0400 (EDT) > Bruce Momjian <pgman@candle.pha.pa.us> escribió: > > > In your results it seems to suggest that memset() gets slower for longer > > buffer lengths, and a for loop starts to win at longer sizes. Should I > > pull out my Solaris kernel source and see what memset() is doing? > > No, because memset() belongs to the libc AFAICS... Do you have source > code for that? and if you do, what vintage is it? I believe Solaris has mucked with stuff over the last few rev's. -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
Alvaro Herrera wrote: > En Thu, 29 Aug 2002 19:35:13 -0400 (EDT) > Bruce Momjian <pgman@candle.pha.pa.us> escribi?: > > > In your results it seems to suggest that memset() gets slower for longer > > buffer lengths, and a for loop starts to win at longer sizes. Should I > > pull out my Solaris kernel source and see what memset() is doing? > > No, because memset() belongs to the libc AFAICS... Do you have source > code for that? You bet. I have source code to it all, libs, /bin, etc. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Larry Rosenman wrote: > On Thu, 2002-08-29 at 18:53, Alvaro Herrera wrote: > > En Thu, 29 Aug 2002 19:35:13 -0400 (EDT) > > Bruce Momjian <pgman@candle.pha.pa.us> escribi?: > > > > > In your results it seems to suggest that memset() gets slower for longer > > > buffer lengths, and a for loop starts to win at longer sizes. Should I > > > pull out my Solaris kernel source and see what memset() is doing? > > > > No, because memset() belongs to the libc AFAICS... Do you have source > > code for that? > and if you do, what vintage is it? I believe Solaris has mucked with > stuff over the last few rev's. 8.0. Looks like there is interested so I will dig the CD's out of the the box the moves moved and take a look. Now where is that... -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Larry Rosenman wrote: > On Thu, 2002-08-29 at 18:53, Alvaro Herrera wrote: > > En Thu, 29 Aug 2002 19:35:13 -0400 (EDT) > > Bruce Momjian <pgman@candle.pha.pa.us> escribi?: > > > > > In your results it seems to suggest that memset() gets slower for longer > > > buffer lengths, and a for loop starts to win at longer sizes. Should I > > > pull out my Solaris kernel source and see what memset() is doing? > > > > No, because memset() belongs to the libc AFAICS... Do you have source > > code for that? > and if you do, what vintage is it? I believe Solaris has mucked with > stuff over the last few rev's. OK, I am not permitted to discuss the contents of the source with anyone except other Solaris source licensees, but I can say that there isn't anything fancy in the source. There is nothing that would explain the slowdown of memset >512 bytes compared to MemSet. All lengths 64, 128, ... use the same algorithm in the memset code. I got the source from the now-cancelled Solaris Foundation Source Program: http://wwws.sun.com/software/solaris/source/ -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Would you please retest this. I have attached my email showing a simpler test that is less error-prone. I can't come up with any scenario that would produce what you have reported. If I look at function call cost, MemSet loop efficiency, and memset loop efficiency, I can't come up with a combination that produces what you reported. The standard assumption is that function call overhead is significant, and that memset it faster than C MemSet. What compiler are you using? Is the memset() call being inlined by the compiler? You will have to look at the assembler code to be sure. My only guess is that memset is inlined and that it is only moving single bytes. If that is the case, there is no function call overhead and it would explain why MemSet gets faster as the buffer gets larger. --------------------------------------------------------------------------- Andrew Sullivan wrote: > On Thu, Aug 29, 2002 at 01:27:41AM -0400, Neil Conway wrote: > > > > Also, if anyone would like to contribute the results of doing the > > benchmark on their particular system, that might provide some useful > > additional data points. > > Ok, here's a run on a Sun E450, Solaris 7. I presume your "total" > time label corresponds to my "real" time. That's what I'm including, > anyway. > > System Configuration: Sun Microsystems sun4u Sun Enterprise 450 (2 > X UltraSPARC-II 400MHz) > System clock frequency: 100 MHz > Memory size: 2560 Megabytes > > BUFFER_SIZE = 64 > MemSet(): 0m13.343s,12.567s,13.659s > memset(): 0m1.255s,0m1.258s,0m1.254s > > BUFFER_SIZE = 128 > MemSet(): 0m21.347s,0m21.200s,0m20.541s > memset(): 0m18.041s,0m17.963s,0m17.990s > > BUFFER_SIZE = 256 > MemSet(): 0m38.023s,0m37.480s,0m37.631s > memset(): 0m25.969s,0m26.047s,0m26.012s > > BUFFER_SIZE = 512 > MemSet(): 1m9.226s,1m9.901s,1m10.148s > memset(): 2m17.897s,2m18.310s,2m17.984s > > BUFFER_SIZE = 1024 > MemSet(): 2m13.690s,2m13.981s,2m13.206s > memset(): 4m43.195s,4m43.405s,4m43.390s > > . . .at which point I gave up. > > A > > -- > ---- > Andrew Sullivan 204-4141 Yonge Street > Liberty RMS Toronto, Ontario Canada > <andrew@libertyrms.info> M2P 2A8 > +1 416 646 3304 x110 > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Would you please retest this. I have attached my email showing a > simpler test that is less error-prone. What did you consider less error-prone, exactly? Neil's original test considered the case where both the value being set and the buffer length (second and third args of MemSet) are compile-time constants. Your test used a compile-time-constant second arg and a variable third arg. It's obvious from looking at the source of MemSet that this will make a difference in what an optimizing compiler can do. I believe that both cases are interesting in practice in the Postgres sources, but I have no idea about their relative frequency of occurrence. FWIW, I get numbers like the following for the constant-third-arg scenario, using "gcc -O2" with gcc 2.95.3 on HPUX 10.20, HPPA C180 processor: bufsize MemSet memset 64 0m1.71s 0m4.89s 128 0m2.51s 0m5.36s 256 0m4.11s 0m7.02s 512 0m7.32s 0m10.31s 1024 0m13.74s 0m16.90s 2048 0m26.58s 0m30.08s 4096 0m52.24s 0m56.43s So I'd go for a crossover point of *at least* 512. IIRC, I got similar numbers two years ago that led me to put the comment into c.h that Neil is reacting to... regards, tom lane
On Thu, Aug 29, 2002 at 07:35:13PM -0400, Bruce Momjian wrote: > Andrew Sullivan wrote: > > On Thu, Aug 29, 2002 at 01:27:41AM -0400, Neil Conway wrote: > > > > > > Also, if anyone would like to contribute the results of doing the > > > benchmark on their particular system, that might provide some useful > > > additional data points. > > > > Ok, here's a run on a Sun E450, Solaris 7. I presume your "total" > > time label corresponds to my "real" time. That's what I'm including, > > anyway. > > > Now, these are unusual results. In the 64 case, MemSet is dramatically > slower, and it only starts to win around 512, and seems to speed up > after that. > > These are strange results. The idea of MemSet was to prevent the Yes, I was rather surprised, too. In fact, the first couple of runs I thought I must have made a mistake and compiled with (for instance) MemSet() instead of memset(). But I triple-checked, and I hadn't. FWIW, here's an example of what I used to call the compiler: gcc -O2 -DBUFFER_SIZE=1024 -Ipostgresql-7.2.1/src/include/ memset.c A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Thu, Aug 29, 2002 at 11:07:03PM -0400, Bruce Momjian wrote: > and that memset it faster than C MemSet. What compiler are you using? Sorry. Should have included that. $gcc --version 2.95.3 > Is the memset() call being inlined by the compiler? You will have to > look at the assembler code to be sure. No idea. I can maybe check this out later, but I'll have to ask one of my colleagues for help. My knowledge of what I am looking at runs out way before looking at assembler code :( A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Would you please retest this. I have attached my email showing a > > simpler test that is less error-prone. > > What did you consider less error-prone, exactly? > > Neil's original test considered the case where both the value being > set and the buffer length (second and third args of MemSet) are > compile-time constants. Your test used a compile-time-constant second > arg and a variable third arg. It's obvious from looking at the source > of MemSet that this will make a difference in what an optimizing > compiler can do. It was less error-prone because you don't have to recompile for every constant, though your idea that a non-constant length may effect the optimizer is possible, though I assumed for >=64, the length would not be significant to the optimizer. Should we take it to 1024 as a switchover point? I am low at 512, and others are higher, so 1024 seems like a good average. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Thu, Aug 29, 2002 at 11:07:03PM -0400, Bruce Momjian wrote: > > Would you please retest this. I have attached my email showing a > simpler test that is less error-prone. Ok, here you go. Same machine as before, 2-way UltraSPARC-II @400 MHz, 2.5 G, gcc 2.95.3, Solaris 7. This gcc compiles 32 bit apps. MemSet(): *64 real 0m1.298s user 0m1.290s sys 0m0.010s *128 real 0m2.251s user 0m2.250s sys 0m0.000s *256 real 0m3.734s user 0m3.720s sys 0m0.010s *512 real 0m7.041s user 0m7.010s sys 0m0.020s *1024 real 0m13.353s user 0m13.350s sys 0m0.000s *2048 real 0m26.178s user 0m26.040s sys 0m0.000s *4096 real 0m51.838s user 0m51.670s sys 0m0.010s and memset() *64 real 0m1.469s user 0m1.460s sys 0m0.000s *128 real 0m1.813s user 0m1.810s sys 0m0.000s *256 real 0m2.747s user 0m2.730s sys 0m0.010s *512 real 0m12.478s user 0m12.370s sys 0m0.010s *1024 real 0m26.128s user 0m26.010s sys 0m0.000s *2048 real 0m57.663s user 0m57.320s sys 0m0.010s *4096 real 1m53.772s user 1m53.290s sys 0m0.000s A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Andrew Sullivan wrote:<br /><blockquote cite="mid20020829175951.M1322@mail.libertyrms.com" type="cite"><pre wrap="">On Thu,Aug 29, 2002 at 01:27:41AM -0400, Neil Conway wrote: </pre><blockquote type="cite"><pre wrap="">Also, if anyone wouldlike to contribute the results of doing the benchmark on their particular system, that might provide some useful additional data points. </pre></blockquote><pre wrap=""></pre></blockquote> Linux 2.4.18 (preempt) gcc 2.95.4 Dual Althon1600XP 1Gb DDR<br /><br /> memset() *64 2.999s<br /> MemSet() *64 3.640s<br /><br /> memset() *128 4.211s<br /> MemSet()*128 5.933s<br /><br /> memset() *256 6.624s<br /> MemSet()*256 14.889s<br /><br /> memset() *512 11.182s<br /> MemSet()*51228.583s<br /><br /> memset() *1024 20.288s<br /> MemSet()*1024 55.853s<br /><br /> memset() *2048 38.513s<br/> MemSet()*2048 1m50.555s<br /><br /> memset()*4096 1m15.010s<br /> MemSet()*4096 3m40.381s<br /><br /> Linux2.4.16 gcc 2.95.4 Dual Celeron 400 512Mb PC66<br /><br /> memset() *64 15.618s<br /> MemSet() *64 12.864s<br /><br />memset() *128 24.524s<br /> MemSet() *128 21.852s<br /><br /> memset() *256 53.963s<br /> MemSet() *256 52.012s<br /><br/> memset() *512 1m31.232s<br /> MemSet() *512 1m39.445s<br /><br /> memset() *1024 2m44.609s<br /> MemSet() *1024 3m14.567s<br /><br /> memset() *2048 5m12.630s<br /> MemSet() *2048 6m24.916s<br /><br /> memset() *4096 10m8.183s<br/> MemSet() *4096 12m43.830s<br /><br /> Ashley Cambrell<br /><br />
OK, seems we have wildly different values for MemSet for different machines. I am going to up the MEMSET_LOOP_LIMIT value to 1024 because it seems to be the best value on most machines. We can revisit this in 7.4. I wonder if a configure test is going to be required for this evenutally. I think random page size needs the same handling. Maybe I should add to TODO: o compute optimal MEMSET_LOOP_LIMIT value via configure. Is there a significant benefit? Can someone run some query with MemSet vs. memset and see a timing difference? You can use the new GUC param log_duration I just committed. Remember, I added MemSet to eliminate the function call overhead, but at this point, we are now seeing that MemSet beats memset() for ordinary memory setting, and function call overhead isn't even an issue with the larger buffer sizes. --------------------------------------------------------------------------- Andrew Sullivan wrote: > On Thu, Aug 29, 2002 at 11:07:03PM -0400, Bruce Momjian wrote: > > > > Would you please retest this. I have attached my email showing a > > simpler test that is less error-prone. > > Ok, here you go. Same machine as before, 2-way UltraSPARC-II @400 > MHz, 2.5 G, gcc 2.95.3, Solaris 7. This gcc compiles 32 bit apps. > > MemSet(): > > *64 > > real 0m1.298s > user 0m1.290s > sys 0m0.010s > *128 > > real 0m2.251s > user 0m2.250s > sys 0m0.000s > *256 > > real 0m3.734s > user 0m3.720s > sys 0m0.010s > *512 > > real 0m7.041s > user 0m7.010s > sys 0m0.020s > *1024 > > real 0m13.353s > user 0m13.350s > sys 0m0.000s > *2048 > > real 0m26.178s > user 0m26.040s > sys 0m0.000s > *4096 > > real 0m51.838s > user 0m51.670s > sys 0m0.010s > > and memset() > > *64 > > real 0m1.469s > user 0m1.460s > sys 0m0.000s > *128 > > real 0m1.813s > user 0m1.810s > sys 0m0.000s > *256 > > real 0m2.747s > user 0m2.730s > sys 0m0.010s > *512 > > real 0m12.478s > user 0m12.370s > sys 0m0.010s > *1024 > > real 0m26.128s > user 0m26.010s > sys 0m0.000s > *2048 > > real 0m57.663s > user 0m57.320s > sys 0m0.010s > *4096 > > real 1m53.772s > user 1m53.290s > sys 0m0.000s > > A > > -- > ---- > Andrew Sullivan 204-4141 Yonge Street > Liberty RMS Toronto, Ontario Canada > <andrew@libertyrms.info> M2P 2A8 > +1 416 646 3304 x110 > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073