Thread: tweaking MemSet() performance

tweaking MemSet() performance

From
Neil Conway
Date:
In include/c.h, MemSet() is defined to be different than the stock
function memset() only when copying less than or equal to
MEMSET_LOOP_LIMIT bytes (currently 64). The comments above the macro
definition note:
*    We got the 64 number by testing this against the stock memset() on*    BSD/OS 3.0. Larger values were slower.
bjm1997/09/11**    I think the crossover point could be a good deal higher for*    most platforms, actually.  tgl
2000-03-19

I decided to investigate Tom's suggestion and determine the
performance of MemSet() versus memset() on my machine, for various
values of MEMSET_LOOP_LIMIT. The machine this is being tested on is a
Pentium 4 1.8 Ghz with RDRAM, running Linux 2.4.19pre8 with GCC 3.1.1
and glibc 2.2.5 -- the results may or may not apply to other
machines.

The test program was:

#include <string.h>
#include "postgres.h"

#undef MEMSET_LOOP_LIMIT
#define MEMSET_LOOP_LIMIT BUFFER_SIZE

int
main(void)
{char buffer[BUFFER_SIZE];long long i;
for (i = 0; i < 99000000; i++){    MemSet(buffer, 0, sizeof(buffer));}
return 0;
}

(I manually changed MemSet() to memset() when testing the performance
of the latter function.)

It was compiled like so:
       gcc -O2 -DBUFFER_SIZE=xxx -Ipgsql/src/include memset.c

(The -O2 optimization flag is important: the results are significantly
different if it is not used.)

Here are the results (each timing is the 'total' listing from 'time
./a.out'):

BUFFER_SIZE = 64       MemSet() -> 2.756, 2.810, 2.789       memset() -> 13.844, 13.782, 13.778

BUFFER_SIZE = 128       MemSet() -> 5.848, 5.989, 5.861       memset() -> 15.637, 15.631, 15.631

BUFFER_SIZE = 256       MemSet() -> 9.602, 9.652, 9.633       memset() -> 19.305, 19.370, 19.302

BUFFER_SIZE = 512       MemSet() -> 17.416, 17.462, 17.353       memset() -> 26.657, 26.658, 26.678

BUFFER_SIZE = 1024       MemSet() -> 32.144, 32.179, 32.086       memset() -> 41.186, 41.115, 41.176

BUFFER_SIZE = 2048       MemSet() -> 60.39, 60.48, 60.32       memset() -> 71.19, 71.18, 71.17

BUFFER_SIZE = 4096       MemSet() -> 118.29, 120.07, 118.69       memset() -> 131.40, 131.41

... at which point I stopped benchmarking.

Is the benchmark above a reasonable assessment of memset() / MemSet()
performance when copying word-aligned amounts of memory? If so, what's
a good value for MEMSET_LOOP_LIMIT (perhaps 512)?

Also, if anyone would like to contribute the results of doing the
benchmark on their particular system, that might provide some useful
additional data points.

Cheers,

Neil

-- 
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC



Re: tweaking MemSet() performance

From
Bruce Momjian
Date:
I consider this a very good test.  As you can see from the date of my
last test, 1997/09/11, I think I may have had a dual Pentium Pro at that
point, and hardware has certainly changed since then.  I did try 128 at
that time and found it to be slower, but with newer hardware, it is very
possible it has improved.

I remember in writing that macro how surprised I was that there was any
improvements, but obviously there is a gain and the gain is getting
bigger.

I tested the following program:    #include <string.h>#include "postgres.h"#undef    MEMSET_LOOP_LIMIT#define
MEMSET_LOOP_LIMIT 1000000intmain(int argc, char **argv){    int        len = atoi(argv[1]);    char        buffer[len];
  long long    i;    for (i = 0; i < 9900000; i++)        MemSet(buffer, 0, len);    return 0;}
 

and, yes, -O2 is significant!  Looks like we use -O2 on all platforms
that use GCC so we should be OK there.

I tested with the following script:
for TIME in 64 128 256 512 1024 2048 4096; do echo "*$TIME\c";time tst1 $TIME; done

and got for MemSet:*64real    0m1.001suser    0m1.000ssys     0m0.003s*128real    0m1.578suser    0m1.567ssys
0m0.013s*256real   0m2.723suser    0m2.723ssys     0m0.003s*512real    0m5.044suser    0m5.029ssys
0m0.013s*1024real   0m9.621suser    0m9.621ssys     0m0.003s*2048real    0m18.821suser    0m18.811ssys
0m0.013s*4096real   0m37.266suser    0m37.266ssys     0m0.003s
 

and for memset():*64real    0m1.813suser    0m1.801ssys     0m0.014s*128real    0m2.489suser    0m2.499ssys
0m0.994s*256real   0m4.397suser    0m5.389ssys     0m0.005s*512real    0m5.186suser    0m6.170ssys
0m0.015s*1024real   0m6.676suser    0m6.676ssys     0m0.003s*2048real    0m9.766suser    0m9.776ssys
0m0.994s*4096real   0m15.970suser    0m15.954ssys     0m0.003s
 

so for BSD/OS, the break-even is 512.

I am on a dual P3/550 using 2.95.2.  I will tell you exactly why my
break-even is lower than most --- I have assembly language memset()
functions in libc on BSD/OS.

I suggest changing the MEMSET_LOOP_LIMIT to 512.

---------------------------------------------------------------------------

Neil Conway wrote:
> In include/c.h, MemSet() is defined to be different than the stock
> function memset() only when copying less than or equal to
> MEMSET_LOOP_LIMIT bytes (currently 64). The comments above the macro
> definition note:
> 
>  *    We got the 64 number by testing this against the stock memset() on
>  *    BSD/OS 3.0. Larger values were slower.    bjm 1997/09/11
>  *
>  *    I think the crossover point could be a good deal higher for
>  *    most platforms, actually.  tgl 2000-03-19
> 
> I decided to investigate Tom's suggestion and determine the
> performance of MemSet() versus memset() on my machine, for various
> values of MEMSET_LOOP_LIMIT. The machine this is being tested on is a
> Pentium 4 1.8 Ghz with RDRAM, running Linux 2.4.19pre8 with GCC 3.1.1
> and glibc 2.2.5 -- the results may or may not apply to other
> machines.
> 
> The test program was:
> 
> #include <string.h>
> #include "postgres.h"
> 
> #undef MEMSET_LOOP_LIMIT
> #define MEMSET_LOOP_LIMIT BUFFER_SIZE
> 
> int
> main(void)
> {
>     char buffer[BUFFER_SIZE];
>     long long i;
> 
>     for (i = 0; i < 99000000; i++)
>     {
>         MemSet(buffer, 0, sizeof(buffer));
>     }
> 
>     return 0;
> }
> 
> (I manually changed MemSet() to memset() when testing the performance
> of the latter function.)
> 
> It was compiled like so:
> 
>         gcc -O2 -DBUFFER_SIZE=xxx -Ipgsql/src/include memset.c
> 
> (The -O2 optimization flag is important: the results are significantly
> different if it is not used.)
> 
> Here are the results (each timing is the 'total' listing from 'time
> ./a.out'):
> 
> BUFFER_SIZE = 64
>         MemSet() -> 2.756, 2.810, 2.789
>         memset() -> 13.844, 13.782, 13.778
> 
> BUFFER_SIZE = 128
>         MemSet() -> 5.848, 5.989, 5.861
>         memset() -> 15.637, 15.631, 15.631
> 
> BUFFER_SIZE = 256
>         MemSet() -> 9.602, 9.652, 9.633
>         memset() -> 19.305, 19.370, 19.302
> 
> BUFFER_SIZE = 512
>         MemSet() -> 17.416, 17.462, 17.353
>         memset() -> 26.657, 26.658, 26.678
> 
> BUFFER_SIZE = 1024
>         MemSet() -> 32.144, 32.179, 32.086
>         memset() -> 41.186, 41.115, 41.176
> 
> BUFFER_SIZE = 2048
>         MemSet() -> 60.39, 60.48, 60.32
>         memset() -> 71.19, 71.18, 71.17
> 
> BUFFER_SIZE = 4096
>         MemSet() -> 118.29, 120.07, 118.69
>         memset() -> 131.40, 131.41
> 
> ... at which point I stopped benchmarking.
> 
> Is the benchmark above a reasonable assessment of memset() / MemSet()
> performance when copying word-aligned amounts of memory? If so, what's
> a good value for MEMSET_LOOP_LIMIT (perhaps 512)?
> 
> Also, if anyone would like to contribute the results of doing the
> benchmark on their particular system, that might provide some useful
> additional data points.
> 
> Cheers,
> 
> Neil
> 
> -- 
> Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org


--ELM1030676823-15578-0_--


Re: [HACKERS] tweaking MemSet() performance

From
Bruce Momjian
Date:
I consider this a very good test.  As you can see from the date of my
last test, 1997/09/11, I think I may have had a dual Pentium Pro at that
point, and hardware has certainly changed since then.  I did try 128 at
that time and found it to be slower, but with newer hardware, it is very
possible it has improved.

I remember in writing that macro how surprised I was that there was any
improvements, but obviously there is a gain and the gain is getting
bigger.

I tested the following program:    #include <string.h>#include "postgres.h"#undef    MEMSET_LOOP_LIMIT#define
MEMSET_LOOP_LIMIT 1000000intmain(int argc, char **argv){    int        len = atoi(argv[1]);    char        buffer[len];
  long long    i;    for (i = 0; i < 9900000; i++)        MemSet(buffer, 0, len);    return 0;}
 

and, yes, -O2 is significant!  Looks like we use -O2 on all platforms
that use GCC so we should be OK there.

I tested with the following script:
for TIME in 64 128 256 512 1024 2048 4096; do echo "*$TIME\c";time tst1 $TIME; done

and got for MemSet:*64real    0m1.001suser    0m1.000ssys     0m0.003s*128real    0m1.578suser    0m1.567ssys
0m0.013s*256real   0m2.723suser    0m2.723ssys     0m0.003s*512real    0m5.044suser    0m5.029ssys
0m0.013s*1024real   0m9.621suser    0m9.621ssys     0m0.003s*2048real    0m18.821suser    0m18.811ssys
0m0.013s*4096real   0m37.266suser    0m37.266ssys     0m0.003s
 

and for memset():*64real    0m1.813suser    0m1.801ssys     0m0.014s*128real    0m2.489suser    0m2.499ssys
0m0.994s*256real   0m4.397suser    0m5.389ssys     0m0.005s*512real    0m5.186suser    0m6.170ssys
0m0.015s*1024real   0m6.676suser    0m6.676ssys     0m0.003s*2048real    0m9.766suser    0m9.776ssys
0m0.994s*4096real   0m15.970suser    0m15.954ssys     0m0.003s
 

so for BSD/OS, the break-even is 512.

I am on a dual P3/550 using 2.95.2.  I will tell you exactly why my
break-even is lower than most --- I have assembly language memset()
functions in libc on BSD/OS.

I suggest changing the MEMSET_LOOP_LIMIT to 512.

---------------------------------------------------------------------------

Neil Conway wrote:
> In include/c.h, MemSet() is defined to be different than the stock
> function memset() only when copying less than or equal to
> MEMSET_LOOP_LIMIT bytes (currently 64). The comments above the macro
> definition note:
> 
>  *    We got the 64 number by testing this against the stock memset() on
>  *    BSD/OS 3.0. Larger values were slower.    bjm 1997/09/11
>  *
>  *    I think the crossover point could be a good deal higher for
>  *    most platforms, actually.  tgl 2000-03-19
> 
> I decided to investigate Tom's suggestion and determine the
> performance of MemSet() versus memset() on my machine, for various
> values of MEMSET_LOOP_LIMIT. The machine this is being tested on is a
> Pentium 4 1.8 Ghz with RDRAM, running Linux 2.4.19pre8 with GCC 3.1.1
> and glibc 2.2.5 -- the results may or may not apply to other
> machines.
> 
> The test program was:
> 
> #include <string.h>
> #include "postgres.h"
> 
> #undef MEMSET_LOOP_LIMIT
> #define MEMSET_LOOP_LIMIT BUFFER_SIZE
> 
> int
> main(void)
> {
>     char buffer[BUFFER_SIZE];
>     long long i;
> 
>     for (i = 0; i < 99000000; i++)
>     {
>         MemSet(buffer, 0, sizeof(buffer));
>     }
> 
>     return 0;
> }
> 
> (I manually changed MemSet() to memset() when testing the performance
> of the latter function.)
> 
> It was compiled like so:
> 
>         gcc -O2 -DBUFFER_SIZE=xxx -Ipgsql/src/include memset.c
> 
> (The -O2 optimization flag is important: the results are significantly
> different if it is not used.)
> 
> Here are the results (each timing is the 'total' listing from 'time
> ./a.out'):
> 
> BUFFER_SIZE = 64
>         MemSet() -> 2.756, 2.810, 2.789
>         memset() -> 13.844, 13.782, 13.778
> 
> BUFFER_SIZE = 128
>         MemSet() -> 5.848, 5.989, 5.861
>         memset() -> 15.637, 15.631, 15.631
> 
> BUFFER_SIZE = 256
>         MemSet() -> 9.602, 9.652, 9.633
>         memset() -> 19.305, 19.370, 19.302
> 
> BUFFER_SIZE = 512
>         MemSet() -> 17.416, 17.462, 17.353
>         memset() -> 26.657, 26.658, 26.678
> 
> BUFFER_SIZE = 1024
>         MemSet() -> 32.144, 32.179, 32.086
>         memset() -> 41.186, 41.115, 41.176
> 
> BUFFER_SIZE = 2048
>         MemSet() -> 60.39, 60.48, 60.32
>         memset() -> 71.19, 71.18, 71.17
> 
> BUFFER_SIZE = 4096
>         MemSet() -> 118.29, 120.07, 118.69
>         memset() -> 131.40, 131.41
> 
> ... at which point I stopped benchmarking.
> 
> Is the benchmark above a reasonable assessment of memset() / MemSet()
> performance when copying word-aligned amounts of memory? If so, what's
> a good value for MEMSET_LOOP_LIMIT (perhaps 512)?
> 
> Also, if anyone would like to contribute the results of doing the
> benchmark on their particular system, that might provide some useful
> additional data points.
> 
> Cheers,
> 
> Neil
> 
> -- 
> Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org


--ELM1030676823-15578-0_--


Re: tweaking MemSet() performance

From
Andrew Sullivan
Date:
On Thu, Aug 29, 2002 at 01:27:41AM -0400, Neil Conway wrote:
> 
> Also, if anyone would like to contribute the results of doing the
> benchmark on their particular system, that might provide some useful
> additional data points.

Ok, here's a run on a Sun E450, Solaris 7.  I presume your "total"
time label corresponds to my "real" time.  That's what I'm including,
anyway.

System Configuration:  Sun Microsystems  sun4u Sun Enterprise 450 (2
X UltraSPARC-II 400MHz)
System clock frequency: 100 MHz
Memory size: 2560 Megabytes

BUFFER_SIZE = 64       MemSet(): 0m13.343s,12.567s,13.659s       memset(): 0m1.255s,0m1.258s,0m1.254s       
BUFFER_SIZE = 128       MemSet(): 0m21.347s,0m21.200s,0m20.541s       memset(): 0m18.041s,0m17.963s,0m17.990s       
BUFFER_SIZE = 256       MemSet(): 0m38.023s,0m37.480s,0m37.631s       memset(): 0m25.969s,0m26.047s,0m26.012s       
BUFFER_SIZE = 512       MemSet(): 1m9.226s,1m9.901s,1m10.148s       memset(): 2m17.897s,2m18.310s,2m17.984s

BUFFER_SIZE = 1024       MemSet(): 2m13.690s,2m13.981s,2m13.206s       memset(): 4m43.195s,4m43.405s,4m43.390s

. . .at which point I gave up.

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: tweaking MemSet() performance

From
Bruce Momjian
Date:
Andrew Sullivan wrote:
> On Thu, Aug 29, 2002 at 01:27:41AM -0400, Neil Conway wrote:
> > 
> > Also, if anyone would like to contribute the results of doing the
> > benchmark on their particular system, that might provide some useful
> > additional data points.
> 
> Ok, here's a run on a Sun E450, Solaris 7.  I presume your "total"
> time label corresponds to my "real" time.  That's what I'm including,
> anyway.


Now, these are unusual results.  In the 64 case, MemSet is dramatically
slower, and it only starts to win around 512, and seems to speed up
after that.

These are strange results.  The idea of MemSet was to prevent the
function call overhead for memset, but in such a case, you would think
the function call overhead would reduce as a percentage of the total
time as the buffer got longer.

In your results it seems to suggest that memset() gets slower for longer
buffer lengths, and a for loop starts to win at longer sizes.  Should I
pull out my Solaris kernel source and see what memset() is doing?

---------------------------------------------------------------------------



> System Configuration:  Sun Microsystems  sun4u Sun Enterprise 450 (2
> X UltraSPARC-II 400MHz)
> System clock frequency: 100 MHz
> Memory size: 2560 Megabytes
> 
> BUFFER_SIZE = 64
>         MemSet(): 0m13.343s,12.567s,13.659s
>         memset(): 0m1.255s,0m1.258s,0m1.254s
>         
> BUFFER_SIZE = 128
>         MemSet(): 0m21.347s,0m21.200s,0m20.541s
>         memset(): 0m18.041s,0m17.963s,0m17.990s
>         
> BUFFER_SIZE = 256
>         MemSet(): 0m38.023s,0m37.480s,0m37.631s
>         memset(): 0m25.969s,0m26.047s,0m26.012s
>         
> BUFFER_SIZE = 512
>         MemSet(): 1m9.226s,1m9.901s,1m10.148s
>         memset(): 2m17.897s,2m18.310s,2m17.984s
> 
> BUFFER_SIZE = 1024
>         MemSet(): 2m13.690s,2m13.981s,2m13.206s
>         memset(): 4m43.195s,4m43.405s,4m43.390s
> 
> . . .at which point I gave up.
> 
> A
> 
> -- 
> ----
> Andrew Sullivan                         204-4141 Yonge Street
> Liberty RMS                           Toronto, Ontario Canada
> <andrew@libertyrms.info>                              M2P 2A8
>                                          +1 416 646 3304 x110
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: tweaking MemSet() performance

From
Alvaro Herrera
Date:
En Thu, 29 Aug 2002 19:35:13 -0400 (EDT)
Bruce Momjian <pgman@candle.pha.pa.us> escribió:

> In your results it seems to suggest that memset() gets slower for longer
> buffer lengths, and a for loop starts to win at longer sizes.  Should I
> pull out my Solaris kernel source and see what memset() is doing?

No, because memset() belongs to the libc AFAICS...  Do you have source
code for that?

-- 
Alvaro Herrera (<alvherre[a]atentus.com>)
"Pensar que el espectro que vemos es ilusorio no lo despoja de espanto,
sólo le suma el nuevo terror de la locura" (Perelandra, CSLewis)


Re: tweaking MemSet() performance

From
Larry Rosenman
Date:
On Thu, 2002-08-29 at 18:53, Alvaro Herrera wrote:
> En Thu, 29 Aug 2002 19:35:13 -0400 (EDT)
> Bruce Momjian <pgman@candle.pha.pa.us> escribió:
>
> > In your results it seems to suggest that memset() gets slower for longer
> > buffer lengths, and a for loop starts to win at longer sizes.  Should I
> > pull out my Solaris kernel source and see what memset() is doing?
>
> No, because memset() belongs to the libc AFAICS...  Do you have source
> code for that?
and if you do, what vintage is it?  I believe Solaris has mucked with
stuff over the last few rev's.

--
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 972-414-9812                 E-Mail: ler@lerctr.org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749



Re: tweaking MemSet() performance

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> En Thu, 29 Aug 2002 19:35:13 -0400 (EDT)
> Bruce Momjian <pgman@candle.pha.pa.us> escribi?:
> 
> > In your results it seems to suggest that memset() gets slower for longer
> > buffer lengths, and a for loop starts to win at longer sizes.  Should I
> > pull out my Solaris kernel source and see what memset() is doing?
> 
> No, because memset() belongs to the libc AFAICS...  Do you have source
> code for that?

You bet.  I have source code to it all, libs, /bin, etc.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: tweaking MemSet() performance

From
Bruce Momjian
Date:
Larry Rosenman wrote:
> On Thu, 2002-08-29 at 18:53, Alvaro Herrera wrote:
> > En Thu, 29 Aug 2002 19:35:13 -0400 (EDT)
> > Bruce Momjian <pgman@candle.pha.pa.us> escribi?:
> > 
> > > In your results it seems to suggest that memset() gets slower for longer
> > > buffer lengths, and a for loop starts to win at longer sizes.  Should I
> > > pull out my Solaris kernel source and see what memset() is doing?
> > 
> > No, because memset() belongs to the libc AFAICS...  Do you have source
> > code for that?
> and if you do, what vintage is it?  I believe Solaris has mucked with
> stuff over the last few rev's. 

8.0.  Looks like there is interested so I will dig the CD's out of the
the box the moves moved and take a look.  Now where is that...

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: tweaking MemSet() performance

From
Bruce Momjian
Date:
Larry Rosenman wrote:
> On Thu, 2002-08-29 at 18:53, Alvaro Herrera wrote:
> > En Thu, 29 Aug 2002 19:35:13 -0400 (EDT)
> > Bruce Momjian <pgman@candle.pha.pa.us> escribi?:
> > 
> > > In your results it seems to suggest that memset() gets slower for longer
> > > buffer lengths, and a for loop starts to win at longer sizes.  Should I
> > > pull out my Solaris kernel source and see what memset() is doing?
> > 
> > No, because memset() belongs to the libc AFAICS...  Do you have source
> > code for that?
> and if you do, what vintage is it?  I believe Solaris has mucked with
> stuff over the last few rev's. 

OK, I am not permitted to discuss the contents of the source with anyone
except other Solaris source licensees, but I can say that there isn't
anything fancy in the source.

There is nothing that would explain the slowdown of memset >512 bytes
compared to MemSet.  All lengths 64, 128, ... use the same algorithm in
the memset code.

I got the source from the now-cancelled Solaris Foundation Source
Program:
http://wwws.sun.com/software/solaris/source/

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: tweaking MemSet() performance

From
Bruce Momjian
Date:
Would you please retest this.  I have attached my email showing a
simpler test that is less error-prone.

I can't come up with any scenario that would produce what you have
reported.  If I look at function call cost, MemSet loop efficiency, and
memset loop efficiency, I can't come up with a combination that produces
what you reported.

The standard assumption is that function call overhead is significant,
and that memset it faster than C MemSet.  What compiler are you using?
Is the memset() call being inlined by the compiler?  You will have to
look at the assembler code to be sure.

My only guess is that memset is inlined and that it is only moving
single bytes.  If that is the case, there is no function call overhead
and it would explain why MemSet gets faster as the buffer gets larger.

---------------------------------------------------------------------------

Andrew Sullivan wrote:
> On Thu, Aug 29, 2002 at 01:27:41AM -0400, Neil Conway wrote:
> >
> > Also, if anyone would like to contribute the results of doing the
> > benchmark on their particular system, that might provide some useful
> > additional data points.
>
> Ok, here's a run on a Sun E450, Solaris 7.  I presume your "total"
> time label corresponds to my "real" time.  That's what I'm including,
> anyway.
>
> System Configuration:  Sun Microsystems  sun4u Sun Enterprise 450 (2
> X UltraSPARC-II 400MHz)
> System clock frequency: 100 MHz
> Memory size: 2560 Megabytes
>
> BUFFER_SIZE = 64
>         MemSet(): 0m13.343s,12.567s,13.659s
>         memset(): 0m1.255s,0m1.258s,0m1.254s
>
> BUFFER_SIZE = 128
>         MemSet(): 0m21.347s,0m21.200s,0m20.541s
>         memset(): 0m18.041s,0m17.963s,0m17.990s
>
> BUFFER_SIZE = 256
>         MemSet(): 0m38.023s,0m37.480s,0m37.631s
>         memset(): 0m25.969s,0m26.047s,0m26.012s
>
> BUFFER_SIZE = 512
>         MemSet(): 1m9.226s,1m9.901s,1m10.148s
>         memset(): 2m17.897s,2m18.310s,2m17.984s
>
> BUFFER_SIZE = 1024
>         MemSet(): 2m13.690s,2m13.981s,2m13.206s
>         memset(): 4m43.195s,4m43.405s,4m43.390s
>
> . . .at which point I gave up.
>
> A
>
> --
> ----
> Andrew Sullivan                         204-4141 Yonge Street
> Liberty RMS                           Toronto, Ontario Canada
> <andrew@libertyrms.info>                              M2P 2A8
>                                          +1 416 646 3304 x110
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: tweaking MemSet() performance

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Would you please retest this.  I have attached my email showing a
> simpler test that is less error-prone.

What did you consider less error-prone, exactly?

Neil's original test considered the case where both the value being
set and the buffer length (second and third args of MemSet) are
compile-time constants.  Your test used a compile-time-constant second
arg and a variable third arg.  It's obvious from looking at the source
of MemSet that this will make a difference in what an optimizing
compiler can do.

I believe that both cases are interesting in practice in the Postgres
sources, but I have no idea about their relative frequency of
occurrence.

FWIW, I get numbers like the following for the constant-third-arg
scenario, using "gcc -O2" with gcc 2.95.3 on HPUX 10.20, HPPA C180
processor:

bufsize        MemSet        memset
64        0m1.71s        0m4.89s
128        0m2.51s        0m5.36s
256        0m4.11s        0m7.02s
512        0m7.32s        0m10.31s
1024        0m13.74s    0m16.90s
2048        0m26.58s    0m30.08s
4096        0m52.24s    0m56.43s

So I'd go for a crossover point of *at least* 512.  IIRC, I got
similar numbers two years ago that led me to put the comment into
c.h that Neil is reacting to...
        regards, tom lane


Re: tweaking MemSet() performance

From
Andrew Sullivan
Date:
On Thu, Aug 29, 2002 at 07:35:13PM -0400, Bruce Momjian wrote:
> Andrew Sullivan wrote:
> > On Thu, Aug 29, 2002 at 01:27:41AM -0400, Neil Conway wrote:
> > > 
> > > Also, if anyone would like to contribute the results of doing the
> > > benchmark on their particular system, that might provide some useful
> > > additional data points.
> > 
> > Ok, here's a run on a Sun E450, Solaris 7.  I presume your "total"
> > time label corresponds to my "real" time.  That's what I'm including,
> > anyway.
> 
> 
> Now, these are unusual results.  In the 64 case, MemSet is dramatically
> slower, and it only starts to win around 512, and seems to speed up
> after that.
> 
> These are strange results.  The idea of MemSet was to prevent the

Yes, I was rather surprised, too.  In fact, the first couple of runs
I thought I must have made a mistake and compiled with (for instance)
MemSet() instead of memset().  But I triple-checked, and I hadn't.

FWIW, here's an example of what I used to call the compiler:

gcc -O2 -DBUFFER_SIZE=1024 -Ipostgresql-7.2.1/src/include/ memset.c

A
-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: tweaking MemSet() performance

From
Andrew Sullivan
Date:
On Thu, Aug 29, 2002 at 11:07:03PM -0400, Bruce Momjian wrote:

> and that memset it faster than C MemSet.  What compiler are you using? 

Sorry.  Should have included that.

$gcc --version
2.95.3

> Is the memset() call being inlined by the compiler?  You will have to
> look at the assembler code to be sure.

No idea.  I can maybe check this out later, but I'll have to ask one
of my colleagues for help.  My knowledge of what I am looking at runs
out way before looking at assembler code :(

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: tweaking MemSet() performance

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Would you please retest this.  I have attached my email showing a
> > simpler test that is less error-prone.
> 
> What did you consider less error-prone, exactly?
> 
> Neil's original test considered the case where both the value being
> set and the buffer length (second and third args of MemSet) are
> compile-time constants.  Your test used a compile-time-constant second
> arg and a variable third arg.  It's obvious from looking at the source
> of MemSet that this will make a difference in what an optimizing
> compiler can do.

It was less error-prone because you don't have to recompile for every
constant, though your idea that a non-constant length may effect the
optimizer is possible, though I assumed for >=64, the length would not
be significant to the optimizer.

Should we take it to 1024 as a switchover point?  I am low at 512, and
others are higher, so 1024 seems like a good average.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: tweaking MemSet() performance

From
Andrew Sullivan
Date:
On Thu, Aug 29, 2002 at 11:07:03PM -0400, Bruce Momjian wrote:
> 
> Would you please retest this.  I have attached my email showing a
> simpler test that is less error-prone.

Ok, here you go.  Same machine as before, 2-way UltraSPARC-II @400
MHz, 2.5 G, gcc 2.95.3, Solaris 7.  This gcc compiles 32 bit apps.  

MemSet():

*64

real    0m1.298s
user    0m1.290s
sys    0m0.010s
*128

real    0m2.251s
user    0m2.250s
sys    0m0.000s
*256

real    0m3.734s
user    0m3.720s
sys    0m0.010s
*512

real    0m7.041s
user    0m7.010s
sys    0m0.020s
*1024

real    0m13.353s
user    0m13.350s
sys    0m0.000s
*2048

real    0m26.178s
user    0m26.040s
sys    0m0.000s
*4096

real    0m51.838s
user    0m51.670s
sys    0m0.010s

and memset()

*64

real    0m1.469s
user    0m1.460s
sys    0m0.000s
*128

real    0m1.813s
user    0m1.810s
sys    0m0.000s
*256

real    0m2.747s
user    0m2.730s
sys    0m0.010s
*512

real    0m12.478s
user    0m12.370s
sys    0m0.010s
*1024

real    0m26.128s
user    0m26.010s
sys    0m0.000s
*2048

real    0m57.663s
user    0m57.320s
sys    0m0.010s
*4096

real    1m53.772s
user    1m53.290s
sys    0m0.000s

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8                                        +1 416 646 3304
x110



Re: tweaking MemSet() performance

From
Ashley Cambrell
Date:
Andrew Sullivan wrote:<br /><blockquote cite="mid20020829175951.M1322@mail.libertyrms.com" type="cite"><pre wrap="">On
Thu,Aug 29, 2002 at 01:27:41AM -0400, Neil Conway wrote: </pre><blockquote type="cite"><pre wrap="">Also, if anyone
wouldlike to contribute the results of doing the
 
benchmark on their particular system, that might provide some useful
additional data points.   </pre></blockquote><pre wrap=""></pre></blockquote> Linux 2.4.18 (preempt) gcc 2.95.4 Dual
Althon1600XP 1Gb DDR<br /><br /> memset() *64 2.999s<br /> MemSet() *64 3.640s<br /><br /> memset() *128 4.211s<br />
MemSet()*128 5.933s<br /><br /> memset() *256 6.624s<br /> MemSet()*256 14.889s<br /><br /> memset() *512 11.182s<br />
MemSet()*51228.583s<br /><br /> memset() *1024 20.288s<br /> MemSet()*1024 55.853s<br /><br /> memset() *2048
38.513s<br/> MemSet()*2048 1m50.555s<br /><br /> memset()*4096 1m15.010s<br /> MemSet()*4096 3m40.381s<br /><br />
Linux2.4.16 gcc 2.95.4 Dual Celeron 400 512Mb PC66<br /><br /> memset() *64 15.618s<br /> MemSet() *64 12.864s<br /><br
/>memset() *128 24.524s<br /> MemSet() *128 21.852s<br /><br /> memset() *256 53.963s<br /> MemSet() *256 52.012s<br
/><br/> memset() *512 1m31.232s<br /> MemSet() *512 1m39.445s<br /><br /> memset() *1024 2m44.609s<br /> MemSet()
*1024 3m14.567s<br /><br /> memset() *2048 5m12.630s<br /> MemSet() *2048  6m24.916s<br /><br /> memset() *4096
10m8.183s<br/> MemSet() *4096 12m43.830s<br /><br /> Ashley Cambrell<br /><br /> 

Re: tweaking MemSet() performance

From
Bruce Momjian
Date:
OK, seems we have wildly different values for MemSet for different
machines.  I am going to up the MEMSET_LOOP_LIMIT value to 1024 because
it seems to be the best value on most machines.  We can revisit this in
7.4.

I wonder if a configure test is going to be required for this
evenutally.  I think random page size needs the same handling.

Maybe I should add to TODO:
 o compute optimal MEMSET_LOOP_LIMIT value via configure.

Is there a significant benefit?  Can someone run some query with MemSet
vs. memset and see a timing difference?  You can use the new GUC param
log_duration I just committed.

Remember, I added MemSet to eliminate the function call overhead, but at
this point, we are now seeing that MemSet beats memset() for ordinary
memory setting, and function call overhead isn't even an issue with the
larger buffer sizes.

---------------------------------------------------------------------------

Andrew Sullivan wrote:
> On Thu, Aug 29, 2002 at 11:07:03PM -0400, Bruce Momjian wrote:
> > 
> > Would you please retest this.  I have attached my email showing a
> > simpler test that is less error-prone.
> 
> Ok, here you go.  Same machine as before, 2-way UltraSPARC-II @400
> MHz, 2.5 G, gcc 2.95.3, Solaris 7.  This gcc compiles 32 bit apps.  
> 
> MemSet():
> 
> *64
> 
> real    0m1.298s
> user    0m1.290s
> sys    0m0.010s
> *128
> 
> real    0m2.251s
> user    0m2.250s
> sys    0m0.000s
> *256
> 
> real    0m3.734s
> user    0m3.720s
> sys    0m0.010s
> *512
> 
> real    0m7.041s
> user    0m7.010s
> sys    0m0.020s
> *1024
> 
> real    0m13.353s
> user    0m13.350s
> sys    0m0.000s
> *2048
> 
> real    0m26.178s
> user    0m26.040s
> sys    0m0.000s
> *4096
> 
> real    0m51.838s
> user    0m51.670s
> sys    0m0.010s
> 
> and memset()
> 
> *64
> 
> real    0m1.469s
> user    0m1.460s
> sys    0m0.000s
> *128
> 
> real    0m1.813s
> user    0m1.810s
> sys    0m0.000s
> *256
> 
> real    0m2.747s
> user    0m2.730s
> sys    0m0.010s
> *512
> 
> real    0m12.478s
> user    0m12.370s
> sys    0m0.010s
> *1024
> 
> real    0m26.128s
> user    0m26.010s
> sys    0m0.000s
> *2048
> 
> real    0m57.663s
> user    0m57.320s
> sys    0m0.010s
> *4096
> 
> real    1m53.772s
> user    1m53.290s
> sys    0m0.000s
> 
> A
> 
> -- 
> ----
> Andrew Sullivan                         204-4141 Yonge Street
> Liberty RMS                           Toronto, Ontario Canada
> <andrew@libertyrms.info>                              M2P 2A8
>                                          +1 416 646 3304 x110
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073