Thread: Warm-cache prefetching

Warm-cache prefetching

From
Qingqing Zhou
Date:
I found an interesting paper improving index speed by prefetching memory
data to L1/L2 cache here (there is discussion about prefetching disk
data to memory several days ago "ice-breaker thread"):
http://www.cs.cmu.edu/~chensm/papers/index_pf_final.pdf

Also related technique used to speedup memcpy:
http://people.redhat.com/arjanv/pIII.c

I wonder if we could use it to speed up in-memory scan opertion for heap
or index. Tom's patch has made scan can handle a page (vs. row) every
time, which is a basis for this optimization.

I write a program try to simulate it, but I am not good at micro
optimization, and I just get a very weak but kind-of-stable improvement. I
wonder if any people here are interested to take a look.

Regards,
Qingqing

----------------------------------

Test results
--------------
Cache line size: 64
CPU: P4 2.4G
$#./prefetch 10 16
Sum: -951304192: with prefetch on - duration: 42.163 ms
Sum: -951304192: with prefetch off - duration: 42.838 ms
Sum: -951304192: with prefetch on - duration: 44.044 ms
Sum: -951304192: with prefetch off - duration: 42.792 ms
Sum: -951304192: with prefetch on - duration: 42.324 ms
Sum: -951304192: with prefetch off - duration: 42.803 ms
Sum: -951304192: with prefetch on - duration: 42.189 ms
Sum: -951304192: with prefetch off - duration: 42.801 ms
Sum: -951304192: with prefetch on - duration: 42.155 ms
Sum: -951304192: with prefetch off - duration: 42.827 ms
Sum: -951304192: with prefetch on - duration: 42.179 ms
Sum: -951304192: with prefetch off - duration: 42.798 ms
Sum: -951304192: with prefetch on - duration: 42.180 ms
Sum: -951304192: with prefetch off - duration: 42.804 ms
Sum: -951304192: with prefetch on - duration: 42.193 ms
Sum: -951304192: with prefetch off - duration: 42.827 ms
Sum: -951304192: with prefetch on - duration: 42.164 ms
Sum: -951304192: with prefetch off - duration: 42.810 ms
Sum: -951304192: with prefetch on - duration: 42.182 ms
Sum: -951304192: with prefetch off - duration: 42.826 ms

Test program
----------------

/** prefetch.c*        PostgreSQL warm-cache sequential scan simulator with prefetch*/

#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
#include <errno.h>
#include <sys/time.h>

typedef char bool;
#define true    ((bool) 1)
#define false    ((bool) 0)

#define BLCKSZ    8192
#define CACHESZ    64
#define NBLCKS    5000

int    sum;

int
main(int argc, char *argv[])
{int     i, rounds;char    *blocks;int    cpu_cost;
if (argc != 3){    fprintf(stderr, "usage: prefetch <rounds> <cpu_cost [1, 16]>\n");    exit(-1);}
rounds = atoi(argv[1]);cpu_cost  = atoi(argv[2]);if (cpu_cost > 16)    exit(-1);
for (i = 0; i < 2*rounds; i++){    int    j, k;    struct    timeval start_t, stop_t;    bool    enable =
i%2?false:true;   char    *blck;
 
    blocks = (char *)malloc(BLCKSZ*NBLCKS);    memset(blocks, 'a', BLCKSZ*NBLCKS);
    sum = 0;    gettimeofday(&start_t, NULL);
    for (j = 0; j < NBLCKS; j++)    {        blck = blocks + j*BLCKSZ;        for (k=0; k < BLCKSZ; k+=CACHESZ)
{           int    *s = (int *)(blck + k);            int    u = cpu_cost;
 
            if (enable)            {                /* prefetch ahead */                __asm__ __volatile__ (
                    "1: prefetchnta 128(%0)\n"                                   : : "r" (s) : "memory" );
}
            /* pretend to process current tuple */            while (u--) sum += (*(s+u))*(*(s+u));        }    }
gettimeofday(&stop_t,NULL);
 
    free(blocks);
    /* measure the time */    if (stop_t.tv_usec < start_t.tv_usec)    {        stop_t.tv_sec--;        stop_t.tv_usec
+=1000000;    }    fprintf (stdout, "Sum: %d: with prefetch %s - duration: %ld.%03ld ms\n",            sum,
enable?"on":"off",           (long) ((stop_t.tv_sec - start_t.tv_sec) * 1000 +                    (stop_t.tv_usec -
start_t.tv_usec)/ 1000),            (long) (stop_t.tv_usec - start_t.tv_usec) % 1000);
 
}
exit(0);
}


Re: Warm-cache prefetching

From
"Luke Lonergan"
Date:
Qingqing,

On 12/8/05 8:07 PM, "Qingqing Zhou" <zhouqq@cs.toronto.edu> wrote:

>                                         /* prefetch ahead */
>                                         __asm__ __volatile__ (
>                                         "1: prefetchnta 128(%0)\n"
>                                                 : : "r" (s) : "memory" );

I think this kind / grain of prefetch is handled as a compiler optimization
in the latest GNU compilers, and further there are some memory streaming
operations for the Pentium 4 ISA that are now part of the standard compiler
optimizations done by gcc.

What I think would be tremendously beneficial is to implement L2 cache
blocking in certain key code paths like sort.  What I mean by "cache
blocking" is performing as many operations on a block of memory (maybe 128
pages worth for a 1MB cache) as possible, then moving to the next batch of
memory and performing all of the work on that batch, etc.

The other thing to consider in conjunction with this would be maximizing use
of the instruction cache, increasing use of parallel functional units and
minimizing pipeline stalls.  The best way to do this would be to group
operations into tighter groups and separating out branching:

So instead of structures like this:
 function_row_at_a_time(row)   if conditions     do some work   else if other     do different work   else if error
printerror_log
 

You'd have function_buffer_at_a_time(buffer_of_rows)   loop on sizeof(buffer_of_rows) / sizeof_L2cache      do a lot of
workon each row
 
   loop on sizeof(buffer_of_rows) / sizeof_L2cache      if error exit      
The ideas in the above optimizations:
- Delay work until a buffer can be gathered
- Increase the "computational intensity" of the loops by putting more
instructions together
- While in loops doing lots of work, avoid branches / jumps

- Luke




Re: Warm-cache prefetching

From
"Qingqing Zhou"
Date:
""Luke Lonergan"" <llonergan@greenplum.com> wrote
>
>>                                         /* prefetch ahead */
>>                                         __asm__ __volatile__ (
>>                                         "1: prefetchnta 128(%0)\n"
>>                                                 : : "r" (s) : "memory" );
>
> I think this kind / grain of prefetch is handled as a compiler 
> optimization
> in the latest GNU compilers, and further there are some memory streaming
> operations for the Pentium 4 ISA that are now part of the standard 
> compiler
> optimizations done by gcc.
>

Is there any special kind of optimization flag of gcc needed to support 
this? I just tried both 2.96 and 4.01 with O2. Unfortunately, 
sse_clear_page() encounters a core-dump by 4.0.1 at this line:
       __asm__ __volatile__ (" movntdq %%xmm0, %0"::"m"(sse_save[0]) );

So I removed this test (sorry ...). There is no non-trivial difference 
AFAICS. The results is attached. I will look into the other parts of your 
thread tomorrow,

Regards,
Qingqing

---

*#ll prefp3-*
-rwx------    1 zhouqq   jmgrp         38k Dec  9 00:49 prefp3-296
-rwx------    1 zhouqq   jmgrp         16k Dec  9 00:49 prefp3-401
*#./prefp3-296
2392.975 MHz
clear_page function 'gcc clear_page()'   took 27142 cycles per page (172.7 
MB/s)
clear_page function 'normal clear_page()'        took 27161 cycles per page 
(172.6 MB/s)
clear_page function 'mmx clear_page()   '        took 17293 cycles per page 
(271.1 MB/s)
clear_page function 'gcc clear_page()'   took 27174 cycles per page (172.5 
MB/s)
clear_page function 'normal clear_page()'        took 27142 cycles per page 
(172.7 MB/s)
clear_page function 'mmx clear_page()   '        took 17291 cycles per page 
(271.1 MB/s)

copy_page function 'normal copy_page()'  took 18552 cycles per page (252.7 
MB/s)
copy_page function 'mmx copy_page()   '  took 12511 cycles per page (374.6 
MB/s)
copy_page function 'sse copy_page()   '  took 12318 cycles per page (380.5 
MB/s)
*#./prefp3-401
2392.970 MHz
clear_page function 'gcc clear_page()'   took 27120 cycles per page (172.8 
MB/s)
clear_page function 'normal clear_page()'        took 27151 cycles per page 
(172.6 MB/s)
clear_page function 'mmx clear_page()   '        took 17295 cycles per page 
(271.0 MB/s)
clear_page function 'gcc clear_page()'   took 27152 cycles per page (172.6 
MB/s)
clear_page function 'normal clear_page()'        took 27114 cycles per page 
(172.9 MB/s)
clear_page function 'mmx clear_page()   '        took 17296 cycles per page 
(271.0 MB/s)

copy_page function 'normal copy_page()'  took 18586 cycles per page (252.2 
MB/s)
copy_page function 'mmx copy_page()   '  took 12620 cycles per page (371.4 
MB/s)
copy_page function 'sse copy_page()   '  took 12698 cycles per page (369.1 
MB/s)




Re: Warm-cache prefetching

From
"Min Xu (Hsu)"
Date:
Perhaps because P4 is already doing H/W prefetching?

http://www.tomshardware.com/2000/11/20/intel/page5.html

I ran the test program on an opteron 2.2G:

% ./a.out 10 16
Sum: -951304192: with prefetch on - duration: 81.166 ms
Sum: -951304192: with prefetch off - duration: 79.769 ms
Sum: -951304192: with prefetch on - duration: 81.173 ms
Sum: -951304192: with prefetch off - duration: 79.718 ms
Sum: -951304192: with prefetch on - duration: 83.293 ms
Sum: -951304192: with prefetch off - duration: 79.731 ms
Sum: -951304192: with prefetch on - duration: 81.227 ms
Sum: -951304192: with prefetch off - duration: 79.851 ms
Sum: -951304192: with prefetch on - duration: 81.003 ms
Sum: -951304192: with prefetch off - duration: 79.724 ms
Sum: -951304192: with prefetch on - duration: 81.084 ms
Sum: -951304192: with prefetch off - duration: 79.728 ms
Sum: -951304192: with prefetch on - duration: 81.009 ms
Sum: -951304192: with prefetch off - duration: 79.723 ms
Sum: -951304192: with prefetch on - duration: 81.074 ms
Sum: -951304192: with prefetch off - duration: 79.719 ms
Sum: -951304192: with prefetch on - duration: 81.188 ms
Sum: -951304192: with prefetch off - duration: 79.724 ms
Sum: -951304192: with prefetch on - duration: 81.075 ms
Sum: -951304192: with prefetch off - duration: 79.719 ms

Got slowdown.

I ran it on a PIII 650M:

% ./a.out 10 16
Sum: -951304192: with prefetch on - duration: 284.952 ms
Sum: -951304192: with prefetch off - duration: 291.439 ms
Sum: -951304192: with prefetch on - duration: 290.690 ms
Sum: -951304192: with prefetch off - duration: 299.692 ms
Sum: -951304192: with prefetch on - duration: 295.287 ms
Sum: -951304192: with prefetch off - duration: 290.992 ms
Sum: -951304192: with prefetch on - duration: 285.116 ms
Sum: -951304192: with prefetch off - duration: 294.127 ms
Sum: -951304192: with prefetch on - duration: 286.986 ms
Sum: -951304192: with prefetch off - duration: 291.001 ms
Sum: -951304192: with prefetch on - duration: 283.233 ms
Sum: -951304192: with prefetch off - duration: 401.910 ms
Sum: -951304192: with prefetch on - duration: 297.021 ms
Sum: -951304192: with prefetch off - duration: 307.814 ms
Sum: -951304192: with prefetch on - duration: 287.201 ms
Sum: -951304192: with prefetch off - duration: 303.870 ms
Sum: -951304192: with prefetch on - duration: 286.962 ms
Sum: -951304192: with prefetch off - duration: 352.779 ms
Sum: -951304192: with prefetch on - duration: 283.245 ms
Sum: -951304192: with prefetch off - duration: 294.422 ms

looks like some speedup to me.

On Thu, 08 Dec 2005 Qingqing Zhou wrote :
> 
> I found an interesting paper improving index speed by prefetching memory
> data to L1/L2 cache here (there is discussion about prefetching disk
> data to memory several days ago "ice-breaker thread"):
> http://www.cs.cmu.edu/~chensm/papers/index_pf_final.pdf
> 
> Also related technique used to speedup memcpy:
> http://people.redhat.com/arjanv/pIII.c
> 
> I wonder if we could use it to speed up in-memory scan opertion for heap
> or index. Tom's patch has made scan can handle a page (vs. row) every
> time, which is a basis for this optimization.
> 
> I write a program try to simulate it, but I am not good at micro
> optimization, and I just get a very weak but kind-of-stable improvement. I
> wonder if any people here are interested to take a look.
> 
> Regards,
> Qingqing
> 
> ----------------------------------
> 
> Test results
> --------------
> Cache line size: 64
> CPU: P4 2.4G
> $#./prefetch 10 16
> Sum: -951304192: with prefetch on - duration: 42.163 ms
> Sum: -951304192: with prefetch off - duration: 42.838 ms
> Sum: -951304192: with prefetch on - duration: 44.044 ms
> Sum: -951304192: with prefetch off - duration: 42.792 ms
> Sum: -951304192: with prefetch on - duration: 42.324 ms
> Sum: -951304192: with prefetch off - duration: 42.803 ms
> Sum: -951304192: with prefetch on - duration: 42.189 ms
> Sum: -951304192: with prefetch off - duration: 42.801 ms
> Sum: -951304192: with prefetch on - duration: 42.155 ms
> Sum: -951304192: with prefetch off - duration: 42.827 ms
> Sum: -951304192: with prefetch on - duration: 42.179 ms
> Sum: -951304192: with prefetch off - duration: 42.798 ms
> Sum: -951304192: with prefetch on - duration: 42.180 ms
> Sum: -951304192: with prefetch off - duration: 42.804 ms
> Sum: -951304192: with prefetch on - duration: 42.193 ms
> Sum: -951304192: with prefetch off - duration: 42.827 ms
> Sum: -951304192: with prefetch on - duration: 42.164 ms
> Sum: -951304192: with prefetch off - duration: 42.810 ms
> Sum: -951304192: with prefetch on - duration: 42.182 ms
> Sum: -951304192: with prefetch off - duration: 42.826 ms
> 
> Test program
> ----------------
> 
> /*
>  * prefetch.c
>  *        PostgreSQL warm-cache sequential scan simulator with prefetch
>  */
> 
> #include <stdio.h>
> #include <stdlib.h>
> #include <memory.h>
> #include <errno.h>
> #include <sys/time.h>
> 
> typedef char bool;
> #define true    ((bool) 1)
> #define false    ((bool) 0)
> 
> #define BLCKSZ    8192
> #define CACHESZ    64
> #define NBLCKS    5000
> 
> int    sum;
> 
> int
> main(int argc, char *argv[])
> {
>     int     i, rounds;
>     char    *blocks;
>     int    cpu_cost;
> 
>     if (argc != 3)
>     {
>         fprintf(stderr, "usage: prefetch <rounds> <cpu_cost [1, 16]>\n");
>         exit(-1);
>     }
> 
>     rounds = atoi(argv[1]);
>     cpu_cost  = atoi(argv[2]);
>     if (cpu_cost > 16)
>         exit(-1);
> 
>     for (i = 0; i < 2*rounds; i++)
>     {
>         int    j, k;
>         struct    timeval start_t, stop_t;
>         bool    enable = i%2?false:true;
>         char    *blck;
> 
>         blocks = (char *)malloc(BLCKSZ*NBLCKS);
>         memset(blocks, 'a', BLCKSZ*NBLCKS);
> 
>         sum = 0;
>         gettimeofday(&start_t, NULL);
> 
>         for (j = 0; j < NBLCKS; j++)
>         {
>             blck = blocks + j*BLCKSZ;
>             for (k=0; k < BLCKSZ; k+=CACHESZ)
>             {
>                 int    *s = (int *)(blck + k);
>                 int    u = cpu_cost;
> 
>                 if (enable)
>                 {
>                     /* prefetch ahead */
>                     __asm__ __volatile__ (
>                                 "1: prefetchnta 128(%0)\n"
>                                     : : "r" (s) : "memory" );
>                 }
> 
>                 /* pretend to process current tuple */
>                 while (u--) sum += (*(s+u))*(*(s+u));
>             }
>         }
>         gettimeofday(&stop_t, NULL);
> 
>         free(blocks);
> 
>         /* measure the time */
>         if (stop_t.tv_usec < start_t.tv_usec)
>         {
>             stop_t.tv_sec--;
>             stop_t.tv_usec += 1000000;
>         }
>         fprintf (stdout, "Sum: %d: with prefetch %s - duration: %ld.%03ld ms\n",
>                 sum,
>                 enable?"on":"off",
>                 (long) ((stop_t.tv_sec - start_t.tv_sec) * 1000 +
>                         (stop_t.tv_usec - start_t.tv_usec) / 1000),
>                 (long) (stop_t.tv_usec - start_t.tv_usec) % 1000);
> 
>     }
> 
>     exit(0);
> }
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly


Re: Warm-cache prefetching

From
Simon Riggs
Date:
This technique and others are discussed in detail in the Intel
Optimization Manual:

http://apps.intel.com/scripts-util/download.asp?url=/design/PentiumII/manuals/24512701.pdf&title=Intel%AE+Architecture+Optimization+Reference+Manual&fullpg=3&site=Developer

Similar manual exists for AMD and other architectures.

On Thu, 2005-12-08 at 23:07 -0500, Qingqing Zhou wrote:
> I write a program try to simulate it, but I am not good at micro
> optimization, and I just get a very weak but kind-of-stable improvement. I
> wonder if any people here are interested to take a look.

You may be trying to use the memory too early. Prefetched memory takes
time to arrive in cache, so you may need to issue prefetch calls for N
+2, N+3 etc rather than simply N+1.

p.6-11 covers this.

There's a lot of papers around coming up with interesting sounding
techniques. AFAICS, all they've done is read the optimization guide and
tried to apply that wisdom, so it seems a good idea to go back to the
source.

I think many of these techniques are generic across architectures, so
there is much to be done in this area, IMHO. Though we must respect
portability and confirm any tuning through testing.

Best Regards, Simon Riggs



Re: Warm-cache prefetching

From
Bruce Momjian
Date:
Do these optimizations have any affect on database software?  I know
call overhead shows up as a performance bottleneck, but do these others
optimizations also measurably improve performance?

---------------------------------------------------------------------------

Simon Riggs wrote:
> This technique and others are discussed in detail in the Intel
> Optimization Manual:
>
http://apps.intel.com/scripts-util/download.asp?url=/design/PentiumII/manuals/24512701.pdf&title=Intel%AE+Architecture+Optimization+Reference+Manual&fullpg=3&site=Developer
> 
> Similar manual exists for AMD and other architectures.
> 
> On Thu, 2005-12-08 at 23:07 -0500, Qingqing Zhou wrote:
> > I write a program try to simulate it, but I am not good at micro
> > optimization, and I just get a very weak but kind-of-stable improvement. I
> > wonder if any people here are interested to take a look.
> 
> You may be trying to use the memory too early. Prefetched memory takes
> time to arrive in cache, so you may need to issue prefetch calls for N
> +2, N+3 etc rather than simply N+1.
> 
> p.6-11 covers this.
> 
> There's a lot of papers around coming up with interesting sounding
> techniques. AFAICS, all they've done is read the optimization guide and
> tried to apply that wisdom, so it seems a good idea to go back to the
> source.
> 
> I think many of these techniques are generic across architectures, so
> there is much to be done in this area, IMHO. Though we must respect
> portability and confirm any tuning through testing.
> 
> Best Regards, Simon Riggs
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Warm-cache prefetching

From
Bruce Momjian
Date:
Kenneth Marshall wrote:
> The main benefit of pre-fetching optimization is to allow just-
> in-time data delivery to the processor. There are numerous papers
> illustrating the dramatic increase in data throughput by using
> datastructures designed to take advantage of prefetching. Factors
> of 3-7 can be realized and this can greatly increase database
> performance. The first step needed to take advantage of the ability
> of pre-fetching to reduce memory latency is to design the index
> page layout with an internal blocking of the cache-line size.
> Then issue pre-fetch instructions for the memory you are going
> to need to process the index page far enough in advance to allow
> it to be in a cache-line by the time it is needed. 

I can see that being useful for a single-user application that doesn't
have locking or I/O bottlenecks, and doesn't have a multi-stage design
like a database.  Do we do enough of such processing that we will _see_
an improvement, or will our code become more complex and it will be
harder to make algorithmic optimizations to our code?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Warm-cache prefetching

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I can see that being useful for a single-user application that doesn't
> have locking or I/O bottlenecks, and doesn't have a multi-stage design
> like a database.  Do we do enough of such processing that we will _see_
> an improvement, or will our code become more complex and it will be
> harder to make algorithmic optimizations to our code?

The main concern I've got about this is the probable negative effect on
code readability.  There's a limit to the extent to which I'm willing to
uglify the code for processor-specific optimizations, and that limit is
not real far off.  There are a lot of other design levels we can work at
to obtain speedups that won't depend on the assumption we are running
on this-year's Intel hardware.
        regards, tom lane


Re: Warm-cache prefetching

From
Kenneth Marshall
Date:
On Fri, Dec 09, 2005 at 10:37:25AM -0500, Bruce Momjian wrote:
> Kenneth Marshall wrote:
> > The main benefit of pre-fetching optimization is to allow just-
> > in-time data delivery to the processor. There are numerous papers
> > illustrating the dramatic increase in data throughput by using
> > datastructures designed to take advantage of prefetching. Factors
> > of 3-7 can be realized and this can greatly increase database
> > performance. The first step needed to take advantage of the ability
> > of pre-fetching to reduce memory latency is to design the index
> > page layout with an internal blocking of the cache-line size.
> > Then issue pre-fetch instructions for the memory you are going
> > to need to process the index page far enough in advance to allow
> > it to be in a cache-line by the time it is needed. 
> 
> I can see that being useful for a single-user application that doesn't
> have locking or I/O bottlenecks, and doesn't have a multi-stage design
> like a database.  Do we do enough of such processing that we will _see_
> an improvement, or will our code become more complex and it will be
> harder to make algorithmic optimizations to our code?
> 

We should certainly consider all of the trade-offs involved. But if
processing a single index page takes 1/5 the time or less, then the
DB can process 5x the lookups in the same amount of time. That would
be very nice in a multi-user DB.

Ken


Re: Warm-cache prefetching

From
Kenneth Marshall
Date:
The main benefit of pre-fetching optimization is to allow just-
in-time data delivery to the processor. There are numerous papers
illustrating the dramatic increase in data throughput by using
datastructures designed to take advantage of prefetching. Factors
of 3-7 can be realized and this can greatly increase database
performance. The first step needed to take advantage of the ability
of pre-fetching to reduce memory latency is to design the index
page layout with an internal blocking of the cache-line size.
Then issue pre-fetch instructions for the memory you are going
to need to process the index page far enough in advance to allow
it to be in a cache-line by the time it is needed. 

Ken

On Fri, Dec 09, 2005 at 09:43:33AM -0500, Bruce Momjian wrote:
> 
> Do these optimizations have any affect on database software?  I know
> call overhead shows up as a performance bottleneck, but do these others
> optimizations also measurably improve performance?
> 
> ---------------------------------------------------------------------------
> 
> Simon Riggs wrote:
> > This technique and others are discussed in detail in the Intel
> > Optimization Manual:
> >
http://apps.intel.com/scripts-util/download.asp?url=/design/PentiumII/manuals/24512701.pdf&title=Intel%AE+Architecture+Optimization+Reference+Manual&fullpg=3&site=Developer
> > 
> > Similar manual exists for AMD and other architectures.
> > 
> > On Thu, 2005-12-08 at 23:07 -0500, Qingqing Zhou wrote:
> > > I write a program try to simulate it, but I am not good at micro
> > > optimization, and I just get a very weak but kind-of-stable improvement. I
> > > wonder if any people here are interested to take a look.
> > 
> > You may be trying to use the memory too early. Prefetched memory takes
> > time to arrive in cache, so you may need to issue prefetch calls for N
> > +2, N+3 etc rather than simply N+1.
> > 
> > p.6-11 covers this.
> > 
> > There's a lot of papers around coming up with interesting sounding
> > techniques. AFAICS, all they've done is read the optimization guide and
> > tried to apply that wisdom, so it seems a good idea to go back to the
> > source.
> > 
> > I think many of these techniques are generic across architectures, so
> > there is much to be done in this area, IMHO. Though we must respect
> > portability and confirm any tuning through testing.
> > 
> > Best Regards, Simon Riggs
> > 
> > 
> > ---------------------------(end of broadcast)---------------------------
> > TIP 5: don't forget to increase your free space map settings
> > 
> 
> -- 
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq


Re: Warm-cache prefetching

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I can see that being useful for a single-user application that doesn't
> > have locking or I/O bottlenecks, and doesn't have a multi-stage design
> > like a database.  Do we do enough of such processing that we will _see_
> > an improvement, or will our code become more complex and it will be
> > harder to make algorithmic optimizations to our code?
> 
> The main concern I've got about this is the probable negative effect on
> code readability.  There's a limit to the extent to which I'm willing to
> uglify the code for processor-specific optimizations, and that limit is
> not real far off.  There are a lot of other design levels we can work at
> to obtain speedups that won't depend on the assumption we are running
> on this-year's Intel hardware.

That is my guess too.  We have seen speedups by inlining and optimizing
frequently-called functions and using assembler for spinlocks.  Proof of
the assembler is in /pg/include/storage/s_lock.h and proof of the
inlining is in /pg/include/access/heapam.h.  Those were chosen for
optimization because they were used a lot.

I think the big question is whether there are other areas that have a
similar CPU load and can be meaningfully optimized, and does the
optimization include such things as multi-staging.  I think we should
take a wait and see attitude and see what test results people get.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Warm-cache prefetching

From
Simon Riggs
Date:
On Fri, 2005-12-09 at 09:43 -0500, Bruce Momjian wrote:
> Do these optimizations have any affect on database software?  I know
> call overhead shows up as a performance bottleneck, but do these others
> optimizations also measurably improve performance?

Many of them can, but nowhere near as much benefit as you'd get by
applying them to a graphics/media app.

We've already taken note of the effects of buffer alignment and padding,
which was pretty much straight out of this manual. 

Recent research suggests that the CPU<->memory bottleneck is growing
worse each year, so attention to this area should have longer term
payoffs. Recent CPU changes indicate number of CPU cores available is
going to increase dramatically in next few years, so scalability is
critical also.

The trick is to find the optimization sweet spot, and to be certain that
it is reasonably broadly applicable. But I think we will find some
benefit, somewhere, so I think its worth the look, at the same time
checking that the compiler guys haven't solved it for us first.

Best Regards, Simon Riggs




Re: Warm-cache prefetching

From
Qingqing Zhou
Date:
On Thu, 8 Dec 2005, Min Xu (Hsu) wrote:

> Perhaps because P4 is already doing H/W prefetching?
>
> http://www.tomshardware.com/2000/11/20/intel/page5.html
>
> I ran the test program on an opteron 2.2G:
> Got slowdown.
>
> I ran it on a PIII 650M:
> looks like some speedup to me.
>

Ok, I see ... so this optimization relies on the CPU type and even speed.
If many programs are fighting for L2 cache, then things will be screwed
up.

Regards,
Qingqing


Re: Warm-cache prefetching

From
"Luke Lonergan"
Date:
Bruce,

It (the compute intensity optimization) is what we did for copy parsing, and it sped up by a factor of 100+.

The rest of the copy path could use some work too.

Yge virtual tuples in 8.1 are another example of grouping operations into more compact chunks instead of doing bits at
atime.
 

The l2 cache optimization could pay huge dividends for certain parts of the code path, like sorting for instance.  A
simpleouter loop that strip mines in l2 sized chunks in sorting could speed things up 5 to 10 times in that section.
 
- Luke
--------------------------
Sent from my BlackBerry Wireless Device


-----Original Message-----
From: pgsql-hackers-owner@postgresql.org <pgsql-hackers-owner@postgresql.org>
To: Simon Riggs <simon@2ndquadrant.com>
CC: Qingqing Zhou <zhouqq@cs.toronto.edu>; pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>
Sent: Fri Dec 09 09:43:33 2005
Subject: Re: [HACKERS] Warm-cache prefetching


Do these optimizations have any affect on database software?  I know
call overhead shows up as a performance bottleneck, but do these others
optimizations also measurably improve performance?

---------------------------------------------------------------------------

Simon Riggs wrote:
> This technique and others are discussed in detail in the Intel
> Optimization Manual:
>
http://apps.intel.com/scripts-util/download.asp?url=/design/PentiumII/manuals/24512701.pdf&title=Intel%AE+Architecture+Optimization+Reference+Manual&fullpg=3&site=Developer
> 
> Similar manual exists for AMD and other architectures.
> 
> On Thu, 2005-12-08 at 23:07 -0500, Qingqing Zhou wrote:
> > I write a program try to simulate it, but I am not good at micro
> > optimization, and I just get a very weak but kind-of-stable improvement. I
> > wonder if any people here are interested to take a look.
> 
> You may be trying to use the memory too early. Prefetched memory takes
> time to arrive in cache, so you may need to issue prefetch calls for N
> +2, N+3 etc rather than simply N+1.
> 
> p.6-11 covers this.
> 
> There's a lot of papers around coming up with interesting sounding
> techniques. AFAICS, all they've done is read the optimization guide and
> tried to apply that wisdom, so it seems a good idea to go back to the
> source.
> 
> I think many of these techniques are generic across architectures, so
> there is much to be done in this area, IMHO. Though we must respect
> portability and confirm any tuning through testing.
> 
> Best Regards, Simon Riggs
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
              http://www.postgresql.org/docs/faq


Re: Warm-cache prefetching

From
Bruce Momjian
Date:
Luke Lonergan wrote:
> Bruce,
> 
> It (the compute intensity optimization) is what we did for copy parsing,
> and it sped up by a factor of 100+.
> 
> The rest of the copy path could use some work too.
> 
> Yge virtual tuples in 8.1 are another example of grouping operations
> into more compact chunks instead of doing bits at a time.
> 
> The l2 cache optimization could pay huge dividends for certain parts
> of the code path, like sorting for instance.  A simple outer loop that
> strip mines in l2 sized chunks in sorting could speed things up 5 to
> 10 times in that section.

Yes, I can see COPY and sort being good for optimization because we
spend a large percentage of the command in a small section of the code.
I just don't know what other things have a similar localized operating
area.

-- Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Warm-cache prefetching

From
Andrew Dunstan
Date:

Luke Lonergan wrote:

>Bruce,
>
>It (the compute intensity optimization) is what we did for copy parsing, and it sped up by a factor of 100+.
>  
>

The changes made to COPY were portable, though.

cheers

andrew


Re: Warm-cache prefetching

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Luke Lonergan wrote:
>> It (the compute intensity optimization) is what we did for copy parsing, and it sped up by a factor of 100+.

> The changes made to COPY were portable, though.

In fact, the changes made to COPY had absolutely nada to do with any of
the things discussed in this thread.

BTW, "sped up by 100%" (which is already an overstatement of what was
actually accomplished) is a long way from "sped up by a factor of 100".
        regards, tom lane


Re: Warm-cache prefetching

From
"Luke Lonergan"
Date:
Tom,

On 12/9/05 2:14 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Andrew Dunstan <andrew@dunslane.net> writes:
>> Luke Lonergan wrote:
>>> It (the compute intensity optimization) is what we did for copy parsing, and
>>> it sped up by a factor of 100+.
> 
>> The changes made to COPY were portable, though.
> 
> In fact, the changes made to COPY had absolutely nada to do with any of
> the things discussed in this thread.

Yes, they do, you must not have read my post in this thread on compute
intensity and removing pipeline stalls.
> BTW, "sped up by 100%" (which is already an overstatement of what was
> actually accomplished) is a long way from "sped up by a factor of 100".

Wrong again - the code section that did parsing sped up by 100x, but the
overall improvement was *only* 100% in our version and 60% in your version.

- Luke




Re: Warm-cache prefetching

From
Kenneth Marshall
Date:
On Fri, Dec 09, 2005 at 11:32:48AM -0500, Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > I can see that being useful for a single-user application that doesn't
> > > have locking or I/O bottlenecks, and doesn't have a multi-stage design
> > > like a database.  Do we do enough of such processing that we will _see_
> > > an improvement, or will our code become more complex and it will be
> > > harder to make algorithmic optimizations to our code?
> > 
> > The main concern I've got about this is the probable negative effect on
> > code readability.  There's a limit to the extent to which I'm willing to
> > uglify the code for processor-specific optimizations, and that limit is
> > not real far off.  There are a lot of other design levels we can work at
> > to obtain speedups that won't depend on the assumption we are running
> > on this-year's Intel hardware.
> 
> That is my guess too.  We have seen speedups by inlining and optimizing
> frequently-called functions and using assembler for spinlocks.  Proof of
> the assembler is in /pg/include/storage/s_lock.h and proof of the
> inlining is in /pg/include/access/heapam.h.  Those were chosen for
> optimization because they were used a lot.
> 
> I think the big question is whether there are other areas that have a
> similar CPU load and can be meaningfully optimized, and does the
> optimization include such things as multi-staging.  I think we should
> take a wait and see attitude and see what test results people get.
> 
I also agree that we should go for the most bang for the buck and include
the coding/maint. aspects in the cost. Pre-fetching is not just available
in x86 processors. Most modern processors now support memory prefetch
operations. If we do not consider memory cache-line stalls while the
processor waits for data in our designs going forward, there will be
substantial performance gains that will be forever out of reach.

Ken


Re: Warm-cache prefetching

From
"Qingqing Zhou"
Date:
"Simon Riggs" <simon@2ndquadrant.com> wrote
>
> You may be trying to use the memory too early. Prefetched memory takes
> time to arrive in cache, so you may need to issue prefetch calls for N
> +2, N+3 etc rather than simply N+1.
>
> p.6-11 covers this.
>

I actually tried it and no improvements have been observed. Also, this may 
conflict with "try to mix prefetch with computation" suggestion from the 
manual that you pointed out. But anyway, this looks like fixable compared to 
the following "prefetch distance" problem. As I read from the manual, this 
is one key factor of the efficiency, which also matches our intuition. 
However, when we process each tuple on a page, CPU clocks that are needed 
might be quite different:

---for (each tuple on a page){ if (ItemIdIsUsed(lpp))    /* some stopped here */ {  ...  /* some involves deeper
functioncalls here */  valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);  if (valid)
scan->rs_vistuples[ntup++]= lineoff; }}
 
---

So it is pretty hard to predicate the prefetch distance. The prefetch 
improvements to memcpy/memmove does not have this problem, the prefecth 
distance can be fixed, and it does not change due to the different speed 
CPUs of the same processor serials.

Maybe L2 cache is big enough so no need to worry about fetch too ahead? 
Seems not true, since this idea is vulnerable to a busy system. No data in 
L2 will be saved for you for a long time.

As Luke suggested, the code above scan operators like sort might be a better 
place to look at. I will take a look there.

Regards,
Qingqing