Thread: Warm-cache prefetching
I found an interesting paper improving index speed by prefetching memory data to L1/L2 cache here (there is discussion about prefetching disk data to memory several days ago "ice-breaker thread"): http://www.cs.cmu.edu/~chensm/papers/index_pf_final.pdf Also related technique used to speedup memcpy: http://people.redhat.com/arjanv/pIII.c I wonder if we could use it to speed up in-memory scan opertion for heap or index. Tom's patch has made scan can handle a page (vs. row) every time, which is a basis for this optimization. I write a program try to simulate it, but I am not good at micro optimization, and I just get a very weak but kind-of-stable improvement. I wonder if any people here are interested to take a look. Regards, Qingqing ---------------------------------- Test results -------------- Cache line size: 64 CPU: P4 2.4G $#./prefetch 10 16 Sum: -951304192: with prefetch on - duration: 42.163 ms Sum: -951304192: with prefetch off - duration: 42.838 ms Sum: -951304192: with prefetch on - duration: 44.044 ms Sum: -951304192: with prefetch off - duration: 42.792 ms Sum: -951304192: with prefetch on - duration: 42.324 ms Sum: -951304192: with prefetch off - duration: 42.803 ms Sum: -951304192: with prefetch on - duration: 42.189 ms Sum: -951304192: with prefetch off - duration: 42.801 ms Sum: -951304192: with prefetch on - duration: 42.155 ms Sum: -951304192: with prefetch off - duration: 42.827 ms Sum: -951304192: with prefetch on - duration: 42.179 ms Sum: -951304192: with prefetch off - duration: 42.798 ms Sum: -951304192: with prefetch on - duration: 42.180 ms Sum: -951304192: with prefetch off - duration: 42.804 ms Sum: -951304192: with prefetch on - duration: 42.193 ms Sum: -951304192: with prefetch off - duration: 42.827 ms Sum: -951304192: with prefetch on - duration: 42.164 ms Sum: -951304192: with prefetch off - duration: 42.810 ms Sum: -951304192: with prefetch on - duration: 42.182 ms Sum: -951304192: with prefetch off - duration: 42.826 ms Test program ---------------- /** prefetch.c* PostgreSQL warm-cache sequential scan simulator with prefetch*/ #include <stdio.h> #include <stdlib.h> #include <memory.h> #include <errno.h> #include <sys/time.h> typedef char bool; #define true ((bool) 1) #define false ((bool) 0) #define BLCKSZ 8192 #define CACHESZ 64 #define NBLCKS 5000 int sum; int main(int argc, char *argv[]) {int i, rounds;char *blocks;int cpu_cost; if (argc != 3){ fprintf(stderr, "usage: prefetch <rounds> <cpu_cost [1, 16]>\n"); exit(-1);} rounds = atoi(argv[1]);cpu_cost = atoi(argv[2]);if (cpu_cost > 16) exit(-1); for (i = 0; i < 2*rounds; i++){ int j, k; struct timeval start_t, stop_t; bool enable = i%2?false:true; char *blck; blocks = (char *)malloc(BLCKSZ*NBLCKS); memset(blocks, 'a', BLCKSZ*NBLCKS); sum = 0; gettimeofday(&start_t, NULL); for (j = 0; j < NBLCKS; j++) { blck = blocks + j*BLCKSZ; for (k=0; k < BLCKSZ; k+=CACHESZ) { int *s = (int *)(blck + k); int u = cpu_cost; if (enable) { /* prefetch ahead */ __asm__ __volatile__ ( "1: prefetchnta 128(%0)\n" : : "r" (s) : "memory" ); } /* pretend to process current tuple */ while (u--) sum += (*(s+u))*(*(s+u)); } } gettimeofday(&stop_t,NULL); free(blocks); /* measure the time */ if (stop_t.tv_usec < start_t.tv_usec) { stop_t.tv_sec--; stop_t.tv_usec +=1000000; } fprintf (stdout, "Sum: %d: with prefetch %s - duration: %ld.%03ld ms\n", sum, enable?"on":"off", (long) ((stop_t.tv_sec - start_t.tv_sec) * 1000 + (stop_t.tv_usec - start_t.tv_usec)/ 1000), (long) (stop_t.tv_usec - start_t.tv_usec) % 1000); } exit(0); }
Qingqing, On 12/8/05 8:07 PM, "Qingqing Zhou" <zhouqq@cs.toronto.edu> wrote: > /* prefetch ahead */ > __asm__ __volatile__ ( > "1: prefetchnta 128(%0)\n" > : : "r" (s) : "memory" ); I think this kind / grain of prefetch is handled as a compiler optimization in the latest GNU compilers, and further there are some memory streaming operations for the Pentium 4 ISA that are now part of the standard compiler optimizations done by gcc. What I think would be tremendously beneficial is to implement L2 cache blocking in certain key code paths like sort. What I mean by "cache blocking" is performing as many operations on a block of memory (maybe 128 pages worth for a 1MB cache) as possible, then moving to the next batch of memory and performing all of the work on that batch, etc. The other thing to consider in conjunction with this would be maximizing use of the instruction cache, increasing use of parallel functional units and minimizing pipeline stalls. The best way to do this would be to group operations into tighter groups and separating out branching: So instead of structures like this: function_row_at_a_time(row) if conditions do some work else if other do different work else if error printerror_log You'd have function_buffer_at_a_time(buffer_of_rows) loop on sizeof(buffer_of_rows) / sizeof_L2cache do a lot of workon each row loop on sizeof(buffer_of_rows) / sizeof_L2cache if error exit The ideas in the above optimizations: - Delay work until a buffer can be gathered - Increase the "computational intensity" of the loops by putting more instructions together - While in loops doing lots of work, avoid branches / jumps - Luke
""Luke Lonergan"" <llonergan@greenplum.com> wrote > >> /* prefetch ahead */ >> __asm__ __volatile__ ( >> "1: prefetchnta 128(%0)\n" >> : : "r" (s) : "memory" ); > > I think this kind / grain of prefetch is handled as a compiler > optimization > in the latest GNU compilers, and further there are some memory streaming > operations for the Pentium 4 ISA that are now part of the standard > compiler > optimizations done by gcc. > Is there any special kind of optimization flag of gcc needed to support this? I just tried both 2.96 and 4.01 with O2. Unfortunately, sse_clear_page() encounters a core-dump by 4.0.1 at this line: __asm__ __volatile__ (" movntdq %%xmm0, %0"::"m"(sse_save[0]) ); So I removed this test (sorry ...). There is no non-trivial difference AFAICS. The results is attached. I will look into the other parts of your thread tomorrow, Regards, Qingqing --- *#ll prefp3-* -rwx------ 1 zhouqq jmgrp 38k Dec 9 00:49 prefp3-296 -rwx------ 1 zhouqq jmgrp 16k Dec 9 00:49 prefp3-401 *#./prefp3-296 2392.975 MHz clear_page function 'gcc clear_page()' took 27142 cycles per page (172.7 MB/s) clear_page function 'normal clear_page()' took 27161 cycles per page (172.6 MB/s) clear_page function 'mmx clear_page() ' took 17293 cycles per page (271.1 MB/s) clear_page function 'gcc clear_page()' took 27174 cycles per page (172.5 MB/s) clear_page function 'normal clear_page()' took 27142 cycles per page (172.7 MB/s) clear_page function 'mmx clear_page() ' took 17291 cycles per page (271.1 MB/s) copy_page function 'normal copy_page()' took 18552 cycles per page (252.7 MB/s) copy_page function 'mmx copy_page() ' took 12511 cycles per page (374.6 MB/s) copy_page function 'sse copy_page() ' took 12318 cycles per page (380.5 MB/s) *#./prefp3-401 2392.970 MHz clear_page function 'gcc clear_page()' took 27120 cycles per page (172.8 MB/s) clear_page function 'normal clear_page()' took 27151 cycles per page (172.6 MB/s) clear_page function 'mmx clear_page() ' took 17295 cycles per page (271.0 MB/s) clear_page function 'gcc clear_page()' took 27152 cycles per page (172.6 MB/s) clear_page function 'normal clear_page()' took 27114 cycles per page (172.9 MB/s) clear_page function 'mmx clear_page() ' took 17296 cycles per page (271.0 MB/s) copy_page function 'normal copy_page()' took 18586 cycles per page (252.2 MB/s) copy_page function 'mmx copy_page() ' took 12620 cycles per page (371.4 MB/s) copy_page function 'sse copy_page() ' took 12698 cycles per page (369.1 MB/s)
Perhaps because P4 is already doing H/W prefetching? http://www.tomshardware.com/2000/11/20/intel/page5.html I ran the test program on an opteron 2.2G: % ./a.out 10 16 Sum: -951304192: with prefetch on - duration: 81.166 ms Sum: -951304192: with prefetch off - duration: 79.769 ms Sum: -951304192: with prefetch on - duration: 81.173 ms Sum: -951304192: with prefetch off - duration: 79.718 ms Sum: -951304192: with prefetch on - duration: 83.293 ms Sum: -951304192: with prefetch off - duration: 79.731 ms Sum: -951304192: with prefetch on - duration: 81.227 ms Sum: -951304192: with prefetch off - duration: 79.851 ms Sum: -951304192: with prefetch on - duration: 81.003 ms Sum: -951304192: with prefetch off - duration: 79.724 ms Sum: -951304192: with prefetch on - duration: 81.084 ms Sum: -951304192: with prefetch off - duration: 79.728 ms Sum: -951304192: with prefetch on - duration: 81.009 ms Sum: -951304192: with prefetch off - duration: 79.723 ms Sum: -951304192: with prefetch on - duration: 81.074 ms Sum: -951304192: with prefetch off - duration: 79.719 ms Sum: -951304192: with prefetch on - duration: 81.188 ms Sum: -951304192: with prefetch off - duration: 79.724 ms Sum: -951304192: with prefetch on - duration: 81.075 ms Sum: -951304192: with prefetch off - duration: 79.719 ms Got slowdown. I ran it on a PIII 650M: % ./a.out 10 16 Sum: -951304192: with prefetch on - duration: 284.952 ms Sum: -951304192: with prefetch off - duration: 291.439 ms Sum: -951304192: with prefetch on - duration: 290.690 ms Sum: -951304192: with prefetch off - duration: 299.692 ms Sum: -951304192: with prefetch on - duration: 295.287 ms Sum: -951304192: with prefetch off - duration: 290.992 ms Sum: -951304192: with prefetch on - duration: 285.116 ms Sum: -951304192: with prefetch off - duration: 294.127 ms Sum: -951304192: with prefetch on - duration: 286.986 ms Sum: -951304192: with prefetch off - duration: 291.001 ms Sum: -951304192: with prefetch on - duration: 283.233 ms Sum: -951304192: with prefetch off - duration: 401.910 ms Sum: -951304192: with prefetch on - duration: 297.021 ms Sum: -951304192: with prefetch off - duration: 307.814 ms Sum: -951304192: with prefetch on - duration: 287.201 ms Sum: -951304192: with prefetch off - duration: 303.870 ms Sum: -951304192: with prefetch on - duration: 286.962 ms Sum: -951304192: with prefetch off - duration: 352.779 ms Sum: -951304192: with prefetch on - duration: 283.245 ms Sum: -951304192: with prefetch off - duration: 294.422 ms looks like some speedup to me. On Thu, 08 Dec 2005 Qingqing Zhou wrote : > > I found an interesting paper improving index speed by prefetching memory > data to L1/L2 cache here (there is discussion about prefetching disk > data to memory several days ago "ice-breaker thread"): > http://www.cs.cmu.edu/~chensm/papers/index_pf_final.pdf > > Also related technique used to speedup memcpy: > http://people.redhat.com/arjanv/pIII.c > > I wonder if we could use it to speed up in-memory scan opertion for heap > or index. Tom's patch has made scan can handle a page (vs. row) every > time, which is a basis for this optimization. > > I write a program try to simulate it, but I am not good at micro > optimization, and I just get a very weak but kind-of-stable improvement. I > wonder if any people here are interested to take a look. > > Regards, > Qingqing > > ---------------------------------- > > Test results > -------------- > Cache line size: 64 > CPU: P4 2.4G > $#./prefetch 10 16 > Sum: -951304192: with prefetch on - duration: 42.163 ms > Sum: -951304192: with prefetch off - duration: 42.838 ms > Sum: -951304192: with prefetch on - duration: 44.044 ms > Sum: -951304192: with prefetch off - duration: 42.792 ms > Sum: -951304192: with prefetch on - duration: 42.324 ms > Sum: -951304192: with prefetch off - duration: 42.803 ms > Sum: -951304192: with prefetch on - duration: 42.189 ms > Sum: -951304192: with prefetch off - duration: 42.801 ms > Sum: -951304192: with prefetch on - duration: 42.155 ms > Sum: -951304192: with prefetch off - duration: 42.827 ms > Sum: -951304192: with prefetch on - duration: 42.179 ms > Sum: -951304192: with prefetch off - duration: 42.798 ms > Sum: -951304192: with prefetch on - duration: 42.180 ms > Sum: -951304192: with prefetch off - duration: 42.804 ms > Sum: -951304192: with prefetch on - duration: 42.193 ms > Sum: -951304192: with prefetch off - duration: 42.827 ms > Sum: -951304192: with prefetch on - duration: 42.164 ms > Sum: -951304192: with prefetch off - duration: 42.810 ms > Sum: -951304192: with prefetch on - duration: 42.182 ms > Sum: -951304192: with prefetch off - duration: 42.826 ms > > Test program > ---------------- > > /* > * prefetch.c > * PostgreSQL warm-cache sequential scan simulator with prefetch > */ > > #include <stdio.h> > #include <stdlib.h> > #include <memory.h> > #include <errno.h> > #include <sys/time.h> > > typedef char bool; > #define true ((bool) 1) > #define false ((bool) 0) > > #define BLCKSZ 8192 > #define CACHESZ 64 > #define NBLCKS 5000 > > int sum; > > int > main(int argc, char *argv[]) > { > int i, rounds; > char *blocks; > int cpu_cost; > > if (argc != 3) > { > fprintf(stderr, "usage: prefetch <rounds> <cpu_cost [1, 16]>\n"); > exit(-1); > } > > rounds = atoi(argv[1]); > cpu_cost = atoi(argv[2]); > if (cpu_cost > 16) > exit(-1); > > for (i = 0; i < 2*rounds; i++) > { > int j, k; > struct timeval start_t, stop_t; > bool enable = i%2?false:true; > char *blck; > > blocks = (char *)malloc(BLCKSZ*NBLCKS); > memset(blocks, 'a', BLCKSZ*NBLCKS); > > sum = 0; > gettimeofday(&start_t, NULL); > > for (j = 0; j < NBLCKS; j++) > { > blck = blocks + j*BLCKSZ; > for (k=0; k < BLCKSZ; k+=CACHESZ) > { > int *s = (int *)(blck + k); > int u = cpu_cost; > > if (enable) > { > /* prefetch ahead */ > __asm__ __volatile__ ( > "1: prefetchnta 128(%0)\n" > : : "r" (s) : "memory" ); > } > > /* pretend to process current tuple */ > while (u--) sum += (*(s+u))*(*(s+u)); > } > } > gettimeofday(&stop_t, NULL); > > free(blocks); > > /* measure the time */ > if (stop_t.tv_usec < start_t.tv_usec) > { > stop_t.tv_sec--; > stop_t.tv_usec += 1000000; > } > fprintf (stdout, "Sum: %d: with prefetch %s - duration: %ld.%03ld ms\n", > sum, > enable?"on":"off", > (long) ((stop_t.tv_sec - start_t.tv_sec) * 1000 + > (stop_t.tv_usec - start_t.tv_usec) / 1000), > (long) (stop_t.tv_usec - start_t.tv_usec) % 1000); > > } > > exit(0); > } > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
This technique and others are discussed in detail in the Intel Optimization Manual: http://apps.intel.com/scripts-util/download.asp?url=/design/PentiumII/manuals/24512701.pdf&title=Intel%AE+Architecture+Optimization+Reference+Manual&fullpg=3&site=Developer Similar manual exists for AMD and other architectures. On Thu, 2005-12-08 at 23:07 -0500, Qingqing Zhou wrote: > I write a program try to simulate it, but I am not good at micro > optimization, and I just get a very weak but kind-of-stable improvement. I > wonder if any people here are interested to take a look. You may be trying to use the memory too early. Prefetched memory takes time to arrive in cache, so you may need to issue prefetch calls for N +2, N+3 etc rather than simply N+1. p.6-11 covers this. There's a lot of papers around coming up with interesting sounding techniques. AFAICS, all they've done is read the optimization guide and tried to apply that wisdom, so it seems a good idea to go back to the source. I think many of these techniques are generic across architectures, so there is much to be done in this area, IMHO. Though we must respect portability and confirm any tuning through testing. Best Regards, Simon Riggs
Do these optimizations have any affect on database software? I know call overhead shows up as a performance bottleneck, but do these others optimizations also measurably improve performance? --------------------------------------------------------------------------- Simon Riggs wrote: > This technique and others are discussed in detail in the Intel > Optimization Manual: > http://apps.intel.com/scripts-util/download.asp?url=/design/PentiumII/manuals/24512701.pdf&title=Intel%AE+Architecture+Optimization+Reference+Manual&fullpg=3&site=Developer > > Similar manual exists for AMD and other architectures. > > On Thu, 2005-12-08 at 23:07 -0500, Qingqing Zhou wrote: > > I write a program try to simulate it, but I am not good at micro > > optimization, and I just get a very weak but kind-of-stable improvement. I > > wonder if any people here are interested to take a look. > > You may be trying to use the memory too early. Prefetched memory takes > time to arrive in cache, so you may need to issue prefetch calls for N > +2, N+3 etc rather than simply N+1. > > p.6-11 covers this. > > There's a lot of papers around coming up with interesting sounding > techniques. AFAICS, all they've done is read the optimization guide and > tried to apply that wisdom, so it seems a good idea to go back to the > source. > > I think many of these techniques are generic across architectures, so > there is much to be done in this area, IMHO. Though we must respect > portability and confirm any tuning through testing. > > Best Regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Kenneth Marshall wrote: > The main benefit of pre-fetching optimization is to allow just- > in-time data delivery to the processor. There are numerous papers > illustrating the dramatic increase in data throughput by using > datastructures designed to take advantage of prefetching. Factors > of 3-7 can be realized and this can greatly increase database > performance. The first step needed to take advantage of the ability > of pre-fetching to reduce memory latency is to design the index > page layout with an internal blocking of the cache-line size. > Then issue pre-fetch instructions for the memory you are going > to need to process the index page far enough in advance to allow > it to be in a cache-line by the time it is needed. I can see that being useful for a single-user application that doesn't have locking or I/O bottlenecks, and doesn't have a multi-stage design like a database. Do we do enough of such processing that we will _see_ an improvement, or will our code become more complex and it will be harder to make algorithmic optimizations to our code? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I can see that being useful for a single-user application that doesn't > have locking or I/O bottlenecks, and doesn't have a multi-stage design > like a database. Do we do enough of such processing that we will _see_ > an improvement, or will our code become more complex and it will be > harder to make algorithmic optimizations to our code? The main concern I've got about this is the probable negative effect on code readability. There's a limit to the extent to which I'm willing to uglify the code for processor-specific optimizations, and that limit is not real far off. There are a lot of other design levels we can work at to obtain speedups that won't depend on the assumption we are running on this-year's Intel hardware. regards, tom lane
On Fri, Dec 09, 2005 at 10:37:25AM -0500, Bruce Momjian wrote: > Kenneth Marshall wrote: > > The main benefit of pre-fetching optimization is to allow just- > > in-time data delivery to the processor. There are numerous papers > > illustrating the dramatic increase in data throughput by using > > datastructures designed to take advantage of prefetching. Factors > > of 3-7 can be realized and this can greatly increase database > > performance. The first step needed to take advantage of the ability > > of pre-fetching to reduce memory latency is to design the index > > page layout with an internal blocking of the cache-line size. > > Then issue pre-fetch instructions for the memory you are going > > to need to process the index page far enough in advance to allow > > it to be in a cache-line by the time it is needed. > > I can see that being useful for a single-user application that doesn't > have locking or I/O bottlenecks, and doesn't have a multi-stage design > like a database. Do we do enough of such processing that we will _see_ > an improvement, or will our code become more complex and it will be > harder to make algorithmic optimizations to our code? > We should certainly consider all of the trade-offs involved. But if processing a single index page takes 1/5 the time or less, then the DB can process 5x the lookups in the same amount of time. That would be very nice in a multi-user DB. Ken
The main benefit of pre-fetching optimization is to allow just- in-time data delivery to the processor. There are numerous papers illustrating the dramatic increase in data throughput by using datastructures designed to take advantage of prefetching. Factors of 3-7 can be realized and this can greatly increase database performance. The first step needed to take advantage of the ability of pre-fetching to reduce memory latency is to design the index page layout with an internal blocking of the cache-line size. Then issue pre-fetch instructions for the memory you are going to need to process the index page far enough in advance to allow it to be in a cache-line by the time it is needed. Ken On Fri, Dec 09, 2005 at 09:43:33AM -0500, Bruce Momjian wrote: > > Do these optimizations have any affect on database software? I know > call overhead shows up as a performance bottleneck, but do these others > optimizations also measurably improve performance? > > --------------------------------------------------------------------------- > > Simon Riggs wrote: > > This technique and others are discussed in detail in the Intel > > Optimization Manual: > > http://apps.intel.com/scripts-util/download.asp?url=/design/PentiumII/manuals/24512701.pdf&title=Intel%AE+Architecture+Optimization+Reference+Manual&fullpg=3&site=Developer > > > > Similar manual exists for AMD and other architectures. > > > > On Thu, 2005-12-08 at 23:07 -0500, Qingqing Zhou wrote: > > > I write a program try to simulate it, but I am not good at micro > > > optimization, and I just get a very weak but kind-of-stable improvement. I > > > wonder if any people here are interested to take a look. > > > > You may be trying to use the memory too early. Prefetched memory takes > > time to arrive in cache, so you may need to issue prefetch calls for N > > +2, N+3 etc rather than simply N+1. > > > > p.6-11 covers this. > > > > There's a lot of papers around coming up with interesting sounding > > techniques. AFAICS, all they've done is read the optimization guide and > > tried to apply that wisdom, so it seems a good idea to go back to the > > source. > > > > I think many of these techniques are generic across architectures, so > > there is much to be done in this area, IMHO. Though we must respect > > portability and confirm any tuning through testing. > > > > Best Regards, Simon Riggs > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 5: don't forget to increase your free space map settings > > > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I can see that being useful for a single-user application that doesn't > > have locking or I/O bottlenecks, and doesn't have a multi-stage design > > like a database. Do we do enough of such processing that we will _see_ > > an improvement, or will our code become more complex and it will be > > harder to make algorithmic optimizations to our code? > > The main concern I've got about this is the probable negative effect on > code readability. There's a limit to the extent to which I'm willing to > uglify the code for processor-specific optimizations, and that limit is > not real far off. There are a lot of other design levels we can work at > to obtain speedups that won't depend on the assumption we are running > on this-year's Intel hardware. That is my guess too. We have seen speedups by inlining and optimizing frequently-called functions and using assembler for spinlocks. Proof of the assembler is in /pg/include/storage/s_lock.h and proof of the inlining is in /pg/include/access/heapam.h. Those were chosen for optimization because they were used a lot. I think the big question is whether there are other areas that have a similar CPU load and can be meaningfully optimized, and does the optimization include such things as multi-staging. I think we should take a wait and see attitude and see what test results people get. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Fri, 2005-12-09 at 09:43 -0500, Bruce Momjian wrote: > Do these optimizations have any affect on database software? I know > call overhead shows up as a performance bottleneck, but do these others > optimizations also measurably improve performance? Many of them can, but nowhere near as much benefit as you'd get by applying them to a graphics/media app. We've already taken note of the effects of buffer alignment and padding, which was pretty much straight out of this manual. Recent research suggests that the CPU<->memory bottleneck is growing worse each year, so attention to this area should have longer term payoffs. Recent CPU changes indicate number of CPU cores available is going to increase dramatically in next few years, so scalability is critical also. The trick is to find the optimization sweet spot, and to be certain that it is reasonably broadly applicable. But I think we will find some benefit, somewhere, so I think its worth the look, at the same time checking that the compiler guys haven't solved it for us first. Best Regards, Simon Riggs
On Thu, 8 Dec 2005, Min Xu (Hsu) wrote: > Perhaps because P4 is already doing H/W prefetching? > > http://www.tomshardware.com/2000/11/20/intel/page5.html > > I ran the test program on an opteron 2.2G: > Got slowdown. > > I ran it on a PIII 650M: > looks like some speedup to me. > Ok, I see ... so this optimization relies on the CPU type and even speed. If many programs are fighting for L2 cache, then things will be screwed up. Regards, Qingqing
Bruce, It (the compute intensity optimization) is what we did for copy parsing, and it sped up by a factor of 100+. The rest of the copy path could use some work too. Yge virtual tuples in 8.1 are another example of grouping operations into more compact chunks instead of doing bits at atime. The l2 cache optimization could pay huge dividends for certain parts of the code path, like sorting for instance. A simpleouter loop that strip mines in l2 sized chunks in sorting could speed things up 5 to 10 times in that section. - Luke -------------------------- Sent from my BlackBerry Wireless Device -----Original Message----- From: pgsql-hackers-owner@postgresql.org <pgsql-hackers-owner@postgresql.org> To: Simon Riggs <simon@2ndquadrant.com> CC: Qingqing Zhou <zhouqq@cs.toronto.edu>; pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org> Sent: Fri Dec 09 09:43:33 2005 Subject: Re: [HACKERS] Warm-cache prefetching Do these optimizations have any affect on database software? I know call overhead shows up as a performance bottleneck, but do these others optimizations also measurably improve performance? --------------------------------------------------------------------------- Simon Riggs wrote: > This technique and others are discussed in detail in the Intel > Optimization Manual: > http://apps.intel.com/scripts-util/download.asp?url=/design/PentiumII/manuals/24512701.pdf&title=Intel%AE+Architecture+Optimization+Reference+Manual&fullpg=3&site=Developer > > Similar manual exists for AMD and other architectures. > > On Thu, 2005-12-08 at 23:07 -0500, Qingqing Zhou wrote: > > I write a program try to simulate it, but I am not good at micro > > optimization, and I just get a very weak but kind-of-stable improvement. I > > wonder if any people here are interested to take a look. > > You may be trying to use the memory too early. Prefetched memory takes > time to arrive in cache, so you may need to issue prefetch calls for N > +2, N+3 etc rather than simply N+1. > > p.6-11 covers this. > > There's a lot of papers around coming up with interesting sounding > techniques. AFAICS, all they've done is read the optimization guide and > tried to apply that wisdom, so it seems a good idea to go back to the > source. > > I think many of these techniques are generic across architectures, so > there is much to be done in this area, IMHO. Though we must respect > portability and confirm any tuning through testing. > > Best Regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073 ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Luke Lonergan wrote: > Bruce, > > It (the compute intensity optimization) is what we did for copy parsing, > and it sped up by a factor of 100+. > > The rest of the copy path could use some work too. > > Yge virtual tuples in 8.1 are another example of grouping operations > into more compact chunks instead of doing bits at a time. > > The l2 cache optimization could pay huge dividends for certain parts > of the code path, like sorting for instance. A simple outer loop that > strip mines in l2 sized chunks in sorting could speed things up 5 to > 10 times in that section. Yes, I can see COPY and sort being good for optimization because we spend a large percentage of the command in a small section of the code. I just don't know what other things have a similar localized operating area. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Luke Lonergan wrote: >Bruce, > >It (the compute intensity optimization) is what we did for copy parsing, and it sped up by a factor of 100+. > > The changes made to COPY were portable, though. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Luke Lonergan wrote: >> It (the compute intensity optimization) is what we did for copy parsing, and it sped up by a factor of 100+. > The changes made to COPY were portable, though. In fact, the changes made to COPY had absolutely nada to do with any of the things discussed in this thread. BTW, "sped up by 100%" (which is already an overstatement of what was actually accomplished) is a long way from "sped up by a factor of 100". regards, tom lane
Tom, On 12/9/05 2:14 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> Luke Lonergan wrote: >>> It (the compute intensity optimization) is what we did for copy parsing, and >>> it sped up by a factor of 100+. > >> The changes made to COPY were portable, though. > > In fact, the changes made to COPY had absolutely nada to do with any of > the things discussed in this thread. Yes, they do, you must not have read my post in this thread on compute intensity and removing pipeline stalls. > BTW, "sped up by 100%" (which is already an overstatement of what was > actually accomplished) is a long way from "sped up by a factor of 100". Wrong again - the code section that did parsing sped up by 100x, but the overall improvement was *only* 100% in our version and 60% in your version. - Luke
On Fri, Dec 09, 2005 at 11:32:48AM -0500, Bruce Momjian wrote: > Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > I can see that being useful for a single-user application that doesn't > > > have locking or I/O bottlenecks, and doesn't have a multi-stage design > > > like a database. Do we do enough of such processing that we will _see_ > > > an improvement, or will our code become more complex and it will be > > > harder to make algorithmic optimizations to our code? > > > > The main concern I've got about this is the probable negative effect on > > code readability. There's a limit to the extent to which I'm willing to > > uglify the code for processor-specific optimizations, and that limit is > > not real far off. There are a lot of other design levels we can work at > > to obtain speedups that won't depend on the assumption we are running > > on this-year's Intel hardware. > > That is my guess too. We have seen speedups by inlining and optimizing > frequently-called functions and using assembler for spinlocks. Proof of > the assembler is in /pg/include/storage/s_lock.h and proof of the > inlining is in /pg/include/access/heapam.h. Those were chosen for > optimization because they were used a lot. > > I think the big question is whether there are other areas that have a > similar CPU load and can be meaningfully optimized, and does the > optimization include such things as multi-staging. I think we should > take a wait and see attitude and see what test results people get. > I also agree that we should go for the most bang for the buck and include the coding/maint. aspects in the cost. Pre-fetching is not just available in x86 processors. Most modern processors now support memory prefetch operations. If we do not consider memory cache-line stalls while the processor waits for data in our designs going forward, there will be substantial performance gains that will be forever out of reach. Ken
"Simon Riggs" <simon@2ndquadrant.com> wrote > > You may be trying to use the memory too early. Prefetched memory takes > time to arrive in cache, so you may need to issue prefetch calls for N > +2, N+3 etc rather than simply N+1. > > p.6-11 covers this. > I actually tried it and no improvements have been observed. Also, this may conflict with "try to mix prefetch with computation" suggestion from the manual that you pointed out. But anyway, this looks like fixable compared to the following "prefetch distance" problem. As I read from the manual, this is one key factor of the efficiency, which also matches our intuition. However, when we process each tuple on a page, CPU clocks that are needed might be quite different: ---for (each tuple on a page){ if (ItemIdIsUsed(lpp)) /* some stopped here */ { ... /* some involves deeper functioncalls here */ valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer); if (valid) scan->rs_vistuples[ntup++]= lineoff; }} --- So it is pretty hard to predicate the prefetch distance. The prefetch improvements to memcpy/memmove does not have this problem, the prefecth distance can be fixed, and it does not change due to the different speed CPUs of the same processor serials. Maybe L2 cache is big enough so no need to worry about fetch too ahead? Seems not true, since this idea is vulnerable to a busy system. No data in L2 will be saved for you for a long time. As Luke suggested, the code above scan operators like sort might be a better place to look at. I will take a look there. Regards, Qingqing