This technique and others are discussed in detail in the Intel
Optimization Manual:
http://apps.intel.com/scripts-util/download.asp?url=/design/PentiumII/manuals/24512701.pdf&title=Intel%AE+Architecture+Optimization+Reference+Manual&fullpg=3&site=Developer
Similar manual exists for AMD and other architectures.
On Thu, 2005-12-08 at 23:07 -0500, Qingqing Zhou wrote:
> I write a program try to simulate it, but I am not good at micro
> optimization, and I just get a very weak but kind-of-stable improvement. I
> wonder if any people here are interested to take a look.
You may be trying to use the memory too early. Prefetched memory takes
time to arrive in cache, so you may need to issue prefetch calls for N
+2, N+3 etc rather than simply N+1.
p.6-11 covers this.
There's a lot of papers around coming up with interesting sounding
techniques. AFAICS, all they've done is read the optimization guide and
tried to apply that wisdom, so it seems a good idea to go back to the
source.
I think many of these techniques are generic across architectures, so
there is much to be done in this area, IMHO. Though we must respect
portability and confirm any tuning through testing.
Best Regards, Simon Riggs