On Tue, 2005-08-09 at 21:48 -0700, Luke Lonergan wrote:
> The key thing that is missing is the lack of micro-parallelism in the
> character processing in this version. By "inverting the loop", or putting
> the characters into a buffer on the outside, then doing fast character
> scanning inside with special "fix-up" cases, we exposed long runs of
> pipeline-able code to the compiler.
>
> I think there is another way to accomplish the same thing and still preserve
> the current structure, but it requires "strip mining" the character buffer
> into chunks that can be processed with an explicit loop to check for the
> different characters. While it may seem artificial (it is), it will provide
> the compiler with the ability to pipeline the character finding logic over
> long runs. The other necessary element will have to avoid pipeline stalls
> from the "if" conditions as much as possible.
This is a key point, IMHO.
That part of the code was specifically written to take advantage of
processing pipelines in the hardware, not because the actual theoretical
algorithm for that approach was itself faster.
Nobody's said what compiler/hardware they have been using, so since both
Alon and Tom say their character finding logic is faster, it is likely
to be down to that? Name your platforms gentlemen, please.
My feeling is that we may learn something here that applies more widely
across many parts of the code.
Best Regards, Simon Riggs