On 01/01/2015 09:17 AM, Abhijit Menon-Sen wrote:
> Hi.
>
> OK, here are the patches with the various suggestions applied.
>
> I found that the alignment didn't seem to make much difference for the
> CRC32* instructions, so I changed to process (len/8)*8bytes followed by
> (len%8)*1bytes, the way the Linux kernel does.
Ok.
In the slicing-by-8 version, I wonder if it would be better to do
single-byte loads to c0-c7, instead of two 4-byte loads and shifts.
4-byte loads are presumably faster than single byte loads, but then
you'd avoid the shifts. And then you could go straight into the
8-bytes-at-a-time loop, without the initial single-byte processing to
get the start address aligned. (the Linux implementation doesn't do
that, so maybe it's a bad idea, but might be worth testing..)
Looking at the Linux implementation, I think it only does the bswap once
per call, not inside the hot loop. Would it even make sense to keep the
crc variable in different byte order, and only do the byte-swap once in
END_CRC32() ?
The comments need some work. I note that there is no mention of the
slicing-by-8 algorithm anywhere in the comments (in the first patch).
Instead of checking for "defined(__GNUC__) || defined(__clang__)",
should add an explicit configure test for __builtin_bswap32().
- Heikki