Re: Enabling Checksums - Mailing list pgsql-hackers

From Ants Aasma
Subject Re: Enabling Checksums
Date
Msg-id CA+CSw_u=P8goPbpvcN8iABUacQumD_VH9MHvKiXQ+fgsqyT0SQ@mail.gmail.com
Whole thread Raw
In response to Re: Enabling Checksums  (Greg Smith <greg@2ndQuadrant.com>)
Responses Re: Enabling Checksums  (Simon Riggs <simon@2ndQuadrant.com>)
Re: Enabling Checksums  (Greg Smith <greg@2ndQuadrant.com>)
List pgsql-hackers
On Mon, Mar 18, 2013 at 2:04 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 3/15/13 5:32 AM, Ants Aasma wrote:
>>
>> Best case using the CRC32 instruction would be 6.8 bytes/cycle [1].
>> But this got me thinking about how to do this faster...
>> [1]
>> http://www.drdobbs.com/parallel/fast-parallelized-crc-computation-using/229401411
>
>
> The optimization work you went through here looked very nice. Unfortunately,
> a few things seem pushing toward using a CRC16 instead of the Fletcher
> approach.  It seems possible to execute a CRC16 in a reasonable enough time,
> in the same neighborhood as the Fletcher one. And there is some hope that
> hardware acceleration for CRCs will be available in a system API/compiler
> feature one day, making them even cheaper.
>
> Ants, do you think you could take a similar look at optimizing a CRC16
> calculation?  I'm back to where I can do a full performance comparison run
> again starting tomorrow, with the latest version of this patch, and I'd like
> to do that with a CRC16 implementation or two.  I'm not sure if it's
> possible to get a quicker implementation because the target is a CRC16, or
> whether it's useful to consider truncating a CRC32 into a CRC16.

I looked for fast CRC implementations on the net. The fastest plain C
variant I could find was one produced by Intels R&D department
(available with a BSD license [1], requires some porting). It does 8 x
8bit table lookups in parallel, requiring a 8*256*4 = 8kB lookup
table.

Using the table lookup method CRC16 would run at exactly the same
speed but the table would be 2x smaller. There is also an option to do
4 lookup tables, this approach is said to be about 2x slower for 2x
less data.

I took a look at assembly generated for the slice-by-8 algorithm. It
seems to me that GCC for some mysterious reason decides to accumulate
the xor's in a serial chain, losing superscalar execution
possibilities. If it could be coaxed into accumulating xor's in a tree
pattern the performance should improve somewhere between 1.5 and 2x.

For CRC32C there is also an option to use the crc32 instructions
available on newer Intel machines and run 3 parallel CRC calculations
to cover for the 3 cycle latency on that instruction, combining them
in the end [2]. Skimming the paper it looks like there are some
patents in this area, so if we wish to implement this, we would have
to see how we can navigate around them. The other issue is that crc32
instruction is Intel only so far. The cited performance is 6.8
bytes/cycle.

There is also an option to use pclmulqdq instruction to do generic
CRC's in 16byte blocks. This is available on Intel Westmere and up
(2010+) and AMD Bulldozer and up (2011+). Sample ASM code is available
in the Intel paper. [3] Cited speed is 0.88 bytes/cycle.

I lifted the benchmark framework of the 8 byte slicing method from the
Intel code and ran some tests on the implementations I had available -
the 8 byte slicing CRC from Intel, fletcher from the checksum patch,
my parallel 16bit checksums approach and a hand coded 32bit parallel
checksum I had (requires SSE4.1 as implemented but on sandy bridge
platform the performance should be equivalent to a 16bit one that
requires only SSE2).

So here come the results:
gcc4.7 -O2, 8192byte buffer:
CRC32 slicing by 8 Algorithm (bytes/cycle), 0.524249
Fletcher Algorithm: (bytes/cycle), 1.930567
SIMD Algorithm (gcc): (bytes/cycle), 0.575617
SIMD Algorithm (hand coded): (bytes/cycle), 9.196853

gcc4.7 -O2 -ftree-vectorize -funroll-loops, 8192byte buffer:
CRC32 slicing by 8 Algorithm (bytes/cycle), 0.523573
Fletcher Algorithm: (bytes/cycle), 3.316269
SIMD Algorithm (gcc): (bytes/cycle), 7.866682
SIMD Algorithm (hand coded): (bytes/cycle), 9.114214

Notes:
* As you can see, CRC based approach would have 4x larger performance
overhead compared to the Fletcher algorithm as implemented in the
current patch.
* This benchmark is the best case for the slicing CRC algorithm. Real
world uses might not have the lookup table in cache.
* We should probably check what the code path length from read syscall
to checksum calculation is. We don't want it to contain something that
would push the page out from cache.
* Even a pclmulqdq based implementation would be a lot slower than Fletcher.
* The Fletcher algorithm benefits greatly from unrolling as the loop
body is so cheap and the algorithm is ALU bound.
* As predicted the SIMD algorithm is quite slow if the compiler won't
vectorize it. But notice that the performance is comparable to
unaccelerated CRC.
* The vectorized SIMD gcc variant is outperforming the claimed
performance of hardware accelerated crc32 using only SSE2 features
(available in the base x86-64 instruction set). The gap isn't large
though.
* Vectorized SIMD code performance is surprisingly close to handcoded.
Not sure if there is something holding back the handcoded version or
if the measurement overhead is coming into play here. This would
require further investigation. perf accounted 25% of execution time to
rdtsc instructions in the measurement loop for the handcoded variant
not all of that is from the pipeline flush.

My 2¢ is that we should either go with truncated CRC32C in the hope
that hardware implementations get more widespread and we can maybe
pick the optimized implementation based on cpuid at runtime. Or if we
need performance right now, we should go with the parallel
implementation and amend the build infrastructure to support
vectorization where possible. This would get good performance to 99%
of users out there and the ones missing out would have a solution that
is as fast as the best CRC algorithm.

I don't really have a lot of cycles left to devote on this this week.
I can maybe help code one of the approaches into PostgreSQL to measure
how much the real world result effect is. Or if you'd like to test the
SIMD version, you can take my last patch in this thread and compare
CFLAGS="-O2 -ftree-vectorize -funroll-loops" built versions. Check
"objdump -d src/backend/storage/page/bufpage.o | grep pmullw" to
verify that it is vectorized.

The parallel multiply-by-prime-and-add algorithm would also need
verification that it gives good detection of interesting error cases.
It's used widely as a hash function so it shouldn't be too bad.

I have also attached the test infrastructure I used so you can
replicate results if you wish. Compile with "gcc -g -O2
[-ftree-vectorize -funroll-loops] crc.c 8x256_tables.c -lm -o crc".
Run with "./crc -t warm -d warm -i 1 -p 8192 -n 100000". If you don't
have a SSE4.1 capable CPU (x86 produced in the last 2 years) the last
test will crash so you might want to comment that out.

[1] http://sourceforge.net/projects/slicing-by-8/
[2]
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf
[3] http://download.intel.com/embedded/processor/whitepaper/327889.pdf

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Support for REINDEX CONCURRENTLY
Next
From: Daniel Farina
Date:
Subject: Re: Enabling Checksums