Re: Enabling Checksums - Mailing list pgsql-hackers

From Ants Aasma
Subject Re: Enabling Checksums
Date
Msg-id CA+CSw_tpKCxj31nmm6xesyA78zaAYWsrJQRGVVE6imLzLy+ApA@mail.gmail.com
Whole thread Raw
In response to Re: Enabling Checksums  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Enabling Checksums  (Ants Aasma <ants@cybertec.at>)
Re: Enabling Checksums  (Andres Freund <andres@2ndquadrant.com>)
Re: Enabling Checksums  (Greg Smith <greg@2ndQuadrant.com>)
Re: Enabling Checksums  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On Wed, Mar 6, 2013 at 1:34 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Fletcher's checksum is good in general, I was mainly worried about
> truncating the Fletcher-64 into two 8-bit values. I can't spot any obvious
> weakness in it, but if it's indeed faster and as good as a straightforward
> Fletcher-16, I wonder why that method is not more widely used.

As implented, the fletcher algorithm as implemented results in:

checksum low byte = (blkno + sum over i [0..N) (x_i)) % 255 + 1
checksum high byte = (blkno + sum over i in [0..N) ((N - i)*x_i)) % 255 + 1

Where N is the number of 4 bytes words in the page and x_i is the i-th
word. As modular arithmetic is a ring, it is easy to show that any
addition or subtraction of a multiple of 255 = 0xFF will result in no
change to the resulting value. The most obvious case here is that you
can swap any number of bytes from 0x00 to 0xFF or back without
affecting the hash.

> Another thought is that perhaps something like CRC32C would be faster to
> calculate on modern hardware, and could be safely truncated to 16-bits using
> the same technique you're using to truncate the Fletcher's Checksum. Greg's
> tests showed that the overhead of CRC calculation is significant in some
> workloads, so it would be good to spend some time to optimize that. It'd be
> difficult to change the algorithm in a future release without breaking
> on-disk compatibility, so let's make sure we pick the best one.

I took a look at how the fletcher-64 compiles. It's a very tight loop
of 1 mov, 3 adds and a cmp/jne. Guestimating the performance on a
modern CPU, if the buffer is still in L1, I would expect this to run
at about 2 bytes/cycle depending on actual scheduling efficiency. Peak
execution unit capacity would results in 4/3 cycles per 4 bytes or 3
bytes/cycle. Coincidentally 2 bytes/cycle would result in about 20%
overhead for ReadBuffer on my machine - close to the overall overhead
measured.

Best case using the CRC32 instruction would be 6.8 bytes/cycle [1].
But this got me thinking about how to do this faster. It seems to me
that the fastest approach would be to accumulate many checksums in
parallel and combine in the end to take advantage of vector
instructions. A quick look at vector instructions and their
throughputs and latencies shows that best bet would be to use the
common (hash = hash*prime + value) mechanism with 16bit values. For
processors made in the last 5 years, accumulating atleast 64 16bit
checksums in parallel would be required to achieve optimal throughput
(3-5 cycle latency for pmullw, 1 cycle for paddw with parallel issue
capability, total 6 cycles * 8 values per vector, rounding up to next
power of two). By unrolling the inner loop, this should be able to run
at a throughput of 1 cycle per 16byte vector on all recent x86's, the
necessary vector instructions are available on all x86-64 CPUs.

I was able to coax GCC to vectorize the code in the attached patch (on
top of checksums-20130312.patch.gz) by adding -ftree-vectorize and
-funroll-loops. But for some silly reason GCC insists on storing the
intermediate values on to stack on each iteration negating any
possible performance benefit. If anyone thinks this avenue is worth
further investigation and would like to do performance tests, I can
whip together a manual asm implementation.

I'm not really sure if parallel checksums would be worth doing or not.
On one hand, enabling data parallelism would make it more future
proof, on the other hand, the unvectorized variant is slower than
Fletcher-64.

On another note, I think I found a bug with the current latest patch.
  for (i = SizeOfPageHeaderData; i < BLCKSZ / sizeof(uint32); i++)

should probably be
   for (i = SizeOfPageHeaderData / sizeof(uint32); i < BLCKSZ /
sizeof(uint32); i++)


[1] http://www.drdobbs.com/parallel/fast-parallelized-crc-computation-using/229401411

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Re: Proposal for Allow postgresql.conf values to be changed via SQL [review] (Fix memory growth)
Next
From: Ants Aasma
Date:
Subject: Re: Enabling Checksums