Re: [REVIEW] Re: Compression of full-page-writes - Mailing list pgsql-hackers

From Arthur Silva
Subject Re: [REVIEW] Re: Compression of full-page-writes
Date
Msg-id CAO_YK0W5Jp2m2jhn20oF80nvxHWfAnhdQ-1ZABrrG4MKZuAiQg@mail.gmail.com
Whole thread Raw
In response to Re: [REVIEW] Re: Compression of full-page-writes  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [REVIEW] Re: Compression of full-page-writes
List pgsql-hackers

On Sat, Sep 13, 2014 at 1:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
>> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva <arthurprs@gmail.com> wrote:
>>> That's not entirely true. CRC-32C beats pretty much everything with the same
>>> length quality-wise and has both hardware implementations and highly
>>> optimized software versions.

>> For better or for worse CRC is biased by detecting all single bit
>> errors, the detection capability of larger errors is slightly
>> diminished. The quality of the other algorithms I mentioned is also
>> very good, while producing uniformly varying output.

> There's also much more literature about the various CRCs in comparison
> to some of these hash allgorithms.

Indeed.  CRCs have well-understood properties for error detection.
Have any of these new algorithms been analyzed even a hundredth as
thoroughly?  No.  I'm unimpressed by evidence-free claims that
something else is "also very good".

Now, CRCs are designed for detecting the sorts of short burst errors
that are (or were, back in the day) common on phone lines.  You could
certainly make an argument that that's not the type of threat we face
for PG data.  However, I've not seen anyone actually make such an
argument, let alone demonstrate that some other algorithm would be better.
To start with, you'd need to explain precisely what other error pattern
is more important to defend against, and why.

                        regards, tom lane

Mysql went this way as well, changing the CRC polynomial in 5.6.

What we are looking for here is uniqueness thus better error detection. Not avalanche effect, nor cryptographically secure, nor bit distribution.
As far as I'm aware CRC32C is unbeaten collision wise and time proven.

I couldn't find tests with xxhash and crc32 on the same hardware so I spent some time putting together a benchmark (see attachment, to run it just start run.sh)

I included a crc32 implementation using ssr4.2 instructions (which works on pretty much any Intel processor built after 2008 and AMD built after 2012),
a portable Slice-By-8 software implementation and xxhash since it's the fastest software 32bit hash I know of.

Here're the results running the test program on my i5-4200M

crc sb8: 90444623
elapsed: 0.513688s
speed: 1.485220 GB/s

crc hw: 90444623
elapsed: 0.048327s
speed: 15.786877 GB/s

xxhash: 7f4a8d5
elapsed: 0.182100s
speed: 4.189663 GB/s

The hardware version is insanely and works on the majority of Postgres setups and the fallback software implementations is 2.8x slower than the fastest 32bit hash around.

Hopefully it'll be useful in the discussion.
Attachment

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: run xmllint during build (was Re: need xmllint on borka)
Next
From: Tom Lane
Date:
Subject: Re: Audit of logout