Re: Re: CRC - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Re: CRC
Date
Msg-id 7712.976476998@sss.pgh.pa.us
Whole thread Raw
In response to Re: Re: CRC  (ncm@zembu.com (Nathan Myers))
Responses Re: Re: CRC  (Bruce Guenter <bruceg@em.ca>)
List pgsql-hackers
ncm@zembu.com (Nathan Myers) writes:
> Minutiae aside, it's clear that the MD5 and CRC are "comparable",
> regardless of CPU.

We've established that the inner loops are pretty comparable.  I'm
still concerned about the startup/termination costs of MD5 on short
records.  The numbers Bruce and I were trading were mostly for records
long enough to make the startup costs negligible, but they're not
negligible for sub-100-byte records.

> For a 32-bit hash, the proven characteristics of CRCs are critical in 
> some applications.  With a good 64-bit hash, the probability of any 
> collision whether from a burst error or otherwise becomes much lower 
> than every other systematic source of error -- the details just don't
> matter any more.

That's a good point.  Of course the critical phrase there is *good*
hash, ie, one without any systematic weaknesses, but as long as we
don't use a "method chosen at random" you're right, it hardly matters.

However, this just begs the question: can't the same be said of a 32-bit
checksum?  My argument the other day essentially was that 32 bits is
plenty for what we need to do, and I have not heard a refutation.

One thing we should look at before going with a 64-bit method is the
extra storage space for the larger checksum.  We can clearly afford
an extra 32 bits for a checksum on an 8K disk page, but if Vadim is
envisioning checksumming each individual XLOG record then the extra
space is more annoying.

Also, there's the KISS issue.  When it takes less than a dozen lines
to do a CRC, versus pages to do MD5, you have to ask yourself what the
extra code space is buying you... also whether you want to get into
licensing issues by borrowing someone else's code.  The MD5 code that
Bruce was using is GPL, not BSD, and so couldn't go into the Postgres
core anyway.  

> MD4 would be a better choice than MD5, despite that a theoretical attack
> on MD4 has been described (albeit never executed).  We don't even care 
> about real attacks, never mind theoretical ones.  What matters is that 
> MD4 is entirely good enough, and faster to compute than MD5.

> I find these results very encouraging.  BSD-licensed MD4 code is readily
> available, e.g. from any of the BSDs themselves.

MD4 would be worth looking at, especially if it has less
startup/shutdown overhead than MD5.  I think a 64-bit true CRC might
also be worth looking at, just for completeness.  But I don't know
where to find code for one.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: SQL 'in' vs join.
Next
From: Tom Lane
Date:
Subject: Re: Re: CRC