Re: CRC was: Re: beta testing version - Mailing list pgsql-hackers

From Bruce Guenter
Subject Re: CRC was: Re: beta testing version
Date
Msg-id 20001208152825.B11989@em.ca
Whole thread Raw
In response to Re: CRC was: Re: beta testing version  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Fri, Dec 08, 2000 at 03:38:09PM -0500, Tom Lane wrote:
> Bruce Guenter <bruceg@em.ca> writes:
> > MD5 is a cryptographic hash, which means (AFAIK) that ideally it is
> > impossible to produce a collision using any other method than brute
> > force attempts.
> True but irrelevant.  What we need to worry about is the probability
> that a random error will be detected,

Which I indicated immediately after the sentence you quoted.  The
probability that a random error will be detected is the same as the
probability of a collision in the hash given two different inputs.  The
brute force note means that the probability of a collision is as good as
random.

> MD5 is designed for a purpose that really doesn't have much to do with
> error detection, when you think about it.  It says "you will have a hard
> time computing a different string that produces the same hash as some
> prespecified string".  This is not the same as promising
> better-than-random odds against a damaged copy of some string having the
> same hash as the original.

It does provide as-good-as-random odds against a damaged copy of some
string having the same hash as the original -- nobody has been able to
exhibit any collisions in MD5 (see http://cr.yp.to/papers/hash127.ps,
page 18 for notes on this).

> CRC, on the other hand, is specifically
> designed for error detection, and for localized errors (such as a
> corrupted byte or two) it does a provably better-than-random job.
> For nonlocalized errors you don't get a guarantee, but you do get
> same-as-random odds of detection (ie, 1 in 2^N for an N-bit CRC).

For the log, the CRC's primary function (as far as I understand it)
would be to guard against inconsistent transaction being treated as
consistent data.  Such inconsistent transactions would be partially
written, resulting in errors much larger than a small burst.

For guarding the actual record data, I agree with you 100% -- what we're
likely to see is a few localized bytes with flipped bits due to hardware
failure of one kind or another.  However, if the data is really
critical, an ECC may be more appropriate, but that would make the data
significantly larger (9/8 for the algorithms I've seen).

> I really doubt that MD5 can beat a CRC with the same number of output
> bits for the purpose of error detection;

Agreed.  However, MD5 provides four times as many bits as the standard
32-bit CRC.

(I think I initially suggested you could take an arbitrary 32 bits out
of MD5 to provide a check code "as good as CRC-32".  I now take that
back.  Due to the burst error nature of CRCs, nothing else could be as
good as it, unless the alternate algorithm also made some guarantees,
which MD5 definitely doesn't.)

> (Wild-pointer
> stomps on disk buffers are an example of the sort of thing that may
> look like a burst error.)

Actually, wild-pointer incidents involving disk buffers at the kernel
level, from my experience, are characterized by content from one file
appearing in another, which is distinctly different than a burst error,
and more like what would be seen if a log record were partially written.
--
Bruce Guenter <bruceg@em.ca>                       http://em.ca/~bruceg/

pgsql-hackers by date:

Previous
From: Bruce Guenter
Date:
Subject: Re: Re: CRC
Next
From: "Mikheev, Vadim"
Date:
Subject: RE: Hash index on macaddr -> crash