Re: better page-level checksums - Mailing list pgsql-hackers

From Robert Haas
Subject Re: better page-level checksums
Date
Msg-id CA+TgmoYKE-X=G=tC-yqn7e1=rpb4m94FRHq9Q3qtLRpJgh4NnQ@mail.gmail.com
Whole thread Raw
In response to Re: better page-level checksums  (Peter Eisentraut <peter.eisentraut@enterprisedb.com>)
List pgsql-hackers
On Wed, Jun 15, 2022 at 4:54 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> It's hard to get any definite information about what size of checksum is
> "good enough", since after all it depends on what kinds of errors you
> expect and what kinds of probabilities you want to accept.  But the best
> I could gather so far is that 16-bit CRC are good until about 16 kB
> block size.

Not really. There's a lot of misinformation on this topic floating
around on this mailing list, and some of that misinformation is my
fault. I keep learning more about this topic. However, I'm pretty
confident that, on the one hand, there's no hard limit on the size of
the data that can be effectively validated via a CRC, and on the other
hand, CRC isn't a particularly great algorithm, although it does have
certain interesting advantages for certain purposes.

For example, according to
https://en.wikipedia.org/wiki/Mathematics_of_cyclic_redundancy_checks#Error_detection_strength
a CRC is guaranteed to detect all single-bit errors. This property is
easy to achieve: for example, a parity bit has this property.
According to the same source, a CRC is guaranteed to detect two-bit
errors only if the distance between them is less than some limit that
gets larger as the CRC gets wider. Imagine that you have a CRC-16 of a
message 64k+1 bits in length. Suppose that an error in the first bit
changes the result from v to v'. Can we, by flipping a second bit
later in the message, change the final result from v' back to v? The
calculation only has 64k possible answers, and we have 64k bits we can
flip to try to get the desired answer. If every one of those bit flips
produces a different answer, then one of those answers must be v --
which means detection of two-bits errors is not guaranteed. If at
least two of those bit flips produce the same answer, then consider
the messages produced by those two different bit flips. They differ
from each other by exactly two bits and yet produced the same CRC, so
detection of two-bit errors is still not guaranteed.

On the other hand, it's still highly likely. If a message of length
2^16+1 bits contains two bit errors one of which is in the first bit,
the chances that the other one is in exactly the right place to cancel
out the first error are about 2^-16. That's not zero, but it's just as
good as our chances of detecting a replacement of the entire message
with some other message chosen completely at random. I think the
reason why discussion of CRCs tends to focus on the types of bit
errors that it can detect is that the algorithm was designed when
people were doing stuff like sending files over a modem. It's easy to
understand how individual bits could get garbled without anybody
noticing, while large-scale corruption would be less likely, but the
risks are not necessarily the same for a PostgreSQL data file. Lower
levels of the stack are probably already using checksums to try to
detect errors at the level of the physical medium. I'm sure some stuff
slips through the cracks, but in practice we also see failure modes
where the filesystem substitutes 8kB of data from an unrelated file,
or where a torn write in combination with unreliable fsync results in
half of the page contents being from an older version of the page.
These kinds of large-scale replacements aren't what CRCs are designed
to detect, and the chances that we will detect them are roughly
1-2^-bits, whether we use a CRC or something else.

Of course, that partly depends on the algorithm quality. If an
algorithm is more likely to generate some results than others, then
its actual error detection rate will not be as good as the number of
output bits would suggest. If the result doesn't depend equally on
every input bit, then the actual error detection rate will not be as
good as the number of output bits would suggest. And CRC-32 is
apparently not great by modern standards:

https://github.com/rurban/smhasher

Compare the results for CRC-32 with, say, Spooky32. Apparently the
latter is faster yet produces better output. So maybe we would've been
better off if we'd made Spooky32 the default algorithm for backup
manifest checksums rather than CRC-32.

> The benefits of doubling the checksum size are exponential rather than
> linear, so there is no significant benefit of using a 64-bit checksum
> over a 32-bit one, for supported block sizes (current max is 32 kB).

I'm still unconvinced that the block size is very relevant here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Remove trailing newlines from pg_upgrade's messages
Next
From: Robert Haas
Date:
Subject: Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?