Re: better page-level checksums - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: better page-level checksums |
Date | |
Msg-id | CA+TgmoYKE-X=G=tC-yqn7e1=rpb4m94FRHq9Q3qtLRpJgh4NnQ@mail.gmail.com Whole thread Raw |
In response to | Re: better page-level checksums (Peter Eisentraut <peter.eisentraut@enterprisedb.com>) |
List | pgsql-hackers |
On Wed, Jun 15, 2022 at 4:54 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > It's hard to get any definite information about what size of checksum is > "good enough", since after all it depends on what kinds of errors you > expect and what kinds of probabilities you want to accept. But the best > I could gather so far is that 16-bit CRC are good until about 16 kB > block size. Not really. There's a lot of misinformation on this topic floating around on this mailing list, and some of that misinformation is my fault. I keep learning more about this topic. However, I'm pretty confident that, on the one hand, there's no hard limit on the size of the data that can be effectively validated via a CRC, and on the other hand, CRC isn't a particularly great algorithm, although it does have certain interesting advantages for certain purposes. For example, according to https://en.wikipedia.org/wiki/Mathematics_of_cyclic_redundancy_checks#Error_detection_strength a CRC is guaranteed to detect all single-bit errors. This property is easy to achieve: for example, a parity bit has this property. According to the same source, a CRC is guaranteed to detect two-bit errors only if the distance between them is less than some limit that gets larger as the CRC gets wider. Imagine that you have a CRC-16 of a message 64k+1 bits in length. Suppose that an error in the first bit changes the result from v to v'. Can we, by flipping a second bit later in the message, change the final result from v' back to v? The calculation only has 64k possible answers, and we have 64k bits we can flip to try to get the desired answer. If every one of those bit flips produces a different answer, then one of those answers must be v -- which means detection of two-bits errors is not guaranteed. If at least two of those bit flips produce the same answer, then consider the messages produced by those two different bit flips. They differ from each other by exactly two bits and yet produced the same CRC, so detection of two-bit errors is still not guaranteed. On the other hand, it's still highly likely. If a message of length 2^16+1 bits contains two bit errors one of which is in the first bit, the chances that the other one is in exactly the right place to cancel out the first error are about 2^-16. That's not zero, but it's just as good as our chances of detecting a replacement of the entire message with some other message chosen completely at random. I think the reason why discussion of CRCs tends to focus on the types of bit errors that it can detect is that the algorithm was designed when people were doing stuff like sending files over a modem. It's easy to understand how individual bits could get garbled without anybody noticing, while large-scale corruption would be less likely, but the risks are not necessarily the same for a PostgreSQL data file. Lower levels of the stack are probably already using checksums to try to detect errors at the level of the physical medium. I'm sure some stuff slips through the cracks, but in practice we also see failure modes where the filesystem substitutes 8kB of data from an unrelated file, or where a torn write in combination with unreliable fsync results in half of the page contents being from an older version of the page. These kinds of large-scale replacements aren't what CRCs are designed to detect, and the chances that we will detect them are roughly 1-2^-bits, whether we use a CRC or something else. Of course, that partly depends on the algorithm quality. If an algorithm is more likely to generate some results than others, then its actual error detection rate will not be as good as the number of output bits would suggest. If the result doesn't depend equally on every input bit, then the actual error detection rate will not be as good as the number of output bits would suggest. And CRC-32 is apparently not great by modern standards: https://github.com/rurban/smhasher Compare the results for CRC-32 with, say, Spooky32. Apparently the latter is faster yet produces better output. So maybe we would've been better off if we'd made Spooky32 the default algorithm for backup manifest checksums rather than CRC-32. > The benefits of doubling the checksum size are exponential rather than > linear, so there is no significant benefit of using a 64-bit checksum > over a 32-bit one, for supported block sizes (current max is 32 kB). I'm still unconvinced that the block size is very relevant here. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: