On Sun, Jan 22, 2017 at 7:37 AM, Stephen Frost <sfrost@snowman.net> wrote:
> Exactly, and that awareness will allow a user to prevent further data
> loss or corruption. Slow corruption over time is a very much known and
> accepted real-world case that people do experience, as well as bit
> flipping enough for someone to write a not-that-old blog post about
> them:
>
> https://blogs.oracle.com/ksplice/entry/attack_of_the_cosmic_rays1
I have no doubt that low frequency cosmic ray bit flipping in main
memory is a real phenomenon, having worked at a company that runs
enough computers to see ECC messages in kernel logs on a regular
basis. But our checksums can't actually help with that, can they? We
verify checksums on the way into shared buffers, and compute new
checksums on the way back to disk, so any bit-flipping that happens in
between those two times -- while your data is a sitting duck in shared
buffers -- would not be detected by this scheme. That's ECC's job.
So the risk being defended against is corruption while in the disk
subsystem, whatever that might consist of (and certainly that includes
more buffers in strange places that themselves are susceptible to
memory faults etc, and hopefully they have their own error detection
and correction). Certainly the ZFS community thinks that pile of
turtles can't be trusted and that extra checks are worthwhile, and you
can find anecdotal reports and studies about filesystem corruption
being detected, for example in the links from
https://en.wikipedia.org/wiki/ZFS#Data_integrity .
So +1 for enabling it by default. I always turn that on.
--
Thomas Munro
http://www.enterprisedb.com