Re: [HACKERS] Checksums by default? - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: [HACKERS] Checksums by default?
Date
Msg-id 20170121223510.GA18360@tamriel.snowman.net
Whole thread Raw
In response to Re: [HACKERS] Checksums by default?  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
Thomas,

* Thomas Munro (thomas.munro@enterprisedb.com) wrote:
> On Sun, Jan 22, 2017 at 7:37 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > Exactly, and that awareness will allow a user to prevent further data
> > loss or corruption.  Slow corruption over time is a very much known and
> > accepted real-world case that people do experience, as well as bit
> > flipping enough for someone to write a not-that-old blog post about
> > them:
> >
> > https://blogs.oracle.com/ksplice/entry/attack_of_the_cosmic_rays1
>
> I have no doubt that low frequency cosmic ray bit flipping in main
> memory is a real phenomenon, having worked at a company that runs
> enough computers to see ECC messages in kernel logs on a regular
> basis.  But our checksums can't actually help with that, can they?  We
> verify checksums on the way into shared buffers, and compute new
> checksums on the way back to disk, so any bit-flipping that happens in
> between those two times -- while your data is a sitting duck in shared
> buffers -- would not be detected by this scheme.  That's ECC's job.

Ideally, everyone's gonna run with ECC and have that handle it.  That
said, there's still the possibility that the bit is flipped after we've
calculated the checksum but before it's hit disk, or before it's
actually been written out to the storage system underneath.  You're
correct that if the bit is flipped before we go to write the buffer out
that we won't detect that case, but there's not much help for that
without compromising performance more than even I'd be ok with.

> So the risk being defended against is corruption while in the disk
> subsystem, whatever that might consist of (and certainly that includes
> more buffers in strange places that themselves are susceptible to
> memory faults etc, and hopefully they have their own error detection
> and correction).  Certainly the ZFS community thinks that pile of
> turtles can't be trusted and that extra checks are worthwhile, and you
> can find anecdotal reports and studies about filesystem corruption
> being detected, for example in the links from
> https://en.wikipedia.org/wiki/ZFS#Data_integrity .

Agreed.  wrt your point above, if you consider "everything that happens
after we have passed over a given bit to incorporate its value into our
CRC" to be "disk subsystem" then I think we're in agreement on this
point and that there's a bunch of stuff that happens then which could be
caught by checking our CRC.

I've seen some other funny things out there in the wild too though, like
a page suddenly being half-zero'd because the virtualization system ran
out of memory and barfed.  I realize that our CRC might not catch such a
case if it's in our shared buffers before we write the page out, but if
it happens in the kernel's write buffer after we pushed it from shared
buffers then our CRC would detect it.

Which actually brings up another point when it comes to if CRCs save
from data-loss: they certainly do if you catch it happening before you
have expired the WAL and the WAL data is clean.

> So +1 for enabling it by default.  I always turn that on.

Ditto.

Thanks!

Stephen

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: [HACKERS] Checksums by default?
Next
From: Jim Nasby
Date:
Subject: Re: Updating MATERIALIZED VIEWs (Re: [HACKERS] delta relations inAFTER triggers)