Re: Enabling Checksums - Mailing list pgsql-hackers

From Jim Nasby
Subject Re: Enabling Checksums
Date
Msg-id 514D2D67.2070404@nasby.net
Whole thread Raw
In response to Re: Enabling Checksums  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
I realize Simone relented on this, but FWIW...

On 3/16/13 4:02 PM, Simon Riggs wrote:
> Most other data we store doesn't consist of
> large runs of 0x00 or 0xFF as data. Most data is more complex than
> that, so any runs of 0s or 1s written to the block will be detected.
...

It's not that uncommon for folks to have tables that have a bunch of int[2,4,8]s all in a row, and I'd bet it's not
uncommonfor a lot of those fields to be zero.
 

> Checksums are for detecting problems. What kind of problems? Sporadic
> changes of bits? Or repeated errors. If we were trying to trap
> isolated bit changes then CRC-32 would be appropriate. But I'm
> assuming that whatever causes the problem is going to recur,

That's opposite to my experience. When we've had corruption events we will normally have one to several blocks with
problemshow up essentially all at once. Of course we can't prove that all the corruption happened at exactly the same
time,but I believe it's a strong possibility. If it wasn't exactly the same time it was certainly over a span of
minutesto hours... *but* we've never seen new corruption occur after we start an investigation (we frequently wait
severalhours for the next time we can take an outage without incurring a huge loss in revenue). That we would run for a
numberof hours with no additional corruption leads me to believe that whatever caused the corruption was essentially a
"one-time"[1] event.
 

[1] One-time except for the fact that there were several periods where we would have corruption occur in 12 or 6 month
intervals.



pgsql-hackers by date:

Previous
From: Jim Nasby
Date:
Subject: Re: Let's invent a function to report lock-wait-blocking PIDs
Next
From: Jim Nasby
Date:
Subject: Re: Enabling Checksums