Re: Enabling Checksums - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Enabling Checksums
Date
Msg-id CA+U5nMLVN_M9ReFZPM1O9jZ6QX=f5h4-jG14MHX3ax5ZWiB=9w@mail.gmail.com
Whole thread Raw
In response to Re: Enabling Checksums  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: Enabling Checksums  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Enabling Checksums  (Jim Nasby <jim@nasby.net>)
List pgsql-hackers
On 15 March 2013 13:08, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-03-15 14:32:57 +0200, Ants Aasma wrote:
>> On Wed, Mar 6, 2013 at 1:34 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>> > Fletcher's checksum is good in general, I was mainly worried about
>> > truncating the Fletcher-64 into two 8-bit values. I can't spot any obvious
>> > weakness in it, but if it's indeed faster and as good as a straightforward
>> > Fletcher-16, I wonder why that method is not more widely used.
>>
>> As implented, the fletcher algorithm as implemented results in:
>>
>> checksum low byte = (blkno + sum over i [0..N) (x_i)) % 255 + 1
>> checksum high byte = (blkno + sum over i in [0..N) ((N - i)*x_i)) % 255 + 1
>>
>> Where N is the number of 4 bytes words in the page and x_i is the i-th
>> word. As modular arithmetic is a ring, it is easy to show that any
>> addition or subtraction of a multiple of 255 = 0xFF will result in no
>> change to the resulting value. The most obvious case here is that you
>> can swap any number of bytes from 0x00 to 0xFF or back without
>> affecting the hash.
>
> I commented on this before, I personally think this property makes fletcher a
> not so good fit for this. Its not uncommon for parts of a block being all-zero
> and many disk corruptions actually change whole runs of bytes.

I think you're right to pick up on this point, and Ants has done a
great job of explaining the issue more clearly.

My perspective, after some thought, is that this doesn't matter to the
overall effectiveness of this feature.

PG blocks do have large runs of 0x00 in them, though that is in the
hole in the centre of the block. If we don't detect problems there,
its not such a big deal. Most other data we store doesn't consist of
large runs of 0x00 or 0xFF as data. Most data is more complex than
that, so any runs of 0s or 1s written to the block will be detected.
So what we need to look at is how that problem affects the quality of
our detection. I would guess we can say that our detection might only
be 99% effective, rather than 100% effective. I'm not sure the issue
is that bad, but lets look at what would happen if it was that value.

Checksums are for detecting problems. What kind of problems? Sporadic
changes of bits? Or repeated errors. If we were trying to trap
isolated bit changes then CRC-32 would be appropriate. But I'm
assuming that whatever causes the problem is going to recur, so what
we want to do is detect hardware that is starting to go bad and needs
to be replaced. So errors show a repetitive pattern, increasing in
frequency and coverage over time; and "issue" is not an isolated
incident, its the beginning of a series of related problems. This much
the same as the idea that for every mouse you see in your house there
are another 10 you don't, and if you ignore the sighting of a mouse,
the problem will get worse, often quickly. What we want to do is
detect infestations/mouse colonies, rather than detect isolated and
non-repeated visitors.

Running checksums on the whole block gives us about x1000 better
chance of detecting a run of issues than we have with just header
checks. The perfection of the actual check, 99%/100%, doesn't alter
much the overall *gain* in detection rate we get from using checksums,
and so I can say its less important that the check itself is
watertight. And in fact, no checksum is watertight, it is a technique
that trades performance for detection quality. So even a detector that
spotted only 90% of real errors would still be a massive gain in
overall detection, because we are applying the check across the whole
block.

What we need is a cheap way of detecting problems as early as
possible. Checksums don't prevent disk corruption, they just alert us
to the presence of disk corruption, allowing us to avoid data
corruption by reverting to backups. If we don't detect things early
enough, then we find that reverting to backup doesn't work because the
backed-up data blocks are corrupt. Fletcher-16 seems to be the best
combination of speed v quality.

What I think we could do here is to allow people to set their checksum
algorithm with a plugin. But if we do that, then we open up the
possibility for user error on people changing checksum algorithms and
not realising that won't change values already calculated. That would
be a bad usability problem in itself and is almost certain to bite,
since user error is a larger source of real world problems than
hardware error. So I'd rather not do that.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Should array_length() Return NULL
Next
From: Alvaro Herrera
Date:
Subject: Re: Strange Windows problem, lock_timeout test request