Re: Enabling Checksums - Mailing list pgsql-hackers

From Daniel Farina
Subject Re: Enabling Checksums
Date
Msg-id CAAZKuFaw=+bmJK2+Rf46b1aaHd=SbOAjeHAD_WEG6Nj5DrPfyw@mail.gmail.com
Whole thread Raw
In response to Re: Enabling Checksums  (Greg Smith <greg@2ndQuadrant.com>)
List pgsql-hackers
On Mon, Mar 18, 2013 at 7:13 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> I wasn't trying to flog EBS as any more or less reliable than other types of
> storage.  What I was trying to emphasize, similarly to your "quite a
> stretch" comment, was the uncertainty involved when such deployments fail.
> Failures happen due to many causes outside of just EBS itself.  But people
> are so far removed from the physical objects that fail, it's harder now to
> point blame the right way when things fail.

I didn't mean to imply you personally were going out of your way to
flog EBS, but there is a sufficient vacuum in the narrative that
someone could reasonably interpereted it that way, so I want to set it
straight.  The problem is the quantity of databases per human.  The
Pythons said it best: 'A simple question of weight ratios.'

> A quick example will demonstrate what I mean.  Let's say my server at home
> dies.  There's some terrible log messages, it crashes, and when it comes
> back up it's broken.  Troubleshooting and possibly replacement parts follow.
> I will normally expect an eventual resolution that includes data like "the
> drive showed X SMART errors" or "I swapped the memory with a similar system
> and the problem followed the RAM".  I'll learn something about what failed
> that I might use as feedback to adjust my practices.  But an EC2+EBS failure
> doesn't let you get to the root cause effectively most of the time, and that
> makes people nervous.

Yes, the layering makes it tougher to do vertical treatment of obscure
issues.  Redundancy has often been the preferred solution here: bugs
come and go all the time, and everyone at each level tries to fix what
they can without much coordination from the layer above or below.
There are hopefully benefits in throughput of progress at each level
from this abstraction, but predicting when any one particular issue
will go understood top to bottom is even harder than it already was.

Also, I think the line of reasoning presented is biased towards a
certain class of database: there are many, many databases with minimal
funding and oversight being run in the traditional way, and the odds
they'll get a vigorous root cause analysis in event of an obscure
issue is already close to nil.  Although there are other
considerations at play (like not just leaving those users with nothing
more than a "bad block" message), checksums open some avenues
gradually benefit those use cases, too.

> I can already see "how do checksums alone help narrow the blame?" as the
> next question.  I'll post something summarizing how I use them for that
> tomorrow, just out of juice for that tonight.

Not from me.  It seems pretty intuitive from here how database
maintained checksums assist in partitioning the problem.

--
fdr



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: backward incompatible pg_basebackup and pg_receivexlog
Next
From: Daniel Farina
Date:
Subject: Re: Optimizing pglz compressor