Re: Enabling Checksums - Mailing list pgsql-hackers

From Pavel Stehule
Subject Re: Enabling Checksums
Date
Msg-id CAFj8pRAcXKLXqsQvkZw8FFLNd2SSX0axgRGVmd1pczPO1ee2FQ@mail.gmail.com
Whole thread Raw
In response to Re: Enabling Checksums  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
2013/3/8 Bruce Momjian <bruce@momjian.us>:
> On Mon, Mar  4, 2013 at 05:04:27PM -0800, Daniel Farina wrote:
>> Putting aside the not-so-rosy predictions seen elsewhere in this
>> thread about the availability of a high performance, reliable
>> checksumming file system available on common platforms, I'd like to
>> express what benefit this feature will have to me:
>>
>> Corruption has easily occupied more than one person-month of time last
>> year for us.  This year to date I've burned two weeks, although
>> admittedly this was probably the result of statistical clustering.
>> Other colleagues of mine have probably put in a week or two in
>> aggregate in this year to date.  The ability to quickly, accurately,
>> and maybe at some later date proactively finding good backups to run
>> WAL recovery from is one of the biggest strides we can make in the
>> operation of Postgres.  The especially ugly cases are where the page
>> header is not corrupt, so full page images can carry along malformed
>> tuples...basically, when the corruption works its way into the WAL,
>> we're in much worse shape.  Checksums would hopefully prevent this
>> case, converting them into corrupt pages that will not be modified.
>>
>> It would be better yet if I could write tools to find the last-good
>> version of pages, and so I think tight integration with Postgres will
>> see a lot of benefits that would be quite difficult and non-portable
>> when relying on file system checksumming.
>
> I see Heroku has corruption experience, and I know Jim Nasby has
> struggled with corruption in the past.
>
> I also see the checksum patch is taking a beating.  I wanted to step
> back and ask what percentage of known corruptions cases will this
> checksum patch detect?  What percentage of these corruptions would
> filesystem checksums have detected?
>
> Also, don't all modern storage drives have built-in checksums, and
> report problems to the system administrator?  Does smartctl help report
> storage corruption?
>
> Let me take a guess at answering this --- we have several layers in a
> database server:
>
>         1 storage
>         2 storage controller
>         3 file system
>         4 RAM
>         5 CPU
>
> My guess is that storage checksums only cover layer 1, while our patch
> covers layers 1-3, and probably not 4-5 because we only compute the
> checksum on write.
>
> If that is correct, the open question is what percentage of corruption
> happens in layers 1-3?

I cooperate with important Czech bank - and they request checksum as
any other tool to increase a possibility to failure identification. So
missing checksums penalize a usability PostgreSQL to critical systems
- speed is not too important there.

Regards

Pavel

>
> --
>   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>   EnterpriseDB                             http://enterprisedb.com
>
>   + It's impossible for everything to be true. +
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Materialized views and unique indexes
Next
From: Daniel Farina
Date:
Subject: Re: Enabling Checksums