Re: [HACKERS] Checksums by default? - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] Checksums by default?
Date
Msg-id CA+TgmoZRG_Vik+giTBAOgRCPtkb5tC0AOauoyGc8=Kpjqdhvgg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Checksums by default?  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On Fri, Feb 10, 2017 at 7:38 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Incidentally, I've been dealing with a checksum failure reported by a
> customer last week, and based on the experience I tend to agree that we
> don't have the tools needed to deal with checksum failures. I think such
> tooling should be a 'must have' for enabling checksums by default.
>
> In this particular case the checksum failure is particularly annoying
> because it happens during recovery (on a standby, after a restart), during
> startup, so FATAL means shutdown.
>
> I've managed to inspect the page in different way (dd and pageinspect from
> another instance), and it looks fine - no obvious data corruption, the only
> thing that seems borked is the checksum itself, and only three consecutive
> bits are flipped in the checksum. So this doesn't seem like a "stale
> checksum" - hardware issue is a possibility (the machine has ECC RAM
> though), but it might just as easily be a bug in PostgreSQL, when something
> scribbles over the checksum due to a buffer overflow, just before we write
> the buffer to the OS. So 'false failures' are not entirely impossible thing.
>
> And no, backups may not be a suitable solution - the failure happens on a
> standby, and the page (luckily) is not corrupted on the master. Which means
> that perhaps the standby got corrupted by a WAL, which would affect the
> backups too. I can't verify this, though, because the WAL got removed from
> the archive, already. But it's a possibility.
>
> So I think we're not ready to enable checksums by default for everyone, not
> until we can provide tools to deal with failures like this (I don't think
> users will be amused if we tell them to use 'dd' and inspect the pages in a
> hex editor).
>
> ISTM the way forward is to keep the current default (disabled), but to allow
> enabling checksums on the fly. That will mostly fix the issue for people who
> actually want checksums but don't realize they need to enable them at initdb
> time (and starting from scratch is not an option for them), are running on
> good hardware and are capable of dealing with checksum errors if needed,
> even without more built-in tooling.
>
> Being able to disable checksums on the fly is nice, but it only really
> solves the issue of extra overhead - it does really help with the failures
> (particularly when you can't even start the database, because of a checksum
> failure in the startup phase).
>
> So, shall we discuss what tooling would be useful / desirable?

FWIW, I appreciate this analysis and I think it's exactly the kind of
thing we need to set a strategy for moving forward.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: "Andrea Urbani"
Date:
Subject: Re: [HACKERS] [ patch ] pg_dump: new --custom-fetch-table and--custom-fetch-value parameters
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Parallel Index Scans