Re: Offline enabling/disabling of data checksums - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: Offline enabling/disabling of data checksums
Date
Msg-id 20190105221214.GW2528@tamriel.snowman.net
Whole thread Raw
In response to Re: Offline enabling/disabling of data checksums  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On 12/27/18 11:43 AM, Magnus Hagander wrote:
> > Plus, the majority of people *should* want them on :) We don't run with
> > say synchronous_commit=off by default either to make it easier on those
> > that don't want to pay the overhead of full data safety :P (I know it's
> > not a direct match, but you get the idea)

+1 to having them on by default, we should have done that a long time
ago.

> I don't know, TBH. I agree making the on/off change cheaper makes moves
> us closer to 'on' by default, because they may disable it if needed. But
> it's not the whole story.
>
> If we enable checksums by default, 99% users will have them enabled.

Yes, and they'll then be able to catch data corruption much earlier.
Today, 99% of our users don't have them enabled and have no clue if
their data has been corrupted on disk, or not.  That's not good.

> That means more people will actually observe data corruption cases that
> went unnoticed so far. What shall we do with that? We don't have very
> good answers to that (tooling, docs) and I'd say "disable checksums" is
> not a particularly amazing response in this case :-(

Now that we've got a number of tools available which will check the
checksums in a running system and throw up warnings when found
(pg_basebackup, pgBackRest and I think other backup tools,
pg_checksums...), users will see corruption and have the option to
restore from a backup before those backups expire out and they're left
with a corrupt database and backups which also have that corruption.

This ongoing call for specific tooling to do "something" about checksums
is certainly good, but it's not right to say that we don't have existing
documentation- we do, quite a bit of it, and it's all under the heading
of "Backup and Recovery".

> FWIW I don't know what to do about that. We certainly can't prevent the
> data corruption, but maybe we could help with fixing it (although that's
> bound to be low-level work).

There's been some effort to try and automagically correct corrupted
pages but it's certainly not something I'm ready to trust beyond a
"well, this is what it might have been" review.  The answer today is to
find a backup which isn't corrupt and restore from it on a known-good
system.  If adding explicit documentation to that effect would reduce
your level of concern when it comes to enabling checksums by default,
then I'm happy to do that.

Thanks!

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Mitar
Date:
Subject: Re: Feature: triggers on materialized views
Next
From: Stephen Frost
Date:
Subject: Re: Record last password change