Re: Offline enabling/disabling of data checksums - Mailing list pgsql-hackers

From Magnus Hagander
Subject Re: Offline enabling/disabling of data checksums
Date
Msg-id CABUevExjCOKDRX=B6fBkoYs7Q2jYJLZS4Bxh1pi0b2_BV-7DEA@mail.gmail.com
Whole thread Raw
In response to Re: Offline enabling/disabling of data checksums  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Offline enabling/disabling of data checksums
List pgsql-hackers


On Thu, Mar 14, 2019 at 5:39 AM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Mar 13, 2019 at 08:56:33PM +0900, Michael Paquier wrote:
> On Wed, Mar 13, 2019 at 12:09:24PM +0100, Michael Banck wrote:
> > The attached patch should do the above, on top of Michael's last
> > patchset.
>
> What you are doing here looks like a good defense in itself.

More thoughts on that, and here is a short summary of the thread.

+       /* Check if control file has changed */
+       if (controlfile_last_updated != ControlFile->time)
+       {
+           fprintf(stderr, _("%s: control file has changed since startup\n"), progname);
+           exit(1);
+       }
Actually, under the conditions discussed on this thread that Postgres
is started in parallel of pg_checksums, imagine the following
scenario:
- pg_checksums starts to enable checksums, it reads a block to
calculate its checksum, then calculates it.
- Postgres has been started in parallel, writes the same block.
- pg_checksums finishes the block calculation, writes back the block
it has just read.
- Postgres stops, some data is lost.
- At the end of pg_checksums, we complain that the control file has
been updated since the start of pg_checksums.
I think that we should be way more noisy about this error message
document properly that Postgres should not be started while checksums
are enabled.  Basically, I guess that it should mention that there is
a risk of corruption because of this parallel operation.

Hence, based on what I could read on this thread, we'd like to have
the following things added to the core patch set:
1) When enabling checksums, fsync the data folder.  Then update the
control file, and finally fsync the control file (+ flush of the
parent folder to make the whole durable).  This way a host crash never
actually means that we finish in an inconsistent state.
2) When checksums are disabled, update the control file, then fsync
it + its parent folder.
3) Add tracking of the control file data, and complain loudly before
trying to update the file if there are any inconsistencies found.
4) Document with a big-fat-red warning that postgres should not be
worked on while the tool is enabling or disabling checksums.

Given that the failure is data corruption, I don't think big fat warning is enough. We should really make it impossible to start up the postmaster by mistake during the checksum generation. People don't read the documentation until it's too late. And it might not even be under their control - some automated tool might go in and try to start postgres, and boom, corruption.

One big-hammer method could be similar to what pg_upgrade does -- temporarily rename away the controlfile so postgresql can't start, and when done, put it back.

--

pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: Offline enabling/disabling of data checksums
Next
From: Christoph Berg
Date:
Subject: Re: Offline enabling/disabling of data checksums