Re: Online verification of checksums - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: Online verification of checksums
Date
Msg-id 20181122173928.GB3415@tamriel.snowman.net
Whole thread Raw
In response to Re: Online verification of checksums  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Online verification of checksums
List pgsql-hackers
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On 11/22/18 2:12 AM, Stephen Frost wrote:
> >* Michael Banck (michael.banck@credativ.de) wrote:
> >>On Tue, Oct 30, 2018 at 06:22:52PM +0100, Fabien COELHO wrote:
> >>>The "check if page was modified since checkpoint" does not look useful when
> >>>offline. Maybe it lacks a comment to say that this cannot (should not ?)
> >>>happen when offline, but even then I would not like it to be true: ISTM that
> >>>no page should be allowed to be skipped on the checkpoint condition when
> >>>offline, but it is probably ok to skip with the new page test, which make me
> >>>still think that they should be counted and reported separately, or at least
> >>>the checkpoint skip test should not be run when offline.
> >>
> >>What is the rationale to not skip on the checkpoint condition when the
> >>instance is offline?  If it was shutdown cleanly, this should not
> >>happen, if the instance crashed, those would be spurious errors that
> >>would get repaired on recovery.
> >>
> >>I have not changed that for now.
> >
> >Agreed- this is an important check even in offline mode.
>
> Yeah. I suppose we could detect if the shutdown was clean (like pg_rewind
> does), and then skip the check. Or perhaps we should still do the check
> (without a retry), and report it as issue when we find a page with LSN newer
> than the last checkpoint.

I agree that it'd be nice to report an issue if it's a clean shutdown
but there's an LSN newer than the last checkpoint, though I suspect that
would be more useful in debugging and such and not so useful for users.

> In any case, the check is pretty cheap (comparing two 64-bit values), and I
> don't see how skipping it would optimize anything. It would make the code a
> tad simpler, but we still need the check for the online mode.

Yeah, I'd just keep the check.

> A minor detail is that the reads/writes should be atomic at the sector
> level, which used to be 512B, so it's not just about pages torn in 4kB/4kB
> manner, but possibly an arbitrary mix of 512B chunks from old and new
> version.

Sure.

> This also explains why we don't need any delay - the reread happens after
> the write must have already written the page header, so the new LSN must be
> already visible.

Agreed.

Thanks!

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Online verification of checksums
Next
From: Tomas Vondra
Date:
Subject: Re: Tid scan improvements