Re: Online verification of checksums - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: Online verification of checksums
Date
Msg-id 20180917144651.GX4184@tamriel.snowman.net
Whole thread Raw
In response to Re: Online verification of checksums  (Michael Banck <michael.banck@credativ.de>)
Responses Re: Online verification of checksums
List pgsql-hackers
Greetings,

* Michael Banck (michael.banck@credativ.de) wrote:
> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
> > Obviously, pg_verify_checksums can't do that easily because it's
> > supposed to run from outside the database instance.
>
> It reads pg_control anyway, so couldn't we just take
> ControlFile->checkPoint?
>
> Other than that, basebackup.c seems to only look at pages which haven't
> been changed since the backup starting checkpoint (see above if
> statement). That's reasonable for backups, but is it just as reasonable
> for online verification?

Right, basebackup doesn't need to look at other pages.

> > The pg_verify_checksum apparently ignores this skip logic, because on
> > the retry it simply re-reads the page again, verifies the checksum and
> > reports an error. Which is broken, because the newly read page might be
> > torn again due to a concurrent write.
>
> Well, ok.

The newly read page will have an updated LSN though then on the re-read,
in which case basebackup can know that what happened was a rewrite of
the page and it no longer has to care about the page and can skip it.

I haven't looked, but if basebackup isn't checking the LSN again for the
newly read page then that'd be broken, but I believe it does (at least,
that's the algorithm we came up with for pgBackRest, and I know David
shared that when the basebackup code was being written).

> > So IMHO this should do something similar to basebackup - check the page
> > LSN, and if it changed then skip the page.
> >
> > I'm afraid this requires using the last checkpoint LSN, the way startptr
> > is used in basebackup. In particular we can't simply remember LSN from
> > the first read, because we might actually read [B1,A2] on the first try,
> > and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
> > torn in various other ways, not necessarily at the 4kB boundary - it
> > might be torn right after the LSN, for example).
>
> I'd prefer to come up with a plan where we don't just give up once we
> see a new LSN, if possible. If I run a modified pg_verify_checksums
> which skips on newer pages in a tight benchmark, basically everything
> gets skipped as checkpoints don't happen often enough.

I'm really not sure how you expect to be able to do something different
here.  Even if we started poking into shared buffers, all you'd be able
to see is that there's a bunch of dirty pages- and we don't maintain the
checksums in shared buffers, so it's not like you could verify them
there.

You could possibly have an option that says "force a checkpoint" but,
honestly, that's really not all that interesting either- all you'd be
doing is forcing all the pages to be written out from shared buffers
into the kernel cache and then reading them from there instead, it's not
like you'd actually be able to tell if there was a disk/storage error
because you'll only be looking at the kernel cache.

> So how about we do check every page, but if one fails on retry, and the
> LSN is newer than the checkpoint, we then skip it? Is that logic sound?

I thought that's what basebackup did- if it doesn't do that today, then
it really should.

> In any case, if we decide we really should skip the page if it is newer
> than the checkpoint, I think it makes sense to track those skipped pages
> and print their number out at the end, if there are any.

Not sure what the point of this is.  If we wanted to really do something
to cross-check here, we'd track the pages that were skipped and then
look through the WAL to make sure that they're there.  That's something
we've talked about doing with pgBackRest, but don't currently.

> > FWIW I also don't understand the purpose of pg_sleep(), it does not seem
> > to protect against anything, really.
>
> Well, I've noticed that without it I get sporadic checksum failures on
> reread, so I've added it to make them go away. It was certainly a
> phenomenological decision that I am happy to trade for a better one.

That then sounds like we really aren't re-checking the LSN, and we
really should be, to avoid getting these sporadic checksum failures on
reread..

> Also, I noticed there's sometimes a 'data/global/pg_internal.init.606'
> or some such file which pg_verify_checksums gets confused on, I guess we
> should skip that as well.  Can we assume that all files that start with
> the ones in skip[] are safe to skip or should we have an exception for
> files starting with pg_internal.init?

Everything listed in skip is safe to skip on a restore..  I've not
really thought too much about if they're all safe to skip when checking
checksums for an online system, but I would generally think so..

Thanks!

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Strange OSX make check-world failure
Next
From: amul sul
Date:
Subject: Re: Multiple primary key on partition table?