Home > mailing lists

Re: Online verification of checksums - Mailing list pgsql-hackers

From	Stephen Frost
Subject	Re: Online verification of checksums
Date	September 17, 2018 19:42:46
Msg-id	20180917164246.GZ4184@tamriel.snowman.net Whole thread Raw
In response to	Re: Online verification of checksums (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: Online verification of checksums
List	pgsql-hackers

Tree view

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On 09/17/2018 04:46 PM, Stephen Frost wrote:
> > * Michael Banck (michael.banck@credativ.de) wrote:
> >> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
> >>> Obviously, pg_verify_checksums can't do that easily because it's
> >>> supposed to run from outside the database instance.
> >>
> >> It reads pg_control anyway, so couldn't we just take
> >> ControlFile->checkPoint?
> >>
> >> Other than that, basebackup.c seems to only look at pages which haven't
> >> been changed since the backup starting checkpoint (see above if
> >> statement). That's reasonable for backups, but is it just as reasonable
> >> for online verification?
> >
> > Right, basebackup doesn't need to look at other pages.
> >
> >>> The pg_verify_checksum apparently ignores this skip logic, because on
> >>> the retry it simply re-reads the page again, verifies the checksum and
> >>> reports an error. Which is broken, because the newly read page might be
> >>> torn again due to a concurrent write.
> >>
> >> Well, ok.
> >
> > The newly read page will have an updated LSN though then on the re-read,
> > in which case basebackup can know that what happened was a rewrite of
> > the page and it no longer has to care about the page and can skip it.
> >
> > I haven't looked, but if basebackup isn't checking the LSN again for the
> > newly read page then that'd be broken, but I believe it does (at least,
> > that's the algorithm we came up with for pgBackRest, and I know David
> > shared that when the basebackup code was being written).
>
> Yes, basebackup does check the LSN on re-read, and skips the page if it
> changed on re-read (because it eliminates the consistency guarantees
> provided by the checkpoint).

Ok, good, though I'm not sure what you mean by 'eliminates the
consistency guarantees provided by the checkpoint'.  The point is that
the page will be in the WAL and the WAL will be replayed during the
restore of the backup.

> >> So how about we do check every page, but if one fails on retry, and the
> >> LSN is newer than the checkpoint, we then skip it? Is that logic sound?
> >
> > I thought that's what basebackup did- if it doesn't do that today, then
> > it really should.
>
> The crucial distinction here is that the trick is not in comparing LSNs
> from the two page reads, but comparing it to the checkpoint LSN. If it's
> greater, the page may be torn or broken, and there's no way to know
> which case it is - so basebackup simply skips it.

Sure, because we don't care about it any longer- that page isn't
interesting because the WAL will replay over it.  IIRC it actually goes
something like: check the checksum, if it failed then check if the LSN
is greater than the checkpoint (of the backup start..), if not, then
re-read, if the LSN is now newer than the checkpoint then skip, if the
LSN is the same then throw an error.

> >> In any case, if we decide we really should skip the page if it is newer
> >> than the checkpoint, I think it makes sense to track those skipped pages
> >> and print their number out at the end, if there are any.
> >
> > Not sure what the point of this is.  If we wanted to really do something
> > to cross-check here, we'd track the pages that were skipped and then
> > look through the WAL to make sure that they're there.  That's something
> > we've talked about doing with pgBackRest, but don't currently.
>
> I agree simply printing the page numbers seems rather useless. What we
> could do is remember which pages we skipped and then try checking them
> after another checkpoint. Or something like that.

I'm still not sure I'm seeing the point of that.  They're still going to
almost certainly be in the kernel cache.  The reason for checking
against the WAL would be to detect errors in PG where we aren't putting
a page into the WAL when it really should be, or something similar,
which seems like it at least could be useful.

Maybe to put it another way- there's very little point in checking the
checksum of a page which we know must be re-written during recovery to
get to a consistent point.  I don't think it hurts in the general case,
but I wouldn't write a lot of code which then needs to be tested to
handle it.  I also don't think that we really need to make
pg_verify_checksum spend lots of extra cycles trying to verify that
*every* page had its checksum validated when we know that lots of pages
are going to be in memory marked dirty and our checking of them will be
ultimately pointless as they'll either be written out by the
checkpointer or some other process, or we'll replay them from the WAL if
we crash.

> >>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
> >>> to protect against anything, really.
> >>
> >> Well, I've noticed that without it I get sporadic checksum failures on
> >> reread, so I've added it to make them go away. It was certainly a
> >> phenomenological decision that I am happy to trade for a better one.
> >
> > That then sounds like we really aren't re-checking the LSN, and we
> > really should be, to avoid getting these sporadic checksum failures on
> > reread..
>
> Again, it's not enough to check the LSN against the preceding read. We
> need a checkpoint LSN or something like that.

I actually tend to disagree with you that, for this purpose, it's
actually necessary to check against the checkpoint LSN- if the LSN
changed and everything is operating correctly then the new LSN must be
more recent than the last checkpoint location or things are broken
badly.

Now, that said, I do think it's a good *idea* to check against the
checkpoint LSN (presuming this is for online checking of checksums- for
basebackup, we could just check against the backup-start LSN as anything
after that point will be rewritten by WAL anyway).  The reason that I
think it's a good idea to check against the checkpoint LSN is that we'd
want to throw a big warning if the kernel is just feeding us random
garbage on reads and only finding a difference between two reads isn't
really doing any kind of validation, whereas checking against the
checkpoint-LSN would at least give us some idea that the value being
read isn't completely ridiculous.

When it comes to if the pg_sleep() is necessary or not, I have to admit
to being unsure about that..  I could see how it might be but it seems a
bit surprising- I'd probably want to see exactly what the page was at
the time of the failure and at the time of the second (no-sleep) re-read
and then after a delay and convince myself that it was just an unlucky
case of being scheduled in twice to read that page before the process
writing it out got a chance to finish the write.

Thanks!

Stephen

Attachment

signature.asc

pgsql-hackers by date:

From: Tomas Vondra
Date: 17 September 2018, 19:25:02
Subject: Re: Online verification of checksums

From: Tomas Vondra
Date: 17 September 2018, 20:00:31
Subject: Re: Online verification of checksums

Re: Online verification of checksums - Mailing list pgsql-hackers

Attachment

Previous

Next