Re: Online verification of checksums - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: Online verification of checksums
Date
Msg-id 20201123222852.GP16415@tamriel.snowman.net
Whole thread Raw
In response to Re: Online verification of checksums  (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>)
Responses Re: Online verification of checksums  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
Greetings,

* Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote:
> On 23.11.2020 18:35, Stephen Frost wrote:
> >* Anastasia Lubennikova (a.lubennikova@postgrespro.ru) wrote:
> >>On 21.11.2020 04:30, Michael Paquier wrote:
> >>>The only method I can think as being really
> >>>reliable is based on two facts:
> >>>- Do a check only on pd_checksums, as that validates the full contents
> >>>of the page.
> >>>- When doing a retry, make sure that there is no concurrent I/O
> >>>activity in the shared buffers.  This requires an API we don't have
> >>>yet.
> >>It seems reasonable to me to rely on checksums only.
> >>
> >>As for retry, I think that API for concurrent I/O will be complicated.
> >>Instead, we can introduce a function to read the page directly from shared
> >>buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
> >>solution to me. Do you see any possible problems with it?
> >We might end up reading pages back in that have been evicted, for one
> >thing, which doesn't seem great,
> TBH, I think it is highly unlikely that the page that was just updated will
> be evicted.

Is it though..?  Consider that the page which was being written out was
being done so specifically to free a page for use by another backend-
while perhaps that doesn't happen all the time, it certainly happens
enough on very busy systems.

> >and this also seems likely to be
> >awkward for cases which aren't using the replication protocol, unless
> >every process maintains a connection to PG the entire time, which also
> >doesn't seem great.
> Have I missed something? Now pg_basebackup has only one process + one child
> process for streaming. Anyway, I totally agree with your argument. The need
> to maintain connection(s) to PG is the most unpleasant part of the proposed
> approach.

I was thinking beyond pg_basebackup, yes; apologies for that not being
clear but that's what I was meaning when I said "aren't using the
replication protocol".

> >Also- what is the point of reading the page from shared buffers
> >anyway..?
> Well... Reading a page from shared buffers is a reliable way to get a
> correct page from postgres under any concurrent load. So it just seems
> natural to me.

Yes, that's true, but if a dirty page was just written out by a backend
in order to be able to evict it, so that the backend can then pull in a
new page, then having pg_basebackup pull that page back in really isn't
great.

> >All we need to do is prove that the page will be rewritten
> >during WAL replay.
> Yes and this is a tricky part. Until you have explained it in your latest
> message, I wasn't sure how we can distinct concurrent update from a page
> header corruption. Now I agree that if page LSN updated and increased
> between rereads, it is safe enough to conclude that we have some concurrent
> load.

Even in this case, it's almost free to compare the LSN to the starting
backup LSN, and to the current LSN position, and make sure it's
somewhere between the two.  While that doesn't entirely eliminite the
possibility that the page happened to get corrupted *and* return a
different result on subsequent reads *and* that it was corrupted in such
a way that the LSN ended up falling between the starting backup LSN and
the current LSN, it's certainly reducing the chances of a false negative
a fair bit.

A concern here, however, is- can we be 100% sure that we'll get a
different result from the two subsequent reads?  For my part, at least,
I've been doubtful that it's possible but it'd be nice to hear it from
someone who has really looked at the kernel side.  To try and clairfy,
let me illustrate:

pg_basebackup (the backend that's sending data to it anyway) starts
reading an 8K page, but gets interrupted halfway through, meaning that
it's read 4K and is now paused.

PG writes that same 8K page, and is able to successfully write the
entire block.

pg_basebackup then wakes up, reads the second half, computes a checksum
and gets a checksum failure.

At this point the question is: if pg_basebackup loops, seeks and
re-reads the same 8K block again, is it possible that pg_basebackup will
get the "old" starting 4K and the "new" ending 4K again?  I'd like to
think that the answer is 'no' and that the kernel will guarantee that if
we managed to read a "new" ending 4K block then the following read of
the full 8K block would be guaranteed to give us the "new" starting 4K.
If that is truely guaranteed then we could be much more confident that
the idea here of simply checking for an ascending LSN, which falls
between the starting LSN of the backup and the current LSN (or perhaps
the final LSN for the backup) would be sufficient to detect this case.

I would also think that, if we can trust that, then there really isn't
any need for the delay in performing the re-read, which I have to admit
that I don't particularly care for.

> >  If we can prove that, we don't actually care what
> >the contents of the page are.  We certainly can't calculate the
> >checksum on a page we plucked out of shared buffers since we only
> >calculate the checksum when we go to write the page out.
> Good point. I was thinking that we can recalculate checksum. Or even save a
> page without it, as we have checked LSN and know for sure that it will be
> rewritten by WAL replay.

At the point that we know the page is in the WAL which must be replayed
to make this backup consistent, we could theoretically zero the page out
of the actual backup (or if we're doing some kind of incremental magic,
skip it entirely, as long as we zero-fill it on restore).

> To sum up, I agree with your proposal to reread the page and rely on
> ascending LSNs. Can you submit a patch?

Probably would make sense to give Michael an opportunity to comment and
get his thoughts on this, and for him to update the patch if he agrees.

As it relates to pgbackrest, we're currently contemplating having a
higher level loop which, upon detecting any page with an invalid
checksum, continues to scan to the end of that file and perform the
compression, encryption, et al, but then loops back after we've
completed that file and skips through the file again, re-reading those
pages which didn't have a valid checksum the first time to see if their
LSN has changed and is within the range of the backup.  This will
certainly give more opportunity for the kernel to 'catch up', if needed,
and give us an updated page without a random 100ms delay, and will also
make it easier for us to, eventually, check and make sure the page was
in the WAL that was been produced as part of the backup, to give us a
complete guarantee that the contents of this page don't matter and that
the failed checksum isn't a sign of latent storage corruption.

Thanks,

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: PATCH: Batch/pipelining support for libpq
Next
From: Tom Lane
Date:
Subject: Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)