Re: a very minor bug and a couple of comment changes for basebackup.c - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: a very minor bug and a couple of comment changes for basebackup.c
Date
Msg-id ZAZ+pjj5yD2XqbFE@tamriel.snowman.net
Whole thread Raw
In response to Re: a very minor bug and a couple of comment changes for basebackup.c  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> Thanks for the review. I have committed the patches.

No objections to what was committed.

> On Thu, Mar 2, 2023 at 2:59 AM Michael Paquier <michael@paquier.xyz> wrote:
> > There is more to it: the page LSN is checked before its checksum.
> > Hence, if the page's LSN is corrupted in a way that it is higher than
> > sink->bbs_state->startptr, the checksum verification is just skipped
> > while the page is broken but not reported as such.  (Spoiler: this has
> > been mentioned in the past, and maybe we'd better remove this stuff in
> > its entirety.)
>
> Yep. It's pretty embarrassing that we have a checksum verification
> feature that has both known ways of producing false positives and
> known ways of producing false negatives and we have no plan to ever
> fix that, we're just going to keep shipping what we've got. I think
> it's pretty clear that the feature shouldn't have been committed like
> this; valid criticisms of the design were offered and simply not
> addressed, not even by updating the comments or documentation with
> disclaimers. I find the code in sendFile() pretty ugly, too. For all
> of that, I'm a bit uncertain whether ripping it out is the right thing
> to do. It might be (I'm not sure) that it tends to work decently well
> in practice. Like, yes, it could produce false checksum warnings, but
> does that actually happen to people? It's probably not too likely that
> a read() or write() of 8kB gets updated after doing only part of the
> I/O, so the concurrent read or write is fairly likely to be on-CPU, I
> would guess, and therefore this algorithm might kind of work OK in
> practice despite begin garbage on a theoretical level. Similarly, the
> problems with how the LSN is vetted make it likely that a page
> replaced with random garbage will go undetected, but a page where a
> few bytes get flipped in a random position within the page is likely
> to get caught, and maybe the latter is actually a bigger risk than the
> former. I don't really know. I'd be interested to hear from anyone
> with a lot of practical experience using the feature. A few anecdotes
> of the form "this routine fails to tell us about problems" or "this
> often complains about problems that are not real" or "this has really
> helped us out on a number of occasions and we have had no problems
> with it" would be really helpful.

The concern about the LSN is certainly a valid one and we ended up
ripping that check out of pgbackrest in favor of taking a different
approach- simply re-read the page and see if it changed.  If it changed,
then we punt and figure that it was a hot page that PG was actively
writing to and so it'll be in the WAL and we don't have to worry about
it.  We're generally concerned more about latent on-disk corruption that
is missed than about some kind of in-memory corruption and the page
changing in the filesystem cache without some other process writing to
it just seems to be awfully unlikely.

https://github.com/pgbackrest/pgbackrest/commit/9eec98c61302121134d2067326dbd2cd0f2f0b9c

From a practical perspective, while this has the afore-mentioned risk
regarding our loop happening twice fast enough that it re-reads the same
page without the i/o on the page progressing at all, I'm not aware of
even one report of that actually happening.  We absolutely have seen
cases where the first read picks up a torn page and that's not even
uncommon in busy environments.

One of the ideas I've had around how to address all of this is to not
depend on inferring things from what's been written out but instead to
just ask PG and that's one of the (relatively few...) benefits that I
see to an archive_library- the ability to check if a given page is in
shared buffers and, if so, what its LSN is to see if it's past the start
of the backup.  Given that pg_basebackup is already working with a lot
of server-side code.. perhaps this could be a direction to go in for it.

Another idea we played around with was keeping track of the LSNs of the
pages with invalid checksums and checking that they fall within the
range of start LSN-end LSN of the backup; while that wouldn't
completely prevent corrupted LSNs from skipping detection using the LSN
approach, it'd sure make it a whole lot less likely.

Lastly, I've also wondered about trying to pull (clean) pages from
shared_buffers directly instead of reading them off of disk at all,
perhaps using a ring buffer in shared_buffers to read them in if
they're not already there, and letting all the usual checksum validation
and such happening with PG handling it.  Dirty pages would have to be
handled though, presumably by locking them and then reading the page
from disk as I don't think folks would be pleased if we decided to
forcibly write them out.  For an incremental backup, if you have
reliable knowledge that the page is in the WAL from PG, perhaps you
could even skip the page entirely.

Independently of verifying PG checksums, we've also considered an
approach where we take the file-level checksums we have in the
pgbackrest manifest and check if the file has changed on the filesystem
without the timestamp changing as that would certainly be indicative of
something gone wrong.

Certainly interested in other ideas about how to improve things in this
area.  I don't think we should rip it out though.

Thanks,

Stephen

Attachment

pgsql-hackers by date:

Previous
From: "Imseih (AWS), Sami"
Date:
Subject: Re: Record queryid when auto_explain.log_verbose is on
Next
From: Michael Paquier
Date:
Subject: Re: Normalization of utility queries in pg_stat_statements