Thread: a very minor bug and a couple of comment changes for basebackup.c

a very minor bug and a couple of comment changes for basebackup.c

From

Robert Haas

Date:

02 February 2023, 23:23:51

Here are a few small patches for basebackup.c:

0001 fixes what I believe to be a slight logical error in sendFile(),
introduced by me during the v15 development cycle when I introduced
the bbsink abstraction. I believe that it is theoretically possible
for this to cause an assertion failure, although the chances of that
actually happening seem extremely remote in practice. I don't believe
there are any consequences worse than that; for instance, I don't
think this can result in your backup getting corrupted. See the
proposed commit message for full details. Because I believe that this
is formally a bug, I am inclined to think that this should be
back-patched, but I also think it's fairly likely that no one would
ever notice if we didn't. However, patch authors have been known to be
wrong about the consequences of their own bugs from time to time, so
please do let me know if this seems more serious to you than what I'm
indicating, or conversely if you think it's not a problem at all for
some reason.

0002 removes an old comment from the file that I find useless and
slightly misleading.

0003 rewrites a comment about the way that we verify checksums during
backups. If we get a checksum mismatch, we reread the block and see if
the perceived problem goes away. If it doesn't, then we report it.
This is intended as protection against the backup reading a block
while some other process is in the midst of writing it, but there's no
guarantee that any concurrent write will finish quickly enough for our
second read attempt to see the updated contents. The comment claims
otherwise, and that's false, and I'm getting tired of reading this
false claim every time I read this code, so I rewrote the comment to
say what I believe to be true, namely, that our algorithm is flaky and
we have no good way to fix that right now. I'm pretty sure that Andres
pointed this problem out when this feature was under discussion, but
somehow it's still like this. There's another nearby comment which is
also false or at least misleading for basically the same reasons which
probably should be rewritten too, but I was a bit less certain how to
rewrite it and it wasn't making me as annoyed as this one, so for now
I only rewrote the one.

Comments?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: a very minor bug and a couple of comment changes for basebackup.c

From

Michael Paquier

Date:

02 March 2023, 10:59:35

On Thu, Feb 02, 2023 at 03:23:51PM -0500, Robert Haas wrote:
> 0001 fixes what I believe to be a slight logical error in sendFile(),
> introduced by me during the v15 development cycle when I introduced
> the bbsink abstraction. I believe that it is theoretically possible
> for this to cause an assertion failure, although the chances of that
> actually happening seem extremely remote in practice. I don't believe
> there are any consequences worse than that; for instance, I don't
> think this can result in your backup getting corrupted. See the
> proposed commit message for full details. Because I believe that this
> is formally a bug, I am inclined to think that this should be
> back-patched, but I also think it's fairly likely that no one would
> ever notice if we didn't. However, patch authors have been known to be
> wrong about the consequences of their own bugs from time to time, so
> please do let me know if this seems more serious to you than what I'm
> indicating, or conversely if you think it's not a problem at all for
> some reason.

Seems right, I think that you should backpatch that as
VERIFY_CHECKSUMS is the default.

> 0002 removes an old comment from the file that I find useless and
> slightly misleading.

Okay.

> 0003 rewrites a comment about the way that we verify checksums during
> backups. If we get a checksum mismatch, we reread the block and see if
> the perceived problem goes away. If it doesn't, then we report it.
> This is intended as protection against the backup reading a block
> while some other process is in the midst of writing it, but there's no
> guarantee that any concurrent write will finish quickly enough for our
> second read attempt to see the updated contents.

There is more to it: the page LSN is checked before its checksum.
Hence, if the page's LSN is corrupted in a way that it is higher than
sink->bbs_state->startptr, the checksum verification is just skipped
while the page is broken but not reported as such.  (Spoiler: this has
been mentioned in the past, and maybe we'd better remove this stuff in
its entirety.)

> The comment claims
> otherwise, and that's false, and I'm getting tired of reading this
> false claim every time I read this code, so I rewrote the comment to
> say what I believe to be true, namely, that our algorithm is flaky and
> we have no good way to fix that right now. I'm pretty sure that Andres
> pointed this problem out when this feature was under discussion, but
> somehow it's still like this. There's another nearby comment which is
> also false or at least misleading for basically the same reasons which
> probably should be rewritten too, but I was a bit less certain how to
> rewrite it and it wasn't making me as annoyed as this one, so for now
> I only rewrote the one.

Indeed, that's an improvement ;)
--
Michael

Attachment

signature.asc

Re: a very minor bug and a couple of comment changes for basebackup.c

From

Robert Haas

Date:

06 March 2023, 19:07:07

Thanks for the review. I have committed the patches.

On Thu, Mar 2, 2023 at 2:59 AM Michael Paquier <michael@paquier.xyz> wrote:
> Seems right, I think that you should backpatch that as
> VERIFY_CHECKSUMS is the default.

Done.

> There is more to it: the page LSN is checked before its checksum.
> Hence, if the page's LSN is corrupted in a way that it is higher than
> sink->bbs_state->startptr, the checksum verification is just skipped
> while the page is broken but not reported as such.  (Spoiler: this has
> been mentioned in the past, and maybe we'd better remove this stuff in
> its entirety.)

Yep. It's pretty embarrassing that we have a checksum verification
feature that has both known ways of producing false positives and
known ways of producing false negatives and we have no plan to ever
fix that, we're just going to keep shipping what we've got. I think
it's pretty clear that the feature shouldn't have been committed like
this; valid criticisms of the design were offered and simply not
addressed, not even by updating the comments or documentation with
disclaimers. I find the code in sendFile() pretty ugly, too. For all
of that, I'm a bit uncertain whether ripping it out is the right thing
to do. It might be (I'm not sure) that it tends to work decently well
in practice. Like, yes, it could produce false checksum warnings, but
does that actually happen to people? It's probably not too likely that
a read() or write() of 8kB gets updated after doing only part of the
I/O, so the concurrent read or write is fairly likely to be on-CPU, I
would guess, and therefore this algorithm might kind of work OK in
practice despite begin garbage on a theoretical level. Similarly, the
problems with how the LSN is vetted make it likely that a page
replaced with random garbage will go undetected, but a page where a
few bytes get flipped in a random position within the page is likely
to get caught, and maybe the latter is actually a bigger risk than the
former. I don't really know. I'd be interested to hear from anyone
with a lot of practical experience using the feature. A few anecdotes
of the form "this routine fails to tell us about problems" or "this
often complains about problems that are not real" or "this has really
helped us out on a number of occasions and we have had no problems
with it" would be really helpful.

On the other hand, we could just say that the code is nonsense and
therefore, regardless of practical experience, it ought to be removed.
I'm somewhat open to that idea, too.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: a very minor bug and a couple of comment changes for basebackup.c

From

Stephen Frost

Date:

07 March 2023, 03:00:38

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> Thanks for the review. I have committed the patches.

No objections to what was committed.

> On Thu, Mar 2, 2023 at 2:59 AM Michael Paquier <michael@paquier.xyz> wrote:
> > There is more to it: the page LSN is checked before its checksum.
> > Hence, if the page's LSN is corrupted in a way that it is higher than
> > sink->bbs_state->startptr, the checksum verification is just skipped
> > while the page is broken but not reported as such.  (Spoiler: this has
> > been mentioned in the past, and maybe we'd better remove this stuff in
> > its entirety.)
>
> Yep. It's pretty embarrassing that we have a checksum verification
> feature that has both known ways of producing false positives and
> known ways of producing false negatives and we have no plan to ever
> fix that, we're just going to keep shipping what we've got. I think
> it's pretty clear that the feature shouldn't have been committed like
> this; valid criticisms of the design were offered and simply not
> addressed, not even by updating the comments or documentation with
> disclaimers. I find the code in sendFile() pretty ugly, too. For all
> of that, I'm a bit uncertain whether ripping it out is the right thing
> to do. It might be (I'm not sure) that it tends to work decently well
> in practice. Like, yes, it could produce false checksum warnings, but
> does that actually happen to people? It's probably not too likely that
> a read() or write() of 8kB gets updated after doing only part of the
> I/O, so the concurrent read or write is fairly likely to be on-CPU, I
> would guess, and therefore this algorithm might kind of work OK in
> practice despite begin garbage on a theoretical level. Similarly, the
> problems with how the LSN is vetted make it likely that a page
> replaced with random garbage will go undetected, but a page where a
> few bytes get flipped in a random position within the page is likely
> to get caught, and maybe the latter is actually a bigger risk than the
> former. I don't really know. I'd be interested to hear from anyone
> with a lot of practical experience using the feature. A few anecdotes
> of the form "this routine fails to tell us about problems" or "this
> often complains about problems that are not real" or "this has really
> helped us out on a number of occasions and we have had no problems
> with it" would be really helpful.

The concern about the LSN is certainly a valid one and we ended up
ripping that check out of pgbackrest in favor of taking a different
approach- simply re-read the page and see if it changed.  If it changed,
then we punt and figure that it was a hot page that PG was actively
writing to and so it'll be in the WAL and we don't have to worry about
it.  We're generally concerned more about latent on-disk corruption that
is missed than about some kind of in-memory corruption and the page
changing in the filesystem cache without some other process writing to
it just seems to be awfully unlikely.

https://github.com/pgbackrest/pgbackrest/commit/9eec98c61302121134d2067326dbd2cd0f2f0b9c

From a practical perspective, while this has the afore-mentioned risk
regarding our loop happening twice fast enough that it re-reads the same
page without the i/o on the page progressing at all, I'm not aware of
even one report of that actually happening.  We absolutely have seen
cases where the first read picks up a torn page and that's not even
uncommon in busy environments.

One of the ideas I've had around how to address all of this is to not
depend on inferring things from what's been written out but instead to
just ask PG and that's one of the (relatively few...) benefits that I
see to an archive_library- the ability to check if a given page is in
shared buffers and, if so, what its LSN is to see if it's past the start
of the backup.  Given that pg_basebackup is already working with a lot
of server-side code.. perhaps this could be a direction to go in for it.

Another idea we played around with was keeping track of the LSNs of the
pages with invalid checksums and checking that they fall within the
range of start LSN-end LSN of the backup; while that wouldn't
completely prevent corrupted LSNs from skipping detection using the LSN
approach, it'd sure make it a whole lot less likely.

Lastly, I've also wondered about trying to pull (clean) pages from
shared_buffers directly instead of reading them off of disk at all,
perhaps using a ring buffer in shared_buffers to read them in if
they're not already there, and letting all the usual checksum validation
and such happening with PG handling it.  Dirty pages would have to be
handled though, presumably by locking them and then reading the page
from disk as I don't think folks would be pleased if we decided to
forcibly write them out.  For an incremental backup, if you have
reliable knowledge that the page is in the WAL from PG, perhaps you
could even skip the page entirely.

Independently of verifying PG checksums, we've also considered an
approach where we take the file-level checksums we have in the
pgbackrest manifest and check if the file has changed on the filesystem
without the timestamp changing as that would certainly be indicative of
something gone wrong.

Certainly interested in other ideas about how to improve things in this
area.  I don't think we should rip it out though.

Thanks,

Stephen

Attachment

signature.asc