Re: a very minor bug and a couple of comment changes for basebackup.c - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: a very minor bug and a couple of comment changes for basebackup.c
Date
Msg-id ZABXZ/mfr40VLK7w@paquier.xyz
Whole thread Raw
In response to a very minor bug and a couple of comment changes for basebackup.c  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: a very minor bug and a couple of comment changes for basebackup.c
List pgsql-hackers
On Thu, Feb 02, 2023 at 03:23:51PM -0500, Robert Haas wrote:
> 0001 fixes what I believe to be a slight logical error in sendFile(),
> introduced by me during the v15 development cycle when I introduced
> the bbsink abstraction. I believe that it is theoretically possible
> for this to cause an assertion failure, although the chances of that
> actually happening seem extremely remote in practice. I don't believe
> there are any consequences worse than that; for instance, I don't
> think this can result in your backup getting corrupted. See the
> proposed commit message for full details. Because I believe that this
> is formally a bug, I am inclined to think that this should be
> back-patched, but I also think it's fairly likely that no one would
> ever notice if we didn't. However, patch authors have been known to be
> wrong about the consequences of their own bugs from time to time, so
> please do let me know if this seems more serious to you than what I'm
> indicating, or conversely if you think it's not a problem at all for
> some reason.

Seems right, I think that you should backpatch that as
VERIFY_CHECKSUMS is the default.

> 0002 removes an old comment from the file that I find useless and
> slightly misleading.

Okay.

> 0003 rewrites a comment about the way that we verify checksums during
> backups. If we get a checksum mismatch, we reread the block and see if
> the perceived problem goes away. If it doesn't, then we report it.
> This is intended as protection against the backup reading a block
> while some other process is in the midst of writing it, but there's no
> guarantee that any concurrent write will finish quickly enough for our
> second read attempt to see the updated contents.

There is more to it: the page LSN is checked before its checksum.
Hence, if the page's LSN is corrupted in a way that it is higher than
sink->bbs_state->startptr, the checksum verification is just skipped
while the page is broken but not reported as such.  (Spoiler: this has
been mentioned in the past, and maybe we'd better remove this stuff in
its entirety.)

> The comment claims
> otherwise, and that's false, and I'm getting tired of reading this
> false claim every time I read this code, so I rewrote the comment to
> say what I believe to be true, namely, that our algorithm is flaky and
> we have no good way to fix that right now. I'm pretty sure that Andres
> pointed this problem out when this feature was under discussion, but
> somehow it's still like this. There's another nearby comment which is
> also false or at least misleading for basically the same reasons which
> probably should be rewritten too, but I was a bit less certain how to
> rewrite it and it wasn't making me as annoyed as this one, so for now
> I only rewrote the one.

Indeed, that's an improvement ;)
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: "Drouvot, Bertrand"
Date:
Subject: Re: Generate pg_stat_get_xact*() functions with Macros
Next
From: "shiy.fnst@fujitsu.com"
Date:
Subject: RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher