On Thu, Feb 02, 2023 at 03:23:51PM -0500, Robert Haas wrote:
> 0001 fixes what I believe to be a slight logical error in sendFile(),
> introduced by me during the v15 development cycle when I introduced
> the bbsink abstraction. I believe that it is theoretically possible
> for this to cause an assertion failure, although the chances of that
> actually happening seem extremely remote in practice. I don't believe
> there are any consequences worse than that; for instance, I don't
> think this can result in your backup getting corrupted. See the
> proposed commit message for full details. Because I believe that this
> is formally a bug, I am inclined to think that this should be
> back-patched, but I also think it's fairly likely that no one would
> ever notice if we didn't. However, patch authors have been known to be
> wrong about the consequences of their own bugs from time to time, so
> please do let me know if this seems more serious to you than what I'm
> indicating, or conversely if you think it's not a problem at all for
> some reason.
Seems right, I think that you should backpatch that as
VERIFY_CHECKSUMS is the default.
> 0002 removes an old comment from the file that I find useless and
> slightly misleading.
Okay.
> 0003 rewrites a comment about the way that we verify checksums during
> backups. If we get a checksum mismatch, we reread the block and see if
> the perceived problem goes away. If it doesn't, then we report it.
> This is intended as protection against the backup reading a block
> while some other process is in the midst of writing it, but there's no
> guarantee that any concurrent write will finish quickly enough for our
> second read attempt to see the updated contents.
There is more to it: the page LSN is checked before its checksum.
Hence, if the page's LSN is corrupted in a way that it is higher than
sink->bbs_state->startptr, the checksum verification is just skipped
while the page is broken but not reported as such. (Spoiler: this has
been mentioned in the past, and maybe we'd better remove this stuff in
its entirety.)
> The comment claims
> otherwise, and that's false, and I'm getting tired of reading this
> false claim every time I read this code, so I rewrote the comment to
> say what I believe to be true, namely, that our algorithm is flaky and
> we have no good way to fix that right now. I'm pretty sure that Andres
> pointed this problem out when this feature was under discussion, but
> somehow it's still like this. There's another nearby comment which is
> also false or at least misleading for basically the same reasons which
> probably should be rewritten too, but I was a bit less certain how to
> rewrite it and it wasn't making me as annoyed as this one, so for now
> I only rewrote the one.
Indeed, that's an improvement ;)
--
Michael