On Mon, Oct 23, 2023 at 8:43 PM Andres Freund <andres@anarazel.de> wrote:
> The source of the emode=13=DEBUG2 is that that's hardcoded in
> WaitForWALToBecomeAvailable(). I guess the error ought to come from
> XLogPageRead(), but all that happens is this:
>
> case XLREAD_FAIL:
> if (readFile >= 0)
> close(readFile);
> readFile = -1;
> readLen = 0;
> readSource = XLOG_FROM_ANY;
> return XLREAD_FAIL;
I've been under the impression that the guiding principle here is that
we shouldn't error out upon hitting a condition that should only cause
us to switch sources. I think WaitForWALToBecomeAvailable() is
supposed to set things up so that XLogPageRead()'s call to pg_pread()
will succeed. If it says it can't, then XLogPageRead() is only obliged
to pass that information up to the caller, who can decide to wait
longer for the data to show up, or give up, or whatever it wants to
do. On the other hand, if WaitForWALToBecomeAvailable() says that it's
fine to go ahead and call pg_pread() and pg_pread() then fails, then
that means that we've got a problem with the WAL file other than it
just not being available yet, like it's the wrong length or there was
an I/O error, and those are reportable errors. Said differently, in
the former case, the WAL is not broken, merely not currently
available; in the latter case, it's broken.
The legibility and maintainability of this code are certainly not
great. The xlogreader mechanism is extremely useful, but maybe we
should have done more cleanup of the underlying mechanism first. It's
a giant ball of spaghetti code that is challenging to understand and
almost impossible to validate thoroughly (as you just discovered).
--
Robert Haas
EDB: http://www.enterprisedb.com