Re: logical decoding and replication of sequences - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: logical decoding and replication of sequences
Date
Msg-id 20220808.173322.1349884302599281820.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: logical decoding and replication of sequences  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: logical decoding and replication of sequences
List pgsql-hackers
At Mon, 8 Aug 2022 18:15:46 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in 
> On Mon, Aug 8, 2022 at 9:09 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Thanks for the repro patch and bisection work.  Looking...
> 
> I don't have the complete explanation yet, but it's something like
> this.  We hit the following branch in xlogrecovery.c...
> 
>         if (StandbyMode &&
>                 !XLogReaderValidatePageHeader(xlogreader,
> targetPagePtr, readBuf))
>         {
>                 /*
>                  * Emit this error right now then retry this page
> immediately. Use
>                  * errmsg_internal() because the message was already translated.
>                  */
>                 if (xlogreader->errormsg_buf[0])
>                         ereport(emode_for_corrupt_record(emode,
> xlogreader->EndRecPtr),
>                                         (errmsg_internal("%s",
> xlogreader->errormsg_buf)));
> 
>                 /* reset any error XLogReaderValidatePageHeader()
> might have set */
>                 xlogreader->errormsg_buf[0] = '\0';
>                 goto next_record_is_invalid;
>         }
> 
> ... but, even though there was a (suppressed) error, nothing
> invalidates the reader's page buffer.  Normally,
> XLogReadValidatePageHeader() failure or any other kind of error
> encountered by xlogreader.c'd decoding logic would do that, but here
> the read_page callback is directly calling the header validation.
> Without prefetching, that doesn't seem to matter, but reading ahead
> can cause us to have the problem page in our buffer at the wrong time,
> and then not re-read it when we should.  Or something like that.
> 
> The attached patch that simply moves the cache invalidation into
> report_invalid_record(), so that it's reached by the above code and
> everything else that reports an error, seems to fix the problem in
> src/bin/pg_ctl/t/003_promote.pl with Noah's spanner-in-the-works patch
> applied, and passes check-world without it.  I need to look at this
> some more, though, and figure out if it's the right fix.

If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral
misses the chance to inavlidate reader-state.  That state is not an
error while in StandbyMode.

In the repro case, XLogPageRead returns XLREAD_WOULDBLOCK after the
first failure.  This situation (of course) was not considered when
that code was introduced. If that function is going to return with
XLREAD_WOULDBLOCK while lastSourceFailed, it should be turned into
XLREAD_FAIL. So, the following also works.

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 21088e78f6..9f242fe656 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3220,7 +3220,9 @@ retry:
                                             xlogreader->nonblocking))
         {
             case XLREAD_WOULDBLOCK:
-                return XLREAD_WOULDBLOCK;
+                if (!lastSourceFailed)
+                    return XLREAD_WOULDBLOCK;
+                /* Fall through.  */
             case XLREAD_FAIL:
                 if (readFile >= 0)
                     close(readFile);

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: r.zharkov@postgrespro.ru
Date:
Subject: Re: Checking pgwin32_is_junction() errors
Next
From: Peter Smith
Date:
Subject: Re: Column Filtering in Logical Replication