Re: logical decoding and replication of sequences - Mailing list pgsql-hackers
From | Kyotaro Horiguchi |
---|---|
Subject | Re: logical decoding and replication of sequences |
Date | |
Msg-id | 20220808.173322.1349884302599281820.horikyota.ntt@gmail.com Whole thread Raw |
In response to | Re: logical decoding and replication of sequences (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: logical decoding and replication of sequences
|
List | pgsql-hackers |
At Mon, 8 Aug 2022 18:15:46 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in > On Mon, Aug 8, 2022 at 9:09 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > Thanks for the repro patch and bisection work. Looking... > > I don't have the complete explanation yet, but it's something like > this. We hit the following branch in xlogrecovery.c... > > if (StandbyMode && > !XLogReaderValidatePageHeader(xlogreader, > targetPagePtr, readBuf)) > { > /* > * Emit this error right now then retry this page > immediately. Use > * errmsg_internal() because the message was already translated. > */ > if (xlogreader->errormsg_buf[0]) > ereport(emode_for_corrupt_record(emode, > xlogreader->EndRecPtr), > (errmsg_internal("%s", > xlogreader->errormsg_buf))); > > /* reset any error XLogReaderValidatePageHeader() > might have set */ > xlogreader->errormsg_buf[0] = '\0'; > goto next_record_is_invalid; > } > > ... but, even though there was a (suppressed) error, nothing > invalidates the reader's page buffer. Normally, > XLogReadValidatePageHeader() failure or any other kind of error > encountered by xlogreader.c'd decoding logic would do that, but here > the read_page callback is directly calling the header validation. > Without prefetching, that doesn't seem to matter, but reading ahead > can cause us to have the problem page in our buffer at the wrong time, > and then not re-read it when we should. Or something like that. > > The attached patch that simply moves the cache invalidation into > report_invalid_record(), so that it's reached by the above code and > everything else that reports an error, seems to fix the problem in > src/bin/pg_ctl/t/003_promote.pl with Noah's spanner-in-the-works patch > applied, and passes check-world without it. I need to look at this > some more, though, and figure out if it's the right fix. If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral misses the chance to inavlidate reader-state. That state is not an error while in StandbyMode. In the repro case, XLogPageRead returns XLREAD_WOULDBLOCK after the first failure. This situation (of course) was not considered when that code was introduced. If that function is going to return with XLREAD_WOULDBLOCK while lastSourceFailed, it should be turned into XLREAD_FAIL. So, the following also works. diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 21088e78f6..9f242fe656 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -3220,7 +3220,9 @@ retry: xlogreader->nonblocking)) { case XLREAD_WOULDBLOCK: - return XLREAD_WOULDBLOCK; + if (!lastSourceFailed) + return XLREAD_WOULDBLOCK; + /* Fall through. */ case XLREAD_FAIL: if (readFile >= 0) close(readFile); regards. -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: