Re: Infinite loop in XLogPageRead() on standby - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Infinite loop in XLogPageRead() on standby
Date
Msg-id 20240301.131637.563635299005880728.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Infinite loop in XLogPageRead() on standby  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: Infinite loop in XLogPageRead() on standby
List pgsql-hackers
At Fri, 01 Mar 2024 12:37:55 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Anyway, our current policy here is to avoid record-rereads beyond
> source switches. However, fixing this seems to require that source
> switches cause record rereads unless some additional information is
> available to know if replay LSN needs to back up.

It seems to me that the error messages are related to commit 0668719801.

XLogPageRead:
> * Check the page header immediately, so that we can retry immediately if
> * it's not valid. This may seem unnecessary, because ReadPageInternal()
> * validates the page header anyway, and would propagate the failure up to
> * ReadRecord(), which would retry. However, there's a corner case with
> * continuation records, if a record is split across two pages such that
> * we would need to read the two pages from different sources. For
...
>    if (StandbyMode &&
>        !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
>    {
>        /*
>         * Emit this error right now then retry this page immediately. Use
>         * errmsg_internal() because the message was already translated.
>         */
>        if (xlogreader->errormsg_buf[0])
>            ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
>                    (errmsg_internal("%s", xlogreader->errormsg_buf)));

This code intends to prevent a page header error from causing a record
reread, when a record is required to be read from multiple sources. We
could restrict this to only fire at segment boundaries. At segment
boundaries, we won't let LSN back up by using XLP_FIRST_IS_CONTRECORD.

Having thought up to this point, I now believe that we should
completely prevent LSN from going back in any case. One drawback is
that the fix cannot be back-patched.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Next
From: Peter Smith
Date:
Subject: Re: Synchronizing slots from primary to standby