Thread: corruption of WAL page header is never reported

corruption of WAL page header is never reported

From
Yugo NAGATA
Date:
Hello,

I found that any corruption of WAL page header found during recovery is never
reported in log messages. If wal page header is broken, it is detected in
XLogReaderValidatePageHeader called from  XLogPageRead, but the error messages
are always reset and never reported.

        if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
        {
               /* reset any error XLogReaderValidatePageHeader() might have set */
               xlogreader->errormsg_buf[0] = '\0';
               goto next_record_is_invalid;
        }

Since the commit 06687198018, we call XLogReaderValidatePageHeader here so that
we can check a page header and retry immediately if it's invalid, but the error
message is reset immediately and not reported. I guess the reason why the error
message is reset is because we might get the right WAL after some retries.
However, I think it is better to report the error for each check in order to let
users know the actual issues founded in the WAL.

I attached a patch to fix it in this way.

Or, if we wouldn't like to report an error for each check and also what we want
to check here is just about old recycled WAL instead of header corruption itself, 
I wander that we could check just xlp_pageaddr instead of calling
XLogReaderValidatePageHeader.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Attachment

Re: corruption of WAL page header is never reported

From
Ranier Vilela
Date:
Em sáb., 17 de jul. de 2021 às 16:57, Yugo NAGATA <nagata@sraoss.co.jp> escreveu:
Hello,

I found that any corruption of WAL page header found during recovery is never
reported in log messages. If wal page header is broken, it is detected in
XLogReaderValidatePageHeader called from  XLogPageRead, but the error messages
are always reset and never reported.

        if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
        {
               /* reset any error XLogReaderValidatePageHeader() might have set */
               xlogreader->errormsg_buf[0] = '\0';
               goto next_record_is_invalid;
        }

Since the commit 06687198018, we call XLogReaderValidatePageHeader here so that
we can check a page header and retry immediately if it's invalid, but the error
message is reset immediately and not reported. I guess the reason why the error
message is reset is because we might get the right WAL after some retries.
However, I think it is better to report the error for each check in order to let
users know the actual issues founded in the WAL.

I attached a patch to fix it in this way.
I think to keep the same behavior as before, is necessary always run:

/* reset any error XLogReaderValidatePageHeader() might have set */
xlogreader->errormsg_buf[0] = '\0';

not?

regards,
Ranier Vilela
Attachment

Re: corruption of WAL page header is never reported

From
Yugo NAGATA
Date:
On Sat, 17 Jul 2021 18:40:02 -0300
Ranier Vilela <ranier.vf@gmail.com> wrote:

> Em sáb., 17 de jul. de 2021 às 16:57, Yugo NAGATA <nagata@sraoss.co.jp>
> escreveu:
> 
> > Hello,
> >
> > I found that any corruption of WAL page header found during recovery is
> > never
> > reported in log messages. If wal page header is broken, it is detected in
> > XLogReaderValidatePageHeader called from  XLogPageRead, but the error
> > messages
> > are always reset and never reported.
> >
> >         if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
> > readBuf))
> >         {
> >                /* reset any error XLogReaderValidatePageHeader() might
> > have set */
> >                xlogreader->errormsg_buf[0] = '\0';
> >                goto next_record_is_invalid;
> >         }
> >
> > Since the commit 06687198018, we call XLogReaderValidatePageHeader here so
> > that
> > we can check a page header and retry immediately if it's invalid, but the
> > error
> > message is reset immediately and not reported. I guess the reason why the
> > error
> > message is reset is because we might get the right WAL after some retries.
> > However, I think it is better to report the error for each check in order
> > to let
> > users know the actual issues founded in the WAL.
> >
> > I attached a patch to fix it in this way.
> >
> I think to keep the same behavior as before, is necessary always run:
> 
> /* reset any error XLogReaderValidatePageHeader() might have set */
> xlogreader->errormsg_buf[0] = '\0';
> 
> not?

If we are not in StandbyMode, the check is not retried, and an error is returned
immediately. So, I think ,we don't have to display an error message in such cases,
and neither reset it. Instead, it would be better to leave the error message
handling to the caller of XLogReadRecord.

Regards,
Yugo Nagat

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: corruption of WAL page header is never reported

From
Kyotaro Horiguchi
Date:
Hello.

At Sun, 18 Jul 2021 04:55:05 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in 
> Hello,
> 
> I found that any corruption of WAL page header found during recovery is never
> reported in log messages. If wal page header is broken, it is detected in
> XLogReaderValidatePageHeader called from  XLogPageRead, but the error messages
> are always reset and never reported.

Good catch!  Currently recovery stops showing no reason if it is
stopped by page-header errors.

> I attached a patch to fix it in this way.

However, it is a kind of a roof-over-a-roof.  What we should do is
just omitting the check in XLogPageRead while in standby mode.

> Or, if we wouldn't like to report an error for each check and also what we want
> to check here is just about old recycled WAL instead of header corruption itself, 
> I wander that we could check just xlp_pageaddr instead of calling
> XLogReaderValidatePageHeader.

I'm not sure. But as described in the commit message, the commit
intended to save a common case and there's no obvious reason to (and
not to) restrict the check only to page address. So it uses the
established checking function.

I was tempted to adjust the comment just above by adding "while in
standby mode", but "so that we can retry immediately" is suggesting
that so I didn't do that in the attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 30033d810bcc784da55600792484603e1c46b3d7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 19 Jul 2021 14:49:34 +0900
Subject: [PATCH v1] Don't forget message of hage-header errors while not in
 standby mode

The commit 0668719801 intended to omit page-header errors only while
in standby mode but actually it is always forgotten.  As the result
the message of the end of a crash recovery lacks the reason for the
stop. Fix that by doing the additional check only while in standby
mode.
---
 src/backend/access/transam/xlog.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2ee9515139..79513fb8b5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12317,7 +12317,8 @@ retry:
      * Validating the page header is cheap enough that doing it twice
      * shouldn't be a big deal from a performance point of view.
      */
-    if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
+    if (StandbyMode &&
+        !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
     {
         /* reset any error XLogReaderValidatePageHeader() might have set */
         xlogreader->errormsg_buf[0] = '\0';
-- 
2.27.0


Re: corruption of WAL page header is never reported

From
Yugo NAGATA
Date:
On Mon, 19 Jul 2021 15:14:41 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> Hello.
> 
> At Sun, 18 Jul 2021 04:55:05 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in 
> > Hello,
> > 
> > I found that any corruption of WAL page header found during recovery is never
> > reported in log messages. If wal page header is broken, it is detected in
> > XLogReaderValidatePageHeader called from  XLogPageRead, but the error messages
> > are always reset and never reported.
> 
> Good catch!  Currently recovery stops showing no reason if it is
> stopped by page-header errors.
> 
> > I attached a patch to fix it in this way.
> 
> However, it is a kind of a roof-over-a-roof.  What we should do is
> just omitting the check in XLogPageRead while in standby mode.

Your patch doesn't fix the issue that the error message is never reported in
standby mode. When a WAL page header is broken, the standby would silently repeat
retrying forever.

I think we have to let users know the corruption of WAL page header even in
standby mode, not? A corruption of WAL record header is always reported,
by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.)


Regards,
Yugo Nagata


-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: corruption of WAL page header is never reported

From
Kyotaro Horiguchi
Date:
At Mon, 19 Jul 2021 16:00:39 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in 
> Your patch doesn't fix the issue that the error message is never reported in
> standby mode. When a WAL page header is broken, the standby would silently repeat
> retrying forever.

Ok, I see your point and agree to that.

> I think we have to let users know the corruption of WAL page header even in
> standby mode, not? A corruption of WAL record header is always reported,
> by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.)

Howeer, I'm still on the opinion that we don't need to check that
while in standby mode.

How about the attached?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From fe23fbf51569eb7a305bd9e619e918aec0b87bf7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 19 Jul 2021 14:49:34 +0900
Subject: [PATCH] Don't hide xlog page-header errors while recovery

The commit 0668719801 intended to handle the case of standby mode but
it inadvertently catches also while not in standby mode.  Change the
condition to catch only while in non-standby mode.

Addition to that, it is reasonable to show the error message even in
standby mode. Show the error message if any.

Author: Yugo NAGATA <nagata@sraoss.co.jp>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 src/backend/access/transam/xlog.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2ee9515139..221e7d1f07 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12317,8 +12317,17 @@ retry:
      * Validating the page header is cheap enough that doing it twice
      * shouldn't be a big deal from a performance point of view.
      */
-    if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
+    if (StandbyMode &&
+        !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
     {
+        /*
+         * in this case we consume this error right now then retry immediately,
+         * the message is already translated
+         */
+        if (xlogreader->errormsg_buf[0])
+            ereport(emode_for_corrupt_record(emode, EndRecPtr),
+                    (errmsg_internal("%s", xlogreader->errormsg_buf)));
+
         /* reset any error XLogReaderValidatePageHeader() might have set */
         xlogreader->errormsg_buf[0] = '\0';
         goto next_record_is_invalid;
-- 
2.27.0


Re: corruption of WAL page header is never reported

From
Kyotaro Horiguchi
Date:
me> Howeer, I'm still on the opinion that we don't need to check that
me> while in standby mode.

Of course it's typo of  "while not in standby mode".

sorry..

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: corruption of WAL page header is never reported

From
Yugo NAGATA
Date:
On Mon, 19 Jul 2021 17:47:07 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> At Mon, 19 Jul 2021 16:00:39 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in 
> > Your patch doesn't fix the issue that the error message is never reported in
> > standby mode. When a WAL page header is broken, the standby would silently repeat
> > retrying forever.
> 
> Ok, I see your point and agree to that.
> 
> > I think we have to let users know the corruption of WAL page header even in
> > standby mode, not? A corruption of WAL record header is always reported,
> > by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.)
> 
> Howeer, I'm still on the opinion that we don't need to check that
> while in standby mode.
> 
> How about the attached?

On Mon, 19 Jul 2021 17:50:16 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> me> Howeer, I'm still on the opinion that we don't need to check that
> me> while in standby mode.
> 
> Of course it's typo of  "while not in standby mode".

Thanks for updating the patch. I agree with you.

I think it is nice to fix to perform the check only during standby mode
because it make a bit clearer why we check it immediately in XLogPageRead.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: corruption of WAL page header is never reported

From
Ranier Vilela
Date:
Em seg., 19 de jul. de 2021 às 06:15, Yugo NAGATA <nagata@sraoss.co.jp> escreveu:
On Mon, 19 Jul 2021 17:47:07 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> At Mon, 19 Jul 2021 16:00:39 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in
> > Your patch doesn't fix the issue that the error message is never reported in
> > standby mode. When a WAL page header is broken, the standby would silently repeat
> > retrying forever.
>
> Ok, I see your point and agree to that.
>
> > I think we have to let users know the corruption of WAL page header even in
> > standby mode, not? A corruption of WAL record header is always reported,
> > by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.)
>
> Howeer, I'm still on the opinion that we don't need to check that
> while in standby mode.
>
> How about the attached?

On Mon, 19 Jul 2021 17:50:16 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> me> Howeer, I'm still on the opinion that we don't need to check that
> me> while in standby mode.
>
> Of course it's typo of  "while not in standby mode".

Thanks for updating the patch. I agree with you.

I think it is nice to fix to perform the check only during standby mode
because it make a bit clearer why we check it immediately in XLogPageRead.
And as I had reviewed, your first patch was wrong and now with the Kyotaro version,
to keep the same behavior, it is necessary to reset the error, correct?

regards,
Ranier Vilela

Re: corruption of WAL page header is never reported

From
Yugo NAGATA
Date:
On Mon, 19 Jul 2021 06:32:28 -0300
Ranier Vilela <ranier.vf@gmail.com> wrote:

> Em seg., 19 de jul. de 2021 às 06:15, Yugo NAGATA <nagata@sraoss.co.jp>
> escreveu:
> 
> > On Mon, 19 Jul 2021 17:47:07 +0900 (JST)
> > Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> >
> > > At Mon, 19 Jul 2021 16:00:39 +0900, Yugo NAGATA <nagata@sraoss.co.jp>
> > wrote in
> > > > Your patch doesn't fix the issue that the error message is never
> > reported in
> > > > standby mode. When a WAL page header is broken, the standby would
> > silently repeat
> > > > retrying forever.
> > >
> > > Ok, I see your point and agree to that.
> > >
> > > > I think we have to let users know the corruption of WAL page header
> > even in
> > > > standby mode, not? A corruption of WAL record header is always
> > reported,
> > > > by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.)
> > >
> > > Howeer, I'm still on the opinion that we don't need to check that
> > > while in standby mode.
> > >
> > > How about the attached?
> >
> > On Mon, 19 Jul 2021 17:50:16 +0900 (JST)
> > Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> >
> > > me> Howeer, I'm still on the opinion that we don't need to check that
> > > me> while in standby mode.
> > >
> > > Of course it's typo of  "while not in standby mode".
> >
> > Thanks for updating the patch. I agree with you.
> >
> > I think it is nice to fix to perform the check only during standby mode
> > because it make a bit clearer why we check it immediately in XLogPageRead.
> >
> And as I had reviewed, your first patch was wrong and now with the Kyotaro
> version,
> to keep the same behavior, it is necessary to reset the error, correct?

Well, I think my first patch was not wrong. The difference with the latest
patch is just whether we perform the additional check when we are not in
standby mode or not. The behavior is basically the same although which function
detects and prints the page-header error in cases of crash recovery is different.
On the other hand, in your patch, the error message was always ommitted in cases
of crash recovery, and it seemed to me wrong.

Regards,
Yugo Nagata


-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: corruption of WAL page header is never reported

From
Fujii Masao
Date:

On 2021/07/19 18:52, Yugo NAGATA wrote:
> Well, I think my first patch was not wrong. The difference with the latest
> patch is just whether we perform the additional check when we are not in
> standby mode or not. The behavior is basically the same although which function
> detects and prints the page-header error in cases of crash recovery is different.

Yes, so which patch do you think is better? I like your version
because there seems no reason why XLogPageRead() should skip
XLogReaderValidatePageHeader() when not in standby mode.

Also I'm tempted to move ereport() and reset of errmsg_buf to
under next_record_is_invalid as follows. That is, in standby mode
whenever we find an invalid record and retry reading WAL page
in XLogPageRead(), we report the error message and reset it.
For now in XLogPageRead(), only XLogReaderValidatePageHeader()
sets errmsg_buf, but in the future other code or function doing that
may be added. For that case, the following change seems more elegant.
Thought?

     * shouldn't be a big deal from a performance point of view.
       */
      if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
-    {
-        /* reset any error XLogReaderValidatePageHeader() might have set */
-        xlogreader->errormsg_buf[0] = '\0';
          goto next_record_is_invalid;
-    }
  
      return readLen;
  
@@ -12517,7 +12513,17 @@ next_record_is_invalid:
  
      /* In standby-mode, keep trying */
      if (StandbyMode)
+    {
+        if (xlogreader->errormsg_buf[0])
+        {
+            ereport(emode_for_corrupt_record(emode, EndRecPtr),
+                    (errmsg_internal("%s", xlogreader->errormsg_buf)));
+
+            /* reset any error XLogReaderValidatePageHeader() might have set */
+            xlogreader->errormsg_buf[0] = '\0';
+        }
          goto retry;
+    }
      else
          return -1;
  }

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: corruption of WAL page header is never reported

From
Kyotaro Horiguchi
Date:
At Thu, 2 Sep 2021 12:19:25 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2021/07/19 18:52, Yugo NAGATA wrote:
> > Well, I think my first patch was not wrong. The difference with the
> > latest
> > patch is just whether we perform the additional check when we are not
> > in
> > standby mode or not. The behavior is basically the same although which
> > function
> > detects and prints the page-header error in cases of crash recovery is
> > different.
> 
> Yes, so which patch do you think is better? I like your version
> because there seems no reason why XLogPageRead() should skip
> XLogReaderValidatePageHeader() when not in standby mode.

Did you read the comment just above?

xlog.c:12523
>     * Check the page header immediately, so that we can retry immediately if
>     * it's not valid. This may seem unnecessary, because XLogReadRecord()
>     * validates the page header anyway, and would propagate the failure up to
>     * ReadRecord(), which would retry. However, there's a corner case with
>     * continuation records, if a record is split across two pages such that

So when not in standby mode, the same check is performed by xlogreader
which has the responsibility to validate the binary data read by
XLogPageRead. The page-header validation is a compromise to save a
specific case.

> Also I'm tempted to move ereport() and reset of errmsg_buf to
> under next_record_is_invalid as follows. That is, in standby mode
> whenever we find an invalid record and retry reading WAL page
> in XLogPageRead(), we report the error message and reset it.
> For now in XLogPageRead(), only XLogReaderValidatePageHeader()
> sets errmsg_buf, but in the future other code or function doing that
> may be added. For that case, the following change seems more elegant.
> Thought?

I don't think it is good choice to conflate read-failure and header
validation failure from the view of modularity.

>      * shouldn't be a big deal from a performance point of view.
>       */
>      if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
> -    {
> -        /* reset any error XLogReaderValidatePageHeader() might have set */
> -        xlogreader->errormsg_buf[0] = '\0';
>          goto next_record_is_invalid;
> -    }
>        return readLen;
>  @@ -12517,7 +12513,17 @@ next_record_is_invalid:
>        /* In standby-mode, keep trying */
>      if (StandbyMode)
> +    {
> +        if (xlogreader->errormsg_buf[0])
> +        {
> +            ereport(emode_for_corrupt_record(emode, EndRecPtr),
> + (errmsg_internal("%s", xlogreader->errormsg_buf)));
> +
> + /* reset any error XLogReaderValidatePageHeader() might have set */
> +            xlogreader->errormsg_buf[0] = '\0';
> +        }
>          goto retry;
> +    }
>      else
>          return -1;
>  }

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: corruption of WAL page header is never reported

From
Kyotaro Horiguchi
Date:
Sorry, please let me add something.

At Thu, 02 Sep 2021 13:17:16 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Thu, 2 Sep 2021 12:19:25 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> > Also I'm tempted to move ereport() and reset of errmsg_buf to
> > under next_record_is_invalid as follows. That is, in standby mode
> > whenever we find an invalid record and retry reading WAL page
> > in XLogPageRead(), we report the error message and reset it.
> > For now in XLogPageRead(), only XLogReaderValidatePageHeader()
> > sets errmsg_buf, but in the future other code or function doing that
> > may be added. For that case, the following change seems more elegant.
> > Thought?
> 
> I don't think it is good choice to conflate read-failure and header
> validation failure from the view of modularity.

In other words, XLogReaderValidatePageHeader is foreign for
XLogPageRead and the function indeuces the need of extra care for
errormsg_buf that is not relevant to the elog-capable module.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: corruption of WAL page header is never reported

From
Kyotaro Horiguchi
Date:
(.. sorry for the chained-mails)

At Thu, 02 Sep 2021 13:31:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Sorry, please let me add something.
> 
> At Thu, 02 Sep 2021 13:17:16 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > At Thu, 2 Sep 2021 12:19:25 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> > > Also I'm tempted to move ereport() and reset of errmsg_buf to
> > > under next_record_is_invalid as follows. That is, in standby mode
> > > whenever we find an invalid record and retry reading WAL page
> > > in XLogPageRead(), we report the error message and reset it.
> > > For now in XLogPageRead(), only XLogReaderValidatePageHeader()
> > > sets errmsg_buf, but in the future other code or function doing that
> > > may be added. For that case, the following change seems more elegant.
> > > Thought?
> > 
> > I don't think it is good choice to conflate read-failure and header
> > validation failure from the view of modularity.
> 
> In other words, XLogReaderValidatePageHeader is foreign for
> XLogPageRead and the function indeuces the need of extra care for
> errormsg_buf that is not relevant to the elog-capable module.

However, I can agree that the error handling code can be moved further
later.  Like this,

>      * shouldn't be a big deal from a performance point of view.
>      */
-    if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
-        /* reset any error XLogReaderValidatePageHeader() might have set */
-        xlogreader->errormsg_buf[0] = '\0';
-        goto next_record_is_invalid;
+   if (... && XLogReaderValidatePageHeader())
+       goto page_header_is_invalid;
...
>   return readlen;
>
+ page_header_is_invalid:
+    /*
+     * in this case we consume this error right now then retry immediately,
+     * the message is already translated
+     */
+    if (xlogreader->errormsg_buf[0])
+        ereport(emode_for_corrupt_record(emode, EndRecPtr),
+                (errmsg_internal("%s", xlogreader->errormsg_buf)));
+
+     /* reset any error XLogReaderValidatePageHeader() might have set */
+     xlogreader->errormsg_buf[0] = '\0';
> 
> next_record_is_invalid:
>     lastSourceFailed = true;

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: corruption of WAL page header is never reported

From
Fujii Masao
Date:

On 2021/09/02 13:17, Kyotaro Horiguchi wrote:
> Did you read the comment just above?

Yes!

> 
> xlog.c:12523
>>      * Check the page header immediately, so that we can retry immediately if
>>      * it's not valid. This may seem unnecessary, because XLogReadRecord()
>>      * validates the page header anyway, and would propagate the failure up to
>>      * ReadRecord(), which would retry. However, there's a corner case with
>>      * continuation records, if a record is split across two pages such that
> 
> So when not in standby mode, the same check is performed by xlogreader
> which has the responsibility to validate the binary data read by
> XLogPageRead. The page-header validation is a compromise to save a
> specific case.

Yes, so XLogPageRead() can skip the validation check of page head if not
in standby mode. On the other hand, there is no problem if it still performs
the validation check as it does for now. No?

> I don't think it is good choice to conflate read-failure and header
> validation failure from the view of modularity.

I don't think that the proposed change does that. But maybe I failed to get
your point yet... Anyway the proposed change just tries to reset
errormsg_buf whenever XLogPageRead() retries, whatever error happened
before. Also if errormsg_buf is set at that moment, it's reported.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: corruption of WAL page header is never reported

From
Kyotaro Horiguchi
Date:
At Thu, 2 Sep 2021 14:44:31 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2021/09/02 13:17, Kyotaro Horiguchi wrote:
> > Did you read the comment just above?
> 
> Yes!

Glad to hear that, or..:p

> > xlog.c:12523
> >>      * Check the page header immediately, so that we can retry immediately if
> >>      * it's not valid. This may seem unnecessary, because XLogReadRecord()
> >>      * validates the page header anyway, and would propagate the failure up to
> >>      * ReadRecord(), which would retry. However, there's a corner case with
> >>      * continuation records, if a record is split across two pages such that
> > So when not in standby mode, the same check is performed by xlogreader
> > which has the responsibility to validate the binary data read by
> > XLogPageRead. The page-header validation is a compromise to save a
> > specific case.
> 
> Yes, so XLogPageRead() can skip the validation check of page head if
> not
> in standby mode. On the other hand, there is no problem if it still
> performs
> the validation check as it does for now. No?

Practically yes, and it has always been like that as you say.

> > I don't think it is good choice to conflate read-failure and header
> > validation failure from the view of modularity.
> 
> I don't think that the proposed change does that. But maybe I failed

It's about your idea in a recent mail. not about the proposed
patch(es).

> to get
> your point yet... Anyway the proposed change just tries to reset
> errormsg_buf whenever XLogPageRead() retries, whatever error happened
> before. Also if errormsg_buf is set at that moment, it's reported.

I believe errmsg_buf is an interface to emit error messages dedicated
to xlogreader that doesn't have an access to elog facility, and
xlogreader doesn't (or ought not to or expect to) suppose
non-xlogreader callback functions set the variable.  In that sense I
don't think theoriginally proposed patch is proper for the reason that
the non-xlogreader callback function may set errmsg_buf.  This is what
I meant by the word "modularity".

For that reason I avoided in my second proposal to call
XLogReaderValidatePageHeader() at all while not in standby mode,
because calling the validator function while in non-standby mode
results in the non-xlogreader function return errmsg_buf.  Of course
we can instead always consume errmsg_buf in the function but I don't
like to shadow the caller's task.

Does that makes sense?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: corruption of WAL page header is never reported

From
Fujii Masao
Date:

On 2021/09/02 16:26, Kyotaro Horiguchi wrote:
> I believe errmsg_buf is an interface to emit error messages dedicated
> to xlogreader that doesn't have an access to elog facility, and
> xlogreader doesn't (or ought not to or expect to) suppose
> non-xlogreader callback functions set the variable.  In that sense I
> don't think theoriginally proposed patch is proper for the reason that
> the non-xlogreader callback function may set errmsg_buf.  This is what
> I meant by the word "modularity".
> 
> For that reason I avoided in my second proposal to call
> XLogReaderValidatePageHeader() at all while not in standby mode,
> because calling the validator function while in non-standby mode
> results in the non-xlogreader function return errmsg_buf.  Of course
> we can instead always consume errmsg_buf in the function but I don't
> like to shadow the caller's task.

Understood. Thanks for clarifying this!

> Does that makes sense?

Yes, I'm fine with your latest patch.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: corruption of WAL page header is never reported

From
Kyotaro Horiguchi
Date:
At Thu, 2 Sep 2021 21:52:00 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2021/09/02 16:26, Kyotaro Horiguchi wrote:
> > I believe errmsg_buf is an interface to emit error messages dedicated
> > to xlogreader that doesn't have an access to elog facility, and
> > xlogreader doesn't (or ought not to or expect to) suppose
> > non-xlogreader callback functions set the variable.  In that sense I
> > don't think theoriginally proposed patch is proper for the reason that
> > the non-xlogreader callback function may set errmsg_buf.  This is what
> > I meant by the word "modularity".
> > For that reason I avoided in my second proposal to call
> > XLogReaderValidatePageHeader() at all while not in standby mode,
> > because calling the validator function while in non-standby mode
> > results in the non-xlogreader function return errmsg_buf.  Of course
> > we can instead always consume errmsg_buf in the function but I don't
> > like to shadow the caller's task.
> 
> Understood. Thanks for clarifying this!
> 
> > Does that makes sense?
> 
> Yes, I'm fine with your latest patch.

Thanks. Maybe some additional comment is needed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: corruption of WAL page header is never reported

From
Kyotaro Horiguchi
Date:
At Fri, 03 Sep 2021 16:55:36 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > Yes, I'm fine with your latest patch.
> 
> Thanks. Maybe some additional comment is needed.

Something like this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24165ab03e..b621ad6b0f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12496,9 +12496,21 @@ retry:
      *
      * Validating the page header is cheap enough that doing it twice
      * shouldn't be a big deal from a performance point of view.
+     *
+     * Don't call XLogReaderValidatePageHeader here while not in standby mode
+     * so that this function won't return with a valid errmsg_buf.
      */
-    if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
+    if (StandbyMode &&
+        !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
     {
+        /*
+         * emit this error right now then retry this page immediately
+         * the message is already translated
+         */
+        if (xlogreader->errormsg_buf[0])
+            ereport(emode_for_corrupt_record(emode, EndRecPtr),
+                    (errmsg_internal("%s", xlogreader->errormsg_buf)));
+
         /* reset any error XLogReaderValidatePageHeader() might have set */
         xlogreader->errormsg_buf[0] = '\0';
         goto next_record_is_invalid;

Re: corruption of WAL page header is never reported

From
Fujii Masao
Date:

On 2021/09/06 3:11, Alvaro Herrera wrote:
> On 2021-Sep-03, Kyotaro Horiguchi wrote:
> 
>> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
>> index 24165ab03e..b621ad6b0f 100644
>> --- a/src/backend/access/transam/xlog.c
>> +++ b/src/backend/access/transam/xlog.c
>> @@ -12496,9 +12496,21 @@ retry:
>>        *
>>        * Validating the page header is cheap enough that doing it twice
>>        * shouldn't be a big deal from a performance point of view.
>> +     *
>> +     * Don't call XLogReaderValidatePageHeader here while not in standby mode
>> +     * so that this function won't return with a valid errmsg_buf.
>>        */
>> -    if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
>> +    if (StandbyMode &&
>> +        !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
> 
> OK, but I don't understand why we have a comment that says (referring to
> non-standby mode) "doing it twice shouldn't be a big deal", followed by
> "Don't do it twice while not in standby mode" -- that seems quite
> contradictory.  I think the new comment should overwrite the previous
> one, something like this:
> 
> -     * Validating the page header is cheap enough that doing it twice
> -     * shouldn't be a big deal from a performance point of view.
> +     *
> +     * We do this in standby mode only,
> +     * so that this function won't return with a valid errmsg_buf.

Even if we do this while NOT in standby mode, ISTM that this function doesn't
return with a valid errmsg_buf because it's reset. So probably the comment
should be updated as follows?

-------------------------
We don't do this while not in standby mode because we don't need to retry
immediately if the page header is not valid. Instead, XLogReadRecord() is
responsible to check the page header.
-------------------------

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: corruption of WAL page header is never reported

From
Fujii Masao
Date:

On 2021/09/07 2:02, Fujii Masao wrote:
> Even if we do this while NOT in standby mode, ISTM that this function doesn't
> return with a valid errmsg_buf because it's reset. So probably the comment
> should be updated as follows?
> 
> -------------------------
> We don't do this while not in standby mode because we don't need to retry
> immediately if the page header is not valid. Instead, XLogReadRecord() is
> responsible to check the page header.
> -------------------------

I updated the comment as above. Patch attached.

-     * it's not valid. This may seem unnecessary, because XLogReadRecord()
+     * it's not valid. This may seem unnecessary, because ReadPageInternal()
       * validates the page header anyway, and would propagate the failure up to

I also applied this change because ReadPageInternal() not XLogReadRecord()
calls XLogReaderValidatePageHeader().

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

Re: corruption of WAL page header is never reported

From
Kyotaro Horiguchi
Date:
At Fri, 10 Sep 2021 10:38:39 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2021/09/07 2:02, Fujii Masao wrote:
> > Even if we do this while NOT in standby mode, ISTM that this function
> > doesn't
> > return with a valid errmsg_buf because it's reset. So probably the
> > comment
> > should be updated as follows?
> > -------------------------
> > We don't do this while not in standby mode because we don't need to
> > retry
> > immediately if the page header is not valid. Instead, XLogReadRecord()
> > is
> > responsible to check the page header.
> > -------------------------
> 
> I updated the comment as above. Patch attached.
> 
> -     * it's not valid. This may seem unnecessary, because XLogReadRecord()
> + * it's not valid. This may seem unnecessary, because
> ReadPageInternal()
>       * validates the page header anyway, and would propagate the failure up to
> 
> I also applied this change because ReadPageInternal() not
> XLogReadRecord()
> calls XLogReaderValidatePageHeader().

Yeah, good catch.


+     * Note that we don't do this while not in standby mode because we don't
+     * need to retry immediately if the page header is not valid. Instead,
+     * ReadPageInternal() is responsible for validating the page header.

The point here is "retry this page, not this record", so "we don't need
to retry immediately" looks a bit ambiguous. So how about something
like this?

Note that we don't do this while not in standby mode because we don't
need to avoid retrying this entire record even if the page header is
not valid. Instead, ReadPageInternal() is responsible for validating
the page header in that case.

Everything else looks fine to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: corruption of WAL page header is never reported

From
Fujii Masao
Date:

On 2021/09/13 11:00, Kyotaro Horiguchi wrote:
> The point here is "retry this page, not this record", so "we don't need
> to retry immediately" looks a bit ambiguous. So how about something
> like this?
> 
> Note that we don't do this while not in standby mode because we don't
> need to avoid retrying this entire record even if the page header is
> not valid. Instead, ReadPageInternal() is responsible for validating
> the page header in that case.

You mean that, while not in standby mode, we need to retry reading
the entire record if the page head is invalid? I was thinking that
we basically give up replaying further records in that case becase
we're not in standby mode.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: corruption of WAL page header is never reported

From
Kyotaro Horiguchi
Date:
At Mon, 13 Sep 2021 14:56:11 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2021/09/13 11:00, Kyotaro Horiguchi wrote:
> > The point here is "retry this page, not this record", so "we don't
> > need
> > to retry immediately" looks a bit ambiguous. So how about something
> > like this?
> > Note that we don't do this while not in standby mode because we don't
> > need to avoid retrying this entire record even if the page header is
> > not valid. Instead, ReadPageInternal() is responsible for validating
> > the page header in that case.
> 
> You mean that, while not in standby mode, we need to retry reading
> the entire record if the page head is invalid? I was thinking that
> we basically give up replaying further records in that case becase
> we're not in standby mode.

I wrote "while not in standby mode, we don't need to avoid retry the
entire record" but that doesn't mean the inversion "while in standby
mode, we need to do avoid that". In the first place retry doesn't
happen while not in standby mode.  I don't come up with a nice
phrasing but something like this works?

Note that we don't do this while not in standby mode because this is
required only to avoid retrying this entire record for an invalid page
header while in standby mode. Instead, ReadPageInternal() is
responsible for validating the page header in that case.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: corruption of WAL page header is never reported

From
Kyotaro Horiguchi
Date:
At Mon, 13 Sep 2021 17:21:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Mon, 13 Sep 2021 14:56:11 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> > 
> > 
> > On 2021/09/13 11:00, Kyotaro Horiguchi wrote:
> > > The point here is "retry this page, not this record", so "we don't
> > > need
> > > to retry immediately" looks a bit ambiguous. So how about something
> > > like this?
> > > Note that we don't do this while not in standby mode because we don't
> > > need to avoid retrying this entire record even if the page header is
> > > not valid. Instead, ReadPageInternal() is responsible for validating
> > > the page header in that case.
> > 
> > You mean that, while not in standby mode, we need to retry reading
> > the entire record if the page head is invalid? I was thinking that
> > we basically give up replaying further records in that case becase
> > we're not in standby mode.

Sorry, my brain can be easily twisted..

> I wrote "while not in standby mode, we don't need to avoid retry the
> entire record" but that doesn't mean the inversion "while in standby
> mode, we need to do avoid that". In the first place retry doesn't
> happen while not in standby mode.  I don't come up with a nice
> phrasing but something like this works?

Mmm. Something's got wrong badly...

I wrote "we don't need to avoid retry the entire record" but that
doesn't mean "we need to retry the entire record".  That simply means
"we don't care if retry happens or not."

In the first place retry doesn't happen while not in standby mode.  I
don't come up with a nice phrasing but something like this works?

> Note that we don't do this while not in standby mode because this is
> required only to avoid retrying this entire record for an invalid page
> header while in standby mode. Instead, ReadPageInternal() is
> responsible for validating the page header in that case.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: corruption of WAL page header is never reported

From
Fujii Masao
Date:

On 2021/09/13 17:21, Kyotaro Horiguchi wrote:
> I wrote "while not in standby mode, we don't need to avoid retry the
> entire record" but that doesn't mean the inversion "while in standby
> mode, we need to do avoid that". In the first place retry doesn't
> happen while not in standby mode.  I don't come up with a nice
> phrasing but something like this works?
> 
> Note that we don't do this while not in standby mode because this is
> required only to avoid retrying this entire record for an invalid page
> header while in standby mode. Instead, ReadPageInternal() is
> responsible for validating the page header in that case.

I think that it's better to comment why "retry" is not necessary
when not in standby mode.

-------------------
When not in standby mode, an invalid page header should cause recovery
to end, not retry reading the page, so we don't need to validate the page
header here for the retry. Instead, ReadPageInternal() is responsible for
the validation.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: corruption of WAL page header is never reported

From
Kyotaro Horiguchi
Date:
At Tue, 5 Oct 2021 00:59:46 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> I think that it's better to comment why "retry" is not necessary
> when not in standby mode.
> 
> -------------------
> When not in standby mode, an invalid page header should cause recovery
> to end, not retry reading the page, so we don't need to validate the
> page
> header here for the retry. Instead, ReadPageInternal() is responsible
> for
> the validation.

LGTM.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: corruption of WAL page header is never reported

From
Fujii Masao
Date:

On 2021/10/05 10:58, Kyotaro Horiguchi wrote:
> At Tue, 5 Oct 2021 00:59:46 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
>> I think that it's better to comment why "retry" is not necessary
>> when not in standby mode.
>>
>> -------------------
>> When not in standby mode, an invalid page header should cause recovery
>> to end, not retry reading the page, so we don't need to validate the
>> page
>> header here for the retry. Instead, ReadPageInternal() is responsible
>> for
>> the validation.
> 
> LGTM.

Thanks for the review! I updated the comment and pushed the patch.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION