Thread: corruption of WAL page header is never reported
Hello, I found that any corruption of WAL page header found during recovery is never reported in log messages. If wal page header is broken, it is detected in XLogReaderValidatePageHeader called from XLogPageRead, but the error messages are always reset and never reported. if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) { /* reset any error XLogReaderValidatePageHeader() might have set */ xlogreader->errormsg_buf[0] = '\0'; goto next_record_is_invalid; } Since the commit 06687198018, we call XLogReaderValidatePageHeader here so that we can check a page header and retry immediately if it's invalid, but the error message is reset immediately and not reported. I guess the reason why the error message is reset is because we might get the right WAL after some retries. However, I think it is better to report the error for each check in order to let users know the actual issues founded in the WAL. I attached a patch to fix it in this way. Or, if we wouldn't like to report an error for each check and also what we want to check here is just about old recycled WAL instead of header corruption itself, I wander that we could check just xlp_pageaddr instead of calling XLogReaderValidatePageHeader. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Attachment
Em sáb., 17 de jul. de 2021 às 16:57, Yugo NAGATA <nagata@sraoss.co.jp> escreveu:
Hello,
I found that any corruption of WAL page header found during recovery is never
reported in log messages. If wal page header is broken, it is detected in
XLogReaderValidatePageHeader called from XLogPageRead, but the error messages
are always reset and never reported.
if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
{
/* reset any error XLogReaderValidatePageHeader() might have set */
xlogreader->errormsg_buf[0] = '\0';
goto next_record_is_invalid;
}
Since the commit 06687198018, we call XLogReaderValidatePageHeader here so that
we can check a page header and retry immediately if it's invalid, but the error
message is reset immediately and not reported. I guess the reason why the error
message is reset is because we might get the right WAL after some retries.
However, I think it is better to report the error for each check in order to let
users know the actual issues founded in the WAL.
I attached a patch to fix it in this way.
I think to keep the same behavior as before, is necessary always run:
/* reset any error XLogReaderValidatePageHeader() might have set */
xlogreader->errormsg_buf[0] = '\0';
xlogreader->errormsg_buf[0] = '\0';
not?
Ranier Vilela
Attachment
On Sat, 17 Jul 2021 18:40:02 -0300 Ranier Vilela <ranier.vf@gmail.com> wrote: > Em sáb., 17 de jul. de 2021 às 16:57, Yugo NAGATA <nagata@sraoss.co.jp> > escreveu: > > > Hello, > > > > I found that any corruption of WAL page header found during recovery is > > never > > reported in log messages. If wal page header is broken, it is detected in > > XLogReaderValidatePageHeader called from XLogPageRead, but the error > > messages > > are always reset and never reported. > > > > if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, > > readBuf)) > > { > > /* reset any error XLogReaderValidatePageHeader() might > > have set */ > > xlogreader->errormsg_buf[0] = '\0'; > > goto next_record_is_invalid; > > } > > > > Since the commit 06687198018, we call XLogReaderValidatePageHeader here so > > that > > we can check a page header and retry immediately if it's invalid, but the > > error > > message is reset immediately and not reported. I guess the reason why the > > error > > message is reset is because we might get the right WAL after some retries. > > However, I think it is better to report the error for each check in order > > to let > > users know the actual issues founded in the WAL. > > > > I attached a patch to fix it in this way. > > > I think to keep the same behavior as before, is necessary always run: > > /* reset any error XLogReaderValidatePageHeader() might have set */ > xlogreader->errormsg_buf[0] = '\0'; > > not? If we are not in StandbyMode, the check is not retried, and an error is returned immediately. So, I think ,we don't have to display an error message in such cases, and neither reset it. Instead, it would be better to leave the error message handling to the caller of XLogReadRecord. Regards, Yugo Nagat -- Yugo NAGATA <nagata@sraoss.co.jp>
Hello. At Sun, 18 Jul 2021 04:55:05 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in > Hello, > > I found that any corruption of WAL page header found during recovery is never > reported in log messages. If wal page header is broken, it is detected in > XLogReaderValidatePageHeader called from XLogPageRead, but the error messages > are always reset and never reported. Good catch! Currently recovery stops showing no reason if it is stopped by page-header errors. > I attached a patch to fix it in this way. However, it is a kind of a roof-over-a-roof. What we should do is just omitting the check in XLogPageRead while in standby mode. > Or, if we wouldn't like to report an error for each check and also what we want > to check here is just about old recycled WAL instead of header corruption itself, > I wander that we could check just xlp_pageaddr instead of calling > XLogReaderValidatePageHeader. I'm not sure. But as described in the commit message, the commit intended to save a common case and there's no obvious reason to (and not to) restrict the check only to page address. So it uses the established checking function. I was tempted to adjust the comment just above by adding "while in standby mode", but "so that we can retry immediately" is suggesting that so I didn't do that in the attached. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 30033d810bcc784da55600792484603e1c46b3d7 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Mon, 19 Jul 2021 14:49:34 +0900 Subject: [PATCH v1] Don't forget message of hage-header errors while not in standby mode The commit 0668719801 intended to omit page-header errors only while in standby mode but actually it is always forgotten. As the result the message of the end of a crash recovery lacks the reason for the stop. Fix that by doing the additional check only while in standby mode. --- src/backend/access/transam/xlog.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 2ee9515139..79513fb8b5 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -12317,7 +12317,8 @@ retry: * Validating the page header is cheap enough that doing it twice * shouldn't be a big deal from a performance point of view. */ - if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) + if (StandbyMode && + !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) { /* reset any error XLogReaderValidatePageHeader() might have set */ xlogreader->errormsg_buf[0] = '\0'; -- 2.27.0
On Mon, 19 Jul 2021 15:14:41 +0900 (JST) Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > Hello. > > At Sun, 18 Jul 2021 04:55:05 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in > > Hello, > > > > I found that any corruption of WAL page header found during recovery is never > > reported in log messages. If wal page header is broken, it is detected in > > XLogReaderValidatePageHeader called from XLogPageRead, but the error messages > > are always reset and never reported. > > Good catch! Currently recovery stops showing no reason if it is > stopped by page-header errors. > > > I attached a patch to fix it in this way. > > However, it is a kind of a roof-over-a-roof. What we should do is > just omitting the check in XLogPageRead while in standby mode. Your patch doesn't fix the issue that the error message is never reported in standby mode. When a WAL page header is broken, the standby would silently repeat retrying forever. I think we have to let users know the corruption of WAL page header even in standby mode, not? A corruption of WAL record header is always reported, by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.) Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
At Mon, 19 Jul 2021 16:00:39 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in > Your patch doesn't fix the issue that the error message is never reported in > standby mode. When a WAL page header is broken, the standby would silently repeat > retrying forever. Ok, I see your point and agree to that. > I think we have to let users know the corruption of WAL page header even in > standby mode, not? A corruption of WAL record header is always reported, > by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.) Howeer, I'm still on the opinion that we don't need to check that while in standby mode. How about the attached? regards. -- Kyotaro Horiguchi NTT Open Source Software Center From fe23fbf51569eb7a305bd9e619e918aec0b87bf7 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Mon, 19 Jul 2021 14:49:34 +0900 Subject: [PATCH] Don't hide xlog page-header errors while recovery The commit 0668719801 intended to handle the case of standby mode but it inadvertently catches also while not in standby mode. Change the condition to catch only while in non-standby mode. Addition to that, it is reasonable to show the error message even in standby mode. Show the error message if any. Author: Yugo NAGATA <nagata@sraoss.co.jp> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> --- src/backend/access/transam/xlog.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 2ee9515139..221e7d1f07 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -12317,8 +12317,17 @@ retry: * Validating the page header is cheap enough that doing it twice * shouldn't be a big deal from a performance point of view. */ - if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) + if (StandbyMode && + !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) { + /* + * in this case we consume this error right now then retry immediately, + * the message is already translated + */ + if (xlogreader->errormsg_buf[0]) + ereport(emode_for_corrupt_record(emode, EndRecPtr), + (errmsg_internal("%s", xlogreader->errormsg_buf))); + /* reset any error XLogReaderValidatePageHeader() might have set */ xlogreader->errormsg_buf[0] = '\0'; goto next_record_is_invalid; -- 2.27.0
me> Howeer, I'm still on the opinion that we don't need to check that me> while in standby mode. Of course it's typo of "while not in standby mode". sorry.. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, 19 Jul 2021 17:47:07 +0900 (JST) Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Mon, 19 Jul 2021 16:00:39 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in > > Your patch doesn't fix the issue that the error message is never reported in > > standby mode. When a WAL page header is broken, the standby would silently repeat > > retrying forever. > > Ok, I see your point and agree to that. > > > I think we have to let users know the corruption of WAL page header even in > > standby mode, not? A corruption of WAL record header is always reported, > > by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.) > > Howeer, I'm still on the opinion that we don't need to check that > while in standby mode. > > How about the attached? On Mon, 19 Jul 2021 17:50:16 +0900 (JST) Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > me> Howeer, I'm still on the opinion that we don't need to check that > me> while in standby mode. > > Of course it's typo of "while not in standby mode". Thanks for updating the patch. I agree with you. I think it is nice to fix to perform the check only during standby mode because it make a bit clearer why we check it immediately in XLogPageRead. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
Em seg., 19 de jul. de 2021 às 06:15, Yugo NAGATA <nagata@sraoss.co.jp> escreveu:
On Mon, 19 Jul 2021 17:47:07 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> At Mon, 19 Jul 2021 16:00:39 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in
> > Your patch doesn't fix the issue that the error message is never reported in
> > standby mode. When a WAL page header is broken, the standby would silently repeat
> > retrying forever.
>
> Ok, I see your point and agree to that.
>
> > I think we have to let users know the corruption of WAL page header even in
> > standby mode, not? A corruption of WAL record header is always reported,
> > by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.)
>
> Howeer, I'm still on the opinion that we don't need to check that
> while in standby mode.
>
> How about the attached?
On Mon, 19 Jul 2021 17:50:16 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> me> Howeer, I'm still on the opinion that we don't need to check that
> me> while in standby mode.
>
> Of course it's typo of "while not in standby mode".
Thanks for updating the patch. I agree with you.
I think it is nice to fix to perform the check only during standby mode
because it make a bit clearer why we check it immediately in XLogPageRead.
And as I had reviewed, your first patch was wrong and now with the Kyotaro version,
to keep the same behavior, it is necessary to reset the error, correct?
regards,
Ranier Vilela
On Mon, 19 Jul 2021 06:32:28 -0300 Ranier Vilela <ranier.vf@gmail.com> wrote: > Em seg., 19 de jul. de 2021 às 06:15, Yugo NAGATA <nagata@sraoss.co.jp> > escreveu: > > > On Mon, 19 Jul 2021 17:47:07 +0900 (JST) > > Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > > > At Mon, 19 Jul 2021 16:00:39 +0900, Yugo NAGATA <nagata@sraoss.co.jp> > > wrote in > > > > Your patch doesn't fix the issue that the error message is never > > reported in > > > > standby mode. When a WAL page header is broken, the standby would > > silently repeat > > > > retrying forever. > > > > > > Ok, I see your point and agree to that. > > > > > > > I think we have to let users know the corruption of WAL page header > > even in > > > > standby mode, not? A corruption of WAL record header is always > > reported, > > > > by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.) > > > > > > Howeer, I'm still on the opinion that we don't need to check that > > > while in standby mode. > > > > > > How about the attached? > > > > On Mon, 19 Jul 2021 17:50:16 +0900 (JST) > > Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > > > me> Howeer, I'm still on the opinion that we don't need to check that > > > me> while in standby mode. > > > > > > Of course it's typo of "while not in standby mode". > > > > Thanks for updating the patch. I agree with you. > > > > I think it is nice to fix to perform the check only during standby mode > > because it make a bit clearer why we check it immediately in XLogPageRead. > > > And as I had reviewed, your first patch was wrong and now with the Kyotaro > version, > to keep the same behavior, it is necessary to reset the error, correct? Well, I think my first patch was not wrong. The difference with the latest patch is just whether we perform the additional check when we are not in standby mode or not. The behavior is basically the same although which function detects and prints the page-header error in cases of crash recovery is different. On the other hand, in your patch, the error message was always ommitted in cases of crash recovery, and it seemed to me wrong. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
On 2021/07/19 18:52, Yugo NAGATA wrote: > Well, I think my first patch was not wrong. The difference with the latest > patch is just whether we perform the additional check when we are not in > standby mode or not. The behavior is basically the same although which function > detects and prints the page-header error in cases of crash recovery is different. Yes, so which patch do you think is better? I like your version because there seems no reason why XLogPageRead() should skip XLogReaderValidatePageHeader() when not in standby mode. Also I'm tempted to move ereport() and reset of errmsg_buf to under next_record_is_invalid as follows. That is, in standby mode whenever we find an invalid record and retry reading WAL page in XLogPageRead(), we report the error message and reset it. For now in XLogPageRead(), only XLogReaderValidatePageHeader() sets errmsg_buf, but in the future other code or function doing that may be added. For that case, the following change seems more elegant. Thought? * shouldn't be a big deal from a performance point of view. */ if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) - { - /* reset any error XLogReaderValidatePageHeader() might have set */ - xlogreader->errormsg_buf[0] = '\0'; goto next_record_is_invalid; - } return readLen; @@ -12517,7 +12513,17 @@ next_record_is_invalid: /* In standby-mode, keep trying */ if (StandbyMode) + { + if (xlogreader->errormsg_buf[0]) + { + ereport(emode_for_corrupt_record(emode, EndRecPtr), + (errmsg_internal("%s", xlogreader->errormsg_buf))); + + /* reset any error XLogReaderValidatePageHeader() might have set */ + xlogreader->errormsg_buf[0] = '\0'; + } goto retry; + } else return -1; } Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
At Thu, 2 Sep 2021 12:19:25 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2021/07/19 18:52, Yugo NAGATA wrote: > > Well, I think my first patch was not wrong. The difference with the > > latest > > patch is just whether we perform the additional check when we are not > > in > > standby mode or not. The behavior is basically the same although which > > function > > detects and prints the page-header error in cases of crash recovery is > > different. > > Yes, so which patch do you think is better? I like your version > because there seems no reason why XLogPageRead() should skip > XLogReaderValidatePageHeader() when not in standby mode. Did you read the comment just above? xlog.c:12523 > * Check the page header immediately, so that we can retry immediately if > * it's not valid. This may seem unnecessary, because XLogReadRecord() > * validates the page header anyway, and would propagate the failure up to > * ReadRecord(), which would retry. However, there's a corner case with > * continuation records, if a record is split across two pages such that So when not in standby mode, the same check is performed by xlogreader which has the responsibility to validate the binary data read by XLogPageRead. The page-header validation is a compromise to save a specific case. > Also I'm tempted to move ereport() and reset of errmsg_buf to > under next_record_is_invalid as follows. That is, in standby mode > whenever we find an invalid record and retry reading WAL page > in XLogPageRead(), we report the error message and reset it. > For now in XLogPageRead(), only XLogReaderValidatePageHeader() > sets errmsg_buf, but in the future other code or function doing that > may be added. For that case, the following change seems more elegant. > Thought? I don't think it is good choice to conflate read-failure and header validation failure from the view of modularity. > * shouldn't be a big deal from a performance point of view. > */ > if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) > - { > - /* reset any error XLogReaderValidatePageHeader() might have set */ > - xlogreader->errormsg_buf[0] = '\0'; > goto next_record_is_invalid; > - } > return readLen; > @@ -12517,7 +12513,17 @@ next_record_is_invalid: > /* In standby-mode, keep trying */ > if (StandbyMode) > + { > + if (xlogreader->errormsg_buf[0]) > + { > + ereport(emode_for_corrupt_record(emode, EndRecPtr), > + (errmsg_internal("%s", xlogreader->errormsg_buf))); > + > + /* reset any error XLogReaderValidatePageHeader() might have set */ > + xlogreader->errormsg_buf[0] = '\0'; > + } > goto retry; > + } > else > return -1; > } regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Sorry, please let me add something. At Thu, 02 Sep 2021 13:17:16 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Thu, 2 Sep 2021 12:19:25 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > Also I'm tempted to move ereport() and reset of errmsg_buf to > > under next_record_is_invalid as follows. That is, in standby mode > > whenever we find an invalid record and retry reading WAL page > > in XLogPageRead(), we report the error message and reset it. > > For now in XLogPageRead(), only XLogReaderValidatePageHeader() > > sets errmsg_buf, but in the future other code or function doing that > > may be added. For that case, the following change seems more elegant. > > Thought? > > I don't think it is good choice to conflate read-failure and header > validation failure from the view of modularity. In other words, XLogReaderValidatePageHeader is foreign for XLogPageRead and the function indeuces the need of extra care for errormsg_buf that is not relevant to the elog-capable module. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
(.. sorry for the chained-mails) At Thu, 02 Sep 2021 13:31:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > Sorry, please let me add something. > > At Thu, 02 Sep 2021 13:17:16 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > At Thu, 2 Sep 2021 12:19:25 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > Also I'm tempted to move ereport() and reset of errmsg_buf to > > > under next_record_is_invalid as follows. That is, in standby mode > > > whenever we find an invalid record and retry reading WAL page > > > in XLogPageRead(), we report the error message and reset it. > > > For now in XLogPageRead(), only XLogReaderValidatePageHeader() > > > sets errmsg_buf, but in the future other code or function doing that > > > may be added. For that case, the following change seems more elegant. > > > Thought? > > > > I don't think it is good choice to conflate read-failure and header > > validation failure from the view of modularity. > > In other words, XLogReaderValidatePageHeader is foreign for > XLogPageRead and the function indeuces the need of extra care for > errormsg_buf that is not relevant to the elog-capable module. However, I can agree that the error handling code can be moved further later. Like this, > * shouldn't be a big deal from a performance point of view. > */ - if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) - /* reset any error XLogReaderValidatePageHeader() might have set */ - xlogreader->errormsg_buf[0] = '\0'; - goto next_record_is_invalid; + if (... && XLogReaderValidatePageHeader()) + goto page_header_is_invalid; ... > return readlen; > + page_header_is_invalid: + /* + * in this case we consume this error right now then retry immediately, + * the message is already translated + */ + if (xlogreader->errormsg_buf[0]) + ereport(emode_for_corrupt_record(emode, EndRecPtr), + (errmsg_internal("%s", xlogreader->errormsg_buf))); + + /* reset any error XLogReaderValidatePageHeader() might have set */ + xlogreader->errormsg_buf[0] = '\0'; > > next_record_is_invalid: > lastSourceFailed = true; regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2021/09/02 13:17, Kyotaro Horiguchi wrote: > Did you read the comment just above? Yes! > > xlog.c:12523 >> * Check the page header immediately, so that we can retry immediately if >> * it's not valid. This may seem unnecessary, because XLogReadRecord() >> * validates the page header anyway, and would propagate the failure up to >> * ReadRecord(), which would retry. However, there's a corner case with >> * continuation records, if a record is split across two pages such that > > So when not in standby mode, the same check is performed by xlogreader > which has the responsibility to validate the binary data read by > XLogPageRead. The page-header validation is a compromise to save a > specific case. Yes, so XLogPageRead() can skip the validation check of page head if not in standby mode. On the other hand, there is no problem if it still performs the validation check as it does for now. No? > I don't think it is good choice to conflate read-failure and header > validation failure from the view of modularity. I don't think that the proposed change does that. But maybe I failed to get your point yet... Anyway the proposed change just tries to reset errormsg_buf whenever XLogPageRead() retries, whatever error happened before. Also if errormsg_buf is set at that moment, it's reported. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
At Thu, 2 Sep 2021 14:44:31 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2021/09/02 13:17, Kyotaro Horiguchi wrote: > > Did you read the comment just above? > > Yes! Glad to hear that, or..:p > > xlog.c:12523 > >> * Check the page header immediately, so that we can retry immediately if > >> * it's not valid. This may seem unnecessary, because XLogReadRecord() > >> * validates the page header anyway, and would propagate the failure up to > >> * ReadRecord(), which would retry. However, there's a corner case with > >> * continuation records, if a record is split across two pages such that > > So when not in standby mode, the same check is performed by xlogreader > > which has the responsibility to validate the binary data read by > > XLogPageRead. The page-header validation is a compromise to save a > > specific case. > > Yes, so XLogPageRead() can skip the validation check of page head if > not > in standby mode. On the other hand, there is no problem if it still > performs > the validation check as it does for now. No? Practically yes, and it has always been like that as you say. > > I don't think it is good choice to conflate read-failure and header > > validation failure from the view of modularity. > > I don't think that the proposed change does that. But maybe I failed It's about your idea in a recent mail. not about the proposed patch(es). > to get > your point yet... Anyway the proposed change just tries to reset > errormsg_buf whenever XLogPageRead() retries, whatever error happened > before. Also if errormsg_buf is set at that moment, it's reported. I believe errmsg_buf is an interface to emit error messages dedicated to xlogreader that doesn't have an access to elog facility, and xlogreader doesn't (or ought not to or expect to) suppose non-xlogreader callback functions set the variable. In that sense I don't think theoriginally proposed patch is proper for the reason that the non-xlogreader callback function may set errmsg_buf. This is what I meant by the word "modularity". For that reason I avoided in my second proposal to call XLogReaderValidatePageHeader() at all while not in standby mode, because calling the validator function while in non-standby mode results in the non-xlogreader function return errmsg_buf. Of course we can instead always consume errmsg_buf in the function but I don't like to shadow the caller's task. Does that makes sense? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2021/09/02 16:26, Kyotaro Horiguchi wrote: > I believe errmsg_buf is an interface to emit error messages dedicated > to xlogreader that doesn't have an access to elog facility, and > xlogreader doesn't (or ought not to or expect to) suppose > non-xlogreader callback functions set the variable. In that sense I > don't think theoriginally proposed patch is proper for the reason that > the non-xlogreader callback function may set errmsg_buf. This is what > I meant by the word "modularity". > > For that reason I avoided in my second proposal to call > XLogReaderValidatePageHeader() at all while not in standby mode, > because calling the validator function while in non-standby mode > results in the non-xlogreader function return errmsg_buf. Of course > we can instead always consume errmsg_buf in the function but I don't > like to shadow the caller's task. Understood. Thanks for clarifying this! > Does that makes sense? Yes, I'm fine with your latest patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
At Thu, 2 Sep 2021 21:52:00 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2021/09/02 16:26, Kyotaro Horiguchi wrote: > > I believe errmsg_buf is an interface to emit error messages dedicated > > to xlogreader that doesn't have an access to elog facility, and > > xlogreader doesn't (or ought not to or expect to) suppose > > non-xlogreader callback functions set the variable. In that sense I > > don't think theoriginally proposed patch is proper for the reason that > > the non-xlogreader callback function may set errmsg_buf. This is what > > I meant by the word "modularity". > > For that reason I avoided in my second proposal to call > > XLogReaderValidatePageHeader() at all while not in standby mode, > > because calling the validator function while in non-standby mode > > results in the non-xlogreader function return errmsg_buf. Of course > > we can instead always consume errmsg_buf in the function but I don't > > like to shadow the caller's task. > > Understood. Thanks for clarifying this! > > > Does that makes sense? > > Yes, I'm fine with your latest patch. Thanks. Maybe some additional comment is needed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 03 Sep 2021 16:55:36 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > Yes, I'm fine with your latest patch. > > Thanks. Maybe some additional comment is needed. Something like this. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 24165ab03e..b621ad6b0f 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -12496,9 +12496,21 @@ retry: * * Validating the page header is cheap enough that doing it twice * shouldn't be a big deal from a performance point of view. + * + * Don't call XLogReaderValidatePageHeader here while not in standby mode + * so that this function won't return with a valid errmsg_buf. */ - if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) + if (StandbyMode && + !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) { + /* + * emit this error right now then retry this page immediately + * the message is already translated + */ + if (xlogreader->errormsg_buf[0]) + ereport(emode_for_corrupt_record(emode, EndRecPtr), + (errmsg_internal("%s", xlogreader->errormsg_buf))); + /* reset any error XLogReaderValidatePageHeader() might have set */ xlogreader->errormsg_buf[0] = '\0'; goto next_record_is_invalid;
On 2021/09/06 3:11, Alvaro Herrera wrote: > On 2021-Sep-03, Kyotaro Horiguchi wrote: > >> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c >> index 24165ab03e..b621ad6b0f 100644 >> --- a/src/backend/access/transam/xlog.c >> +++ b/src/backend/access/transam/xlog.c >> @@ -12496,9 +12496,21 @@ retry: >> * >> * Validating the page header is cheap enough that doing it twice >> * shouldn't be a big deal from a performance point of view. >> + * >> + * Don't call XLogReaderValidatePageHeader here while not in standby mode >> + * so that this function won't return with a valid errmsg_buf. >> */ >> - if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) >> + if (StandbyMode && >> + !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) > > OK, but I don't understand why we have a comment that says (referring to > non-standby mode) "doing it twice shouldn't be a big deal", followed by > "Don't do it twice while not in standby mode" -- that seems quite > contradictory. I think the new comment should overwrite the previous > one, something like this: > > - * Validating the page header is cheap enough that doing it twice > - * shouldn't be a big deal from a performance point of view. > + * > + * We do this in standby mode only, > + * so that this function won't return with a valid errmsg_buf. Even if we do this while NOT in standby mode, ISTM that this function doesn't return with a valid errmsg_buf because it's reset. So probably the comment should be updated as follows? ------------------------- We don't do this while not in standby mode because we don't need to retry immediately if the page header is not valid. Instead, XLogReadRecord() is responsible to check the page header. ------------------------- Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2021/09/07 2:02, Fujii Masao wrote: > Even if we do this while NOT in standby mode, ISTM that this function doesn't > return with a valid errmsg_buf because it's reset. So probably the comment > should be updated as follows? > > ------------------------- > We don't do this while not in standby mode because we don't need to retry > immediately if the page header is not valid. Instead, XLogReadRecord() is > responsible to check the page header. > ------------------------- I updated the comment as above. Patch attached. - * it's not valid. This may seem unnecessary, because XLogReadRecord() + * it's not valid. This may seem unnecessary, because ReadPageInternal() * validates the page header anyway, and would propagate the failure up to I also applied this change because ReadPageInternal() not XLogReadRecord() calls XLogReaderValidatePageHeader(). Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
At Fri, 10 Sep 2021 10:38:39 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2021/09/07 2:02, Fujii Masao wrote: > > Even if we do this while NOT in standby mode, ISTM that this function > > doesn't > > return with a valid errmsg_buf because it's reset. So probably the > > comment > > should be updated as follows? > > ------------------------- > > We don't do this while not in standby mode because we don't need to > > retry > > immediately if the page header is not valid. Instead, XLogReadRecord() > > is > > responsible to check the page header. > > ------------------------- > > I updated the comment as above. Patch attached. > > - * it's not valid. This may seem unnecessary, because XLogReadRecord() > + * it's not valid. This may seem unnecessary, because > ReadPageInternal() > * validates the page header anyway, and would propagate the failure up to > > I also applied this change because ReadPageInternal() not > XLogReadRecord() > calls XLogReaderValidatePageHeader(). Yeah, good catch. + * Note that we don't do this while not in standby mode because we don't + * need to retry immediately if the page header is not valid. Instead, + * ReadPageInternal() is responsible for validating the page header. The point here is "retry this page, not this record", so "we don't need to retry immediately" looks a bit ambiguous. So how about something like this? Note that we don't do this while not in standby mode because we don't need to avoid retrying this entire record even if the page header is not valid. Instead, ReadPageInternal() is responsible for validating the page header in that case. Everything else looks fine to me. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2021/09/13 11:00, Kyotaro Horiguchi wrote: > The point here is "retry this page, not this record", so "we don't need > to retry immediately" looks a bit ambiguous. So how about something > like this? > > Note that we don't do this while not in standby mode because we don't > need to avoid retrying this entire record even if the page header is > not valid. Instead, ReadPageInternal() is responsible for validating > the page header in that case. You mean that, while not in standby mode, we need to retry reading the entire record if the page head is invalid? I was thinking that we basically give up replaying further records in that case becase we're not in standby mode. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
At Mon, 13 Sep 2021 14:56:11 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2021/09/13 11:00, Kyotaro Horiguchi wrote: > > The point here is "retry this page, not this record", so "we don't > > need > > to retry immediately" looks a bit ambiguous. So how about something > > like this? > > Note that we don't do this while not in standby mode because we don't > > need to avoid retrying this entire record even if the page header is > > not valid. Instead, ReadPageInternal() is responsible for validating > > the page header in that case. > > You mean that, while not in standby mode, we need to retry reading > the entire record if the page head is invalid? I was thinking that > we basically give up replaying further records in that case becase > we're not in standby mode. I wrote "while not in standby mode, we don't need to avoid retry the entire record" but that doesn't mean the inversion "while in standby mode, we need to do avoid that". In the first place retry doesn't happen while not in standby mode. I don't come up with a nice phrasing but something like this works? Note that we don't do this while not in standby mode because this is required only to avoid retrying this entire record for an invalid page header while in standby mode. Instead, ReadPageInternal() is responsible for validating the page header in that case. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Mon, 13 Sep 2021 17:21:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Mon, 13 Sep 2021 14:56:11 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > > > > On 2021/09/13 11:00, Kyotaro Horiguchi wrote: > > > The point here is "retry this page, not this record", so "we don't > > > need > > > to retry immediately" looks a bit ambiguous. So how about something > > > like this? > > > Note that we don't do this while not in standby mode because we don't > > > need to avoid retrying this entire record even if the page header is > > > not valid. Instead, ReadPageInternal() is responsible for validating > > > the page header in that case. > > > > You mean that, while not in standby mode, we need to retry reading > > the entire record if the page head is invalid? I was thinking that > > we basically give up replaying further records in that case becase > > we're not in standby mode. Sorry, my brain can be easily twisted.. > I wrote "while not in standby mode, we don't need to avoid retry the > entire record" but that doesn't mean the inversion "while in standby > mode, we need to do avoid that". In the first place retry doesn't > happen while not in standby mode. I don't come up with a nice > phrasing but something like this works? Mmm. Something's got wrong badly... I wrote "we don't need to avoid retry the entire record" but that doesn't mean "we need to retry the entire record". That simply means "we don't care if retry happens or not." In the first place retry doesn't happen while not in standby mode. I don't come up with a nice phrasing but something like this works? > Note that we don't do this while not in standby mode because this is > required only to avoid retrying this entire record for an invalid page > header while in standby mode. Instead, ReadPageInternal() is > responsible for validating the page header in that case. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2021/09/13 17:21, Kyotaro Horiguchi wrote: > I wrote "while not in standby mode, we don't need to avoid retry the > entire record" but that doesn't mean the inversion "while in standby > mode, we need to do avoid that". In the first place retry doesn't > happen while not in standby mode. I don't come up with a nice > phrasing but something like this works? > > Note that we don't do this while not in standby mode because this is > required only to avoid retrying this entire record for an invalid page > header while in standby mode. Instead, ReadPageInternal() is > responsible for validating the page header in that case. I think that it's better to comment why "retry" is not necessary when not in standby mode. ------------------- When not in standby mode, an invalid page header should cause recovery to end, not retry reading the page, so we don't need to validate the page header here for the retry. Instead, ReadPageInternal() is responsible for the validation. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
At Tue, 5 Oct 2021 00:59:46 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > I think that it's better to comment why "retry" is not necessary > when not in standby mode. > > ------------------- > When not in standby mode, an invalid page header should cause recovery > to end, not retry reading the page, so we don't need to validate the > page > header here for the retry. Instead, ReadPageInternal() is responsible > for > the validation. LGTM. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2021/10/05 10:58, Kyotaro Horiguchi wrote: > At Tue, 5 Oct 2021 00:59:46 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in >> I think that it's better to comment why "retry" is not necessary >> when not in standby mode. >> >> ------------------- >> When not in standby mode, an invalid page header should cause recovery >> to end, not retry reading the page, so we don't need to validate the >> page >> header here for the retry. Instead, ReadPageInternal() is responsible >> for >> the validation. > > LGTM. Thanks for the review! I updated the comment and pushed the patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION