Re: Wall shiping replica failed to recover database with error:invalid contrecord length 1956 at FED/38FFE208 - Mailing list pgsql-general

From Kyotaro Horiguchi
Subject Re: Wall shiping replica failed to recover database with error:invalid contrecord length 1956 at FED/38FFE208
Date
Msg-id 20200117.172655.1384889922565817808.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Wall shiping replica failed to recover database with error: invalid contrecord length 1956 at FED/38FFE208  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Thank you, and sorry for overlooking your comment.

At Thu, 14 Nov 2019 12:28:13 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> > At Wed, 2 Oct 2019 19:24:02 -0400, Stephen Frost <sfrost@snowman.net> wrote in 
> >> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> >>> Yeah, those messages are all pretty ancient, from when WAL was new and not
> >>> to be trusted much.  Perhaps the thing to do is move the existing info
> >>> into DETAIL and make the primary message be something like "reached
> >>> apparent end of WAL stream".
> 
> >> Yes, +1 on that.
> 
> > What do you think about this?
> 
> It seems overcomplicated.  Why do you need to reach into
> emode_for_corrupt_record's private state? 
> emode_for_corrupt_record's private state?  I think we could just
> change the message text without that, and leave the emode
> machinations as-is.

I tried to avoid messages at the same LSN. But it is done by
emode_for_corrupt_record by reducing error level. I reverted that
part.

> I don't understand the restriction to "if (RecPtr == InvalidXLogRecPtr)"
> either?  Maybe that's fine but the comment fails to explain it.

"When we end up here while reading successive recored" meant that, but
it is not explicit. If RecPtr was not invalid, it means that the
caller is requesting for a specific record that should exist.  It is
not end-of-wal at all. I rewrote the comment.

> Another point is that, as the comment for emode_for_corrupt_record
> notes, we don't really want to consider that we've hit end-of-WAL
> when reading any source except XLOG_FROM_PG_WAL.  So I think the
> change of message ought to include a test of the source, but this
> doesn't.

Maybe no. The function just mutes messages for repeated error on
XLOG_FROM_WAL. We consider end-of-WAL while XLOG_FROM_ARCHIVE.  In
this version the "reached end of WAL" is not emitted when
emode_for_corrupt_record decided not to show the message.

> Also, the business with an "operation" string violates the message
> translatability guideline about "don't assemble a message out of
> phrases".  If we want to have that extra detail, it's better just
> to make three separate ereport() calls with separately translatable
> messages.

Though I thought that the operation is just a noun and it is safely
processed separately, I followed your comment. And this version takes
the more verbose one of the previous two.

> Also, it seems wrong that the page TLI check, just below, is where
> it is and isn't part of the main set of page header sanity checks.
> That's sort of unrelated to this patch, except not really, because
> shouldn't a failure of that test also be treated as an "end of WAL"
> condition?

It seems checking if xlogreader did something wrong, since it is
pluggable.  I'm not sure there is any concrete reason for that,
though.  As for recovery, it is actually redundant because
XLogFileReadAnyTLI already checked that for the record, but it doesn't
matter.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 94ee9c982afc5b39bb062c59a95ac2323a4109c8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 17 Jan 2020 17:15:58 +0900
Subject: [PATCH] Change end-of-WAL message less scary

At the end of WAL during recovery, users read just "invalid record"
message in logs, which is too scary. This patch change the message for
the case to "reached end of WAL" to let users know that they are not
in a trouble.
---
 src/backend/access/transam/xlog.c | 48 +++++++++++++++++++++++++------
 1 file changed, 40 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7f4f784c0e..a7757ea4bf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4279,6 +4279,10 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
         EndRecPtr = xlogreader->EndRecPtr;
         if (record == NULL)
         {
+            int actual_emode =
+                emode_for_corrupt_record(emode,
+                                         RecPtr ? RecPtr : EndRecPtr);
+
             if (readFile >= 0)
             {
                 close(readFile);
@@ -4286,15 +4290,43 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * Invalid RecPtr here means we are reading successive records
+             * during recovery. If we get here during recovery, we assume that
+             * we reached the end of WAL.  Otherwise something's really wrong
+             * and we report just only the errormsg if any. If we don't receive
+             * errormsg here, we already logged something.  We are going to
+             * emit duplicate message at the same LSN if
+             * emode_for_currupt_record decided to mute it.  We don't repeat
+             * "reached end of WAL" in the muted messages.
+             *
+             * Note: errormsg is alreay translated.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode,
-                                                 RecPtr ? RecPtr : EndRecPtr),
-                        (errmsg_internal("%s", errormsg) /* already translated */ ));
+            if (RecPtr == InvalidXLogRecPtr && actual_emode == emode)
+            {
+                if (StandbyMode)
+                    ereport(actual_emode,
+                            (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during streaming replication",
+                                     (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr,
+                                     ThisTimeLineID,
+                                     xlogSourceNames[currentSource]),
+                             (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+                else if (InArchiveRecovery)
+                    ereport(actual_emode,
+                            (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during archive recovery",
+                                     (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr,
+                                     ThisTimeLineID,
+                                     xlogSourceNames[currentSource]),
+                             (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+                else
+                    ereport(actual_emode,
+                            (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash recovery",
+                                     (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr,
+                                     ThisTimeLineID,
+                                     xlogSourceNames[currentSource]),
+                             (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+            }
+            else if (errormsg)
+                ereport(actual_emode, (errmsg_internal("%s", errormsg)));
         }
 
         /*
-- 
2.23.0


pgsql-general by date:

Previous
From: "Arnaud L."
Date:
Subject: Re: minimal wal_level on subscriber
Next
From: ramesh penumalli
Date:
Subject: postgresql commands(psql,createdb,dropdb) are not working from shell script