Re: Make mesage at end-of-recovery less scary. - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Make mesage at end-of-recovery less scary.
Date
Msg-id 20200305.160650.1027844943773682801.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Make mesage at end-of-recovery less scary.  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: Make mesage at end-of-recovery less scary.  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-hackers
Hello.

I changed the condition from randAccess to fetching_ckpt considering
the discussion in another thread [1]. Then I moved the block that
shows the new messages to more appropriate place.

At Fri, 28 Feb 2020 17:28:06 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > 
> > Have you considered an error context here?  Your patch leads to a bit
> > of duplication with the message a bit down of what you are changing
> > where the end of local pg_wal is reached.
> 
> It is a DEBUG message and it is for the time moving from crash
> recovery to archive recovery. I could remove that but decided to leave
> it for tracability.

I modified the message so that it has the same look to the new
messages, but I left it being DEBUG1, since it is just a intermediate
state. We should finally see one of the new three messages.

After the messages changed, another message from wal sender came to
look redundant.

| [20866] LOG:  replication terminated by primary server
| [20866] DETAIL:  End of WAL reached on timeline 1 at 0/30001C8.
| [20866] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
| [20851] LOG:  reached end of WAL at 0/30001C8 on timeline 1 in archive during standby mode
| [20851] DETAIL:  invalid record length at 0/30001C8: wanted 24, got 0

I changed the above to the below, which looks more adequate.

| [24271]  LOG:  replication terminated by primary server on timeline 1 at 0/3000240.
| [24271]  FATAL:  could not send end-of-streaming message to primary: no COPY in progress
| [24267]  LOG:  reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
| [24267]  DETAIL:  invalid record length at 0/3000240: wanted 24, got 0

> > > +    * reached the end of WAL.  Otherwise something's really wrong and
> > > +    * we report just only the errormsg if any. If we don't receive
> > 
> > This sentence sounds strange to me.  Or you meant "Something is wrong,
> > so use errormsg as report if it is set"?

The message no longer exists.

> > > +    (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash recovery",
> > 
> > FWIW, you are introducing three times the same typo, in the same
> > word, in three different messages.
> 
> They're copy-pasto.  I refrained from constructing an error message
> from multiple nonindipendent parts.  Are you suggesting to do so?

The tree times repetition of almost same phrases is very unreadable. I
rewrote it in more simple shape.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 17ee82e5d44dd5a932ed69b8a1ea91a23d170952 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v2] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is happening.
Make this message less scary as "reached end of WAL".
---
 src/backend/access/transam/xlog.c     | 72 ++++++++++++++++++++-------
 src/backend/replication/walreceiver.c |  3 +-
 2 files changed, 55 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d19408b3be..849cf6fe6b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4282,12 +4282,15 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         ReadRecPtr = xlogreader->ReadRecPtr;
         EndRecPtr = xlogreader->EndRecPtr;
         if (record == NULL)
         {
+            ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr;
+
             if (readFile >= 0)
             {
                 close(readFile);
@@ -4295,14 +4298,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * If we are fetching checkpoint, we emit the error message right
+             * now. Otherwise the error is regarded as "end of WAL" and the
+             * message if any is shown as a part of the end-of-WAL message
+             * below.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, EndRecPtr),
+            if (fetching_ckpt && errormsg)
+            {
+                ereport(emode_for_corrupt_record(emode, ErrRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
+            }
         }
 
         /*
@@ -4332,11 +4337,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
-        {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
 
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
+        {
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -4349,11 +4355,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery,
enteringarchive recovery",
 
+                                         (uint32) (ErrRecPtr >> 32),
+                                         (uint32) ErrRecPtr,
+                                         ThisTimeLineID,
+                                         xlogSourceNames[currentSource])));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -4391,12 +4404,35 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             *  We reached the end of WAL, show the messages just once at the
+             *  same LSN.
+             */
+            if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+            {
+                char *fmt;
+
+                if (StandbyMode)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+                else if (InArchiveRecovery)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+                else
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+                ereport(LOG,
+                        (errmsg (fmt, (uint32) (EndRecPtr >> 32),
+                                 (uint32) EndRecPtr,
+                                 ThisTimeLineID,
+                                 xlogSourceNames[currentSource]),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+            }
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2ab15c3cbb..682dbb4e1f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -478,8 +478,7 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
+                                    (errmsg("replication terminated by primary server on timeline %u at %X/%X.",
                                                startpointTLI,
                                                (uint32) (LogstreamResult.Write >> 32), (uint32)
LogstreamResult.Write)));
                             endofwal = true;
-- 
2.18.2


pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
Next
From: Peter Eisentraut
Date:
Subject: Re: pg_stat_progress_basebackup - progress reporting forpg_basebackup, in the server side