Re: Make mesage at end-of-recovery less scary. - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Make mesage at end-of-recovery less scary.
Date
Msg-id 20200324.105250.1584846618506385212.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Make mesage at end-of-recovery less scary.  (Andres Freund <andres@anarazel.de>)
Responses Re: Make mesage at end-of-recovery less scary.  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-hackers
At Mon, 23 Mar 2020 12:47:36 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2020-03-23 10:37:16 +0100, Peter Eisentraut wrote:
> > On 2020-03-05 08:06, Kyotaro Horiguchi wrote:
> > > | [20866] LOG:  replication terminated by primary server
> > > | [20866] DETAIL:  End of WAL reached on timeline 1 at 0/30001C8.
> > > | [20866] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
> 
> IMO it's a bug that we see this FATAL. I seem to recall that we didn't
> use to get that?

I thought that it is a convention that A auxiliary process uses ERROR
(which is turned into FATAL in ereport) to exit, which I didn't like
so much, but it was out of scope of this patch.

As for the message bove, the FATAL is preceded by the "LOG:
replication terminated by" message, that means walreceiver tries to
send new data after disconnection just to fail, which is
unreasonable. I think we should exit immediately after detecting
disconnection. The FATAL is gone by the attached.

> > > | [24267]  LOG:  reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
> > > | [24267]  DETAIL:  invalid record length at 0/3000240: wanted 24, got 0
> > 
> > Is this the before and after?  That doesn't seem like a substantial
> > improvement to me.  You still get the "scary" message at the end.
> 
> It seems like a minor improvement - folding the DETAIL into the message
> makes sense to me here. But it indeed doesn't really address the main
> issue.
> 
> I think we don't want to elide the information about how the end of the
> WAL was detected - there are some issues where I found that quite
> helpful. But we could reformulate it to be clearer that it's informative
> output, not a bug.  E.g. something roughly like
> 
> LOG:  reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
> DETAIL: End detected due to invalid record length at 0/3000240: expected 24, got 0
> (I first elided the position in the DETAIL, but it could differ from the
> one in LOG)
> 
> I don't find that very satisfying, but I can't come up with something
> that provides the current information, while being less scary than my
> suggestion?

The 0-length record is not an "invalid" state during recovery, so we
can add the message for the state as "record length is 0 at %X/%X". I
think if other states found there, it implies something wrong.

LSN is redundantly shown but I'm not sure if it is better to remove it
from either of the two lines.

| LOG:  reached end of WAL at 0/3000850 on timeline 1 in pg_wal during crash recovery
| DETAIL:  record length is 0 at 0/3000850

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 47511afed5f8acf92abaf1cd6fcfecc1faea9c87 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v3] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is happening.
Make this message less scary as "reached end of WAL".
---
 src/backend/access/transam/xlog.c       | 69 ++++++++++++++++++-------
 src/backend/access/transam/xlogreader.c |  9 ++++
 src/backend/replication/walreceiver.c   | 11 ++--
 3 files changed, 67 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 793c076da6..6c2924dfb7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4283,12 +4283,15 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         ReadRecPtr = xlogreader->ReadRecPtr;
         EndRecPtr = xlogreader->EndRecPtr;
         if (record == NULL)
         {
+            ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr;
+
             if (readFile >= 0)
             {
                 close(readFile);
@@ -4296,13 +4299,13 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * If we are fetching checkpoint, we emit the error message right
+             * now. Otherwise the error is regarded as "end of WAL" and the
+             * message if any is shown as a part of the end-of-WAL message
+             * below.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, EndRecPtr),
+            if (fetching_ckpt && errormsg)
+                ereport(emode_for_corrupt_record(emode, ErrRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
         }
 
@@ -4333,11 +4336,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
-        {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
 
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
+        {
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -4350,11 +4354,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery,
enteringarchive recovery",
 
+                                         (uint32) (ErrRecPtr >> 32),
+                                         (uint32) ErrRecPtr,
+                                         ThisTimeLineID,
+                                         xlogSourceNames[currentSource])));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -4392,12 +4403,34 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             *  We reached the end of WAL, show the messages just once at the
+             *  same LSN.
+             */
+            if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+            {
+                char *fmt;
+
+                if (StandbyMode)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+                else if (InArchiveRecovery)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+                else
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+                ereport(LOG,
+                        (errmsg(fmt, (uint32) (EndRecPtr >> 32),
+                                (uint32) EndRecPtr,    ThisTimeLineID,
+                                xlogSourceNames[currentSource]),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+            }
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 32f02256ed..9ea1305364 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -682,6 +682,15 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                       XLogRecPtr PrevRecPtr, XLogRecord *record,
                       bool randAccess)
 {
+
+    if (record->xl_tot_len == 0)
+    {
+        /* This is strictly not an invalid state, so phrase it as so. */
+        report_invalid_record(state,
+                              "record length is 0 at %X/%X",
+                              (uint32) (RecPtr >> 32), (uint32) RecPtr);
+        return false;
+    }
     if (record->xl_tot_len < SizeOfXLogRecord)
     {
         report_invalid_record(state,
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 25e0333c9e..da978d4047 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -479,12 +479,15 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
+                                    (errmsg("replication terminated by primary server on timeline %u at %X/%X.",
                                                startpointTLI,
                                                (uint32) (LogstreamResult.Write >> 32), (uint32)
LogstreamResult.Write)));
-                            endofwal = true;
-                            break;
+
+                            /*
+                             * we have no longer anything to do on the broken
+                             * connection other than exiting.
+                             */
+                            proc_exit(1);
                         }
                         len = walrcv_receive(wrconn, &buf, &wait_fd);
                     }
-- 
2.18.2


pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: pg_stat_progress_basebackup - progress reporting forpg_basebackup, in the server side
Next
From: Fujii Masao
Date:
Subject: Re: Wait event that should be reported while waiting for WALarchiving to finish