Re: Make mesage at end-of-recovery less scary. - Mailing list pgsql-hackers
| From | Kyotaro Horiguchi |
|---|---|
| Subject | Re: Make mesage at end-of-recovery less scary. |
| Date | |
| Msg-id | 20200305.160650.1027844943773682801.horikyota.ntt@gmail.com Whole thread Raw |
| In response to | Re: Make mesage at end-of-recovery less scary. (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
| Responses |
Re: Make mesage at end-of-recovery less scary.
|
| List | pgsql-hackers |
Hello.
I changed the condition from randAccess to fetching_ckpt considering
the discussion in another thread [1]. Then I moved the block that
shows the new messages to more appropriate place.
At Fri, 28 Feb 2020 17:28:06 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> >
> > Have you considered an error context here? Your patch leads to a bit
> > of duplication with the message a bit down of what you are changing
> > where the end of local pg_wal is reached.
>
> It is a DEBUG message and it is for the time moving from crash
> recovery to archive recovery. I could remove that but decided to leave
> it for tracability.
I modified the message so that it has the same look to the new
messages, but I left it being DEBUG1, since it is just a intermediate
state. We should finally see one of the new three messages.
After the messages changed, another message from wal sender came to
look redundant.
| [20866] LOG: replication terminated by primary server
| [20866] DETAIL: End of WAL reached on timeline 1 at 0/30001C8.
| [20866] FATAL: could not send end-of-streaming message to primary: no COPY in progress
| [20851] LOG: reached end of WAL at 0/30001C8 on timeline 1 in archive during standby mode
| [20851] DETAIL: invalid record length at 0/30001C8: wanted 24, got 0
I changed the above to the below, which looks more adequate.
| [24271] LOG: replication terminated by primary server on timeline 1 at 0/3000240.
| [24271] FATAL: could not send end-of-streaming message to primary: no COPY in progress
| [24267] LOG: reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
| [24267] DETAIL: invalid record length at 0/3000240: wanted 24, got 0
> > > + * reached the end of WAL. Otherwise something's really wrong and
> > > + * we report just only the errormsg if any. If we don't receive
> >
> > This sentence sounds strange to me. Or you meant "Something is wrong,
> > so use errormsg as report if it is set"?
The message no longer exists.
> > > + (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash recovery",
> >
> > FWIW, you are introducing three times the same typo, in the same
> > word, in three different messages.
>
> They're copy-pasto. I refrained from constructing an error message
> from multiple nonindipendent parts. Are you suggesting to do so?
The tree times repetition of almost same phrases is very unreadable. I
rewrote it in more simple shape.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From 17ee82e5d44dd5a932ed69b8a1ea91a23d170952 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v2] Make End-Of-Recovery error less scary
When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is happening.
Make this message less scary as "reached end of WAL".
---
src/backend/access/transam/xlog.c | 72 ++++++++++++++++++++-------
src/backend/replication/walreceiver.c | 3 +-
2 files changed, 55 insertions(+), 20 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d19408b3be..849cf6fe6b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4282,12 +4282,15 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
for (;;)
{
char *errormsg;
+ XLogRecPtr ErrRecPtr = InvalidXLogRecPtr;
record = XLogReadRecord(xlogreader, &errormsg);
ReadRecPtr = xlogreader->ReadRecPtr;
EndRecPtr = xlogreader->EndRecPtr;
if (record == NULL)
{
+ ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr;
+
if (readFile >= 0)
{
close(readFile);
@@ -4295,14 +4298,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
}
/*
- * We only end up here without a message when XLogPageRead()
- * failed - in that case we already logged something. In
- * StandbyMode that only happens if we have been triggered, so we
- * shouldn't loop anymore in that case.
+ * If we are fetching checkpoint, we emit the error message right
+ * now. Otherwise the error is regarded as "end of WAL" and the
+ * message if any is shown as a part of the end-of-WAL message
+ * below.
*/
- if (errormsg)
- ereport(emode_for_corrupt_record(emode, EndRecPtr),
+ if (fetching_ckpt && errormsg)
+ {
+ ereport(emode_for_corrupt_record(emode, ErrRecPtr),
(errmsg_internal("%s", errormsg) /* already translated */ ));
+ }
}
/*
@@ -4332,11 +4337,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
/* Great, got a record */
return record;
}
- else
- {
- /* No valid record available from this source */
- lastSourceFailed = true;
+ /* No valid record available from this source */
+ lastSourceFailed = true;
+
+ if (!fetching_ckpt)
+ {
/*
* If archive recovery was requested, but we were still doing
* crash recovery, switch to archive recovery and retry using the
@@ -4349,11 +4355,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
* we'd have no idea how far we'd have to replay to reach
* consistency. So err on the safe side and give up.
*/
- if (!InArchiveRecovery && ArchiveRecoveryRequested &&
- !fetching_ckpt)
+ if (!InArchiveRecovery && ArchiveRecoveryRequested)
{
+ /*
+ * We don't report this as LOG, since we don't stop recovery
+ * here
+ */
ereport(DEBUG1,
- (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+ (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery,
enteringarchive recovery",
+ (uint32) (ErrRecPtr >> 32),
+ (uint32) ErrRecPtr,
+ ThisTimeLineID,
+ xlogSourceNames[currentSource])));
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
@@ -4391,12 +4404,35 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
continue;
}
- /* In standby mode, loop back to retry. Otherwise, give up. */
- if (StandbyMode && !CheckForStandbyTrigger())
- continue;
- else
- return NULL;
+ /*
+ * We reached the end of WAL, show the messages just once at the
+ * same LSN.
+ */
+ if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+ {
+ char *fmt;
+
+ if (StandbyMode)
+ fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+ else if (InArchiveRecovery)
+ fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+ else
+ fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+ ereport(LOG,
+ (errmsg (fmt, (uint32) (EndRecPtr >> 32),
+ (uint32) EndRecPtr,
+ ThisTimeLineID,
+ xlogSourceNames[currentSource]),
+ (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+ }
}
+
+ /* In standby mode, loop back to retry. Otherwise, give up. */
+ if (StandbyMode && !CheckForStandbyTrigger())
+ continue;
+ else
+ return NULL;
}
}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2ab15c3cbb..682dbb4e1f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -478,8 +478,7 @@ WalReceiverMain(void)
else if (len < 0)
{
ereport(LOG,
- (errmsg("replication terminated by primary server"),
- errdetail("End of WAL reached on timeline %u at %X/%X.",
+ (errmsg("replication terminated by primary server on timeline %u at %X/%X.",
startpointTLI,
(uint32) (LogstreamResult.Write >> 32), (uint32)
LogstreamResult.Write)));
endofwal = true;
--
2.18.2
pgsql-hackers by date: