Re: Make mesage at end-of-recovery less scary. - Mailing list pgsql-hackers

From Ashutosh Sharma
Subject Re: Make mesage at end-of-recovery less scary.
Date
Msg-id CAE9k0P=9gBHdBff2Rhq9pwa9ghNft_aHHSmO51K+EdDnfnkpFQ@mail.gmail.com
Whole thread Raw
In response to Re: Make mesage at end-of-recovery less scary.  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: Make mesage at end-of-recovery less scary.
List pgsql-hackers
On Wed, Feb 9, 2022 at 1:14 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Hi, Ashutosh.
>
> At Tue, 8 Feb 2022 18:35:34 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in
> > Here are some of my review comments on the v11 patch:
>
> Thank you for taking a look on this.
>
> > -                       (errmsg_internal("reached end of WAL in
> > pg_wal, entering archive recovery")));
> > +                       (errmsg_internal("reached end of WAL at %X/%X
> > on timeline %u in %s during crash recovery, entering archive
> > recovery",
> > +                                        LSN_FORMAT_ARGS(ErrRecPtr),
> > +                                        replayTLI,
> > +                                        xlogSourceNames[currentSource])));
> >
> > Why crash recovery? Won't this message get printed even during PITR?
>
> It is in the if-block with the following condition.
>
> >        * If archive recovery was requested, but we were still doing
> >        * crash recovery, switch to archive recovery and retry using the
> >        * offline archive. We have now replayed all the valid WAL in
> >        * pg_wal, so we are presumably now consistent.
> ...
> >    if (!InArchiveRecovery && ArchiveRecoveryRequested)
>
> This means archive-recovery is requested but not started yet. That is,
> we've just finished crash recovery.  The existing comment cited
> together is mentioning that.
>
> At the end of PITR (or archive recovery), the other code works.
>

This is quite understandable, the point here is that the message that
we are emitting says, we have just finished reading the wal files in
the pg_wal directory during crash recovery and are now entering
archive recovery when we are actually doing point-in-time recovery
which seems a bit misleading.

> > /*
> >  * If we haven't emit an error message, we have safely reached the
> >  * end-of-WAL.
> >  */
> > if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
> > {
> >       char       *fmt;
> >
> >       if (StandbyMode)
> >               fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
> >       else if (InArchiveRecovery)
> >               fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
> >       else
> >               fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
>
> The last among the above messages is choosed when archive-recovery is
> not requested at all.
>
> > I just did a PITR and could see these messages in the logfile.
>
> Yeah, the log lines are describing that the server starting with crash
> recovery to run PITR.
>
> > 2022-02-08 18:00:44.367 IST [86185] LOG:  starting point-in-time
> > recovery to WAL location (LSN) "0/5227790"
> > 2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not
> > properly shut down; automatic recovery in progress
>
> Well. I guess that the "automatic recovery" is ambiguous.  Does it
> make sense if the second line were like the follows instead?
>
> + 2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not properly shut down; crash recovery in progress
>

Well, according to me the current message looks fine.

> > Lastly, are we also planning to backport this patch?
>
> This is apparent a behavioral change, not a bug fix, which I think we
> regard as not appropriate for back-patching.
>
>
> As the result, I made the following chages in the version 11.
>
> 1. Changed the condition for the "end-of-WAL" message from
>    emode_for_corrupt_record to the EndOfWAL flag.
>
> 2. Corrected the wording of end-of-wal to end-of-WAL.
>
> 3. In the "reached end of WAL" message, report the LSN of the
>   beginning of failed record instead of the beginning of the
>   last-succeeded record.
>
> 4. In the changed message in walreceiver.c, I swapped LSN and timeline
>   so that they are in the same order with other similar messages.
>

Thanks for sharing this information.

==

Here is one more comment:

One more comment:

+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+my $chkptfile;
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;

$chkptfile is declared twice in the same scope. We can probably remove
the first one.

--
With Regards,
Ashutosh Sharma.



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Database-level collation version tracking
Next
From: Alexander Pyhalov
Date:
Subject: Justin Pryzby