Thread: PANIC during crash recovery of a recently promoted standby

PANIC during crash recovery of a recently promoted standby

From
Pavan Deolasee
Date:
Hello,

I recently investigated a problem where a standby is promoted to be the new master. The promoted standby crashes shortly thereafter for whatever reason. Upon running the crash recovery, the promoted standby (now master) PANICs with message such as:

PANIC,XX000,"WAL contains references to invalid pages",,,,,,,,"XLogCheckInvalidPages, xlogutils.c:242",""

After investigation, I could recreate a reproduction scenario for this problem. The attached TAP test (thanks Alvaro from converting my bash script to a TAP test) demonstrates the problem. The test is probably sensitive to timing, but it reproduces the problem consistently at least at my end. While the original report was for 9.6, I can reproduce it on the master and thus it probably affects all supported releases.

Investigations point to a possible bug where we fail to update the minRecoveryPoint after completing the ongoing restart point upon promotion. IMV after promotion the new master must always recover to the end of the WAL to ensure that all changes are applied correctly. But what we've instead is that minRecoveryPoint remains set to a prior location because of this:

   /*
     * Update pg_control, using current time.  Check that it still shows
     * IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
     * this is a quick hack to make sure nothing really bad happens if somehow
     * we get here after the end-of-recovery checkpoint.
     */
   LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
        ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
    {
        ControlFile->checkPoint = lastCheckPointRecPtr;
        ControlFile->checkPointCopy = lastCheckPoint;
        ControlFile->time = (pg_time_t) time(NULL);

        /*
         * Ensure minRecoveryPoint is past the checkpoint record.  Normally,
         * this will have happened already while writing out dirty buffers,
         * but not necessarily - e.g. because no buffers were dirtied.  We do
         * this because a non-exclusive base backup uses minRecoveryPoint to
         * determine which WAL files must be included in the backup, and the
         * file (or files) containing the checkpoint record must be included,
         * at a minimum. Note that for an ordinary restart of recovery there's
         * no value in having the minimum recovery point any earlier than this
         * anyway, because redo will begin just after the checkpoint record.
         */
        if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
        {
            ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
            ControlFile->minRecoveryPointTLI = lastCheckPoint.ThisTimeLineID;

            /* update local copy */
            minRecoveryPoint = ControlFile->minRecoveryPoint;
            minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
        }
        if (flags & CHECKPOINT_IS_SHUTDOWN)
            ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
        UpdateControlFile();
    }
    LWLockRelease(ControlFileLock);
    


After promotion, the minRecoveryPoint is only updated (cleared) when the first regular checkpoint completes. If a crash happens before that, we will run the crash recovery with a stale minRecoveryPoint, which results into the PANIC that we diagnosed. The test case was written to reproduce the issue as reported to us. Thus the test case TRUNCATEs and extends the table at hand after promotion. The crash shortly thereafter leaves the pages in uninitialised state because the shared buffers are not yet flushed to the disk.

During crash recovery, we see uninitialised pages for the WAL records written before the promotion. These pages are remembered and we expect to either see a DROP TABLE or TRUNCATE WAL record before the minRecoveryPoint is reached. But since the minRecoveryPoint is still pointing to a WAL location prior to the TRUNCATE operation, crash recovery hits the minRecoveryPoint before seeing the TRUNCATE WAL record. That results in a PANIC situation.

I propose that we should always clear the minRecoveryPoint after promotion to ensure that crash recovery always run to the end if a just-promoted standby crashes before completing its first regular checkpoint. A WIP patch is attached.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Attachment

Re: PANIC during crash recovery of a recently promoted standby

From
Michael Paquier
Date:
On Thu, May 10, 2018 at 10:52:12AM +0530, Pavan Deolasee wrote:
> I propose that we should always clear the minRecoveryPoint after promotion
> to ensure that crash recovery always run to the end if a just-promoted
> standby crashes before completing its first regular checkpoint. A WIP patch
> is attached.

I have been playing with your patch and upgraded the test to check as
well for cascading standbys.  We could use that in the final patch.
That's definitely something to add in the recovery test suite, and the
sleep phases should be replaced by waits on replay and/or flush.

Still, that approach looks sensitive to me.  A restart point could be
running while the end-of-recovery record is inserted, so your patch
could update minRecoveryPoint to InvalidXLogRecPtr, and then a restart
point would happily update again the control file's minRecoveryPoint to
lastCheckPointEndPtr because it would see that the former is older than
lastCheckPointEndPtr (let's not forget that InvalidXLogRecPtr is 0), so
you could still crash on invalid pages?

I need to think a bit more about that stuff, but one idea would be to
use a special state in the control file to mark it as ending recovery,
this way we would control race conditions with restart points.
--
Michael

Attachment

Re: PANIC during crash recovery of a recently promoted standby

From
Alvaro Herrera
Date:
Michael Paquier wrote:
> On Thu, May 10, 2018 at 10:52:12AM +0530, Pavan Deolasee wrote:
> > I propose that we should always clear the minRecoveryPoint after promotion
> > to ensure that crash recovery always run to the end if a just-promoted
> > standby crashes before completing its first regular checkpoint. A WIP patch
> > is attached.
> 
> I have been playing with your patch and upgraded the test to check as
> well for cascading standbys.  We could use that in the final patch.
> That's definitely something to add in the recovery test suite, and the
> sleep phases should be replaced by waits on replay and/or flush.
> 
> Still, that approach looks sensitive to me.  A restart point could be
> running while the end-of-recovery record is inserted, so your patch
> could update minRecoveryPoint to InvalidXLogRecPtr, and then a restart
> point would happily update again the control file's minRecoveryPoint to
> lastCheckPointEndPtr because it would see that the former is older than
> lastCheckPointEndPtr (let's not forget that InvalidXLogRecPtr is 0), so
> you could still crash on invalid pages?

Yeah, I had this exact comment, but I was unable to come up with a test
case that would cause a problem.

> I need to think a bit more about that stuff, but one idea would be to
> use a special state in the control file to mark it as ending recovery,
> this way we would control race conditions with restart points.

Hmm.  Can we change the control file in released branches?  (It should
be possible to make the new server understand both old and new formats,
but I think this is breaking new ground and it looks easy to introduce
more bugs there.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PANIC during crash recovery of a recently promoted standby

From
Michael Paquier
Date:
On Fri, May 11, 2018 at 12:09:58PM -0300, Alvaro Herrera wrote:
> Yeah, I had this exact comment, but I was unable to come up with a test
> case that would cause a problem.

pg_ctl promote would wait for the control file to be updated, so you
cannot use it in the TAP tests to trigger the promotion.  Still I think
I found one after waking up?  Please note I have not tested it:
- Use a custom trigger file and then trigger promotion with a signal.
- Use a sleep command in recovery_end_command to increase the window, as
what matters is sleeping after CreateEndOfRecoveryRecord updates the
control file.
- Issue a restart point on the standby, which will update the control
file.
- Stop the standby with immediate mode.
- Start the standby, it should see unreferenced pages.

> Hmm.  Can we change the control file in released branches?  (It should
> be possible to make the new server understand both old and new formats,
> but I think this is breaking new ground and it looks easy to introduce
> more bugs there.)

We definitely can't, even if the new value is added at the end of
DBState :(

A couple of wild ideas, not tested, again after waking up:
1) We could also abuse of existing values by using the existing
DB_IN_CRASH_RECOVERY or DB_STARTUP.  Still that's not completely true as
the cluster may be open for business as a hot standby.
2) Invent a new special value for XLogRecPtr, normally impossible to
reach, which uses high bits.
--
Michael

Attachment

Re: PANIC during crash recovery of a recently promoted standby

From
Michael Paquier
Date:
On Sat, May 12, 2018 at 07:41:33AM +0900, Michael Paquier wrote:
> pg_ctl promote would wait for the control file to be updated, so you
> cannot use it in the TAP tests to trigger the promotion.  Still I think
> I found one after waking up?  Please note I have not tested it:
> - Use a custom trigger file and then trigger promotion with a signal.
> - Use a sleep command in recovery_end_command to increase the window, as
> what matters is sleeping after CreateEndOfRecoveryRecord updates the
> control file.
> - Issue a restart point on the standby, which will update the control
> file.
> - Stop the standby with immediate mode.
> - Start the standby, it should see unreferenced pages.

I have been looking at that this morning, and actually I have been able
to create a test case where minRecoveryPoint goes backwards using
Pavan's patch.  Using a sleep in recovery_end_command has proved to not
be enough so I had to patch the backend with a couple of sleeps and some
processing, mainly:
- One sleep in CreateRestartPoint to make a restart point wait before
updating the control file, which I set at 5s.
- One sleep just after calling CreateEndOfRecoveryRecord, which has been
set at 20s.
- Trigger an asynchronous checkpoint using IPC::Run::start.
- Trigger a promotion with a trigger file and SIGUSR2 to the
postmaster.

The rest of the test is crafted with some magic wait number and adds
some logs to ease the monitoring of the issue.  In order to get a crash,
I think that you would need to crash the backend after creating the last
WAL records which generate the invalid page references, and also slow
down the last restart point which makes minRecoveryPoint go backwards,
which is err...  Complicated to make deterministic.  Still if you apply
the patch attached you would see log entries on the standby as follows:
2018-05-14 12:24:15.065 JST [17352] LOG:  selected new timeline ID: 2
2018-05-14 12:24:15.074 JST [17352] LOG:  archive recovery complete
2018-05-14 12:24:15.074 JST [17352] WARNING:  CreateEndOfRecoveryRecord: minRecoveryPoint is 0/32C4258 before update
2018-05-14 12:24:15.074 JST [17352] WARNING:  CreateEndOfRecoveryRecord: minRecoveryPoint is 0/0 after update
2018-05-14 12:24:17.067 JST [17353] WARNING:  checkPointCopy.redo =0/30B3D70, lastCheckPoint.redo = 0/31BC208
2018-05-14 12:24:17.067 JST [17353] WARNING:  CreateRestartPoint: minRecoveryPoint is 0/0 before update,
lastCheckPointEndPtris 0/31BC2B0
 
2018-05-14 12:24:17.067 JST [17353] WARNING:  CreateRestartPoint: minRecoveryPoint is 0/31BC2B0 after update

So minRecoveryRecord can go definitely go backwards here, which is not
good.  Attached is a patch which includes Pavan's fix on top of the test
case I crafted with what is in upthread.  You just need to apply it on
HEAD.
--
Michael

Attachment

Re: PANIC during crash recovery of a recently promoted standby

From
Pavan Deolasee
Date:


On Fri, May 11, 2018 at 8:39 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Michael Paquier wrote:
> On Thu, May 10, 2018 at 10:52:12AM +0530, Pavan Deolasee wrote:
> > I propose that we should always clear the minRecoveryPoint after promotion
> > to ensure that crash recovery always run to the end if a just-promoted
> > standby crashes before completing its first regular checkpoint. A WIP patch
> > is attached.
>
> I have been playing with your patch and upgraded the test to check as
> well for cascading standbys.  We could use that in the final patch.
> That's definitely something to add in the recovery test suite, and the
> sleep phases should be replaced by waits on replay and/or flush.
>
> Still, that approach looks sensitive to me.  A restart point could be
> running while the end-of-recovery record is inserted, so your patch
> could update minRecoveryPoint to InvalidXLogRecPtr, and then a restart
> point would happily update again the control file's minRecoveryPoint to
> lastCheckPointEndPtr because it would see that the former is older than
> lastCheckPointEndPtr (let's not forget that InvalidXLogRecPtr is 0), so
> you could still crash on invalid pages?

Yeah, I had this exact comment, but I was unable to come up with a test
case that would cause a problem.

Looks like I didn't understand Alvaro's comment when he mentioned it to me off-list. But I now see what Michael and Alvaro mean and that indeed seems like a problem. I was thinking that the test for (ControlFile->state == DB_IN_ARCHIVE_RECOVERY) will ensure that minRecoveryPoint can't be updated after the standby is promoted. While that's true for a DB_IN_PRODUCTION,  the RestartPoint may finish after we have written end-of-recovery record, but before we're in production and thus the minRecoveryPoint may again be set.
 

> I need to think a bit more about that stuff, but one idea would be to
> use a special state in the control file to mark it as ending recovery,
> this way we would control race conditions with restart points.

Hmm.  Can we change the control file in released branches?  (It should
be possible to make the new server understand both old and new formats,
but I think this is breaking new ground and it looks easy to introduce
more bugs there.)


Can't we just remember that in shared memory state instead of writing to the control file? So if we've already performed end-of-recovery, we don't update the minRecoveryPoint when RestartPoint completes. 

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PANIC during crash recovery of a recently promoted standby

From
Michael Paquier
Date:
On Mon, May 14, 2018 at 01:14:22PM +0530, Pavan Deolasee wrote:
> Looks like I didn't understand Alvaro's comment when he mentioned it to me
> off-list. But I now see what Michael and Alvaro mean and that indeed seems
> like a problem. I was thinking that the test for (ControlFile->state ==
> DB_IN_ARCHIVE_RECOVERY) will ensure that minRecoveryPoint can't be updated
> after the standby is promoted. While that's true for a DB_IN_PRODUCTION,  the
> RestartPoint may finish after we have written end-of-recovery record, but
> before we're in production and thus the minRecoveryPoint may again be set.

Yeah, this has been something I considered as well first, but I was not
confident enough that setting up minRecoveryPoint to InvalidXLogRecPtr
was actually a safe thing for timeline switches.

So I have spent a good portion of today testing and playing with it to
be confident enough that this was right, and I have finished with the
attached.  The patch adds a new flag to XLogCtl which marks if the
control file has been updated after the end-of-recovery record has been
written, so as minRecoveryPoint does not get updated because of a
restart point running in parallel.

I have also reworked the test case you sent, removing the manuals sleeps
and replacing them with correct wait points.  There is also no point to
wait after promotion as pg_ctl promote implies a wait.  Another
important thing is that you need to use wal_log_hints = off to see a
crash, which is something that allows_streaming actually enables.

Comments are welcome.
--
Michael

Attachment

Re: PANIC during crash recovery of a recently promoted standby

From
Kyotaro HORIGUCHI
Date:
Hello.

At Thu, 24 May 2018 16:57:07 +0900, Michael Paquier <michael@paquier.xyz> wrote in
<20180524075707.GE15445@paquier.xyz>
> On Mon, May 14, 2018 at 01:14:22PM +0530, Pavan Deolasee wrote:
> > Looks like I didn't understand Alvaro's comment when he mentioned it to me
> > off-list. But I now see what Michael and Alvaro mean and that indeed seems
> > like a problem. I was thinking that the test for (ControlFile->state ==
> > DB_IN_ARCHIVE_RECOVERY) will ensure that minRecoveryPoint can't be updated
> > after the standby is promoted. While that's true for a DB_IN_PRODUCTION,  the
> > RestartPoint may finish after we have written end-of-recovery record, but
> > before we're in production and thus the minRecoveryPoint may again be set.
> 
> Yeah, this has been something I considered as well first, but I was not
> confident enough that setting up minRecoveryPoint to InvalidXLogRecPtr
> was actually a safe thing for timeline switches.
> 
> So I have spent a good portion of today testing and playing with it to
> be confident enough that this was right, and I have finished with the
> attached.  The patch adds a new flag to XLogCtl which marks if the
> control file has been updated after the end-of-recovery record has been
> written, so as minRecoveryPoint does not get updated because of a
> restart point running in parallel.
> 
> I have also reworked the test case you sent, removing the manuals sleeps
> and replacing them with correct wait points.  There is also no point to
> wait after promotion as pg_ctl promote implies a wait.  Another
> important thing is that you need to use wal_log_hints = off to see a 
> crash, which is something that allows_streaming actually enables.
> 
> Comments are welcome.

As the result of some poking around, my dignosis is different
from yours.

(I believe that) By definition recovery doesn't end until the
end-of-recovery check point ends so from the viewpoint I think it
is wrong to clear ControlFile->minRecoveryPoint before the end.

Invalid-page checking during crash recovery is hamful rather than
useless. It is done by CheckRecoveryConsistency even in crash
recovery against its expectation because there's a case where
minRecoveryPoint is valid but InArchiveRecovery is false. The two
variable there seems to be in contradiction with each other.

The immediate cause of the contradition is that StartXLOG wrongly
initializes local minRecoveryPoint from control file's value even
under crash recovery. updateMinRecoveryPoint also should be
turned off during crash recovery. The case of crash after last
checkpoint end has been treated in UpdateMinRecoveryPoint but
forgot to consider this case, where crash recovery has been
started while control file is still holding valid
minRecoveryPoint.

Finally, the attached patch seems fixing the issue. It passes
015_promotion.pl and the problem doesn't happen with
014_promotion_bug.pl.  Also this passes the existing tests
002-014.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 192c36bc96135d9108c499d01016f7eb6fc28fd1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 7 Jun 2018 16:02:41 +0900
Subject: [PATCH] Fix wrong behavior during crash recovery

After being killed before end-of-recovery checkpoint ends, server
restarts in crash recovery mode. However it can wrongly performs
invalid-page checks of WAL records, which is not expected during crash
recovery, then results in PANIC.

This patch prevents the wrong checking by suppressing needless updates
of minRecoveryPoint during crash recovery.
---
 src/backend/access/transam/xlog.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index adbd6a2126..283c26cb6c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -821,8 +821,11 @@ static XLogSource XLogReceiptSource = 0;    /* XLOG_FROM_* code */
 static XLogRecPtr ReadRecPtr;    /* start of last record read */
 static XLogRecPtr EndRecPtr;    /* end+1 of last record read */
 
-static XLogRecPtr minRecoveryPoint; /* local copy of
-                                     * ControlFile->minRecoveryPoint */
+/*
+ * local copy of ControlFile->minRecoveryPoint. InvalidXLogRecPtr means we're
+ * in crash recovery
+ */
+static XLogRecPtr minRecoveryPoint = InvalidXLogRecPtr;
 static TimeLineID minRecoveryPointTLI;
 static bool updateMinRecoveryPoint = true;
 
@@ -2717,14 +2720,7 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
     minRecoveryPoint = ControlFile->minRecoveryPoint;
     minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
 
-    /*
-     * An invalid minRecoveryPoint means that we need to recover all the WAL,
-     * i.e., we're doing crash recovery.  We never modify the control file's
-     * value in that case, so we can short-circuit future checks here too.
-     */
-    if (minRecoveryPoint == 0)
-        updateMinRecoveryPoint = false;
-    else if (force || minRecoveryPoint < lsn)
+    if (force || minRecoveryPoint < lsn)
     {
         XLogRecPtr    newMinRecoveryPoint;
         TimeLineID    newMinRecoveryPointTLI;
@@ -6841,6 +6837,7 @@ StartupXLOG(void)
                                 ControlFile->checkPointCopy.ThisTimeLineID,
                                 recoveryTargetTLI)));
             ControlFile->state = DB_IN_CRASH_RECOVERY;
+            updateMinRecoveryPoint = false;
         }
         ControlFile->checkPoint = checkPointLoc;
         ControlFile->checkPointCopy = checkPoint;
@@ -6852,6 +6849,10 @@ StartupXLOG(void)
                 ControlFile->minRecoveryPoint = checkPoint.redo;
                 ControlFile->minRecoveryPointTLI = checkPoint.ThisTimeLineID;
             }
+
+            /* also initialize our local copies */
+            minRecoveryPoint = ControlFile->minRecoveryPoint;
+            minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
         }
 
         /*
@@ -6889,10 +6890,6 @@ StartupXLOG(void)
         /* No need to hold ControlFileLock yet, we aren't up far enough */
         UpdateControlFile();
 
-        /* initialize our local copy of minRecoveryPoint */
-        minRecoveryPoint = ControlFile->minRecoveryPoint;
-        minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
-
         /*
          * Reset pgstat data, because it may be invalid after recovery.
          */
@@ -7858,6 +7855,8 @@ CheckRecoveryConsistency(void)
     if (XLogRecPtrIsInvalid(minRecoveryPoint))
         return;
 
+    Assert(InArchiveRecovery);
+
     /*
      * assume that we are called in the startup process, and hence don't need
      * a lock to read lastReplayedEndRecPtr
-- 
2.16.3


Re: PANIC during crash recovery of a recently promoted standby

From
Michael Paquier
Date:
On Thu, Jun 07, 2018 at 07:58:29PM +0900, Kyotaro HORIGUCHI wrote:
> Invalid-page checking during crash recovery is hamful rather than
> useless. It is done by CheckRecoveryConsistency even in crash
> recovery against its expectation because there's a case where
> minRecoveryPoint is valid but InArchiveRecovery is false. The two
> variable there seems to be in contradiction with each other.
>
> The immediate cause of the contradition is that StartXLOG wrongly
> initializes local minRecoveryPoint from control file's value even
> under crash recovery. updateMinRecoveryPoint also should be
> turned off during crash recovery. The case of crash after last
> checkpoint end has been treated in UpdateMinRecoveryPoint but
> forgot to consider this case, where crash recovery has been
> started while control file is still holding valid
> minRecoveryPoint.

Hmm.  This patch looks interesting and those issues need a very careful
lookup.  This is one of those things which should be fixed as part of
the extra CF, so I am planning to look at it in details very soon,
perhaps by the end of this week...

I have by the way noticed that nothing was registered in the CF app:
https://commitfest.postgresql.org/18/1680/
I have added as well a bullet point in the open item page of v11 for
older bugs.
--
Michael

Attachment

Re: PANIC during crash recovery of a recently promoted standby

From
Michael Paquier
Date:
On Thu, Jun 07, 2018 at 07:58:29PM +0900, Kyotaro HORIGUCHI wrote:
> (I believe that) By definition recovery doesn't end until the
> end-of-recovery check point ends so from the viewpoint I think it
> is wrong to clear ControlFile->minRecoveryPoint before the end.
>
> Invalid-page checking during crash recovery is hamful rather than
> useless. It is done by CheckRecoveryConsistency even in crash
> recovery against its expectation because there's a case where
> minRecoveryPoint is valid but InArchiveRecovery is false. The two
> variable there seems to be in contradiction with each other.
>
> The immediate cause of the contradition is that StartXLOG wrongly
> initializes local minRecoveryPoint from control file's value even
> under crash recovery. updateMinRecoveryPoint also should be
> turned off during crash recovery. The case of crash after last
> checkpoint end has been treated in UpdateMinRecoveryPoint but
> forgot to consider this case, where crash recovery has been
> started while control file is still holding valid
> minRecoveryPoint.

I have been digesting your proposal and I reviewed it, and I think that
what you are proposing here is on the right track.  However, the updates
around minRecoveryPoint and minRecoveryPointTLI ought to be more
consistent because that could cause future issues.  I have spotted two
bug where I think the problem is not fixed: when replaying a WAL record
XLOG_PARAMETER_CHANGE, minRecoveryPoint and minRecoveryPointTLI would
still get updated from the control file values which can still lead to
failures as CheckRecoveryConsistency could still happily trigger a
PANIC, so I think that we had better maintain those values consistent as
long as crash recovery runs.  And XLogNeedsFlush() also has a similar
problem.

Note that there is as well the case where the startup process switches
from crash recovery to archive recovery, in which case the update of the
local copies have to happen once the switch is done.  Your patch covers
that with just updating updateMinRecoveryPoint each time crash recovery
happens but that's not completely consistent, but it seems that it also
missed the fact that updateMinRecoveryPoint needs to be switched back to
true as the startup process can update the control file.  Actually,
updateMinRecoveryPoint needs to be switched back to true in that case or
the startup process would not be able to update minRecoveryPoint when it
calls XLogFlush for example.

There is the point of trying to get rid of updateMinRecoveryPoint which
has crossed my mind, but that's not wise as it's default value allows
the checkpointer minRecoveryPoint when started, which also has to happen
once the startup process gets out of recovery and tells the postmaster
to start the checkpointer.  For backends as well that's a sane default.

There is also no point in taking ControlFileLock when checking if the
local copy of minRecoveryPoint is valid or not, so this can be
bypassed.

The assertion in CheckRecoveryConsistency is definitely a good idea as
this should only be called by the startup process, so we can keep it.

In the attached, I have fixed the issues I found and added the test case
which should be included in the final commit.  Its concept is pretty
simple, the idea is to keep the local copy of minRecoveryPoint to
InvalidXLogRecPtr as long as crash recovery runs, and is switched back
to normal if there is a switch to archive recovery after a crash
recovery.

This is not really a complicated patch, and it took a lot of energy from
me the last couple of days per the nature of the many scenarios to think
about...  So an extra pair of eyes from another committer would be
welcome.  I am letting that cool down for a couple of days now.
--
Michael

Attachment

Re: PANIC during crash recovery of a recently promoted standby

From
Pavan Deolasee
Date:


On Fri, Jun 22, 2018 at 9:28 AM, Michael Paquier <michael@paquier.xyz> wrote:


This is not really a complicated patch, and it took a lot of energy from
me the last couple of days per the nature of the many scenarios to think
about... 

Thanks for the efforts. It wasn't an easy bug to chase to begin with. So I am not surprised there were additional problems that I missed.
 
So an extra pair of eyes from another committer would be
welcome.  I am letting that cool down for a couple of days now.

I am not a committer, so don't know if my pair of eyes count, but FWIW the patch looks good to me except couple of minor points.

+/*
+ * Local copies of equivalent fields in the control file.  When running
+ * crash recovery, minRecoveryPoint is set to InvalidXLogRecPtr as we
+ * expect to replay all the WAL available, and updateMinRecoveryPoint is
+ * switched to false to prevent any updates while replaying records.
+ * Those values are kept consistent as long as crash recovery runs.
+ */
 static XLogRecPtr minRecoveryPoint; /* local copy of
  * ControlFile->minRecoveryPoint */

The inline comment looks unnecessary now that we have comment at the top.

 
@@ -4266,6 +4276,12 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
  minRecoveryPoint = ControlFile->minRecoveryPoint;
  minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
 
+ /*
+ * The startup process can update its local copy of
+ * minRecoveryPoint from that point.
+ */
 
s/that/this
 
Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PANIC during crash recovery of a recently promoted standby

From
Michael Paquier
Date:
On Fri, Jun 22, 2018 at 10:08:24AM +0530, Pavan Deolasee wrote:
> On Fri, Jun 22, 2018 at 9:28 AM, Michael Paquier <michael@paquier.xyz>
> wrote:
>> So an extra pair of eyes from another committer would be
>> welcome.  I am letting that cool down for a couple of days now.
>
> I am not a committer, so don't know if my pair of eyes count, but FWIW the
> patch looks good to me except couple of minor points.

Thanks for grabbing some time, Pavan.  Any help is welcome!

> +/*
> + * Local copies of equivalent fields in the control file.  When running
> + * crash recovery, minRecoveryPoint is set to InvalidXLogRecPtr as we
> + * expect to replay all the WAL available, and updateMinRecoveryPoint is
> + * switched to false to prevent any updates while replaying records.
> + * Those values are kept consistent as long as crash recovery runs.
> + */
>  static XLogRecPtr minRecoveryPoint; /* local copy of
>   * ControlFile->minRecoveryPoint */
>
> The inline comment looks unnecessary now that we have comment at the
> top.

I'll fix that.

> @@ -4266,6 +4276,12 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr
> RecPtr, int emode,
>   minRecoveryPoint = ControlFile->minRecoveryPoint;
>   minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
>
> + /*
> + * The startup process can update its local copy of
> + * minRecoveryPoint from that point.
> + */
>
> s/that/this

This one too.
--
Michael

Attachment

Re: PANIC during crash recovery of a recently promoted standby

From
Kyotaro HORIGUCHI
Date:
Hello, sorry for the absense and I looked the second patch.

At Fri, 22 Jun 2018 13:45:21 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180622044521.GC5215@paquier.xyz>
> On Fri, Jun 22, 2018 at 10:08:24AM +0530, Pavan Deolasee wrote:
> > On Fri, Jun 22, 2018 at 9:28 AM, Michael Paquier <michael@paquier.xyz>
> > wrote:
> >> So an extra pair of eyes from another committer would be
> >> welcome.  I am letting that cool down for a couple of days now.
> > 
> > I am not a committer, so don't know if my pair of eyes count, but FWIW the
> > patch looks good to me except couple of minor points.
> 
> Thanks for grabbing some time, Pavan.  Any help is welcome!

in previous mail:
> I have spotted two
> bug where I think the problem is not fixed: when replaying a WAL record
> XLOG_PARAMETER_CHANGE, minRecoveryPoint and minRecoveryPointTLI would
> still get updated from the control file values which can still lead to
> failures as CheckRecoveryConsistency could still happily trigger a
> PANIC, so I think that we had better maintain those values consistent as

The fix of StartupXLOG, CheckRecoveryConsistency, ReadRecrod and
xlog_redo looks (functionally, mendtioned below) fine.

> long as crash recovery runs.  And XLogNeedsFlush() also has a similar
> problem.

Here, on the other hand, this patch turns off
updateMinRecoverypoint if minRecoverPoint is invalid when
RecoveryInProgress() == true. Howerver RecovInProg() == true
means archive recovery is already started and and
minRecoveryPoint *should* be updated t for the
condition. Actually minRecoverypoint is updated just below. If
this is really right thing, I think that some explanation for the
reason is required here.

In xlog_redo there still be "minRecoverypoint != 0", which ought
to be described as "!XLogRecPtrIsInvalid(minRecoveryPoint)". (It
seems the only one. Double negation is a bit uneasy but there are
many instance of this kind of coding.)

# I'll go all-out next week.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: PANIC during crash recovery of a recently promoted standby

From
Michael Paquier
Date:
On Fri, Jun 22, 2018 at 02:34:02PM +0900, Kyotaro HORIGUCHI wrote:
> Hello, sorry for the absense and I looked the second patch.

Thanks for the review!

> At Fri, 22 Jun 2018 13:45:21 +0900, Michael Paquier
> <michael@paquier.xyz> wrote in <20180622044521.GC5215@paquier.xyz>
>> long as crash recovery runs.  And XLogNeedsFlush() also has a similar
>> problem.
>
> Here, on the other hand, this patch turns off
> updateMinRecoverypoint if minRecoverPoint is invalid when
> RecoveryInProgress() == true. Howerver RecovInProg() == true
> means archive recovery is already started and and
> minRecoveryPoint *should* be updated t for the
> condition. Actually minRecoverypoint is updated just below. If
> this is really right thing, I think that some explanation for the
> reason is required here.

LocalRecoveryInProgress is just a local copy of SharedRecoveryInProgress
so RecoveryInProgress also returns true if crash recovery is running.
But perhaps I am missing what you mean?  The point here is that redo can
call XLogNeedsFlush, no?

> In xlog_redo there still be "minRecoverypoint != 0", which ought
> to be described as "!XLogRecPtrIsInvalid(minRecoveryPoint)". (It
> seems the only one. Double negation is a bit uneasy but there are
> many instance of this kind of coding.)

It is possible to use directly a comparison with InvalidXLogRecPtr
instead of a double negation.

> # I'll go all-out next week.

Enjoy your vacations!
--
Michael

Attachment

Re: PANIC during crash recovery of a recently promoted standby

From
Michael Paquier
Date:
On Fri, Jun 22, 2018 at 03:25:48PM +0900, Michael Paquier wrote:
> On Fri, Jun 22, 2018 at 02:34:02PM +0900, Kyotaro HORIGUCHI wrote:
>> Hello, sorry for the absense and I looked the second patch.
>
> Thanks for the review!

I have been spending some time testing and torturing the patch for all
stable branches, and I have finished with the set of patches attached.

My testing has involved using the TAP suite, where I have actually, and
roughly backported the infrastructure in v10 down to older versions,
which has required to tweak Makefile.global.in and finding out again
that pg_ctl start has switched to the wait mode by default in 10.

I have spent a bit of time testing this on HEAD, 10 and 9.6.  For 9.5,
9.4 and 9.3 I have reproduced the failure and tested the patch, but I
lacked time to perform more tests.  The patch set for 9.3~9.5 applies
without conflict across the 3 branches.  9.6 has a conflict in a
comment, and v10 had an extra comment conflict.

Feel free to have a look, I am not completely done with this stuff and
I'll work more tomorrow on checking 9.3~9.5.
--
Michael

Attachment

Re: PANIC during crash recovery of a recently promoted standby

From
Michael Paquier
Date:
Adding Heikki and Andres in CC here for awareness..

On Wed, Jun 27, 2018 at 05:29:38PM +0900, Michael Paquier wrote:
> I have spent a bit of time testing this on HEAD, 10 and 9.6.  For 9.5,
> 9.4 and 9.3 I have reproduced the failure and tested the patch, but I
> lacked time to perform more tests.  The patch set for 9.3~9.5 applies
> without conflict across the 3 branches.  9.6 has a conflict in a
> comment, and v10 had an extra comment conflict.
>
> Feel free to have a look, I am not completely done with this stuff and
> I'll work more tomorrow on checking 9.3~9.5.

And I have been able to spend the time I wanted to spend on this patch
series with testing for 9.3 to 9.5.  Attached are a couple of patches
you can use to reproduce the failures for all the branches:
- For master and 10, the tests are included in the patch and are
proposed for commit.
- On 9.6, I had to tweak the TAP scripts as pg_ctl start has switched to
use the wait mode by default.
- On 9.5, there is a tweak to src/Makefile.global.in which cleans up
tmp_check, and a couple of GUCs not compatible.
- On 9.4, I had to tweak src/Makefile.global.in so as the temporary
installation path is correct.  Again some GUCs had to be tweaked.
- On 9.3, there is no TAP infrastructure, so I tweaked
src/test/recovery/Makefile to be able to run the tests.

I have also created a bash script which emulates what the TAP test does,
which is attached.  Because of visibly some timing reasons, I have not
been able to reproduce the problem with it.  Anyway, running (and
actually sort of back-porting) the TAP suite so as the problematic test
case can be run is possible with the sets attached and shows the failure
so we can use that.

Thoughts?  I would love more input about the patch concept.
--
Michael

Attachment

Re: PANIC during crash recovery of a recently promoted standby

From
Kyotaro HORIGUCHI
Date:
Hello.

At Fri, 22 Jun 2018 15:25:48 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180622062548.GE5215@paquier.xyz>
> On Fri, Jun 22, 2018 at 02:34:02PM +0900, Kyotaro HORIGUCHI wrote:
> > Hello, sorry for the absense and I looked the second patch.
> 
> Thanks for the review!
> 
> > At Fri, 22 Jun 2018 13:45:21 +0900, Michael Paquier
> > <michael@paquier.xyz> wrote in <20180622044521.GC5215@paquier.xyz>
> >> long as crash recovery runs.  And XLogNeedsFlush() also has a similar
> >> problem.
> > 
> > Here, on the other hand, this patch turns off
> > updateMinRecoverypoint if minRecoverPoint is invalid when
> > RecoveryInProgress() == true. Howerver RecovInProg() == true
> > means archive recovery is already started and and
> > minRecoveryPoint *should* be updated t for the
> > condition. Actually minRecoverypoint is updated just below. If
> > this is really right thing, I think that some explanation for the
> > reason is required here.
> 
> LocalRecoveryInProgress is just a local copy of SharedRecoveryInProgress
> so RecoveryInProgress also returns true if crash recovery is running.
> But perhaps I am missing what you mean?  The point here is that redo can
> call XLogNeedsFlush, no?

My concern at the time was the necessity to turn off
updateMinRecoveryPoint on the fly. (The previous comment seems a
bit confused, sorry.)

When minRecoveryPoint is invalid, there're only two possible
cases. It may be at very beginning of archive reovery or may be
running a crash recovery. In the latter case, we have detected
crash recovery before redo starts. So we can turn off
updateMinRecoveryPoint immediately and no further check is
needed and it is (I think) easier to understand.

> > In xlog_redo there still be "minRecoverypoint != 0", which ought
> > to be described as "!XLogRecPtrIsInvalid(minRecoveryPoint)". (It
> > seems the only one. Double negation is a bit uneasy but there are
> > many instance of this kind of coding.)
> 
> It is possible to use directly a comparison with InvalidXLogRecPtr
> instead of a double negation.

I'm not sure whether it is abstraction of invalid value, or just
a short cut of the value. That's right if it's the
latter. (There's several places where invalid LSN is assumed to
be smaller than any valid values in the patch).

the second diff is the difference of the first patch from
promote_panic_master.diff

On further thought, as we confirmed upthread (and existing
comments are saying) that (minRecoveryPoint == 0)
!InArchiveRecovery are always equivalent, and
updateMinRecoveryPoint becomes equivalent to them by the v3
patch.  That is, we can just remove the variable and the attached
third patch is that. It also passes all recovery tests including
the new 015.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dcfef36591..d86137ae8b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -821,8 +821,14 @@ static XLogSource XLogReceiptSource = 0;    /* XLOG_FROM_* code */
 static XLogRecPtr ReadRecPtr;    /* start of last record read */
 static XLogRecPtr EndRecPtr;    /* end+1 of last record read */
 
-static XLogRecPtr minRecoveryPoint; /* local copy of
-                                     * ControlFile->minRecoveryPoint */
+/*
+ * Local copies of equivalent fields in the control file.  When running
+ * crash recovery, minRecoveryPoint is set to InvalidXLogRecPtr as we
+ * expect to replay all the WAL available, and updateMinRecoveryPoint is
+ * switched to false to prevent any updates while replaying records.
+ * Those values are kept consistent as long as crash recovery runs.
+ */
+static XLogRecPtr minRecoveryPoint;
 static TimeLineID minRecoveryPointTLI;
 static bool updateMinRecoveryPoint = true;
 
@@ -2707,7 +2713,7 @@ XLogGetReplicationSlotMinimumLSN(void)
 static void
 UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
 {
-    /* Quick check using our local copy of the variable */
+    /* Check using our local copy of minRecoveryPoint */
     if (!updateMinRecoveryPoint || (!force && lsn <= minRecoveryPoint))
         return;
 
@@ -2717,14 +2723,7 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
     minRecoveryPoint = ControlFile->minRecoveryPoint;
     minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
 
-    /*
-     * An invalid minRecoveryPoint means that we need to recover all the WAL,
-     * i.e., we're doing crash recovery.  We never modify the control file's
-     * value in that case, so we can short-circuit future checks here too.
-     */
-    if (minRecoveryPoint == 0)
-        updateMinRecoveryPoint = false;
-    else if (force || minRecoveryPoint < lsn)
+    if (force || minRecoveryPoint < lsn)
     {
         XLogRecPtr    newMinRecoveryPoint;
         TimeLineID    newMinRecoveryPointTLI;
@@ -3110,7 +3109,7 @@ XLogNeedsFlush(XLogRecPtr record)
      */
     if (RecoveryInProgress())
     {
-        /* Quick exit if already known updated */
+        /* Quick exit if already known to be updated or cannot be updated */
         if (record <= minRecoveryPoint || !updateMinRecoveryPoint)
             return false;
 
@@ -3124,20 +3123,8 @@ XLogNeedsFlush(XLogRecPtr record)
         minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
         LWLockRelease(ControlFileLock);
 
-        /*
-         * An invalid minRecoveryPoint means that we need to recover all the
-         * WAL, i.e., we're doing crash recovery.  We never modify the control
-         * file's value in that case, so we can short-circuit future checks
-         * here too.
-         */
-        if (minRecoveryPoint == 0)
-            updateMinRecoveryPoint = false;
-
         /* check again */
-        if (record <= minRecoveryPoint || !updateMinRecoveryPoint)
-            return false;
-        else
-            return true;
+        return record > minRecoveryPoint;
     }
 
     /* Quick exit if already known flushed */
@@ -4269,6 +4256,12 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
                 minRecoveryPoint = ControlFile->minRecoveryPoint;
                 minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
 
+                /*
+                 * The startup process can update its local copy of
+                 * minRecoveryPoint from this point.
+                 */
+                updateMinRecoveryPoint = true;
+
                 UpdateControlFile();
                 LWLockRelease(ControlFileLock);
 
@@ -6892,9 +6885,31 @@ StartupXLOG(void)
         /* No need to hold ControlFileLock yet, we aren't up far enough */
         UpdateControlFile();
 
-        /* initialize our local copy of minRecoveryPoint */
-        minRecoveryPoint = ControlFile->minRecoveryPoint;
-        minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
+        /*
+         * Initialize our local copy of minRecoveryPoint.  When doing crash
+         * recovery we want to replay up to the end of WAL.  Particularly, in
+         * the case of a promoted standby minRecoveryPoint value in the
+         * control file is only updated after the first checkpoint.  However,
+         * if the instance crashes before the first post-recovery checkpoint
+         * is completed then recovery will use a stale location causing the
+         * startup process to think that there are still invalid page
+         * references when checking for data consistency.
+         */
+        if (InArchiveRecovery)
+        {
+            minRecoveryPoint = ControlFile->minRecoveryPoint;
+            minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
+        }
+        else
+        {
+            /*
+             * We are to run crash recovery. We shouldn't update
+             * minRecoveryPoint until crash recvoery ends.
+             */
+            updateMinRecoveryPoint = false;
+            minRecoveryPoint = InvalidXLogRecPtr;
+            minRecoveryPointTLI = 0;
+        }
 
         /*
          * Reset pgstat data, because it may be invalid after recovery.
@@ -7861,6 +7876,8 @@ CheckRecoveryConsistency(void)
     if (XLogRecPtrIsInvalid(minRecoveryPoint))
         return;
 
+    Assert(InArchiveRecovery);
+
     /*
      * assume that we are called in the startup process, and hence don't need
      * a lock to read lastReplayedEndRecPtr
@@ -9949,11 +9966,16 @@ xlog_redo(XLogReaderState *record)
          * Update minRecoveryPoint to ensure that if recovery is aborted, we
          * recover back up to this point before allowing hot standby again.
          * This is important if the max_* settings are decreased, to ensure
-         * you don't run queries against the WAL preceding the change.
+         * you don't run queries against the WAL preceding the change. The
+         * local copies cannot be updated as long as crash recovery is
+         * happening and we expect all the WAL to be replayed.
          */
-        minRecoveryPoint = ControlFile->minRecoveryPoint;
-        minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
-        if (minRecoveryPoint != 0 && minRecoveryPoint < lsn)
+        if (InArchiveRecovery)
+        {
+            minRecoveryPoint = ControlFile->minRecoveryPoint;
+            minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
+        }
+        if (minRecoveryPoint != InvalidXLogRecPtr && minRecoveryPoint < lsn)
         {
             ControlFile->minRecoveryPoint = lsn;
             ControlFile->minRecoveryPointTLI = ThisTimeLineID;
2716c2716
<     /* Quick check using our local copy of the variable */
---
>     /* Check using our local copy of minRecoveryPoint */
2720,2732d2719
<     /*
<      * An invalid minRecoveryPoint means that we need to recover all the WAL,
<      * i.e., we're doing crash recovery.  We never modify the control file's
<      * value in that case, so we can short-circuit future checks here too. The
<      * local values of minRecoveryPoint and minRecoveryPointTLI should not be
<      * updated until crash recovery finishes.
<      */
<     if (XLogRecPtrIsInvalid(minRecoveryPoint))
<     {
<         updateMinRecoveryPoint = false;
<         return;
<     }
< 
3125,3133d3111
<         /*
<          * An invalid minRecoveryPoint means that we need to recover all the
<          * WAL, i.e., we're doing crash recovery.  We never modify the control
<          * file's value in that case, so we can short-circuit future checks
<          * here too.
<          */
<         if (XLogRecPtrIsInvalid(minRecoveryPoint))
<             updateMinRecoveryPoint = false;
< 
6926a6905,6909
>             /*
>              * We are to run crash recovery. We shouldn't update
>              * minRecoveryPoint until crash recvoery ends.
>              */
>             updateMinRecoveryPoint = false;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dcfef36591..8a37784653 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -821,10 +821,13 @@ static XLogSource XLogReceiptSource = 0;    /* XLOG_FROM_* code */
 static XLogRecPtr ReadRecPtr;    /* start of last record read */
 static XLogRecPtr EndRecPtr;    /* end+1 of last record read */
 
-static XLogRecPtr minRecoveryPoint; /* local copy of
-                                     * ControlFile->minRecoveryPoint */
+/*
+ * Local copies of equivalent fields in the control file.  When running
+ * crash recovery, minRecoveryPoint is set to InvalidXLogRecPtr as we
+ * expect to replay all the WAL available.
+ */
+static XLogRecPtr minRecoveryPoint;
 static TimeLineID minRecoveryPointTLI;
-static bool updateMinRecoveryPoint = true;
 
 /*
  * Have we reached a consistent database state? In crash recovery, we have
@@ -2707,8 +2710,13 @@ XLogGetReplicationSlotMinimumLSN(void)
 static void
 UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
 {
-    /* Quick check using our local copy of the variable */
-    if (!updateMinRecoveryPoint || (!force && lsn <= minRecoveryPoint))
+    /*
+     * Quich check using our local copy of minRecoveryPoint. Invalid
+     * minRecoveryPoint value means we are running crash recovery and don't
+     * need to update it.
+     */
+    if (XLogRecPtrIsInvalid(minRecoveryPoint) ||
+        (!force && lsn <= minRecoveryPoint))
         return;
 
     LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
@@ -2717,14 +2725,7 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
     minRecoveryPoint = ControlFile->minRecoveryPoint;
     minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
 
-    /*
-     * An invalid minRecoveryPoint means that we need to recover all the WAL,
-     * i.e., we're doing crash recovery.  We never modify the control file's
-     * value in that case, so we can short-circuit future checks here too.
-     */
-    if (minRecoveryPoint == 0)
-        updateMinRecoveryPoint = false;
-    else if (force || minRecoveryPoint < lsn)
+    if (force || minRecoveryPoint < lsn)
     {
         XLogRecPtr    newMinRecoveryPoint;
         TimeLineID    newMinRecoveryPointTLI;
@@ -3110,8 +3111,8 @@ XLogNeedsFlush(XLogRecPtr record)
      */
     if (RecoveryInProgress())
     {
-        /* Quick exit if already known updated */
-        if (record <= minRecoveryPoint || !updateMinRecoveryPoint)
+        /* Quick exit if already known to be updated or cannot be updated */
+        if (record <= minRecoveryPoint || XLogRecPtrIsInvalid(minRecoveryPoint))
             return false;
 
         /*
@@ -3124,20 +3125,8 @@ XLogNeedsFlush(XLogRecPtr record)
         minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
         LWLockRelease(ControlFileLock);
 
-        /*
-         * An invalid minRecoveryPoint means that we need to recover all the
-         * WAL, i.e., we're doing crash recovery.  We never modify the control
-         * file's value in that case, so we can short-circuit future checks
-         * here too.
-         */
-        if (minRecoveryPoint == 0)
-            updateMinRecoveryPoint = false;
-
         /* check again */
-        if (record <= minRecoveryPoint || !updateMinRecoveryPoint)
-            return false;
-        else
-            return true;
+        return record > minRecoveryPoint;
     }
 
     /* Quick exit if already known flushed */
@@ -6892,9 +6881,31 @@ StartupXLOG(void)
         /* No need to hold ControlFileLock yet, we aren't up far enough */
         UpdateControlFile();
 
-        /* initialize our local copy of minRecoveryPoint */
-        minRecoveryPoint = ControlFile->minRecoveryPoint;
-        minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
+        /*
+         * Initialize our local copy of minRecoveryPoint.  When doing crash
+         * recovery we want to replay up to the end of WAL.  Particularly, in
+         * the case of a promoted standby minRecoveryPoint value in the
+         * control file is only updated after the first checkpoint.  However,
+         * if the instance crashes before the first post-recovery checkpoint
+         * is completed then recovery will use a stale location causing the
+         * startup process to think that there are still invalid page
+         * references when checking for data consistency.
+         */
+        if (InArchiveRecovery)
+        {
+            minRecoveryPoint = ControlFile->minRecoveryPoint;
+            minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
+        }
+        else
+        {
+            /*
+             * We are to run crash recovery. Set invalid value to
+             * minRecoveryPoint ignoring the shared variable, which inhibis
+             * updating the variable during crash recovery.
+             */
+            minRecoveryPoint = InvalidXLogRecPtr;
+            minRecoveryPointTLI = 0;
+        }
 
         /*
          * Reset pgstat data, because it may be invalid after recovery.
@@ -7861,6 +7872,8 @@ CheckRecoveryConsistency(void)
     if (XLogRecPtrIsInvalid(minRecoveryPoint))
         return;
 
+    Assert(InArchiveRecovery);
+
     /*
      * assume that we are called in the startup process, and hence don't need
      * a lock to read lastReplayedEndRecPtr
@@ -9265,6 +9278,7 @@ CreateRestartPoint(int flags)
          * no value in having the minimum recovery point any earlier than this
          * anyway, because redo will begin just after the checkpoint record.
          */
+        Assert(ControlFile->minRecoveryPoint != InvalidXLogRecPtr);
         if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
         {
             ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
@@ -9949,14 +9963,21 @@ xlog_redo(XLogReaderState *record)
          * Update minRecoveryPoint to ensure that if recovery is aborted, we
          * recover back up to this point before allowing hot standby again.
          * This is important if the max_* settings are decreased, to ensure
-         * you don't run queries against the WAL preceding the change.
+         * you don't run queries against the WAL preceding the change. Both
+         * local and shared variables cannot be updated and we expect all the
+         * WAL to be replayed while crash recovery is happning.
          */
-        minRecoveryPoint = ControlFile->minRecoveryPoint;
-        minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
-        if (minRecoveryPoint != 0 && minRecoveryPoint < lsn)
+        if (InArchiveRecovery)
         {
-            ControlFile->minRecoveryPoint = lsn;
-            ControlFile->minRecoveryPointTLI = ThisTimeLineID;
+            Assert(minRecoveryPoint != InvalidXLogRecPtr);
+            minRecoveryPoint = ControlFile->minRecoveryPoint;
+            minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
+
+            if (minRecoveryPoint < lsn)
+            {
+                ControlFile->minRecoveryPoint = lsn;
+                ControlFile->minRecoveryPointTLI = ThisTimeLineID;
+            }
         }
 
         CommitTsParameterChange(xlrec.track_commit_timestamp,

Re: PANIC during crash recovery of a recently promoted standby

From
Michael Paquier
Date:
On Mon, Jul 02, 2018 at 04:25:13PM +0900, Kyotaro HORIGUCHI wrote:
> When minRecoveryPoint is invalid, there're only two possible
> cases. It may be at very beginning of archive reovery or may be
> running a crash recovery. In the latter case, we have detected
> crash recovery before redo starts. So we can turn off
> updateMinRecoveryPoint immediately and no further check is
> needed and it is (I think) easier to understand.

Er, you are missing the point that updateMinRecoveryPoint is also used
by processes, like the checkpointer, other than the startup process,
which actually needs to update minRecoveryPoint and rely on the default
value of updateMinRecoveryPoint which is true...

I am planning to finish wrapping this patch luckily on Wednesday JST
time, or in the worst case on Thursday.  I got this problem on my mind
for a couple of days now and I could not find a case where the approach
taken could cause a problem.  Opinions are welcome.
--
Michael

Attachment

Re: PANIC during crash recovery of a recently promoted standby

From
Michael Paquier
Date:
On Mon, Jul 02, 2018 at 10:41:05PM +0900, Michael Paquier wrote:
> I am planning to finish wrapping this patch luckily on Wednesday JST
> time, or in the worst case on Thursday.  I got this problem on my mind
> for a couple of days now and I could not find a case where the approach
> taken could cause a problem.  Opinions are welcome.

Okay, pushed and back-patched.  Thanks to all who participated in the
thread!
--
Michael

Attachment

Re: PANIC during crash recovery of a recently promoted standby

From
Pavan Deolasee
Date:


On Thu, Jul 5, 2018 at 7:20 AM, Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Jul 02, 2018 at 10:41:05PM +0900, Michael Paquier wrote:
> I am planning to finish wrapping this patch luckily on Wednesday JST
> time, or in the worst case on Thursday.  I got this problem on my mind
> for a couple of days now and I could not find a case where the approach
> taken could cause a problem.  Opinions are welcome.

Okay, pushed and back-patched.  Thanks to all who participated in the
thread!

Many thanks Michael for doing the gruelling of coming up with a more complete fix, verifying all the cases, in various back branches.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PANIC during crash recovery of a recently promoted standby

From
Michael Paquier
Date:
On Thu, Jul 05, 2018 at 01:03:14PM +0530, Pavan Deolasee wrote:
> Many thanks Michael for doing the gruelling of coming up with a more
> complete fix, verifying all the cases, in various back branches.

No problem.  I hope I got the credits right.  If there is anything wrong
please feel free to let me know.
--
Michael

Attachment