PANIC during crash recovery of a recently promoted standby - Mailing list pgsql-hackers

From Pavan Deolasee
Subject PANIC during crash recovery of a recently promoted standby
Date
Msg-id CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com
Whole thread Raw
Responses Re: PANIC during crash recovery of a recently promoted standby  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
Hello,

I recently investigated a problem where a standby is promoted to be the new master. The promoted standby crashes shortly thereafter for whatever reason. Upon running the crash recovery, the promoted standby (now master) PANICs with message such as:

PANIC,XX000,"WAL contains references to invalid pages",,,,,,,,"XLogCheckInvalidPages, xlogutils.c:242",""

After investigation, I could recreate a reproduction scenario for this problem. The attached TAP test (thanks Alvaro from converting my bash script to a TAP test) demonstrates the problem. The test is probably sensitive to timing, but it reproduces the problem consistently at least at my end. While the original report was for 9.6, I can reproduce it on the master and thus it probably affects all supported releases.

Investigations point to a possible bug where we fail to update the minRecoveryPoint after completing the ongoing restart point upon promotion. IMV after promotion the new master must always recover to the end of the WAL to ensure that all changes are applied correctly. But what we've instead is that minRecoveryPoint remains set to a prior location because of this:

   /*
     * Update pg_control, using current time.  Check that it still shows
     * IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
     * this is a quick hack to make sure nothing really bad happens if somehow
     * we get here after the end-of-recovery checkpoint.
     */
   LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
        ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
    {
        ControlFile->checkPoint = lastCheckPointRecPtr;
        ControlFile->checkPointCopy = lastCheckPoint;
        ControlFile->time = (pg_time_t) time(NULL);

        /*
         * Ensure minRecoveryPoint is past the checkpoint record.  Normally,
         * this will have happened already while writing out dirty buffers,
         * but not necessarily - e.g. because no buffers were dirtied.  We do
         * this because a non-exclusive base backup uses minRecoveryPoint to
         * determine which WAL files must be included in the backup, and the
         * file (or files) containing the checkpoint record must be included,
         * at a minimum. Note that for an ordinary restart of recovery there's
         * no value in having the minimum recovery point any earlier than this
         * anyway, because redo will begin just after the checkpoint record.
         */
        if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
        {
            ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
            ControlFile->minRecoveryPointTLI = lastCheckPoint.ThisTimeLineID;

            /* update local copy */
            minRecoveryPoint = ControlFile->minRecoveryPoint;
            minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
        }
        if (flags & CHECKPOINT_IS_SHUTDOWN)
            ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
        UpdateControlFile();
    }
    LWLockRelease(ControlFileLock);
    


After promotion, the minRecoveryPoint is only updated (cleared) when the first regular checkpoint completes. If a crash happens before that, we will run the crash recovery with a stale minRecoveryPoint, which results into the PANIC that we diagnosed. The test case was written to reproduce the issue as reported to us. Thus the test case TRUNCATEs and extends the table at hand after promotion. The crash shortly thereafter leaves the pages in uninitialised state because the shared buffers are not yet flushed to the disk.

During crash recovery, we see uninitialised pages for the WAL records written before the promotion. These pages are remembered and we expect to either see a DROP TABLE or TRUNCATE WAL record before the minRecoveryPoint is reached. But since the minRecoveryPoint is still pointing to a WAL location prior to the TRUNCATE operation, crash recovery hits the minRecoveryPoint before seeing the TRUNCATE WAL record. That results in a PANIC situation.

I propose that we should always clear the minRecoveryPoint after promotion to ensure that crash recovery always run to the end if a just-promoted standby crashes before completing its first regular checkpoint. A WIP patch is attached.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Attachment

pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: Needless additional partition check in INSERT?
Next
From: Simon Riggs
Date:
Subject: Re: Needless additional partition check in INSERT?