Possible corruption by CreateRestartPoint at promotion - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Possible corruption by CreateRestartPoint at promotion
Date
Msg-id 20220316.102444.2193181487576617583.horikyota.ntt@gmail.com
Whole thread Raw
Responses Re: Possible corruption by CreateRestartPoint at promotion  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Re: Possible corruption by CreateRestartPoint at promotion  (Nathan Bossart <nathandbossart@gmail.com>)
List pgsql-hackers
Hello,  (Cc:ed Fujii-san)

This is a diverged topic from [1], which is summarized as $SUBJECT.

To recap:

While discussing on additional LSNs in checkpoint log message,
Fujii-san pointed out [2] that there is a case where
CreateRestartPoint leaves unrecoverable database when concurrent
promotion happens. That corruption is "fixed" by the next checkpoint
so it is not a severe corruption.

AFAICS since 9.5, no check(/restart)pionts won't run concurrently with
restartpoint [3].  So I propose to remove the code path as attached.

regards.


[1] https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com

[2] https://www.postgresql.org/message-id/7bfad665-db9c-0c2a-2604-9f54763c5f9e%40oss.nttdata.com

[3] https://www.postgresql.org/message-id/20220222.174401.765586897814316743.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 57823c5690469cf404de9fbdf37f55d09abaedd5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 4 Mar 2022 13:18:30 +0900
Subject: [PATCH v4] Correctly update contfol file at the end of archive
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 69 ++++++++++++++++++++-----------
 1 file changed, 44 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ed16f279b1..dd9c564988 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6900,6 +6900,9 @@ CreateRestartPoint(int flags)
     XLogSegNo    _logSegNo;
     TimestampTz xtime;
 
+    /* we don't assume concurrent checkpoint/restartpoint to run */
+    Assert (!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
+
     /* Get a local copy of the last safe checkpoint record. */
     SpinLockAcquire(&XLogCtl->info_lck);
     lastCheckPointRecPtr = XLogCtl->lastCheckPointRecPtr;
@@ -6965,7 +6968,7 @@ CreateRestartPoint(int flags)
 
     /* Also update the info_lck-protected copy */
     SpinLockAcquire(&XLogCtl->info_lck);
-    XLogCtl->RedoRecPtr = lastCheckPoint.redo;
+    XLogCtl->RedoRecPtr = RedoRecPtr;
     SpinLockRelease(&XLogCtl->info_lck);
 
     /*
@@ -6984,7 +6987,10 @@ CreateRestartPoint(int flags)
     /* Update the process title */
     update_checkpoint_display(flags, true, false);
 
-    CheckPointGuts(lastCheckPoint.redo, flags);
+    CheckPointGuts(RedoRecPtr, flags);
+
+    /* Update pg_control */
+    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
     /*
      * Remember the prior checkpoint's redo ptr for
@@ -6992,30 +6998,26 @@ CreateRestartPoint(int flags)
      */
     PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+    Assert (PriorRedoPtr < RedoRecPtr);
+
+    ControlFile->checkPoint = lastCheckPointRecPtr;
+    ControlFile->checkPointCopy = lastCheckPoint;
+
     /*
-     * Update pg_control, using current time.  Check that it still shows
-     * DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
-     * this is a quick hack to make sure nothing really bad happens if somehow
-     * we get here after the end-of-recovery checkpoint.
+     * Ensure minRecoveryPoint is past the checkpoint record while archive
+     * recovery is still ongoing.  Normally, this will have happened already
+     * while writing out dirty buffers, but not necessarily - e.g. because no
+     * buffers were dirtied.  We do this because a non-exclusive base backup
+     * uses minRecoveryPoint to determine which WAL files must be included in
+     * the backup, and the file (or files) containing the checkpoint record
+     * must be included, at a minimum. Note that for an ordinary restart of
+     * recovery there's no value in having the minimum recovery point any
+     * earlier than this anyway, because redo will begin just after the
+     * checkpoint record.  This is a quick hack to make sure nothing really bad
+     * happens if somehow we get here after the end-of-recovery checkpoint.
      */
-    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-        ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
     {
-        ControlFile->checkPoint = lastCheckPointRecPtr;
-        ControlFile->checkPointCopy = lastCheckPoint;
-
-        /*
-         * Ensure minRecoveryPoint is past the checkpoint record.  Normally,
-         * this will have happened already while writing out dirty buffers,
-         * but not necessarily - e.g. because no buffers were dirtied.  We do
-         * this because a non-exclusive base backup uses minRecoveryPoint to
-         * determine which WAL files must be included in the backup, and the
-         * file (or files) containing the checkpoint record must be included,
-         * at a minimum. Note that for an ordinary restart of recovery there's
-         * no value in having the minimum recovery point any earlier than this
-         * anyway, because redo will begin just after the checkpoint record.
-         */
         if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
         {
             ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
@@ -7027,8 +7029,25 @@ CreateRestartPoint(int flags)
         }
         if (flags & CHECKPOINT_IS_SHUTDOWN)
             ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-        UpdateControlFile();
     }
+    else
+    {
+        /* recovery mode is not supposed to end during shutdown restartpoint */
+        Assert((flags & CHECKPOINT_IS_SHUTDOWN) == 0);
+
+        /*
+         * Aarchive recovery has ended. Crash recovery ever after should
+         * always recover to the end of WAL
+         */
+        ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
+        ControlFile->minRecoveryPointTLI = 0;
+
+        /* also update local copy */
+        LocalMinRecoveryPoint = InvalidXLogRecPtr;
+        LocalMinRecoveryPointTLI = 0;
+    }
+
+    UpdateControlFile();
     LWLockRelease(ControlFileLock);
 
     /*
@@ -7105,7 +7124,7 @@ CreateRestartPoint(int flags)
     xtime = GetLatestXTime();
     ereport((log_checkpoints ? LOG : DEBUG2),
             (errmsg("recovery restart point at %X/%X",
-                    LSN_FORMAT_ARGS(lastCheckPoint.redo)),
+                    LSN_FORMAT_ARGS(RedoRecPtr)),
              xtime ? errdetail("Last completed transaction was at log time %s.",
                                timestamptz_to_str(xtime)) : 0));
 
-- 
2.27.0

From e983f3d4c2dbeea742aed0ef1e209e7821f6687f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 14 Feb 2022 13:04:33 +0900
Subject: [PATCH v2] Correctly update contfol file at the end of archive recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 73 ++++++++++++++++++++-----------
 1 file changed, 47 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6208e123e5..ff4a90eacc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9587,6 +9587,9 @@ CreateRestartPoint(int flags)
     XLogSegNo    _logSegNo;
     TimestampTz xtime;
 
+    /* we don't assume concurrent checkpoint/restartpoint to run */
+    Assert (!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
+
     /* Get a local copy of the last safe checkpoint record. */
     SpinLockAcquire(&XLogCtl->info_lck);
     lastCheckPointRecPtr = XLogCtl->lastCheckPointRecPtr;
@@ -9653,7 +9656,7 @@ CreateRestartPoint(int flags)
 
     /* Also update the info_lck-protected copy */
     SpinLockAcquire(&XLogCtl->info_lck);
-    XLogCtl->RedoRecPtr = lastCheckPoint.redo;
+    XLogCtl->RedoRecPtr = RedoRecPtr;
     SpinLockRelease(&XLogCtl->info_lck);
 
     /*
@@ -9672,7 +9675,10 @@ CreateRestartPoint(int flags)
     /* Update the process title */
     update_checkpoint_display(flags, true, false);
 
-    CheckPointGuts(lastCheckPoint.redo, flags);
+    CheckPointGuts(RedoRecPtr, flags);
+
+    /* Update pg_control */
+    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
     /*
      * Remember the prior checkpoint's redo ptr for
@@ -9680,31 +9686,29 @@ CreateRestartPoint(int flags)
      */
     PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+    Assert (PriorRedoPtr < RedoRecPtr);
+
+    ControlFile->checkPoint = lastCheckPointRecPtr;
+    ControlFile->checkPointCopy = lastCheckPoint;
+
+    /* Update control file using current time */
+    ControlFile->time = (pg_time_t) time(NULL);
+
     /*
-     * Update pg_control, using current time.  Check that it still shows
-     * DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
-     * this is a quick hack to make sure nothing really bad happens if somehow
-     * we get here after the end-of-recovery checkpoint.
+     * Ensure minRecoveryPoint is past the checkpoint record while archive
+     * recovery is still ongoing.  Normally, this will have happened already
+     * while writing out dirty buffers, but not necessarily - e.g. because no
+     * buffers were dirtied.  We do this because a non-exclusive base backup
+     * uses minRecoveryPoint to determine which WAL files must be included in
+     * the backup, and the file (or files) containing the checkpoint record
+     * must be included, at a minimum. Note that for an ordinary restart of
+     * recovery there's no value in having the minimum recovery point any
+     * earlier than this anyway, because redo will begin just after the
+     * checkpoint record.  This is a quick hack to make sure nothing really bad
+     * happens if somehow we get here after the end-of-recovery checkpoint.
      */
-    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-        ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
     {
-        ControlFile->checkPoint = lastCheckPointRecPtr;
-        ControlFile->checkPointCopy = lastCheckPoint;
-        ControlFile->time = (pg_time_t) time(NULL);
-
-        /*
-         * Ensure minRecoveryPoint is past the checkpoint record.  Normally,
-         * this will have happened already while writing out dirty buffers,
-         * but not necessarily - e.g. because no buffers were dirtied.  We do
-         * this because a non-exclusive base backup uses minRecoveryPoint to
-         * determine which WAL files must be included in the backup, and the
-         * file (or files) containing the checkpoint record must be included,
-         * at a minimum. Note that for an ordinary restart of recovery there's
-         * no value in having the minimum recovery point any earlier than this
-         * anyway, because redo will begin just after the checkpoint record.
-         */
         if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
         {
             ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
@@ -9716,8 +9720,25 @@ CreateRestartPoint(int flags)
         }
         if (flags & CHECKPOINT_IS_SHUTDOWN)
             ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-        UpdateControlFile();
     }
+    else
+    {
+        /* recovery mode is not supposed to end during shutdown restartpoint */
+        Assert((flags & CHECKPOINT_IS_SHUTDOWN) == 0);
+
+        /*
+         * Aarchive recovery has ended. Crash recovery ever after should
+         * always recover to the end of WAL
+         */
+        ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
+        ControlFile->minRecoveryPointTLI = 0;
+
+        /* also update local copy */
+        minRecoveryPoint = InvalidXLogRecPtr;
+        minRecoveryPointTLI = 0;
+    }
+
+    UpdateControlFile();
     LWLockRelease(ControlFileLock);
 
     /*
@@ -9804,7 +9825,7 @@ CreateRestartPoint(int flags)
     xtime = GetLatestXTime();
     ereport((log_checkpoints ? LOG : DEBUG2),
             (errmsg("recovery restart point at %X/%X",
-                    LSN_FORMAT_ARGS(lastCheckPoint.redo)),
+                    LSN_FORMAT_ARGS(RedoRecPtr)),
              xtime ? errdetail("Last completed transaction was at log time %s.",
                                timestamptz_to_str(xtime)) : 0));
 
-- 
2.27.0

From 13329169b996509a3a853afb9c283c3b27e0eab7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 25 Feb 2022 14:46:41 +0900
Subject: [PATCH v2] Correctly update contfol file at the end of archive 
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 73 ++++++++++++++++++++-----------
 1 file changed, 47 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3d76fad128..3670ff81e7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9376,6 +9376,9 @@ CreateRestartPoint(int flags)
      */
     LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
 
+    /* we don't assume concurrent checkpoint/restartpoint to run */
+    Assert (!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
+
     /* Get a local copy of the last safe checkpoint record. */
     SpinLockAcquire(&XLogCtl->info_lck);
     lastCheckPointRecPtr = XLogCtl->lastCheckPointRecPtr;
@@ -9445,7 +9448,7 @@ CreateRestartPoint(int flags)
 
     /* Also update the info_lck-protected copy */
     SpinLockAcquire(&XLogCtl->info_lck);
-    XLogCtl->RedoRecPtr = lastCheckPoint.redo;
+    XLogCtl->RedoRecPtr = RedoRecPtr;
     SpinLockRelease(&XLogCtl->info_lck);
 
     /*
@@ -9461,7 +9464,10 @@ CreateRestartPoint(int flags)
     if (log_checkpoints)
         LogCheckpointStart(flags, true);
 
-    CheckPointGuts(lastCheckPoint.redo, flags);
+    CheckPointGuts(RedoRecPtr, flags);
+
+    /* Update pg_control */
+    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
     /*
      * Remember the prior checkpoint's redo ptr for
@@ -9469,31 +9475,29 @@ CreateRestartPoint(int flags)
      */
     PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+    Assert (PriorRedoPtr < RedoRecPtr);
+
+    ControlFile->checkPoint = lastCheckPointRecPtr;
+    ControlFile->checkPointCopy = lastCheckPoint;
+
+    /* Update control file using current time */
+    ControlFile->time = (pg_time_t) time(NULL);
+
     /*
-     * Update pg_control, using current time.  Check that it still shows
-     * DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
-     * this is a quick hack to make sure nothing really bad happens if somehow
-     * we get here after the end-of-recovery checkpoint.
+     * Ensure minRecoveryPoint is past the checkpoint record while archive
+     * recovery is still ongoing.  Normally, this will have happened already
+     * while writing out dirty buffers, but not necessarily - e.g. because no
+     * buffers were dirtied.  We do this because a non-exclusive base backup
+     * uses minRecoveryPoint to determine which WAL files must be included in
+     * the backup, and the file (or files) containing the checkpoint record
+     * must be included, at a minimum. Note that for an ordinary restart of
+     * recovery there's no value in having the minimum recovery point any
+     * earlier than this anyway, because redo will begin just after the
+     * checkpoint record.  This is a quick hack to make sure nothing really bad
+     * happens if somehow we get here after the end-of-recovery checkpoint.
      */
-    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-        ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
     {
-        ControlFile->checkPoint = lastCheckPointRecPtr;
-        ControlFile->checkPointCopy = lastCheckPoint;
-        ControlFile->time = (pg_time_t) time(NULL);
-
-        /*
-         * Ensure minRecoveryPoint is past the checkpoint record.  Normally,
-         * this will have happened already while writing out dirty buffers,
-         * but not necessarily - e.g. because no buffers were dirtied.  We do
-         * this because a non-exclusive base backup uses minRecoveryPoint to
-         * determine which WAL files must be included in the backup, and the
-         * file (or files) containing the checkpoint record must be included,
-         * at a minimum. Note that for an ordinary restart of recovery there's
-         * no value in having the minimum recovery point any earlier than this
-         * anyway, because redo will begin just after the checkpoint record.
-         */
         if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
         {
             ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
@@ -9505,8 +9509,25 @@ CreateRestartPoint(int flags)
         }
         if (flags & CHECKPOINT_IS_SHUTDOWN)
             ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-        UpdateControlFile();
     }
+    else
+    {
+        /* recovery mode is not supposed to end during shutdown restartpoint */
+        Assert((flags & CHECKPOINT_IS_SHUTDOWN) == 0);
+
+        /*
+         * Aarchive recovery has ended. Crash recovery ever after should
+         * always recover to the end of WAL
+         */
+        ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
+        ControlFile->minRecoveryPointTLI = 0;
+
+        /* also update local copy */
+        minRecoveryPoint = InvalidXLogRecPtr;
+        minRecoveryPointTLI = 0;
+    }
+
+    UpdateControlFile();
     LWLockRelease(ControlFileLock);
 
     /*
@@ -9590,7 +9611,7 @@ CreateRestartPoint(int flags)
     xtime = GetLatestXTime();
     ereport((log_checkpoints ? LOG : DEBUG2),
             (errmsg("recovery restart point at %X/%X",
-                    (uint32) (lastCheckPoint.redo >> 32), (uint32) lastCheckPoint.redo),
+                    (uint32) (RedoRecPtr >> 32), (uint32) RedoRecPtr),
              xtime ? errdetail("Last completed transaction was at log time %s.",
                                timestamptz_to_str(xtime)) : 0));
 
-- 
2.27.0

From c89e2b509723b68897f2af49a154af2a69f0747b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 25 Feb 2022 15:04:00 +0900
Subject: [PATCH v3] Correctly update contfol file at the end of archive
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 71 +++++++++++++++++++------------
 1 file changed, 44 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 885558f291..2b2568c475 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9334,7 +9334,7 @@ CreateRestartPoint(int flags)
 
     /* Also update the info_lck-protected copy */
     SpinLockAcquire(&XLogCtl->info_lck);
-    XLogCtl->RedoRecPtr = lastCheckPoint.redo;
+    XLogCtl->RedoRecPtr = RedoRecPtr;
     SpinLockRelease(&XLogCtl->info_lck);
 
     /*
@@ -9350,7 +9350,10 @@ CreateRestartPoint(int flags)
     if (log_checkpoints)
         LogCheckpointStart(flags, true);
 
-    CheckPointGuts(lastCheckPoint.redo, flags);
+    CheckPointGuts(RedoRecPtr, flags);
+
+    /* Update pg_control */
+    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
     /*
      * Remember the prior checkpoint's redo ptr for
@@ -9358,31 +9361,28 @@ CreateRestartPoint(int flags)
      */
     PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+    Assert (PriorRedoPtr < RedoRecPtr);
+
+    ControlFile->checkPoint = lastCheckPointRecPtr;
+    ControlFile->checkPointCopy = lastCheckPoint;
+
+    /* Update control file using current time */
+    ControlFile->time = (pg_time_t) time(NULL);
+
     /*
-     * Update pg_control, using current time.  Check that it still shows
-     * IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
-     * this is a quick hack to make sure nothing really bad happens if somehow
-     * we get here after the end-of-recovery checkpoint.
+     * Ensure minRecoveryPoint is past the checkpoint record while archive
+     * recovery is still ongoing.  Normally, this will have happened already
+     * while writing out dirty buffers, but not necessarily - e.g. because no
+     * buffers were dirtied.  We do this because a non-exclusive base backup
+     * uses minRecoveryPoint to determine which WAL files must be included in
+     * the backup, and the file (or files) containing the checkpoint record
+     * must be included, at a minimum. Note that for an ordinary restart of
+     * recovery there's no value in having the minimum recovery point any
+     * earlier than this anyway, because redo will begin just after the
+     * checkpoint record.
      */
-    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-        ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
     {
-        ControlFile->checkPoint = lastCheckPointRecPtr;
-        ControlFile->checkPointCopy = lastCheckPoint;
-        ControlFile->time = (pg_time_t) time(NULL);
-
-        /*
-         * Ensure minRecoveryPoint is past the checkpoint record.  Normally,
-         * this will have happened already while writing out dirty buffers,
-         * but not necessarily - e.g. because no buffers were dirtied.  We do
-         * this because a non-exclusive base backup uses minRecoveryPoint to
-         * determine which WAL files must be included in the backup, and the
-         * file (or files) containing the checkpoint record must be included,
-         * at a minimum. Note that for an ordinary restart of recovery there's
-         * no value in having the minimum recovery point any earlier than this
-         * anyway, because redo will begin just after the checkpoint record.
-         */
         if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
         {
             ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
@@ -9393,9 +9393,26 @@ CreateRestartPoint(int flags)
             minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
         }
         if (flags & CHECKPOINT_IS_SHUTDOWN)
-            ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-        UpdateControlFile();
+        ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
     }
+    else
+    {
+        /* recovery mode is not supposed to end during shutdown restartpoint */
+        Assert((flags & CHECKPOINT_IS_SHUTDOWN) == 0);
+
+        /*
+         * Aarchive recovery has ended. Crash recovery ever after should
+         * always recover to the end of WAL
+         */
+        ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
+        ControlFile->minRecoveryPointTLI = 0;
+
+        /* also update local copy */
+        minRecoveryPoint = InvalidXLogRecPtr;
+        minRecoveryPointTLI = 0;
+    }
+
+    UpdateControlFile();
     LWLockRelease(ControlFileLock);
 
     /*
@@ -9470,7 +9487,7 @@ CreateRestartPoint(int flags)
     xtime = GetLatestXTime();
     ereport((log_checkpoints ? LOG : DEBUG2),
             (errmsg("recovery restart point at %X/%X",
-                    (uint32) (lastCheckPoint.redo >> 32), (uint32) lastCheckPoint.redo),
+                    (uint32) (RedoRecPtr >> 32), (uint32) RedoRecPtr),
              xtime ? errdetail("Last completed transaction was at log time %s.",
                                timestamptz_to_str(xtime)) : 0));
 
-- 
2.27.0

From 7dd174d165b3639b573bfc47c2e8b2fba61395c5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 25 Feb 2022 16:35:16 +0900
Subject: [PATCH v3] Correctly update contfol file at the end of archive
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 73 +++++++++++++++++++------------
 1 file changed, 45 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c64febdb53..9fb66ad7d5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9434,7 +9434,7 @@ CreateRestartPoint(int flags)
 
     /* Also update the info_lck-protected copy */
     SpinLockAcquire(&XLogCtl->info_lck);
-    XLogCtl->RedoRecPtr = lastCheckPoint.redo;
+    XLogCtl->RedoRecPtr = RedoRecPtr;
     SpinLockRelease(&XLogCtl->info_lck);
 
     /*
@@ -9450,7 +9450,10 @@ CreateRestartPoint(int flags)
     if (log_checkpoints)
         LogCheckpointStart(flags, true);
 
-    CheckPointGuts(lastCheckPoint.redo, flags);
+    CheckPointGuts(RedoRecPtr, flags);
+
+    /* Update pg_control */
+    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
     /*
      * Remember the prior checkpoint's redo pointer, used later to determine
@@ -9458,32 +9461,29 @@ CreateRestartPoint(int flags)
      */
     PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+    Assert (PriorRedoPtr < RedoRecPtr);
+
+    ControlFile->prevCheckPoint = ControlFile->checkPoint;
+    ControlFile->checkPoint = lastCheckPointRecPtr;
+    ControlFile->checkPointCopy = lastCheckPoint;
+
+    /* Update control file using current time */
+    ControlFile->time = (pg_time_t) time(NULL);
+
     /*
-     * Update pg_control, using current time.  Check that it still shows
-     * IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
-     * this is a quick hack to make sure nothing really bad happens if somehow
-     * we get here after the end-of-recovery checkpoint.
+     * Ensure minRecoveryPoint is past the checkpoint record while archive
+     * recovery is still running.  Normally, this will have happened already
+     * while writing out dirty buffers, but not necessarily - e.g. because no
+     * buffers were dirtied.  We do this because a non-exclusive base backup
+     * uses minRecoveryPoint to determine which WAL files must be included in
+     * the backup, and the file (or files) containing the checkpoint record
+     * must be included, at a minimum. Note that for an ordinary restart of
+     * recovery there's no value in having the minimum recovery point any
+     * earlier than this anyway, because redo will begin just after the
+     * checkpoint record.
      */
-    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-        ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
     {
-        ControlFile->prevCheckPoint = ControlFile->checkPoint;
-        ControlFile->checkPoint = lastCheckPointRecPtr;
-        ControlFile->checkPointCopy = lastCheckPoint;
-        ControlFile->time = (pg_time_t) time(NULL);
-
-        /*
-         * Ensure minRecoveryPoint is past the checkpoint record.  Normally,
-         * this will have happened already while writing out dirty buffers,
-         * but not necessarily - e.g. because no buffers were dirtied.  We do
-         * this because a non-exclusive base backup uses minRecoveryPoint to
-         * determine which WAL files must be included in the backup, and the
-         * file (or files) containing the checkpoint record must be included,
-         * at a minimum. Note that for an ordinary restart of recovery there's
-         * no value in having the minimum recovery point any earlier than this
-         * anyway, because redo will begin just after the checkpoint record.
-         */
         if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
         {
             ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
@@ -9494,9 +9494,26 @@ CreateRestartPoint(int flags)
             minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
         }
         if (flags & CHECKPOINT_IS_SHUTDOWN)
-            ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-        UpdateControlFile();
+        ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
     }
+    else
+    {
+        /* recovery mode is not supposed to end during shutdown restartpoint */
+        Assert((flags & CHECKPOINT_IS_SHUTDOWN) == 0);
+
+        /*
+         * Aarchive recovery has ended. Crash recovery ever after should
+         * always recover to the end of WAL
+         */
+        ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
+        ControlFile->minRecoveryPointTLI = 0;
+
+        /* also update local copy */
+        minRecoveryPoint = InvalidXLogRecPtr;
+        minRecoveryPointTLI = 0;
+    }
+
+    UpdateControlFile();
     LWLockRelease(ControlFileLock);
 
     /*
@@ -9579,7 +9596,7 @@ CreateRestartPoint(int flags)
     xtime = GetLatestXTime();
     ereport((log_checkpoints ? LOG : DEBUG2),
             (errmsg("recovery restart point at %X/%X",
-                    (uint32) (lastCheckPoint.redo >> 32), (uint32) lastCheckPoint.redo),
+                    (uint32) (RedoRecPtr >> 32), (uint32) RedoRecPtr),
              xtime ? errdetail("last completed transaction was at log time %s",
                                timestamptz_to_str(xtime)) : 0));
 
-- 
2.27.0


pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Skipping logical replication transactions on subscriber side
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Add checkpoint and redo LSN to LogCheckpointEnd log message