Thread: Possible corruption by CreateRestartPoint at promotion

Possible corruption by CreateRestartPoint at promotion

From
Kyotaro Horiguchi
Date:
Hello,  (Cc:ed Fujii-san)

This is a diverged topic from [1], which is summarized as $SUBJECT.

To recap:

While discussing on additional LSNs in checkpoint log message,
Fujii-san pointed out [2] that there is a case where
CreateRestartPoint leaves unrecoverable database when concurrent
promotion happens. That corruption is "fixed" by the next checkpoint
so it is not a severe corruption.

AFAICS since 9.5, no check(/restart)pionts won't run concurrently with
restartpoint [3].  So I propose to remove the code path as attached.

regards.


[1] https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com

[2] https://www.postgresql.org/message-id/7bfad665-db9c-0c2a-2604-9f54763c5f9e%40oss.nttdata.com

[3] https://www.postgresql.org/message-id/20220222.174401.765586897814316743.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 57823c5690469cf404de9fbdf37f55d09abaedd5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 4 Mar 2022 13:18:30 +0900
Subject: [PATCH v4] Correctly update contfol file at the end of archive
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 69 ++++++++++++++++++++-----------
 1 file changed, 44 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ed16f279b1..dd9c564988 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6900,6 +6900,9 @@ CreateRestartPoint(int flags)
     XLogSegNo    _logSegNo;
     TimestampTz xtime;
 
+    /* we don't assume concurrent checkpoint/restartpoint to run */
+    Assert (!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
+
     /* Get a local copy of the last safe checkpoint record. */
     SpinLockAcquire(&XLogCtl->info_lck);
     lastCheckPointRecPtr = XLogCtl->lastCheckPointRecPtr;
@@ -6965,7 +6968,7 @@ CreateRestartPoint(int flags)
 
     /* Also update the info_lck-protected copy */
     SpinLockAcquire(&XLogCtl->info_lck);
-    XLogCtl->RedoRecPtr = lastCheckPoint.redo;
+    XLogCtl->RedoRecPtr = RedoRecPtr;
     SpinLockRelease(&XLogCtl->info_lck);
 
     /*
@@ -6984,7 +6987,10 @@ CreateRestartPoint(int flags)
     /* Update the process title */
     update_checkpoint_display(flags, true, false);
 
-    CheckPointGuts(lastCheckPoint.redo, flags);
+    CheckPointGuts(RedoRecPtr, flags);
+
+    /* Update pg_control */
+    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
     /*
      * Remember the prior checkpoint's redo ptr for
@@ -6992,30 +6998,26 @@ CreateRestartPoint(int flags)
      */
     PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+    Assert (PriorRedoPtr < RedoRecPtr);
+
+    ControlFile->checkPoint = lastCheckPointRecPtr;
+    ControlFile->checkPointCopy = lastCheckPoint;
+
     /*
-     * Update pg_control, using current time.  Check that it still shows
-     * DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
-     * this is a quick hack to make sure nothing really bad happens if somehow
-     * we get here after the end-of-recovery checkpoint.
+     * Ensure minRecoveryPoint is past the checkpoint record while archive
+     * recovery is still ongoing.  Normally, this will have happened already
+     * while writing out dirty buffers, but not necessarily - e.g. because no
+     * buffers were dirtied.  We do this because a non-exclusive base backup
+     * uses minRecoveryPoint to determine which WAL files must be included in
+     * the backup, and the file (or files) containing the checkpoint record
+     * must be included, at a minimum. Note that for an ordinary restart of
+     * recovery there's no value in having the minimum recovery point any
+     * earlier than this anyway, because redo will begin just after the
+     * checkpoint record.  This is a quick hack to make sure nothing really bad
+     * happens if somehow we get here after the end-of-recovery checkpoint.
      */
-    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-        ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
     {
-        ControlFile->checkPoint = lastCheckPointRecPtr;
-        ControlFile->checkPointCopy = lastCheckPoint;
-
-        /*
-         * Ensure minRecoveryPoint is past the checkpoint record.  Normally,
-         * this will have happened already while writing out dirty buffers,
-         * but not necessarily - e.g. because no buffers were dirtied.  We do
-         * this because a non-exclusive base backup uses minRecoveryPoint to
-         * determine which WAL files must be included in the backup, and the
-         * file (or files) containing the checkpoint record must be included,
-         * at a minimum. Note that for an ordinary restart of recovery there's
-         * no value in having the minimum recovery point any earlier than this
-         * anyway, because redo will begin just after the checkpoint record.
-         */
         if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
         {
             ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
@@ -7027,8 +7029,25 @@ CreateRestartPoint(int flags)
         }
         if (flags & CHECKPOINT_IS_SHUTDOWN)
             ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-        UpdateControlFile();
     }
+    else
+    {
+        /* recovery mode is not supposed to end during shutdown restartpoint */
+        Assert((flags & CHECKPOINT_IS_SHUTDOWN) == 0);
+
+        /*
+         * Aarchive recovery has ended. Crash recovery ever after should
+         * always recover to the end of WAL
+         */
+        ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
+        ControlFile->minRecoveryPointTLI = 0;
+
+        /* also update local copy */
+        LocalMinRecoveryPoint = InvalidXLogRecPtr;
+        LocalMinRecoveryPointTLI = 0;
+    }
+
+    UpdateControlFile();
     LWLockRelease(ControlFileLock);
 
     /*
@@ -7105,7 +7124,7 @@ CreateRestartPoint(int flags)
     xtime = GetLatestXTime();
     ereport((log_checkpoints ? LOG : DEBUG2),
             (errmsg("recovery restart point at %X/%X",
-                    LSN_FORMAT_ARGS(lastCheckPoint.redo)),
+                    LSN_FORMAT_ARGS(RedoRecPtr)),
              xtime ? errdetail("Last completed transaction was at log time %s.",
                                timestamptz_to_str(xtime)) : 0));
 
-- 
2.27.0

From e983f3d4c2dbeea742aed0ef1e209e7821f6687f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 14 Feb 2022 13:04:33 +0900
Subject: [PATCH v2] Correctly update contfol file at the end of archive recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 73 ++++++++++++++++++++-----------
 1 file changed, 47 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6208e123e5..ff4a90eacc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9587,6 +9587,9 @@ CreateRestartPoint(int flags)
     XLogSegNo    _logSegNo;
     TimestampTz xtime;
 
+    /* we don't assume concurrent checkpoint/restartpoint to run */
+    Assert (!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
+
     /* Get a local copy of the last safe checkpoint record. */
     SpinLockAcquire(&XLogCtl->info_lck);
     lastCheckPointRecPtr = XLogCtl->lastCheckPointRecPtr;
@@ -9653,7 +9656,7 @@ CreateRestartPoint(int flags)
 
     /* Also update the info_lck-protected copy */
     SpinLockAcquire(&XLogCtl->info_lck);
-    XLogCtl->RedoRecPtr = lastCheckPoint.redo;
+    XLogCtl->RedoRecPtr = RedoRecPtr;
     SpinLockRelease(&XLogCtl->info_lck);
 
     /*
@@ -9672,7 +9675,10 @@ CreateRestartPoint(int flags)
     /* Update the process title */
     update_checkpoint_display(flags, true, false);
 
-    CheckPointGuts(lastCheckPoint.redo, flags);
+    CheckPointGuts(RedoRecPtr, flags);
+
+    /* Update pg_control */
+    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
     /*
      * Remember the prior checkpoint's redo ptr for
@@ -9680,31 +9686,29 @@ CreateRestartPoint(int flags)
      */
     PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+    Assert (PriorRedoPtr < RedoRecPtr);
+
+    ControlFile->checkPoint = lastCheckPointRecPtr;
+    ControlFile->checkPointCopy = lastCheckPoint;
+
+    /* Update control file using current time */
+    ControlFile->time = (pg_time_t) time(NULL);
+
     /*
-     * Update pg_control, using current time.  Check that it still shows
-     * DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
-     * this is a quick hack to make sure nothing really bad happens if somehow
-     * we get here after the end-of-recovery checkpoint.
+     * Ensure minRecoveryPoint is past the checkpoint record while archive
+     * recovery is still ongoing.  Normally, this will have happened already
+     * while writing out dirty buffers, but not necessarily - e.g. because no
+     * buffers were dirtied.  We do this because a non-exclusive base backup
+     * uses minRecoveryPoint to determine which WAL files must be included in
+     * the backup, and the file (or files) containing the checkpoint record
+     * must be included, at a minimum. Note that for an ordinary restart of
+     * recovery there's no value in having the minimum recovery point any
+     * earlier than this anyway, because redo will begin just after the
+     * checkpoint record.  This is a quick hack to make sure nothing really bad
+     * happens if somehow we get here after the end-of-recovery checkpoint.
      */
-    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-        ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
     {
-        ControlFile->checkPoint = lastCheckPointRecPtr;
-        ControlFile->checkPointCopy = lastCheckPoint;
-        ControlFile->time = (pg_time_t) time(NULL);
-
-        /*
-         * Ensure minRecoveryPoint is past the checkpoint record.  Normally,
-         * this will have happened already while writing out dirty buffers,
-         * but not necessarily - e.g. because no buffers were dirtied.  We do
-         * this because a non-exclusive base backup uses minRecoveryPoint to
-         * determine which WAL files must be included in the backup, and the
-         * file (or files) containing the checkpoint record must be included,
-         * at a minimum. Note that for an ordinary restart of recovery there's
-         * no value in having the minimum recovery point any earlier than this
-         * anyway, because redo will begin just after the checkpoint record.
-         */
         if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
         {
             ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
@@ -9716,8 +9720,25 @@ CreateRestartPoint(int flags)
         }
         if (flags & CHECKPOINT_IS_SHUTDOWN)
             ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-        UpdateControlFile();
     }
+    else
+    {
+        /* recovery mode is not supposed to end during shutdown restartpoint */
+        Assert((flags & CHECKPOINT_IS_SHUTDOWN) == 0);
+
+        /*
+         * Aarchive recovery has ended. Crash recovery ever after should
+         * always recover to the end of WAL
+         */
+        ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
+        ControlFile->minRecoveryPointTLI = 0;
+
+        /* also update local copy */
+        minRecoveryPoint = InvalidXLogRecPtr;
+        minRecoveryPointTLI = 0;
+    }
+
+    UpdateControlFile();
     LWLockRelease(ControlFileLock);
 
     /*
@@ -9804,7 +9825,7 @@ CreateRestartPoint(int flags)
     xtime = GetLatestXTime();
     ereport((log_checkpoints ? LOG : DEBUG2),
             (errmsg("recovery restart point at %X/%X",
-                    LSN_FORMAT_ARGS(lastCheckPoint.redo)),
+                    LSN_FORMAT_ARGS(RedoRecPtr)),
              xtime ? errdetail("Last completed transaction was at log time %s.",
                                timestamptz_to_str(xtime)) : 0));
 
-- 
2.27.0

From 13329169b996509a3a853afb9c283c3b27e0eab7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 25 Feb 2022 14:46:41 +0900
Subject: [PATCH v2] Correctly update contfol file at the end of archive 
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 73 ++++++++++++++++++++-----------
 1 file changed, 47 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3d76fad128..3670ff81e7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9376,6 +9376,9 @@ CreateRestartPoint(int flags)
      */
     LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
 
+    /* we don't assume concurrent checkpoint/restartpoint to run */
+    Assert (!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
+
     /* Get a local copy of the last safe checkpoint record. */
     SpinLockAcquire(&XLogCtl->info_lck);
     lastCheckPointRecPtr = XLogCtl->lastCheckPointRecPtr;
@@ -9445,7 +9448,7 @@ CreateRestartPoint(int flags)
 
     /* Also update the info_lck-protected copy */
     SpinLockAcquire(&XLogCtl->info_lck);
-    XLogCtl->RedoRecPtr = lastCheckPoint.redo;
+    XLogCtl->RedoRecPtr = RedoRecPtr;
     SpinLockRelease(&XLogCtl->info_lck);
 
     /*
@@ -9461,7 +9464,10 @@ CreateRestartPoint(int flags)
     if (log_checkpoints)
         LogCheckpointStart(flags, true);
 
-    CheckPointGuts(lastCheckPoint.redo, flags);
+    CheckPointGuts(RedoRecPtr, flags);
+
+    /* Update pg_control */
+    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
     /*
      * Remember the prior checkpoint's redo ptr for
@@ -9469,31 +9475,29 @@ CreateRestartPoint(int flags)
      */
     PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+    Assert (PriorRedoPtr < RedoRecPtr);
+
+    ControlFile->checkPoint = lastCheckPointRecPtr;
+    ControlFile->checkPointCopy = lastCheckPoint;
+
+    /* Update control file using current time */
+    ControlFile->time = (pg_time_t) time(NULL);
+
     /*
-     * Update pg_control, using current time.  Check that it still shows
-     * DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
-     * this is a quick hack to make sure nothing really bad happens if somehow
-     * we get here after the end-of-recovery checkpoint.
+     * Ensure minRecoveryPoint is past the checkpoint record while archive
+     * recovery is still ongoing.  Normally, this will have happened already
+     * while writing out dirty buffers, but not necessarily - e.g. because no
+     * buffers were dirtied.  We do this because a non-exclusive base backup
+     * uses minRecoveryPoint to determine which WAL files must be included in
+     * the backup, and the file (or files) containing the checkpoint record
+     * must be included, at a minimum. Note that for an ordinary restart of
+     * recovery there's no value in having the minimum recovery point any
+     * earlier than this anyway, because redo will begin just after the
+     * checkpoint record.  This is a quick hack to make sure nothing really bad
+     * happens if somehow we get here after the end-of-recovery checkpoint.
      */
-    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-        ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
     {
-        ControlFile->checkPoint = lastCheckPointRecPtr;
-        ControlFile->checkPointCopy = lastCheckPoint;
-        ControlFile->time = (pg_time_t) time(NULL);
-
-        /*
-         * Ensure minRecoveryPoint is past the checkpoint record.  Normally,
-         * this will have happened already while writing out dirty buffers,
-         * but not necessarily - e.g. because no buffers were dirtied.  We do
-         * this because a non-exclusive base backup uses minRecoveryPoint to
-         * determine which WAL files must be included in the backup, and the
-         * file (or files) containing the checkpoint record must be included,
-         * at a minimum. Note that for an ordinary restart of recovery there's
-         * no value in having the minimum recovery point any earlier than this
-         * anyway, because redo will begin just after the checkpoint record.
-         */
         if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
         {
             ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
@@ -9505,8 +9509,25 @@ CreateRestartPoint(int flags)
         }
         if (flags & CHECKPOINT_IS_SHUTDOWN)
             ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-        UpdateControlFile();
     }
+    else
+    {
+        /* recovery mode is not supposed to end during shutdown restartpoint */
+        Assert((flags & CHECKPOINT_IS_SHUTDOWN) == 0);
+
+        /*
+         * Aarchive recovery has ended. Crash recovery ever after should
+         * always recover to the end of WAL
+         */
+        ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
+        ControlFile->minRecoveryPointTLI = 0;
+
+        /* also update local copy */
+        minRecoveryPoint = InvalidXLogRecPtr;
+        minRecoveryPointTLI = 0;
+    }
+
+    UpdateControlFile();
     LWLockRelease(ControlFileLock);
 
     /*
@@ -9590,7 +9611,7 @@ CreateRestartPoint(int flags)
     xtime = GetLatestXTime();
     ereport((log_checkpoints ? LOG : DEBUG2),
             (errmsg("recovery restart point at %X/%X",
-                    (uint32) (lastCheckPoint.redo >> 32), (uint32) lastCheckPoint.redo),
+                    (uint32) (RedoRecPtr >> 32), (uint32) RedoRecPtr),
              xtime ? errdetail("Last completed transaction was at log time %s.",
                                timestamptz_to_str(xtime)) : 0));
 
-- 
2.27.0

From c89e2b509723b68897f2af49a154af2a69f0747b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 25 Feb 2022 15:04:00 +0900
Subject: [PATCH v3] Correctly update contfol file at the end of archive
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 71 +++++++++++++++++++------------
 1 file changed, 44 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 885558f291..2b2568c475 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9334,7 +9334,7 @@ CreateRestartPoint(int flags)
 
     /* Also update the info_lck-protected copy */
     SpinLockAcquire(&XLogCtl->info_lck);
-    XLogCtl->RedoRecPtr = lastCheckPoint.redo;
+    XLogCtl->RedoRecPtr = RedoRecPtr;
     SpinLockRelease(&XLogCtl->info_lck);
 
     /*
@@ -9350,7 +9350,10 @@ CreateRestartPoint(int flags)
     if (log_checkpoints)
         LogCheckpointStart(flags, true);
 
-    CheckPointGuts(lastCheckPoint.redo, flags);
+    CheckPointGuts(RedoRecPtr, flags);
+
+    /* Update pg_control */
+    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
     /*
      * Remember the prior checkpoint's redo ptr for
@@ -9358,31 +9361,28 @@ CreateRestartPoint(int flags)
      */
     PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+    Assert (PriorRedoPtr < RedoRecPtr);
+
+    ControlFile->checkPoint = lastCheckPointRecPtr;
+    ControlFile->checkPointCopy = lastCheckPoint;
+
+    /* Update control file using current time */
+    ControlFile->time = (pg_time_t) time(NULL);
+
     /*
-     * Update pg_control, using current time.  Check that it still shows
-     * IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
-     * this is a quick hack to make sure nothing really bad happens if somehow
-     * we get here after the end-of-recovery checkpoint.
+     * Ensure minRecoveryPoint is past the checkpoint record while archive
+     * recovery is still ongoing.  Normally, this will have happened already
+     * while writing out dirty buffers, but not necessarily - e.g. because no
+     * buffers were dirtied.  We do this because a non-exclusive base backup
+     * uses minRecoveryPoint to determine which WAL files must be included in
+     * the backup, and the file (or files) containing the checkpoint record
+     * must be included, at a minimum. Note that for an ordinary restart of
+     * recovery there's no value in having the minimum recovery point any
+     * earlier than this anyway, because redo will begin just after the
+     * checkpoint record.
      */
-    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-        ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
     {
-        ControlFile->checkPoint = lastCheckPointRecPtr;
-        ControlFile->checkPointCopy = lastCheckPoint;
-        ControlFile->time = (pg_time_t) time(NULL);
-
-        /*
-         * Ensure minRecoveryPoint is past the checkpoint record.  Normally,
-         * this will have happened already while writing out dirty buffers,
-         * but not necessarily - e.g. because no buffers were dirtied.  We do
-         * this because a non-exclusive base backup uses minRecoveryPoint to
-         * determine which WAL files must be included in the backup, and the
-         * file (or files) containing the checkpoint record must be included,
-         * at a minimum. Note that for an ordinary restart of recovery there's
-         * no value in having the minimum recovery point any earlier than this
-         * anyway, because redo will begin just after the checkpoint record.
-         */
         if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
         {
             ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
@@ -9393,9 +9393,26 @@ CreateRestartPoint(int flags)
             minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
         }
         if (flags & CHECKPOINT_IS_SHUTDOWN)
-            ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-        UpdateControlFile();
+        ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
     }
+    else
+    {
+        /* recovery mode is not supposed to end during shutdown restartpoint */
+        Assert((flags & CHECKPOINT_IS_SHUTDOWN) == 0);
+
+        /*
+         * Aarchive recovery has ended. Crash recovery ever after should
+         * always recover to the end of WAL
+         */
+        ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
+        ControlFile->minRecoveryPointTLI = 0;
+
+        /* also update local copy */
+        minRecoveryPoint = InvalidXLogRecPtr;
+        minRecoveryPointTLI = 0;
+    }
+
+    UpdateControlFile();
     LWLockRelease(ControlFileLock);
 
     /*
@@ -9470,7 +9487,7 @@ CreateRestartPoint(int flags)
     xtime = GetLatestXTime();
     ereport((log_checkpoints ? LOG : DEBUG2),
             (errmsg("recovery restart point at %X/%X",
-                    (uint32) (lastCheckPoint.redo >> 32), (uint32) lastCheckPoint.redo),
+                    (uint32) (RedoRecPtr >> 32), (uint32) RedoRecPtr),
              xtime ? errdetail("Last completed transaction was at log time %s.",
                                timestamptz_to_str(xtime)) : 0));
 
-- 
2.27.0

From 7dd174d165b3639b573bfc47c2e8b2fba61395c5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 25 Feb 2022 16:35:16 +0900
Subject: [PATCH v3] Correctly update contfol file at the end of archive
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 73 +++++++++++++++++++------------
 1 file changed, 45 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c64febdb53..9fb66ad7d5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9434,7 +9434,7 @@ CreateRestartPoint(int flags)
 
     /* Also update the info_lck-protected copy */
     SpinLockAcquire(&XLogCtl->info_lck);
-    XLogCtl->RedoRecPtr = lastCheckPoint.redo;
+    XLogCtl->RedoRecPtr = RedoRecPtr;
     SpinLockRelease(&XLogCtl->info_lck);
 
     /*
@@ -9450,7 +9450,10 @@ CreateRestartPoint(int flags)
     if (log_checkpoints)
         LogCheckpointStart(flags, true);
 
-    CheckPointGuts(lastCheckPoint.redo, flags);
+    CheckPointGuts(RedoRecPtr, flags);
+
+    /* Update pg_control */
+    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
     /*
      * Remember the prior checkpoint's redo pointer, used later to determine
@@ -9458,32 +9461,29 @@ CreateRestartPoint(int flags)
      */
     PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+    Assert (PriorRedoPtr < RedoRecPtr);
+
+    ControlFile->prevCheckPoint = ControlFile->checkPoint;
+    ControlFile->checkPoint = lastCheckPointRecPtr;
+    ControlFile->checkPointCopy = lastCheckPoint;
+
+    /* Update control file using current time */
+    ControlFile->time = (pg_time_t) time(NULL);
+
     /*
-     * Update pg_control, using current time.  Check that it still shows
-     * IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
-     * this is a quick hack to make sure nothing really bad happens if somehow
-     * we get here after the end-of-recovery checkpoint.
+     * Ensure minRecoveryPoint is past the checkpoint record while archive
+     * recovery is still running.  Normally, this will have happened already
+     * while writing out dirty buffers, but not necessarily - e.g. because no
+     * buffers were dirtied.  We do this because a non-exclusive base backup
+     * uses minRecoveryPoint to determine which WAL files must be included in
+     * the backup, and the file (or files) containing the checkpoint record
+     * must be included, at a minimum. Note that for an ordinary restart of
+     * recovery there's no value in having the minimum recovery point any
+     * earlier than this anyway, because redo will begin just after the
+     * checkpoint record.
      */
-    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-        ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
     {
-        ControlFile->prevCheckPoint = ControlFile->checkPoint;
-        ControlFile->checkPoint = lastCheckPointRecPtr;
-        ControlFile->checkPointCopy = lastCheckPoint;
-        ControlFile->time = (pg_time_t) time(NULL);
-
-        /*
-         * Ensure minRecoveryPoint is past the checkpoint record.  Normally,
-         * this will have happened already while writing out dirty buffers,
-         * but not necessarily - e.g. because no buffers were dirtied.  We do
-         * this because a non-exclusive base backup uses minRecoveryPoint to
-         * determine which WAL files must be included in the backup, and the
-         * file (or files) containing the checkpoint record must be included,
-         * at a minimum. Note that for an ordinary restart of recovery there's
-         * no value in having the minimum recovery point any earlier than this
-         * anyway, because redo will begin just after the checkpoint record.
-         */
         if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
         {
             ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
@@ -9494,9 +9494,26 @@ CreateRestartPoint(int flags)
             minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
         }
         if (flags & CHECKPOINT_IS_SHUTDOWN)
-            ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-        UpdateControlFile();
+        ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
     }
+    else
+    {
+        /* recovery mode is not supposed to end during shutdown restartpoint */
+        Assert((flags & CHECKPOINT_IS_SHUTDOWN) == 0);
+
+        /*
+         * Aarchive recovery has ended. Crash recovery ever after should
+         * always recover to the end of WAL
+         */
+        ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
+        ControlFile->minRecoveryPointTLI = 0;
+
+        /* also update local copy */
+        minRecoveryPoint = InvalidXLogRecPtr;
+        minRecoveryPointTLI = 0;
+    }
+
+    UpdateControlFile();
     LWLockRelease(ControlFileLock);
 
     /*
@@ -9579,7 +9596,7 @@ CreateRestartPoint(int flags)
     xtime = GetLatestXTime();
     ereport((log_checkpoints ? LOG : DEBUG2),
             (errmsg("recovery restart point at %X/%X",
-                    (uint32) (lastCheckPoint.redo >> 32), (uint32) lastCheckPoint.redo),
+                    (uint32) (RedoRecPtr >> 32), (uint32) RedoRecPtr),
              xtime ? errdetail("last completed transaction was at log time %s",
                                timestamptz_to_str(xtime)) : 0));
 
-- 
2.27.0


Re: Possible corruption by CreateRestartPoint at promotion

From
Kyotaro Horiguchi
Date:
Just for the record.

An instance of the corruption showed up in this mailing list [1].

[1] https://www.postgresql.org/message-id/flat/9EB4CF63-1107-470E-B5A4-061FB9EF8CC8%40outlook.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Possible corruption by CreateRestartPoint at promotion

From
Nathan Bossart
Date:
On Wed, Mar 16, 2022 at 10:24:44AM +0900, Kyotaro Horiguchi wrote:
> While discussing on additional LSNs in checkpoint log message,
> Fujii-san pointed out [2] that there is a case where
> CreateRestartPoint leaves unrecoverable database when concurrent
> promotion happens. That corruption is "fixed" by the next checkpoint
> so it is not a severe corruption.

I suspect we'll start seeing this problem more often once end-of-recovery
checkpoints are removed [0].  Would you mind creating a commitfest entry
for this thread?  I didn't see one.

> AFAICS since 9.5, no check(/restart)pionts won't run concurrently with
> restartpoint [3].  So I propose to remove the code path as attached.

Yeah, this "quick hack" has been around for some time (2de48a8), and I
believe much has changed since then, so something like what you're
proposing is probably the right thing to do.

>      /* Also update the info_lck-protected copy */
>      SpinLockAcquire(&XLogCtl->info_lck);
> -    XLogCtl->RedoRecPtr = lastCheckPoint.redo;
> +    XLogCtl->RedoRecPtr = RedoRecPtr;
>      SpinLockRelease(&XLogCtl->info_lck);
>  
>      /*
> @@ -6984,7 +6987,10 @@ CreateRestartPoint(int flags)
>      /* Update the process title */
>      update_checkpoint_display(flags, true, false);
>  
> -    CheckPointGuts(lastCheckPoint.redo, flags);
> +    CheckPointGuts(RedoRecPtr, flags);

I don't understand the purpose of these changes.  Are these related to the
fix, or is this just tidying up?

[0] https://postgr.es/m/CA%2BTgmoY%2BSJLTjma4Hfn1sA7S6CZAgbihYd%3DKzO6srd7Ut%3DXVBQ%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Possible corruption by CreateRestartPoint at promotion

From
Kyotaro Horiguchi
Date:
At Tue, 26 Apr 2022 11:33:49 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> On Wed, Mar 16, 2022 at 10:24:44AM +0900, Kyotaro Horiguchi wrote:
> > While discussing on additional LSNs in checkpoint log message,
> > Fujii-san pointed out [2] that there is a case where
> > CreateRestartPoint leaves unrecoverable database when concurrent
> > promotion happens. That corruption is "fixed" by the next checkpoint
> > so it is not a severe corruption.
> 
> I suspect we'll start seeing this problem more often once end-of-recovery
> checkpoints are removed [0].  Would you mind creating a commitfest entry
> for this thread?  I didn't see one.

I'm not sure the patch makes any change here, because restart points
don't run while crash recovery, since no checkpoint records seen
during a crash recovery.  Anyway the patch doesn't apply anymore so
rebased, but only the one for master for the lack of time for now.

> > AFAICS since 9.5, no check(/restart)pionts won't run concurrently with
> > restartpoint [3].  So I propose to remove the code path as attached.
> 
> Yeah, this "quick hack" has been around for some time (2de48a8), and I
> believe much has changed since then, so something like what you're
> proposing is probably the right thing to do.

Thanks for checking!

> >      /* Also update the info_lck-protected copy */
> >      SpinLockAcquire(&XLogCtl->info_lck);
> > -    XLogCtl->RedoRecPtr = lastCheckPoint.redo;
> > +    XLogCtl->RedoRecPtr = RedoRecPtr;
> >      SpinLockRelease(&XLogCtl->info_lck);
> >  
> >      /*
> > @@ -6984,7 +6987,10 @@ CreateRestartPoint(int flags)
> >      /* Update the process title */
> >      update_checkpoint_display(flags, true, false);
> >  
> > -    CheckPointGuts(lastCheckPoint.redo, flags);
> > +    CheckPointGuts(RedoRecPtr, flags);
> 
> I don't understand the purpose of these changes.  Are these related to the
> fix, or is this just tidying up?

The latter, since the mixed use of two not-guaranteed-to-be-same
variables at the same time for the same purpose made me perplexed (but
I feel the change can hardly incorporated alone). However, you're
right that it is irrelevant to the fix, so removed including other
instances of the same.

> [0] https://postgr.es/m/CA%2BTgmoY%2BSJLTjma4Hfn1sA7S6CZAgbihYd%3DKzO6srd7Ut%3DXVBQ%40mail.gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment

Re: Possible corruption by CreateRestartPoint at promotion

From
Nathan Bossart
Date:
On Wed, Apr 27, 2022 at 10:43:53AM +0900, Kyotaro Horiguchi wrote:
> At Tue, 26 Apr 2022 11:33:49 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
>> I suspect we'll start seeing this problem more often once end-of-recovery
>> checkpoints are removed [0].  Would you mind creating a commitfest entry
>> for this thread?  I didn't see one.
> 
> I'm not sure the patch makes any change here, because restart points
> don't run while crash recovery, since no checkpoint records seen
> during a crash recovery.  Anyway the patch doesn't apply anymore so
> rebased, but only the one for master for the lack of time for now.

Thanks for the new patch!  Yeah, it wouldn't affect crash recovery, but
IIUC Robert's patch also applies to archive recovery.

> +    /* Update pg_control */
> +    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> +
>      /*
>       * Remember the prior checkpoint's redo ptr for
>       * UpdateCheckPointDistanceEstimate()
>       */
>      PriorRedoPtr = ControlFile->checkPointCopy.redo;

nitpick: Why move the LWLockAcquire() all the way up here?

> +    Assert (PriorRedoPtr < RedoRecPtr);

I think this could use a short explanation.

> +     * Ensure minRecoveryPoint is past the checkpoint record while archive
> +     * recovery is still ongoing.  Normally, this will have happened already
> +     * while writing out dirty buffers, but not necessarily - e.g. because no
> +     * buffers were dirtied.  We do this because a non-exclusive base backup
> +     * uses minRecoveryPoint to determine which WAL files must be included in
> +     * the backup, and the file (or files) containing the checkpoint record
> +     * must be included, at a minimum. Note that for an ordinary restart of
> +     * recovery there's no value in having the minimum recovery point any
> +     * earlier than this anyway, because redo will begin just after the
> +     * checkpoint record.

nitpick: Since exclusive backup mode is now removed, we don't need to
specify that the base backup is non-exclusive.

> +        /*
> +         * Aarchive recovery has ended. Crash recovery ever after should
> +         * always recover to the end of WAL
> +         */
> +        ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
> +        ControlFile->minRecoveryPointTLI = 0;
> +
> +        /* also update local copy */
> +        LocalMinRecoveryPoint = InvalidXLogRecPtr;
> +        LocalMinRecoveryPointTLI = 0;

Should this be handled by the code that changes the control file state to
DB_IN_PRODUCTION instead?  It looks like this is ordinarily done in the
next checkpoint.  It's not clear to me why it is done this way.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re:Possible corruption by CreateRestartPoint at promotion

From
"Rui Zhao"
Date:
Kyotaro's patch seems good to me and fixes the test case in my patch.
Do you have interest in adding a test like one in my patch?

> +	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> +
> 	/*
> 	 * Remember the prior checkpoint's redo ptr for
> 	 * UpdateCheckPointDistanceEstimate()
> 	 */
> 	PriorRedoPtr = ControlFile->checkPointCopy.redo;
> 
> +	Assert (PriorRedoPtr < RedoRecPtr);
Maybe PriorRedoPtr does not need to be under LWLockAcquire?

regards.
-- 
Zhao Rui
Alibaba Cloud: https://www.aliyun.com/
------------------ Original ------------------
From: "Kyotaro Horiguchi" <horikyota.ntt@gmail.com>;
Date: Wed, Mar 16, 2022 09:24 AM
To: "pgsql-hackers"<pgsql-hackers@lists.postgresql.org>;
Cc: "masao.fujii"<masao.fujii@oss.nttdata.com>;
Subject: Possible corruption by CreateRestartPoint at promotion

Hello,  (Cc:ed Fujii-san)

This is a diverged topic from [1], which is summarized as $SUBJECT.

To recap:

While discussing on additional LSNs in checkpoint log message,
Fujii-san pointed out [2] that there is a case where
CreateRestartPoint leaves unrecoverable database when concurrent
promotion happens. That corruption is "fixed" by the next checkpoint
so it is not a severe corruption.

AFAICS since 9.5, no check(/restart)pionts won't run concurrently with
restartpoint [3].  So I propose to remove the code path as attached.

regards.


[1] https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com

[2] https://www.postgresql.org/message-id/7bfad665-db9c-0c2a-2604-9f54763c5f9e%40oss.nttdata.com

[3] https://www.postgresql.org/message-id/20220222.174401.765586897814316743.horikyota.ntt%40gmail.com

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment

Re: Possible corruption by CreateRestartPoint at promotion

From
Michael Paquier
Date:
On Tue, Apr 26, 2022 at 08:26:09PM -0700, Nathan Bossart wrote:
> On Wed, Apr 27, 2022 at 10:43:53AM +0900, Kyotaro Horiguchi wrote:
>> At Tue, 26 Apr 2022 11:33:49 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in
>>> I suspect we'll start seeing this problem more often once end-of-recovery
>>> checkpoints are removed [0].  Would you mind creating a commitfest entry
>>> for this thread?  I didn't see one.
>>
>> I'm not sure the patch makes any change here, because restart points
>> don't run while crash recovery, since no checkpoint records seen
>> during a crash recovery.  Anyway the patch doesn't apply anymore so
>> rebased, but only the one for master for the lack of time for now.
>
> Thanks for the new patch!  Yeah, it wouldn't affect crash recovery, but
> IIUC Robert's patch also applies to archive recovery.
>
>> +    /* Update pg_control */
>> +    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
>> +
>>      /*
>>       * Remember the prior checkpoint's redo ptr for
>>       * UpdateCheckPointDistanceEstimate()
>>       */
>>      PriorRedoPtr = ControlFile->checkPointCopy.redo;
>
> nitpick: Why move the LWLockAcquire() all the way up here?

Yeah, that should not be necessary.  InitWalRecovery() is the only
place outside the checkpointer that would touch this field, but that
happens far too early in the startup sequence to matter with the
checkpointer.

>> +    Assert (PriorRedoPtr < RedoRecPtr);
>
> I think this could use a short explanation.

That's just to make sure that the current redo LSN is always older
than the one prior that.  It does not seem really necessary to me to
add that.

>> +        /*
>> +         * Aarchive recovery has ended. Crash recovery ever after should
>> +         * always recover to the end of WAL
>> +         */

s/Aarchive/Archive/.

>> +        ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
>> +        ControlFile->minRecoveryPointTLI = 0;
>> +
>> +        /* also update local copy */
>> +        LocalMinRecoveryPoint = InvalidXLogRecPtr;
>> +        LocalMinRecoveryPointTLI = 0;
>
> Should this be handled by the code that changes the control file state to
> DB_IN_PRODUCTION instead?  It looks like this is ordinarily done in the
> next checkpoint.  It's not clear to me why it is done this way.

Anyway, that would be the work of the end-of-recovery checkpoint
requested at the end of StartupXLOG() once a promotion happens or of
the checkpoint requested by PerformRecoveryXLogAction() in the second
case, no?  So, I don't quite see why we need to update
minRecoveryPoint and minRecoveryPointTLI in the control file here, as
much as this does not have to be part of the end-of-recovery code
that switches the control file to DB_IN_PRODUCTION.

-   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-       ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
-   {
7ff23c6 has removed the last call to CreateCheckpoint() outside the
checkpointer, meaning that there is one less concurrent race to worry
about, but I have to admit that this change, to update the control
file's checkPoint and checkPointCopy even if we don't check after
ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the
code less robust in ~14.  So I am questioning whether a backpatch
is actually worth the risk here.
--
Michael

Attachment

Re: Possible corruption by CreateRestartPoint at promotion

From
Michael Paquier
Date:
On Wed, Apr 27, 2022 at 12:36:10PM +0800, Rui Zhao wrote:
> Do you have interest in adding a test like one in my patch?

I have studied the test case you are proposing, and I am afraid that
it is too expensive as designed.  And it is actually racy as you
expect the restart point to take longer than the promotion with a
timing based on an arbitrary (and large!) amount of data inserted into
the primary.  Well, the promotion should be shorter than the restart
point in any case, but such tests should be designed so as they would
work reliably on slow machines while being able to complete quickly on
fast machines.

It would much better if the test is designed so as the restart point
is stopped at an arbitrary step rather than throttled, moving on when
the promotion of the standby is done.  A well-known method, that would
not work on Windows, is to rely on SIGSTOP that could be used on the
checkpointer for such things.  Anyway, we don't have any mean to
reliably stop a restart point while in the middle of its processing,
do we?
--
Michael

Attachment

Re: Possible corruption by CreateRestartPoint at promotion

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> On Wed, Apr 27, 2022 at 12:36:10PM +0800, Rui Zhao wrote:
>> Do you have interest in adding a test like one in my patch?

> I have studied the test case you are proposing, and I am afraid that
> it is too expensive as designed.

That was my feeling too.  It's certainly a useful test for verifying
that we fixed the problem, but that doesn't mean that it's worth the
cycles to add it as a permanent fixture in check-world, even if we
could get rid of the timing assumptions.

            regards, tom lane



Re: Possible corruption by CreateRestartPoint at promotion

From
Nathan Bossart
Date:
On Wed, Apr 27, 2022 at 02:16:01PM +0900, Michael Paquier wrote:
> On Tue, Apr 26, 2022 at 08:26:09PM -0700, Nathan Bossart wrote:
>> On Wed, Apr 27, 2022 at 10:43:53AM +0900, Kyotaro Horiguchi wrote:
>>> +        ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
>>> +        ControlFile->minRecoveryPointTLI = 0;
>>> +
>>> +        /* also update local copy */
>>> +        LocalMinRecoveryPoint = InvalidXLogRecPtr;
>>> +        LocalMinRecoveryPointTLI = 0;
>> 
>> Should this be handled by the code that changes the control file state to
>> DB_IN_PRODUCTION instead?  It looks like this is ordinarily done in the
>> next checkpoint.  It's not clear to me why it is done this way.
> 
> Anyway, that would be the work of the end-of-recovery checkpoint
> requested at the end of StartupXLOG() once a promotion happens or of
> the checkpoint requested by PerformRecoveryXLogAction() in the second
> case, no?  So, I don't quite see why we need to update
> minRecoveryPoint and minRecoveryPointTLI in the control file here, as
> much as this does not have to be part of the end-of-recovery code
> that switches the control file to DB_IN_PRODUCTION.

+1. We probably don't need to reset minRecoveryPoint here.

> -   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
> -       ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
> -   {
> 7ff23c6 has removed the last call to CreateCheckpoint() outside the
> checkpointer, meaning that there is one less concurrent race to worry
> about, but I have to admit that this change, to update the control
> file's checkPoint and checkPointCopy even if we don't check after
> ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the
> code less robust in ~14.  So I am questioning whether a backpatch
> is actually worth the risk here.

IMO we should still check this before updating ControlFile to be safe.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Possible corruption by CreateRestartPoint at promotion

From
Michael Paquier
Date:
On Wed, Apr 27, 2022 at 11:09:45AM -0700, Nathan Bossart wrote:
> On Wed, Apr 27, 2022 at 02:16:01PM +0900, Michael Paquier wrote:
>> -   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
>> -       ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
>> -   {
>> 7ff23c6 has removed the last call to CreateCheckpoint() outside the
>> checkpointer, meaning that there is one less concurrent race to worry
>> about, but I have to admit that this change, to update the control
>> file's checkPoint and checkPointCopy even if we don't check after
>> ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the
>> code less robust in ~14.  So I am questioning whether a backpatch
>> is actually worth the risk here.
>
> IMO we should still check this before updating ControlFile to be safe.

Sure.  Fine by me to play it safe.
--
Michael

Attachment

Re: Possible corruption by CreateRestartPoint at promotion

From
Kyotaro Horiguchi
Date:
At Wed, 27 Apr 2022 14:16:01 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Tue, Apr 26, 2022 at 08:26:09PM -0700, Nathan Bossart wrote:
> > On Wed, Apr 27, 2022 at 10:43:53AM +0900, Kyotaro Horiguchi wrote:
> >> At Tue, 26 Apr 2022 11:33:49 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> >>> I suspect we'll start seeing this problem more often once end-of-recovery
> >>> checkpoints are removed [0].  Would you mind creating a commitfest entry
> >>> for this thread?  I didn't see one.
> >> 
> >> I'm not sure the patch makes any change here, because restart points
> >> don't run while crash recovery, since no checkpoint records seen
> >> during a crash recovery.  Anyway the patch doesn't apply anymore so
> >> rebased, but only the one for master for the lack of time for now.
> > 
> > Thanks for the new patch!  Yeah, it wouldn't affect crash recovery, but
> > IIUC Robert's patch also applies to archive recovery.
> > 
> >> +    /* Update pg_control */
> >> +    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> >> +
> >>      /*
> >>       * Remember the prior checkpoint's redo ptr for
> >>       * UpdateCheckPointDistanceEstimate()
> >>       */
> >>      PriorRedoPtr = ControlFile->checkPointCopy.redo;
> > 
> > nitpick: Why move the LWLockAcquire() all the way up here?
> 
> Yeah, that should not be necessary.  InitWalRecovery() is the only
> place outside the checkpointer that would touch this field, but that
> happens far too early in the startup sequence to matter with the
> checkpointer.

Yes it is not necessary. I just wanted to apparently ensure not to
access ControlFile outside ControlFileLoc.  So I won't be opposed to
reverting it since, as you say, it is *actuall* safe..

> >> +    Assert (PriorRedoPtr < RedoRecPtr);
> > 
> > I think this could use a short explanation.
> 
> That's just to make sure that the current redo LSN is always older
> than the one prior that.  It does not seem really necessary to me to
> add that.

Just after we call UpdateCheckPointDistanceEstimate(RedoRecPtr -
PriorRedoPtr). Don't we really need any safeguard against giving a
wrap-arounded (in other words, treamendously large) value to the
function?  Actually it doesn't seem to happen now, but I don't
confident that that never ever happens.

That being said, I'm a minority here, so removed it.

> >> +        /*
> >> +         * Aarchive recovery has ended. Crash recovery ever after should
> >> +         * always recover to the end of WAL
> >> +         */
> 
> s/Aarchive/Archive/.

Oops! Fixed.

> >> +        ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
> >> +        ControlFile->minRecoveryPointTLI = 0;
> >> +
> >> +        /* also update local copy */
> >> +        LocalMinRecoveryPoint = InvalidXLogRecPtr;
> >> +        LocalMinRecoveryPointTLI = 0;
> > 
> > Should this be handled by the code that changes the control file state to
> > DB_IN_PRODUCTION instead?  It looks like this is ordinarily done in the
> > next checkpoint.  It's not clear to me why it is done this way.
> 
> Anyway, that would be the work of the end-of-recovery checkpoint
> requested at the end of StartupXLOG() once a promotion happens or of
> the checkpoint requested by PerformRecoveryXLogAction() in the second
> case, no?  So, I don't quite see why we need to update

Eventually the work is done by StartupXLOG().  So we don't need to do
that at all even in CreateCheckPoint().  If we expect that the
end-of-recovery checkpoint clears it, that won't happen if if the last
restart point takes so long time that the end-of-recovery checkpoint
request is ignored. If DB_IN_ARCHIVE_RECOVERY ended while the restart
point is running, it is highly possible that the end-of-recovery
checkpoint trigger is ignored. In that case the values are cleard at
the next checkpoint.

In short, if we want to reset them at so-called end-of-recovery
checkpoint, we should do that also in CreateRecoveryPoint.

So, it is not removed in this version.

> minRecoveryPoint and minRecoveryPointTLI in the control file here, as
> much as this does not have to be part of the end-of-recovery code
> that switches the control file to DB_IN_PRODUCTION.
> 
> -   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
> -       ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
> -   {
> 7ff23c6 has removed the last call to CreateCheckpoint() outside the
> checkpointer, meaning that there is one less concurrent race to worry
> about, but I have to admit that this change, to update the control

Sure.  In very early stage the reasoning to rmove the code was it.
And the rason for proposing to back-patch the same to older versions
is basing on the further investigation, and I'm not fully confident
that for the earlier versions.

> file's checkPoint and checkPointCopy even if we don't check after
> ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the
> code less robust in ~14.  So I am questioning whether a backpatch
> is actually worth the risk here.

Agreed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Possible corruption by CreateRestartPoint at promotion

From
Kyotaro Horiguchi
Date:
At Thu, 28 Apr 2022 09:12:13 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Wed, Apr 27, 2022 at 11:09:45AM -0700, Nathan Bossart wrote:
> > On Wed, Apr 27, 2022 at 02:16:01PM +0900, Michael Paquier wrote:
> >> -   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
> >> -       ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
> >> -   {
> >> 7ff23c6 has removed the last call to CreateCheckpoint() outside the
> >> checkpointer, meaning that there is one less concurrent race to worry
> >> about, but I have to admit that this change, to update the control
> >> file's checkPoint and checkPointCopy even if we don't check after
> >> ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the
> >> code less robust in ~14.  So I am questioning whether a backpatch
> >> is actually worth the risk here.
> > 
> > IMO we should still check this before updating ControlFile to be safe.
> 
> Sure.  Fine by me to play it safe.

Why do we consider concurrent check/restart points here while we don't
consider the same for ControlFile->checkPointCopy?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Possible corruption by CreateRestartPoint at promotion

From
Kyotaro Horiguchi
Date:
At Wed, 27 Apr 2022 01:31:55 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Michael Paquier <michael@paquier.xyz> writes:
> > On Wed, Apr 27, 2022 at 12:36:10PM +0800, Rui Zhao wrote:
> >> Do you have interest in adding a test like one in my patch?
> 
> > I have studied the test case you are proposing, and I am afraid that
> > it is too expensive as designed.
> 
> That was my feeling too.  It's certainly a useful test for verifying
> that we fixed the problem, but that doesn't mean that it's worth the
> cycles to add it as a permanent fixture in check-world, even if we
> could get rid of the timing assumptions.

My first feeling is the same.  And I don't find a way to cause this
cheap and reliably without inserting a dedicate debugging-aid code.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Possible corruption by CreateRestartPoint at promotion

From
Michael Paquier
Date:
On Thu, Apr 28, 2022 at 11:43:57AM +0900, Kyotaro Horiguchi wrote:
> At Thu, 28 Apr 2022 09:12:13 +0900, Michael Paquier <michael@paquier.xyz> wrote in
>> On Wed, Apr 27, 2022 at 11:09:45AM -0700, Nathan Bossart wrote:
>>> On Wed, Apr 27, 2022 at 02:16:01PM +0900, Michael Paquier wrote:
>>>> -   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
>>>> -       ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
>>>> -   {
>>>> 7ff23c6 has removed the last call to CreateCheckpoint() outside the
>>>> checkpointer, meaning that there is one less concurrent race to worry
>>>> about, but I have to admit that this change, to update the control
>>>> file's checkPoint and checkPointCopy even if we don't check after
>>>> ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the
>>>> code less robust in ~14.  So I am questioning whether a backpatch
>>>> is actually worth the risk here.
>>>
>>> IMO we should still check this before updating ControlFile to be safe.
>>
>> Sure.  Fine by me to play it safe.
>
> Why do we consider concurrent check/restart points here while we don't
> consider the same for ControlFile->checkPointCopy?

I am not sure what you mean here.  FWIW, I am translating the
suggestion of Nathan to split the existing check in
CreateRestartPoint() that we are discussing here into two if blocks,
instead of just one:
- Move the update of checkPoint and checkPointCopy into its own if
block, controlled only by the check on
(ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
- Keep the code updating minRecoveryPoint and minRecoveryPointTLI
mostly as-is, but just do the update when the control file state is
DB_IN_ARCHIVE_RECOVERY.  Of course we need to keep the check on
(minRecoveryPoint < lastCheckPointEndPtr).

v5 is mostly doing that.
--
Michael

Attachment

Re: Possible corruption by CreateRestartPoint at promotion

From
Michael Paquier
Date:
On Thu, Apr 28, 2022 at 03:49:42PM +0900, Michael Paquier wrote:
> I am not sure what you mean here.  FWIW, I am translating the
> suggestion of Nathan to split the existing check in
> CreateRestartPoint() that we are discussing here into two if blocks,
> instead of just one:
> - Move the update of checkPoint and checkPointCopy into its own if
> block, controlled only by the check on
> (ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
> - Keep the code updating minRecoveryPoint and minRecoveryPointTLI
> mostly as-is, but just do the update when the control file state is
> DB_IN_ARCHIVE_RECOVERY.  Of course we need to keep the check on
> (minRecoveryPoint < lastCheckPointEndPtr).

And I have spent a bit of this stuff to finish with the attached.  It
will be a plus to get that done on HEAD for beta1, so I'll try to deal
with it on Monday.  I am still a bit stressed about the back branches
as concurrent checkpoints are possible via CreateCheckPoint() from the
startup process (not the case of HEAD), but the stable branches will
have a new point release very soon so let's revisit this choice there
later.  v6 attached includes a TAP test, but I don't intend to include
it as it is expensive.
--
Michael

Attachment

Re: Possible corruption by CreateRestartPoint at promotion

From
Nathan Bossart
Date:
On Fri, May 06, 2022 at 07:58:43PM +0900, Michael Paquier wrote:
> And I have spent a bit of this stuff to finish with the attached.  It
> will be a plus to get that done on HEAD for beta1, so I'll try to deal
> with it on Monday.  I am still a bit stressed about the back branches
> as concurrent checkpoints are possible via CreateCheckPoint() from the
> startup process (not the case of HEAD), but the stable branches will
> have a new point release very soon so let's revisit this choice there
> later.  v6 attached includes a TAP test, but I don't intend to include
> it as it is expensive.

I was looking at other changes in this area (e.g., 3c64dcb), and now I'm
wondering if we actually should invalidate the minRecoveryPoint when the
control file no longer indicates archive recovery.  Specifically, what
happens if a base backup of a newly promoted standby is used for a
point-in-time restore?  If the checkpoint location is updated and all
previous segments have been recycled/removed, it seems like the
minRecoveryPoint might point to a missing segment.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Possible corruption by CreateRestartPoint at promotion

From
Michael Paquier
Date:
On Fri, May 06, 2022 at 08:52:45AM -0700, Nathan Bossart wrote:
> I was looking at other changes in this area (e.g., 3c64dcb), and now I'm
> wondering if we actually should invalidate the minRecoveryPoint when the
> control file no longer indicates archive recovery.  Specifically, what
> happens if a base backup of a newly promoted standby is used for a
> point-in-time restore?  If the checkpoint location is updated and all
> previous segments have been recycled/removed, it seems like the
> minRecoveryPoint might point to a missing segment.

A new checkpoint is enforced at the beginning of the backup which
would update minRecoveryPoint and minRecoveryPointTLI, while we don't
a allow a backup to finish if it began on a standby has just promoted
in-between.
--
Michael

Attachment

Re: Possible corruption by CreateRestartPoint at promotion

From
Michael Paquier
Date:
On Fri, May 06, 2022 at 07:58:43PM +0900, Michael Paquier wrote:
> And I have spent a bit of this stuff to finish with the attached.  It
> will be a plus to get that done on HEAD for beta1, so I'll try to deal
> with it on Monday.  I am still a bit stressed about the back branches
> as concurrent checkpoints are possible via CreateCheckPoint() from the
> startup process (not the case of HEAD), but the stable branches will
> have a new point release very soon so let's revisit this choice there
> later.  v6 attached includes a TAP test, but I don't intend to include
> it as it is expensive.

Okay, applied this one on HEAD after going back-and-forth on it for
the last couple of days.  I have found myself shaping the patch in
what looks like its simplest form, by applying the check based on an
older checkpoint to all the fields updated in the control file, with
the check on DB_IN_ARCHIVE_RECOVERY applying to the addition of
DB_SHUTDOWNED_IN_RECOVERY (got initialially surprised that this was
having side effects on pg_rewind) and the minRecoveryPoint
calculations.

Now, it would be nice to get a test for this stuff, and we are going
to need something cheaper than what's been proposed upthread.  This
comes down to the point of being able to put a deterministic stop
in a restart point while it is processing, meaning that we need to
interact with one of the internal routines of CheckPointGuts().  One
fancy way to do so would be to forcibly take a LWLock to stuck the
restart point until it is released.  Using a SQL function for that
would be possible, if not artistic.  Perhaps we don't need such a
function though, if we could stuck arbitrarily the internals of a
checkpoint?  Any ideas?
--
Michael

Attachment

Re: Possible corruption by CreateRestartPoint at promotion

From
Michael Paquier
Date:
On Mon, May 09, 2022 at 09:24:06AM +0900, Michael Paquier wrote:
> Okay, applied this one on HEAD after going back-and-forth on it for
> the last couple of days.  I have found myself shaping the patch in
> what looks like its simplest form, by applying the check based on an
> older checkpoint to all the fields updated in the control file, with
> the check on DB_IN_ARCHIVE_RECOVERY applying to the addition of
> DB_SHUTDOWNED_IN_RECOVERY (got initialially surprised that this was
> having side effects on pg_rewind) and the minRecoveryPoint
> calculations.

It took me some time to make sure that this would be safe, but done
now for all the stable branches.

> Now, it would be nice to get a test for this stuff, and we are going
> to need something cheaper than what's been proposed upthread.  This
> comes down to the point of being able to put a deterministic stop
> in a restart point while it is processing, meaning that we need to
> interact with one of the internal routines of CheckPointGuts().  One
> fancy way to do so would be to forcibly take a LWLock to stuck the
> restart point until it is released.  Using a SQL function for that
> would be possible, if not artistic.  Perhaps we don't need such a
> function though, if we could stuck arbitrarily the internals of a
> checkpoint?  Any ideas?

One thing that we could do here is to resurrect the patch that adds
support for stop points in the code..
--
Michael

Attachment