Re: Corruption during WAL replay - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Corruption during WAL replay
Date
Msg-id 20220318.102109.162855329039722212.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Corruption during WAL replay  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Corruption during WAL replay  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
At Wed, 16 Mar 2022 10:14:56 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> Hmm. I think the last two instances of "buffers" in this comment
> should actually say "blocks".

Ok. I replaced them with "blocks" and it looks nicer. Thanks!

> > I'll try that, if you are already working on it, please inform me. (It
> > may more than likely be too late..)
> 
> If you want to take a crack at that, I'd be delighted.

Finally, no two of from 10 to 14 doesn't accept the same patch.

As a cross-version check, I compared all combinations of the patches
for two adjacent versions and confirmed that no hunks are lost.

All versions pass check world.


The differences between each two adjacent versions are as follows.

master->14:

 A hunk fails due to the change in how to access rel->rd_smgr.

14->13:

  Several hunks fail due to simple context differences.

13->12:

 Many hunks fail due to the migration of delayChkpt from PGPROC to
 PGXACT and the context difference due to change of FSM trancation
 logic in RelationTruncate.

12->11:

 Several hunks fail due to the removal of volatile qalifier from
 pointers to PGPROC/PGXACT.

11-10:

 A hunk fails due to the context difference due to an additional
 member tempNamespaceId of PGPROC.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From c88858d3e5681005ba0396b7e7ebcde4322b3308 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 15 Mar 2022 12:29:14 -0400
Subject: [PATCH] Fix possible recovery trouble if TRUNCATE overlaps a
 checkpoint.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.

Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.

Back-patch to all supported versions.

Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com
---
 src/backend/access/transam/multixact.c  |  6 ++--
 src/backend/access/transam/twophase.c   | 12 ++++----
 src/backend/access/transam/xact.c       |  5 ++--
 src/backend/access/transam/xlog.c       | 16 +++++++++--
 src/backend/access/transam/xloginsert.c |  2 +-
 src/backend/catalog/storage.c           | 29 ++++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c     |  6 ++--
 src/backend/storage/ipc/procarray.c     | 26 ++++++++++++-----
 src/backend/storage/lmgr/proc.c         |  4 +--
 src/include/storage/proc.h              | 37 ++++++++++++++++++++++++-
 src/include/storage/procarray.h         |  5 ++--
 11 files changed, 120 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 6a70d49738..9f65c600d0 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3088,8 +3088,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
      * crash/basebackup, even though the state of the data directory would
      * require it.
      */
-    Assert(!MyProc->delayChkpt);
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     /* WAL log truncation */
     WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3115,7 +3115,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
     /* Then offsets */
     PerformOffsetsTruncation(oldestMulti, newOldestMulti);

-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();
     LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 874c8ed125..4dc8ccc12b 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -475,7 +475,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
     }
     proc->xid = xid;
     Assert(proc->xmin == InvalidTransactionId);
-    proc->delayChkpt = false;
+    proc->delayChkpt = 0;
     proc->statusFlags = 0;
     proc->pid = 0;
     proc->databaseId = databaseid;
@@ -1164,7 +1164,8 @@ EndPrepare(GlobalTransaction gxact)

     START_CRIT_SECTION();

-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     XLogBeginInsert();
     for (record = records.head; record != NULL; record = record->next)
@@ -1207,7 +1208,7 @@ EndPrepare(GlobalTransaction gxact)
      * checkpoint starting after this will certainly see the gxact as a
      * candidate for fsyncing.
      */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     /*
      * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2266,7 +2267,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
     START_CRIT_SECTION();

     /* See notes in RecordTransactionCommit */
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     /*
      * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2314,7 +2316,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
     TransactionIdCommitTree(xid, nchildren, children);

     /* Checkpoint can proceed now */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8964ddf3eb..3596a7d734 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1387,8 +1387,9 @@ RecordTransactionCommit(void)
          * This makes checkpoint's determination of which xacts are delayChkpt
          * a bit fuzzy, but it doesn't matter.
          */
+        Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
         START_CRIT_SECTION();
-        MyProc->delayChkpt = true;
+        MyProc->delayChkpt |= DELAY_CHKPT_START;

         SetCurrentTransactionStopTimestamp();

@@ -1489,7 +1490,7 @@ RecordTransactionCommit(void)
      */
     if (markXidCommitted)
     {
-        MyProc->delayChkpt = false;
+        MyProc->delayChkpt &= ~DELAY_CHKPT_START;
         END_CRIT_SECTION();
     }

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f436471b27..ece71b9208 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6503,18 +6503,30 @@ CreateCheckPoint(int flags)
      * and we will correctly flush the update below.  So we cannot miss any
      * xacts we need to wait for.
      */
-    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
     if (nvxids > 0)
     {
         do
         {
             pg_usleep(10000L);    /* wait for 10 msec */
-        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_START));
     }
     pfree(vxids);

     CheckPointGuts(checkPoint.redo, flags);

+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
+    if (nvxids > 0)
+    {
+        do
+        {
+            pg_usleep(10000L);    /* wait for 10 msec */
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_COMPLETE));
+    }
+    pfree(vxids);
+
     /*
      * Take a snapshot of running transactions and write this to WAL. This
      * allows us to reconstruct the state of running transactions during
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index f4eb54b63c..462e23503e 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1011,7 +1011,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
     /*
      * Ensure no checkpoint can change our view of RedoRecPtr.
      */
-    Assert(MyProc->delayChkpt);
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) != 0);

     /*
      * Update RedoRecPtr so that we can make the right decision
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9b8075536a..ce5568ff08 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -325,6 +325,22 @@ RelationTruncate(Relation rel, BlockNumber nblocks)

     RelationPreTruncate(rel);

+    /*
+     * Make sure that a concurrent checkpoint can't complete while truncation
+     * is in progress.
+     *
+     * The truncation operation might drop buffers that the checkpoint
+     * otherwise would have flushed. If it does, then it's essential that
+     * the files actually get truncated on disk before the checkpoint record
+     * is written. Otherwise, if reply begins from that checkpoint, the
+     * to-be-truncated blocks might still exist on disk but have older
+     * contents than expected, which can cause replay to fail. It's OK for
+     * the blocks to not exist on disk at all, but not for them to have the
+     * wrong contents.
+     */
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_COMPLETE) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_COMPLETE;
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -363,13 +379,24 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
             XLogFlush(lsn);
     }

-    /* Do the real work to truncate relation forks */
+    /*
+     * This will first remove any buffers from the buffer pool that should no
+     * longer exist after truncation is complete, and then truncate the
+     * corresponding files on disk.
+     */
     smgrtruncate(RelationGetSmgr(rel), forks, nforks, blocks);

+    /* We've done all the critical work, so checkpoints are OK now. */
+    MyProc->delayChkpt &= ~DELAY_CHKPT_COMPLETE;
+
     /*
      * Update upper-level FSM pages to account for the truncation. This is
      * important because the just-truncated pages were likely marked as
      * all-free, and would be preferentially selected.
+     *
+     * NB: There's no point in delaying checkpoints until this is done.
+     * Because the FSM is not WAL-logged, we have to be prepared for the
+     * possibility of corruption after a crash anyway.
      */
     if (need_fsm_vacuum)
         FreeSpaceMapVacuumRange(rel, nblocks, InvalidBlockNumber);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f8..11005edc73 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3911,7 +3911,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
              * essential that CreateCheckPoint waits for virtual transactions
              * rather than full transactionids.
              */
-            MyProc->delayChkpt = delayChkpt = true;
+            Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+            MyProc->delayChkpt |= DELAY_CHKPT_START;
+            delayChkpt = true;
             lsn = XLogSaveBufferForHint(buffer, buffer_std);
         }

@@ -3944,7 +3946,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
         UnlockBufHdr(bufHdr, buf_state);

         if (delayChkpt)
-            MyProc->delayChkpt = false;
+            MyProc->delayChkpt &= ~DELAY_CHKPT_START;

         if (dirtied)
         {
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 13d192ec2b..735763cc24 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -698,7 +698,10 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)

         proc->lxid = InvalidLocalTransactionId;
         proc->xmin = InvalidTransactionId;
-        proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+        /* be sure this is cleared in abort */
+        proc->delayChkpt = 0;
+
         proc->recoveryConflictPending = false;

         /* must be cleared with xid/xmin: */
@@ -737,7 +740,10 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
     proc->xid = InvalidTransactionId;
     proc->lxid = InvalidLocalTransactionId;
     proc->xmin = InvalidTransactionId;
-    proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+    /* be sure this is cleared in abort */
+    proc->delayChkpt = 0;
+
     proc->recoveryConflictPending = false;

     /* must be cleared with xid/xmin: */
@@ -3053,7 +3059,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * delaying checkpoint because they have critical actions in progress.
  *
  * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGPROC.
+ * critical sections, as shown by having specified delayChkpt bits set in their
+ * PGPROC.
  *
  * Returns a palloc'd array that should be freed by the caller.
  * *nvxids is the number of valid entries.
@@ -3067,13 +3074,15 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * for clearing of delayChkpt to propagate is unimportant for correctness.
  */
 VirtualTransactionId *
-GetVirtualXIDsDelayingChkpt(int *nvxids)
+GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 {
     VirtualTransactionId *vxids;
     ProcArrayStruct *arrayP = procArray;
     int            count = 0;
     int            index;

+    Assert(type != 0);
+
     /* allocate what's certainly enough result space */
     vxids = (VirtualTransactionId *)
         palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
@@ -3085,7 +3094,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
         int            pgprocno = arrayP->pgprocnos[index];
         PGPROC       *proc = &allProcs[pgprocno];

-        if (proc->delayChkpt)
+        if ((proc->delayChkpt & type) != 0)
         {
             VirtualTransactionId vxid;

@@ -3111,12 +3120,14 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
  * those numbers should be small enough for it not to be a problem.
  */
 bool
-HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 {
     bool        result = false;
     ProcArrayStruct *arrayP = procArray;
     int            index;

+    Assert(type != 0);
+
     LWLockAcquire(ProcArrayLock, LW_SHARED);

     for (index = 0; index < arrayP->numProcs; index++)
@@ -3127,7 +3138,8 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)

         GET_VXID_FROM_PGPROC(vxid, *proc);

-        if (proc->delayChkpt && VirtualTransactionIdIsValid(vxid))
+        if ((proc->delayChkpt & type) != 0 &&
+            VirtualTransactionIdIsValid(vxid))
         {
             int            i;

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 90283f8a9f..df080cd332 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -393,7 +393,7 @@ InitProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyProc->statusFlags = 0;
     /* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
     if (IsAutoVacuumWorkerProcess())
@@ -578,7 +578,7 @@ InitAuxiliaryProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyProc->statusFlags = 0;
     MyProc->lwWaiting = false;
     MyProc->lwWaitMode = 0;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index a58888f9e9..36ecf7d005 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -86,6 +86,41 @@ struct XidCache
  */
 #define INVALID_PGPROCNO        PG_INT32_MAX

+/*
+ * Flags for PGPROC.delayChkpt
+ *
+ * These flags can be used to delay the start or completion of a checkpoint
+ * for short periods. A flag is in effect if the corresponding bit is set in
+ * the PGPROC of any backend.
+ *
+ * For our purposes here, a checkpoint has three phases: (1) determine the
+ * location to which the redo pointer will be moved, (2) write all the
+ * data durably to disk, and (3) WAL-log the checkpoint.
+ *
+ * Setting DELAY_CHKPT_START prevents the system from moving from phase 1
+ * to phase 2. This is useful when we are performing a WAL-logged modification
+ * of data that will be flushed to disk in phase 2. By setting this flag
+ * before writing WAL and clearing it after we've both written WAL and
+ * performed the corresponding modification, we ensure that if the WAL record
+ * is inserted prior to the new redo point, the corresponding data changes will
+ * also be flushed to disk before the checkpoint can complete. (In the
+ * extremely common case where the data being modified is in shared buffers
+ * and we acquire an exclusive content lock on the relevant buffers before
+ * writing WAL, this mechanism is not needed, because phase 2 will block
+ * until we release the content lock and then flush the modified data to
+ * disk.)
+ *
+ * Setting DELAY_CHKPT_COMPLETE prevents the system from moving from phase 2
+ * to phase 3. This is useful if we are performing a WAL-logged operation that
+ * might invalidate buffers, such as relation truncation. In this case, we need
+ * to ensure that any buffers which were invalidated and thus not flushed by
+ * the checkpoint are actaully destroyed on disk. Replay can cope with a file
+ * or block that doesn't exist, but not with a block that has the wrong
+ * contents.
+ */
+#define DELAY_CHKPT_START        (1<<0)
+#define DELAY_CHKPT_COMPLETE    (1<<1)
+
 typedef enum
 {
     PROC_WAIT_STATUS_OK,
@@ -191,7 +226,7 @@ struct PGPROC
     pg_atomic_uint64 waitStart; /* time at which wait for lock acquisition
                                  * started */

-    bool        delayChkpt;        /* true if this proc delays checkpoint start */
+    int            delayChkpt;        /* for DELAY_CHKPT_* flags */

     uint8        statusFlags;    /* this backend's status flags, see PROC_*
                                  * above. mirrored in
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index e03692053e..1b2cfac5ad 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -59,8 +59,9 @@ extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
 extern void GetReplicationHorizons(TransactionId *slot_xmin, TransactionId *catalog_xmin);

-extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
-extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids, int type);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids,
+                                         int nvxids, int type);

 extern PGPROC *BackendPidGetProc(int pid);
 extern PGPROC *BackendPidGetProcWithLock(int pid);
--
2.27.0

From 71493542cda97f75d0737e3434d9aaab2beadd5f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 17 Mar 2022 14:54:25 +0900
Subject: [PATCH] Fix possible recovery trouble if TRUNCATE overlaps a
 checkpoint.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.

Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.

Back-patch to all supported versions.

Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com
---
 src/backend/access/transam/multixact.c  |  6 ++--
 src/backend/access/transam/twophase.c   | 12 ++++----
 src/backend/access/transam/xact.c       |  5 ++--
 src/backend/access/transam/xlog.c       | 16 +++++++++--
 src/backend/access/transam/xloginsert.c |  2 +-
 src/backend/catalog/storage.c           | 29 ++++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c     |  6 ++--
 src/backend/storage/ipc/procarray.c     | 26 ++++++++++++-----
 src/backend/storage/lmgr/proc.c         |  4 +--
 src/include/storage/proc.h              | 37 ++++++++++++++++++++++++-
 src/include/storage/procarray.h         |  5 ++--
 11 files changed, 120 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b643564f16..50d8bab9e2 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3075,8 +3075,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
      * crash/basebackup, even though the state of the data directory would
      * require it.
      */
-    Assert(!MyProc->delayChkpt);
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     /* WAL log truncation */
     WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3102,7 +3102,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
     /* Then offsets */
     PerformOffsetsTruncation(oldestMulti, newOldestMulti);

-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();
     LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7cc76c1db7..dea3f485f7 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -474,7 +474,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
     }
     proc->xid = xid;
     Assert(proc->xmin == InvalidTransactionId);
-    proc->delayChkpt = false;
+    proc->delayChkpt = 0;
     proc->statusFlags = 0;
     proc->pid = 0;
     proc->databaseId = databaseid;
@@ -1165,7 +1165,8 @@ EndPrepare(GlobalTransaction gxact)

     START_CRIT_SECTION();

-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     XLogBeginInsert();
     for (record = records.head; record != NULL; record = record->next)
@@ -1208,7 +1209,7 @@ EndPrepare(GlobalTransaction gxact)
      * checkpoint starting after this will certainly see the gxact as a
      * candidate for fsyncing.
      */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     /*
      * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2275,7 +2276,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
     START_CRIT_SECTION();

     /* See notes in RecordTransactionCommit */
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     /*
      * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2323,7 +2325,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
     TransactionIdCommitTree(xid, nchildren, children);

     /* Checkpoint can proceed now */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 514044f3db..c5e7261921 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1335,8 +1335,9 @@ RecordTransactionCommit(void)
          * This makes checkpoint's determination of which xacts are delayChkpt
          * a bit fuzzy, but it doesn't matter.
          */
+        Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
         START_CRIT_SECTION();
-        MyProc->delayChkpt = true;
+        MyProc->delayChkpt |= DELAY_CHKPT_START;

         SetCurrentTransactionStopTimestamp();

@@ -1437,7 +1438,7 @@ RecordTransactionCommit(void)
      */
     if (markXidCommitted)
     {
-        MyProc->delayChkpt = false;
+        MyProc->delayChkpt &= ~DELAY_CHKPT_START;
         END_CRIT_SECTION();
     }

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3e71aea71f..7cc49819f0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9228,18 +9228,30 @@ CreateCheckPoint(int flags)
      * and we will correctly flush the update below.  So we cannot miss any
      * xacts we need to wait for.
      */
-    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
     if (nvxids > 0)
     {
         do
         {
             pg_usleep(10000L);    /* wait for 10 msec */
-        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_START));
     }
     pfree(vxids);

     CheckPointGuts(checkPoint.redo, flags);

+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
+    if (nvxids > 0)
+    {
+        do
+        {
+            pg_usleep(10000L);    /* wait for 10 msec */
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_COMPLETE));
+    }
+    pfree(vxids);
+
     /*
      * Take a snapshot of running transactions and write this to WAL. This
      * allows us to reconstruct the state of running transactions during
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b153fad594..1af4a90c41 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -925,7 +925,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
     /*
      * Ensure no checkpoint can change our view of RedoRecPtr.
      */
-    Assert(MyProc->delayChkpt);
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) != 0);

     /*
      * Update RedoRecPtr so that we can make the right decision
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index cba7a9ada0..fa5682dce8 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -325,6 +325,22 @@ RelationTruncate(Relation rel, BlockNumber nblocks)

     RelationPreTruncate(rel);

+    /*
+     * Make sure that a concurrent checkpoint can't complete while truncation
+     * is in progress.
+     *
+     * The truncation operation might drop buffers that the checkpoint
+     * otherwise would have flushed. If it does, then it's essential that
+     * the files actually get truncated on disk before the checkpoint record
+     * is written. Otherwise, if reply begins from that checkpoint, the
+     * to-be-truncated blocks might still exist on disk but have older
+     * contents than expected, which can cause replay to fail. It's OK for
+     * the blocks to not exist on disk at all, but not for them to have the
+     * wrong contents.
+     */
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_COMPLETE) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_COMPLETE;
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -363,13 +379,24 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
             XLogFlush(lsn);
     }

-    /* Do the real work to truncate relation forks */
+    /*
+     * This will first remove any buffers from the buffer pool that should no
+     * longer exist after truncation is complete, and then truncate the
+     * corresponding files on disk.
+     */
     smgrtruncate(rel->rd_smgr, forks, nforks, blocks);

+    /* We've done all the critical work, so checkpoints are OK now. */
+    MyProc->delayChkpt &= ~DELAY_CHKPT_COMPLETE;
+
     /*
      * Update upper-level FSM pages to account for the truncation. This is
      * important because the just-truncated pages were likely marked as
      * all-free, and would be preferentially selected.
+     *
+     * NB: There's no point in delaying checkpoints until this is done.
+     * Because the FSM is not WAL-logged, we have to be prepared for the
+     * possibility of corruption after a crash anyway.
      */
     if (need_fsm_vacuum)
         FreeSpaceMapVacuumRange(rel, nblocks, InvalidBlockNumber);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ffc6056c60..a55545a187 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3946,7 +3946,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
              * essential that CreateCheckpoint waits for virtual transactions
              * rather than full transactionids.
              */
-            MyProc->delayChkpt = delayChkpt = true;
+            Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+            MyProc->delayChkpt |= DELAY_CHKPT_START;
+            delayChkpt = true;
             lsn = XLogSaveBufferForHint(buffer, buffer_std);
         }

@@ -3979,7 +3981,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
         UnlockBufHdr(bufHdr, buf_state);

         if (delayChkpt)
-            MyProc->delayChkpt = false;
+            MyProc->delayChkpt &= ~DELAY_CHKPT_START;

         if (dirtied)
         {
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index f047f9a242..ae71d7538b 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -689,7 +689,10 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)

         proc->lxid = InvalidLocalTransactionId;
         proc->xmin = InvalidTransactionId;
-        proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+        /* be sure this is cleared in abort */
+        proc->delayChkpt = 0;
+
         proc->recoveryConflictPending = false;

         /* must be cleared with xid/xmin: */
@@ -728,7 +731,10 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
     proc->xid = InvalidTransactionId;
     proc->lxid = InvalidLocalTransactionId;
     proc->xmin = InvalidTransactionId;
-    proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+    /* be sure this is cleared in abort */
+    proc->delayChkpt = 0;
+
     proc->recoveryConflictPending = false;

     /* must be cleared with xid/xmin: */
@@ -3043,7 +3049,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * delaying checkpoint because they have critical actions in progress.
  *
  * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGPROC.
+ * critical sections, as shown by having specified delayChkpt bits set in their
+ * PGPROC.
  *
  * Returns a palloc'd array that should be freed by the caller.
  * *nvxids is the number of valid entries.
@@ -3057,13 +3064,15 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * for clearing of delayChkpt to propagate is unimportant for correctness.
  */
 VirtualTransactionId *
-GetVirtualXIDsDelayingChkpt(int *nvxids)
+GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 {
     VirtualTransactionId *vxids;
     ProcArrayStruct *arrayP = procArray;
     int            count = 0;
     int            index;

+    Assert(type != 0);
+
     /* allocate what's certainly enough result space */
     vxids = (VirtualTransactionId *)
         palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
@@ -3075,7 +3084,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
         int            pgprocno = arrayP->pgprocnos[index];
         PGPROC       *proc = &allProcs[pgprocno];

-        if (proc->delayChkpt)
+        if ((proc->delayChkpt & type) != 0)
         {
             VirtualTransactionId vxid;

@@ -3101,12 +3110,14 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
  * those numbers should be small enough for it not to be a problem.
  */
 bool
-HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 {
     bool        result = false;
     ProcArrayStruct *arrayP = procArray;
     int            index;

+    Assert(type != 0);
+
     LWLockAcquire(ProcArrayLock, LW_SHARED);

     for (index = 0; index < arrayP->numProcs; index++)
@@ -3117,7 +3128,8 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)

         GET_VXID_FROM_PGPROC(vxid, *proc);

-        if (proc->delayChkpt && VirtualTransactionIdIsValid(vxid))
+        if ((proc->delayChkpt & type) != 0 &&
+            VirtualTransactionIdIsValid(vxid))
         {
             int            i;

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 2575ea1ca0..c50a419a54 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -394,7 +394,7 @@ InitProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyProc->statusFlags = 0;
     /* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
     if (IsAutoVacuumWorkerProcess())
@@ -579,7 +579,7 @@ InitAuxiliaryProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyProc->statusFlags = 0;
     MyProc->lwWaiting = false;
     MyProc->lwWaitMode = 0;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index cfabfdbedf..b78012ec2b 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -86,6 +86,41 @@ struct XidCache
  */
 #define INVALID_PGPROCNO        PG_INT32_MAX

+/*
+ * Flags for PGPROC.delayChkpt
+ *
+ * These flags can be used to delay the start or completion of a checkpoint
+ * for short periods. A flag is in effect if the corresponding bit is set in
+ * the PGPROC of any backend.
+ *
+ * For our purposes here, a checkpoint has three phases: (1) determine the
+ * location to which the redo pointer will be moved, (2) write all the
+ * data durably to disk, and (3) WAL-log the checkpoint.
+ *
+ * Setting DELAY_CHKPT_START prevents the system from moving from phase 1
+ * to phase 2. This is useful when we are performing a WAL-logged modification
+ * of data that will be flushed to disk in phase 2. By setting this flag
+ * before writing WAL and clearing it after we've both written WAL and
+ * performed the corresponding modification, we ensure that if the WAL record
+ * is inserted prior to the new redo point, the corresponding data changes will
+ * also be flushed to disk before the checkpoint can complete. (In the
+ * extremely common case where the data being modified is in shared buffers
+ * and we acquire an exclusive content lock on the relevant buffers before
+ * writing WAL, this mechanism is not needed, because phase 2 will block
+ * until we release the content lock and then flush the modified data to
+ * disk.)
+ *
+ * Setting DELAY_CHKPT_COMPLETE prevents the system from moving from phase 2
+ * to phase 3. This is useful if we are performing a WAL-logged operation that
+ * might invalidate buffers, such as relation truncation. In this case, we need
+ * to ensure that any buffers which were invalidated and thus not flushed by
+ * the checkpoint are actaully destroyed on disk. Replay can cope with a file
+ * or block that doesn't exist, but not with a block that has the wrong
+ * contents.
+ */
+#define DELAY_CHKPT_START        (1<<0)
+#define DELAY_CHKPT_COMPLETE    (1<<1)
+
 typedef enum
 {
     PROC_WAIT_STATUS_OK,
@@ -191,7 +226,7 @@ struct PGPROC
     pg_atomic_uint64 waitStart; /* time at which wait for lock acquisition
                                  * started */

-    bool        delayChkpt;        /* true if this proc delays checkpoint start */
+    int            delayChkpt;        /* for DELAY_CHKPT_* flags */

     uint8        statusFlags;    /* this backend's status flags, see PROC_*
                                  * above. mirrored in
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index b01fa52139..93de230a32 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -59,8 +59,9 @@ extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
 extern void GetReplicationHorizons(TransactionId *slot_xmin, TransactionId *catalog_xmin);

-extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
-extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids, int type);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids,
+                                         int nvxids, int type);

 extern PGPROC *BackendPidGetProc(int pid);
 extern PGPROC *BackendPidGetProcWithLock(int pid);
--
2.27.0

From f1832b4aaa3fcd06777a1d3bd9e322b3d85dd634 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 17 Mar 2022 19:11:22 +0900
Subject: [PATCH] Fix possible recovery trouble if TRUNCATE overlaps a
 checkpoint.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.

Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.

Back-patch to all supported versions.

Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com
---
 src/backend/access/transam/multixact.c  |  6 ++--
 src/backend/access/transam/twophase.c   | 12 ++++----
 src/backend/access/transam/xact.c       |  5 ++--
 src/backend/access/transam/xlog.c       | 16 +++++++++--
 src/backend/access/transam/xloginsert.c |  2 +-
 src/backend/catalog/storage.c           | 29 ++++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c     |  6 ++--
 src/backend/storage/ipc/procarray.c     | 26 ++++++++++++-----
 src/backend/storage/lmgr/proc.c         |  4 +--
 src/include/storage/proc.h              | 37 ++++++++++++++++++++++++-
 src/include/storage/procarray.h         |  5 ++--
 11 files changed, 120 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 7990b5e5dd..3e6443fd41 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3071,8 +3071,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
      * crash/basebackup, even though the state of the data directory would
      * require it.
      */
-    Assert(!MyProc->delayChkpt);
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     /* WAL log truncation */
     WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3098,7 +3098,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
     /* Then offsets */
     PerformOffsetsTruncation(oldestMulti, newOldestMulti);

-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();
     LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index b1a221849a..716c17c98f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -476,7 +476,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
     }
     pgxact->xid = xid;
     pgxact->xmin = InvalidTransactionId;
-    proc->delayChkpt = false;
+    proc->delayChkpt = 0;
     pgxact->vacuumFlags = 0;
     proc->pid = 0;
     proc->databaseId = databaseid;
@@ -1170,7 +1170,8 @@ EndPrepare(GlobalTransaction gxact)

     START_CRIT_SECTION();

-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     XLogBeginInsert();
     for (record = records.head; record != NULL; record = record->next)
@@ -1213,7 +1214,7 @@ EndPrepare(GlobalTransaction gxact)
      * checkpoint starting after this will certainly see the gxact as a
      * candidate for fsyncing.
      */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     /*
      * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2286,7 +2287,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
     START_CRIT_SECTION();

     /* See notes in RecordTransactionCommit */
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     /*
      * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2334,7 +2336,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
     TransactionIdCommitTree(xid, nchildren, children);

     /* Checkpoint can proceed now */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index fb6220e491..da6ce5a09e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1308,8 +1308,9 @@ RecordTransactionCommit(void)
          * This makes checkpoint's determination of which xacts are delayChkpt
          * a bit fuzzy, but it doesn't matter.
          */
+        Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
         START_CRIT_SECTION();
-        MyProc->delayChkpt = true;
+        MyProc->delayChkpt |= DELAY_CHKPT_START;

         SetCurrentTransactionStopTimestamp();

@@ -1410,7 +1411,7 @@ RecordTransactionCommit(void)
      */
     if (markXidCommitted)
     {
-        MyProc->delayChkpt = false;
+        MyProc->delayChkpt &= ~DELAY_CHKPT_START;
         END_CRIT_SECTION();
     }

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7bef438d9a..9522c6531f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9022,18 +9022,30 @@ CreateCheckPoint(int flags)
      * and we will correctly flush the update below.  So we cannot miss any
      * xacts we need to wait for.
      */
-    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
     if (nvxids > 0)
     {
         do
         {
             pg_usleep(10000L);    /* wait for 10 msec */
-        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_START));
     }
     pfree(vxids);

     CheckPointGuts(checkPoint.redo, flags);

+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
+    if (nvxids > 0)
+    {
+        do
+        {
+            pg_usleep(10000L);    /* wait for 10 msec */
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_COMPLETE));
+    }
+    pfree(vxids);
+
     /*
      * Take a snapshot of running transactions and write this to WAL. This
      * allows us to reconstruct the state of running transactions during
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f09e..5cff486d9e 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -904,7 +904,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
     /*
      * Ensure no checkpoint can change our view of RedoRecPtr.
      */
-    Assert(MyProc->delayChkpt);
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) != 0);

     /*
      * Update RedoRecPtr so that we can make the right decision
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 74216785b7..0eb14cc885 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -325,6 +325,22 @@ RelationTruncate(Relation rel, BlockNumber nblocks)

     RelationPreTruncate(rel);

+    /*
+     * Make sure that a concurrent checkpoint can't complete while truncation
+     * is in progress.
+     *
+     * The truncation operation might drop buffers that the checkpoint
+     * otherwise would have flushed. If it does, then it's essential that
+     * the files actually get truncated on disk before the checkpoint record
+     * is written. Otherwise, if reply begins from that checkpoint, the
+     * to-be-truncated blocks might still exist on disk but have older
+     * contents than expected, which can cause replay to fail. It's OK for
+     * the blocks to not exist on disk at all, but not for them to have the
+     * wrong contents.
+     */
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_COMPLETE) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_COMPLETE;
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -363,13 +379,24 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
             XLogFlush(lsn);
     }

-    /* Do the real work to truncate relation forks */
+    /*
+     * This will first remove any buffers from the buffer pool that should no
+     * longer exist after truncation is complete, and then truncate the
+     * corresponding files on disk.
+     */
     smgrtruncate(rel->rd_smgr, forks, nforks, blocks);

+    /* We've done all the critical work, so checkpoints are OK now. */
+    MyProc->delayChkpt &= ~DELAY_CHKPT_COMPLETE;
+
     /*
      * Update upper-level FSM pages to account for the truncation. This is
      * important because the just-truncated pages were likely marked as
      * all-free, and would be preferentially selected.
+     *
+     * NB: There's no point in delaying checkpoints until this is done.
+     * Because the FSM is not WAL-logged, we have to be prepared for the
+     * possibility of corruption after a crash anyway.
      */
     if (need_fsm_vacuum)
         FreeSpaceMapVacuumRange(rel, nblocks, InvalidBlockNumber);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 597afedef7..033ef46811 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3647,7 +3647,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
              * essential that CreateCheckpoint waits for virtual transactions
              * rather than full transactionids.
              */
-            MyProc->delayChkpt = delayChkpt = true;
+            Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+            MyProc->delayChkpt |= DELAY_CHKPT_START;
+            delayChkpt = true;
             lsn = XLogSaveBufferForHint(buffer, buffer_std);
         }

@@ -3680,7 +3682,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
         UnlockBufHdr(bufHdr, buf_state);

         if (delayChkpt)
-            MyProc->delayChkpt = false;
+            MyProc->delayChkpt &= ~DELAY_CHKPT_START;

         if (dirtied)
         {
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 02b157243e..725680f34f 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -434,7 +434,10 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
         pgxact->xmin = InvalidTransactionId;
         /* must be cleared with xid/xmin: */
         pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-        proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+        /* be sure this is cleared in abort */
+        proc->delayChkpt = 0;
+
         proc->recoveryConflictPending = false;

         Assert(pgxact->nxids == 0);
@@ -456,7 +459,10 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
     pgxact->xmin = InvalidTransactionId;
     /* must be cleared with xid/xmin: */
     pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-    proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+    /* be sure this is cleared in abort */
+    proc->delayChkpt = 0;
+
     proc->recoveryConflictPending = false;

     /* Clear the subtransaction-XID cache too while holding the lock */
@@ -2272,7 +2278,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * delaying checkpoint because they have critical actions in progress.
  *
  * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGPROC.
+ * critical sections, as shown by having specified delayChkpt bits set in their
+ * PGPROC.
  *
  * Returns a palloc'd array that should be freed by the caller.
  * *nvxids is the number of valid entries.
@@ -2286,13 +2293,15 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * for clearing of delayChkpt to propagate is unimportant for correctness.
  */
 VirtualTransactionId *
-GetVirtualXIDsDelayingChkpt(int *nvxids)
+GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 {
     VirtualTransactionId *vxids;
     ProcArrayStruct *arrayP = procArray;
     int            count = 0;
     int            index;

+    Assert(type != 0);
+
     /* allocate what's certainly enough result space */
     vxids = (VirtualTransactionId *)
         palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
@@ -2304,7 +2313,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
         int            pgprocno = arrayP->pgprocnos[index];
         PGPROC       *proc = &allProcs[pgprocno];

-        if (proc->delayChkpt)
+        if ((proc->delayChkpt & type) != 0)
         {
             VirtualTransactionId vxid;

@@ -2330,12 +2339,14 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
  * those numbers should be small enough for it not to be a problem.
  */
 bool
-HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 {
     bool        result = false;
     ProcArrayStruct *arrayP = procArray;
     int            index;

+    Assert(type != 0);
+
     LWLockAcquire(ProcArrayLock, LW_SHARED);

     for (index = 0; index < arrayP->numProcs; index++)
@@ -2346,7 +2357,8 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)

         GET_VXID_FROM_PGPROC(vxid, *proc);

-        if (proc->delayChkpt && VirtualTransactionIdIsValid(vxid))
+        if ((proc->delayChkpt & type) != 0 &&
+            VirtualTransactionIdIsValid(vxid))
         {
             int            i;

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 0d70b03eeb..f3a6c598bf 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -396,7 +396,7 @@ InitProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyPgXact->vacuumFlags = 0;
     /* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
     if (IsAutoVacuumWorkerProcess())
@@ -578,7 +578,7 @@ InitAuxiliaryProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyPgXact->vacuumFlags = 0;
     MyProc->lwWaiting = false;
     MyProc->lwWaitMode = 0;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index b3ea1a2586..5798b91186 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -83,6 +83,41 @@ struct XidCache
  */
 #define INVALID_PGPROCNO        PG_INT32_MAX

+/*
+ * Flags for PGPROC.delayChkpt
+ *
+ * These flags can be used to delay the start or completion of a checkpoint
+ * for short periods. A flag is in effect if the corresponding bit is set in
+ * the PGPROC of any backend.
+ *
+ * For our purposes here, a checkpoint has three phases: (1) determine the
+ * location to which the redo pointer will be moved, (2) write all the
+ * data durably to disk, and (3) WAL-log the checkpoint.
+ *
+ * Setting DELAY_CHKPT_START prevents the system from moving from phase 1
+ * to phase 2. This is useful when we are performing a WAL-logged modification
+ * of data that will be flushed to disk in phase 2. By setting this flag
+ * before writing WAL and clearing it after we've both written WAL and
+ * performed the corresponding modification, we ensure that if the WAL record
+ * is inserted prior to the new redo point, the corresponding data changes will
+ * also be flushed to disk before the checkpoint can complete. (In the
+ * extremely common case where the data being modified is in shared buffers
+ * and we acquire an exclusive content lock on the relevant buffers before
+ * writing WAL, this mechanism is not needed, because phase 2 will block
+ * until we release the content lock and then flush the modified data to
+ * disk.)
+ *
+ * Setting DELAY_CHKPT_COMPLETE prevents the system from moving from phase 2
+ * to phase 3. This is useful if we are performing a WAL-logged operation that
+ * might invalidate buffers, such as relation truncation. In this case, we need
+ * to ensure that any buffers which were invalidated and thus not flushed by
+ * the checkpoint are actaully destroyed on disk. Replay can cope with a file
+ * or block that doesn't exist, but not with a block that has the wrong
+ * contents.
+ */
+#define DELAY_CHKPT_START        (1<<0)
+#define DELAY_CHKPT_COMPLETE    (1<<1)
+
 /*
  * Each backend has a PGPROC struct in shared memory.  There is also a list of
  * currently-unused PGPROC structs that will be reallocated to new backends.
@@ -149,7 +184,7 @@ struct PGPROC
     LOCKMASK    heldLocks;        /* bitmask for lock types already held on this
                                  * lock object by this backend */

-    bool        delayChkpt;        /* true if this proc delays checkpoint start */
+    int            delayChkpt;        /* for DELAY_CHKPT_* flags */

     /*
      * Info to allow us to wait for synchronous replication, if needed.
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 200ef8db27..4dee2dab10 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -92,8 +92,9 @@ extern TransactionId GetOldestXmin(Relation rel, int flags);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);

-extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
-extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids, int type);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids,
+                                         int nvxids, int type);

 extern PGPROC *BackendPidGetProc(int pid);
 extern PGPROC *BackendPidGetProcWithLock(int pid);
--
2.27.0

From 3eb3c1df1fbccd7eb3dc0dcc1ed99938e5c12e44 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 17 Mar 2022 19:32:38 +0900
Subject: [PATCH] Fix possible recovery trouble if TRUNCATE overlaps a
 checkpoint.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.

Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.

Back-patch to all supported versions.

Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com
---
 src/backend/access/transam/multixact.c  |  6 ++--
 src/backend/access/transam/twophase.c   | 12 ++++----
 src/backend/access/transam/xact.c       |  5 ++--
 src/backend/access/transam/xlog.c       | 16 +++++++++--
 src/backend/access/transam/xloginsert.c |  2 +-
 src/backend/catalog/storage.c           | 26 ++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c     |  6 ++--
 src/backend/storage/ipc/procarray.c     | 26 ++++++++++++-----
 src/backend/storage/lmgr/proc.c         |  4 +--
 src/include/storage/proc.h              | 38 +++++++++++++++++++++++--
 src/include/storage/procarray.h         |  5 ++--
 11 files changed, 117 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 09748905a8..757346cbbb 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3069,8 +3069,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
      * crash/basebackup, even though the state of the data directory would
      * require it.
      */
-    Assert(!MyPgXact->delayChkpt);
-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     /* WAL log truncation */
     WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3096,7 +3096,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
     /* Then offsets */
     PerformOffsetsTruncation(oldestMulti, newOldestMulti);

-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();
     LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6def1820ca..602ca41054 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -477,7 +477,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
     }
     pgxact->xid = xid;
     pgxact->xmin = InvalidTransactionId;
-    pgxact->delayChkpt = false;
+    pgxact->delayChkpt = 0;
     pgxact->vacuumFlags = 0;
     proc->pid = 0;
     proc->databaseId = databaseid;
@@ -1187,7 +1187,8 @@ EndPrepare(GlobalTransaction gxact)

     START_CRIT_SECTION();

-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     XLogBeginInsert();
     for (record = records.head; record != NULL; record = record->next)
@@ -1230,7 +1231,7 @@ EndPrepare(GlobalTransaction gxact)
      * checkpoint starting after this will certainly see the gxact as a
      * candidate for fsyncing.
      */
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     /*
      * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2337,7 +2338,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
     START_CRIT_SECTION();

     /* See notes in RecordTransactionCommit */
-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     /*
      * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2385,7 +2387,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
     TransactionIdCommitTree(xid, nchildren, children);

     /* Checkpoint can proceed now */
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 9c6b87c6ec..9d23298b2b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1306,8 +1306,9 @@ RecordTransactionCommit(void)
          * This makes checkpoint's determination of which xacts are delayChkpt
          * a bit fuzzy, but it doesn't matter.
          */
+        Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
         START_CRIT_SECTION();
-        MyPgXact->delayChkpt = true;
+        MyPgXact->delayChkpt |= DELAY_CHKPT_START;

         SetCurrentTransactionStopTimestamp();

@@ -1408,7 +1409,7 @@ RecordTransactionCommit(void)
      */
     if (markXidCommitted)
     {
-        MyPgXact->delayChkpt = false;
+        MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;
         END_CRIT_SECTION();
     }

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a30314bc83..9135985eaf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8920,18 +8920,30 @@ CreateCheckPoint(int flags)
      * and we will correctly flush the update below.  So we cannot miss any
      * xacts we need to wait for.
      */
-    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
     if (nvxids > 0)
     {
         do
         {
             pg_usleep(10000L);    /* wait for 10 msec */
-        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_START));
     }
     pfree(vxids);

     CheckPointGuts(checkPoint.redo, flags);

+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
+    if (nvxids > 0)
+    {
+        do
+        {
+            pg_usleep(10000L);    /* wait for 10 msec */
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_COMPLETE));
+    }
+    pfree(vxids);
+
     /*
      * Take a snapshot of running transactions and write this to WAL. This
      * allows us to reconstruct the state of running transactions during
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 24a6f3148b..b51b0edd67 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -899,7 +899,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
     /*
      * Ensure no checkpoint can change our view of RedoRecPtr.
      */
-    Assert(MyPgXact->delayChkpt);
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) != 0);

     /*
      * Update RedoRecPtr so that we can make the right decision
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index f899b25c0e..5a6324fec4 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,6 +29,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
+#include "storage/proc.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -252,6 +253,22 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
     if (vm)
         visibilitymap_truncate(rel, nblocks);

+    /*
+     * Make sure that a concurrent checkpoint can't complete while truncation
+     * is in progress.
+     *
+     * The truncation operation might drop buffers that the checkpoint
+     * otherwise would have flushed. If it does, then it's essential that
+     * the files actually get truncated on disk before the checkpoint record
+     * is written. Otherwise, if reply begins from that checkpoint, the
+     * to-be-truncated blocks might still exist on disk but have older
+     * contents than expected, which can cause replay to fail. It's OK for
+     * the blocks to not exist on disk at all, but not for them to have the
+     * wrong contents.
+     */
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_COMPLETE) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_COMPLETE;
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -290,8 +307,15 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
             XLogFlush(lsn);
     }

-    /* Do the real work */
+    /*
+     * This will first remove any buffers from the buffer pool that should no
+     * longer exist after truncation is complete, and then truncate the
+     * corresponding files on disk.
+     */
     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
+
+    /* We've done all the critical work, so checkpoints are OK now. */
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_COMPLETE;
 }

 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01c09fd532..7d11b0963f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3514,7 +3514,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
              * essential that CreateCheckpoint waits for virtual transactions
              * rather than full transactionids.
              */
-            MyPgXact->delayChkpt = delayChkpt = true;
+            Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+            MyPgXact->delayChkpt |= DELAY_CHKPT_START;
+            delayChkpt = true;
             lsn = XLogSaveBufferForHint(buffer, buffer_std);
         }

@@ -3547,7 +3549,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
         UnlockBufHdr(bufHdr, buf_state);

         if (delayChkpt)
-            MyPgXact->delayChkpt = false;
+            MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

         if (dirtied)
         {
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ec7e210226..39093253fe 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -434,7 +434,10 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
         pgxact->xmin = InvalidTransactionId;
         /* must be cleared with xid/xmin: */
         pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-        pgxact->delayChkpt = false; /* be sure this is cleared in abort */
+
+        /* be sure this is cleared in abort */
+        pgxact->delayChkpt = 0;
+
         proc->recoveryConflictPending = false;

         Assert(pgxact->nxids == 0);
@@ -456,7 +459,10 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
     pgxact->xmin = InvalidTransactionId;
     /* must be cleared with xid/xmin: */
     pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-    pgxact->delayChkpt = false; /* be sure this is cleared in abort */
+
+    /* be sure this is cleared in abort */
+    pgxact->delayChkpt = 0;
+
     proc->recoveryConflictPending = false;

     /* Clear the subtransaction-XID cache too while holding the lock */
@@ -2261,7 +2267,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * delaying checkpoint because they have critical actions in progress.
  *
  * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGXACT.
+ * critical sections, as shown by having specified delayChkpt bits set in their
+ * PGXACT.
  *
  * Returns a palloc'd array that should be freed by the caller.
  * *nvxids is the number of valid entries.
@@ -2275,13 +2282,15 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * for clearing of delayChkpt to propagate is unimportant for correctness.
  */
 VirtualTransactionId *
-GetVirtualXIDsDelayingChkpt(int *nvxids)
+GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 {
     VirtualTransactionId *vxids;
     ProcArrayStruct *arrayP = procArray;
     int            count = 0;
     int            index;

+    Assert(type != 0);
+
     /* allocate what's certainly enough result space */
     vxids = (VirtualTransactionId *)
         palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
@@ -2294,7 +2303,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
         PGPROC       *proc = &allProcs[pgprocno];
         PGXACT       *pgxact = &allPgXact[pgprocno];

-        if (pgxact->delayChkpt)
+        if ((pgxact->delayChkpt & type) != 0)
         {
             VirtualTransactionId vxid;

@@ -2320,12 +2329,14 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
  * those numbers should be small enough for it not to be a problem.
  */
 bool
-HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 {
     bool        result = false;
     ProcArrayStruct *arrayP = procArray;
     int            index;

+    Assert(type != 0);
+
     LWLockAcquire(ProcArrayLock, LW_SHARED);

     for (index = 0; index < arrayP->numProcs; index++)
@@ -2337,7 +2348,8 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)

         GET_VXID_FROM_PGPROC(vxid, *proc);

-        if (pgxact->delayChkpt && VirtualTransactionIdIsValid(vxid))
+        if ((pgxact->delayChkpt & type) != 0 &&
+            VirtualTransactionIdIsValid(vxid))
         {
             int            i;

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 4850df2e14..59291e01f4 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -397,7 +397,7 @@ InitProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt = 0;
     MyPgXact->vacuumFlags = 0;
     /* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
     if (IsAutoVacuumWorkerProcess())
@@ -579,7 +579,7 @@ InitAuxiliaryProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt = 0;
     MyPgXact->vacuumFlags = 0;
     MyProc->lwWaiting = false;
     MyProc->lwWaitMode = 0;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 43d0854a41..2a16fd23d4 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -76,6 +76,41 @@ struct XidCache
  */
 #define INVALID_PGPROCNO        PG_INT32_MAX

+/*
+ * Flags for PGPROC.delayChkpt
+ *
+ * These flags can be used to delay the start or completion of a checkpoint
+ * for short periods. A flag is in effect if the corresponding bit is set in
+ * the PGPROC of any backend.
+ *
+ * For our purposes here, a checkpoint has three phases: (1) determine the
+ * location to which the redo pointer will be moved, (2) write all the
+ * data durably to disk, and (3) WAL-log the checkpoint.
+ *
+ * Setting DELAY_CHKPT_START prevents the system from moving from phase 1
+ * to phase 2. This is useful when we are performing a WAL-logged modification
+ * of data that will be flushed to disk in phase 2. By setting this flag
+ * before writing WAL and clearing it after we've both written WAL and
+ * performed the corresponding modification, we ensure that if the WAL record
+ * is inserted prior to the new redo point, the corresponding data changes will
+ * also be flushed to disk before the checkpoint can complete. (In the
+ * extremely common case where the data being modified is in shared buffers
+ * and we acquire an exclusive content lock on the relevant buffers before
+ * writing WAL, this mechanism is not needed, because phase 2 will block
+ * until we release the content lock and then flush the modified data to
+ * disk.)
+ *
+ * Setting DELAY_CHKPT_COMPLETE prevents the system from moving from phase 2
+ * to phase 3. This is useful if we are performing a WAL-logged operation that
+ * might invalidate buffers, such as relation truncation. In this case, we need
+ * to ensure that any buffers which were invalidated and thus not flushed by
+ * the checkpoint are actaully destroyed on disk. Replay can cope with a file
+ * or block that doesn't exist, but not with a block that has the wrong
+ * contents.
+ */
+#define DELAY_CHKPT_START        (1<<0)
+#define DELAY_CHKPT_COMPLETE    (1<<1)
+
 /*
  * Each backend has a PGPROC struct in shared memory.  There is also a list of
  * currently-unused PGPROC structs that will be reallocated to new backends.
@@ -232,8 +267,7 @@ typedef struct PGXACT

     uint8        vacuumFlags;    /* vacuum-related flags, see above */
     bool        overflowed;
-    bool        delayChkpt;        /* true if this proc delays checkpoint start;
-                                 * previously called InCommit */
+    int            delayChkpt;        /* for DELAY_CHKPT_* flags */

     uint8        nxids;
 } PGXACT;
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index d1dc0ffe28..d9ca460efc 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -92,8 +92,9 @@ extern TransactionId GetOldestXmin(Relation rel, int flags);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);

-extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
-extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids, int type);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids,
+                                         int nvxids, int type);

 extern PGPROC *BackendPidGetProc(int pid);
 extern PGPROC *BackendPidGetProcWithLock(int pid);
--
2.27.0

From 30fd7eea362f38a64f62fc91123bc387dabed15f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 17 Mar 2022 19:36:10 +0900
Subject: [PATCH] Fix possible recovery trouble if TRUNCATE overlaps a
 checkpoint.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.

Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.

Back-patch to all supported versions.

Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com
---
 src/backend/access/transam/multixact.c  |  6 ++--
 src/backend/access/transam/twophase.c   | 12 ++++----
 src/backend/access/transam/xact.c       |  5 ++--
 src/backend/access/transam/xlog.c       | 16 +++++++++--
 src/backend/access/transam/xloginsert.c |  2 +-
 src/backend/catalog/storage.c           | 26 ++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c     |  6 ++--
 src/backend/storage/ipc/procarray.c     | 26 ++++++++++++-----
 src/backend/storage/lmgr/proc.c         |  4 +--
 src/include/storage/proc.h              | 38 +++++++++++++++++++++++--
 src/include/storage/procarray.h         |  5 ++--
 11 files changed, 117 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ad9e7ff8f0..5612db0e21 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3069,8 +3069,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
      * crash/basebackup, even though the state of the data directory would
      * require it.
      */
-    Assert(!MyPgXact->delayChkpt);
-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     /* WAL log truncation */
     WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3096,7 +3096,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
     /* Then offsets */
     PerformOffsetsTruncation(oldestMulti, newOldestMulti);

-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();
     LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 8b402c3a1d..769a5fd714 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -476,7 +476,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
     }
     pgxact->xid = xid;
     pgxact->xmin = InvalidTransactionId;
-    pgxact->delayChkpt = false;
+    pgxact->delayChkpt = 0;
     pgxact->vacuumFlags = 0;
     proc->pid = 0;
     proc->databaseId = databaseid;
@@ -1175,7 +1175,8 @@ EndPrepare(GlobalTransaction gxact)

     START_CRIT_SECTION();

-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     XLogBeginInsert();
     for (record = records.head; record != NULL; record = record->next)
@@ -1218,7 +1219,7 @@ EndPrepare(GlobalTransaction gxact)
      * checkpoint starting after this will certainly see the gxact as a
      * candidate for fsyncing.
      */
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     /*
      * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2352,7 +2353,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
     START_CRIT_SECTION();

     /* See notes in RecordTransactionCommit */
-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     /*
      * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2400,7 +2402,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
     TransactionIdCommitTree(xid, nchildren, children);

     /* Checkpoint can proceed now */
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e32b05d17f..5a86b6575e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1239,8 +1239,9 @@ RecordTransactionCommit(void)
          * This makes checkpoint's determination of which xacts are delayChkpt
          * a bit fuzzy, but it doesn't matter.
          */
+        Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
         START_CRIT_SECTION();
-        MyPgXact->delayChkpt = true;
+        MyPgXact->delayChkpt |= DELAY_CHKPT_START;

         SetCurrentTransactionStopTimestamp();

@@ -1341,7 +1342,7 @@ RecordTransactionCommit(void)
      */
     if (markXidCommitted)
     {
-        MyPgXact->delayChkpt = false;
+        MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;
         END_CRIT_SECTION();
     }

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c68dc1b9a8..53e109b0aa 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9064,18 +9064,30 @@ CreateCheckPoint(int flags)
      * and we will correctly flush the update below.  So we cannot miss any
      * xacts we need to wait for.
      */
-    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
     if (nvxids > 0)
     {
         do
         {
             pg_usleep(10000L);    /* wait for 10 msec */
-        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_START));
     }
     pfree(vxids);

     CheckPointGuts(checkPoint.redo, flags);

+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
+    if (nvxids > 0)
+    {
+        do
+        {
+            pg_usleep(10000L);    /* wait for 10 msec */
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_COMPLETE));
+    }
+    pfree(vxids);
+
     /*
      * Take a snapshot of running transactions and write this to WAL. This
      * allows us to reconstruct the state of running transactions during
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c033e7bd4c..a8c140b06f 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -899,7 +899,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
     /*
      * Ensure no checkpoint can change our view of RedoRecPtr.
      */
-    Assert(MyPgXact->delayChkpt);
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) != 0);

     /*
      * Update RedoRecPtr so that we can make the right decision
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 5df4382b7e..5d6f456c70 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -27,6 +27,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
+#include "storage/proc.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -248,6 +249,22 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
     if (vm)
         visibilitymap_truncate(rel, nblocks);

+    /*
+     * Make sure that a concurrent checkpoint can't complete while truncation
+     * is in progress.
+     *
+     * The truncation operation might drop buffers that the checkpoint
+     * otherwise would have flushed. If it does, then it's essential that
+     * the files actually get truncated on disk before the checkpoint record
+     * is written. Otherwise, if reply begins from that checkpoint, the
+     * to-be-truncated blocks might still exist on disk but have older
+     * contents than expected, which can cause replay to fail. It's OK for
+     * the blocks to not exist on disk at all, but not for them to have the
+     * wrong contents.
+     */
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_COMPLETE) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_COMPLETE;
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -286,8 +303,15 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
             XLogFlush(lsn);
     }

-    /* Do the real work */
+    /*
+     * This will first remove any buffers from the buffer pool that should no
+     * longer exist after truncation is complete, and then truncate the
+     * corresponding files on disk.
+     */
     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
+
+    /* We've done all the critical work, so checkpoints are OK now. */
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_COMPLETE;
 }

 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 459151519a..027d5067a0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3471,7 +3471,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
              * essential that CreateCheckpoint waits for virtual transactions
              * rather than full transactionids.
              */
-            MyPgXact->delayChkpt = delayChkpt = true;
+            Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+            MyPgXact->delayChkpt |= DELAY_CHKPT_START;
+            delayChkpt = true;
             lsn = XLogSaveBufferForHint(buffer, buffer_std);
         }

@@ -3504,7 +3506,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
         UnlockBufHdr(bufHdr, buf_state);

         if (delayChkpt)
-            MyPgXact->delayChkpt = false;
+            MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

         if (dirtied)
         {
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 465ca66857..d88d955091 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -433,7 +433,10 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
         pgxact->xmin = InvalidTransactionId;
         /* must be cleared with xid/xmin: */
         pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-        pgxact->delayChkpt = false; /* be sure this is cleared in abort */
+
+        /* be sure this is cleared in abort */
+        pgxact->delayChkpt = 0;
+
         proc->recoveryConflictPending = false;

         Assert(pgxact->nxids == 0);
@@ -455,7 +458,10 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
     pgxact->xmin = InvalidTransactionId;
     /* must be cleared with xid/xmin: */
     pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-    pgxact->delayChkpt = false; /* be sure this is cleared in abort */
+
+    /* be sure this is cleared in abort */
+    pgxact->delayChkpt = 0;
+
     proc->recoveryConflictPending = false;

     /* Clear the subtransaction-XID cache too while holding the lock */
@@ -2267,7 +2273,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * delaying checkpoint because they have critical actions in progress.
  *
  * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGXACT.
+ * critical sections, as shown by having specified delayChkpt bits set in their
+ * PGXACT.
  *
  * Returns a palloc'd array that should be freed by the caller.
  * *nvxids is the number of valid entries.
@@ -2281,13 +2288,15 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * for clearing of delayChkpt to propagate is unimportant for correctness.
  */
 VirtualTransactionId *
-GetVirtualXIDsDelayingChkpt(int *nvxids)
+GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 {
     VirtualTransactionId *vxids;
     ProcArrayStruct *arrayP = procArray;
     int            count = 0;
     int            index;

+    Assert(type != 0);
+
     /* allocate what's certainly enough result space */
     vxids = (VirtualTransactionId *)
         palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
@@ -2300,7 +2309,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
         volatile PGPROC *proc = &allProcs[pgprocno];
         volatile PGXACT *pgxact = &allPgXact[pgprocno];

-        if (pgxact->delayChkpt)
+        if ((pgxact->delayChkpt & type) != 0)
         {
             VirtualTransactionId vxid;

@@ -2326,12 +2335,14 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
  * those numbers should be small enough for it not to be a problem.
  */
 bool
-HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 {
     bool        result = false;
     ProcArrayStruct *arrayP = procArray;
     int            index;

+    Assert(type != 0);
+
     LWLockAcquire(ProcArrayLock, LW_SHARED);

     for (index = 0; index < arrayP->numProcs; index++)
@@ -2343,7 +2354,8 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)

         GET_VXID_FROM_PGPROC(vxid, *proc);

-        if (pgxact->delayChkpt && VirtualTransactionIdIsValid(vxid))
+        if ((pgxact->delayChkpt & type) != 0 &&
+            VirtualTransactionIdIsValid(vxid))
         {
             int            i;

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 69a1e37289..aaecfa67b7 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -380,7 +380,7 @@ InitProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt = 0;
     MyPgXact->vacuumFlags = 0;
     /* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
     if (IsAutoVacuumWorkerProcess())
@@ -562,7 +562,7 @@ InitAuxiliaryProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt = 0;
     MyPgXact->vacuumFlags = 0;
     MyProc->lwWaiting = false;
     MyProc->lwWaitMode = 0;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 95c9592b21..e76ca8a11e 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -76,6 +76,41 @@ struct XidCache
  */
 #define INVALID_PGPROCNO        PG_INT32_MAX

+/*
+ * Flags for PGPROC.delayChkpt
+ *
+ * These flags can be used to delay the start or completion of a checkpoint
+ * for short periods. A flag is in effect if the corresponding bit is set in
+ * the PGPROC of any backend.
+ *
+ * For our purposes here, a checkpoint has three phases: (1) determine the
+ * location to which the redo pointer will be moved, (2) write all the
+ * data durably to disk, and (3) WAL-log the checkpoint.
+ *
+ * Setting DELAY_CHKPT_START prevents the system from moving from phase 1
+ * to phase 2. This is useful when we are performing a WAL-logged modification
+ * of data that will be flushed to disk in phase 2. By setting this flag
+ * before writing WAL and clearing it after we've both written WAL and
+ * performed the corresponding modification, we ensure that if the WAL record
+ * is inserted prior to the new redo point, the corresponding data changes will
+ * also be flushed to disk before the checkpoint can complete. (In the
+ * extremely common case where the data being modified is in shared buffers
+ * and we acquire an exclusive content lock on the relevant buffers before
+ * writing WAL, this mechanism is not needed, because phase 2 will block
+ * until we release the content lock and then flush the modified data to
+ * disk.)
+ *
+ * Setting DELAY_CHKPT_COMPLETE prevents the system from moving from phase 2
+ * to phase 3. This is useful if we are performing a WAL-logged operation that
+ * might invalidate buffers, such as relation truncation. In this case, we need
+ * to ensure that any buffers which were invalidated and thus not flushed by
+ * the checkpoint are actaully destroyed on disk. Replay can cope with a file
+ * or block that doesn't exist, but not with a block that has the wrong
+ * contents.
+ */
+#define DELAY_CHKPT_START        (1<<0)
+#define DELAY_CHKPT_COMPLETE    (1<<1)
+
 /*
  * Each backend has a PGPROC struct in shared memory.  There is also a list of
  * currently-unused PGPROC structs that will be reallocated to new backends.
@@ -232,8 +267,7 @@ typedef struct PGXACT

     uint8        vacuumFlags;    /* vacuum-related flags, see above */
     bool        overflowed;
-    bool        delayChkpt;        /* true if this proc delays checkpoint start;
-                                 * previously called InCommit */
+    int            delayChkpt;        /* for DELAY_CHKPT_* flags */

     uint8        nxids;
 } PGXACT;
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index a3a1bf724c..a69632a70c 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -92,8 +92,9 @@ extern TransactionId GetOldestXmin(Relation rel, int flags);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);

-extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
-extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids, int type);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids,
+                                         int nvxids, int type);

 extern PGPROC *BackendPidGetProc(int pid);
 extern PGPROC *BackendPidGetProcWithLock(int pid);
--
2.27.0

From f0b1e3bee795a54d2a701889dd5956283fbc2cf6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 17 Mar 2022 19:40:45 +0900
Subject: [PATCH] Fix possible recovery trouble if TRUNCATE overlaps a
 checkpoint.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.

Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.

Back-patch to all supported versions.

Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com
---
 src/backend/access/transam/multixact.c  |  6 ++--
 src/backend/access/transam/twophase.c   | 12 ++++----
 src/backend/access/transam/xact.c       |  5 ++--
 src/backend/access/transam/xlog.c       | 16 +++++++++--
 src/backend/access/transam/xloginsert.c |  2 +-
 src/backend/catalog/storage.c           | 26 ++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c     |  6 ++--
 src/backend/storage/ipc/procarray.c     | 26 ++++++++++++-----
 src/backend/storage/lmgr/proc.c         |  4 +--
 src/include/storage/proc.h              | 38 +++++++++++++++++++++++--
 src/include/storage/procarray.h         |  5 ++--
 11 files changed, 117 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index cdaf499348..1e52972bbf 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3069,8 +3069,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
      * crash/basebackup, even though the state of the data directory would
      * require it.
      */
-    Assert(!MyPgXact->delayChkpt);
-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     /* WAL log truncation */
     WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3096,7 +3096,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
     /* Then offsets */
     PerformOffsetsTruncation(oldestMulti, newOldestMulti);

-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();
     LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 3eb33be69b..c61b2736a1 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -478,7 +478,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
     }
     pgxact->xid = xid;
     pgxact->xmin = InvalidTransactionId;
-    pgxact->delayChkpt = false;
+    pgxact->delayChkpt = 0;
     pgxact->vacuumFlags = 0;
     proc->pid = 0;
     proc->databaseId = databaseid;
@@ -1159,7 +1159,8 @@ EndPrepare(GlobalTransaction gxact)

     START_CRIT_SECTION();

-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     XLogBeginInsert();
     for (record = records.head; record != NULL; record = record->next)
@@ -1191,7 +1192,7 @@ EndPrepare(GlobalTransaction gxact)
      * checkpoint starting after this will certainly see the gxact as a
      * candidate for fsyncing.
      */
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     /*
      * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2284,7 +2285,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
     START_CRIT_SECTION();

     /* See notes in RecordTransactionCommit */
-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     /*
      * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2332,7 +2334,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
     TransactionIdCommitTree(xid, nchildren, children);

     /* Checkpoint can proceed now */
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 25a3a4f97e..ccd99c38c2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1247,8 +1247,9 @@ RecordTransactionCommit(void)
          * This makes checkpoint's determination of which xacts are delayChkpt
          * a bit fuzzy, but it doesn't matter.
          */
+        Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
         START_CRIT_SECTION();
-        MyPgXact->delayChkpt = true;
+        MyPgXact->delayChkpt |= DELAY_CHKPT_START;

         SetCurrentTransactionStopTimestamp();

@@ -1349,7 +1350,7 @@ RecordTransactionCommit(void)
      */
     if (markXidCommitted)
     {
-        MyPgXact->delayChkpt = false;
+        MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;
         END_CRIT_SECTION();
     }

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8e8bdde764..5087b5fe0a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9022,18 +9022,30 @@ CreateCheckPoint(int flags)
      * and we will correctly flush the update below.  So we cannot miss any
      * xacts we need to wait for.
      */
-    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
     if (nvxids > 0)
     {
         do
         {
             pg_usleep(10000L);    /* wait for 10 msec */
-        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_START));
     }
     pfree(vxids);

     CheckPointGuts(checkPoint.redo, flags);

+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
+    if (nvxids > 0)
+    {
+        do
+        {
+            pg_usleep(10000L);    /* wait for 10 msec */
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_COMPLETE));
+    }
+    pfree(vxids);
+
     /*
      * Take a snapshot of running transactions and write this to WAL. This
      * allows us to reconstruct the state of running transactions during
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 579d8de775..6ff19814d4 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -899,7 +899,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
     /*
      * Ensure no checkpoint can change our view of RedoRecPtr.
      */
-    Assert(MyPgXact->delayChkpt);
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) != 0);

     /*
      * Update RedoRecPtr so that we can make the right decision
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9a5fde00ca..729fb92c5f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
+#include "storage/proc.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -249,6 +250,22 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
     if (vm)
         visibilitymap_truncate(rel, nblocks);

+    /*
+     * Make sure that a concurrent checkpoint can't complete while truncation
+     * is in progress.
+     *
+     * The truncation operation might drop buffers that the checkpoint
+     * otherwise would have flushed. If it does, then it's essential that
+     * the files actually get truncated on disk before the checkpoint record
+     * is written. Otherwise, if reply begins from that checkpoint, the
+     * to-be-truncated blocks might still exist on disk but have older
+     * contents than expected, which can cause replay to fail. It's OK for
+     * the blocks to not exist on disk at all, but not for them to have the
+     * wrong contents.
+     */
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_COMPLETE) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_COMPLETE;
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -287,8 +304,15 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
             XLogFlush(lsn);
     }

-    /* Do the real work */
+    /*
+     * This will first remove any buffers from the buffer pool that should no
+     * longer exist after truncation is complete, and then truncate the
+     * corresponding files on disk.
+     */
     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
+
+    /* We've done all the critical work, so checkpoints are OK now. */
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_COMPLETE;
 }

 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bafe91ab0d..0b7bdb8634 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3469,7 +3469,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
              * essential that CreateCheckpoint waits for virtual transactions
              * rather than full transactionids.
              */
-            MyPgXact->delayChkpt = delayChkpt = true;
+            Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+            MyPgXact->delayChkpt |= DELAY_CHKPT_START;
+            delayChkpt = true;
             lsn = XLogSaveBufferForHint(buffer, buffer_std);
         }

@@ -3502,7 +3504,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
         UnlockBufHdr(bufHdr, buf_state);

         if (delayChkpt)
-            MyPgXact->delayChkpt = false;
+            MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

         if (dirtied)
         {
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index d739812f23..134b63f28b 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -433,7 +433,10 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
         pgxact->xmin = InvalidTransactionId;
         /* must be cleared with xid/xmin: */
         pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-        pgxact->delayChkpt = false; /* be sure this is cleared in abort */
+
+        /* be sure this is cleared in abort */
+        pgxact->delayChkpt = 0;
+
         proc->recoveryConflictPending = false;

         Assert(pgxact->nxids == 0);
@@ -455,7 +458,10 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
     pgxact->xmin = InvalidTransactionId;
     /* must be cleared with xid/xmin: */
     pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-    pgxact->delayChkpt = false; /* be sure this is cleared in abort */
+
+    /* be sure this is cleared in abort */
+    pgxact->delayChkpt = 0;
+
     proc->recoveryConflictPending = false;

     /* Clear the subtransaction-XID cache too while holding the lock */
@@ -2259,7 +2265,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * delaying checkpoint because they have critical actions in progress.
  *
  * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGXACT.
+ * critical sections, as shown by having specified delayChkpt bits set in their
+ * PGXACT.
  *
  * Returns a palloc'd array that should be freed by the caller.
  * *nvxids is the number of valid entries.
@@ -2273,13 +2280,15 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * for clearing of delayChkpt to propagate is unimportant for correctness.
  */
 VirtualTransactionId *
-GetVirtualXIDsDelayingChkpt(int *nvxids)
+GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 {
     VirtualTransactionId *vxids;
     ProcArrayStruct *arrayP = procArray;
     int            count = 0;
     int            index;

+    Assert(type != 0);
+
     /* allocate what's certainly enough result space */
     vxids = (VirtualTransactionId *)
         palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
@@ -2292,7 +2301,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
         volatile PGPROC *proc = &allProcs[pgprocno];
         volatile PGXACT *pgxact = &allPgXact[pgprocno];

-        if (pgxact->delayChkpt)
+        if ((pgxact->delayChkpt & type) != 0)
         {
             VirtualTransactionId vxid;

@@ -2318,12 +2327,14 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
  * those numbers should be small enough for it not to be a problem.
  */
 bool
-HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 {
     bool        result = false;
     ProcArrayStruct *arrayP = procArray;
     int            index;

+    Assert(type != 0);
+
     LWLockAcquire(ProcArrayLock, LW_SHARED);

     for (index = 0; index < arrayP->numProcs; index++)
@@ -2335,7 +2346,8 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)

         GET_VXID_FROM_PGPROC(vxid, *proc);

-        if (pgxact->delayChkpt && VirtualTransactionIdIsValid(vxid))
+        if ((pgxact->delayChkpt & type) != 0 &&
+            VirtualTransactionIdIsValid(vxid))
         {
             int            i;

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 857dfdab09..e5370df019 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -377,7 +377,7 @@ InitProcess(void)
     MyProc->databaseId = InvalidOid;
     MyProc->roleId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt = 0;
     MyPgXact->vacuumFlags = 0;
     /* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
     if (IsAutoVacuumWorkerProcess())
@@ -550,7 +550,7 @@ InitAuxiliaryProcess(void)
     MyProc->databaseId = InvalidOid;
     MyProc->roleId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt = 0;
     MyPgXact->vacuumFlags = 0;
     MyProc->lwWaiting = false;
     MyProc->lwWaitMode = 0;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 947f69d634..d8dd7bf5e1 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -75,6 +75,41 @@ struct XidCache
  */
 #define INVALID_PGPROCNO        PG_INT32_MAX

+/*
+ * Flags for PGPROC.delayChkpt
+ *
+ * These flags can be used to delay the start or completion of a checkpoint
+ * for short periods. A flag is in effect if the corresponding bit is set in
+ * the PGPROC of any backend.
+ *
+ * For our purposes here, a checkpoint has three phases: (1) determine the
+ * location to which the redo pointer will be moved, (2) write all the
+ * data durably to disk, and (3) WAL-log the checkpoint.
+ *
+ * Setting DELAY_CHKPT_START prevents the system from moving from phase 1
+ * to phase 2. This is useful when we are performing a WAL-logged modification
+ * of data that will be flushed to disk in phase 2. By setting this flag
+ * before writing WAL and clearing it after we've both written WAL and
+ * performed the corresponding modification, we ensure that if the WAL record
+ * is inserted prior to the new redo point, the corresponding data changes will
+ * also be flushed to disk before the checkpoint can complete. (In the
+ * extremely common case where the data being modified is in shared buffers
+ * and we acquire an exclusive content lock on the relevant buffers before
+ * writing WAL, this mechanism is not needed, because phase 2 will block
+ * until we release the content lock and then flush the modified data to
+ * disk.)
+ *
+ * Setting DELAY_CHKPT_COMPLETE prevents the system from moving from phase 2
+ * to phase 3. This is useful if we are performing a WAL-logged operation that
+ * might invalidate buffers, such as relation truncation. In this case, we need
+ * to ensure that any buffers which were invalidated and thus not flushed by
+ * the checkpoint are actaully destroyed on disk. Replay can cope with a file
+ * or block that doesn't exist, but not with a block that has the wrong
+ * contents.
+ */
+#define DELAY_CHKPT_START        (1<<0)
+#define DELAY_CHKPT_COMPLETE    (1<<1)
+
 /*
  * Each backend has a PGPROC struct in shared memory.  There is also a list of
  * currently-unused PGPROC structs that will be reallocated to new backends.
@@ -217,8 +252,7 @@ typedef struct PGXACT

     uint8        vacuumFlags;    /* vacuum-related flags, see above */
     bool        overflowed;
-    bool        delayChkpt;        /* true if this proc delays checkpoint start;
-                                 * previously called InCommit */
+    int            delayChkpt;        /* for DELAY_CHKPT_* flags */

     uint8        nxids;
 } PGXACT;
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 08b4b030bb..2b60b27604 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -92,8 +92,9 @@ extern TransactionId GetOldestXmin(Relation rel, int flags);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);

-extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
-extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids, int type);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids,
+                                         int nvxids, int type);

 extern PGPROC *BackendPidGetProc(int pid);
 extern PGPROC *BackendPidGetProcWithLock(int pid);
--
2.27.0


pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: XID formatting and SLRU refactorings
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: XID formatting and SLRU refactorings