From c5283146ca0e3cd2355a19d4d29fe455a6ab9fa2 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 19 Oct 2023 15:43:15 -0400
Subject: [PATCH 2/2] Panic if filesystem truncation fails.

It was originally thought that failure to PANIC in the event of a
failed truncation was harmless, but we now know otherwise. Without the
PANIC, you can end up with a corrupted database and broken standbys.
Add a critical section to force a PANIC in this case, and add a
lengthy comment explaining why it's necessary.

Reported by (XXX, multiple people, make an actual list). Patch by me.
---
 src/backend/catalog/storage.c | 46 +++++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 13 deletions(-)

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index bef4269368..08914d61e3 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -350,11 +350,11 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 	 * this code executes.
 	 *
 	 * Second, the call to smgrtruncate() below will in turn call
-	 * RegisterSyncRequest(). We need the sync request created by that call
-	 * to be processed before the checkpoint completes. CheckPointGuts()
-	 * will call ProcessSyncRequests(), but if we register our sync request
-	 * after that happens, then the WAL record for the truncation could end
-	 * up preceding the checkpoint record, while the actual sync doesn't happen
+	 * RegisterSyncRequest(). We need the sync request created by that call to
+	 * be processed before the checkpoint completes. CheckPointGuts() will
+	 * call ProcessSyncRequests(), but if we register our sync request after
+	 * that happens, then the WAL record for the truncation could end up
+	 * preceding the checkpoint record, while the actual sync doesn't happen
 	 * until the next checkpoint. To prevent that, we need to set
 	 * DELAY_CHKPT_START here. That way, if the XLOG_SMGR_TRUNCATE precedes
 	 * the redo pointer of a concurrent checkpoint, we're guaranteed that the
@@ -365,14 +365,33 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 	MyProc->delayChkptFlags |= DELAY_CHKPT_START | DELAY_CHKPT_COMPLETE;
 
 	/*
-	 * We WAL-log the truncation before actually truncating, which means
-	 * trouble if the truncation fails. If we then crash, the WAL replay
-	 * likely isn't going to succeed in the truncation either, and cause a
-	 * PANIC. It's tempting to put a critical section here, but that cure
-	 * would be worse than the disease. It would turn a usually harmless
-	 * failure to truncate, that might spell trouble at WAL replay, into a
-	 * certain PANIC.
+	 * We WAL-log the truncation before actually truncating. That means that
+	 * if truncation subsequently fails, we have to PANIC. If we don't, some
+	 * really bad things can happen. First, consider that we discard dirty
+	 * buffers from memory before actually truncating on disk. That means that
+	 * in effect we revert to an older on disk state where, perhaps, some
+	 * tuples that have been deleted since the last checkpoint are still
+	 * visible. That's equivalent to allowing a committed transaction to
+	 * become partially uncommitted, which is clearly unacceptable. Second,
+	 * consider that truncation may succeed on some or all standbys even
+	 * though it failed on the primary. That means that the primary and
+	 * standby are now in persistently different states.
+	 *
+	 * In general, this is much as if we wrote a WAL record for a change to
+	 * the contents of some page and then (for some reason) found ourselves
+	 * unable to perform the corresponding page modification in memory. That
+	 * situation also causes a PANIC. However, this case is considerably more
+	 * uncomfortable, because an in-memory modification to page contents
+	 * shouldn't really ever fail unless we've got a bug in the code
+	 * somewhere. In contrast, trying to truncate a file on disk certainly can
+	 * fail for a variety of reasons that are outside our control. If it does,
+	 * then recovery will probably also fail, which probably won't be much fun
+	 * for the user. They'll have to fix whatever is making us unable to
+	 * truncate files on disk before they can get the database up and running
+	 * again. But forcing them to fix permissions (or whatever the problem is)
+	 * beats corrupting the database.
 	 */
+	START_CRIT_SECTION();
 	if (RelationNeedsWAL(rel))
 	{
 		/*
@@ -409,7 +428,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 	 */
 	smgrtruncate(RelationGetSmgr(rel), forks, nforks, blocks);
 
-	/* We've done all the critical work, so checkpoints are OK now. */
+	/* We've done all the critical work. */
+	END_CRIT_SECTION();
 	MyProc->delayChkptFlags &= ~(DELAY_CHKPT_START | DELAY_CHKPT_COMPLETE);
 
 	/*
-- 
2.37.1 (Apple Git-137.1)