From c5283146ca0e3cd2355a19d4d29fe455a6ab9fa2 Mon Sep 17 00:00:00 2001 From: Robert Haas Date: Thu, 19 Oct 2023 15:43:15 -0400 Subject: [PATCH 2/2] Panic if filesystem truncation fails. It was originally thought that failure to PANIC in the event of a failed truncation was harmless, but we now know otherwise. Without the PANIC, you can end up with a corrupted database and broken standbys. Add a critical section to force a PANIC in this case, and add a lengthy comment explaining why it's necessary. Reported by (XXX, multiple people, make an actual list). Patch by me. --- src/backend/catalog/storage.c | 46 +++++++++++++++++++++++++---------- 1 file changed, 33 insertions(+), 13 deletions(-) diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index bef4269368..08914d61e3 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -350,11 +350,11 @@ RelationTruncate(Relation rel, BlockNumber nblocks) * this code executes. * * Second, the call to smgrtruncate() below will in turn call - * RegisterSyncRequest(). We need the sync request created by that call - * to be processed before the checkpoint completes. CheckPointGuts() - * will call ProcessSyncRequests(), but if we register our sync request - * after that happens, then the WAL record for the truncation could end - * up preceding the checkpoint record, while the actual sync doesn't happen + * RegisterSyncRequest(). We need the sync request created by that call to + * be processed before the checkpoint completes. CheckPointGuts() will + * call ProcessSyncRequests(), but if we register our sync request after + * that happens, then the WAL record for the truncation could end up + * preceding the checkpoint record, while the actual sync doesn't happen * until the next checkpoint. To prevent that, we need to set * DELAY_CHKPT_START here. That way, if the XLOG_SMGR_TRUNCATE precedes * the redo pointer of a concurrent checkpoint, we're guaranteed that the @@ -365,14 +365,33 @@ RelationTruncate(Relation rel, BlockNumber nblocks) MyProc->delayChkptFlags |= DELAY_CHKPT_START | DELAY_CHKPT_COMPLETE; /* - * We WAL-log the truncation before actually truncating, which means - * trouble if the truncation fails. If we then crash, the WAL replay - * likely isn't going to succeed in the truncation either, and cause a - * PANIC. It's tempting to put a critical section here, but that cure - * would be worse than the disease. It would turn a usually harmless - * failure to truncate, that might spell trouble at WAL replay, into a - * certain PANIC. + * We WAL-log the truncation before actually truncating. That means that + * if truncation subsequently fails, we have to PANIC. If we don't, some + * really bad things can happen. First, consider that we discard dirty + * buffers from memory before actually truncating on disk. That means that + * in effect we revert to an older on disk state where, perhaps, some + * tuples that have been deleted since the last checkpoint are still + * visible. That's equivalent to allowing a committed transaction to + * become partially uncommitted, which is clearly unacceptable. Second, + * consider that truncation may succeed on some or all standbys even + * though it failed on the primary. That means that the primary and + * standby are now in persistently different states. + * + * In general, this is much as if we wrote a WAL record for a change to + * the contents of some page and then (for some reason) found ourselves + * unable to perform the corresponding page modification in memory. That + * situation also causes a PANIC. However, this case is considerably more + * uncomfortable, because an in-memory modification to page contents + * shouldn't really ever fail unless we've got a bug in the code + * somewhere. In contrast, trying to truncate a file on disk certainly can + * fail for a variety of reasons that are outside our control. If it does, + * then recovery will probably also fail, which probably won't be much fun + * for the user. They'll have to fix whatever is making us unable to + * truncate files on disk before they can get the database up and running + * again. But forcing them to fix permissions (or whatever the problem is) + * beats corrupting the database. */ + START_CRIT_SECTION(); if (RelationNeedsWAL(rel)) { /* @@ -409,7 +428,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks) */ smgrtruncate(RelationGetSmgr(rel), forks, nforks, blocks); - /* We've done all the critical work, so checkpoints are OK now. */ + /* We've done all the critical work. */ + END_CRIT_SECTION(); MyProc->delayChkptFlags &= ~(DELAY_CHKPT_START | DELAY_CHKPT_COMPLETE); /* -- 2.37.1 (Apple Git-137.1)