Thread: Lots of FSM-related fragility in transaction commit
Joseph Shraibman reported some rather unpleasant behavior that seems to be due to trying to drop a table whose FSM file has got no permissions: http://archives.postgresql.org/pgsql-general/2011-12/msg00246.php One question is how the file got that way, and whether Postgres did anything wrong to cause it to not have permissions. But what I'm on the warpath about now is the abysmally bad error response. What ought to happen if we fail to delete a file during a DROP TABLE is a WARNING, no more. Not a PANIC that recurs during crash recovery. I tried to reproduce this, and found a different but equally bad behavior: regression=# create table foo as select * from tenk1; SELECT 10000 regression=# delete from foo where unique1%10 = 0; DELETE 1000 regression=# vacuum foo; VACUUM regression=# select 'foo'::regclass::oid; oid -------52860 (1 row) [ go and "chmod 000 52860_fsm" ] regression=# drop table foo; WARNING: AbortTransaction while in COMMIT state ERROR: could not open file "base/27666/52860_fsm": Permission denied PANIC: cannot abort transaction 7413, it was already committed ERROR: could not open file "base/27666/52860_fsm": Permission denied PANIC: cannot abort transaction 7413, it was already committed The connection to the server was lost. Attempting reset: Failed. I thought that removing the unreadable file would let me restart, but after "rm 52860_fsm" and trying again to start the server, there's a different problem: LOG: database system was interrupted while in recovery at 2011-12-08 00:56:11 EST HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. LOG: database system was not properly shut down; automatic recovery in progress LOG: consistent recovery state reached at 0/5D71BA8 LOG: redo starts at 0/5100050 WARNING: page 18 of relation base/27666/52860 is uninitialized CONTEXT: xlog redo visible: rel 1663/27666/52860; blk 18 PANIC: WAL contains references to invalid pages CONTEXT: xlog redo visible: rel 1663/27666/52860; blk 18 LOG: startup process (PID 14507) was terminated by signal 6 LOG: aborting startup due to startup process failure Note that this isn't even the same symptom Shraibman hit, since apparently he was failing on replaying the commit record. How is it that the main table file managed to have uninitialized pages? I suspect that "redo visible" WAL processing is breaking one or more of the fundamental WAL rules, perhaps by not generating a full-page image when it needs to. So this is really a whole lot worse than our behavior was in pre-FSM days, and it needs to get fixed. regards, tom lane
On 08.12.2011 08:20, Tom Lane wrote: > I thought that removing the unreadable file would let me restart, > but after "rm 52860_fsm" and trying again to start the server, > there's a different problem: > > LOG: database system was interrupted while in recovery at 2011-12-08 00:56:11 EST > HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. > LOG: database system was not properly shut down; automatic recovery in progress > LOG: consistent recovery state reached at 0/5D71BA8 > LOG: redo starts at 0/5100050 > WARNING: page 18 of relation base/27666/52860 is uninitialized > CONTEXT: xlog redo visible: rel 1663/27666/52860; blk 18 > PANIC: WAL contains references to invalid pages > CONTEXT: xlog redo visible: rel 1663/27666/52860; blk 18 > LOG: startup process (PID 14507) was terminated by signal 6 > LOG: aborting startup due to startup process failure > > Note that this isn't even the same symptom Shraibman hit, since > apparently he was failing on replaying the commit record. How is it > that the main table file managed to have uninitialized pages? > I suspect that "redo visible" WAL processing is breaking one or more of > the fundamental WAL rules, perhaps by not generating a full-page image > when it needs to. > > So this is really a whole lot worse than our behavior was in pre-FSM > days, and it needs to get fixed. This bug was actually introduced only recently. Notice how the log says "consistent recovery state reached at 0/5D71BA8". This interacts badly with Fujii's patch I committed last week: > commit 1e616f639156b2431725f7823c999486ca46c1ea > Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Fri Dec 2 10:49:54 2011 +0200 > > During recovery, if we reach consistent state and still have entries in the > invalid-page hash table, PANIC immediately. Immediate PANIC is much better > than waiting for end-of-recovery, which is what we did before, because the > end-of-recovery might not come until months later if this is a standby > server. > ... Recovery thinks it has reached consistency early on, so it PANICs as soon as it sees a reference to a page that belongs to a table that was later dropped. The bug here is that we consider having immediately reached consistency during crash recovery. That's a false notion: during crash recovery the database isn't consistent until *all* WAL has been replayed. We should not set reachedMinRecoveryPoint during crash recovery at all. And perhaps the flag should be renamed to reachedConsistency, to make it clear that during crash recovery it's not enough to reach the nominal minRecoveryPoint. That was harmless until last week, because reachedMinRecoveryPoint was not used for anything unless you're doing archive recovery and hot standby was enabled, but IMO the "consistent recovery state reached" log message was misleading even then. I propose that we apply the attached patch to master and backbranches. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 08.12.2011 08:20, Tom Lane wrote: >> So this is really a whole lot worse than our behavior was in pre-FSM >> days, and it needs to get fixed. > This bug was actually introduced only recently. Notice how the log says > "consistent recovery state reached at 0/5D71BA8". This interacts badly > with Fujii's patch I committed last week: You're right, I was testing on HEAD not 9.1.x. > That was harmless until last week, because reachedMinRecoveryPoint was > not used for anything unless you're doing archive recovery and hot > standby was enabled, but IMO the "consistent recovery state reached" log > message was misleading even then. I propose that we apply the attached > patch to master and backbranches. Looks sane to me, though I've not tested to see what effect it has on the case I was testing. regards, tom lane
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 08.12.2011 08:20, Tom Lane wrote: >> I thought that removing the unreadable file would let me restart, >> but after "rm 52860_fsm" and trying again to start the server, >> there's a different problem: >> LOG: consistent recovery state reached at 0/5D71BA8 >> LOG: redo starts at 0/5100050 >> WARNING: page 18 of relation base/27666/52860 is uninitialized >> CONTEXT: xlog redo visible: rel 1663/27666/52860; blk 18 >> PANIC: WAL contains references to invalid pages > The bug here is that we consider having immediately reached consistency > during crash recovery. That fixes only part of the problem I was on about, though: I don't believe that this type of situation should occur in the first place. We should not be throwing ERROR during post-commit attempts to unlink the physical files of deleted relations. After some investigation I believe that the problem was introduced by the changes to support multiple "forks" in a relation. Specifically, instead of just doing "smgrdounlink()" during post-commit, we now do something like this: for (fork = 0; fork <= MAX_FORKNUM; fork++) { if (smgrexists(srel, fork)) smgrdounlink(srel, fork, false); } and if you look into mdexists() you'll find that it throws ERROR for any unexpected errno. This makes smgrdounlink's attempts to be robust rather pointless. I'm not entirely sure whether that behavior of mdexists is a good thing, but it strikes me that the smgrexists call is pretty unnecessary here in the first place. We should just get rid of it and do smgrdounlink unconditionally. The only reason it's there is to keep smgrdounlink from whinging about ENOENT on non-primary forks, which is simply stupid for smgrdounlink to be doing anyway --- it is actually *more* willing to complain about ENOENT on non-primary than primary forks, which has no sense that I can detect. Accordingly I propose the attached patch for HEAD, and similar changes in all branches that have multiple forks. Note that this makes the warning-report conditions identical for both code paths in mdunlink. regards, tom lane diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index d2fecb1ecb9171fc2c6dd2887d846c8a16779bb0..c5a2541807e148c29c790b0fe12a52c0389f4626 100644 *** a/src/backend/access/transam/twophase.c --- b/src/backend/access/transam/twophase.c *************** FinishPreparedTransaction(const char *gi *** 1366,1373 **** for (fork = 0; fork <= MAX_FORKNUM; fork++) { ! if (smgrexists(srel, fork)) ! smgrdounlink(srel, fork, false); } smgrclose(srel); } --- 1366,1372 ---- for (fork = 0; fork <= MAX_FORKNUM; fork++) { ! smgrdounlink(srel, fork, false); } smgrclose(srel); } diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index c383011b5f6538fa2b90fa5f7778da7ff59fa679..4177a6c5cf95d5e1a1b49b9f11c17098b2158529 100644 *** a/src/backend/access/transam/xact.c --- b/src/backend/access/transam/xact.c *************** xact_redo_commit_internal(TransactionId *** 4595,4605 **** for (fork = 0; fork <= MAX_FORKNUM; fork++) { ! if (smgrexists(srel, fork)) ! { ! XLogDropRelation(xnodes[i], fork); ! smgrdounlink(srel, fork, true); ! } } smgrclose(srel); } --- 4595,4602 ---- for (fork = 0; fork <= MAX_FORKNUM; fork++) { ! XLogDropRelation(xnodes[i], fork); ! smgrdounlink(srel, fork, true); } smgrclose(srel); } *************** xact_redo_abort(xl_xact_abort *xlrec, Tr *** 4738,4748 **** for (fork = 0; fork <= MAX_FORKNUM; fork++) { ! if (smgrexists(srel, fork)) ! { ! XLogDropRelation(xlrec->xnodes[i], fork); ! smgrdounlink(srel, fork, true); ! } } smgrclose(srel); } --- 4735,4742 ---- for (fork = 0; fork <= MAX_FORKNUM; fork++) { ! XLogDropRelation(xlrec->xnodes[i], fork); ! smgrdounlink(srel, fork, true); } smgrclose(srel); } diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index fff933d403e7ca5e7b551b987e1684cd6a6e98a7..450e292f4013f9be5418f482654cd33c14c6d344 100644 *** a/src/backend/catalog/storage.c --- b/src/backend/catalog/storage.c *************** smgrDoPendingDeletes(bool isCommit) *** 361,368 **** srel = smgropen(pending->relnode, pending->backend); for (i = 0; i <= MAX_FORKNUM; i++) { ! if (smgrexists(srel, i)) ! smgrdounlink(srel, i, false); } smgrclose(srel); } --- 361,367 ---- srel = smgropen(pending->relnode, pending->backend); for (i = 0; i <= MAX_FORKNUM; i++) { ! smgrdounlink(srel, i, false); } smgrclose(srel); } diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index a761369d39c2f7aa4c71131d79ef6a0edc61b4ea..7150a00407e3b28dae1cca2e6063251f2c5b681c 100644 *** a/src/backend/storage/smgr/md.c --- b/src/backend/storage/smgr/md.c *************** mdcreate(SMgrRelation reln, ForkNumber f *** 323,329 **** * number until it's safe, because relfilenode assignment skips over any * existing file. * ! * If isRedo is true, it's okay for the relation to be already gone. * Also, we should remove the file immediately instead of queuing a request * for later, since during redo there's no possibility of creating a * conflicting relation. --- 323,335 ---- * number until it's safe, because relfilenode assignment skips over any * existing file. * ! * All the above applies only to the relation's main fork; other forks can ! * just be removed immediately, since they are not needed to prevent the ! * relfilenode number from being recycled. Also, we do not carefully ! * track whether other forks have been created or not, but just attempt to ! * unlink them unconditionally; so we should never complain about ENOENT. ! * ! * If isRedo is true, it's unsurprising for the relation to be already gone. * Also, we should remove the file immediately instead of queuing a request * for later, since during redo there's no possibility of creating a * conflicting relation. *************** mdunlink(RelFileNodeBackend rnode, ForkN *** 351,363 **** if (isRedo || forkNum != MAIN_FORKNUM) { ret = unlink(path); ! if (ret < 0) ! { ! if (!isRedo || errno != ENOENT) ! ereport(WARNING, ! (errcode_for_file_access(), ! errmsg("could not remove file \"%s\": %m", path))); ! } } else { --- 357,366 ---- if (isRedo || forkNum != MAIN_FORKNUM) { ret = unlink(path); ! if (ret < 0 && errno != ENOENT) ! ereport(WARNING, ! (errcode_for_file_access(), ! errmsg("could not remove file \"%s\": %m", path))); } else {