Thread: Lots of FSM-related fragility in transaction commit

Lots of FSM-related fragility in transaction commit

From
Tom Lane
Date:
Joseph Shraibman reported some rather unpleasant behavior that
seems to be due to trying to drop a table whose FSM file has got
no permissions:
http://archives.postgresql.org/pgsql-general/2011-12/msg00246.php

One question is how the file got that way, and whether Postgres did
anything wrong to cause it to not have permissions.  But what I'm on the
warpath about now is the abysmally bad error response.  What ought to
happen if we fail to delete a file during a DROP TABLE is a WARNING,
no more.  Not a PANIC that recurs during crash recovery.

I tried to reproduce this, and found a different but equally bad
behavior:

regression=# create table foo as select * from tenk1;
SELECT 10000
regression=# delete from foo where unique1%10 = 0;
DELETE 1000
regression=# vacuum foo;
VACUUM
regression=# select 'foo'::regclass::oid; oid  
-------52860
(1 row)

[ go and "chmod 000 52860_fsm" ]

regression=# drop table foo;
WARNING:  AbortTransaction while in COMMIT state
ERROR:  could not open file "base/27666/52860_fsm": Permission denied
PANIC:  cannot abort transaction 7413, it was already committed
ERROR:  could not open file "base/27666/52860_fsm": Permission denied
PANIC:  cannot abort transaction 7413, it was already committed
The connection to the server was lost. Attempting reset: Failed.

I thought that removing the unreadable file would let me restart,
but after "rm 52860_fsm" and trying again to start the server,
there's a different problem:

LOG:  database system was interrupted while in recovery at 2011-12-08 00:56:11 EST
HINT:  This probably means that some data is corrupted and you will have to use the last backup for recovery.
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  consistent recovery state reached at 0/5D71BA8
LOG:  redo starts at 0/5100050
WARNING:  page 18 of relation base/27666/52860 is uninitialized
CONTEXT:  xlog redo visible: rel 1663/27666/52860; blk 18
PANIC:  WAL contains references to invalid pages
CONTEXT:  xlog redo visible: rel 1663/27666/52860; blk 18
LOG:  startup process (PID 14507) was terminated by signal 6
LOG:  aborting startup due to startup process failure

Note that this isn't even the same symptom Shraibman hit, since
apparently he was failing on replaying the commit record.  How is it
that the main table file managed to have uninitialized pages?
I suspect that "redo visible" WAL processing is breaking one or more of
the fundamental WAL rules, perhaps by not generating a full-page image
when it needs to.

So this is really a whole lot worse than our behavior was in pre-FSM
days, and it needs to get fixed.
        regards, tom lane


Re: Lots of FSM-related fragility in transaction commit

From
Heikki Linnakangas
Date:
On 08.12.2011 08:20, Tom Lane wrote:
> I thought that removing the unreadable file would let me restart,
> but after "rm 52860_fsm" and trying again to start the server,
> there's a different problem:
>
> LOG:  database system was interrupted while in recovery at 2011-12-08 00:56:11 EST
> HINT:  This probably means that some data is corrupted and you will have to use the last backup for recovery.
> LOG:  database system was not properly shut down; automatic recovery in progress
> LOG:  consistent recovery state reached at 0/5D71BA8
> LOG:  redo starts at 0/5100050
> WARNING:  page 18 of relation base/27666/52860 is uninitialized
> CONTEXT:  xlog redo visible: rel 1663/27666/52860; blk 18
> PANIC:  WAL contains references to invalid pages
> CONTEXT:  xlog redo visible: rel 1663/27666/52860; blk 18
> LOG:  startup process (PID 14507) was terminated by signal 6
> LOG:  aborting startup due to startup process failure
>
> Note that this isn't even the same symptom Shraibman hit, since
> apparently he was failing on replaying the commit record.  How is it
> that the main table file managed to have uninitialized pages?
> I suspect that "redo visible" WAL processing is breaking one or more of
> the fundamental WAL rules, perhaps by not generating a full-page image
> when it needs to.
>
> So this is really a whole lot worse than our behavior was in pre-FSM
> days, and it needs to get fixed.

This bug was actually introduced only recently. Notice how the log says
"consistent recovery state reached at 0/5D71BA8". This interacts badly
with Fujii's patch I committed last week:

> commit 1e616f639156b2431725f7823c999486ca46c1ea
> Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date:   Fri Dec 2 10:49:54 2011 +0200
>
>     During recovery, if we reach consistent state and still have entries in the
>     invalid-page hash table, PANIC immediately. Immediate PANIC is much better
>     than waiting for end-of-recovery, which is what we did before, because the
>     end-of-recovery might not come until months later if this is a standby
>     server.
> ...

Recovery thinks it has reached consistency early on, so it PANICs as
soon as it sees a reference to a page that belongs to a table that was
later dropped.

The bug here is that we consider having immediately reached consistency
during crash recovery. That's a false notion: during crash recovery the
database isn't consistent until *all* WAL has been replayed. We should
not set reachedMinRecoveryPoint during crash recovery at all. And
perhaps the flag should be renamed to reachedConsistency, to make it
clear that during crash recovery it's not enough to reach the nominal
minRecoveryPoint.

That was harmless until last week, because reachedMinRecoveryPoint was
not used for anything unless you're doing archive recovery and hot
standby was enabled, but IMO the "consistent recovery state reached" log
message was misleading even then. I propose that we apply the attached
patch to master and backbranches.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: Lots of FSM-related fragility in transaction commit

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 08.12.2011 08:20, Tom Lane wrote:
>> So this is really a whole lot worse than our behavior was in pre-FSM
>> days, and it needs to get fixed.

> This bug was actually introduced only recently. Notice how the log says 
> "consistent recovery state reached at 0/5D71BA8". This interacts badly 
> with Fujii's patch I committed last week:

You're right, I was testing on HEAD not 9.1.x.

> That was harmless until last week, because reachedMinRecoveryPoint was 
> not used for anything unless you're doing archive recovery and hot 
> standby was enabled, but IMO the "consistent recovery state reached" log 
> message was misleading even then. I propose that we apply the attached 
> patch to master and backbranches.

Looks sane to me, though I've not tested to see what effect it has on
the case I was testing.
        regards, tom lane


Re: Lots of FSM-related fragility in transaction commit

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 08.12.2011 08:20, Tom Lane wrote:
>> I thought that removing the unreadable file would let me restart,
>> but after "rm 52860_fsm" and trying again to start the server,
>> there's a different problem:
>> LOG:  consistent recovery state reached at 0/5D71BA8
>> LOG:  redo starts at 0/5100050
>> WARNING:  page 18 of relation base/27666/52860 is uninitialized
>> CONTEXT:  xlog redo visible: rel 1663/27666/52860; blk 18
>> PANIC:  WAL contains references to invalid pages

> The bug here is that we consider having immediately reached consistency
> during crash recovery.

That fixes only part of the problem I was on about, though: I don't
believe that this type of situation should occur in the first place.
We should not be throwing ERROR during post-commit attempts to unlink
the physical files of deleted relations.

After some investigation I believe that the problem was introduced
by the changes to support multiple "forks" in a relation.  Specifically,
instead of just doing "smgrdounlink()" during post-commit, we now
do something like this:

        for (fork = 0; fork <= MAX_FORKNUM; fork++)
        {
            if (smgrexists(srel, fork))
                smgrdounlink(srel, fork, false);
        }

and if you look into mdexists() you'll find that it throws ERROR for any
unexpected errno.  This makes smgrdounlink's attempts to be robust
rather pointless.

I'm not entirely sure whether that behavior of mdexists is a good thing,
but it strikes me that the smgrexists call is pretty unnecessary here in
the first place.  We should just get rid of it and do smgrdounlink
unconditionally.  The only reason it's there is to keep smgrdounlink
from whinging about ENOENT on non-primary forks, which is simply stupid
for smgrdounlink to be doing anyway --- it is actually *more* willing to
complain about ENOENT on non-primary than primary forks, which has no
sense that I can detect.

Accordingly I propose the attached patch for HEAD, and similar changes
in all branches that have multiple forks.  Note that this makes the
warning-report conditions identical for both code paths in mdunlink.

            regards, tom lane


diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d2fecb1ecb9171fc2c6dd2887d846c8a16779bb0..c5a2541807e148c29c790b0fe12a52c0389f4626 100644
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
*************** FinishPreparedTransaction(const char *gi
*** 1366,1373 ****

          for (fork = 0; fork <= MAX_FORKNUM; fork++)
          {
!             if (smgrexists(srel, fork))
!                 smgrdounlink(srel, fork, false);
          }
          smgrclose(srel);
      }
--- 1366,1372 ----

          for (fork = 0; fork <= MAX_FORKNUM; fork++)
          {
!             smgrdounlink(srel, fork, false);
          }
          smgrclose(srel);
      }
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c383011b5f6538fa2b90fa5f7778da7ff59fa679..4177a6c5cf95d5e1a1b49b9f11c17098b2158529 100644
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
*************** xact_redo_commit_internal(TransactionId
*** 4595,4605 ****

          for (fork = 0; fork <= MAX_FORKNUM; fork++)
          {
!             if (smgrexists(srel, fork))
!             {
!                 XLogDropRelation(xnodes[i], fork);
!                 smgrdounlink(srel, fork, true);
!             }
          }
          smgrclose(srel);
      }
--- 4595,4602 ----

          for (fork = 0; fork <= MAX_FORKNUM; fork++)
          {
!             XLogDropRelation(xnodes[i], fork);
!             smgrdounlink(srel, fork, true);
          }
          smgrclose(srel);
      }
*************** xact_redo_abort(xl_xact_abort *xlrec, Tr
*** 4738,4748 ****

          for (fork = 0; fork <= MAX_FORKNUM; fork++)
          {
!             if (smgrexists(srel, fork))
!             {
!                 XLogDropRelation(xlrec->xnodes[i], fork);
!                 smgrdounlink(srel, fork, true);
!             }
          }
          smgrclose(srel);
      }
--- 4735,4742 ----

          for (fork = 0; fork <= MAX_FORKNUM; fork++)
          {
!             XLogDropRelation(xlrec->xnodes[i], fork);
!             smgrdounlink(srel, fork, true);
          }
          smgrclose(srel);
      }
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index fff933d403e7ca5e7b551b987e1684cd6a6e98a7..450e292f4013f9be5418f482654cd33c14c6d344 100644
*** a/src/backend/catalog/storage.c
--- b/src/backend/catalog/storage.c
*************** smgrDoPendingDeletes(bool isCommit)
*** 361,368 ****
                  srel = smgropen(pending->relnode, pending->backend);
                  for (i = 0; i <= MAX_FORKNUM; i++)
                  {
!                     if (smgrexists(srel, i))
!                         smgrdounlink(srel, i, false);
                  }
                  smgrclose(srel);
              }
--- 361,367 ----
                  srel = smgropen(pending->relnode, pending->backend);
                  for (i = 0; i <= MAX_FORKNUM; i++)
                  {
!                     smgrdounlink(srel, i, false);
                  }
                  smgrclose(srel);
              }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index a761369d39c2f7aa4c71131d79ef6a0edc61b4ea..7150a00407e3b28dae1cca2e6063251f2c5b681c 100644
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
*************** mdcreate(SMgrRelation reln, ForkNumber f
*** 323,329 ****
   * number until it's safe, because relfilenode assignment skips over any
   * existing file.
   *
!  * If isRedo is true, it's okay for the relation to be already gone.
   * Also, we should remove the file immediately instead of queuing a request
   * for later, since during redo there's no possibility of creating a
   * conflicting relation.
--- 323,335 ----
   * number until it's safe, because relfilenode assignment skips over any
   * existing file.
   *
!  * All the above applies only to the relation's main fork; other forks can
!  * just be removed immediately, since they are not needed to prevent the
!  * relfilenode number from being recycled.  Also, we do not carefully
!  * track whether other forks have been created or not, but just attempt to
!  * unlink them unconditionally; so we should never complain about ENOENT.
!  *
!  * If isRedo is true, it's unsurprising for the relation to be already gone.
   * Also, we should remove the file immediately instead of queuing a request
   * for later, since during redo there's no possibility of creating a
   * conflicting relation.
*************** mdunlink(RelFileNodeBackend rnode, ForkN
*** 351,363 ****
      if (isRedo || forkNum != MAIN_FORKNUM)
      {
          ret = unlink(path);
!         if (ret < 0)
!         {
!             if (!isRedo || errno != ENOENT)
!                 ereport(WARNING,
!                         (errcode_for_file_access(),
!                          errmsg("could not remove file \"%s\": %m", path)));
!         }
      }
      else
      {
--- 357,366 ----
      if (isRedo || forkNum != MAIN_FORKNUM)
      {
          ret = unlink(path);
!         if (ret < 0 && errno != ENOENT)
!             ereport(WARNING,
!                     (errcode_for_file_access(),
!                      errmsg("could not remove file \"%s\": %m", path)));
      }
      else
      {