ERROR during end-of-xact/FATAL - Mailing list pgsql-hackers

From Noah Misch
Subject ERROR during end-of-xact/FATAL
Date
Msg-id 20131031145234.GA621493@tornado.leadboat.com
Whole thread Raw
Responses Re: ERROR during end-of-xact/FATAL  (Amit Kapila <amit.kapila16@gmail.com>)
Re: ERROR during end-of-xact/FATAL  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
CommitTransaction() and AbortTransaction() both do much work, and large
portions of that work either should not or must not throw errors.  An error
during either function will, as usual, siglongjmp() out.  Ordinarily,
PostgresMain() will regain control and fire off a fresh AbortTransaction().
The consequences thereof depend on the original function's progress:

- Before the function updates CurrentTransactionState->state, an ERROR is fully acceptable.  CommitTransaction()
specificallyplaces failure-prone tasks accordingly; AbortTransaction() has no analogous tasks.
 

- After the function updates CurrentTransactionState->state, an ERROR yields a user-unfriendly e.g. "WARNING:
AbortTransactionwhile in COMMIT state". This is not itself harmful, but we've largely kept the things that can fail for
pedestrianreasons ahead of that point.
 

- After CommitTransaction() calls RecordTransactionCommit() for an xid-bearing transaction, an ERROR upgrades to e.g.
"PANIC: cannot abort transaction 805, it was already committed".
 

- After AbortTransaction() calls ProcArrayEndTransaction() for an xid-bearing transaction, an ERROR will lead to this
assertionfailure:
 
 TRAP: FailedAssertion("!(((allPgXact[proc->pgprocno].xid) != ((TransactionId) 0)))", File: "procarray.c", Line: 396)

If the original AbortTransaction() pertained to a FATAL, the situation is
worse.  errfinish() promotes the ERROR thrown from AbortTransaction() to
another FATAL, so we reenter proc_exit().  Thanks to the following logic in
shmem_exit(), we will never return to AbortTransaction():
/* * call all the registered callbacks. * * Note that since we decrement on_proc_exit_index each time, if a * callback
callsereport(ERROR) or ereport(FATAL) then it won't be * invoked again when control comes back here (nor will the *
previously-completedcallbacks).  So, an infinite loop should not be * possible. */
 

As a result, we miss any cleanups that had not yet happened in the original
AbortTransaction().  In particular, this can leak heavyweight locks.  An
asserts build subsequently fails this way:

TRAP: FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))", File: "proc.c", Line: 788)

In a production build, the affected PGPROC slot just continues to hold the
lock until the next backend claiming that slot calls LockReleaseAll().  Oops.
Bottom line: most bets are off given an ERROR after RecordTransactionCommit()
in CommitTransaction() or anywhere in AbortTransaction().


Now, while those assertion failures are worth preventing on general principle,
the actual field importance depends on whether things actually do fail in the
vulnerable end-of-xact work.  We've prevented the errors that would otherwise
be relatively common, but there are several rare ways to get a late ERROR.
Incomplete list:

- If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee relpathbackend() call palloc(); this is
truein all supported branches.  In 9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc(). (In fact,
itdoes so even when the pending list is empty -- this is the only palloc() during a trivial transaction commit.)
palloc()failure there yields a PANIC during commit.
 

- ResourceOwnerRelease() calls FileClose() during abort, and FileClose() raises an ERROR when close() returns EIO.

- AtEOXact_Inval() can lead to calls like RelationReloadIndexInfo(), which has many ways to throw errors.  This
precedesreleasing heavyweight locks, so an error here during an abort pertaining to a FATAL exit orphans locks as
describedabove.  This relates into another recent thread:
 
http://www.postgresql.org/message-id/20130805170931.GA369289@tornado.leadboat.com


What should we do to mitigate these problems?  Certainly we can harden
individual end-of-xact tasks to not throw errors, as we have in the past.
What higher-level strategies should we consider?  What about for the unclean
result of the FATAL-then-ERROR scenario in particular?  If we can't manage to
free a shared memory resource like a lock or buffer pin, we really must PANIC.
Releasing those things is quite reliable, though.  The tasks that have the
highest chance of capsizing the AbortTransaction() are of backend-local
interest, or they're tasks for which we tolerate failure as a rule
(e.g. unlinking files).

Robert Haas provided a large slice of the research for this report.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Something fishy happening on frogmouth
Next
From: Tom Lane
Date:
Subject: Re: Get more from indices.