Re: ERROR during end-of-xact/FATAL - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: ERROR during end-of-xact/FATAL
Date
Msg-id CAA4eK1KDgb3o9Ds7V14kZgWPCVQkZnmAKeRfXdK352jpiVzv6g@mail.gmail.com
Whole thread Raw
In response to ERROR during end-of-xact/FATAL  (Noah Misch <noah@leadboat.com>)
Responses Re: ERROR during end-of-xact/FATAL  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
On Thu, Oct 31, 2013 at 8:22 PM, Noah Misch <noah@leadboat.com> wrote:
> CommitTransaction() and AbortTransaction() both do much work, and large
> portions of that work either should not or must not throw errors.  An error
> during either function will, as usual, siglongjmp() out.  Ordinarily,
> PostgresMain() will regain control and fire off a fresh AbortTransaction().
> The consequences thereof depend on the original function's progress:
>
> - Before the function updates CurrentTransactionState->state, an ERROR is
>   fully acceptable.  CommitTransaction() specifically places failure-prone
>   tasks accordingly; AbortTransaction() has no analogous tasks.
>
> - After the function updates CurrentTransactionState->state, an ERROR yields a
>   user-unfriendly e.g. "WARNING:  AbortTransaction while in COMMIT state".
>   This is not itself harmful, but we've largely kept the things that can fail
>   for pedestrian reasons ahead of that point.
>
> - After CommitTransaction() calls RecordTransactionCommit() for an xid-bearing
>   transaction, an ERROR upgrades to e.g. "PANIC:  cannot abort transaction
>   805, it was already committed".
>
> - After AbortTransaction() calls ProcArrayEndTransaction() for an xid-bearing
>   transaction, an ERROR will lead to this assertion failure:
>
>   TRAP: FailedAssertion("!(((allPgXact[proc->pgprocno].xid) != ((TransactionId) 0)))", File: "procarray.c", Line:
396)
>
> If the original AbortTransaction() pertained to a FATAL, the situation is
> worse.  errfinish() promotes the ERROR thrown from AbortTransaction() to
> another FATAL,

isn't errstart promotes ERROR to FATAL?

> so we reenter proc_exit().  Thanks to the following logic in
> shmem_exit(),

following comment is in proc_exit_prepare(), but the effect will be
same as described you due to logic in shmem_exit().

> we will never return to AbortTransaction():
>
>         /*
>          * call all the registered callbacks.
>          *
>          * Note that since we decrement on_proc_exit_index each time, if a
>          * callback calls ereport(ERROR) or ereport(FATAL) then it won't be
>          * invoked again when control comes back here (nor will the
>          * previously-completed callbacks).  So, an infinite loop should not be
>          * possible.
>          */
>
> As a result, we miss any cleanups that had not yet happened in the original
> AbortTransaction().  In particular, this can leak heavyweight locks.  An
> asserts build subsequently fails this way:
>
> TRAP: FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))", File: "proc.c", Line: 788)
>
> In a production build, the affected PGPROC slot just continues to hold the
> lock until the next backend claiming that slot calls LockReleaseAll().  Oops.
> Bottom line: most bets are off given an ERROR after RecordTransactionCommit()
> in CommitTransaction() or anywhere in AbortTransaction().

When I tried above scenario, I hit Assert at different place

shmem_exit()->pgstat_beshutdown_hook()->pgstat_report_stat()
pgstat_report_stat(bool force)
{
..
Assert(entry->trans == NULL);
}

The reason is that in AbortTransaction, an ERROR is thrown before call
to AtEOXact_PgStat(false).

This means that in the situation when an ERROR occurs in
AbortTransaction which is called as a result of FATAL error, there are
many more possibilities of Assert.

>
> Now, while those assertion failures are worth preventing on general principle,
> the actual field importance depends on whether things actually do fail in the
> vulnerable end-of-xact work.  We've prevented the errors that would otherwise
> be relatively common, but there are several rare ways to get a late ERROR.
> Incomplete list:
>
> - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee
>   relpathbackend() call palloc(); this is true in all supported branches.  In
>   9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc().
>   (In fact, it does so even when the pending list is empty -- this is the only
>   palloc() during a trivial transaction commit.)  palloc() failure there
>   yields a PANIC during commit.
>
> - ResourceOwnerRelease() calls FileClose() during abort, and FileClose()
>   raises an ERROR when close() returns EIO.
>
> - AtEOXact_Inval() can lead to calls like RelationReloadIndexInfo(), which has
>   many ways to throw errors.  This precedes releasing heavyweight locks, so an
>   error here during an abort pertaining to a FATAL exit orphans locks as
>   described above.  This relates into another recent thread:
> http://www.postgresql.org/message-id/20130805170931.GA369289@tornado.leadboat.com
>
>
> What should we do to mitigate these problems?  Certainly we can harden
> individual end-of-xact tasks to not throw errors, as we have in the past.
> What higher-level strategies should we consider?  What about for the unclean
> result of the FATAL-then-ERROR scenario in particular?

About unclean FATAL-then-ERROR scenario, one way to deal at high level
could be to treat such a case as backend crash in which case
postmaster reinitialises shared memory and other stuff.

> If we can't manage to
> free a shared memory resource like a lock or buffer pin, we really must PANIC.

Can't we try to initialise the shared memory and other resources,
wouldn't that resolve the problem's that can occur due to scenario
explained by you?

> Releasing those things is quite reliable, though.  The tasks that have the
> highest chance of capsizing the AbortTransaction() are of backend-local
> interest, or they're tasks for which we tolerate failure as a rule
> (e.g. unlinking files).
>
> Robert Haas provided a large slice of the research for this report.

This analysis is quite useful because I feel if any user hits this
problem, the repercussion is much higher, even though chances of
hitting such problem is less.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: git diff --check whitespace checks, gitattributes
Next
From: Craig Ringer
Date:
Subject: Re: [v9.4] row level security