Re: ERROR during end-of-xact/FATAL - Mailing list pgsql-hackers

From Noah Misch
Subject Re: ERROR during end-of-xact/FATAL
Date
Msg-id 20131113160412.GE833090@tornado.leadboat.com
Whole thread Raw
In response to Re: ERROR during end-of-xact/FATAL  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: ERROR during end-of-xact/FATAL  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Tue, Nov 12, 2013 at 09:55:34AM -0500, Robert Haas wrote:
> On Fri, Nov 8, 2013 at 4:13 PM, Noah Misch <noah@leadboat.com> wrote:
> > A PANIC will reinitialize everything relevant, largely resolving the problems
> > around ERROR during FATAL.  It's a heavy-handed solution, but it may well be
> > the best solution.  Efforts to harden CommitTransaction() and
> > AbortTransaction() seem well-spent, but the additional effort to make FATAL
> > exit cope where AbortTransaction() or another exit action could not cope seems
> > to be slicing ever-smaller portions of additional robustness.
> >
> > I pondered a variant of that conclusion that distinguished critical cleanup
> > needs from the rest.  Each shared resource (heavyweight locks, buffer pins,
> > LWLocks) would have an on_shmem_exit() callback that cleans up the resource
> > under a critical section.  (AtProcExit_Buffers() used to fill such a role, but
> > resowner.c's work during AbortTransaction() has mostly supplanted it.)  The
> > ShutdownPostgres callback would not use a critical section, so lesser failures
> > in AbortTransaction() would not upgrade to a PANIC.  But I'm leaning against
> > such a complication on the grounds that it would add seldom-tested code paths
> > posing as much a chance of eroding robustness as bolstering it.
> 
> The current situation is just plain weird: in the ERROR-then-ERROR
> case, we emit a WARNING and bounce right back into AbortTransaction(),
> and if it doesn't work any better the second time than the first time,
> we recurse again, and eventually if it fails enough times in a row, we
> just give up and PANIC.  But in the ERROR-then-FATAL case, we *don't*
> retry AbortTransaction(); instead, we just continue running the rest
> of the on_shmem_exit callbacks and then exit.
> 
> So, in short, ERROR + ERROR*10 = PANIC, but FATAL + ERROR*10 = FATAL.
> That's bizarre.

Quite so.

> Given that that's where we are, promoting an ERROR during FATAL
> processing to PANIC doesn't seem like it's losing much; we're
> essentially already doing that in the (probably more likely) case of a
> persistent ERROR during ERROR processing.  But since PANIC sucks, I'd
> rather go the other direction: let's make an ERROR during ERROR
> processing promote to FATAL.  And then let's do what you write above:
> make sure that there's a separate on-shmem-exit callback for each
> critical shared memory resource and that we call of those during FATAL
> processing.

Many of the factors that can cause AbortTransaction() to fail can also cause
CommitTransaction() to fail, and those would still PANIC if the transaction
had an xid.  How practical might it be to also escape from an error during
CommitTransaction() with a FATAL instead of PANIC?  There's more to fix up in
that case (sinval, NOTIFY), but it may be within reach.  If such a technique
can only reasonably fix abort, though, I have doubts it buys us enough.

> It seems to me that that's how things were originally designed to
> work, but that we've drifted away from it basically because the
> low-level callbacks to release heavyweight locks and buffer pins
> turned out to be kinda, uh, slow, and we thought those code paths
> couldn't be taken anyway (turns out they can).  I think we could
> either make those routines very fast, or arrange only to run that code
> at all in the case where AbortTransaction() didn't complete
> successfully.

Agreed; the performance hazards look tractable.

> It's true that such code will be rarely run, but the
> logic is simple enough that I think we can verify it by hand, and it's
> sure nice to avoid PANICs.

True.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: alter_table regression test problem
Next
From: Heikki Linnakangas
Date:
Subject: Re: Race condition in b-tree page deletion