Re: ERROR during end-of-xact/FATAL - Mailing list pgsql-hackers

From Robert Haas
Subject Re: ERROR during end-of-xact/FATAL
Date
Msg-id CA+TgmoY1uGDRVtKGFp347QehoKCEO4NPwYP=r7VscGC31NgcMQ@mail.gmail.com
Whole thread Raw
In response to Re: ERROR during end-of-xact/FATAL  (Noah Misch <noah@leadboat.com>)
Responses Re: ERROR during end-of-xact/FATAL  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
On Fri, Nov 8, 2013 at 4:13 PM, Noah Misch <noah@leadboat.com> wrote:
> A PANIC will reinitialize everything relevant, largely resolving the problems
> around ERROR during FATAL.  It's a heavy-handed solution, but it may well be
> the best solution.  Efforts to harden CommitTransaction() and
> AbortTransaction() seem well-spent, but the additional effort to make FATAL
> exit cope where AbortTransaction() or another exit action could not cope seems
> to be slicing ever-smaller portions of additional robustness.
>
> I pondered a variant of that conclusion that distinguished critical cleanup
> needs from the rest.  Each shared resource (heavyweight locks, buffer pins,
> LWLocks) would have an on_shmem_exit() callback that cleans up the resource
> under a critical section.  (AtProcExit_Buffers() used to fill such a role, but
> resowner.c's work during AbortTransaction() has mostly supplanted it.)  The
> ShutdownPostgres callback would not use a critical section, so lesser failures
> in AbortTransaction() would not upgrade to a PANIC.  But I'm leaning against
> such a complication on the grounds that it would add seldom-tested code paths
> posing as much a chance of eroding robustness as bolstering it.

The current situation is just plain weird: in the ERROR-then-ERROR
case, we emit a WARNING and bounce right back into AbortTransaction(),
and if it doesn't work any better the second time than the first time,
we recurse again, and eventually if it fails enough times in a row, we
just give up and PANIC.  But in the ERROR-then-FATAL case, we *don't*
retry AbortTransaction(); instead, we just continue running the rest
of the on_shmem_exit callbacks and then exit.

So, in short, ERROR + ERROR*10 = PANIC, but FATAL + ERROR*10 = FATAL.
That's bizarre.

Given that that's where we are, promoting an ERROR during FATAL
processing to PANIC doesn't seem like it's losing much; we're
essentially already doing that in the (probably more likely) case of a
persistent ERROR during ERROR processing.  But since PANIC sucks, I'd
rather go the other direction: let's make an ERROR during ERROR
processing promote to FATAL.  And then let's do what you write above:
make sure that there's a separate on-shmem-exit callback for each
critical shared memory resource and that we call of those during FATAL
processing.

It seems to me that that's how things were originally designed to
work, but that we've drifted away from it basically because the
low-level callbacks to release heavyweight locks and buffer pins
turned out to be kinda, uh, slow, and we thought those code paths
couldn't be taken anyway (turns out they can).  I think we could
either make those routines very fast, or arrange only to run that code
at all in the case where AbortTransaction() didn't complete
successfully.  It's true that such code will be rarely run, but the
logic is simple enough that I think we can verify it by hand, and it's
sure nice to avoid PANICs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Clang 3.3 Analyzer Results
Next
From: Tom Lane
Date:
Subject: Re: What's needed for cache-only table scan?