Thread: ERROR during end-of-xact/FATAL

ERROR during end-of-xact/FATAL

From
Noah Misch
Date:
CommitTransaction() and AbortTransaction() both do much work, and large
portions of that work either should not or must not throw errors.  An error
during either function will, as usual, siglongjmp() out.  Ordinarily,
PostgresMain() will regain control and fire off a fresh AbortTransaction().
The consequences thereof depend on the original function's progress:

- Before the function updates CurrentTransactionState->state, an ERROR is fully acceptable.  CommitTransaction()
specificallyplaces failure-prone tasks accordingly; AbortTransaction() has no analogous tasks.
 

- After the function updates CurrentTransactionState->state, an ERROR yields a user-unfriendly e.g. "WARNING:
AbortTransactionwhile in COMMIT state". This is not itself harmful, but we've largely kept the things that can fail for
pedestrianreasons ahead of that point.
 

- After CommitTransaction() calls RecordTransactionCommit() for an xid-bearing transaction, an ERROR upgrades to e.g.
"PANIC: cannot abort transaction 805, it was already committed".
 

- After AbortTransaction() calls ProcArrayEndTransaction() for an xid-bearing transaction, an ERROR will lead to this
assertionfailure:
 
 TRAP: FailedAssertion("!(((allPgXact[proc->pgprocno].xid) != ((TransactionId) 0)))", File: "procarray.c", Line: 396)

If the original AbortTransaction() pertained to a FATAL, the situation is
worse.  errfinish() promotes the ERROR thrown from AbortTransaction() to
another FATAL, so we reenter proc_exit().  Thanks to the following logic in
shmem_exit(), we will never return to AbortTransaction():
/* * call all the registered callbacks. * * Note that since we decrement on_proc_exit_index each time, if a * callback
callsereport(ERROR) or ereport(FATAL) then it won't be * invoked again when control comes back here (nor will the *
previously-completedcallbacks).  So, an infinite loop should not be * possible. */
 

As a result, we miss any cleanups that had not yet happened in the original
AbortTransaction().  In particular, this can leak heavyweight locks.  An
asserts build subsequently fails this way:

TRAP: FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))", File: "proc.c", Line: 788)

In a production build, the affected PGPROC slot just continues to hold the
lock until the next backend claiming that slot calls LockReleaseAll().  Oops.
Bottom line: most bets are off given an ERROR after RecordTransactionCommit()
in CommitTransaction() or anywhere in AbortTransaction().


Now, while those assertion failures are worth preventing on general principle,
the actual field importance depends on whether things actually do fail in the
vulnerable end-of-xact work.  We've prevented the errors that would otherwise
be relatively common, but there are several rare ways to get a late ERROR.
Incomplete list:

- If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee relpathbackend() call palloc(); this is
truein all supported branches.  In 9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc(). (In fact,
itdoes so even when the pending list is empty -- this is the only palloc() during a trivial transaction commit.)
palloc()failure there yields a PANIC during commit.
 

- ResourceOwnerRelease() calls FileClose() during abort, and FileClose() raises an ERROR when close() returns EIO.

- AtEOXact_Inval() can lead to calls like RelationReloadIndexInfo(), which has many ways to throw errors.  This
precedesreleasing heavyweight locks, so an error here during an abort pertaining to a FATAL exit orphans locks as
describedabove.  This relates into another recent thread:
 
http://www.postgresql.org/message-id/20130805170931.GA369289@tornado.leadboat.com


What should we do to mitigate these problems?  Certainly we can harden
individual end-of-xact tasks to not throw errors, as we have in the past.
What higher-level strategies should we consider?  What about for the unclean
result of the FATAL-then-ERROR scenario in particular?  If we can't manage to
free a shared memory resource like a lock or buffer pin, we really must PANIC.
Releasing those things is quite reliable, though.  The tasks that have the
highest chance of capsizing the AbortTransaction() are of backend-local
interest, or they're tasks for which we tolerate failure as a rule
(e.g. unlinking files).

Robert Haas provided a large slice of the research for this report.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Re: ERROR during end-of-xact/FATAL

From
Amit Kapila
Date:
On Thu, Oct 31, 2013 at 8:22 PM, Noah Misch <noah@leadboat.com> wrote:
> CommitTransaction() and AbortTransaction() both do much work, and large
> portions of that work either should not or must not throw errors.  An error
> during either function will, as usual, siglongjmp() out.  Ordinarily,
> PostgresMain() will regain control and fire off a fresh AbortTransaction().
> The consequences thereof depend on the original function's progress:
>
> - Before the function updates CurrentTransactionState->state, an ERROR is
>   fully acceptable.  CommitTransaction() specifically places failure-prone
>   tasks accordingly; AbortTransaction() has no analogous tasks.
>
> - After the function updates CurrentTransactionState->state, an ERROR yields a
>   user-unfriendly e.g. "WARNING:  AbortTransaction while in COMMIT state".
>   This is not itself harmful, but we've largely kept the things that can fail
>   for pedestrian reasons ahead of that point.
>
> - After CommitTransaction() calls RecordTransactionCommit() for an xid-bearing
>   transaction, an ERROR upgrades to e.g. "PANIC:  cannot abort transaction
>   805, it was already committed".
>
> - After AbortTransaction() calls ProcArrayEndTransaction() for an xid-bearing
>   transaction, an ERROR will lead to this assertion failure:
>
>   TRAP: FailedAssertion("!(((allPgXact[proc->pgprocno].xid) != ((TransactionId) 0)))", File: "procarray.c", Line:
396)
>
> If the original AbortTransaction() pertained to a FATAL, the situation is
> worse.  errfinish() promotes the ERROR thrown from AbortTransaction() to
> another FATAL,

isn't errstart promotes ERROR to FATAL?

> so we reenter proc_exit().  Thanks to the following logic in
> shmem_exit(),

following comment is in proc_exit_prepare(), but the effect will be
same as described you due to logic in shmem_exit().

> we will never return to AbortTransaction():
>
>         /*
>          * call all the registered callbacks.
>          *
>          * Note that since we decrement on_proc_exit_index each time, if a
>          * callback calls ereport(ERROR) or ereport(FATAL) then it won't be
>          * invoked again when control comes back here (nor will the
>          * previously-completed callbacks).  So, an infinite loop should not be
>          * possible.
>          */
>
> As a result, we miss any cleanups that had not yet happened in the original
> AbortTransaction().  In particular, this can leak heavyweight locks.  An
> asserts build subsequently fails this way:
>
> TRAP: FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))", File: "proc.c", Line: 788)
>
> In a production build, the affected PGPROC slot just continues to hold the
> lock until the next backend claiming that slot calls LockReleaseAll().  Oops.
> Bottom line: most bets are off given an ERROR after RecordTransactionCommit()
> in CommitTransaction() or anywhere in AbortTransaction().

When I tried above scenario, I hit Assert at different place

shmem_exit()->pgstat_beshutdown_hook()->pgstat_report_stat()
pgstat_report_stat(bool force)
{
..
Assert(entry->trans == NULL);
}

The reason is that in AbortTransaction, an ERROR is thrown before call
to AtEOXact_PgStat(false).

This means that in the situation when an ERROR occurs in
AbortTransaction which is called as a result of FATAL error, there are
many more possibilities of Assert.

>
> Now, while those assertion failures are worth preventing on general principle,
> the actual field importance depends on whether things actually do fail in the
> vulnerable end-of-xact work.  We've prevented the errors that would otherwise
> be relatively common, but there are several rare ways to get a late ERROR.
> Incomplete list:
>
> - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee
>   relpathbackend() call palloc(); this is true in all supported branches.  In
>   9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc().
>   (In fact, it does so even when the pending list is empty -- this is the only
>   palloc() during a trivial transaction commit.)  palloc() failure there
>   yields a PANIC during commit.
>
> - ResourceOwnerRelease() calls FileClose() during abort, and FileClose()
>   raises an ERROR when close() returns EIO.
>
> - AtEOXact_Inval() can lead to calls like RelationReloadIndexInfo(), which has
>   many ways to throw errors.  This precedes releasing heavyweight locks, so an
>   error here during an abort pertaining to a FATAL exit orphans locks as
>   described above.  This relates into another recent thread:
> http://www.postgresql.org/message-id/20130805170931.GA369289@tornado.leadboat.com
>
>
> What should we do to mitigate these problems?  Certainly we can harden
> individual end-of-xact tasks to not throw errors, as we have in the past.
> What higher-level strategies should we consider?  What about for the unclean
> result of the FATAL-then-ERROR scenario in particular?

About unclean FATAL-then-ERROR scenario, one way to deal at high level
could be to treat such a case as backend crash in which case
postmaster reinitialises shared memory and other stuff.

> If we can't manage to
> free a shared memory resource like a lock or buffer pin, we really must PANIC.

Can't we try to initialise the shared memory and other resources,
wouldn't that resolve the problem's that can occur due to scenario
explained by you?

> Releasing those things is quite reliable, though.  The tasks that have the
> highest chance of capsizing the AbortTransaction() are of backend-local
> interest, or they're tasks for which we tolerate failure as a rule
> (e.g. unlinking files).
>
> Robert Haas provided a large slice of the research for this report.

This analysis is quite useful because I feel if any user hits this
problem, the repercussion is much higher, even though chances of
hitting such problem is less.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR during end-of-xact/FATAL

From
Alvaro Herrera
Date:
Noah Misch wrote:

> Incomplete list:
> 
> - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee
>   relpathbackend() call palloc(); this is true in all supported branches.  In
>   9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc().
>   (In fact, it does so even when the pending list is empty -- this is the only
>   palloc() during a trivial transaction commit.)  palloc() failure there
>   yields a PANIC during commit.

I think we should fix this routine to avoid the palloc when not necessary.
That initial palloc is pointless.

Also, there have been previous discussions about having relpathbackend
not use palloc at all.  That was only because we wanted to use it in
pg_xlogdump which didn't have palloc support at the time, so it's no
longer as pressing; but perhaps it's still worthy of consideration.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: ERROR during end-of-xact/FATAL

From
Robert Haas
Date:
On Wed, Nov 6, 2013 at 9:40 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Noah Misch wrote:
>> Incomplete list:
>>
>> - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee
>>   relpathbackend() call palloc(); this is true in all supported branches.  In
>>   9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc().
>>   (In fact, it does so even when the pending list is empty -- this is the only
>>   palloc() during a trivial transaction commit.)  palloc() failure there
>>   yields a PANIC during commit.
>
> I think we should fix this routine to avoid the palloc when not necessary.
> That initial palloc is pointless.
>
> Also, there have been previous discussions about having relpathbackend
> not use palloc at all.  That was only because we wanted to use it in
> pg_xlogdump which didn't have palloc support at the time, so it's no
> longer as pressing; but perhaps it's still worthy of consideration.

+1, but I'd like it better if we could find a way of avoiding the
palloc in all cases.  Panicking because we run out of memory at the
wrong time isn't really very nice.  Maybe the information needs to be
maintained in the format in which it ultimately needs to be returned,
so that we needn't rejigger it in the middle of a critical section.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ERROR during end-of-xact/FATAL

From
Noah Misch
Date:
On Wed, Nov 06, 2013 at 10:14:53AM +0530, Amit Kapila wrote:
> On Thu, Oct 31, 2013 at 8:22 PM, Noah Misch <noah@leadboat.com> wrote:
> > If the original AbortTransaction() pertained to a FATAL, the situation is
> > worse.  errfinish() promotes the ERROR thrown from AbortTransaction() to
> > another FATAL,
> 
> isn't errstart promotes ERROR to FATAL?

Right.

> When I tried above scenario, I hit Assert at different place
...
> This means that in the situation when an ERROR occurs in
> AbortTransaction which is called as a result of FATAL error, there are
> many more possibilities of Assert.

Agreed.

> About unclean FATAL-then-ERROR scenario, one way to deal at high level
> could be to treat such a case as backend crash in which case
> postmaster reinitialises shared memory and other stuff.
> 
> > If we can't manage to
> > free a shared memory resource like a lock or buffer pin, we really must PANIC.
> 
> Can't we try to initialise the shared memory and other resources,
> wouldn't that resolve the problem's that can occur due to scenario
> explained by you?

A PANIC will reinitialize everything relevant, largely resolving the problems
around ERROR during FATAL.  It's a heavy-handed solution, but it may well be
the best solution.  Efforts to harden CommitTransaction() and
AbortTransaction() seem well-spent, but the additional effort to make FATAL
exit cope where AbortTransaction() or another exit action could not cope seems
to be slicing ever-smaller portions of additional robustness.

I pondered a variant of that conclusion that distinguished critical cleanup
needs from the rest.  Each shared resource (heavyweight locks, buffer pins,
LWLocks) would have an on_shmem_exit() callback that cleans up the resource
under a critical section.  (AtProcExit_Buffers() used to fill such a role, but
resowner.c's work during AbortTransaction() has mostly supplanted it.)  The
ShutdownPostgres callback would not use a critical section, so lesser failures
in AbortTransaction() would not upgrade to a PANIC.  But I'm leaning against
such a complication on the grounds that it would add seldom-tested code paths
posing as much a chance of eroding robustness as bolstering it.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Re: ERROR during end-of-xact/FATAL

From
Amit Kapila
Date:
On Sat, Nov 9, 2013 at 2:43 AM, Noah Misch <noah@leadboat.com> wrote:
> On Wed, Nov 06, 2013 at 10:14:53AM +0530, Amit Kapila wrote:
>> On Thu, Oct 31, 2013 at 8:22 PM, Noah Misch <noah@leadboat.com> wrote:
>> About unclean FATAL-then-ERROR scenario, one way to deal at high level
>> could be to treat such a case as backend crash in which case
>> postmaster reinitialises shared memory and other stuff.
>>
>> > If we can't manage to
>> > free a shared memory resource like a lock or buffer pin, we really must PANIC.
>>
>> Can't we try to initialise the shared memory and other resources,
>> wouldn't that resolve the problem's that can occur due to scenario
>> explained by you?
>
> A PANIC will reinitialize everything relevant, largely resolving the problems
> around ERROR during FATAL.  It's a heavy-handed solution, but it may well be
> the best solution.  Efforts to harden CommitTransaction() and
> AbortTransaction() seem well-spent, but the additional effort to make FATAL
> exit cope where AbortTransaction() or another exit action could not cope seems
> to be slicing ever-smaller portions of additional robustness.
>
> I pondered a variant of that conclusion that distinguished critical cleanup
> needs from the rest.  Each shared resource (heavyweight locks, buffer pins,
> LWLocks) would have an on_shmem_exit() callback that cleans up the resource
> under a critical section.  (AtProcExit_Buffers() used to fill such a role, but
> resowner.c's work during AbortTransaction() has mostly supplanted it.)  The
> ShutdownPostgres callback would not use a critical section, so lesser failures
> in AbortTransaction() would not upgrade to a PANIC.  But I'm leaning against
> such a complication on the grounds that it would add seldom-tested code paths
> posing as much a chance of eroding robustness as bolstering it.

I think here PANIC is safe and less complicated solution for this
situation. Apart from this we can try to avoid palloc or other such
errors in AbortTransaction/CommitTransaction path.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR during end-of-xact/FATAL

From
Robert Haas
Date:
On Fri, Nov 8, 2013 at 4:13 PM, Noah Misch <noah@leadboat.com> wrote:
> A PANIC will reinitialize everything relevant, largely resolving the problems
> around ERROR during FATAL.  It's a heavy-handed solution, but it may well be
> the best solution.  Efforts to harden CommitTransaction() and
> AbortTransaction() seem well-spent, but the additional effort to make FATAL
> exit cope where AbortTransaction() or another exit action could not cope seems
> to be slicing ever-smaller portions of additional robustness.
>
> I pondered a variant of that conclusion that distinguished critical cleanup
> needs from the rest.  Each shared resource (heavyweight locks, buffer pins,
> LWLocks) would have an on_shmem_exit() callback that cleans up the resource
> under a critical section.  (AtProcExit_Buffers() used to fill such a role, but
> resowner.c's work during AbortTransaction() has mostly supplanted it.)  The
> ShutdownPostgres callback would not use a critical section, so lesser failures
> in AbortTransaction() would not upgrade to a PANIC.  But I'm leaning against
> such a complication on the grounds that it would add seldom-tested code paths
> posing as much a chance of eroding robustness as bolstering it.

The current situation is just plain weird: in the ERROR-then-ERROR
case, we emit a WARNING and bounce right back into AbortTransaction(),
and if it doesn't work any better the second time than the first time,
we recurse again, and eventually if it fails enough times in a row, we
just give up and PANIC.  But in the ERROR-then-FATAL case, we *don't*
retry AbortTransaction(); instead, we just continue running the rest
of the on_shmem_exit callbacks and then exit.

So, in short, ERROR + ERROR*10 = PANIC, but FATAL + ERROR*10 = FATAL.
That's bizarre.

Given that that's where we are, promoting an ERROR during FATAL
processing to PANIC doesn't seem like it's losing much; we're
essentially already doing that in the (probably more likely) case of a
persistent ERROR during ERROR processing.  But since PANIC sucks, I'd
rather go the other direction: let's make an ERROR during ERROR
processing promote to FATAL.  And then let's do what you write above:
make sure that there's a separate on-shmem-exit callback for each
critical shared memory resource and that we call of those during FATAL
processing.

It seems to me that that's how things were originally designed to
work, but that we've drifted away from it basically because the
low-level callbacks to release heavyweight locks and buffer pins
turned out to be kinda, uh, slow, and we thought those code paths
couldn't be taken anyway (turns out they can).  I think we could
either make those routines very fast, or arrange only to run that code
at all in the case where AbortTransaction() didn't complete
successfully.  It's true that such code will be rarely run, but the
logic is simple enough that I think we can verify it by hand, and it's
sure nice to avoid PANICs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ERROR during end-of-xact/FATAL

From
Noah Misch
Date:
On Tue, Nov 12, 2013 at 09:55:34AM -0500, Robert Haas wrote:
> On Fri, Nov 8, 2013 at 4:13 PM, Noah Misch <noah@leadboat.com> wrote:
> > A PANIC will reinitialize everything relevant, largely resolving the problems
> > around ERROR during FATAL.  It's a heavy-handed solution, but it may well be
> > the best solution.  Efforts to harden CommitTransaction() and
> > AbortTransaction() seem well-spent, but the additional effort to make FATAL
> > exit cope where AbortTransaction() or another exit action could not cope seems
> > to be slicing ever-smaller portions of additional robustness.
> >
> > I pondered a variant of that conclusion that distinguished critical cleanup
> > needs from the rest.  Each shared resource (heavyweight locks, buffer pins,
> > LWLocks) would have an on_shmem_exit() callback that cleans up the resource
> > under a critical section.  (AtProcExit_Buffers() used to fill such a role, but
> > resowner.c's work during AbortTransaction() has mostly supplanted it.)  The
> > ShutdownPostgres callback would not use a critical section, so lesser failures
> > in AbortTransaction() would not upgrade to a PANIC.  But I'm leaning against
> > such a complication on the grounds that it would add seldom-tested code paths
> > posing as much a chance of eroding robustness as bolstering it.
> 
> The current situation is just plain weird: in the ERROR-then-ERROR
> case, we emit a WARNING and bounce right back into AbortTransaction(),
> and if it doesn't work any better the second time than the first time,
> we recurse again, and eventually if it fails enough times in a row, we
> just give up and PANIC.  But in the ERROR-then-FATAL case, we *don't*
> retry AbortTransaction(); instead, we just continue running the rest
> of the on_shmem_exit callbacks and then exit.
> 
> So, in short, ERROR + ERROR*10 = PANIC, but FATAL + ERROR*10 = FATAL.
> That's bizarre.

Quite so.

> Given that that's where we are, promoting an ERROR during FATAL
> processing to PANIC doesn't seem like it's losing much; we're
> essentially already doing that in the (probably more likely) case of a
> persistent ERROR during ERROR processing.  But since PANIC sucks, I'd
> rather go the other direction: let's make an ERROR during ERROR
> processing promote to FATAL.  And then let's do what you write above:
> make sure that there's a separate on-shmem-exit callback for each
> critical shared memory resource and that we call of those during FATAL
> processing.

Many of the factors that can cause AbortTransaction() to fail can also cause
CommitTransaction() to fail, and those would still PANIC if the transaction
had an xid.  How practical might it be to also escape from an error during
CommitTransaction() with a FATAL instead of PANIC?  There's more to fix up in
that case (sinval, NOTIFY), but it may be within reach.  If such a technique
can only reasonably fix abort, though, I have doubts it buys us enough.

> It seems to me that that's how things were originally designed to
> work, but that we've drifted away from it basically because the
> low-level callbacks to release heavyweight locks and buffer pins
> turned out to be kinda, uh, slow, and we thought those code paths
> couldn't be taken anyway (turns out they can).  I think we could
> either make those routines very fast, or arrange only to run that code
> at all in the case where AbortTransaction() didn't complete
> successfully.

Agreed; the performance hazards look tractable.

> It's true that such code will be rarely run, but the
> logic is simple enough that I think we can verify it by hand, and it's
> sure nice to avoid PANICs.

True.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Re: ERROR during end-of-xact/FATAL

From
Robert Haas
Date:
On Wed, Nov 13, 2013 at 11:04 AM, Noah Misch <noah@leadboat.com> wrote:
>> So, in short, ERROR + ERROR*10 = PANIC, but FATAL + ERROR*10 = FATAL.
>> That's bizarre.
>
> Quite so.
>
>> Given that that's where we are, promoting an ERROR during FATAL
>> processing to PANIC doesn't seem like it's losing much; we're
>> essentially already doing that in the (probably more likely) case of a
>> persistent ERROR during ERROR processing.  But since PANIC sucks, I'd
>> rather go the other direction: let's make an ERROR during ERROR
>> processing promote to FATAL.  And then let's do what you write above:
>> make sure that there's a separate on-shmem-exit callback for each
>> critical shared memory resource and that we call of those during FATAL
>> processing.
>
> Many of the factors that can cause AbortTransaction() to fail can also cause
> CommitTransaction() to fail, and those would still PANIC if the transaction
> had an xid.  How practical might it be to also escape from an error during
> CommitTransaction() with a FATAL instead of PANIC?  There's more to fix up in
> that case (sinval, NOTIFY), but it may be within reach.  If such a technique
> can only reasonably fix abort, though, I have doubts it buys us enough.

The critical stuff that's got to happen after
RecordTransactionCommit() appears to be ProcArrayEndTransaction() and
AtEOXact_Inval(). Unfortunately, the latter is well after the point
when we're supposed to only be doing "non-critical resource cleanup",
nonwithstanding which it appears to be critical.

So here's a sketch.  Hoist the preparatory logic in
RecordTransactionCommit() - smgrGetPendingDeletes,
xactGetCommittedChildren, and xactGetCommittedInvalidationMessages up
into the caller and do it before setting TRANS_COMMIT.  If any of that
stuff fails, we'll land in AbortTransaction() which must cope.  As
soon as we exit the commit critical section, set a flag somewhere
(where?) indicating that we have written our commit record; when that
flag is set, (a) promote any ERROR after that point through the end of
commit cleanup to FATAL and (b) if we enter AbortTransaction(), don't
try to RecordTransactionAbort().

I can't see that the notification stuff requires fixup in this case;
AFAICS, it is just adjusting backend-local state, and it's OK to
disregard any problems there during a FATAL exit.  Do you see
something to the contrary?  But invalidation messages are a problem:
if we commit and exit without sending our queued-up invalidation
messages, Bad Things Will Happen.  Perhaps we could arrange things so
that in that case only, we just PANIC.   That would allow most write
transactions to get by with FATAL, promoting to PANIC only in the case
of transactions that have modified system catalogs and only until the
invalidations have actually been sent.  Avoiding the PANIC in that
case seems to require some additional wizardry which is not entirely
clear to me at this time.

I think we'll have to approach the various problems in this area
stepwise, or we'll never make any progress.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ERROR during end-of-xact/FATAL

From
Alvaro Herrera
Date:
Robert Haas escribió:
> On Wed, Nov 6, 2013 at 9:40 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > Noah Misch wrote:
> >> Incomplete list:
> >>
> >> - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee
> >>   relpathbackend() call palloc(); this is true in all supported branches.  In
> >>   9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc().
> >>   (In fact, it does so even when the pending list is empty -- this is the only
> >>   palloc() during a trivial transaction commit.)  palloc() failure there
> >>   yields a PANIC during commit.
> >
> > I think we should fix this routine to avoid the palloc when not necessary.
> > That initial palloc is pointless.

Here's a trivial patch we could apply to 9.3 immediately.  Anything else
such as the ideas proposed below would require more effort than anyone
can probably spend here soon.


> > Also, there have been previous discussions about having relpathbackend
> > not use palloc at all.  That was only because we wanted to use it in
> > pg_xlogdump which didn't have palloc support at the time, so it's no
> > longer as pressing; but perhaps it's still worthy of consideration.
>
> +1, but I'd like it better if we could find a way of avoiding the
> palloc in all cases.  Panicking because we run out of memory at the
> wrong time isn't really very nice.  Maybe the information needs to be
> maintained in the format in which it ultimately needs to be returned,
> so that we needn't rejigger it in the middle of a critical section.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: ERROR during end-of-xact/FATAL

From
Robert Haas
Date:
On Thu, Nov 28, 2013 at 10:10 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas escribió:
>> On Wed, Nov 6, 2013 at 9:40 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> > Noah Misch wrote:
>> >> Incomplete list:
>> >>
>> >> - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee
>> >>   relpathbackend() call palloc(); this is true in all supported branches.  In
>> >>   9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc().
>> >>   (In fact, it does so even when the pending list is empty -- this is the only
>> >>   palloc() during a trivial transaction commit.)  palloc() failure there
>> >>   yields a PANIC during commit.
>> >
>> > I think we should fix this routine to avoid the palloc when not necessary.
>> > That initial palloc is pointless.
>
> Here's a trivial patch we could apply to 9.3 immediately.  Anything else
> such as the ideas proposed below would require more effort than anyone
> can probably spend here soon.

Yeah, this seems like a good thing to do for now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ERROR during end-of-xact/FATAL

From
Alvaro Herrera
Date:
Robert Haas escribió:
> On Thu, Nov 28, 2013 at 10:10 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Robert Haas escribió:
> >> On Wed, Nov 6, 2013 at 9:40 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >> > Noah Misch wrote:
> >> >> Incomplete list:
> >> >>
> >> >> - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee
> >> >>   relpathbackend() call palloc(); this is true in all supported branches.  In
> >> >>   9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc().
> >> >>   (In fact, it does so even when the pending list is empty -- this is the only
> >> >>   palloc() during a trivial transaction commit.)  palloc() failure there
> >> >>   yields a PANIC during commit.
> >> >
> >> > I think we should fix this routine to avoid the palloc when not necessary.
> >> > That initial palloc is pointless.
> >
> > Here's a trivial patch we could apply to 9.3 immediately.  Anything else
> > such as the ideas proposed below would require more effort than anyone
> > can probably spend here soon.
> 
> Yeah, this seems like a good thing to do for now.

Pushed, thanks.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: ERROR during end-of-xact/FATAL

From
Noah Misch
Date:
On Fri, Nov 15, 2013 at 09:51:32AM -0500, Robert Haas wrote:
> On Wed, Nov 13, 2013 at 11:04 AM, Noah Misch <noah@leadboat.com> wrote:
> >> So, in short, ERROR + ERROR*10 = PANIC, but FATAL + ERROR*10 = FATAL.
> >> That's bizarre.
> >
> > Quite so.
> >
> >> Given that that's where we are, promoting an ERROR during FATAL
> >> processing to PANIC doesn't seem like it's losing much; we're
> >> essentially already doing that in the (probably more likely) case of a
> >> persistent ERROR during ERROR processing.  But since PANIC sucks, I'd
> >> rather go the other direction: let's make an ERROR during ERROR
> >> processing promote to FATAL.  And then let's do what you write above:
> >> make sure that there's a separate on-shmem-exit callback for each
> >> critical shared memory resource and that we call of those during FATAL
> >> processing.
> >
> > Many of the factors that can cause AbortTransaction() to fail can also cause
> > CommitTransaction() to fail, and those would still PANIC if the transaction
> > had an xid.  How practical might it be to also escape from an error during
> > CommitTransaction() with a FATAL instead of PANIC?  There's more to fix up in
> > that case (sinval, NOTIFY), but it may be within reach.  If such a technique
> > can only reasonably fix abort, though, I have doubts it buys us enough.
> 
> The critical stuff that's got to happen after
> RecordTransactionCommit() appears to be ProcArrayEndTransaction() and
> AtEOXact_Inval(). Unfortunately, the latter is well after the point
> when we're supposed to only be doing "non-critical resource cleanup",
> nonwithstanding which it appears to be critical.
> 
> So here's a sketch.  Hoist the preparatory logic in
> RecordTransactionCommit() - smgrGetPendingDeletes,
> xactGetCommittedChildren, and xactGetCommittedInvalidationMessages up
> into the caller and do it before setting TRANS_COMMIT.  If any of that
> stuff fails, we'll land in AbortTransaction() which must cope.  As
> soon as we exit the commit critical section, set a flag somewhere
> (where?) indicating that we have written our commit record; when that
> flag is set, (a) promote any ERROR after that point through the end of
> commit cleanup to FATAL and (b) if we enter AbortTransaction(), don't
> try to RecordTransactionAbort().
> 
> I can't see that the notification stuff requires fixup in this case;
> AFAICS, it is just adjusting backend-local state, and it's OK to
> disregard any problems there during a FATAL exit.  Do you see
> something to the contrary?

The part not to miss is SignalBackends() in ProcessCompletedNotifies().  It
happens outside CommitTransaction(), just before the backend goes idle.
Without that, listeners could miss our notification until some other NOTIFY
transaction signals them.  That said, since ProcessCompletedNotifies() already
accepts the possibility of failure, it would not be crazy to overlook this.

> But invalidation messages are a problem:
> if we commit and exit without sending our queued-up invalidation
> messages, Bad Things Will Happen.  Perhaps we could arrange things so
> that in that case only, we just PANIC.   That would allow most write
> transactions to get by with FATAL, promoting to PANIC only in the case
> of transactions that have modified system catalogs and only until the
> invalidations have actually been sent.  Avoiding the PANIC in that
> case seems to require some additional wizardry which is not entirely
> clear to me at this time.

Agreed.

> I think we'll have to approach the various problems in this area
> stepwise, or we'll never make any progress.

That's one option.  The larger point is that allowing ERROR-late-in-COMMIT and
FATAL + ERROR scenarios to get away with FATAL requires this sort of analysis
of every exit callback and all the later stages of CommitTransaction().  The
current bug level shows such analysis hasn't been happening, and the dearth of
bug reports suggests the scenarios are rare.  I don't think the project stands
to gain enough from the changes contemplated here and, perhaps more notably,
from performing such analysis during review of future patches that touch
CommitTransaction() and the exit sequence.  It remains my recommendation to
run proc_exit_prepare() and the post-CLOG actions of CommitTransaction() and
AbortTransaction() in critical sections, giving the scenarios blunt but
reliable treatment.  If reports of excess PANIC arise, the thing to fix is the
code causing the late error.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com