Thread: Fatal Errors

Fatal Errors

From
Simon Riggs
Date:
Is it possible to have a FATAL error that crashes a backend and for it
to *not* have written an abort WAL record for any previously active
transaction? 

I think yes, but haven't managed to create this situation while testing
for it. If we either *always* write a WAL record, or PANIC then that
makes some coding easier, so seems sensible to check.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Fatal Errors

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> Is it possible to have a FATAL error that crashes a backend and for it
> to *not* have written an abort WAL record for any previously active
> transaction? 

Well, a FATAL error will still go through transaction abort before
exiting, IIRC.  The problem case is a PANIC or an actual core dump.

> If we either *always* write a WAL record, or PANIC then that
> makes some coding easier,

Like what?
        regards, tom lane


Re: Fatal Errors

From
Simon Riggs
Date:
On Mon, 2008-09-29 at 10:30 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > Is it possible to have a FATAL error that crashes a backend and for it
> > to *not* have written an abort WAL record for any previously active
> > transaction? 
> 
> Well, a FATAL error will still go through transaction abort before
> exiting, IIRC.  The problem case is a PANIC or an actual core dump.

> > If we either *always* write a WAL record, or PANIC then that
> > makes some coding easier,
> 
> Like what?

For constructing snapshots during standby. I need a data structure where
emulated-as-running transactions can live. If backend birth/death is
intimately tied to WAL visible events then I can use dummy PGPROC
structures. If not, then I will have to create a special area that can
expand to cater for the possibility that a backend dies and WAL replay
won't know about it - which also means I would need to periodically dump
a list of running backends into WAL.

PANIC isn't a problem case because we'll end up generating a shutdown
checkpoint which shows the backends have been terminated.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Fatal Errors

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Mon, 2008-09-29 at 10:30 -0400, Tom Lane wrote:
>> Like what?

> For constructing snapshots during standby. I need a data structure where
> emulated-as-running transactions can live. If backend birth/death is
> intimately tied to WAL visible events then I can use dummy PGPROC
> structures. If not, then I will have to create a special area that can
> expand to cater for the possibility that a backend dies and WAL replay
> won't know about it - which also means I would need to periodically dump
> a list of running backends into WAL.

Mph.  I find the idea of assuming that there must be an abort record to
be unacceptably fragile.  Consider the possibility that the transaction
gets an error while trying to run AbortTransaction.  Some of that code
is a CRITICAL_SECTION, but I don't think I like the idea that all of it
has to be one.

> PANIC isn't a problem case because we'll end up generating a shutdown
> checkpoint which shows the backends have been terminated.

Thought you were trying to get rid of the shutdown checkpoint during
restart?
        regards, tom lane


Re: Fatal Errors

From
Simon Riggs
Date:
On Mon, 2008-09-29 at 11:18 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Mon, 2008-09-29 at 10:30 -0400, Tom Lane wrote:
> >> Like what?
> 
> > For constructing snapshots during standby. I need a data structure where
> > emulated-as-running transactions can live. If backend birth/death is
> > intimately tied to WAL visible events then I can use dummy PGPROC
> > structures. If not, then I will have to create a special area that can
> > expand to cater for the possibility that a backend dies and WAL replay
> > won't know about it - which also means I would need to periodically dump
> > a list of running backends into WAL.
> 
> Mph.  I find the idea of assuming that there must be an abort record to
> be unacceptably fragile.  Consider the possibility that the transaction
> gets an error while trying to run AbortTransaction.  Some of that code
> is a CRITICAL_SECTION, but I don't think I like the idea that all of it
> has to be one.

Aware of possibility fragility, hence the post.

Few thoughts:

* Is it close enough that we can get away with having a few spare slots
to cater for that possibility?

* Might we make AbortTransaction critical just as far as the
END_CRIT_SECTION after XLogInsert in RecordTransactionAbort(), but no
further? Don't expect yes, but seems worth recording thoughts.

> > PANIC isn't a problem case because we'll end up generating a shutdown
> > checkpoint which shows the backends have been terminated.
> 
> Thought you were trying to get rid of the shutdown checkpoint during
> restart?

Yes, but if I do there would still be a WAL record of some kind there to
allow us to confirm the change of tli.

Anyway, I thought you wanted me to keep it now?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Fatal Errors

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> * Might we make AbortTransaction critical just as far as the
> END_CRIT_SECTION after XLogInsert in RecordTransactionAbort(), but no
> further? Don't expect yes, but seems worth recording thoughts.

The problem is that pretty much everything that proc_exit runs would
have to become critical, AFAICS.  And a lot of that code is explicitly
intended not to be critical --- that's why we split it up into multiple
proc_exit callbacks.  If one fails we pick up with the next, after a
recursive call to elog().

In any case it is clear that there will be failure cases where an
abort record cannot be written --- out of disk space for WAL being
one obvious example.  Are we sure that we can, or want to, guarantee
that those all result in PANIC?  (We do already PANIC on out of disk
space for WAL, but I'm not so sure about generalizing that to any
possible failure.)

>> Thought you were trying to get rid of the shutdown checkpoint during
>> restart?

> Yes, but if I do there would still be a WAL record of some kind there to
> allow us to confirm the change of tli.

> Anyway, I thought you wanted me to keep it now?

No, I don't have a strong opinion one way or the other on that bit.
But an ordinary crash and restart shouldn't generate a tli change.
        regards, tom lane


Re: Fatal Errors

From
Simon Riggs
Date:
On Mon, 2008-09-29 at 12:14 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > * Might we make AbortTransaction critical just as far as the
> > END_CRIT_SECTION after XLogInsert in RecordTransactionAbort(), but no
> > further? Don't expect yes, but seems worth recording thoughts.
> 
> The problem is that pretty much everything that proc_exit runs would
> have to become critical, AFAICS.  And a lot of that code is explicitly
> intended not to be critical --- that's why we split it up into multiple
> proc_exit callbacks.  If one fails we pick up with the next, after a
> recursive call to elog().

OK...next idea. If we didn't PANIC, then Postmaster knows about child
death and fumbles around with the ProcArray.

Will it be OK to simply WAL log ProcArrayAdd() and ProcArrayRemove()?

Methinks postmaster can't do this. But might be able to ask somebody
else to do it for him? 

The next person to run ProcArrayAdd() could be left a message to say
last user of this proc index didn't clean up and we need to log it. That
way we can WAL log the ProcArrayRemove() and the ProcArrayAdd() in one
message.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support