Thread: Recovery of Multi-stage WAL actions

Recovery of Multi-stage WAL actions

From

Simon Riggs

Date:

29 October 2007, 23:03:25

We've had two hard to diagnose errors in recovery in recent months. ISTM
that the core issue is the way we allow Resource Managers to have
multi-stage WAL actions that persist for long periods of time. This
means we have no way of telling whether the answer
rm_safe_restartpoint() == false is a momentary, valid state or a
progressively worsening indicator of a subtle RM bug.

An example of a multi-stage WAL action would be an index split inside
one of the Resource Managers. Now that kind of action shouldn't take
very long, though theoretically it could for various reasons.

Right now we have a log message to cope with this:
http://archives.postgresql.org/pgsql-committers/2007-08/msg00374.php
but its not nearly as helpful as we'd like it to be. We could back-patch
this to 8.2, but I have a potentially better proposal.

I very much want to encourage authors of new Resource Managers and it
looks like we may be getting at least 3 new RMs that produce WAL
records: hash indexes (currently not WAL-logged), bitmap indexes and
clustered indexes for 8.4. We should be realistic that new bugs probably
will occur in recovery code for existing and new RMs.

What I'd like to do is force all of the RMs to record the lsn of any WAL
record that starts an incomplete action. Then, if an incomplete action
lives for more than a certain period of time it will be possible to
produce a log message saying "incomplete split has survived for X
seconds, in xlog time". That way we'll see log messages if any of the
RMs start to push an incomplete action onto their list and then not
consume it again.

We might trust each RM to implement code to LOG messages if their code
goes a little awry, but I'd prefer some mechanism that allows the main
server to check what's happening in each RM. That way we'd have a
cross-check on whether the RM is well-behaved, plus we'd only need to
implement the checking code once. Right now very similar, yet different
code runs inside each RM.

So my proposal is to have an incomplete split remember/forget API that
forces each RM to expose its incomplete split List. Currently each RM
has a hook on rm_saferestartpoint(), so that each RM manages its own
List. My new thought is to have one function safeRestartPoint() that
inspects each of the incomplete split Lists to see if they are empty. If
the lists are non-empty then inspect the age of each list entry to see
if it is worth reporting as a possible issue. Each RM would then store
incomplete splits using a ResourceManagerRememberIncompleteEvent(lsn,
id_data, payload??) and ResourceManagerforgetIncompleteEvent(id_data).
Implementation is a a bit hazy on that last part, but I think the
overall idea is clear.

That should mean that any incomplete split that lasts for the length of
one restartpoint, which is *at least* one checkpoint duration, should
cause a LOG message to be produced. We might even go as far as to ignore
super long-lived and therefore spurious incomplete splits when we issue
rm_cleanup() for fear of allowing RM bugs to kill recovery.

I'd like to suggest that those changes be performed now for 8.3 *and*
back-patched for 8.2. I want to make sure that all users are able to
diagnose server errors and report them. I'm guessing that might raise a
few eyebrows, but I think its justifiable. Bugs in complex code are
inevitable and should not be seen to reflect badly upon RM authors.
However, our inability to recognise RM bugs that do occur doesn't seem
acceptable to me, especially since they may save themselves up for the
moment of PITR fail-over. You might persuade me I'm being over-zealous
here, but High Availability is something we have to be zealous about.

It should also be possible to allow the server to stay up even if one of
the RM's fails to recover properly. That would need to be settable, so I
really only mean that for "optional" RMs, i.e. index RMs only. For those
cases we should be able to mark effected indexes by marking them
corrupt. Automatic rebuild of corrupt indexes could also be possible,
should it occur. That would be an 8.4 action... :-)

Comments appreciated, as ever.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: Recovery of Multi-stage WAL actions

From

Tom Lane

Date:

30 October 2007, 04:48:37

Simon Riggs <simon@2ndquadrant.com> writes:
> What I'd like to do is force all of the RMs to record the lsn of any WAL
> record that starts an incomplete action.
> ...
> I'd like to suggest that those changes be performed now for 8.3 *and*
> back-patched for 8.2.

There is zero chance of the former and less than zero of the latter;
making major, hard-to-test changes in the rather-hypothetical hope
of discovering entirely-hypothetical bugs is not my idea of the way
to run a stable release branch.

Feel free to create a patch for 8.4 though --- I think this is a
reasonable suggestion for making *future* RM development safer.
(Also, if there are any bugs in the extant code, we'd presumably
back-patch fixes discovered in 8.4 testing, long as you don't
whack the code around to the point of unrecognizability...)
        regards, tom lane

Re: Recovery of Multi-stage WAL actions

From

Simon Riggs

Date:

30 October 2007, 07:37:52

On Tue, 2007-10-30 at 00:48 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > What I'd like to do is force all of the RMs to record the lsn of any WAL
> > record that starts an incomplete action.
> > ...
> > I'd like to suggest that those changes be performed now for 8.3 *and*
> > back-patched for 8.2.
> 
> There is zero chance of the former and less than zero of the latter;
> making major, hard-to-test changes in the rather-hypothetical hope
> of discovering entirely-hypothetical bugs is not my idea of the way
> to run a stable release branch.

I understand that initial thought. 

Right now, the two bugs discovered seem likely to mean that anyone
running a combination of a 8.2, GIN index and Warm Standby has a bug
that will only show itself when the standby server comes out of standby.
At the very least we should add the DEBUG2 message I referred to earlier
into 8.2 and then release a new dot release immediately for 8.2 to allow
users to avoid this error.

The problem is then we have no way of knowing whether there is a 3rd bug
after that. I'm not happy about my suggestion and its possible effects
on stability. However, if I'm running Postgres HA, then I *need* to know
that there is no bug, not just presume there is not one. The latter
presumption is looking weak after two bugs. So adding logging code to
that area is important.

If you're still adamant, then I would suggest we modify the GIN xlog
code so it has a simple test for orphaned incomplete splits. I didn't
want to single out that particular code, but I think we should do
something for 8.2 and 8.3.

I'll look at a simplest-possible patch for that.

> Feel free to create a patch for 8.4 though --- I think this is a
> reasonable suggestion for making *future* RM development safer.
> (Also, if there are any bugs in the extant code, we'd presumably
> back-patch fixes discovered in 8.4 testing, long as you don't
> whack the code around to the point of unrecognizability...)

OK, will do.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: Recovery of Multi-stage WAL actions

From

Bruce Momjian

Date:

25 March 2008, 00:29:46

Added to TODO:

> * Have resource managers report the duration of their status changes
>
>   http://archives.postgresql.org/pgsql-hackers/2007-10/msg01468.php


---------------------------------------------------------------------------

Simon Riggs wrote:
> We've had two hard to diagnose errors in recovery in recent months. ISTM
> that the core issue is the way we allow Resource Managers to have
> multi-stage WAL actions that persist for long periods of time. This
> means we have no way of telling whether the answer
> rm_safe_restartpoint() == false is a momentary, valid state or a
> progressively worsening indicator of a subtle RM bug.
> 
> An example of a multi-stage WAL action would be an index split inside
> one of the Resource Managers. Now that kind of action shouldn't take
> very long, though theoretically it could for various reasons.
> 
> Right now we have a log message to cope with this:
> http://archives.postgresql.org/pgsql-committers/2007-08/msg00374.php
> but its not nearly as helpful as we'd like it to be. We could back-patch
> this to 8.2, but I have a potentially better proposal.
> 
> I very much want to encourage authors of new Resource Managers and it
> looks like we may be getting at least 3 new RMs that produce WAL
> records: hash indexes (currently not WAL-logged), bitmap indexes and
> clustered indexes for 8.4. We should be realistic that new bugs probably
> will occur in recovery code for existing and new RMs.
> 
> What I'd like to do is force all of the RMs to record the lsn of any WAL
> record that starts an incomplete action. Then, if an incomplete action
> lives for more than a certain period of time it will be possible to
> produce a log message saying "incomplete split has survived for X
> seconds, in xlog time". That way we'll see log messages if any of the
> RMs start to push an incomplete action onto their list and then not
> consume it again.
> 
> We might trust each RM to implement code to LOG messages if their code
> goes a little awry, but I'd prefer some mechanism that allows the main
> server to check what's happening in each RM. That way we'd have a
> cross-check on whether the RM is well-behaved, plus we'd only need to
> implement the checking code once. Right now very similar, yet different
> code runs inside each RM.
> 
> So my proposal is to have an incomplete split remember/forget API that
> forces each RM to expose its incomplete split List. Currently each RM
> has a hook on rm_saferestartpoint(), so that each RM manages its own
> List. My new thought is to have one function safeRestartPoint() that
> inspects each of the incomplete split Lists to see if they are empty. If
> the lists are non-empty then inspect the age of each list entry to see
> if it is worth reporting as a possible issue. Each RM would then store
> incomplete splits using a ResourceManagerRememberIncompleteEvent(lsn,
> id_data, payload??) and ResourceManagerforgetIncompleteEvent(id_data).
> Implementation is a a bit hazy on that last part, but I think the
> overall idea is clear.
> 
> That should mean that any incomplete split that lasts for the length of
> one restartpoint, which is *at least* one checkpoint duration, should
> cause a LOG message to be produced. We might even go as far as to ignore
> super long-lived and therefore spurious incomplete splits when we issue
> rm_cleanup() for fear of allowing RM bugs to kill recovery.
> 
> I'd like to suggest that those changes be performed now for 8.3 *and*
> back-patched for 8.2. I want to make sure that all users are able to
> diagnose server errors and report them. I'm guessing that might raise a
> few eyebrows, but I think its justifiable. Bugs in complex code are
> inevitable and should not be seen to reflect badly upon RM authors.
> However, our inability to recognise RM bugs that do occur doesn't seem
> acceptable to me, especially since they may save themselves up for the
> moment of PITR fail-over. You might persuade me I'm being over-zealous
> here, but High Availability is something we have to be zealous about.
> 
> It should also be possible to allow the server to stay up even if one of
> the RM's fails to recover properly. That would need to be settable, so I
> really only mean that for "optional" RMs, i.e. index RMs only. For those
> cases we should be able to mark effected indexes by marking them
> corrupt. Automatic rebuild of corrupt indexes could also be possible,
> should it occur. That would be an 8.4 action... :-)
> 
> Comments appreciated, as ever.
> 
> -- 
>   Simon Riggs
>   2ndQuadrant  http://www.2ndQuadrant.com
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://postgres.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +