Thread: Recovery of Multi-stage WAL actions
We've had two hard to diagnose errors in recovery in recent months. ISTM that the core issue is the way we allow Resource Managers to have multi-stage WAL actions that persist for long periods of time. This means we have no way of telling whether the answer rm_safe_restartpoint() == false is a momentary, valid state or a progressively worsening indicator of a subtle RM bug. An example of a multi-stage WAL action would be an index split inside one of the Resource Managers. Now that kind of action shouldn't take very long, though theoretically it could for various reasons. Right now we have a log message to cope with this: http://archives.postgresql.org/pgsql-committers/2007-08/msg00374.php but its not nearly as helpful as we'd like it to be. We could back-patch this to 8.2, but I have a potentially better proposal. I very much want to encourage authors of new Resource Managers and it looks like we may be getting at least 3 new RMs that produce WAL records: hash indexes (currently not WAL-logged), bitmap indexes and clustered indexes for 8.4. We should be realistic that new bugs probably will occur in recovery code for existing and new RMs. What I'd like to do is force all of the RMs to record the lsn of any WAL record that starts an incomplete action. Then, if an incomplete action lives for more than a certain period of time it will be possible to produce a log message saying "incomplete split has survived for X seconds, in xlog time". That way we'll see log messages if any of the RMs start to push an incomplete action onto their list and then not consume it again. We might trust each RM to implement code to LOG messages if their code goes a little awry, but I'd prefer some mechanism that allows the main server to check what's happening in each RM. That way we'd have a cross-check on whether the RM is well-behaved, plus we'd only need to implement the checking code once. Right now very similar, yet different code runs inside each RM. So my proposal is to have an incomplete split remember/forget API that forces each RM to expose its incomplete split List. Currently each RM has a hook on rm_saferestartpoint(), so that each RM manages its own List. My new thought is to have one function safeRestartPoint() that inspects each of the incomplete split Lists to see if they are empty. If the lists are non-empty then inspect the age of each list entry to see if it is worth reporting as a possible issue. Each RM would then store incomplete splits using a ResourceManagerRememberIncompleteEvent(lsn, id_data, payload??) and ResourceManagerforgetIncompleteEvent(id_data). Implementation is a a bit hazy on that last part, but I think the overall idea is clear. That should mean that any incomplete split that lasts for the length of one restartpoint, which is *at least* one checkpoint duration, should cause a LOG message to be produced. We might even go as far as to ignore super long-lived and therefore spurious incomplete splits when we issue rm_cleanup() for fear of allowing RM bugs to kill recovery. I'd like to suggest that those changes be performed now for 8.3 *and* back-patched for 8.2. I want to make sure that all users are able to diagnose server errors and report them. I'm guessing that might raise a few eyebrows, but I think its justifiable. Bugs in complex code are inevitable and should not be seen to reflect badly upon RM authors. However, our inability to recognise RM bugs that do occur doesn't seem acceptable to me, especially since they may save themselves up for the moment of PITR fail-over. You might persuade me I'm being over-zealous here, but High Availability is something we have to be zealous about. It should also be possible to allow the server to stay up even if one of the RM's fails to recover properly. That would need to be settable, so I really only mean that for "optional" RMs, i.e. index RMs only. For those cases we should be able to mark effected indexes by marking them corrupt. Automatic rebuild of corrupt indexes could also be possible, should it occur. That would be an 8.4 action... :-) Comments appreciated, as ever. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs <simon@2ndquadrant.com> writes: > What I'd like to do is force all of the RMs to record the lsn of any WAL > record that starts an incomplete action. > ... > I'd like to suggest that those changes be performed now for 8.3 *and* > back-patched for 8.2. There is zero chance of the former and less than zero of the latter; making major, hard-to-test changes in the rather-hypothetical hope of discovering entirely-hypothetical bugs is not my idea of the way to run a stable release branch. Feel free to create a patch for 8.4 though --- I think this is a reasonable suggestion for making *future* RM development safer. (Also, if there are any bugs in the extant code, we'd presumably back-patch fixes discovered in 8.4 testing, long as you don't whack the code around to the point of unrecognizability...) regards, tom lane
On Tue, 2007-10-30 at 00:48 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > What I'd like to do is force all of the RMs to record the lsn of any WAL > > record that starts an incomplete action. > > ... > > I'd like to suggest that those changes be performed now for 8.3 *and* > > back-patched for 8.2. > > There is zero chance of the former and less than zero of the latter; > making major, hard-to-test changes in the rather-hypothetical hope > of discovering entirely-hypothetical bugs is not my idea of the way > to run a stable release branch. I understand that initial thought. Right now, the two bugs discovered seem likely to mean that anyone running a combination of a 8.2, GIN index and Warm Standby has a bug that will only show itself when the standby server comes out of standby. At the very least we should add the DEBUG2 message I referred to earlier into 8.2 and then release a new dot release immediately for 8.2 to allow users to avoid this error. The problem is then we have no way of knowing whether there is a 3rd bug after that. I'm not happy about my suggestion and its possible effects on stability. However, if I'm running Postgres HA, then I *need* to know that there is no bug, not just presume there is not one. The latter presumption is looking weak after two bugs. So adding logging code to that area is important. If you're still adamant, then I would suggest we modify the GIN xlog code so it has a simple test for orphaned incomplete splits. I didn't want to single out that particular code, but I think we should do something for 8.2 and 8.3. I'll look at a simplest-possible patch for that. > Feel free to create a patch for 8.4 though --- I think this is a > reasonable suggestion for making *future* RM development safer. > (Also, if there are any bugs in the extant code, we'd presumably > back-patch fixes discovered in 8.4 testing, long as you don't > whack the code around to the point of unrecognizability...) OK, will do. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Added to TODO: > * Have resource managers report the duration of their status changes > > http://archives.postgresql.org/pgsql-hackers/2007-10/msg01468.php --------------------------------------------------------------------------- Simon Riggs wrote: > We've had two hard to diagnose errors in recovery in recent months. ISTM > that the core issue is the way we allow Resource Managers to have > multi-stage WAL actions that persist for long periods of time. This > means we have no way of telling whether the answer > rm_safe_restartpoint() == false is a momentary, valid state or a > progressively worsening indicator of a subtle RM bug. > > An example of a multi-stage WAL action would be an index split inside > one of the Resource Managers. Now that kind of action shouldn't take > very long, though theoretically it could for various reasons. > > Right now we have a log message to cope with this: > http://archives.postgresql.org/pgsql-committers/2007-08/msg00374.php > but its not nearly as helpful as we'd like it to be. We could back-patch > this to 8.2, but I have a potentially better proposal. > > I very much want to encourage authors of new Resource Managers and it > looks like we may be getting at least 3 new RMs that produce WAL > records: hash indexes (currently not WAL-logged), bitmap indexes and > clustered indexes for 8.4. We should be realistic that new bugs probably > will occur in recovery code for existing and new RMs. > > What I'd like to do is force all of the RMs to record the lsn of any WAL > record that starts an incomplete action. Then, if an incomplete action > lives for more than a certain period of time it will be possible to > produce a log message saying "incomplete split has survived for X > seconds, in xlog time". That way we'll see log messages if any of the > RMs start to push an incomplete action onto their list and then not > consume it again. > > We might trust each RM to implement code to LOG messages if their code > goes a little awry, but I'd prefer some mechanism that allows the main > server to check what's happening in each RM. That way we'd have a > cross-check on whether the RM is well-behaved, plus we'd only need to > implement the checking code once. Right now very similar, yet different > code runs inside each RM. > > So my proposal is to have an incomplete split remember/forget API that > forces each RM to expose its incomplete split List. Currently each RM > has a hook on rm_saferestartpoint(), so that each RM manages its own > List. My new thought is to have one function safeRestartPoint() that > inspects each of the incomplete split Lists to see if they are empty. If > the lists are non-empty then inspect the age of each list entry to see > if it is worth reporting as a possible issue. Each RM would then store > incomplete splits using a ResourceManagerRememberIncompleteEvent(lsn, > id_data, payload??) and ResourceManagerforgetIncompleteEvent(id_data). > Implementation is a a bit hazy on that last part, but I think the > overall idea is clear. > > That should mean that any incomplete split that lasts for the length of > one restartpoint, which is *at least* one checkpoint duration, should > cause a LOG message to be produced. We might even go as far as to ignore > super long-lived and therefore spurious incomplete splits when we issue > rm_cleanup() for fear of allowing RM bugs to kill recovery. > > I'd like to suggest that those changes be performed now for 8.3 *and* > back-patched for 8.2. I want to make sure that all users are able to > diagnose server errors and report them. I'm guessing that might raise a > few eyebrows, but I think its justifiable. Bugs in complex code are > inevitable and should not be seen to reflect badly upon RM authors. > However, our inability to recognise RM bugs that do occur doesn't seem > acceptable to me, especially since they may save themselves up for the > moment of PITR fail-over. You might persuade me I'm being over-zealous > here, but High Availability is something we have to be zealous about. > > It should also be possible to allow the server to stay up even if one of > the RM's fails to recover properly. That would need to be settable, so I > really only mean that for "optional" RMs, i.e. index RMs only. For those > cases we should be able to mark effected indexes by marking them > corrupt. Automatic rebuild of corrupt indexes could also be possible, > should it occur. That would be an 8.4 action... :-) > > Comments appreciated, as ever. > > -- > Simon Riggs > 2ndQuadrant http://www.2ndQuadrant.com > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +