Thread: pg_rewind exiting with error code 1 when source and target are on the same timeline
pg_rewind exiting with error code 1 when source and target are on the same timeline
From
Michael Paquier
Date:
Hi all, I have been pinged internally by a user by the fact that pg_rewind returns 1 as exit code if the target and source nodes are on the same timeline. Actually, in this case, it feels weird to consider that as a failure as no rewind should be needed, the node that is behind the other in terms of WAL replay could just be reconnected to the other node. This is of course assuming no nodes are being "promoted" the way for example repmgr does by deleting recovery.conf and restarting the node, in which case both the target and source would be on the same timeline but they forked. But I think that we should not care about this case with pg_rewind as it has been designed to rewind nodes with different timelines. My point is that returning 1 in this case prevents users the possibility to make the difference between a run of pg_rewind that is not necessary and something that actually failed in this case, the idea being for example to be able to roll in a base backup or take other actions should a failure occur during a rewind or if one node does not satisfy one of pre-processing sanity checks. Attached is a patch aimed at changing that. Regards, -- Michael
Attachment
Re: pg_rewind exiting with error code 1 when source and target are on the same timeline
From
Peter Eisentraut
Date:
On 10/20/15 9:31 PM, Michael Paquier wrote: > I have been pinged internally by a user by the fact that pg_rewind > returns 1 as exit code if the target and source nodes are on the same > timeline. I agree that that should not be an error. But I think in that case we should just delete that check and have the subsequent checks end up at the existing "no rewind required" result.
Re: pg_rewind exiting with error code 1 when source and target are on the same timeline
From
Michael Paquier
Date:
On Thu, Oct 22, 2015 at 4:08 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 10/20/15 9:31 PM, Michael Paquier wrote: >> I have been pinged internally by a user by the fact that pg_rewind >> returns 1 as exit code if the target and source nodes are on the same >> timeline. > > I agree that that should not be an error. > But I think in that case we should just delete that check and have the > subsequent checks end up at the existing "no rewind required" result. I am not sure that I am not getting completely your point, why would it be a win to remove this safety check? We surely do not want to look for the common ancestor timeline if the target and source nodes have the same timeline, so we should not remove this check and just set rewind_needed to false to fallback to the same exit(0) for all those code paths. Per se the attached for example. -- Michael
Attachment
Re: pg_rewind exiting with error code 1 when source and target are on the same timeline
From
Francisco Olarte
Date:
On Thu, Oct 22, 2015 at 7:42 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Oct 22, 2015 at 4:08 AM, Peter Eisentraut <peter_e@gmx.net> wrote: >> I agree that that should not be an error. >> But I think in that case we should just delete that check and have the >> subsequent checks end up at the existing "no rewind required" result. > I am not sure that I am not getting completely your point, why would > it be a win to remove this safety check? We surely do not want to look > for the common ancestor timeline if the target and source nodes have > the same timeline, so we should not remove this check and just set > rewind_needed to false to fallback to the same exit(0) for all those > code paths. Per se the attached for example. I'm not familiar with pg_rewind, but from a programming perspective I agree with Peter. If you will end with a no-rewind-required result when not checking for equality the same timeline check is not a safety check, but an optimisation. Then removing it can make sense, as deleted code cannot contain bugs. If searching for the common ancestor can lead to problems then its a safety check. Francisco Olarte.
Re: pg_rewind exiting with error code 1 when source and target are on the same timeline
From
Peter Eisentraut
Date:
On 10/22/15 1:42 AM, Michael Paquier wrote: > I am not sure that I am not getting completely your point, why would > it be a win to remove this safety check? We surely do not want to look > for the common ancestor timeline if the target and source nodes have > the same timeline, so we should not remove this check and just set > rewind_needed to false to fallback to the same exit(0) for all those > code paths. Per se the attached for example. After playing with this a bit, I think your patch is correct. The code has drifted a bit in the meantime, so attached is an updated patch. Your patch added variable initializations for divergerec and lastcommontli. Was that to avoid compiler warnings about uninitialized variables? I didn't get these here. If that's the issue, maybe these initializations should be moved to the if (same timeline) branch, to get it a bit closer to the action. I also added a test for this case. (It would currently fail.)
Attachment
Re: pg_rewind exiting with error code 1 when source and target are on the same timeline
From
Michael Paquier
Date:
On Fri, Dec 4, 2015 at 12:22 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 10/22/15 1:42 AM, Michael Paquier wrote: >> I am not sure that I am not getting completely your point, why would >> it be a win to remove this safety check? We surely do not want to look >> for the common ancestor timeline if the target and source nodes have >> the same timeline, so we should not remove this check and just set >> rewind_needed to false to fallback to the same exit(0) for all those >> code paths. Per se the attached for example. > > After playing with this a bit, I think your patch is correct. The code > has drifted a bit in the meantime, so attached is an updated patch. Thanks for looking at it. > Your patch added variable initializations for divergerec and > lastcommontli. Was that to avoid compiler warnings about uninitialized > variables? I didn't get these here. If that's the issue, maybe these > initializations should be moved to the if (same timeline) branch, to get > it a bit closer to the action. Looking again at this patch, I don't actually recall why I added them. It does not matter to initialize them anyway.. > I also added a test for this case. (It would currently fail.) Thanks, this looks good to me. -- Michael
Re: pg_rewind exiting with error code 1 when source and target are on the same timeline
From
Peter Eisentraut
Date:
On 12/3/15 11:10 PM, Michael Paquier wrote: > On Fri, Dec 4, 2015 at 12:22 PM, Peter Eisentraut <peter_e@gmx.net> wrote: >> On 10/22/15 1:42 AM, Michael Paquier wrote: >>> I am not sure that I am not getting completely your point, why would >>> it be a win to remove this safety check? We surely do not want to look >>> for the common ancestor timeline if the target and source nodes have >>> the same timeline, so we should not remove this check and just set >>> rewind_needed to false to fallback to the same exit(0) for all those >>> code paths. Per se the attached for example. >> >> After playing with this a bit, I think your patch is correct. The code >> has drifted a bit in the meantime, so attached is an updated patch. > > Thanks for looking at it. I committed this to master. It's also on the 9.5 open item list, but if I backport it then the tests don't pass. Still looking. Not sure yet if this is because of code changes in pg_rewind master or test infrastructure changes in master.
Re: pg_rewind exiting with error code 1 when source and target are on the same timeline
From
Michael Paquier
Date:
On Sat, Dec 12, 2015 at 11:13 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 12/3/15 11:10 PM, Michael Paquier wrote: >> On Fri, Dec 4, 2015 at 12:22 PM, Peter Eisentraut <peter_e@gmx.net> wrote: >>> On 10/22/15 1:42 AM, Michael Paquier wrote: >>>> I am not sure that I am not getting completely your point, why would >>>> it be a win to remove this safety check? We surely do not want to look >>>> for the common ancestor timeline if the target and source nodes have >>>> the same timeline, so we should not remove this check and just set >>>> rewind_needed to false to fallback to the same exit(0) for all those >>>> code paths. Per se the attached for example. >>> >>> After playing with this a bit, I think your patch is correct. The code >>> has drifted a bit in the meantime, so attached is an updated patch. >> >> Thanks for looking at it. > > I committed this to master. It's also on the 9.5 open item list, but if > I backport it then the tests don't pass. Still looking. Not sure yet > if this is because of code changes in pg_rewind master or test > infrastructure changes in master. The failure is as follows: source data directory must be shut down cleanly Failure, exiting not ok 1 - pg_rewind local And is caused by the fact that master checks that the source node has been stopped with DB_SHUTDOWNED and not DB_SHUTDOWNED_IN_RECOVERY. On master we allow both. This has been discussed here as well previously: http://www.postgresql.org/message-id/55FA2537.4070600@gmx.net Attached is a patch for 9.5, the test case also needs to have a call to RewindTest::clean_rewind_test(). Regards, -- Michael
Attachment
Re: pg_rewind exiting with error code 1 when source and target are on the same timeline
From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes: > On 12/3/15 11:10 PM, Michael Paquier wrote: >> On Fri, Dec 4, 2015 at 12:22 PM, Peter Eisentraut <peter_e@gmx.net> wrote: >>> After playing with this a bit, I think your patch is correct. The code >>> has drifted a bit in the meantime, so attached is an updated patch. >> Thanks for looking at it. > I committed this to master. It's also on the 9.5 open item list, but if > I backport it then the tests don't pass. Still looking. Not sure yet > if this is because of code changes in pg_rewind master or test > infrastructure changes in master. I poked into this and found that the problem is that 9.5 is lacking the hunks of commit e50cda78 that teach sanityChecks() to allow the control file state to be DB_SHUTDOWNED_IN_RECOVERY, to wit @@ -374,10 +380,11 @@ sanityChecks(void) /* * Target cluster better not be running. This doesn't guard against * someone starting the cluster concurrently. Also, this is probably more - * strict than necessary; it's OK if the master was not shut down cleanly, - * as long as it isn't running at the moment. + * strict than necessary; it's OK if the target node was not shut down + * cleanly, as long as it isn't running at the moment. */ - if (ControlFile_target.state != DB_SHUTDOWNED) + if (ControlFile_target.state != DB_SHUTDOWNED && + ControlFile_target.state != DB_SHUTDOWNED_IN_RECOVERY) pg_fatal("target server must be shut down cleanly\n"); /* @@ -385,75 +392,149 @@ sanityChecks(void) * server is shut down. There isn't any very strong reason for this * limitation, but better safe than sorry. */ - if (datadir_source && ControlFile_source.state != DB_SHUTDOWNED) + if (datadir_source && + ControlFile_source.state != DB_SHUTDOWNED && + ControlFile_source.state != DB_SHUTDOWNED_IN_RECOVERY) pg_fatal("source data directory must be shut down cleanly\n"); } (Actually, it's only the second of these that is critical to make the test pass, but I should think we should apply both of them if either.) If I apply these, without any of the rest of e50cda78, everything seems fine. I'm going to go ahead and push that in the interests of getting some buildfarm cycles on it; but if someone could confirm that this is not an insane thing to do, it'd help. regards, tom lane
Re: pg_rewind exiting with error code 1 when source and target are on the same timeline
From
Michael Paquier
Date:
On Tue, Dec 15, 2015 at 8:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > /* > @@ -385,75 +392,149 @@ sanityChecks(void) > * server is shut down. There isn't any very strong reason for this > * limitation, but better safe than sorry. > */ > - if (datadir_source && ControlFile_source.state != DB_SHUTDOWNED) > + if (datadir_source && > + ControlFile_source.state != DB_SHUTDOWNED && > + ControlFile_source.state != DB_SHUTDOWNED_IN_RECOVERY) > pg_fatal("source data directory must be shut down cleanly\n"); > } > > (Actually, it's only the second of these that is critical to make the > test pass, but I should think we should apply both of them if either.) Yep. > If I apply these, without any of the rest of e50cda78, everything seems > fine. I'm going to go ahead and push that in the interests of getting > some buildfarm cycles on it; but if someone could confirm that this > is not an insane thing to do, it'd help. That's not an insane behavior to me. Actually the patch you pushed to REL9_5_STABLE is missing something: a call to RewindTest::clean_rewind_test(); in t/005 to clean up the nodes involved in the test. I think that in the current state the nodes would stay online until they notice that postmaster.pid is missing. (FWIW there was a patch I sent upthread that did exactly the same thing :) ) -- Michael
Re: pg_rewind exiting with error code 1 when source and target are on the same timeline
From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes: > That's not an insane behavior to me. Actually the patch you pushed to > REL9_5_STABLE is missing something: a call to > RewindTest::clean_rewind_test(); in t/005 to clean up the nodes > involved in the test. I think that in the current state the nodes > would stay online until they notice that postmaster.pid is missing. Hm, it's just like HEAD, but yeah I see the other four test files do that. Will fix. regards, tom lane
Re: pg_rewind exiting with error code 1 when source and target are on the same timeline
From
Peter Eisentraut
Date:
On 12/12/15 6:08 AM, Michael Paquier wrote: >> I committed this to master. It's also on the 9.5 open item list, but if >> I backport it then the tests don't pass. Still looking. Not sure yet >> if this is because of code changes in pg_rewind master or test >> infrastructure changes in master. > > The failure is as follows: > source data directory must be shut down cleanly > Failure, exiting > not ok 1 - pg_rewind local > > And is caused by the fact that master checks that the source node has > been stopped with DB_SHUTDOWNED and not DB_SHUTDOWNED_IN_RECOVERY. It seems that the/a problem is actually that the test setup tries a rewind using the standby as the source and the standby also as the target. The source should be the primary, and then it doesn't matter whether we allow DB_SHUTDOWNED_IN_RECOVERY. Need to check that test setup.
Re: pg_rewind exiting with error code 1 when source and target are on the same timeline
From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes: > On 12/12/15 6:08 AM, Michael Paquier wrote: >> And is caused by the fact that master checks that the source node has >> been stopped with DB_SHUTDOWNED and not DB_SHUTDOWNED_IN_RECOVERY. > It seems that the/a problem is actually that the test setup tries a > rewind using the standby as the source and the standby also as the > target. The source should be the primary, and then it doesn't matter > whether we allow DB_SHUTDOWNED_IN_RECOVERY. Need to check that test setup. While the test setup might have been a bit weird, I think that it uncovered an actual bug in pg_rewind, or at least a very undesirable and unnecessary restriction. Why should people not be allowed to use it to sync in that direction? regards, tom lane
Re: pg_rewind exiting with error code 1 when source and target are on the same timeline
From
Peter Eisentraut
Date:
On 12/15/15 3:15 PM, Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: >> On 12/12/15 6:08 AM, Michael Paquier wrote: >>> And is caused by the fact that master checks that the source node has >>> been stopped with DB_SHUTDOWNED and not DB_SHUTDOWNED_IN_RECOVERY. > >> It seems that the/a problem is actually that the test setup tries a >> rewind using the standby as the source and the standby also as the >> target. The source should be the primary, and then it doesn't matter >> whether we allow DB_SHUTDOWNED_IN_RECOVERY. Need to check that test setup. > > While the test setup might have been a bit weird, I think that it > uncovered an actual bug in pg_rewind, or at least a very undesirable > and unnecessary restriction. Why should people not be allowed to use > it to sync in that direction? That issue was known and was fixed in 9.6. I don't mind that it was fixed in 9.5 as well.