Thread: Improve sleep processing of pg_rewind TAP tests
Hi all, TAP tests of pg_rewind are using in 2 places hardcoded values to wait for a given amount of time for some events. In HEAD, those things are: 1) Wait for 1s for standby to catch up. 2) Wait for 2s for promotion of standby. However after discussion with a colleague we have noticed that those values may not be enough in slow environments, a value of up to 10s being sometimes needed after promotion to make tests pass. And actually the current way of doing is not reliable because it depends on how the environment is able to handle quickly standby replay and promotion (I am expecting issues regarding that on Windows btw, and on small spec machines). Attached is a patch improving this sleep logic and doing the following things: 1) To ensure that standby has caught up, check replay position on the standby and compare it with the current WAL position of master. 2) To ensure that promotion is effective, use pg_is_in_recovery() and continue processing until we are sure that the standby is out of recovery. In each case the patch attached makes a maximum of 30 attempts, each attempt waiting 1s before processing, so we let a margin of up to 30s for environments to handle replay and promotion properly. Note that this patch adds a small routine called command_result in TestLib.pm, able to return stdout, stderr and the exit code. That's really handy, and I am planning to use something like that as well in the replication test suite. Regards, -- Michael
Attachment
Michael Paquier wrote: > However after discussion with a colleague we have noticed that those > values may not be enough in slow environments, a value of up to 10s > being sometimes needed after promotion to make tests pass. Yeah, hardcoded sleep times are not reliable. (/me would love to get rid of hardcoded times in isolationtester timeout.spec test ...) > Attached is a patch improving this sleep logic and doing the following things: > 1) To ensure that standby has caught up, check replay position on the > standby and compare it with the current WAL position of master. > 2) To ensure that promotion is effective, use pg_is_in_recovery() and > continue processing until we are sure that the standby is out of > recovery. Seems reasonable, but why are you sleeping 1s if pg_ctl -w is in use? I thought the -w would wait until promotion has taken effect, so there's no need to sleep additional time. > Note that this patch adds a small routine called command_result in > TestLib.pm, able to return stdout, stderr and the exit code. That's > really handy, and I am planning to use something like that as well in > the replication test suite. Makes sense to me. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 16, 2015 at 12:51 PM, Alvaro Herrera wrote: > Michael Paquier wrote: > >> However after discussion with a colleague we have noticed that those >> values may not be enough in slow environments, a value of up to 10s >> being sometimes needed after promotion to make tests pass. > > Yeah, hardcoded sleep times are not reliable. (/me would love to get > rid of hardcoded times in isolationtester timeout.spec test ...) > >> Attached is a patch improving this sleep logic and doing the following things: >> 1) To ensure that standby has caught up, check replay position on the >> standby and compare it with the current WAL position of master. > >> 2) To ensure that promotion is effective, use pg_is_in_recovery() and >> continue processing until we are sure that the standby is out of >> recovery. > > Seems reasonable, but why are you sleeping 1s if pg_ctl -w is in use? I > thought the -w would wait until promotion has taken effect, so there's > no need to sleep additional time. Visibly that's not the case for this test case, the timing issues that we saw happened not because of the standby not catching up, but because of the promotion not taking effect in a timely fashion. And that's as well something I saw on my OSX box some days ago as well, explaining why the sleep time has been increased from 1 to 2 in 53ba1077. This patch just takes it the careful way. In perl, system waits for the process of pg_ctl to exit before moving on, perhaps the problem is that pg_ctl thinks that promotion is done, but the node is not quite ready, explaining why the test fails. -- Michael
On Thu, Apr 16, 2015 at 1:00 PM, Michael Paquier wrote: > Visibly that's not the case for this test case, the timing issues that > we saw happened not because of the standby not catching up, but > because of the promotion not taking effect in a timely fashion. And > that's as well something I saw on my OSX box some days ago as well, > explaining why the sleep time has been increased from 1 to 2 in > 53ba1077. This patch just takes it the careful way. In perl, system > waits for the process of pg_ctl to exit before moving on, perhaps the > problem is that pg_ctl thinks that promotion is done, but the node is > not quite ready, explaining why the test fails. I have just made a run of the TAP tests of pg_rewind on my raspberry PI 1 (hamster), where the tests are very slow, and I noticed that it takes up to 10s to get confirmation from standby that it has caught up with the changes from master, and 5s to get confirmation that standby has been promoted. So the tests in HEAD would be broken without this patch, and 30s gives visibly enough room. Note as well that I would like to enable the TAP tests on this buildfarm machine btw. It may be good to get some input where things are really slow even for other tests. Regards, -- Michael
On 04/16/2015 06:51 AM, Alvaro Herrera wrote: > Seems reasonable, but why are you sleeping 1s if pg_ctl -w is in use? I > thought the -w would wait until promotion has taken effect, so there's > no need to sleep additional time. -w is not supported with pg_ctl promote. Only start, stop and restart. It's accepted, but it doesn't do anything. Which isn't very nice... - Heikki
On 04/20/2015 05:21 AM, Michael Paquier wrote: > I have just made a run of the TAP tests of pg_rewind on my raspberry > PI 1 (hamster), where the tests are very slow, and I noticed that it > takes up to 10s to get confirmation from standby that it has caught up > with the changes from master, and 5s to get confirmation that standby > has been promoted. So the tests in HEAD would be broken without this > patch, and 30s gives visibly enough room. Ok, applied, with some refactoring: I put the sleep-retry loop into a separate function. - Heikki
On Wed, Apr 22, 2015 at 8:40 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 04/20/2015 05:21 AM, Michael Paquier wrote: >> >> I have just made a run of the TAP tests of pg_rewind on my raspberry >> PI 1 (hamster), where the tests are very slow, and I noticed that it >> takes up to 10s to get confirmation from standby that it has caught up >> with the changes from master, and 5s to get confirmation that standby >> has been promoted. So the tests in HEAD would be broken without this >> patch, and 30s gives visibly enough room. > > > Ok, applied, with some refactoring: I put the sleep-retry loop into a > separate function. Thanks. -- Michael
On April 22, 2015 11:14:08 AM Heikki Linnakangas wrote: > On 04/16/2015 06:51 AM, Alvaro Herrera wrote: > > Seems reasonable, but why are you sleeping 1s if pg_ctl -w is in use? I > > thought the -w would wait until promotion has taken effect, so there's > > no need to sleep additional time. > > -w is not supported with pg_ctl promote. Only start, stop and restart. > It's accepted, but it doesn't do anything. Which isn't very nice... I'm futzing with the pg_ctl cmd line parameters for my patch notifying pg_ctl that restart didn't work. I'll fix this as well. > > - Heikki jan