pg_ctl/pg_rewind tests vs. slow AIX buildfarm members - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | pg_ctl/pg_rewind tests vs. slow AIX buildfarm members |
Date | |
Msg-id | 20150903062500.GB2973274@tornado.leadboat.com Whole thread Raw |
Responses |
Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
(Andres Freund <andres@anarazel.de>)
|
List | pgsql-hackers |
My AIX buildfarm members have failed the BinInstallCheck step on and off since inception. It became more frequent when I added animals sungazer and tern alongside the older hornet and mandrill. The animals share a machine with each other and with dozens of other developers. I setpriority() the animals to the lowest available priority, so they probably lose the CPU for long periods. Separately, this machine has slow filesystem metadata operations. For example, git-new-workdir takes ~50s for a PostgreSQL tree. The pg_rewind suite has failed a few times when crash recovery took longer than the 60s pg_ctl default timeout. Disabling fsync (commit 7d7a103) reduced median crash recovery time by 75%, which may suffice. If not, I'll be inclined to add --timeout=900 to each pg_ctl invocation. The pg_ctl suite has failed with "not ok 12 - second pg_ctl start succeeds". You can reproduce that by adding "sleep 3;" between that test and the one before it. The timing dependency comes from the pg_ctl "slop" time: /* * Make sanity checks. If it's for a standalone backend * (negative PID),or the recorded start time is before * pg_ctl started, then either we are looking at the wrong * data directory, or this is a pre-existing pidfile that * hasn't (yet?) been overwritten by ourchild postmaster. * Allow 2 seconds slop for possible cross-process clock * skew. */ The "second pg_ctl start succeeds" tested-for behavior is actually a minor bug that we'd ideally fix as described in the last paragraph of the commit 3c485ca log message: All of this could be improved if we rewrote start_postmaster() so that it could report the child postmaster's PID, sothat we'd know a-priori the correct PID to test with postmaster_is_alive(). That looks like a bit too much changefor so late in the 9.1 development cycle, unfortunately. I recommend we invert the test expectation and, pending the ideal pg_ctl fix, add the "sleep 3" to avoid falling within the time slop: --- a/src/bin/pg_ctl/t/001_start_stop.pl +++ b/src/bin/pg_ctl/t/001_start_stop.pl @@ -35,6 +35,7 @@ close CONF;command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], 'pg_ctl start -w'); -command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], - 'second pg_ctl start succeeds'); +sleep 3; # bridge test_postmaster_connection() slop threshold +command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], + 'second pg_ctl start fails');command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ], 'pg_ctlstop -w'); Alternately, I could just remove the test. crake failed the same way, once: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2015-07-07%2016%3A35%3A06
pgsql-hackers by date: