pg_ctl/pg_rewind tests vs. slow AIX buildfarm members - Mailing list pgsql-hackers

From Noah Misch
Subject pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Date
Msg-id 20150903062500.GB2973274@tornado.leadboat.com
Whole thread Raw
Responses Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
My AIX buildfarm members have failed the BinInstallCheck step on and off since
inception.  It became more frequent when I added animals sungazer and tern
alongside the older hornet and mandrill.  The animals share a machine with
each other and with dozens of other developers.  I setpriority() the animals
to the lowest available priority, so they probably lose the CPU for long
periods.  Separately, this machine has slow filesystem metadata operations.
For example, git-new-workdir takes ~50s for a PostgreSQL tree.

The pg_rewind suite has failed a few times when crash recovery took longer
than the 60s pg_ctl default timeout.  Disabling fsync (commit 7d7a103) reduced
median crash recovery time by 75%, which may suffice.  If not, I'll be
inclined to add --timeout=900 to each pg_ctl invocation.


The pg_ctl suite has failed with "not ok 12 - second pg_ctl start succeeds".
You can reproduce that by adding "sleep 3;" between that test and the one
before it.  The timing dependency comes from the pg_ctl "slop" time:
                /*                 * Make sanity checks.  If it's for a standalone backend                 * (negative
PID),or the recorded start time is before                 * pg_ctl started, then either we are looking at the wrong
           * data directory, or this is a pre-existing pidfile that                 * hasn't (yet?) been overwritten by
ourchild postmaster.                 * Allow 2 seconds slop for possible cross-process clock                 * skew.
            */
 

The "second pg_ctl start succeeds" tested-for behavior is actually a minor bug
that we'd ideally fix as described in the last paragraph of the commit 3c485ca
log message:
   All of this could be improved if we rewrote start_postmaster() so that it   could report the child postmaster's PID,
sothat we'd know a-priori the   correct PID to test with postmaster_is_alive().  That looks like a bit too   much
changefor so late in the 9.1 development cycle, unfortunately.
 

I recommend we invert the test expectation and, pending the ideal pg_ctl fix,
add the "sleep 3" to avoid falling within the time slop:

--- a/src/bin/pg_ctl/t/001_start_stop.pl
+++ b/src/bin/pg_ctl/t/001_start_stop.pl
@@ -35,6 +35,7 @@ close CONF;command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],    'pg_ctl start -w');
-command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
-    'second pg_ctl start succeeds');
+sleep 3;    # bridge test_postmaster_connection() slop threshold
+command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
+    'second pg_ctl start fails');command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
'pg_ctlstop -w');
 


Alternately, I could just remove the test.

crake failed the same way, once:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2015-07-07%2016%3A35%3A06



pgsql-hackers by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: Horizontal scalability/sharding
Next
From: Fabien COELHO
Date:
Subject: Re: pgbench stats per script & other stuff