Regression tests versus the buildfarm environment - Mailing list pgsql-hackers

There's an interesting buildfarm failure here:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=polecat&dt=2010-08-10%2023:46:10
It appears to me that this was caused by the concurrent run of another
buildfarm animal on the same physical machine, namely:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=colugos&dt=2010-08-11%2000:02:58
Both animals are trying to test HEAD, which means that pg_regress
defaults to the same postmaster port number in both builds:
   if (temp_install && !port_specified_by_user)
       /*        * To reduce chances of interference with parallel installations, use        * a port number starting
inthe private range (49152-65535)        * calculated from the version number.        */       port = 0xC000 |
(PG_VERSION_NUM& 0x3FFF);
 

We observe colugos successfully starting on that port:

============== starting postmaster                    ==============
running on port 57332 with pid 47019
============== creating database "regression"         ==============
CREATE DATABASE
ALTER DATABASE
... etc etc ...

polecat comes along what must be only moments later, and tries to use
the same port for its temp install:

============== starting postmaster                    ==============
running on port 57332 with pid 47022
============== creating database "regression"         ==============
ERROR:  duplicate key value violates unique constraint "pg_database_datname_index"
DETAIL:  Key (datname)=(regression) already exists.
command failed:
"/usr/local/src/build-farm-3.2/builds/HEAD/pgsql.15278/src/test/regress/./tmp_check/install//usr/local/src/build-farm-3.2/builds/HEAD/inst/bin/psql"
-X-c "CREATE DATABASE \"regression\" TEMPLATE=template0 ENCODING='SQL_ASCII' LC_COLLATE='C' LC_CTYPE='C'" "postgres"
 
pg_ctl: PID file
"/usr/local/src/build-farm-3.2/builds/HEAD/pgsql.15278/src/test/regress/./tmp_check/data/postmaster.pid"does not exist
 
Is server running?

pg_regress: could not stop postmaster: exit code was 256

Now the postmaster log shows that the second postmaster correctly
recognized that the port number was already in use, so it bailed out:

================== pgsql.15278/src/test/regress/log/postmaster.log ===================
[4c61f2d2.b7ae:1] FATAL:  lock file "/tmp/.s.PGSQL.57332.lock" already exists
[4c61f2d2.b7ae:2] HINT:  Is another postmaster (PID 47019) using socket file "/tmp/.s.PGSQL.57332"?

However, pg_regress failed to have a clue about what had happened,
and bulled ahead trying to run the regression tests (against the
postmaster started by the other pg_regress instance).  A look at the
code shows that it is merely trying to run psql, and if psql reports
that it can connect to the specified port, then pg_regress thinks the
postmaster started OK.  Of course, psql was really reporting that it
could connect to the other instance's postmaster.


I've seen similar multiple-postmaster-interference symptoms before in
the buildfarm, but never really understood the cause.

I am not sure if there's anything very good we can do about the
problem of pg_regress misidentifying the postmaster it's managed to
connect to.  A real solution would probably be much more trouble than
it's worth, anyway.  However, it does seem like we ought to be able to
do something about two buildfarm critters defaulting to the same choice
of port number.  The buildfarm infrastructure goes to great lengths to
pick nonconflicting port numbers for the "installed" postmasters it
runs; but we're ignoring all that effort and just using a hardwired
port number for "make check".  This is dumb.

pg_regress does have a --port argument that can be used to override
that default.  I don't know whether the buildfarm script calls
pg_regress directly or does "make check".  If the latter, we'd need to
twiddle the Makefiles to allow a port number to get passed in.  But
this seems well worthwhile to me.

Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: pg_restore should accept multiple -t switches?
Next
From: Greg Smith
Date:
Subject: Re: Cost of AtEOXact_Buffers in --enable-cassert