Thread: Explanation for intermittent buildfarm pg_upgradecheck failures

Explanation for intermittent buildfarm pg_upgradecheck failures

From
Tom Lane
Date:
Observe the smoking gun at
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mule&dt=2015-08-02%2003%3A30%3A02

to wit this extract from pg_upgrade_server.log:

command:
"/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/install//home/pg/build-farm-4.15.1/build/HEAD/inst/bin/pg_ctl"
-w-D "/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/data.old" -o ""  stop >>
"pg_upgrade_server.log"2>&1
 
LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  shutting down
waiting for server to shut down.....LOG:  database system is shut downdone
server stopped

command:
"/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/install//home/pg/build-farm-4.15.1/build/HEAD/inst/bin/pg_ctl"
-w-l "pg_upgrade_server.log" -D "/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/data"
-o"-p 57832 -b -c synchronous_commit=off -c fsync=off -c full_page_writes=off  -c listen_addresses='' -c
unix_socket_permissions=0700-c
unix_socket_directories='/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade'"start >>
"pg_upgrade_server.log"2>&1
 
waiting for server to start....FATAL:  lock file
"/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/.s.PGSQL.57832.lock"already exists
 
HINT:  Is another postmaster (PID 12295) using socket file
"/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/.s.PGSQL.57832"?
.... stopped waiting
pg_ctl: could not start server
Examine the log output.

Now, the old postmaster clearly was shut down, so how come the socket lock
file still existed?  The answer is that what pg_ctl is watching for is for
the postmaster.pid file to disappear, and *the postmaster deletes its lock
files in the wrong order*.  strace'ing HEAD on my own machine shows this
sequence of kernel calls during postmaster stop:

wait4(-1, 0x7ffff23f3cbc, WNOHANG, NULL) = -1 ECHILD (No child processes)
stat("backup_label", 0x7ffff23f3bd0)    = -1 ENOENT (No such file or directory)
munmap(0x7fd16e8f8000, 2316)            = 0
unlink("/dev/shm/PostgreSQL.1804289383") = 0
semctl(1547862018, 0, IPC_RMID, 0)      = 0
semctl(1547894787, 0, IPC_RMID, 0)      = 0
semctl(1547927556, 0, IPC_RMID, 0)      = 0
semctl(1547960325, 0, IPC_RMID, 0)      = 0
semctl(1547993094, 0, IPC_RMID, 0)      = 0
semctl(1548025863, 0, IPC_RMID, 0)      = 0
semctl(1548058632, 0, IPC_RMID, 0)      = 0
semctl(1548091401, 0, IPC_RMID, 0)      = 0
shmdt(0x7fd16e8f9000)                   = 0
munmap(0x7fd165b3c000, 148488192)       = 0
shmctl(194314244, IPC_RMID, 0)          = 0
unlink("/tmp/.s.PGSQL.5432")            = 0
unlink("postmaster.pid")                = 0
unlink("/tmp/.s.PGSQL.5432.lock")       = 0
exit_group(0)                           = ?
+++ exited with 0 +++

I haven't looked to find out why the unlinks happen in this order, but on
a heavily loaded machine, it's certainly possible that the process would
lose the CPU after unlink("postmaster.pid"), and then a new postmaster
could get far enough to see the socket lock file still there.  So that
would account for low-probability failures in the pg_upgradecheck test,
which is exactly what we've been seeing.

It seems clear to me that what the exit sequence ought to be is
unlink socket, unlink socket lock file, repeat for any additional
sockets, then unlink postmaster.pid.  Can anyone think of a reason
why the current ordering isn't a bug?

Another point that's rather obvious from the trace is that at no time
do we bother to close the postmaster's TCP socket; it only goes away when
the postmaster process dies.  So there's a second race condition here:
a new postmaster could be unable to bind() to the desired TCP port
because the old one still has it open.  I think we've seen unexplained
failures of the TCP-socket-in-use variety, too.
        regards, tom lane



Re: Explanation for intermittent buildfarm pg_upgradecheck failures

From
Tom Lane
Date:
I wrote:
> unlink("/tmp/.s.PGSQL.5432")            = 0
> unlink("postmaster.pid")                = 0
> unlink("/tmp/.s.PGSQL.5432.lock")       = 0
> exit_group(0)                           = ?
> +++ exited with 0 +++

> I haven't looked to find out why the unlinks happen in this order, but on
> a heavily loaded machine, it's certainly possible that the process would
> lose the CPU after unlink("postmaster.pid"), and then a new postmaster
> could get far enough to see the socket lock file still there.  So that
> would account for low-probability failures in the pg_upgradecheck test,
> which is exactly what we've been seeing.

Further experimentation says that 9.0-9.2 do this in the expected order;
so somebody broke it during 9.3.

The lack of a close() on the postmaster socket goes all the way back
though.
        regards, tom lane



Re: Explanation for intermittent buildfarm pg_upgradecheck failures

From
Tom Lane
Date:
I wrote:
> Further experimentation says that 9.0-9.2 do this in the expected order;
> so somebody broke it during 9.3.

Depressingly enough, the somebody was me.  Fixed now.
        regards, tom lane



Re: Explanation for intermittent buildfarm pg_upgradecheck failures

From
Michael Paquier
Date:
On Mon, Aug 3, 2015 at 1:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I haven't looked to find out why the unlinks happen in this order, but on
> a heavily loaded machine, it's certainly possible that the process would
> lose the CPU after unlink("postmaster.pid"), and then a new postmaster
> could get far enough to see the socket lock file still there.  So that
> would account for low-probability failures in the pg_upgradecheck test,
> which is exactly what we've been seeing.

Oh... This may explain the different failures seen with TAP tests on
hamster, and axolotl with pg_upgrade as well. It is rather easy to get
them heavily loaded.
-- 
Michael