Explanation for intermittent buildfarm pg_upgradecheck failures - Mailing list pgsql-hackers

From Tom Lane
Subject Explanation for intermittent buildfarm pg_upgradecheck failures
Date
Msg-id 3150.1438533017@sss.pgh.pa.us
Whole thread Raw
Responses Re: Explanation for intermittent buildfarm pg_upgradecheck failures  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Explanation for intermittent buildfarm pg_upgradecheck failures  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers
Observe the smoking gun at
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mule&dt=2015-08-02%2003%3A30%3A02

to wit this extract from pg_upgrade_server.log:

command:
"/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/install//home/pg/build-farm-4.15.1/build/HEAD/inst/bin/pg_ctl"
-w-D "/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/data.old" -o ""  stop >>
"pg_upgrade_server.log"2>&1
 
LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  shutting down
waiting for server to shut down.....LOG:  database system is shut downdone
server stopped

command:
"/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/install//home/pg/build-farm-4.15.1/build/HEAD/inst/bin/pg_ctl"
-w-l "pg_upgrade_server.log" -D "/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/data"
-o"-p 57832 -b -c synchronous_commit=off -c fsync=off -c full_page_writes=off  -c listen_addresses='' -c
unix_socket_permissions=0700-c
unix_socket_directories='/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade'"start >>
"pg_upgrade_server.log"2>&1
 
waiting for server to start....FATAL:  lock file
"/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/.s.PGSQL.57832.lock"already exists
 
HINT:  Is another postmaster (PID 12295) using socket file
"/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/.s.PGSQL.57832"?
.... stopped waiting
pg_ctl: could not start server
Examine the log output.

Now, the old postmaster clearly was shut down, so how come the socket lock
file still existed?  The answer is that what pg_ctl is watching for is for
the postmaster.pid file to disappear, and *the postmaster deletes its lock
files in the wrong order*.  strace'ing HEAD on my own machine shows this
sequence of kernel calls during postmaster stop:

wait4(-1, 0x7ffff23f3cbc, WNOHANG, NULL) = -1 ECHILD (No child processes)
stat("backup_label", 0x7ffff23f3bd0)    = -1 ENOENT (No such file or directory)
munmap(0x7fd16e8f8000, 2316)            = 0
unlink("/dev/shm/PostgreSQL.1804289383") = 0
semctl(1547862018, 0, IPC_RMID, 0)      = 0
semctl(1547894787, 0, IPC_RMID, 0)      = 0
semctl(1547927556, 0, IPC_RMID, 0)      = 0
semctl(1547960325, 0, IPC_RMID, 0)      = 0
semctl(1547993094, 0, IPC_RMID, 0)      = 0
semctl(1548025863, 0, IPC_RMID, 0)      = 0
semctl(1548058632, 0, IPC_RMID, 0)      = 0
semctl(1548091401, 0, IPC_RMID, 0)      = 0
shmdt(0x7fd16e8f9000)                   = 0
munmap(0x7fd165b3c000, 148488192)       = 0
shmctl(194314244, IPC_RMID, 0)          = 0
unlink("/tmp/.s.PGSQL.5432")            = 0
unlink("postmaster.pid")                = 0
unlink("/tmp/.s.PGSQL.5432.lock")       = 0
exit_group(0)                           = ?
+++ exited with 0 +++

I haven't looked to find out why the unlinks happen in this order, but on
a heavily loaded machine, it's certainly possible that the process would
lose the CPU after unlink("postmaster.pid"), and then a new postmaster
could get far enough to see the socket lock file still there.  So that
would account for low-probability failures in the pg_upgradecheck test,
which is exactly what we've been seeing.

It seems clear to me that what the exit sequence ought to be is
unlink socket, unlink socket lock file, repeat for any additional
sockets, then unlink postmaster.pid.  Can anyone think of a reason
why the current ordering isn't a bug?

Another point that's rather obvious from the trace is that at no time
do we bother to close the postmaster's TCP socket; it only goes away when
the postmaster process dies.  So there's a second race condition here:
a new postmaster could be unable to bind() to the desired TCP port
because the old one still has it open.  I think we've seen unexplained
failures of the TCP-socket-in-use variety, too.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Nikolay Shaplov
Date:
Subject: pageinspect patch, for showing tuple data
Next
From: Tom Lane
Date:
Subject: Re: LWLock deadlock and gdb advice