Explanation for intermittent buildfarm pg_upgradecheck failures - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Explanation for intermittent buildfarm pg_upgradecheck failures |
Date | |
Msg-id | 3150.1438533017@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Explanation for intermittent buildfarm pg_upgradecheck failures
Re: Explanation for intermittent buildfarm pg_upgradecheck failures |
List | pgsql-hackers |
Observe the smoking gun at http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mule&dt=2015-08-02%2003%3A30%3A02 to wit this extract from pg_upgrade_server.log: command: "/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/install//home/pg/build-farm-4.15.1/build/HEAD/inst/bin/pg_ctl" -w-D "/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/data.old" -o "" stop >> "pg_upgrade_server.log"2>&1 LOG: received fast shutdown request LOG: aborting any active transactions LOG: shutting down waiting for server to shut down.....LOG: database system is shut downdone server stopped command: "/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/install//home/pg/build-farm-4.15.1/build/HEAD/inst/bin/pg_ctl" -w-l "pg_upgrade_server.log" -D "/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/data" -o"-p 57832 -b -c synchronous_commit=off -c fsync=off -c full_page_writes=off -c listen_addresses='' -c unix_socket_permissions=0700-c unix_socket_directories='/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade'"start >> "pg_upgrade_server.log"2>&1 waiting for server to start....FATAL: lock file "/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/.s.PGSQL.57832.lock"already exists HINT: Is another postmaster (PID 12295) using socket file "/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/.s.PGSQL.57832"? .... stopped waiting pg_ctl: could not start server Examine the log output. Now, the old postmaster clearly was shut down, so how come the socket lock file still existed? The answer is that what pg_ctl is watching for is for the postmaster.pid file to disappear, and *the postmaster deletes its lock files in the wrong order*. strace'ing HEAD on my own machine shows this sequence of kernel calls during postmaster stop: wait4(-1, 0x7ffff23f3cbc, WNOHANG, NULL) = -1 ECHILD (No child processes) stat("backup_label", 0x7ffff23f3bd0) = -1 ENOENT (No such file or directory) munmap(0x7fd16e8f8000, 2316) = 0 unlink("/dev/shm/PostgreSQL.1804289383") = 0 semctl(1547862018, 0, IPC_RMID, 0) = 0 semctl(1547894787, 0, IPC_RMID, 0) = 0 semctl(1547927556, 0, IPC_RMID, 0) = 0 semctl(1547960325, 0, IPC_RMID, 0) = 0 semctl(1547993094, 0, IPC_RMID, 0) = 0 semctl(1548025863, 0, IPC_RMID, 0) = 0 semctl(1548058632, 0, IPC_RMID, 0) = 0 semctl(1548091401, 0, IPC_RMID, 0) = 0 shmdt(0x7fd16e8f9000) = 0 munmap(0x7fd165b3c000, 148488192) = 0 shmctl(194314244, IPC_RMID, 0) = 0 unlink("/tmp/.s.PGSQL.5432") = 0 unlink("postmaster.pid") = 0 unlink("/tmp/.s.PGSQL.5432.lock") = 0 exit_group(0) = ? +++ exited with 0 +++ I haven't looked to find out why the unlinks happen in this order, but on a heavily loaded machine, it's certainly possible that the process would lose the CPU after unlink("postmaster.pid"), and then a new postmaster could get far enough to see the socket lock file still there. So that would account for low-probability failures in the pg_upgradecheck test, which is exactly what we've been seeing. It seems clear to me that what the exit sequence ought to be is unlink socket, unlink socket lock file, repeat for any additional sockets, then unlink postmaster.pid. Can anyone think of a reason why the current ordering isn't a bug? Another point that's rather obvious from the trace is that at no time do we bother to close the postmaster's TCP socket; it only goes away when the postmaster process dies. So there's a second race condition here: a new postmaster could be unable to bind() to the desired TCP port because the old one still has it open. I think we've seen unexplained failures of the TCP-socket-in-use variety, too. regards, tom lane
pgsql-hackers by date: