Re: Buildfarm owners: check if your HEAD build is stuck - Mailing list pgsql-hackers
| From | Andrew Dunstan |
|---|---|
| Subject | Re: Buildfarm owners: check if your HEAD build is stuck |
| Date | |
| Msg-id | 44DE8DC6.2010903@dunslane.net Whole thread Raw |
| In response to | Buildfarm owners: check if your HEAD build is stuck (Tom Lane <tgl@sss.pgh.pa.us>) |
| Responses |
Re: Buildfarm owners: check if your HEAD build is stuck
|
| List | pgsql-hackers |
Tom Lane wrote:
> A number of the buildfarm machines have been failing HEAD builds
> at the "make check" stage since last night, with complaints like
> this one from emu:
>
> ================== pgsql.21911/src/test/regress/log/postmaster.log ===================
> FATAL: lock file "/tmp/.s.PGSQL.55678.lock" already exists
> HINT: Is another postmaster (PID 23692) using socket file "/tmp/.s.PGSQL.55678"?
>
> What's happened is that that GUC patch that was in the tree for a few
> hours broke postmaster startup on some machines (for as-yet-unidentified
> reasons). The postmaster does actually start and establish its
> lockfiles, but it never gets to the stage of being able to accept
> connections.
>
> After the buildfarm script rm -rf's the build tree, the postmaster
> process is still there but "disembodied" (its executable file is
> probably gone, for example, or at least in the state of zero remaining
> directory links). But it's still got that socket file and lockfile
> in /tmp, and this prevents another postmaster from starting with the
> same port number.
>
> If you've got this situation, you'll need to do a manual "kill" on the
> PID mentioned in the lock file before things will start working again.
> (pg_ctl won't work because it looks for the data directory
> postmaster.pid file, which is long gone.) More generally you might want
> to look through a ps listing for unexpected postgres-owned processes.
>
> I'm not sure whether there's anything much we can do to prevent such
> problems in future. Maybe it'd be reasonable for pg_regress to do a
> kill -9 on its postmaster child process if it gives up waiting for the
> postmaster to accept connections.
>
>
>
That's amazingly ugly, and well diagnosed.
BTW, buildfarm processes would typically not be postgres owned, at least
not on my machines. I run either as myself or as a special buildfarm user.
I'm trying to think how we could harden the buildfarm script to avoid
such situations, although I am so far without any great revelations.
The idea of getting pg_regress to send a signal isn't bad - what if the
PID gets reused, since we know not all systems allocate PIDs in a
cyclical fashion?
Also, I see the pg-regress code has this comment:
/* * Fail immediately if postmaster has exited * * XXX is there a way to do
thison Windows? */
As I understand it, the way to do it is to call OpenProcess() - if that
succeeds then it is still there. I guess if needed we could even do that
in src/port/kill.c so that kill(pid,0) would work. But I would want
confirmation from the Windows gurus.
cheers
andrew
pgsql-hackers by date: