Re: Buildfarm owners: check if your HEAD build is stuck - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: Buildfarm owners: check if your HEAD build is stuck
Date
Msg-id 44DE8DC6.2010903@dunslane.net
Whole thread Raw
In response to Buildfarm owners: check if your HEAD build is stuck  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Buildfarm owners: check if your HEAD build is stuck
List pgsql-hackers

Tom Lane wrote:
> A number of the buildfarm machines have been failing HEAD builds
> at the "make check" stage since last night, with complaints like
> this one from emu: 
>
> ================== pgsql.21911/src/test/regress/log/postmaster.log ===================
> FATAL:  lock file "/tmp/.s.PGSQL.55678.lock" already exists
> HINT:  Is another postmaster (PID 23692) using socket file "/tmp/.s.PGSQL.55678"?
>
> What's happened is that that GUC patch that was in the tree for a few
> hours broke postmaster startup on some machines (for as-yet-unidentified
> reasons).  The postmaster does actually start and establish its
> lockfiles, but it never gets to the stage of being able to accept
> connections.
>
> After the buildfarm script rm -rf's the build tree, the postmaster
> process is still there but "disembodied" (its executable file is
> probably gone, for example, or at least in the state of zero remaining
> directory links).  But it's still got that socket file and lockfile
> in /tmp, and this prevents another postmaster from starting with the
> same port number.
>
> If you've got this situation, you'll need to do a manual "kill" on the
> PID mentioned in the lock file before things will start working again.
> (pg_ctl won't work because it looks for the data directory
> postmaster.pid file, which is long gone.)  More generally you might want
> to look through a ps listing for unexpected postgres-owned processes.
>
> I'm not sure whether there's anything much we can do to prevent such
> problems in future.  Maybe it'd be reasonable for pg_regress to do a
> kill -9 on its postmaster child process if it gives up waiting for the
> postmaster to accept connections.
>
>
>   

That's amazingly ugly, and well diagnosed.

BTW, buildfarm processes would typically not be postgres owned, at least 
not on my machines. I run either as myself or as a special buildfarm user.

I'm trying to think how we could harden the buildfarm script to avoid 
such situations, although I am so far without any great revelations.

The idea of getting pg_regress to send a signal isn't bad - what if the 
PID gets reused, since we know not all systems allocate PIDs in a 
cyclical fashion?

Also, I see the pg-regress code has this comment:
           /*            * Fail immediately if postmaster has exited            *            * XXX is there a way to do
thison Windows?            */
 

As I understand it, the way to do it is to call OpenProcess() - if that 
succeeds then it is still there. I guess if needed we could even do that 
in src/port/kill.c so that kill(pid,0) would work. But I would want 
confirmation from the Windows gurus.


cheers

andrew


pgsql-hackers by date:

Previous
From: "Sergey E. Koposov"
Date:
Subject: segfault on rollback
Next
From: AgentM
Date:
Subject: Re: [PATCHES] Adding fulldisjunctions to the contrib