Re: Idea for improving buildfarm robustness - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: Idea for improving buildfarm robustness
Date
Msg-id 560AE191.8060504@dunslane.net
Whole thread Raw
In response to Idea for improving buildfarm robustness  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Idea for improving buildfarm robustness  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers

On 09/29/2015 02:48 PM, Tom Lane wrote:
> A problem the buildfarm has had for a long time is that if for some reason
> the scripts fail to stop a test postmaster, the postmaster process will
> hang around and cause subsequent runs to fail because of socket conflicts.
> This seems to have gotten a lot worse lately due to the influx of very
> slow buildfarm machines, but the risk has always been there.
>
> I've been thinking about teaching the buildfarm script to "kill -9"
> any postmasters left around at the end of the run, but that's fairly
> problematic: how do you find such processes (since "ps" output isn't
> hugely portable, especially not to Windows), and how do you tell them
> apart from postmasters not started by the script?  So the idea was on
> hold.
>
> But today I thought of another way: suppose that we teach the postmaster
> to commit hara-kiri if the $PGDATA directory goes away.  Since the
> buildfarm script definitely does remove all the temporary data directories
> it creates, this ought to get the job done.
>
> An easy way to do that would be to have it check every so often if
> pg_control can still be read.  We should not have it fail on ENFILE or
> EMFILE, since that would create a new failure hazard under heavy load,
> but ENOENT or similar would be reasonable grounds for deciding that
> something is horribly broken.  (At least on Windows, failing on EPERM
> doesn't seem wise either, since we've seen antivirus products randomly
> causing such errors.)
>
> I wouldn't want to do this every time through the postmaster's main loop,
> but we could do this once an hour for no added cost by adding the check
> where it does TouchSocketLockFiles; or once every few minutes if we
> carried a separate variable like last_touch_time.  Once an hour would be
> plenty to fix the buildfarm's problem, I should think.
>
> Another question is what exactly "commit hara-kiri" should consist of.
> We could just abort() or _exit(1) and leave it to child processes to
> notice that the postmaster is gone, or we could make an effort to clean
> up.  I'd be a bit inclined to treat it like a SIGQUIT situation, ie
> kill all the children and exit.  The children are probably having
> problems of their own if the data directory's gone, so forcing
> termination might be best to keep them from getting stuck.
>
> Also, perhaps we'd only enable this behavior in --enable-cassert builds,
> to avoid any risk of a postmaster incorrectly choosing to suicide in a
> production scenario.  Or maybe that's overly conservative.
>
> Thoughts?
>
>             



It's a fine idea. This is much more likely to be robust than any 
buildfarm client fix.

Not every buildfarm member uses cassert, so I'm not sure that's the best 
way to go. axolotl doesn't, and it's one of those that regularly has 
speed problems. Maybe a not-very-well-publicized GUC, or an environment 
setting? Or maybe just enable this anyway. If the data directory is gone 
what's the point in keeping the postmaster around? Shutting it down 
doesn't seem likely to cause any damage.


cheers

andrew



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Idea for improving buildfarm robustness
Next
From: Stephen Frost
Date:
Subject: Re: Idea for improving buildfarm robustness