Re: Idea for improving buildfarm robustness - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: Idea for improving buildfarm robustness
Date
Msg-id 20150929185749.GG3685@tamriel.snowman.net
Whole thread Raw
In response to Idea for improving buildfarm robustness  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Idea for improving buildfarm robustness  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> But today I thought of another way: suppose that we teach the postmaster
> to commit hara-kiri if the $PGDATA directory goes away.  Since the
> buildfarm script definitely does remove all the temporary data directories
> it creates, this ought to get the job done.

Yes, please.

> An easy way to do that would be to have it check every so often if
> pg_control can still be read.  We should not have it fail on ENFILE or
> EMFILE, since that would create a new failure hazard under heavy load,
> but ENOENT or similar would be reasonable grounds for deciding that
> something is horribly broken.  (At least on Windows, failing on EPERM
> doesn't seem wise either, since we've seen antivirus products randomly
> causing such errors.)

Sounds pretty reasonable to me.

> I wouldn't want to do this every time through the postmaster's main loop,
> but we could do this once an hour for no added cost by adding the check
> where it does TouchSocketLockFiles; or once every few minutes if we
> carried a separate variable like last_touch_time.  Once an hour would be
> plenty to fix the buildfarm's problem, I should think.

I have a bad (?) habit of doing exactly this during development and
would really like it to be a bit more often than once/hour, unless
there's a particular problem with that.

> Another question is what exactly "commit hara-kiri" should consist of.
> We could just abort() or _exit(1) and leave it to child processes to
> notice that the postmaster is gone, or we could make an effort to clean
> up.  I'd be a bit inclined to treat it like a SIGQUIT situation, ie
> kill all the children and exit.  The children are probably having
> problems of their own if the data directory's gone, so forcing
> termination might be best to keep them from getting stuck.

I like the idea of killing all the children and then exiting.

> Also, perhaps we'd only enable this behavior in --enable-cassert builds,
> to avoid any risk of a postmaster incorrectly choosing to suicide in a
> production scenario.  Or maybe that's overly conservative.

That would work for my use-case.  Perhaps only on --enable-cassert
builds for back-branches but enable it in master and see how things go
for 9.6?  I agree that it feels overly conservative, but given our
recent history, we should be overly cautious with the back branches.

> Thoughts?

Thanks!

Stephen

pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: Idea for improving buildfarm robustness
Next
From: Tom Lane
Date:
Subject: Re: Idea for improving buildfarm robustness