Idea for improving buildfarm robustness - Mailing list pgsql-hackers

From Tom Lane
Subject Idea for improving buildfarm robustness
Date
Msg-id 32221.1443552538@sss.pgh.pa.us
Whole thread Raw
Responses Re: Idea for improving buildfarm robustness  (Stephen Frost <sfrost@snowman.net>)
Re: Idea for improving buildfarm robustness  (Andrew Dunstan <andrew@dunslane.net>)
List pgsql-hackers
A problem the buildfarm has had for a long time is that if for some reason
the scripts fail to stop a test postmaster, the postmaster process will
hang around and cause subsequent runs to fail because of socket conflicts.
This seems to have gotten a lot worse lately due to the influx of very
slow buildfarm machines, but the risk has always been there.

I've been thinking about teaching the buildfarm script to "kill -9"
any postmasters left around at the end of the run, but that's fairly
problematic: how do you find such processes (since "ps" output isn't
hugely portable, especially not to Windows), and how do you tell them
apart from postmasters not started by the script?  So the idea was on
hold.

But today I thought of another way: suppose that we teach the postmaster
to commit hara-kiri if the $PGDATA directory goes away.  Since the
buildfarm script definitely does remove all the temporary data directories
it creates, this ought to get the job done.

An easy way to do that would be to have it check every so often if
pg_control can still be read.  We should not have it fail on ENFILE or
EMFILE, since that would create a new failure hazard under heavy load,
but ENOENT or similar would be reasonable grounds for deciding that
something is horribly broken.  (At least on Windows, failing on EPERM
doesn't seem wise either, since we've seen antivirus products randomly
causing such errors.)

I wouldn't want to do this every time through the postmaster's main loop,
but we could do this once an hour for no added cost by adding the check
where it does TouchSocketLockFiles; or once every few minutes if we
carried a separate variable like last_touch_time.  Once an hour would be
plenty to fix the buildfarm's problem, I should think.

Another question is what exactly "commit hara-kiri" should consist of.
We could just abort() or _exit(1) and leave it to child processes to
notice that the postmaster is gone, or we could make an effort to clean
up.  I'd be a bit inclined to treat it like a SIGQUIT situation, ie
kill all the children and exit.  The children are probably having
problems of their own if the data directory's gone, so forcing
termination might be best to keep them from getting stuck.

Also, perhaps we'd only enable this behavior in --enable-cassert builds,
to avoid any risk of a postmaster incorrectly choosing to suicide in a
production scenario.  Or maybe that's overly conservative.

Thoughts?
        regards, tom lane



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: BRIN indexes for MAX, MIN, ORDER BY?
Next
From: Peter Geoghegan
Date:
Subject: Re: ON CONFLICT issues around whole row vars,