Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?) - Mailing list pgsql-hackers

From Andres Freund
Subject Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
Date
Msg-id 20141003214748.GG25554@awork2.anarazel.de
Whole thread Raw
In response to Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
List pgsql-hackers
On 2014-10-03 14:38:10 -0400, Robert Haas wrote:
> On Fri, Oct 3, 2014 at 1:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > Further debugging reveals that sigusr1_handler() gets called
> > repeatedly, to start autovacuum workers, and it keeps waking up and
> > starting them.  But that doesn't cause the background workers to get
> > started either, because although sigusr1_handler() contains a call to
> > maybe_start_bgworker, it only does that if start_bgworker = true,
> > which only happens if
> > CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE) is true,
> > which on these calls it isn't.
> > And I think this might also be the missing ingredient answering
> > Andres's question from before: why doesn't the 60-second
> > select()-timeout cause the background worker to eventually start even
> > if the SELECT doesn't get interrupted?  There seems to be a SIGUSR1
> > arriving about every 3 seconds, and I bet that's resetting the timer
> > every time.

Ick. There's a couple reports about HPUX behaving that way for select(),
so that seems like a reasonable theory. And a good explanation why only
anole was seing this.

> For now I propose to address this by committing the attached patch,
> which gets rid of the separate start_bgworker flag inside
> sigusr1_handler() and instead uses the same StartWorkerNeeded ||
> HaveCrashedWorker test that we use inside ServerLoop() to decide
> whether to call maybe_start_bgworker().

That seems sane.

> Either more signals will
> arrive (in which case the signal handler will launch an additional
> background worker every time a signal arrives) or they won't (in which
> case the 60-second timeout will eventually expire, and ServerLoop()
> will kick into high gear and satisfy all outstanding requests).  This
> isn't really right, because there might still be a quite noticeable
> delay waiting for workers to get launched, but at least the delay
> would be bounded to at most 60 seconds rather than, as at present,
> potentially infinite.

Right, it will somewhat fix the current issue. But aren't the
consequences of a potentially never/very seldomly run ServerLoop() more
severe? There's a couple more things done in that loop :(

> A totally correct fix will require a bit more thought.  A search of
> the git history reveals that the problem of a signal restarting the
> timeout is not new: Tom fixed a similar problem back in 2007 by making
> the autovacuum launcher sleep for at most a second at a time.  Such a
> fix isn't ideal here, because we really don't want an up-to-1-second
> delay launching a newly-registered background worker if there's a way
> to avoid that -- it's probably OK for launching daemons, but it's not
> so hot for parallel query.  However, we could:
> 
> (1) Use the self-pipe trick. We could not directly use a latch, at
> least not without a new API, because we might be waiting on more than
> one socket.

I think this is the way to go. And I think we should make the latch API
support that case, instead of reinventing it. That'll also have the
advantage of automatically using poll() (which is more efficient),
rather than select(), when supported by the platform.

> (2) Have the postmaster not set SA_RESTART for the sigusr1 handler.  I
> don't know how platform-independent this approach would be.

This makes me wary. Postmaster doesn't do too much stuff, but it'd still
be annoying to have to make sure everything always does loop around
EINTR.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: UPSERT wiki page, and SQL MERGE syntax
Next
From: Andres Freund
Date:
Subject: Re: pg_receivexlog and replication slots