Re: problems on Solaris - Mailing list pgsql-hackers

From Robert Haas
Subject Re: problems on Solaris
Date
Msg-id CA+TgmoajRM0RJePuDxw2FK1Gts4gMAgVmbQ+9tHszYr4UsomEw@mail.gmail.com
Whole thread Raw
In response to Re: problems on Solaris  (Andres Freund <andres@anarazel.de>)
Responses Re: problems on Solaris  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Mon, May 25, 2015 at 10:05 PM, Andres Freund <andres@anarazel.de> wrote:
> Hm. So we have a *occasional* stack size exceeded failure and an
> occasional spinlock error in test_shm_mq. I'm inclined to think that
> this is a shm_mq problem, and not a more general locking problem - it
> seems likely, but not guaranteed, that that'd have materialized
> elsewhere.

I think the problem might be that the spinlock-based memory barrier is
not re-entrant.  Suppose some kind of barrier operation is in process,
and we've acquired the dummy spnlock but not yet released it.  Just
then, we receive a signal.  Since the shm_mq code sets
set_latch_on_sigusr1, procsignal_sigusr1_handler will set MyLatch.
SetLatch now includes barrier operations, so we'll try to acquire and
release the spinlock despite already holding it.  Oops.

> Robert: IIRC there was some problems with shm_mq tests being stuck
> before, right?

The last round of investigation, on anole, resulted in this fix:

commit d0410d66037c2f3f9bee45e0a2db9e47eeba2bb4
Author: Robert Haas <rhaas@postgresql.org>
Date:   Sat Oct 4 21:25:41 2014 -0400
   Eliminate one background-worker-related flag variable.
   Teach sigusr1_handler() to use the same test for whether a worker   might need to be started as ServerLoop().  Aside
frombeing perhaps   a bit simpler, this prevents a potentially-unbounded delay when   starting a background worker.  On
someplatforms, select() doesn't   return when interrupted by a signal, but is instead restarted,   including a reset of
thetimeout to the originally-requested value.   If signals arrive often enough, but no connection requests arrive,
sigusr1_handler()will be executed repeatedly, but the body of   ServerLoop() won't be reached.  This change ensures
that,even in   that case, background workers will eventually get launched.
 
   This is far from a perfect fix; really, we need select() to return   control to ServerLoop() after an interrupt,
eithervia the self-pipe   trick or some other mechanism.  But that's going to require more   work and discussion, so
let'sdo this for now to at least mitigate   the damage.
 
   Per investigation of test_shm_mq failures on buildfarm member anole.

The problem here isn't really with test_shm_mq; it's with the
postmaster.  To really make this work properly, we need to be able to
use latches in the postmaster, and we need to generalize
WaitLatchOrSocket so that it can wait for a latch of any of n sockets.
Then ServerLoop can use that instead of calling select directly.  This
will probably look a lot like what you did to get rid of
ImmediateInterruptOK.

But all of that seems unrelated to the current problems.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Ted Toth
Date:
Subject: Re: rhel6 rpm file locations
Next
From: Robert Haas
Date:
Subject: Re: rhel6 rpm file locations