Thread: random failing builds on spoonbill - backends not exiting...

random failing builds on spoonbill - backends not exiting...

From
Stefan Kaltenbrunner
Date:
It has now happened at least twice that builds on spponbill started to
fail after it failed during ECPGcheck:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2012-06-19%2023%3A00%3A04

the first failure was:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2012-05-24%2023%3A00%3A05


so in both cases the postmaster was not shuting down properly and it was
in fact still running - I have attached gdb to to the still running backend:


(gdb) bt
#0  0x0000000208eb5928 in poll () from /usr/lib/libc.so.62.0
#1  0x000000020a972b88 in _thread_kern_poll (wait_reqd=Variable
"wait_reqd" is not available.
) at /usr/src/lib/libpthread/uthread/uthread_kern.c:784
#2  0x000000020a973d04 in _thread_kern_sched (scp=0x0) at
/usr/src/lib/libpthread/uthread/uthread_kern.c:384
#3  0x000000020a96c080 in select (numfds=Variable "numfds" is not available.
) at /usr/src/lib/libpthread/uthread/uthread_select.c:170
#4  0x00000000003a2894 in ServerLoop () at postmaster.c:1321
#5  0x00000000003a45ac in PostmasterMain (argc=Variable "argc" is not
available.
) at postmaster.c:1121
#6  0x0000000000326df8 in main (argc=6, argv=0xffffffffffff14f8) at
main.c:199
(gdb) print Shutdown
$2 = 2
(gdb) print pmState
$3 = PM_WAIT_BACKENDS
(gdb) p *(Backend *) (BackendList->dll_head)
Cannot access memory at address 0x0
(gdb) p *BackendList
$9 = {dll_head = 0x0, dll_tail = 0x0}

all processes are still running:

pgbuild  18020  0.0  1.2  5952 12408 ??  I     Wed04AM    0:03.98

/home/pgbuild/pgbuildfarm/HEAD/pgsql.5709/src/interfaces/ecpg/test/./tmp_check/install//home/pgbuild/pgbuildfarm/HEAD/inst/bin/postgres
-D /
pgbuild  21483  0.0  0.7  6088  7296 ??  Is    Wed04AM    0:00.68
postgres: checkpointer process    (postgres)
pgbuild  12480  0.0  0.4  5952  4464 ??  Ss    Wed04AM    0:06.88
postgres: writer process    (postgres)
pgbuild   9841  0.0  0.5  5952  4936 ??  Ss    Wed04AM    0:06.92
postgres: wal writer process    (postgres)
pgbuild    623  0.1  0.6  7424  6288 ??  Ss    Wed04AM    4:16.76
postgres: autovacuum launcher process    (postgres)
pgbuild  30949  0.0  0.4  6280  3896 ??  Ss    Wed04AM    0:40.94
postgres: stats collector process    (postgres)


sending a manual kill -15 to either of them does not seem to make them
exit either...

I did some further investiagations with robert on IM but I don't think
he has any further ideas other than that I have a weird OS :)
It seems worth noticing that this is OpenBSD 5.1 on Sparc64 which has a
new threading implementation compared to older OpenBSD versions.


Stefan


Re: random failing builds on spoonbill - backends not exiting...

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> It has now happened at least twice that builds on spponbill started to
> fail after it failed during ECPGcheck:
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2012-06-19%2023%3A00%3A04
> the first failure was:
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2012-05-24%2023%3A00%3A05
> so in both cases the postmaster was not shuting down properly

panther has been showing similar postmaster-does-not-shut-down failures
every so often, though IIRC always in the IsolationCheck step not ECPG.

> I did some further investiagations with robert on IM but I don't think
> he has any further ideas other than that I have a weird OS :)
> It seems worth noticing that this is OpenBSD 5.1 on Sparc64 which has a
> new threading implementation compared to older OpenBSD versions.

But we don't use threading ...

Still, panther is NetBSD so there may be some general BSD flavor to
whatever's going on here.
        regards, tom lane


Re: random failing builds on spoonbill - backends not exiting...

From
Andrew Dunstan
Date:

On 06/22/2012 02:34 PM, Tom Lane wrote:
> Stefan Kaltenbrunner<stefan@kaltenbrunner.cc>  writes:
>> It has now happened at least twice that builds on spponbill started to
>> fail after it failed during ECPGcheck:
>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2012-06-19%2023%3A00%3A04
>> the first failure was:
>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2012-05-24%2023%3A00%3A05
>> so in both cases the postmaster was not shuting down properly
> panther has been showing similar postmaster-does-not-shut-down failures
> every so often, though IIRC always in the IsolationCheck step not ECPG.



brolga (Cygwin) has had similar sporadic failures on ecpg checks, so 
much so that, since  I have been unable to debug it in the time I have 
available, for the time being I have disabled the ecpg checks.


cheers

andrew


Re: random failing builds on spoonbill - backends not exiting...

From
Robert Haas
Date:
On Fri, Jun 22, 2012 at 2:16 PM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:
> sending a manual kill -15 to either of them does not seem to make them
> exit either...
>
> I did some further investiagations with robert on IM but I don't think
> he has any further ideas other than that I have a weird OS :)
> It seems worth noticing that this is OpenBSD 5.1 on Sparc64 which has a
> new threading implementation compared to older OpenBSD versions.

I remarked to Stefan that the symptoms seem consistent with the idea
that the children have signals blocked.  But I don't know how that
could happen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: random failing builds on spoonbill - backends not exiting...

From
Andres Freund
Date:
On Friday, June 22, 2012 08:51:55 PM Robert Haas wrote:
> On Fri, Jun 22, 2012 at 2:16 PM, Stefan Kaltenbrunner
> 
> <stefan@kaltenbrunner.cc> wrote:
> > sending a manual kill -15 to either of them does not seem to make them
> > exit either...
> > 
> > I did some further investiagations with robert on IM but I don't think
> > he has any further ideas other than that I have a weird OS :)
> > It seems worth noticing that this is OpenBSD 5.1 on Sparc64 which has a
> > new threading implementation compared to older OpenBSD versions.
> 
> I remarked to Stefan that the symptoms seem consistent with the idea
> that the children have signals blocked.  But I don't know how that
> could happen.
You cannot block sigkill.

Andres
-- 
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: random failing builds on spoonbill - backends not exiting...

From
Stefan Kaltenbrunner
Date:
On 06/22/2012 08:34 PM, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> It has now happened at least twice that builds on spponbill started to
>> fail after it failed during ECPGcheck:
>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2012-06-19%2023%3A00%3A04
>> the first failure was:
>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2012-05-24%2023%3A00%3A05
>> so in both cases the postmaster was not shuting down properly
> 
> panther has been showing similar postmaster-does-not-shut-down failures
> every so often, though IIRC always in the IsolationCheck step not ECPG.

hmm

> 
>> I did some further investiagations with robert on IM but I don't think
>> he has any further ideas other than that I have a weird OS :)
>> It seems worth noticing that this is OpenBSD 5.1 on Sparc64 which has a
>> new threading implementation compared to older OpenBSD versions.
> 
> But we don't use threading ...
> 
> Still, panther is NetBSD so there may be some general BSD flavor to
> whatever's going on here.

yeah the threading reference was mostly because all backtraces contain
references to threading libs and because the threading tests are the
last ones done by the ECPG changes...


Stefan


Re: random failing builds on spoonbill - backends not exiting...

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> On 06/22/2012 08:34 PM, Tom Lane wrote:
>> Still, panther is NetBSD so there may be some general BSD flavor to
>> whatever's going on here.

> yeah the threading reference was mostly because all backtraces contain
> references to threading libs and because the threading tests are the
> last ones done by the ECPG changes...

It is weird that this seems to be happening only in the context of the
ecpg and isolation tests, because it's not clear why client-side
activity would have anything to do with it.  Your gdb investigations
confirm that all the actual client-serving backends are gone, and the
postmaster knows it.  But the background service processes haven't shut
down.  AFAICS the postmaster could not have reached PM_WAIT_BACKENDS
state without signaling them, so why aren't they shutting down, and why
does it matter which set of tests we'd been running?

My first thought about it was that maybe a signal got missed, but it's
hard to credit that bgwriter, walwriter, and autovac would all have
missed signals concurrently.  (checkpointer and stats collector don't
get signaled yet, so it's not surprising those are still around.)

I wonder whether signal_child() could have failed?  It logs about
such failures, but only at debug3 which seems overly taciturn.
I wonder if we should crank that up to LOG level.
        regards, tom lane


Re: random failing builds on spoonbill - backends not exiting...

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On Friday, June 22, 2012 08:51:55 PM Robert Haas wrote:
>> I remarked to Stefan that the symptoms seem consistent with the idea
>> that the children have signals blocked.  But I don't know how that
>> could happen.

> You cannot block sigkill.

sigterm is at issue, not sigkill.  But I don't care for the
signals-blocked theory either, at least not in three different children
at the same time.

(Hey Stefan, is there a way on BSD to check a process's signals-blocked
state from outside?  If so, next time this happens you should try to
determine the children's signal state.)
        regards, tom lane


Re: random failing builds on spoonbill - backends not exiting...

From
Stefan Kaltenbrunner
Date:
On 06/22/2012 09:39 PM, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> On Friday, June 22, 2012 08:51:55 PM Robert Haas wrote:
>>> I remarked to Stefan that the symptoms seem consistent with the idea
>>> that the children have signals blocked.  But I don't know how that
>>> could happen.
> 
>> You cannot block sigkill.
> 
> sigterm is at issue, not sigkill.  But I don't care for the
> signals-blocked theory either, at least not in three different children
> at the same time.
> 
> (Hey Stefan, is there a way on BSD to check a process's signals-blocked
> state from outside?  If so, next time this happens you should try to
> determine the children's signal state.)

with help from RhodiumToad on IRC:

#  ps -o pid,sig,sigcatch,sigignore,sigmask,command -p 12480
 PID  PENDING   CAUGHT  IGNORED  BLOCKED COMMAND
12480 20004004 34084005 c942b002 fffefeff postgres: writer process
(postgres)

#  ps -o pid,sig,sigcatch,sigignore,sigmask,command -p 9841 PID  PENDING   CAUGHT  IGNORED  BLOCKED COMMAND9841
2000400434084007 c942b000 fffefeff postgres: wal writer process (postgres)
 


Stefan


Re: random failing builds on spoonbill - backends not exiting...

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> On 06/22/2012 09:39 PM, Tom Lane wrote:
>> (Hey Stefan, is there a way on BSD to check a process's signals-blocked
>> state from outside?  If so, next time this happens you should try to
>> determine the children's signal state.)

> with help from RhodiumToad on IRC:

> #  ps -o pid,sig,sigcatch,sigignore,sigmask,command -p 12480

>   PID  PENDING   CAUGHT  IGNORED  BLOCKED COMMAND
> 12480 20004004 34084005 c942b002 fffefeff postgres: writer process
> (postgres)

> #  ps -o pid,sig,sigcatch,sigignore,sigmask,command -p 9841
>   PID  PENDING   CAUGHT  IGNORED  BLOCKED COMMAND
>  9841 20004004 34084007 c942b000 fffefeff postgres: wal writer process
>   (postgres)

Well, the nonzero PENDING masks sure look like a smoking gun, but why
are there multiple pending signals?  And I'm not sure I know OpenBSD's
signal numbers by heart.  Could you convert those masks into text signal
name lists for us?
        regards, tom lane


Re: random failing builds on spoonbill - backends not exiting...

From
Stefan Kaltenbrunner
Date:
On 06/22/2012 11:02 PM, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> On 06/22/2012 09:39 PM, Tom Lane wrote:
>>> (Hey Stefan, is there a way on BSD to check a process's signals-blocked
>>> state from outside?  If so, next time this happens you should try to
>>> determine the children's signal state.)
> 
>> with help from RhodiumToad on IRC:
> 
>> #  ps -o pid,sig,sigcatch,sigignore,sigmask,command -p 12480
> 
>>   PID  PENDING   CAUGHT  IGNORED  BLOCKED COMMAND
>> 12480 20004004 34084005 c942b002 fffefeff postgres: writer process
>> (postgres)
> 
>> #  ps -o pid,sig,sigcatch,sigignore,sigmask,command -p 9841
>>   PID  PENDING   CAUGHT  IGNORED  BLOCKED COMMAND
>>  9841 20004004 34084007 c942b000 fffefeff postgres: wal writer process
>>   (postgres)
> 
> Well, the nonzero PENDING masks sure look like a smoking gun, but why
> are there multiple pending signals?  And I'm not sure I know OpenBSD's
> signal numbers by heart.  Could you convert those masks into text signal
> name lists for us?

this seems to be SIGUSR1,SIGTERM and SIGQUIT



Stefan


Re: random failing builds on spoonbill - backends not exiting...

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>> PID  PENDING   CAUGHT  IGNORED  BLOCKED COMMAND
>>> 12480 20004004 34084005 c942b002 fffefeff postgres: writer process
>>> 9841 20004004 34084007 c942b000 fffefeff postgres: wal writer process

> this seems to be SIGUSR1,SIGTERM and SIGQUIT

OK, I looked up OpenBSD's signal numbers on the web.  It looks to me
like these two processes have everything blocked except KILL and STOP
(which are unblockable of course).  I do not see any place in the PG
code that could possibly set such a mask (note that BlockSig should
have more holes in it than that).  So I'm thinking these must be
blocked inside some system function that's installed a restrictive
signal mask, or some such function forgot to restore the mask on exit.
Could you gdb each of these processes and get a stack trace?
        regards, tom lane


Re: random failing builds on spoonbill - backends not exiting...

From
Tom Lane
Date:
oh, and just for comparison's sake, what do the postmaster's signal
masks look like?
        regards, tom lane


Re: random failing builds on spoonbill - backends not exiting...

From
Robert Haas
Date:
On Fri, Jun 22, 2012 at 2:57 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On Friday, June 22, 2012 08:51:55 PM Robert Haas wrote:
>> On Fri, Jun 22, 2012 at 2:16 PM, Stefan Kaltenbrunner
>>
>> <stefan@kaltenbrunner.cc> wrote:
>> > sending a manual kill -15 to either of them does not seem to make them
>> > exit either...
>> >
>> > I did some further investiagations with robert on IM but I don't think
>> > he has any further ideas other than that I have a weird OS :)
>> > It seems worth noticing that this is OpenBSD 5.1 on Sparc64 which has a
>> > new threading implementation compared to older OpenBSD versions.
>>
>> I remarked to Stefan that the symptoms seem consistent with the idea
>> that the children have signals blocked.  But I don't know how that
>> could happen.
> You cannot block sigkill.

Obviously.  The issue is: the postmaster apparently sent SIGTERM (15)
to all of these children, and yet they're not dead; and a manual
SIGTERM doesn't kill them either.  I'm sure SIGKILL (9) would do the
trick, but then it's not a clean shutdown.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: random failing builds on spoonbill - backends not exiting...

From
Stefan Kaltenbrunner
Date:
On 06/22/2012 11:53 PM, Tom Lane wrote:
> oh, and just for comparison's sake, what do the postmaster's signal
> masks look like?


#  ps -o pid,sig,sigcatch,sigignore,sigmask,command -p 18020
 PID  PENDING   CAUGHT  IGNORED  BLOCKED COMMAND
18020        0 74084007 8972b000        0
/home/pgbuild/pgbuildfarm/HEAD/pgsql.5709/src/interfaces/ecpg/test/./t...


Stefan


Re: random failing builds on spoonbill - backends not exiting...

From
Stefan Kaltenbrunner
Date:
On 06/22/2012 11:47 PM, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>>> PID  PENDING   CAUGHT  IGNORED  BLOCKED COMMAND
>>>> 12480 20004004 34084005 c942b002 fffefeff postgres: writer process
>>>> 9841 20004004 34084007 c942b000 fffefeff postgres: wal writer process
> 
>> this seems to be SIGUSR1,SIGTERM and SIGQUIT
> 
> OK, I looked up OpenBSD's signal numbers on the web.  It looks to me
> like these two processes have everything blocked except KILL and STOP
> (which are unblockable of course).  I do not see any place in the PG
> code that could possibly set such a mask (note that BlockSig should
> have more holes in it than that).  So I'm thinking these must be
> blocked inside some system function that's installed a restrictive
> signal mask, or some such function forgot to restore the mask on exit.
> Could you gdb each of these processes and get a stack trace?


background writer (12480):

(gdb) bt
#0  0x0000000208eb5928 in poll () from /usr/lib/libc.so.62.0
#1  0x000000020a972b88 in _thread_kern_poll (wait_reqd=Variable
"wait_reqd" is not available.
) at /usr/src/lib/libpthread/uthread/uthread_kern.c:784
#2  0x000000020a973d04 in _thread_kern_sched (scp=0x0) at
/usr/src/lib/libpthread/uthread/uthread_kern.c:384
#3  0x000000020a96b35c in poll (fds=0xfffffffffffefa80, nfds=Variable
"nfds" is not available.
) at /usr/src/lib/libpthread/uthread/uthread_poll.c:94
#4  0x0000000000395538 in WaitLatchOrSocket (latch=0x212bdc97c,
wakeEvents=25, sock=-1, timeout=10000) at pg_latch.c:286
#5  0x0000000000399800 in BackgroundWriterMain () at bgwriter.c:325
#6  0x0000000000201850 in AuxiliaryProcessMain (argc=2,
argv=0xfffffffffffefd98) at bootstrap.c:419
#7  0x00000000003a1534 in StartChildProcess (type=BgWriterProcess) at
postmaster.c:4518
#8  0x00000000003a7574 in reaper (postgres_signal_arg=Variable
"postgres_signal_arg" is not available.
) at postmaster.c:2385
#9  0x000000020a974528 in _dispatch_signal (sig=20,
scp=0xffffffffffff03e0) at /usr/src/lib/libpthread/uthread/uthread_sig.c:408
#10 0x000000020a97462c in _dispatch_signals (scp=0xffffffffffff03e0) at
/usr/src/lib/libpthread/uthread/uthread_sig.c:437
#11 0x000000020a974e28 in _thread_sig_handler (sig=20,
info=0xffffffffffff0420, scp=0xffffffffffff03e0) at
/usr/src/lib/libpthread/uthread/uthread_sig.c:139
#12 <signal handler called>
#13 _thread_kern_set_timeout (timeout=0xffffffffffff0630) at
/usr/src/lib/libpthread/uthread/uthread_kern.c:989
#14 0x000000020a96bc8c in select (numfds=9, readfds=0xffffffffffff0730,
writefds=0x0, exceptfds=0x0, timeout=Variable "timeout" is not available.
) at /usr/src/lib/libpthread/uthread/uthread_select.c:85
#15 0x00000000003a2894 in ServerLoop () at postmaster.c:1321
#16 0x00000000003a45ac in PostmasterMain (argc=Variable "argc" is not
available.
) at postmaster.c:1121
#17 0x0000000000326df8 in main (argc=6, argv=0xffffffffffff14f8) at
main.c:199


wal writer(9841):

#0  0x0000000208eb5928 in poll () from /usr/lib/libc.so.62.0
#1  0x000000020a972b88 in _thread_kern_poll (wait_reqd=Variable
"wait_reqd" is not available.
) at /usr/src/lib/libpthread/uthread/uthread_kern.c:784
#2  0x000000020a973d04 in _thread_kern_sched (scp=0x0) at
/usr/src/lib/libpthread/uthread/uthread_kern.c:384
#3  0x000000020a96b35c in poll (fds=0xfffffffffffefa80, nfds=Variable
"nfds" is not available.
) at /usr/src/lib/libpthread/uthread/uthread_poll.c:94
#4  0x0000000000395538 in WaitLatchOrSocket (latch=0x212bdc69c,
wakeEvents=25, sock=-1, timeout=5000) at pg_latch.c:286
#5  0x00000000003aa794 in WalWriterMain () at walwriter.c:301
#6  0x0000000000201878 in AuxiliaryProcessMain (argc=2,
argv=0xfffffffffffefd98) at bootstrap.c:430
#7  0x00000000003a1534 in StartChildProcess (type=WalWriterProcess) at
postmaster.c:4518
#8  0x00000000003a7564 in reaper (postgres_signal_arg=Variable
"postgres_signal_arg" is not available.
) at postmaster.c:2387
#9  0x000000020a974528 in _dispatch_signal (sig=20,
scp=0xffffffffffff03e0) at /usr/src/lib/libpthread/uthread/uthread_sig.c:408
#10 0x000000020a97462c in _dispatch_signals (scp=0xffffffffffff03e0) at
/usr/src/lib/libpthread/uthread/uthread_sig.c:437
#11 0x000000020a974e28 in _thread_sig_handler (sig=20,
info=0xffffffffffff0420, scp=0xffffffffffff03e0) at
/usr/src/lib/libpthread/uthread/uthread_sig.c:139
#12 <signal handler called>
#13 _thread_kern_set_timeout (timeout=0xffffffffffff0630) at
/usr/src/lib/libpthread/uthread/uthread_kern.c:989
#14 0x000000020a96bc8c in select (numfds=9, readfds=0xffffffffffff0730,
writefds=0x0, exceptfds=0x0, timeout=Variable "timeout" is not available.
) at /usr/src/lib/libpthread/uthread/uthread_select.c:85
#15 0x00000000003a2894 in ServerLoop () at postmaster.c:1321
#16 0x00000000003a45ac in PostmasterMain (argc=Variable "argc" is not
available.
) at postmaster.c:1121
#17 0x0000000000326df8 in main (argc=6, argv=0xffffffffffff14f8) at
main.c:199




Stefan


Re: random failing builds on spoonbill - backends not exiting...

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> On 06/22/2012 11:47 PM, Tom Lane wrote:
>> Could you gdb each of these processes and get a stack trace?

[ unsurprising stack traces ]

OK, so they're waiting exactly where they should be.

So what we know is that the shutdown failure is caused by the child
processes having all blockable signals blocked.  What we don't know
is how they could have gotten that way.  I do not see any way that
the code in Postgres would ever block all signals, which makes this
look a lot like a BSD libc bug.  But that doesn't help much.

The only way I can think of to narrow it down is to run the postmaster
under "strace -f" or local equivalent, and look for sigprocmask calls
that set all the bits.  That could be pretty tedious though, if the
occurrence of the bug is as low-probability as it seems to be.

Given your earlier comment about a new threading library, I wonder
whether this has something to do with confusion between process-wide
and per-thread signal masks ...
        regards, tom lane