Thread: Archiver not exiting upon crash

Archiver not exiting upon crash

From
Jeff Janes
Date:
I've been testing the crash recovery of REL9_2_BETA1, using the same
method I posted in the "Scaling XLog insertion" thread.  I have the
checkpointer occasionally throw a FATAL error, which causes the
postmaster to take down all of the other processes (DETAIL:  The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally
and possibly corrupted shared memory.)  and initiate recovery.

However, sometimes the automatic recovery never initiates.  It looks
like the postmaster is waiting for the archiver to exit before it
starts recovery, and the archiver is waiting for something, I don't
really know what.

This happens on about 10% of the crashes on REL9_2_BETA1, although I
imagine that number is extremely depend on minutiae of the setup,
hardware, and phase of the moon, as it is probably some kind of race.

This behavior is also present in 9_1_STABLE, although at a much lower
prevalence (about 1%).  If fact it seems to go back at least to 8.4.0.

If I kill -9 the archiver, then recovery initiates and proceeds as normal.

I don't know the best way to tackle this.  By staring at the code, by
"git bisect" (which is hard to do, because I don't know if the
behavior was ever not there, and because the problem only occurs
statistically it can take many hours per iteration), or some other
method?

Thanks,

Jeff


Re: Archiver not exiting upon crash

From
Fujii Masao
Date:
On Sat, May 19, 2012 at 1:23 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> I've been testing the crash recovery of REL9_2_BETA1, using the same
> method I posted in the "Scaling XLog insertion" thread.  I have the
> checkpointer occasionally throw a FATAL error,

We should also fix this problem? If yes, could you show us the self-contained
test case to reproduce it?

> However, sometimes the automatic recovery never initiates.  It looks
> like the postmaster is waiting for the archiver to exit before it
> starts recovery, and the archiver is waiting for something, I don't
> really know what.

Could you get the backtrace of the waiting archiver?
http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

You might have gotten the following problem which was discussed before.
This problem was fixed in SIGQUIT signal handler of a backend, but ISTM
not that of an archiver.
http://archives.postgresql.org/pgsql-admin/2009-11/msg00088.php

Regards,

--
Fujii Masao


Re: Archiver not exiting upon crash

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> ... sometimes the automatic recovery never initiates.  It looks
> like the postmaster is waiting for the archiver to exit before it
> starts recovery, and the archiver is waiting for something, I don't
> really know what.

Can you try poking into the archiver's state with gdb?  It would be
useful to see a stack trace when it's hung, as well as the local and
static variables of pgarch.c.
        regards, tom lane


Re: Archiver not exiting upon crash

From
Tom Lane
Date:
Fujii Masao <masao.fujii@gmail.com> writes:
> You might have gotten the following problem which was discussed before.
> This problem was fixed in SIGQUIT signal handler of a backend, but ISTM
> not that of an archiver.
> http://archives.postgresql.org/pgsql-admin/2009-11/msg00088.php

pgarch.c's SIGQUIT handler just does exit(1), so it seems a bit unlikely
that that solution would make a difference.  But we need to see a stack
trace to rule it out or not.
        regards, tom lane


Re: Archiver not exiting upon crash

From
Tom Lane
Date:
I wrote:
> Fujii Masao <masao.fujii@gmail.com> writes:
>> You might have gotten the following problem which was discussed before.
>> This problem was fixed in SIGQUIT signal handler of a backend, but ISTM
>> not that of an archiver.
>> http://archives.postgresql.org/pgsql-admin/2009-11/msg00088.php

> pgarch.c's SIGQUIT handler just does exit(1), so it seems a bit unlikely
> that that solution would make a difference.

... but having said that, I see Peter's commit
d6de43099ac0bddb4b1da40088487616da892164 only touched postgres.c's
quickdie(), and not all the *other* background processes with identical
coding.  That seems a clear oversight, so I will go fix it.  Doesn't
explain why the archiver would get confused, though, since that file
doesn't have any code that tries to re-enable signals after entering the
signal handler.
        regards, tom lane


Re: Archiver not exiting upon crash

From
Tom Lane
Date:
I wrote:
> ... but having said that, I see Peter's commit
> d6de43099ac0bddb4b1da40088487616da892164 only touched postgres.c's
> quickdie(), and not all the *other* background processes with identical
> coding.  That seems a clear oversight, so I will go fix it.  Doesn't
> explain why the archiver would get confused, though, since that file
> doesn't have any code that tries to re-enable signals after entering the
> signal handler.

... wait, scratch that.  AFAICS, that commit was totally useless,
because BlockSig should always already contain SIGQUIT.  I don't think
there's a need to propagate the same useless code elsewhere.

In the case of pgarch.c, there might be some reason to add
PG_SETMASK(&BlockSig) to its SIGQUIT handler, just to be sure that
*other* signals are blocked before we go into the exit(1) code.
I'm still having a hard time believing that that's Jeff's issue, though.
        regards, tom lane


Re: Archiver not exiting upon crash

From
Peter Eisentraut
Date:
On mån, 2012-05-21 at 12:52 -0400, Tom Lane wrote:
> I see Peter's commit d6de43099ac0bddb4b1da40088487616da892164 only
> touched postgres.c's quickdie(), and not all the *other* background
> processes with identical coding.  That seems a clear oversight, so I
> will go fix it.

None[*] of the other signal handlers call elog/ereport, which was the
cause of the original problem.  It might still be good to review this
across the board, though.

* - for some value of "none", I didn't check all of them in detail



Re: Archiver not exiting upon crash

From
Peter Eisentraut
Date:
On mån, 2012-05-21 at 13:14 -0400, Tom Lane wrote:
> > ... but having said that, I see Peter's commit
> > d6de43099ac0bddb4b1da40088487616da892164 only touched postgres.c's
> > quickdie(), and not all the *other* background processes with
> identical
> > coding.  That seems a clear oversight, so I will go fix it.  Doesn't
> > explain why the archiver would get confused, though, since that file
> > doesn't have any code that tries to re-enable signals after entering
> the
> > signal handler.
> 
> ... wait, scratch that.  AFAICS, that commit was totally useless,
> because BlockSig should always already contain SIGQUIT.

No, because PostgresMain() deletes it from BlockSig.



Re: Archiver not exiting upon crash

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> On mån, 2012-05-21 at 13:14 -0400, Tom Lane wrote:
>> ... wait, scratch that.  AFAICS, that commit was totally useless,
>> because BlockSig should always already contain SIGQUIT.

> No, because PostgresMain() deletes it from BlockSig.

Ah.  So potentially we have an issue in all the background processes
that have copied-and-pasted that sigdelset call, which seems to be most
of them.

I'm inclined to think this does need to be fixed.  Even if it
accidentally works now, it seems fragile as can be; and certainly
it's underdocumented.
        regards, tom lane


Re: Archiver not exiting upon crash

From
Fujii Masao
Date:
On Thu, May 24, 2012 at 1:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
>> On mån, 2012-05-21 at 13:14 -0400, Tom Lane wrote:
>>> ... wait, scratch that.  AFAICS, that commit was totally useless,
>>> because BlockSig should always already contain SIGQUIT.
>
>> No, because PostgresMain() deletes it from BlockSig.
>
> Ah.  So potentially we have an issue in all the background processes
> that have copied-and-pasted that sigdelset call, which seems to be most
> of them.
>
> I'm inclined to think this does need to be fixed.  Even if it
> accidentally works now, it seems fragile as can be; and certainly
> it's underdocumented.

+1

And I found that PG_SETMASK(&UnBlockSig) in WalSenderMain() is useless because
it's always called in PostgresMain() before entering WalSenderMain().

Regards,

--
Fujii Masao


Re: Archiver not exiting upon crash

From
Jeff Janes
Date:
On Mon, May 21, 2012 at 9:22 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Sat, May 19, 2012 at 1:23 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> I've been testing the crash recovery of REL9_2_BETA1, using the same
>> method I posted in the "Scaling XLog insertion" thread.  I have the
>> checkpointer occasionally throw a FATAL error,
>
> We should also fix this problem? If yes, could you show us the self-contained
> test case to reproduce it?

It doesn't need to be fixed.  I intentionally apply a patch to cause
this FATAL error.  In order to exercise the crash recovery, I need a
way to cause crashes.

>
>> However, sometimes the automatic recovery never initiates.  It looks
>> like the postmaster is waiting for the archiver to exit before it
>> starts recovery, and the archiver is waiting for something, I don't
>> really know what.
>
> Could you get the backtrace of the waiting archiver?
> http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

Backtrace of the archiver:

#0  0x0000003da12ca32f in poll () from /lib64/libc.so.6
#1  0x00000000005ef8b9 in WaitLatchOrSocket (latch=0xafe238,
wakeEvents=25, sock=-1, timeout=60000) at pg_latch.c:278
#2  0x00000000005f2b84 in PgArchiverMain () at pgarch.c:423
#3  pgarch_start () at pgarch.c:179
#4  0x00000000005fbd72 in reaper (postgres_signal_arg=<value optimized
out>) at postmaster.c:2383
#5  <signal handler called>
#6  0x0000003da12cc3a3 in __select_nocancel () from /lib64/libc.so.6
#7  0x00000000005f8edc in ServerLoop () at postmaster.c:1316
#8  0x00000000005fa327 in PostmasterMain (argc=3, argv=0xb7c1720) at
postmaster.c:1116
#9  0x000000000059c6ee in main (argc=3, argv=<value optimized out>) at
main.c:199

I also attached gdb to the archiver when it was working normally, and
I get the exact same backtrace.

It looks to me like the SIGQUIT from the postmaster is simply getting
lost.  And from what little I understand of signal handling, this is a
known race with system(3).  The archive_command, child of archiver,
exits before it can receive the signal sent to the entire archiver
process group, so it doesn't set its exit status to show it was
signalled.  But the signal sent directly to the archiver reaches it
while it is still ignoring SIGQUITs.

I caught a hang in motion with strace.  Here, 22448 is the archiver,
and 29475 is the archive_command that was in process when the crash
happened.

[pid 22448] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) ==
0}], 0, NULL) = 29475
[pid 22448] --- SIGQUIT (Quit) @ 0 (0) ---
[pid 22448] rt_sigaction(SIGINT, {0x1, [], SA_RESTORER|SA_RESTART,
0x3da12302f0},  <unfinished ...>
[pid 22448] <... rt_sigaction resumed> NULL, 8) = 0
[pid 22448] rt_sigaction(SIGQUIT, {0x5f29a0, [],
SA_RESTORER|SA_RESTART, 0x3da12302f0},  <unfinished ...>



And just in case it is useful, a backtrace for the postmaster during the hang:

#0  0x0000003da12cc3a3 in __select_nocancel () from /lib64/libc.so.6
#1  0x00000000007226ba in pg_usleep (microsec=<value optimized out>)
at pgsleep.c:43
#2  0x00000000005f9306 in ServerLoop () at postmaster.c:1305
#3  0x00000000005fa327 in PostmasterMain (argc=3, argv=0xb7c1720) at
postmaster.c:1116
#4  0x000000000059c6ee in main (argc=3, argv=<value optimized out>) at
main.c:199


If the SIGQUIT is getting lost in a race, could it just be blocked
during the system(3) call?
I don't know what happens if you call system(3) with SIGQUIT being blocked.

Or maybe the postmaster should not be infinitely patient, but send
another round of signals after a brief delay.

Cheers,

Jeff


Re: Archiver not exiting upon crash

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> It looks to me like the SIGQUIT from the postmaster is simply getting
> lost.  And from what little I understand of signal handling, this is a
> known race with system(3).  The archive_command, child of archiver,
> exits before it can receive the signal sent to the entire archiver
> process group, so it doesn't set its exit status to show it was
> signalled.  But the signal sent directly to the archiver reaches it
> while it is still ignoring SIGQUITs.

Ugh.

> If the SIGQUIT is getting lost in a race, could it just be blocked
> during the system(3) call?
> I don't know what happens if you call system(3) with SIGQUIT being blocked.

On my machine, man system(3) saith:
    system() ignores the SIGINT and SIGQUIT signals, and blocks the    SIGCHLD signal, while waiting for the command to
terminate. If this    might cause the application to miss a signal that would have killed    it, the application should
examinethe return value from system() and    take whatever action is appropriate to the application if the command
terminateddue to receipt of a signal.
 

Now, the code that directly calls system(), namely pgarch_archiveXlog(),
knows this perfectly well, as per the comment at lines 590ff in HEAD.
However, the code that *calls* it did not get the memo :-(, and appears
to be willing to retry regardless.

> Or maybe the postmaster should not be infinitely patient, but send
> another round of signals after a brief delay.

If the first one was ignored, later ones might be too.

I'm inclined to think that we should change pgarch_archiveXlog to
detect these specific signal conditions and just directly exit(),
rather than giving its caller a chance to blow the decision.
        regards, tom lane


Re: Archiver not exiting upon crash

From
Tom Lane
Date:
I wrote:
> On my machine, man system(3) saith:

>      system() ignores the SIGINT and SIGQUIT signals, and blocks the
>      SIGCHLD signal, while waiting for the command to terminate.  If this
>      might cause the application to miss a signal that would have killed
>      it, the application should examine the return value from system() and
>      take whatever action is appropriate to the application if the command
>      terminated due to receipt of a signal.

> Now, the code that directly calls system(), namely pgarch_archiveXlog(),
> knows this perfectly well, as per the comment at lines 590ff in HEAD.
> However, the code that *calls* it did not get the memo :-(, and appears
> to be willing to retry regardless.

Uh, no, scratch that: the code in pgarch_archiveXlog ought to be doing
ereport(FATAL) if this case applies.  So it's not clear what is
happening for you, unless system() returned a misleading return code.
You might try adding a debug printout to show the actual returned value.
        regards, tom lane


Re: Archiver not exiting upon crash

From
Jeff Janes
Date:
On Wed, May 23, 2012 at 1:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> It looks to me like the SIGQUIT from the postmaster is simply getting
>> lost.  And from what little I understand of signal handling, this is a
>> known race with system(3).  The archive_command, child of archiver,
>> exits before it can receive the signal sent to the entire archiver
>> process group, so it doesn't set its exit status to show it was
>> signalled.  But the signal sent directly to the archiver reaches it
>> while it is still ignoring SIGQUITs.
>
> Ugh.
>
>> If the SIGQUIT is getting lost in a race, could it just be blocked
>> during the system(3) call?
>> I don't know what happens if you call system(3) with SIGQUIT being blocked.
>
> On my machine, man system(3) saith:
>
>     system() ignores the SIGINT and SIGQUIT signals, and blocks the
>     SIGCHLD signal, while waiting for the command to terminate.  If this
>     might cause the application to miss a signal that would have killed
>     it, the application should examine the return value from system() and
>     take whatever action is appropriate to the application if the command
>     terminated due to receipt of a signal.

But what happens if the SIGQUIT is blocked before the system(3) is
invoked?  Does the ignore take precedence over the block, or does the
block take precedence over the ignore, and so the signal is still
waiting once the block is reversed after the system(3) is over?  I
could write a test program to see, but that wouldn't be very good
evidence of the portability.

>
> Now, the code that directly calls system(), namely pgarch_archiveXlog(),
> knows this perfectly well, as per the comment at lines 590ff in HEAD.
> However, the code that *calls* it did not get the memo :-(, and appears
> to be willing to retry regardless.

But if the signal is lost, how could it know to do anything differently?

>> Or maybe the postmaster should not be infinitely patient, but send
>> another round of signals after a brief delay.
>
> If the first one was ignored, later ones might be too.

True, but I don't see how it make anything worse.   If the postmaster
is going to hang for eternity anyway, it might as well do something
every now and then.  Eventually a signal is highly likely to get
through.  (And in this case it would be almost certain to, as it is a
rare race and the archiver would run out of things to archive anyway.)

>
> I'm inclined to think that we should change pgarch_archiveXlog to
> detect these specific signal conditions and just directly exit(),
> rather than giving its caller a chance to blow the decision.

But that is exactly the problem.  How do you detect a signal that gets lost?

I wonder if this is a quirk of my current hardware/kernel.  I have
occasionally seen hangs on other systems during this same exercise,
but they seemed much less common and I never investigated them.

Cheers,

Jeff


Re: Archiver not exiting upon crash

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Wed, May 23, 2012 at 1:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> On my machine, man system(3) saith:
>> 
>>     system() ignores the SIGINT and SIGQUIT signals, and blocks the
>>     SIGCHLD signal, while waiting for the command to terminate.  If this
>>     might cause the application to miss a signal that would have killed
>>     it, the application should examine the return value from system() and
>>     take whatever action is appropriate to the application if the command
>>     terminated due to receipt of a signal.

> But what happens if the SIGQUIT is blocked before the system(3) is
> invoked?  Does the ignore take precedence over the block, or does the
> block take precedence over the ignore, and so the signal is still
> waiting once the block is reversed after the system(3) is over?  I
> could write a test program to see, but that wouldn't be very good
> evidence of the portability.

AFAICT from the POSIX spec for system(3), that would be a bug in
system().  You are supposed to be able to tell from the return value
whether a signal happened.  Note that the signals are only supposed to
be ignored while waiting for the command to terminate; therefore, if
a signal does get ignored in the parent process, it should be delivered
to the child, and so we should see its effects in the child's return
code.

Of course, if you are using an archive_command that might itself ignore
SIGQUIT, perhaps that's where the issue is.
        regards, tom lane


Re: Archiver not exiting upon crash

From
Tom Lane
Date:
I wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> But what happens if the SIGQUIT is blocked before the system(3) is
>> invoked?  Does the ignore take precedence over the block, or does the
>> block take precedence over the ignore, and so the signal is still
>> waiting once the block is reversed after the system(3) is over?  I
>> could write a test program to see, but that wouldn't be very good
>> evidence of the portability.

> AFAICT from the POSIX spec for system(3), that would be a bug in
> system().

Actually, on further thought, it seems like there is *necessarily* a
race condition in this.  There must be some interval where the child
process has already exited but the waiting parent hasn't de-ignored the
signals.  So if SIGQUIT is delivered just then, it must go into the
aether.  This, together with the thought that the child process might
accidentally or intentionally ignore the signal, makes me think that
maybe you're right and we need to retransmit the SIGQUIT occasionally.

However, I remain unsatisfied with this idea as an explanation for the
behavior you're seeing.  In the first place, that race condition window
ought not be wide enough to allow failure probabilities as high as 10%.
In the second place, that code has been like that for a long while,
so this theory absolutely does not explain why you're seeing a
materially higher probability of failure in HEAD than 9.1.  There is
something else going on.
        regards, tom lane


Re: Archiver not exiting upon crash

From
Jeff Janes
Date:
On Wed, May 23, 2012 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Jeff Janes <jeff.janes@gmail.com> writes:
>>> But what happens if the SIGQUIT is blocked before the system(3) is
>>> invoked?  Does the ignore take precedence over the block, or does the
>>> block take precedence over the ignore, and so the signal is still
>>> waiting once the block is reversed after the system(3) is over?  I
>>> could write a test program to see, but that wouldn't be very good
>>> evidence of the portability.
>
>> AFAICT from the POSIX spec for system(3), that would be a bug in
>> system().
>
> Actually, on further thought, it seems like there is *necessarily* a
> race condition in this.  There must be some interval where the child
> process has already exited but the waiting parent hasn't de-ignored the
> signals.

Yup.  the posix man page doesn't say that this is guaranteed to be
atomic, so I assume no such guarantee is even attempted.  For that
matter, the man page for system(3) doesn't even tell how one goes
about signaling the child--unless the signal is sent to the negation
of the group leader the child won't receive it.  (Postgres handles
group-leader thing correctly, I just point it out to show the man page
is not intended to be complete)

> However, I remain unsatisfied with this idea as an explanation for the
> behavior you're seeing.  In the first place, that race condition window
> ought not be wide enough to allow failure probabilities as high as 10%.
> In the second place, that code has been like that for a long while,
> so this theory absolutely does not explain why you're seeing a
> materially higher probability of failure in HEAD than 9.1.  There is
> something else going on.

After a while trying to bisect the behavior, I decided it was a mug's
game.  Both arms of the race (the firing of archive_command and the
engineered crash) are triggered indirectly be the same event, the
start of a checkpoint.  Small changes in the code can lead to small
changes in the timing which make drastic changes in how likely it is
that the two arms collide exactly at the vulnerability.

So my test harness is an inexplicably effective show-case for the
vulnerability, but it is not the reason the vulnerability should be
fixed.

By the way, my archive_command is very fast, as all it does it echo
the date into a log file.  I want postgres to think it is archive
mode, but for this purpose I don't want to actually deal with having
an archive.

Cheers,

Jeff


Re: Archiver not exiting upon crash

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Wed, May 23, 2012 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> However, I remain unsatisfied with this idea as an explanation for the
>> behavior you're seeing.  In the first place, that race condition window
>> ought not be wide enough to allow failure probabilities as high as 10%.
>> In the second place, that code has been like that for a long while,
>> so this theory absolutely does not explain why you're seeing a
>> materially higher probability of failure in HEAD than 9.1.  There is
>> something else going on.

> After a while trying to bisect the behavior, I decided it was a mug's
> game.  Both arms of the race (the firing of archive_command and the
> engineered crash) are triggered indirectly be the same event, the
> start of a checkpoint.  Small changes in the code can lead to small
> changes in the timing which make drastic changes in how likely it is
> that the two arms collide exactly at the vulnerability.

Ah.  OK, that sounds more plausible than "it just happened".

> So my test harness is an inexplicably effective show-case for the
> vulnerability, but it is not the reason the vulnerability should be
> fixed.

Agreed.
        regards, tom lane


Re: Archiver not exiting upon crash

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> So my test harness is an inexplicably effective show-case for the
> vulnerability, but it is not the reason the vulnerability should be
> fixed.

I spent a bit of time looking into this.  In principle the postmaster
could be fixed to repeat the SIGQUIT signal every second or so, but
it would be a rather considerable wart on code that's already
overcomplicated for my taste (given that bugs in the postmaster are
basically game over).  An example of the problems is that the
postmaster's own SIGQUIT signal handler presently just sends out the
child signals and calls exit().  We'd need some new "waiting for
children to die" state to allow for retransmission in such cases.
As far as the postmaster is concerned, it would be a lot better if
SIGQUIT could be considered reliable.

This leads me to the thought that maybe we have to deprecate use of
system() in Postgres backend code.  If we simply fork and exec without
touching the signal handling, ISTM that would do what we want, and the
amount of added code would be pretty minimal (a lot less than would have
to be added to the postmaster to address this the other way).  However,
I'm not too sure what would be required to make it go on Windows.
Comments?  Does the Windows emulation of system() even have this issue
to begin with?
        regards, tom lane