Thread: Archiver not exiting upon crash
I've been testing the crash recovery of REL9_2_BETA1, using the same method I posted in the "Scaling XLog insertion" thread. I have the checkpointer occasionally throw a FATAL error, which causes the postmaster to take down all of the other processes (DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.) and initiate recovery. However, sometimes the automatic recovery never initiates. It looks like the postmaster is waiting for the archiver to exit before it starts recovery, and the archiver is waiting for something, I don't really know what. This happens on about 10% of the crashes on REL9_2_BETA1, although I imagine that number is extremely depend on minutiae of the setup, hardware, and phase of the moon, as it is probably some kind of race. This behavior is also present in 9_1_STABLE, although at a much lower prevalence (about 1%). If fact it seems to go back at least to 8.4.0. If I kill -9 the archiver, then recovery initiates and proceeds as normal. I don't know the best way to tackle this. By staring at the code, by "git bisect" (which is hard to do, because I don't know if the behavior was ever not there, and because the problem only occurs statistically it can take many hours per iteration), or some other method? Thanks, Jeff
On Sat, May 19, 2012 at 1:23 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > I've been testing the crash recovery of REL9_2_BETA1, using the same > method I posted in the "Scaling XLog insertion" thread. I have the > checkpointer occasionally throw a FATAL error, We should also fix this problem? If yes, could you show us the self-contained test case to reproduce it? > However, sometimes the automatic recovery never initiates. It looks > like the postmaster is waiting for the archiver to exit before it > starts recovery, and the archiver is waiting for something, I don't > really know what. Could you get the backtrace of the waiting archiver? http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD You might have gotten the following problem which was discussed before. This problem was fixed in SIGQUIT signal handler of a backend, but ISTM not that of an archiver. http://archives.postgresql.org/pgsql-admin/2009-11/msg00088.php Regards, -- Fujii Masao
Jeff Janes <jeff.janes@gmail.com> writes: > ... sometimes the automatic recovery never initiates. It looks > like the postmaster is waiting for the archiver to exit before it > starts recovery, and the archiver is waiting for something, I don't > really know what. Can you try poking into the archiver's state with gdb? It would be useful to see a stack trace when it's hung, as well as the local and static variables of pgarch.c. regards, tom lane
Fujii Masao <masao.fujii@gmail.com> writes: > You might have gotten the following problem which was discussed before. > This problem was fixed in SIGQUIT signal handler of a backend, but ISTM > not that of an archiver. > http://archives.postgresql.org/pgsql-admin/2009-11/msg00088.php pgarch.c's SIGQUIT handler just does exit(1), so it seems a bit unlikely that that solution would make a difference. But we need to see a stack trace to rule it out or not. regards, tom lane
I wrote: > Fujii Masao <masao.fujii@gmail.com> writes: >> You might have gotten the following problem which was discussed before. >> This problem was fixed in SIGQUIT signal handler of a backend, but ISTM >> not that of an archiver. >> http://archives.postgresql.org/pgsql-admin/2009-11/msg00088.php > pgarch.c's SIGQUIT handler just does exit(1), so it seems a bit unlikely > that that solution would make a difference. ... but having said that, I see Peter's commit d6de43099ac0bddb4b1da40088487616da892164 only touched postgres.c's quickdie(), and not all the *other* background processes with identical coding. That seems a clear oversight, so I will go fix it. Doesn't explain why the archiver would get confused, though, since that file doesn't have any code that tries to re-enable signals after entering the signal handler. regards, tom lane
I wrote: > ... but having said that, I see Peter's commit > d6de43099ac0bddb4b1da40088487616da892164 only touched postgres.c's > quickdie(), and not all the *other* background processes with identical > coding. That seems a clear oversight, so I will go fix it. Doesn't > explain why the archiver would get confused, though, since that file > doesn't have any code that tries to re-enable signals after entering the > signal handler. ... wait, scratch that. AFAICS, that commit was totally useless, because BlockSig should always already contain SIGQUIT. I don't think there's a need to propagate the same useless code elsewhere. In the case of pgarch.c, there might be some reason to add PG_SETMASK(&BlockSig) to its SIGQUIT handler, just to be sure that *other* signals are blocked before we go into the exit(1) code. I'm still having a hard time believing that that's Jeff's issue, though. regards, tom lane
On mån, 2012-05-21 at 12:52 -0400, Tom Lane wrote: > I see Peter's commit d6de43099ac0bddb4b1da40088487616da892164 only > touched postgres.c's quickdie(), and not all the *other* background > processes with identical coding. That seems a clear oversight, so I > will go fix it. None[*] of the other signal handlers call elog/ereport, which was the cause of the original problem. It might still be good to review this across the board, though. * - for some value of "none", I didn't check all of them in detail
On mån, 2012-05-21 at 13:14 -0400, Tom Lane wrote: > > ... but having said that, I see Peter's commit > > d6de43099ac0bddb4b1da40088487616da892164 only touched postgres.c's > > quickdie(), and not all the *other* background processes with > identical > > coding. That seems a clear oversight, so I will go fix it. Doesn't > > explain why the archiver would get confused, though, since that file > > doesn't have any code that tries to re-enable signals after entering > the > > signal handler. > > ... wait, scratch that. AFAICS, that commit was totally useless, > because BlockSig should always already contain SIGQUIT. No, because PostgresMain() deletes it from BlockSig.
Peter Eisentraut <peter_e@gmx.net> writes: > On mån, 2012-05-21 at 13:14 -0400, Tom Lane wrote: >> ... wait, scratch that. AFAICS, that commit was totally useless, >> because BlockSig should always already contain SIGQUIT. > No, because PostgresMain() deletes it from BlockSig. Ah. So potentially we have an issue in all the background processes that have copied-and-pasted that sigdelset call, which seems to be most of them. I'm inclined to think this does need to be fixed. Even if it accidentally works now, it seems fragile as can be; and certainly it's underdocumented. regards, tom lane
On Thu, May 24, 2012 at 1:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Eisentraut <peter_e@gmx.net> writes: >> On mån, 2012-05-21 at 13:14 -0400, Tom Lane wrote: >>> ... wait, scratch that. AFAICS, that commit was totally useless, >>> because BlockSig should always already contain SIGQUIT. > >> No, because PostgresMain() deletes it from BlockSig. > > Ah. So potentially we have an issue in all the background processes > that have copied-and-pasted that sigdelset call, which seems to be most > of them. > > I'm inclined to think this does need to be fixed. Even if it > accidentally works now, it seems fragile as can be; and certainly > it's underdocumented. +1 And I found that PG_SETMASK(&UnBlockSig) in WalSenderMain() is useless because it's always called in PostgresMain() before entering WalSenderMain(). Regards, -- Fujii Masao
On Mon, May 21, 2012 at 9:22 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Sat, May 19, 2012 at 1:23 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> I've been testing the crash recovery of REL9_2_BETA1, using the same >> method I posted in the "Scaling XLog insertion" thread. I have the >> checkpointer occasionally throw a FATAL error, > > We should also fix this problem? If yes, could you show us the self-contained > test case to reproduce it? It doesn't need to be fixed. I intentionally apply a patch to cause this FATAL error. In order to exercise the crash recovery, I need a way to cause crashes. > >> However, sometimes the automatic recovery never initiates. It looks >> like the postmaster is waiting for the archiver to exit before it >> starts recovery, and the archiver is waiting for something, I don't >> really know what. > > Could you get the backtrace of the waiting archiver? > http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD Backtrace of the archiver: #0 0x0000003da12ca32f in poll () from /lib64/libc.so.6 #1 0x00000000005ef8b9 in WaitLatchOrSocket (latch=0xafe238, wakeEvents=25, sock=-1, timeout=60000) at pg_latch.c:278 #2 0x00000000005f2b84 in PgArchiverMain () at pgarch.c:423 #3 pgarch_start () at pgarch.c:179 #4 0x00000000005fbd72 in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2383 #5 <signal handler called> #6 0x0000003da12cc3a3 in __select_nocancel () from /lib64/libc.so.6 #7 0x00000000005f8edc in ServerLoop () at postmaster.c:1316 #8 0x00000000005fa327 in PostmasterMain (argc=3, argv=0xb7c1720) at postmaster.c:1116 #9 0x000000000059c6ee in main (argc=3, argv=<value optimized out>) at main.c:199 I also attached gdb to the archiver when it was working normally, and I get the exact same backtrace. It looks to me like the SIGQUIT from the postmaster is simply getting lost. And from what little I understand of signal handling, this is a known race with system(3). The archive_command, child of archiver, exits before it can receive the signal sent to the entire archiver process group, so it doesn't set its exit status to show it was signalled. But the signal sent directly to the archiver reaches it while it is still ignoring SIGQUITs. I caught a hang in motion with strace. Here, 22448 is the archiver, and 29475 is the archive_command that was in process when the crash happened. [pid 22448] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 29475 [pid 22448] --- SIGQUIT (Quit) @ 0 (0) --- [pid 22448] rt_sigaction(SIGINT, {0x1, [], SA_RESTORER|SA_RESTART, 0x3da12302f0}, <unfinished ...> [pid 22448] <... rt_sigaction resumed> NULL, 8) = 0 [pid 22448] rt_sigaction(SIGQUIT, {0x5f29a0, [], SA_RESTORER|SA_RESTART, 0x3da12302f0}, <unfinished ...> And just in case it is useful, a backtrace for the postmaster during the hang: #0 0x0000003da12cc3a3 in __select_nocancel () from /lib64/libc.so.6 #1 0x00000000007226ba in pg_usleep (microsec=<value optimized out>) at pgsleep.c:43 #2 0x00000000005f9306 in ServerLoop () at postmaster.c:1305 #3 0x00000000005fa327 in PostmasterMain (argc=3, argv=0xb7c1720) at postmaster.c:1116 #4 0x000000000059c6ee in main (argc=3, argv=<value optimized out>) at main.c:199 If the SIGQUIT is getting lost in a race, could it just be blocked during the system(3) call? I don't know what happens if you call system(3) with SIGQUIT being blocked. Or maybe the postmaster should not be infinitely patient, but send another round of signals after a brief delay. Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > It looks to me like the SIGQUIT from the postmaster is simply getting > lost. And from what little I understand of signal handling, this is a > known race with system(3). The archive_command, child of archiver, > exits before it can receive the signal sent to the entire archiver > process group, so it doesn't set its exit status to show it was > signalled. But the signal sent directly to the archiver reaches it > while it is still ignoring SIGQUITs. Ugh. > If the SIGQUIT is getting lost in a race, could it just be blocked > during the system(3) call? > I don't know what happens if you call system(3) with SIGQUIT being blocked. On my machine, man system(3) saith: system() ignores the SIGINT and SIGQUIT signals, and blocks the SIGCHLD signal, while waiting for the command to terminate. If this might cause the application to miss a signal that would have killed it, the application should examinethe return value from system() and take whatever action is appropriate to the application if the command terminateddue to receipt of a signal. Now, the code that directly calls system(), namely pgarch_archiveXlog(), knows this perfectly well, as per the comment at lines 590ff in HEAD. However, the code that *calls* it did not get the memo :-(, and appears to be willing to retry regardless. > Or maybe the postmaster should not be infinitely patient, but send > another round of signals after a brief delay. If the first one was ignored, later ones might be too. I'm inclined to think that we should change pgarch_archiveXlog to detect these specific signal conditions and just directly exit(), rather than giving its caller a chance to blow the decision. regards, tom lane
I wrote: > On my machine, man system(3) saith: > system() ignores the SIGINT and SIGQUIT signals, and blocks the > SIGCHLD signal, while waiting for the command to terminate. If this > might cause the application to miss a signal that would have killed > it, the application should examine the return value from system() and > take whatever action is appropriate to the application if the command > terminated due to receipt of a signal. > Now, the code that directly calls system(), namely pgarch_archiveXlog(), > knows this perfectly well, as per the comment at lines 590ff in HEAD. > However, the code that *calls* it did not get the memo :-(, and appears > to be willing to retry regardless. Uh, no, scratch that: the code in pgarch_archiveXlog ought to be doing ereport(FATAL) if this case applies. So it's not clear what is happening for you, unless system() returned a misleading return code. You might try adding a debug printout to show the actual returned value. regards, tom lane
On Wed, May 23, 2012 at 1:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Janes <jeff.janes@gmail.com> writes: >> It looks to me like the SIGQUIT from the postmaster is simply getting >> lost. And from what little I understand of signal handling, this is a >> known race with system(3). The archive_command, child of archiver, >> exits before it can receive the signal sent to the entire archiver >> process group, so it doesn't set its exit status to show it was >> signalled. But the signal sent directly to the archiver reaches it >> while it is still ignoring SIGQUITs. > > Ugh. > >> If the SIGQUIT is getting lost in a race, could it just be blocked >> during the system(3) call? >> I don't know what happens if you call system(3) with SIGQUIT being blocked. > > On my machine, man system(3) saith: > > system() ignores the SIGINT and SIGQUIT signals, and blocks the > SIGCHLD signal, while waiting for the command to terminate. If this > might cause the application to miss a signal that would have killed > it, the application should examine the return value from system() and > take whatever action is appropriate to the application if the command > terminated due to receipt of a signal. But what happens if the SIGQUIT is blocked before the system(3) is invoked? Does the ignore take precedence over the block, or does the block take precedence over the ignore, and so the signal is still waiting once the block is reversed after the system(3) is over? I could write a test program to see, but that wouldn't be very good evidence of the portability. > > Now, the code that directly calls system(), namely pgarch_archiveXlog(), > knows this perfectly well, as per the comment at lines 590ff in HEAD. > However, the code that *calls* it did not get the memo :-(, and appears > to be willing to retry regardless. But if the signal is lost, how could it know to do anything differently? >> Or maybe the postmaster should not be infinitely patient, but send >> another round of signals after a brief delay. > > If the first one was ignored, later ones might be too. True, but I don't see how it make anything worse. If the postmaster is going to hang for eternity anyway, it might as well do something every now and then. Eventually a signal is highly likely to get through. (And in this case it would be almost certain to, as it is a rare race and the archiver would run out of things to archive anyway.) > > I'm inclined to think that we should change pgarch_archiveXlog to > detect these specific signal conditions and just directly exit(), > rather than giving its caller a chance to blow the decision. But that is exactly the problem. How do you detect a signal that gets lost? I wonder if this is a quirk of my current hardware/kernel. I have occasionally seen hangs on other systems during this same exercise, but they seemed much less common and I never investigated them. Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > On Wed, May 23, 2012 at 1:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> On my machine, man system(3) saith: >> >> system() ignores the SIGINT and SIGQUIT signals, and blocks the >> SIGCHLD signal, while waiting for the command to terminate. If this >> might cause the application to miss a signal that would have killed >> it, the application should examine the return value from system() and >> take whatever action is appropriate to the application if the command >> terminated due to receipt of a signal. > But what happens if the SIGQUIT is blocked before the system(3) is > invoked? Does the ignore take precedence over the block, or does the > block take precedence over the ignore, and so the signal is still > waiting once the block is reversed after the system(3) is over? I > could write a test program to see, but that wouldn't be very good > evidence of the portability. AFAICT from the POSIX spec for system(3), that would be a bug in system(). You are supposed to be able to tell from the return value whether a signal happened. Note that the signals are only supposed to be ignored while waiting for the command to terminate; therefore, if a signal does get ignored in the parent process, it should be delivered to the child, and so we should see its effects in the child's return code. Of course, if you are using an archive_command that might itself ignore SIGQUIT, perhaps that's where the issue is. regards, tom lane
I wrote: > Jeff Janes <jeff.janes@gmail.com> writes: >> But what happens if the SIGQUIT is blocked before the system(3) is >> invoked? Does the ignore take precedence over the block, or does the >> block take precedence over the ignore, and so the signal is still >> waiting once the block is reversed after the system(3) is over? I >> could write a test program to see, but that wouldn't be very good >> evidence of the portability. > AFAICT from the POSIX spec for system(3), that would be a bug in > system(). Actually, on further thought, it seems like there is *necessarily* a race condition in this. There must be some interval where the child process has already exited but the waiting parent hasn't de-ignored the signals. So if SIGQUIT is delivered just then, it must go into the aether. This, together with the thought that the child process might accidentally or intentionally ignore the signal, makes me think that maybe you're right and we need to retransmit the SIGQUIT occasionally. However, I remain unsatisfied with this idea as an explanation for the behavior you're seeing. In the first place, that race condition window ought not be wide enough to allow failure probabilities as high as 10%. In the second place, that code has been like that for a long while, so this theory absolutely does not explain why you're seeing a materially higher probability of failure in HEAD than 9.1. There is something else going on. regards, tom lane
On Wed, May 23, 2012 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> Jeff Janes <jeff.janes@gmail.com> writes: >>> But what happens if the SIGQUIT is blocked before the system(3) is >>> invoked? Does the ignore take precedence over the block, or does the >>> block take precedence over the ignore, and so the signal is still >>> waiting once the block is reversed after the system(3) is over? I >>> could write a test program to see, but that wouldn't be very good >>> evidence of the portability. > >> AFAICT from the POSIX spec for system(3), that would be a bug in >> system(). > > Actually, on further thought, it seems like there is *necessarily* a > race condition in this. There must be some interval where the child > process has already exited but the waiting parent hasn't de-ignored the > signals. Yup. the posix man page doesn't say that this is guaranteed to be atomic, so I assume no such guarantee is even attempted. For that matter, the man page for system(3) doesn't even tell how one goes about signaling the child--unless the signal is sent to the negation of the group leader the child won't receive it. (Postgres handles group-leader thing correctly, I just point it out to show the man page is not intended to be complete) > However, I remain unsatisfied with this idea as an explanation for the > behavior you're seeing. In the first place, that race condition window > ought not be wide enough to allow failure probabilities as high as 10%. > In the second place, that code has been like that for a long while, > so this theory absolutely does not explain why you're seeing a > materially higher probability of failure in HEAD than 9.1. There is > something else going on. After a while trying to bisect the behavior, I decided it was a mug's game. Both arms of the race (the firing of archive_command and the engineered crash) are triggered indirectly be the same event, the start of a checkpoint. Small changes in the code can lead to small changes in the timing which make drastic changes in how likely it is that the two arms collide exactly at the vulnerability. So my test harness is an inexplicably effective show-case for the vulnerability, but it is not the reason the vulnerability should be fixed. By the way, my archive_command is very fast, as all it does it echo the date into a log file. I want postgres to think it is archive mode, but for this purpose I don't want to actually deal with having an archive. Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > On Wed, May 23, 2012 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> However, I remain unsatisfied with this idea as an explanation for the >> behavior you're seeing. In the first place, that race condition window >> ought not be wide enough to allow failure probabilities as high as 10%. >> In the second place, that code has been like that for a long while, >> so this theory absolutely does not explain why you're seeing a >> materially higher probability of failure in HEAD than 9.1. There is >> something else going on. > After a while trying to bisect the behavior, I decided it was a mug's > game. Both arms of the race (the firing of archive_command and the > engineered crash) are triggered indirectly be the same event, the > start of a checkpoint. Small changes in the code can lead to small > changes in the timing which make drastic changes in how likely it is > that the two arms collide exactly at the vulnerability. Ah. OK, that sounds more plausible than "it just happened". > So my test harness is an inexplicably effective show-case for the > vulnerability, but it is not the reason the vulnerability should be > fixed. Agreed. regards, tom lane
Jeff Janes <jeff.janes@gmail.com> writes: > So my test harness is an inexplicably effective show-case for the > vulnerability, but it is not the reason the vulnerability should be > fixed. I spent a bit of time looking into this. In principle the postmaster could be fixed to repeat the SIGQUIT signal every second or so, but it would be a rather considerable wart on code that's already overcomplicated for my taste (given that bugs in the postmaster are basically game over). An example of the problems is that the postmaster's own SIGQUIT signal handler presently just sends out the child signals and calls exit(). We'd need some new "waiting for children to die" state to allow for retransmission in such cases. As far as the postmaster is concerned, it would be a lot better if SIGQUIT could be considered reliable. This leads me to the thought that maybe we have to deprecate use of system() in Postgres backend code. If we simply fork and exec without touching the signal handling, ISTM that would do what we want, and the amount of added code would be pretty minimal (a lot less than would have to be added to the postmaster to address this the other way). However, I'm not too sure what would be required to make it go on Windows. Comments? Does the Windows emulation of system() even have this issue to begin with? regards, tom lane