Thread: stats collector dies in current
I see stats collector processes die in current when I suspend postmaster then put it in background from a terminal: $ ps x : : 21638 pts/1 S 0:00 /bin/bash -i 30525 pts/1 S 0:00 postmaster 30527 pts/1 S 0:00 postgres: writer process 30528 pts/1 S 0:00 postgres: stats buffer process 30529 pts/1 S 0:00 postgres: stats collector process 30530 pts/1 R 0:00 ps x $ fg postmaster [1]+ Stopped postmaster $ bg [1]+ postmaster & LOG: statistics collector process (PID 30528) exited with exit code 1 Is this normal? -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > I see stats collector processes die in current when I suspend > postmaster then put it in background from a terminal: > Is this normal? Doesn't 7.4 behave the same? It looks to me like 7.4 and current have the same signal handling. I'm not sure why a tstp/cont sequence would create a problem on your platform (which is what, btw?) but it ought to cause the same problem in 7.4 ... regards, tom lane
> Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > I see stats collector processes die in current when I suspend > > postmaster then put it in background from a terminal: > > > Is this normal? > > Doesn't 7.4 behave the same? No. > It looks to me like 7.4 and current have the same signal handling. > I'm not sure why a tstp/cont sequence would create a problem on > your platform (which is what, btw?) but it ought to cause the same > problem in 7.4 ... This is a Linux box with kernel 2.4.22 (x86). I also noticed that the background writer process does have almost same signal handling but it is not killed. -- Tatsuo Ishii
On 8/14/2004 11:38 PM, Tom Lane wrote: > Tatsuo Ishii <t-ishii@sra.co.jp> writes: >> I see stats collector processes die in current when I suspend >> postmaster then put it in background from a terminal: > >> Is this normal? > > Doesn't 7.4 behave the same? > > It looks to me like 7.4 and current have the same signal handling. > I'm not sure why a tstp/cont sequence would create a problem on > your platform (which is what, btw?) but it ought to cause the same > problem in 7.4 ... In that context, is SIGTSTP similar to SIGSTOP in that it cannot be caught or ignored? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > In that context, is SIGTSTP similar to SIGSTOP in that it cannot be > caught or ignored? Possibly. I've reproduced the problem here on an RHL 8 system (2.4.18 kernel) and I think it's a kernel bug. Points: 1. AFAICS, the only case where the stats buffer process will exit(1) without logging a prior message is where it's gotten SIGCHLD. So, hypothesis: it is the collector process (grandchild process) that is dying. 2. Experiment one: try to strace the collector process to see what it's doing. Result: failure goes away!!! 3. Experiment two: try to strace the buffer process. Result: indeed it's getting SIGCHLD (in fact it seems to get it before SIGTSTP arrives). So at the very least we've got a Heisenbug, but my opinion is we are seeing broken kernel behavior. The only difference in signal handling that I can see from 7.4 is that the collector process explicitly executes pqsignal calls to re-establish all the signal handlers it should have inherited from its parent. I suspect (but haven't tested) that removing that supposedly redundant code would make the failure go away again. The handler re-establishment was put in because it is needed for the EXEC_BACKEND case, but possibly we could make it #ifndef EXEC_BACKEND to work around this problem. regards, tom lane
Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: > >>In that context, is SIGTSTP similar to SIGSTOP in that it cannot be >>caught or ignored? > > > Possibly. I've reproduced the problem here on an RHL 8 system > (2.4.18 kernel) and I think it's a kernel bug. Points: [...] I can reproduce this on a 2.6.7 kernel. I think pqsignal should be passing SA_NOCLDSTOP in sa_flags, or alternatively that the stats buffer process should check that its child really did die rather than receive a stop signal. The sigaction manpage says: > sa_flags specifies a set of flags which modify the behaviour of the signal handling process. It is formed by thebit- > wise OR of zero or more of the following: > > SA_NOCLDSTOP > If signum is SIGCHLD, do not receive notification when child processes stop (i.e., when child processes > receive one of SIGSTOP, SIGTSTP, SIGTTIN or SIGTTOU). signal(7) says that SIGCHLD is generated when a child is stopped or terminated. A bit of experimentation in the stats buffer process seems to confirm this -- while it is receiving a SIGCHLD, calling waitpid() with WNOHANG returns immediately with no dead processes. -O
Oliver Jowett <oliver@opencloud.com> writes: > I think pqsignal should be passing SA_NOCLDSTOP in sa_flags, Hmm, that does look like a good idea ... but it does not explain why 7.4 doesn't have the same problem. regards, tom lane
Oliver Jowett <oliver@opencloud.com> writes: > I think pqsignal should be passing SA_NOCLDSTOP in sa_flags, With that patch applied, the problem is indeed gone on my system. However, I would still like to know why 7.4 didn't show the same misbehavior, when it isn't using this flag. regards, tom lane
Tom Lane wrote: > Oliver Jowett <oliver@opencloud.com> writes: > >>I think pqsignal should be passing SA_NOCLDSTOP in sa_flags, > > > With that patch applied, the problem is indeed gone on my system. > However, I would still like to know why 7.4 didn't show the same > misbehavior, when it isn't using this flag. It looks like the 7.4 code never unblocks signals in the collector process, so that process never gets stopped by SIGTSTP. On the 7.4.1 install I have to hand, from /proc/<pid>/status, the buffer process reports: SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000001006a07 SigCgt: 0000000000010000 while the collector process has: SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: ffffffff3ff9fa07 SigIgn: 0000000001007a07 SigCgt: 0000000000000000 TSTP is signal 20 here which appears to be blocked (mask of 80000) in the collector process. A quick glance at the REL7_4_STABLE pgstat.c shows only one PG_SETMASK, executed in the buffer process only. -O
Oliver Jowett <oliver@opencloud.com> writes: > Tom Lane wrote: >> However, I would still like to know why 7.4 didn't show the same >> misbehavior, when it isn't using this flag. > It looks like the 7.4 code never unblocks signals in the collector > process, so that process never gets stopped by SIGTSTP. Good catch --- that seems to explain all the facts. Since the collector SIG_IGN's all the signals it'd be likely to get in normal operation, it's not surprising we did not notice this. regards, tom lane