Thread: stats collector dies in current

stats collector dies in current

From

Tatsuo Ishii

Date:

15 August 2004, 00:03:30

I see stats collector processes die in current when I suspend
postmaster then put it in background from a terminal:

$ ps x
:
:
21638 pts/1    S      0:00 /bin/bash -i
30525 pts/1    S      0:00 postmaster
30527 pts/1    S      0:00 postgres: writer process   
30528 pts/1    S      0:00 postgres: stats buffer process   
30529 pts/1    S      0:00 postgres: stats collector process   
30530 pts/1    R      0:00 ps x
$ fg
postmaster

[1]+  Stopped                 postmaster
$ bg
[1]+ postmaster &
LOG:  statistics collector process (PID 30528) exited with exit code 1

Is this normal?
--
Tatsuo Ishii

Re: stats collector dies in current

From

Tom Lane

Date:

15 August 2004, 00:38:43

Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> I see stats collector processes die in current when I suspend
> postmaster then put it in background from a terminal:

> Is this normal?

Doesn't 7.4 behave the same?

It looks to me like 7.4 and current have the same signal handling.
I'm not sure why a tstp/cont sequence would create a problem on
your platform (which is what, btw?) but it ought to cause the same
problem in 7.4 ...
        regards, tom lane

Re: stats collector dies in current

From

Tatsuo Ishii

Date:

15 August 2004, 00:49:41

> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > I see stats collector processes die in current when I suspend
> > postmaster then put it in background from a terminal:
> 
> > Is this normal?
> 
> Doesn't 7.4 behave the same?

No.

> It looks to me like 7.4 and current have the same signal handling.
> I'm not sure why a tstp/cont sequence would create a problem on
> your platform (which is what, btw?) but it ought to cause the same
> problem in 7.4 ...

This is a Linux box with kernel 2.4.22 (x86). I also noticed that the
background writer process does have almost same signal handling but it
is not killed.
--
Tatsuo Ishii

Re: stats collector dies in current

From

Jan Wieck

Date:

15 August 2004, 00:55:01

On 8/14/2004 11:38 PM, Tom Lane wrote:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> I see stats collector processes die in current when I suspend
>> postmaster then put it in background from a terminal:
> 
>> Is this normal?
> 
> Doesn't 7.4 behave the same?
> 
> It looks to me like 7.4 and current have the same signal handling.
> I'm not sure why a tstp/cont sequence would create a problem on
> your platform (which is what, btw?) but it ought to cause the same
> problem in 7.4 ...

In that context, is SIGTSTP similar to SIGSTOP in that it cannot be 
caught or ignored?


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: stats collector dies in current

From

Tom Lane

Date:

15 August 2004, 01:19:06

Jan Wieck <JanWieck@Yahoo.com> writes:
> In that context, is SIGTSTP similar to SIGSTOP in that it cannot be 
> caught or ignored?

Possibly.  I've reproduced the problem here on an RHL 8 system
(2.4.18 kernel) and I think it's a kernel bug.  Points:

1. AFAICS, the only case where the stats buffer process will exit(1)
without logging a prior message is where it's gotten SIGCHLD.  So,
hypothesis: it is the collector process (grandchild process) that
is dying.

2. Experiment one: try to strace the collector process to see what
it's doing.  Result: failure goes away!!!

3. Experiment two: try to strace the buffer process.  Result: indeed
it's getting SIGCHLD (in fact it seems to get it before SIGTSTP
arrives).

So at the very least we've got a Heisenbug, but my opinion is we are
seeing broken kernel behavior.

The only difference in signal handling that I can see from 7.4 is that
the collector process explicitly executes pqsignal calls to re-establish
all the signal handlers it should have inherited from its parent.
I suspect (but haven't tested) that removing that supposedly redundant
code would make the failure go away again.

The handler re-establishment was put in because it is needed for the
EXEC_BACKEND case, but possibly we could make it #ifndef EXEC_BACKEND
to work around this problem.
        regards, tom lane

Re: stats collector dies in current

From

Oliver Jowett

Date:

15 August 2004, 02:07:41

Tom Lane wrote:
> Jan Wieck <JanWieck@Yahoo.com> writes:
> 
>>In that context, is SIGTSTP similar to SIGSTOP in that it cannot be 
>>caught or ignored?
> 
> 
> Possibly.  I've reproduced the problem here on an RHL 8 system
> (2.4.18 kernel) and I think it's a kernel bug.  Points:

[...]

I can reproduce this on a 2.6.7 kernel.

I think pqsignal should be passing SA_NOCLDSTOP in sa_flags, or 
alternatively that the stats buffer process should check that its child 
really did die rather than receive a stop signal. The sigaction manpage 
says:

>        sa_flags specifies a set of flags which modify the behaviour of the signal handling process. It is formed by
thebit-

>        wise OR of zero or more of the following:
> 
>               SA_NOCLDSTOP
>                      If signum is SIGCHLD, do not receive notification when child processes stop (i.e., when child
processes
>                      receive one of SIGSTOP, SIGTSTP, SIGTTIN or SIGTTOU).

signal(7) says that SIGCHLD is generated when a child is stopped or 
terminated.

A bit of experimentation in the stats buffer process seems to confirm 
this -- while it is receiving a SIGCHLD, calling waitpid() with WNOHANG 
returns immediately with no dead processes.

-O

Re: stats collector dies in current

From

Tom Lane

Date:

15 August 2004, 02:14:46

Oliver Jowett <oliver@opencloud.com> writes:
> I think pqsignal should be passing SA_NOCLDSTOP in sa_flags,

Hmm, that does look like a good idea ... but it does not explain why 7.4
doesn't have the same problem.
        regards, tom lane

Re: stats collector dies in current

From

Tom Lane

Date:

15 August 2004, 02:38:05

Oliver Jowett <oliver@opencloud.com> writes:
> I think pqsignal should be passing SA_NOCLDSTOP in sa_flags,

With that patch applied, the problem is indeed gone on my system.
However, I would still like to know why 7.4 didn't show the same
misbehavior, when it isn't using this flag.
        regards, tom lane

Re: stats collector dies in current

From

Oliver Jowett

Date:

15 August 2004, 03:02:21

Tom Lane wrote:
> Oliver Jowett <oliver@opencloud.com> writes:
> 
>>I think pqsignal should be passing SA_NOCLDSTOP in sa_flags,
> 
> 
> With that patch applied, the problem is indeed gone on my system.
> However, I would still like to know why 7.4 didn't show the same
> misbehavior, when it isn't using this flag.

It looks like the 7.4 code never unblocks signals in the collector 
process, so that process never gets stopped by SIGTSTP.

On the 7.4.1 install I have to hand, from /proc/<pid>/status, the buffer 
process reports:

SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000001006a07
SigCgt: 0000000000010000

while the collector process has:

SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: ffffffff3ff9fa07
SigIgn: 0000000001007a07
SigCgt: 0000000000000000

TSTP is signal 20 here which appears to be blocked (mask of 80000) in 
the collector process.

A quick glance at the REL7_4_STABLE pgstat.c shows only one PG_SETMASK, 
executed in the buffer process only.

-O

Re: stats collector dies in current

From

Tom Lane

Date:

15 August 2004, 03:15:36

Oliver Jowett <oliver@opencloud.com> writes:
> Tom Lane wrote:
>> However, I would still like to know why 7.4 didn't show the same
>> misbehavior, when it isn't using this flag.

> It looks like the 7.4 code never unblocks signals in the collector 
> process, so that process never gets stopped by SIGTSTP.

Good catch --- that seems to explain all the facts.

Since the collector SIG_IGN's all the signals it'd be likely to get in
normal operation, it's not surprising we did not notice this.
        regards, tom lane