Re: checkpointer code behaving strangely on postmaster -T - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: checkpointer code behaving strangely on postmaster -T
Date
Msg-id 1336770275-sup-7739@alvh.no-ip.org
Whole thread Raw
In response to Re: checkpointer code behaving strangely on postmaster -T  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: checkpointer code behaving strangely on postmaster -T
List pgsql-hackers
Excerpts from Tom Lane's message of vie may 11 16:50:01 -0400 2012:

> > Yep, it's still there as far as I can tell.  A backtrace from the
> > checkpointer shows it's waiting on the latch.
>
> I'm confused about what you did here and whether this isn't just pilot
> error.  If you run with -T then the postmaster will just SIGSTOP the
> remaining child processes, but then it will sit and wait for them to
> die, since the state machine expects them to react as though they'd been
> sent SIGQUIT.

The sequence of events is:
postmaster -T
crash a backend
SIGINT postmaster
SIGCONT all child processes

My expectation is that postmaster should exit normally after this.  What
happens instead is that all processes exit, except checkpointer.  And in
fact, postmaster is now in PM_WAIT_BACKENDS state, so sending SIGINT a
second time will not shutdown checkpointer either.

Maybe we can consider this to be just pilot error, but then why do all
other processes exit normally?  To me it just seems an oversight in
checkpointer shutdown handling in conjuction with -T.

> If you SIGCONT any of them then that process will resume,
> totally ignorant that it's supposed to die.  So "kill -CONTd, of course"
> makes no sense to me.  I tried killing one child with -KILL, then
> sending SIGINT to the postmaster, then killing the remaining
> already-stopped children, and the postmaster did exit as expected after
> the last child died.

Uhm, after you SIGINTd postmaster didn't it shutdown all children?  That
would be odd.

> So I don't see any bug here.  And, after closer inspection, your
> previous proposed patch is quite bogus because checkpointer is not
> supposed to stop yet when the other processes are being terminated
> normally.

Well, it does send the signal only when FatalError is set.  So it should
only affect -T behavior.

> Possibly it'd be useful to teach the postmaster more thoroughly about
> SIGSTOP and have a way for it to really kill the remaining children
> after you've finished investigating their state.  But frankly this
> is the first time I've heard of anybody using that feature at all;
> I always thought it was a vestigial hangover from days when the kernel
> was too stupid to write separate core dump files for each backend.
> I'd rather remove SendStop than add more complexity there.

Hah.  I've used it a few times, but I can see that multiple core files
are okay.  Maybe you're right and we should just remove it.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: checkpointer code behaving strangely on postmaster -T
Next
From: Tom Lane
Date:
Subject: Re: checkpointer code behaving strangely on postmaster -T