Re: checkpointer code behaving strangely on postmaster -T - Mailing list pgsql-hackers

From Tom Lane
Subject Re: checkpointer code behaving strangely on postmaster -T
Date
Msg-id 15684.1336769401@sss.pgh.pa.us
Whole thread Raw
In response to Re: checkpointer code behaving strangely on postmaster -T  (Alvaro Herrera <alvherre@commandprompt.com>)
Responses Re: checkpointer code behaving strangely on postmaster -T
List pgsql-hackers
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Tom Lane's message of jue may 10 02:27:32 -0400 2012:
>> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
>>> I noticed while doing some tests that the checkpointer process does not
>>> recover very nicely after a backend crashes under postmaster -T (after
>>> all processes have been kill -CONTd, of course, and postmaster told to
>>> shutdown via Ctrl-C on its console).  For some reason it seems to get
>>> stuck on a loop doing sleep(0.5s)  In other case I caught it trying to
>>> do a checkpoint, but it was progressing a single page each time and then
>>> sleeping.  In that condition, the checkpoint took a very long time to
>>> finish.

>> Is this still a problem as of HEAD?  I think I've fixed some issues in
>> the checkpointer's outer loop logic, but not sure if what you saw is
>> still there.

> Yep, it's still there as far as I can tell.  A backtrace from the
> checkpointer shows it's waiting on the latch.

I'm confused about what you did here and whether this isn't just pilot
error.  If you run with -T then the postmaster will just SIGSTOP the
remaining child processes, but then it will sit and wait for them to
die, since the state machine expects them to react as though they'd been
sent SIGQUIT.  If you SIGCONT any of them then that process will resume,
totally ignorant that it's supposed to die.  So "kill -CONTd, of course"
makes no sense to me.  I tried killing one child with -KILL, then
sending SIGINT to the postmaster, then killing the remaining
already-stopped children, and the postmaster did exit as expected after
the last child died.

So I don't see any bug here.  And, after closer inspection, your
previous proposed patch is quite bogus because checkpointer is not
supposed to stop yet when the other processes are being terminated
normally.

Possibly it'd be useful to teach the postmaster more thoroughly about
SIGSTOP and have a way for it to really kill the remaining children
after you've finished investigating their state.  But frankly this
is the first time I've heard of anybody using that feature at all;
I always thought it was a vestigial hangover from days when the kernel
was too stupid to write separate core dump files for each backend.
I'd rather remove SendStop than add more complexity there.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: WalSndWakeup() and synchronous_commit=off
Next
From: Alvaro Herrera
Date:
Subject: Re: checkpointer code behaving strangely on postmaster -T