Home > mailing lists

Checkpoint process signal handling seems wrong - Mailing list pgsql-hackers

From	Tom Lane
Subject	Checkpoint process signal handling seems wrong
Date	March 8, 2001 13:34:54
Msg-id	28179.984076478@sss.pgh.pa.us Whole thread Raw
List	pgsql-hackers

Tree view

I am currently looking at a frozen system: a backend crashed during XLOG
write (which I was deliberately provoking, via running it out of disk
space), and now the postmaster is unable to recover because it's waiting
around for a checkpoint process that it had launched milliseconds before
the crash.  The checkpoint process, unfortunately, is not going to quit
anytime soon because it's hung up trying to get a spinlock that the
crashing backend left locked.

Eventually the checkpoint process will time out the spinlock and abort
(but please note that this is true only because I insisted --- Vadim
wanted to have infinite timeouts on the WAL spinlocks.  I think this is
good evidence that that's a bad idea).  However, while sitting here
looking at it I can't help wondering whether the checkpoint process
shouldn't have responded to the SIGTERM that the postmaster sent it
when the other backend crashed.

Is it really such a good idea for the checkpoint process to ignore
SIGTERM?

While we're at it: is it really such a good idea to use elog(STOP)
all over the place in the WAL stuff?  If XLogFileInit had chosen
to exit with elog(FATAL), then we would have released the spinlock
on the way out of the failing backend, and the checkpointer wouldn't
be stuck.
        regards, tom lane

pgsql-hackers by date:

From: "Mikheev, Vadim"
Date: 08 March 2001, 13:13:55
Subject: RE: WAL does not recover gracefully from out-of-disk-sp ace

From: Richard J Kuhns
Date: 08 March 2001, 13:46:52
Subject: Re: Performance monitor

Checkpoint process signal handling seems wrong - Mailing list pgsql-hackers

Previous

Next