Thread: A note about SIGTERM illusion and reality
If you've paid any attention to our signal handling and shutdown procedures, you know that there is a whole lot of design and logic based on the assumption that during a forced system shutdown, we will see SIGTERM delivered to all PG processes, with a little bit of grace time before we get SIGKILL'ed. I had occasion to test this yesterday (thank you, Duquesne Light) and couldn't help noticing a lack of expected behavior in the postmaster log after the lights came back on. The database recovered fine, but it had to recover --- there was no orderly shutdown as intended, and the only indication that the postmaster had any notice at all was a log entry about a SIGKILL on the walwriter process. After digging in the man page for init(8) on a couple of machines, I realized that the SIGTERM-then-SIGKILL behavior only applies to processes that are launched directly by init. Now, I can recall having started the postmaster from an inittab entry on a few systems I maintained years ago, but it's certainly not been the recommended practice for a long time --- I think all modern distributions use SysV init scripts or something comparable. The inittab idea still has some attraction because it guarantees automatic restart if the postmaster dies ... but it's been a long time since that was a big hazard. Even more bit-rot in the concept: init will only SIGTERM its direct child and members of that child's process group. Not too long ago we made most of the postmaster children do setsid() to create their own process groups, so even if you did launch the postmaster from an inittab entry, things wouldn't work as intended. I have no idea what (if anything) we should do about this; but it seems clear that there's some design thinking that could stand to be revisited. regards, tom lane
I wrote: > After digging in the man page for init(8) on a couple of machines, > I realized that the SIGTERM-then-SIGKILL behavior only applies to > processes that are launched directly by init. Actually, after further experimentation, it seems this is very platform-dependent. Current Linux (tested on Fedora 8) actually does behave in the way our code expects, ie, every process gets a SIGTERM. The default SIGTERM-to-SIGKILL delay is only 5 seconds, which is likely not enough for a checkpoint in a busy database, but at least we tried :-(. Mac OS X seems to issue everything SIGQUIT, instead of SIGTERM. I didn't experiment to see how much grace period there might be. No idea about other BSDen, though OS X's behavior might be typical. HPUX seems to go straight to SIGKILL. So the bottom line is that the best bet is to rely on an initscript's stop command to issue "pg_ctl stop -m fast", which will give us enough time to perform a clean shutdown. The other behavior is only of interest for databases that aren't controlled by an initscript known to the system. But there does seem to be some value in our designed response to SIGTERM, on at least one popular platform, so I no longer feel a need to rethink that. regards, tom lane