Thread: Is *that* why debugging backend startup is so hard!?
I just spent a rather frustrating hour trying to debug a backend startup failure --- and getting nowhere because I couldn't catch the failure in a debugger, or even step to where I thought it might be. I've seen this sort of difficulty before, and always had to resort to expedients like putting in printf's. But tonight I finally realized what the problem is. The early stages of startup are run under signal mask BlockSig, which we initialize to include *EVERY SIGNAL* (except SIGUSR1 for some reason). In particular SIGTRAP is blocked, which prevents debugger breakpoints from working. Even sillier, normally-fatal signals like SIGSEGV are blocked. I now know by observation that HPUX, at least, takes this literally: for example, if you've blocked SEGV you don't hear about bus errors, you just keep going. Possibly rather slowly, if every attempted instruction execution causes the hardware to fault to the kernel, but by golly the system will keep trying to run your code. Needless to say I find this braindead in the extreme. Will anyone object if I change the signal masks so that we never ever block SIGABRT, SIGILL, SIGSEGV, SIGBUS, SIGTRAP, SIGCONT, SIGSYS? Any other candidates? Are there any systems that do not define all of these signal names? BTW, once I turned this silliness off, I was able to home in on my bug within minutes... regards, tom lane PS: The postmaster spends most of its time running under BlockSig too. Good thing we haven't had many postmaster bugs lately.
> Needless to say I find this braindead in the extreme. Will anyone > object if I change the signal masks so that we never ever block > SIGABRT, SIGILL, SIGSEGV, SIGBUS, SIGTRAP, SIGCONT, SIGSYS? Any > other candidates? Are there any systems that do not define all > of these signal names? > > BTW, once I turned this silliness off, I was able to home in on > my bug within minutes... Go ahead. Current setup sound very broken. Why do they even bother doing all this. Seems we should identify the signals we want to block, and just block those. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Needless to say I find this braindead in the extreme. Wow, definitely braindead. Trapping some of them on systems that can programmatically generate a stack backtrace might be useful -- it would help reporting what happened. Blocking them and continuing seems about the most dangerous thing that could be done; if we've just got SIGSEGV or similar the code is confused isn't to be trusted to safely modify data! > Will anyone object if I change the signal masks so that we never > ever block SIGABRT, SIGILL, SIGSEGV, SIGBUS, SIGTRAP, SIGCONT, > SIGSYS? Any other candidates? Are there any systems that do not > define all of these signal names? I'd expect these everywhere; certainly they're all defined in the "Single Unix Specification, version 2". Some of them don't exist in ANSI C, if that matters. Usually it's easy enough to wrap code that cares in #ifdef SIGABRT ... #endif so when/if a platform shows up that lacks one or more it's easy to fix. Potential additions to your list: SIGFPE SIGSTOP (can't be blocked) Regards, Giles
Tom Lane writes: > I just spent a rather frustrating hour trying to debug a backend startup > failure --- and getting nowhere because I couldn't catch the failure in > a debugger, or even step to where I thought it might be. I've seen this > sort of difficulty before, and always had to resort to expedients like > putting in printf's. But tonight I finally realized what the problem is. Could that be contributing to the Heisenbug I decribed on Sunday in "Pid file magically disappears"? -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
Peter Eisentraut <peter_e@gmx.net> writes: > Tom Lane writes: >> I just spent a rather frustrating hour trying to debug a backend startup >> failure --- and getting nowhere because I couldn't catch the failure in >> a debugger, or even step to where I thought it might be. I've seen this >> sort of difficulty before, and always had to resort to expedients like >> putting in printf's. But tonight I finally realized what the problem is. > Could that be contributing to the Heisenbug I decribed on Sunday in "Pid > file magically disappears"? Hm. Maybe. I haven't tried to reproduce the pid-file issue here (I'm up to my eyebrows in memmgr at the moment). But the blocking of SEGV and friends could certainly lead to some odd behavior, due to code plowing on after getting an error that should have crashed it. Depending on how robust your local implementation of abort(3) is, it's even possible that the code would fall through a failed Assert() test... regards, tom lane