Thread: unexpected SIGALRM
Hi all, I've examined a cygwin hangup issue for a pretty long time. It seems there are plural causes though it's hard for me to identify them all. Anyway I found some unexpected SIGALRM cases. It may be caused by a cygwin's bug but isn't it safer to return immediately from HandleDeadLock in any platform unless the backend is waiting for a lock ? The following is a back trace I saw very luckily. #0 0x77f827e8 in _libkernel32_a_iname () #1 0x77e56a15 in _libkernel32_a_iname () #2 0x77e56a3d in _libkernel32_a_iname () #3 0x00587535 in semop () #4 0x005086b0 in IpcSemaphoreLock (semId=2688, sem=1, interruptOK=0 '\000') at ipc.c:422 #5 0x005114f7 in LWLockAcquire (lockid=LockMgrLock, mode=LW_EXCLUSIVE) at lwlock.c:272 #6 0x005101ba in HandleDeadLock (postgres_signal_arg=14) at proc.c:862 #7 0x6100fb63 in _libkernel32_a_iname () #8 0x0058650d in do_semop () #9 0x00587535 in semop () #10 0x005086b0 in IpcSemaphoreLock (semId=2688, sem=1, interruptOK=0 '\000') at ipc.c:422 #11 0x005114f7 in LWLockAcquire (lockid=LockMgrLock, mode=LW_EXCLUSIVE) at lwlock.c:272 #12 0x0050e60b in LockRelease (lockmethod=1, locktag=0x22f338, xid=17826, lockmode=1) at lock.c:1018 #13 0x0050c8f5 in UnlockRelation (relation=0xa06c168, lockmode=1) at lmgr.c:217 #14 0x0041d29e in index_endscan (scan=0xa08a8f8) at indexam.c:288 #15 0x004ae558 in ExecCloseR (node=0xa089a88) at execAmi.c:232 #16 0x004b8de2 in ExecEndIndexScan (node=0xa089a88) at nodeIndexscan.c:474 #17 0x004b1e41 in ExecEndNode (node=0xa089a88, parent=0x0) at execProcnode.c:495 regards, Hiroshi Inoue
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > Anyway I found some unexpected SIGALRM cases. > It may be caused by a cygwin's bug but isn't it safer to > return immediately from HandleDeadLock in any platform > unless the backend is waiting for a lock ? If we can't rely on the signal handling facilities to interrupt only when they're supposed to, I think HandleDeadlock is the least of our worries :-(. I'm not excited about inserting an ad-hoc test to work around (only) one manifestation of a system-level bug. regards, tom lane
Tom Lane wrote: > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > Anyway I found some unexpected SIGALRM cases. > > It may be caused by a cygwin's bug but isn't it safer to > > return immediately from HandleDeadLock in any platform > > unless the backend is waiting for a lock ? > > If we can't rely on the signal handling facilities to interrupt only > when they're supposed to, I think HandleDeadlock is the least of our > worries :-(. I'm not sure if it's a cygwin issue. Isn't it preferable for a dbms to be insensitive to other(e.g OS's) bugs anyway ? Or how about blocking SIGALRM signals except when the backend is waiting for a lock ? It seems a better fix because it would also fix another issue. > I'm not excited about inserting an ad-hoc test to work > around (only) one manifestation of a system-level bug. OK so cygwin isn't considered as a supported platform ? retgards, Hiroshi Inoue
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > Tom Lane wrote: >> I'm not excited about inserting an ad-hoc test to work >> around (only) one manifestation of a system-level bug. > OK so cygwin isn't considered as a supported platform ? I don't consider it our responsibility to work around cygwin bugs, as opposed to reporting said bugs and expecting the cygwin folk to fix 'em. If the cost of such a workaround is minimal, then I'd be willing to consider it; but in this case, you're talking about adding another pair of kernel calls to every lock blockage. That seems nontrivial. But the more important argument is this: if cygwin contains a bug that allows it to fire interrupts when it should not, how much improvement do we really get from plugging this one hole? Surely there are other places that will have similar problems. For that matter, how can you be sure that adding a sigsetmask call will prevent it from firing the interrupt --- how is that any more secure than setitimer? I'd say the correct course of action is to report the problem to the cygwin people first, and ask them whether a user-level workaround is possible/useful. regards, tom lane
Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > Tom Lane wrote: > >> I'm not excited about inserting an ad-hoc test to work > >> around (only) one manifestation of a system-level bug. > > > OK so cygwin isn't considered as a supported platform ? > > I don't consider it our responsibility to work around cygwin bugs, > as opposed to reporting said bugs and expecting the cygwin folk to > fix 'em. OK I would leave as it is. I've already wasted a lot of time. It has been 3 months since a pgbench hangup problem was reported by Yutaka Tanida. regards, Hiroshi Inoue