Re: spoonbill vs. -HEAD - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: spoonbill vs. -HEAD |
Date | |
Msg-id | 10127.1364940110@sss.pgh.pa.us Whole thread Raw |
In response to | Re: spoonbill vs. -HEAD (Stefan Kaltenbrunner <stefan@kaltenbrunner.cc>) |
Responses |
Re: spoonbill vs. -HEAD
|
List | pgsql-hackers |
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > On 03/26/2013 11:30 PM, Tom Lane wrote: >> A different line of thought is that the cancel was received by the >> backend but didn't succeed in cancelling the query for some reason. > I added the "pgcancel failed" codepath you suggested but it does not > seem to get triggered at all so the above might actually be what is > happening... Stefan was kind enough to grant me access to spoonbill, and after some experimentation I found out the problem. It seems that OpenBSD blocks additional deliveries of a signal while the signal handler is in progress, and that this is implemented by just calling sigprocmask() before and after calling the handler. Therefore, if the handler doesn't return normally --- like, say, it longjmps --- the restoration of the previous mask never happens. So we're left with the signal still blocked, meaning second and subsequent attempts to interrupt the backend don't work. This isn't revealed by the regular regression tests because they don't exercise PQcancel, but several recently-added isolation tests do attempt to PQcancel the same backend more than once. It's a bit surprising that it's taken us this long to recognize the problem. Typical use of PQcancel doesn't necessarily cause a failure: StatementCancelHandler() won't exit through longjmp unless ImmediateInterruptOK is true, which is only the case while waiting for a heavyweight lock. But still, you'd think somebody would've run into the case in normal usage. I think the simplest fix is to insert "PG_SETMASK(&UnBlockSig)" into StatementCancelHandler() and any other handlers that might exit via longjmp. I'm a bit inclined to only do this on platforms where a problem is demonstrable, which so far is only OpenBSD. (You'd think that all BSDen would have the same issue, but the buildfarm shows otherwise.) BTW, this does not seem to explain the symptoms shown at http://www.postgresql.org/message-id/4FE4D89A.8020002@kaltenbrunner.cc because what we were seeing there was that *all* signals appeared to be blocked. However, after this round of debugging I no longer have a lot of faith in OpenBSD's ps, because it was lying to me about whether the process had signals blocked or not (or at least, it couldn't see the effects of the interrupt signal disable, although when I added debugging code to print the active signal mask according to sigprocmask() I got told the truth). So I'm not sure how much trust to put in those older ps results. It's possible that the previous failures were a manifestation of something related to this bug. regards, tom lane
pgsql-hackers by date: