Thread: SIGFPE handler is naive
A member of our technical team brought it to my attention that if you have a recalcitrant backend that doesn't die when you send it a SIGTERM, and you don't want to cause a crash-and-restart cycle by sending it SIGQUIT, you can have your cake and eat it too by sending it SIGFPE, which will cheerfully throw an error from wherever inside the backend you happen to be, whether it is safe or not. I suppose we had it in mind when this was coded that SIGFPE would only be generated by the system rather than by user activity, and it is a pretty neat trick for working around the lack of CHECK_FOR_INTERRUPTS() in some place where you really wish there were one, but the possibility of setting yourself on fire also seems rather high. Since the person who made the discovery realized the utility of the trick but not the fact that it might destabilize the backend, it seems that the surface area for self-destruction is non-zero. Should we do something to plug this, and if so, what? If not, should we document the danger? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 13, 2012 at 01:04:58PM -0400, Robert Haas wrote: > A member of our technical team brought it to my attention that if you > have a recalcitrant backend that doesn't die when you send it a > SIGTERM, and you don't want to cause a crash-and-restart cycle by > sending it SIGQUIT, you can have your cake and eat it too by sending > it SIGFPE, which will cheerfully throw an error from wherever inside > the backend you happen to be, whether it is safe or not. I suppose we > had it in mind when this was coded that SIGFPE would only be generated > by the system rather than by user activity, and it is a pretty neat > trick for working around the lack of CHECK_FOR_INTERRUPTS() in some > place where you really wish there were one, but the possibility of > setting yourself on fire also seems rather high. Since the person who > made the discovery realized the utility of the trick but not the fact > that it might destabilize the backend, it seems that the surface area > for self-destruction is non-zero. > > Should we do something to plug this, and if so, what? If not, should > we document the danger? If this use of SIGFPE is handy, we should expose it under a better name. What hazards make it unsafe? In a critical section, it escalates to a PANIC anyway. In third-party library code, the library state may become corrupt. In backend code skipping a PG_TRY/PG_CATCH for sections that "cannot" throw errors, cleanup will never happen. That's plenty of danger, but I can sympathize with folks wanting to take the risk and use SIGFPE this way. Overall, though, I think it best to plug this. We could set a flag before each operation, like evaluation of SQL arithmetic, for which SIGFPE is normal. If the signal handler sees the flag set, raise ERROR. Otherwise, PANIC. Code running with the flag set would, of course, need to be ready for a spontaneous elog(ERROR) at any instruction. Thanks, nm
On Mon, Aug 13, 2012 at 10:14 PM, Noah Misch <noah@leadboat.com> wrote: > If this use of SIGFPE is handy, we should expose it under a better name. What > hazards make it unsafe? Well, the most obvious problem is that a backend might receive it while holding a spinlock. > Overall, though, I think it best to plug this. We could set a flag before > each operation, like evaluation of SQL arithmetic, for which SIGFPE is normal. > If the signal handler sees the flag set, raise ERROR. Otherwise, PANIC. Code > running with the flag set would, of course, need to be ready for a spontaneous > elog(ERROR) at any instruction. Yeah, that's what I thought of, too. It seems like it'd be a lot of work to get there, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Aug 13, 2012 at 10:14 PM, Noah Misch <noah@leadboat.com> wrote: >> Overall, though, I think it best to plug this. We could set a flag before >> each operation, like evaluation of SQL arithmetic, for which SIGFPE is normal. > Yeah, that's what I thought of, too. It seems like it'd be a lot of > work to get there, though. That would depend on how many places there are where SIGFPE is expected. Are we sure there are any? Maybe we should just remove the handler and let SIGFPE be treated as a core dump. regards, tom lane
> Should we do something to plug this, and if so, what? If not, should > we document the danger? I am not sure if I really understood the intention of the question correctly, but if the question was if pg should try to work around misuse of signals, then my answer would be a definite no. IMHO, the signal handler should check if the signal was received for good reasons (as proposed by Noah) and handle it appropriately, but otherwise ignore it. Nils
It is possible to check if the signal was synchronous or was sent from an external process. You can check siginfo->si_pid to see who sent you the signal. I'm not sure checking that and handling it at check_for_interrupts if it's asynchronous is the best solution or not though. I'm a bit confused. Didn't Tom do the laborious process of checking the whole source tree for situations where there's shared memory cleanup to be done in and arrange for it to happen? That was the blocking factor to get pg_cancel_backend() to work. Is the problem that the sigfpe handler doesn't invoke atexit() handlers? -- greg
On Mon, Aug 13, 2012 at 11:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Aug 13, 2012 at 10:14 PM, Noah Misch <noah@leadboat.com> wrote: >>> Overall, though, I think it best to plug this. We could set a flag before >>> each operation, like evaluation of SQL arithmetic, for which SIGFPE is normal. > >> Yeah, that's what I thought of, too. It seems like it'd be a lot of >> work to get there, though. > > That would depend on how many places there are where SIGFPE is expected. > Are we sure there are any? Maybe we should just remove the handler and > let SIGFPE be treated as a core dump. No clue. According to Wikipedia, it is commonly caused by dividing by zero, or by dividing by INT_MIN by -1, resulting in a quotient out of range for the type. I'd be willing to bet that we have got all the division-by-zero cases patched up just because we try pretty hard to emit the right error message for such cases, but I'm a lot less certain that things like INT_MIN/-1 can't happen anywhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 14, 2012 at 6:50 AM, Greg Stark <stark@mit.edu> wrote: > It is possible to check if the signal was synchronous or was sent from > an external process. You can check siginfo->si_pid to see who sent you > the signal. I'm not sure checking that and handling it at > check_for_interrupts if it's asynchronous is the best solution or not > though. If that's portable it might be an option, but I doubt that it is. > I'm a bit confused. Didn't Tom do the laborious process of checking > the whole source tree for situations where there's shared memory > cleanup to be done in and arrange for it to happen? That was the > blocking factor to get pg_cancel_backend() to work. Is the problem > that the sigfpe handler doesn't invoke atexit() handlers? No, the problem is that SIGFPE throws an error *from the signal handler* rather than waiting for ProcessInterrupts(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 13, 2012 at 11:52:06PM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Mon, Aug 13, 2012 at 10:14 PM, Noah Misch <noah@leadboat.com> wrote: > >> Overall, though, I think it best to plug this. We could set a flag before > >> each operation, like evaluation of SQL arithmetic, for which SIGFPE is normal. > > > Yeah, that's what I thought of, too. It seems like it'd be a lot of > > work to get there, though. > > That would depend on how many places there are where SIGFPE is expected. > Are we sure there are any? Maybe we should just remove the handler and > let SIGFPE be treated as a core dump. > > regards, tom lane > Wouldn't any user level divide-by-zero code cause a SIGFPE? Regards, Ken
On Tue, Aug 14, 2012 at 08:38:44AM -0400, Robert Haas wrote: > On Mon, Aug 13, 2012 at 11:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > That would depend on how many places there are where SIGFPE is expected. > > Are we sure there are any? Maybe we should just remove the handler and > > let SIGFPE be treated as a core dump. > > No clue. According to Wikipedia, it is commonly caused by dividing by > zero, or by dividing by INT_MIN by -1, resulting in a quotient out of > range for the type. I'd be willing to bet that we have got all the > division-by-zero cases patched up just because we try pretty hard to > emit the right error message for such cases, but I'm a lot less > certain that things like INT_MIN/-1 can't happen anywhere. [local] test=# select -9223372036854775808/-1; ERROR: floating-point exception
On Tue, Aug 14, 2012 at 8:55 AM, ktm@rice.edu <ktm@rice.edu> wrote: > On Mon, Aug 13, 2012 at 11:52:06PM -0400, Tom Lane wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >> > On Mon, Aug 13, 2012 at 10:14 PM, Noah Misch <noah@leadboat.com> wrote: >> >> Overall, though, I think it best to plug this. We could set a flag before >> >> each operation, like evaluation of SQL arithmetic, for which SIGFPE is normal. >> >> > Yeah, that's what I thought of, too. It seems like it'd be a lot of >> > work to get there, though. >> >> That would depend on how many places there are where SIGFPE is expected. >> Are we sure there are any? Maybe we should just remove the handler and >> let SIGFPE be treated as a core dump. > > Wouldn't any user level divide-by-zero code cause a SIGFPE? If it's written in C, sure. If it's written in SQL, no, because we check for that inside int4div et all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Noah Misch <noah@leadboat.com> writes: > On Tue, Aug 14, 2012 at 08:38:44AM -0400, Robert Haas wrote: >> On Mon, Aug 13, 2012 at 11:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> That would depend on how many places there are where SIGFPE is expected. >>> Are we sure there are any? Maybe we should just remove the handler and >>> let SIGFPE be treated as a core dump. >> No clue. According to Wikipedia, it is commonly caused by dividing by >> zero, or by dividing by INT_MIN by -1, resulting in a quotient out of >> range for the type. I'd be willing to bet that we have got all the >> division-by-zero cases patched up just because we try pretty hard to >> emit the right error message for such cases, but I'm a lot less >> certain that things like INT_MIN/-1 can't happen anywhere. > [local] test=# select -9223372036854775808/-1; > ERROR: floating-point exception On reflection I think we should just leave this alone. If we try to tighten it up, what we will mainly accomplish is to make the system less robust, not more, because any place we miss then represents an easily reproducible DOS attack. If somebody's dumb enough to think that SIGFPE'ing a backend represents a supported way of canceling a query, that's his problem not ours. We don't need to expend large amounts of effort on being sure we slap his hand. regards, tom lane
On Tue, Aug 14, 2012 at 08:40:06AM -0400, Robert Haas wrote: > On Tue, Aug 14, 2012 at 6:50 AM, Greg Stark <stark@mit.edu> wrote: > > It is possible to check if the signal was synchronous or was sent from > > an external process. You can check siginfo->si_pid to see who sent you > > the signal. I'm not sure checking that and handling it at > > check_for_interrupts if it's asynchronous is the best solution or not > > though. > > If that's portable it might be an option, but I doubt that it is. I suspect it is portable. Nonetheless, kill() is not the only SIGFPE source that ought to produce a PANIC. Library code might trigger the signal, at which point we cannot assume that elog(ERROR) will leave an acceptable system state. To call this fixed, we need a whitelist of safe sources, not a blacklist of bogus sources. That said, I agree that the effort and risk may be out of proportion.
Noah Misch <noah@leadboat.com> writes: > On Tue, Aug 14, 2012 at 08:40:06AM -0400, Robert Haas wrote: >> On Tue, Aug 14, 2012 at 6:50 AM, Greg Stark <stark@mit.edu> wrote: >>> It is possible to check if the signal was synchronous or was sent from >>> an external process. You can check siginfo->si_pid to see who sent you >>> the signal. I'm not sure checking that and handling it at >>> check_for_interrupts if it's asynchronous is the best solution or not >>> though. >> If that's portable it might be an option, but I doubt that it is. > I suspect it is portable. Single Unix Spec V2 says (on the sigaction man page) The si_code member contains a code identifying the cause of thesignal. If the value of si_code is less than or equal to 0,then thesignal was generated by a process and si_pid and si_uid respectivelyindicate the process ID and the real user IDof the sender. I'm not sure I would trust checking si_pid alone; it would definitely fail on my old HPUX box, where I see that field is union'ed with si_addr and so will read as garbage for a locally-sourced SIGFPE. But it might be that checking si_code alone would work reliably. I think that rejecting an externally sourced SIGFPE might be worth doing if we can distinguish that reliably. regards, tom lane
On Tue, Aug 14, 2012 at 5:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Noah Misch <noah@leadboat.com> writes: >> On Tue, Aug 14, 2012 at 08:40:06AM -0400, Robert Haas wrote: >>> On Tue, Aug 14, 2012 at 6:50 AM, Greg Stark <stark@mit.edu> wrote: >>>> It is possible to check if the signal was synchronous or was sent from >>>> an external process. You can check siginfo->si_pid to see who sent you >>>> the signal. I'm not sure checking that and handling it at >>>> check_for_interrupts if it's asynchronous is the best solution or not >>>> though. > >>> If that's portable it might be an option, but I doubt that it is. > >> I suspect it is portable. > > Single Unix Spec V2 says (on the sigaction man page) > > The si_code member contains a code identifying the cause of the > signal. If the value of si_code is less than or equal to 0, then the > signal was generated by a process and si_pid and si_uid respectively > indicate the process ID and the real user ID of the sender. > > I'm not sure I would trust checking si_pid alone; it would definitely > fail on my old HPUX box, where I see that field is union'ed with si_addr > and so will read as garbage for a locally-sourced SIGFPE. But it might > be that checking si_code alone would work reliably. > > I think that rejecting an externally sourced SIGFPE might be worth doing > if we can distinguish that reliably. Currently, our signal handlers on all platforms are declared to take just an int. We'd need to change that in order to try to do this. Any thoughts on how we could go about that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company