Thread: signal handling in plpython

signal handling in plpython

From
Mario De Frutos Dieguez
Date:
Hello everyone :).

First of all, I want to introduce me to this list. My name is Mario de Frutos and I work at CARTO :)

I come here asking for some advice/help because we're facing some unexpected behavior when we want to interrupt functions doing CPU intensive operations in plpython.

Our problem is that we're not able to interrupt them when they're making CPU intensive operations. For example, when calculating Moran using PySAL, the SIGINT handler of Postgres is not able to cancel it.

I want to show you some possible solutions that I've tried without success:

- If we don't add a custom signal handler, we're not able to interrupt the function when it's making CPU intensive operations. When the `SIGINT` signal is launched, the system is not able to interrupt it until the function ends.
- If we add a custom signal handler for the `SIGINT`, we are able to interrupt the CPU intensive function but we're not able to interrupt data fetching operations like `plpy.execute(query)` because we have overridden the Postgres handler for that signal.
- As a third option I've added a python context manager to wrap, for testing purposes, the CPU intensive part (Moran function from PySAL):
```
def _signal_handler(signal_code, frame):
    plpy.error(INTERRUPTED BY USER!!')


@contextmanager
def interruptible():
    try:
        signal.signal(signal.SIGINT, _signal_handler)
        yield
    finally:
        # Restore the default behavoiur for the signal
        signal.signal(signal.SIGINT, signal.SIG_DFL)
```
  This doesn't work as expected because in the `finally` clause we try to reset to the default behavior but in Postgres, the behavior for the SIGINT signal is defined by a [custom handler](https://github.com/postgres/postgres/blob/master/src/include/tcop/tcopprot.h#L66).
  If we try to retrieve the old handler using `signal.getsignal` we get a None object

So after all,going back and forth I came up with two possible solutions:
- [custom code] in `plpython` to make us able to reset the default signal handler after finish the CPU intensive functions. It seems to work but I'm still doing some tests. This option lets us call it explicitly and add it to the `finally` part of a decorator/context manager
- Reset the signal handler at the beginning of the `plpy.execute` or alike functions like [here].

As an extra ball, we want to implement the SIGALRM part to mimic the "statement timeout" behavior too

I don't know if there is a better way to implement this, I know we're pushing/doing things beyond the scope of plpython but any advise is welcome :)

Re: signal handling in plpython

From
Heikki Linnakangas
Date:
On 10/13/2016 08:57 PM, Mario De Frutos Dieguez wrote:
> I come here asking for some advice/help because we're facing some
> unexpected behavior when we want to interrupt functions doing CPU intensive
> operations in plpython.
>
> Our problem is that we're not able to interrupt them when they're making
> CPU intensive operations. For example, when calculating Moran using PySAL,
> the SIGINT handler of Postgres is not able to cancel it.

Python code isn't interruptible, but any queries you run within a python 
function are. So if you have a loop in your function that you know will 
run for a long time, you could issue a dummy "SELECT 1" query every once 
in a while. However, that doesn't help, if the long loop is in a library 
function that you have no control over, rather than the PL/python 
function itself.

It would be nice to have a solution for this in plpython itself, so that 
the query cancel was turned into a Python exception. Patches for that 
would be welcome. I think you could use Py_AddPendingCall() from 
PostgreSQL's signal handler, to schedule a call to a function that in 
turn throws a Python exception. That'll need some changes to 
PostgreSQL's normal signal handlers, like die() and 
StatementCancelHandler() in postgres.c, but it seems doable.

- Heikki




Re: signal handling in plpython

From
Mario De Frutos Dieguez
Date:
Hi!

Thank you very much for your quick response :)

We're looking for a solution at plpython level. My two proposals are a quick "workaround" that let us interrupt using custom signal handlers in the python code at plpython level. But I'm looking for something more solid and your proposal, I've been doing this for 3 days hehe, looks great and I would LOVE to hear more about it and if you can't guide me a bit more in order to fully understand it :)

We've been thinking to make something like the PostGIS handler and have multiple signal handlers: one for the Postgres using StatementCancelHandler and one for the python code. How does it sound?

Thank you again for your time :)

2016-10-14 12:01 GMT+02:00 Heikki Linnakangas <hlinnaka@iki.fi>:
On 10/13/2016 08:57 PM, Mario De Frutos Dieguez wrote:
I come here asking for some advice/help because we're facing some
unexpected behavior when we want to interrupt functions doing CPU intensive
operations in plpython.

Our problem is that we're not able to interrupt them when they're making
CPU intensive operations. For example, when calculating Moran using PySAL,
the SIGINT handler of Postgres is not able to cancel it.

Python code isn't interruptible, but any queries you run within a python function are. So if you have a loop in your function that you know will run for a long time, you could issue a dummy "SELECT 1" query every once in a while. However, that doesn't help, if the long loop is in a library function that you have no control over, rather than the PL/python function itself.

It would be nice to have a solution for this in plpython itself, so that the query cancel was turned into a Python exception. Patches for that would be welcome. I think you could use Py_AddPendingCall() from PostgreSQL's signal handler, to schedule a call to a function that in turn throws a Python exception. That'll need some changes to PostgreSQL's normal signal handlers, like die() and StatementCancelHandler() in postgres.c, but it seems doable.

- Heikki


Re: signal handling in plpython

From
Heikki Linnakangas
Date:
On 10/14/2016 01:57 PM, Mario De Frutos Dieguez wrote:
> Hi!
>
> Thank you very much for your quick response :)
>
> We're looking for a solution at plpython level. My two proposals are a
> quick "workaround" that let us interrupt using custom signal handlers in
> the python code at plpython level. But I'm looking for something more solid
> and your proposal, I've been doing this for 3 days hehe, looks great and I
> would LOVE to hear more about it and if you can't guide me a bit more in
> order to fully understand it :)
>
> We've been thinking to make something like the PostGIS handler
> <https://github.com/postgis/postgis/blob/98b2cdb872b5b5bd65606f5bce334d2477b2afc5/postgis/postgis_module.c#L128>
> and
> have multiple signal handlers: one for the Postgres using
> StatementCancelHandler and one for the python code. How does it sound?

Hmm, yeah, I guess that would work too. Or maybe not, if the library 
gets preloaded with shared_preload_libraries, because the signal handler 
is then overridden in backends. (And the "coreIntHandler" will continue 
to point to postmaster's signal handler, which is also wrong.)

Overall, changing signal handlers in an extension seems pretty fragile. 
I think adding some kind of a hook mechanism to the backend would be a 
much more robust. An extension that wants to get called from 
query-cancel signal handler would register a callback, and we would call 
the callbacks from the signal handlers in postgres.c.

Aside from SIGINT, a query cancel can also be triggered by a "recovery 
conflict", from RecoveryConflictInterrupt(). With an explicit hook, we 
can ensure that the hook gets called in all the situations that set 
QueryCancelPending. And you'll want to handle ProcDiePending, too.

- Heikki




Re: signal handling in plpython

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> It would be nice to have a solution for this in plpython itself, so that 
> the query cancel was turned into a Python exception. Patches for that 
> would be welcome. I think you could use Py_AddPendingCall() from 
> PostgreSQL's signal handler, to schedule a call to a function that in 
> turn throws a Python exception.

Py_AddPendingCall is safe to call from a signal handler?  That would
be ... quite remarkable.  I should think it'd at least need to do a
malloc().  Also, seeing that it's documented as part of Python threading,
I wonder whether we can call it at all without the backend becoming
multithreaded.

> That'll need some changes to 
> PostgreSQL's normal signal handlers, like die() and 
> StatementCancelHandler() in postgres.c, but it seems doable.

I'm not in principle averse to letting extensions get control in
the signal handlers, but I'm afraid that any such feature would
mostly act as a magnet for unsafe coding.
        regards, tom lane



Re: signal handling in plpython

From
Tom Lane
Date:
I wrote:
> Py_AddPendingCall is safe to call from a signal handler?  That would
> be ... quite remarkable.

I think that a much safer way to proceed would be, rather than asking
"how can I mess with the signal handlers", asking "how can I make my
python code act like it is sprinkled with CHECK_FOR_INTERRUPTS calls".

After some perusing of the Python docs I think it might be possible to
do this by setting up a trace function (cf Py_tracefunc()) that returns
a Python error condition if InterruptPending && (QueryCancelPending ||
ProcDiePending) is true.
        regards, tom lane



Re: signal handling in plpython

From
Heikki Linnakangas
Date:
On 10/14/2016 04:05 PM, Tom Lane wrote:
> I wrote:
>> Py_AddPendingCall is safe to call from a signal handler?  That would
>> be ... quite remarkable.

Yes, I believe it is. That's pretty much the raison d'être for 
Py_AddPendingCall(). I believe the Python interpreter itself implements 
signal handlers that way. If you set a signal handler with signal. So if 
you call Python's signal.signal(SIGINT, my_signal_handler) to set a 
"signal handler", my_signal_handler() won't be called from the actual 
signal handler. The actual signal handler just schedules the call with 
Py_AddPendingCall(), and the next time the Python interpreter is in a 
suitable place, i.e. not in the middle of an atomic operation or holding 
a lock, it calls the my_signal_handler().

https://github.com/python/cpython/blob/4b71e63b0616aa2a44c9b13675e4c8e3c0157481/Python/ceval.c#L422

> I think that a much safer way to proceed would be, rather than asking
> "how can I mess with the signal handlers", asking "how can I make my
> python code act like it is sprinkled with CHECK_FOR_INTERRUPTS calls".
>
> After some perusing of the Python docs I think it might be possible to
> do this by setting up a trace function (cf Py_tracefunc()) that returns
> a Python error condition if InterruptPending && (QueryCancelPending ||
> ProcDiePending) is true.

I think Py_AddPendingCall() is more or less implemented by sprinkling 
calls similar to CHECK_FOR_INTERRUPTS, that check for "any pending 
calls?", over the Python interpreter code.

- Heikki




Re: signal handling in plpython

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 10/14/2016 04:05 PM, Tom Lane wrote:
>> I wrote:
>>> Py_AddPendingCall is safe to call from a signal handler?  That would
>>> be ... quite remarkable.

> Yes, I believe it is.

> https://github.com/python/cpython/blob/4b71e63b0616aa2a44c9b13675e4c8e3c0157481/Python/ceval.c#L422

I don't know whether to laugh or cry, but that code is a joke.  Just
silently fail if you can't get the lock?
        regards, tom lane



Re: signal handling in plpython

From
Mario De Frutos Dieguez
Date:
Hi!

Following your ideas I've made a test [here], only in plpython and seems to works pretty well. I've to make more tests and execute the postgres regress too.

This ad-hoc solution could be enough for now, we don't have shared_preload_libraries as Heikki pointed, because in the next week we need to be able to interrupt plpython functions.

But, I would REALLY LOVE to implement a proper solution for this case, making hooks in Postgres for extensions like Heikki propose or any other proposal. I'm really enjoying to work in the Postgres internals
and at least I'd like to finish this PATCH.

Any suggestion on what do I need to read, similar cases, advices, etc?

Again thank you very much for your time and you invaluable help.

Mario



2016-10-14 15:22 GMT+02:00 Tom Lane <tgl@sss.pgh.pa.us>:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 10/14/2016 04:05 PM, Tom Lane wrote:
>> I wrote:
>>> Py_AddPendingCall is safe to call from a signal handler?  That would
>>> be ... quite remarkable.

> Yes, I believe it is.

> https://github.com/python/cpython/blob/4b71e63b0616aa2a44c9b13675e4c8e3c0157481/Python/ceval.c#L422

I don't know whether to laugh or cry, but that code is a joke.  Just
silently fail if you can't get the lock?

                        regards, tom lane

Re: signal handling in plpython

From
Heikki Linnakangas
Date:

On 14 October 2016 16:22:12 EEST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> On 10/14/2016 04:05 PM, Tom Lane wrote:
>>> I wrote:
>>>> Py_AddPendingCall is safe to call from a signal handler?  That
>would
>>>> be ... quite remarkable.
>
>> Yes, I believe it is.
>
>>
>https://github.com/python/cpython/blob/4b71e63b0616aa2a44c9b13675e4c8e3c0157481/Python/ceval.c#L422
>
>I don't know whether to laugh or cry, but that code is a joke.  Just
>silently fail if you can't get the lock?

Heh, ok, let me rephrase: I believe it's *intended* to be callable from a signal handler :). Whether it actually works
isanother question. Perhaps there's some mitigating conditions there, I don't know.
 

For our use case, it's actually not too bad if Py_AddPendingCall gives up and does nothing. Then the python function
willsimply not be interrupted until next SPI call, which is the current situation anyway.
 

- Heikki



Re: signal handling in plpython

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 14 October 2016 16:22:12 EEST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't know whether to laugh or cry, but that code is a joke.  Just
>> silently fail if you can't get the lock?

> Heh, ok, let me rephrase: I believe it's *intended* to be callable from a signal handler :). Whether it actually
worksis another question. Perhaps there's some mitigating conditions there, I don't know. 

> For our use case, it's actually not too bad if Py_AddPendingCall gives up and does nothing. Then the python function
willsimply not be interrupted until next SPI call, which is the current situation anyway. 

I dunno.  If the failure were very low-probability, you could maybe live
with that behavior, but I'm not sure it is.  Presumably the Python
interpreter loop is taking that lock once per statement (at least), so
that it can tell if there's something to do.  That'd suggest that the
fraction of time in which the lock is held is not negligible.
        regards, tom lane



Re: signal handling in plpython

From
Andres Freund
Date:
On 2016-10-14 13:50:35 -0400, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > On 14 October 2016 16:22:12 EEST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I don't know whether to laugh or cry, but that code is a joke.  Just
> >> silently fail if you can't get the lock?
> 
> > Heh, ok, let me rephrase: I believe it's *intended* to be callable from a signal handler :). Whether it actually
worksis another question. Perhaps there's some mitigating conditions there, I don't know.
 
> 
> > For our use case, it's actually not too bad if Py_AddPendingCall gives up and does nothing. Then the python
functionwill simply not be interrupted until next SPI call, which is the current situation anyway.
 
> 
> I dunno.  If the failure were very low-probability, you could maybe live
> with that behavior, but I'm not sure it is.  Presumably the Python
> interpreter loop is taking that lock once per statement (at least), so
> that it can tell if there's something to do.  That'd suggest that the
> fraction of time in which the lock is held is not negligible.

Since that's how python's signal handling works, I'm not sure that's
really our problem.

Greetings,

Andres Freund



Re: signal handling in plpython

From
Robert Haas
Date:
On Fri, Oct 14, 2016 at 10:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> On 14 October 2016 16:22:12 EEST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I don't know whether to laugh or cry, but that code is a joke.  Just
>>> silently fail if you can't get the lock?
>
>> Heh, ok, let me rephrase: I believe it's *intended* to be callable from a signal handler :). Whether it actually
worksis another question. Perhaps there's some mitigating conditions there, I don't know.
 
>
>> For our use case, it's actually not too bad if Py_AddPendingCall gives up and does nothing. Then the python function
willsimply not be interrupted until next SPI call, which is the current situation anyway.
 
>
> I dunno.  If the failure were very low-probability, you could maybe live
> with that behavior, but I'm not sure it is.  Presumably the Python
> interpreter loop is taking that lock once per statement (at least), so
> that it can tell if there's something to do.  That'd suggest that the
> fraction of time in which the lock is held is not negligible.

I'm not sure that kibitzing the way the Python developers chose to
handle this is very helpful.  Our job to use the APIs they've exposed,
not second-guess how they implemented them.  The comment suggests that
the Python team thought that this would be reliable enough to be
acceptable, and I think we should assume they're right.  If they're
wrong, they can fix that in a future release and our use of that API
will work that much better.  Sitting on our hands gets us nowhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: signal handling in plpython

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Oct 14, 2016 at 10:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I dunno.  If the failure were very low-probability, you could maybe live
>> with that behavior, but I'm not sure it is.  Presumably the Python
>> interpreter loop is taking that lock once per statement (at least), so
>> that it can tell if there's something to do.  That'd suggest that the
>> fraction of time in which the lock is held is not negligible.

> I'm not sure that kibitzing the way the Python developers chose to
> handle this is very helpful.  Our job to use the APIs they've exposed,
> not second-guess how they implemented them.  The comment suggests that
> the Python team thought that this would be reliable enough to be
> acceptable, and I think we should assume they're right.

Well, the comment implies strongly that they expect it to be used in
situations where the signal handler would execute on a different thread
from the python interpreter loop.  So the proposed Postgres usage is
really not within the intended scope of use of the function.

> Sitting on our hands gets us nowhere.

I'm not sure where I said to sit on our hands.  I pointed to the Python
trace callback as a likely implementation that would not suffer from this
problem --- and wouldn't require us to invent safe ways to install
extension callback hooks in our signal handlers, which is not a trivial
problem either.
        regards, tom lane