Thread: Plpython crashing the backend in one easy step

Plpython crashing the backend in one easy step

From
Bradley McLean
Date:
I'm going to look into this, but thought I'd share the misery:

CREATE FUNCTION crash() RETURNS varchar AS '
plpy.execute("syntax error")
' language 'plpython';

SELECT crash();

-Brad


Re: Plpython crashing the backend in one easy step

From
Bradley McLean
Date:
I need some expert guidance here.  Suppose you have:

CREATE FUNCTION crash() RETURNS varchar AS '
plpy.execute("syntax error")
' language 'plpython';

Here are three possible behaviors:

(A)

a123=# select crash();
ERROR:  parser: parse error at or near "syntax"
ERROR:  plpython: Call of function `__plpython_procedure_crash_41133' failed.
plpy.SPIError: Unknown error in PLy_spi_execute_query.
FATAL 2:  elog: error during error recovery, giving up!
server closed the connection unexpectedlyThis probably means the server terminated abnormallybefore or while processing
therequest.
 
The connection to the server was lost. Attempting reset: Failed.
!#

(B)

a123=# select crash();
ERROR:  parser: parse error at or near "syntax"
a123=#

(C)

a123=# select crash();
ERROR:  parser: parse error at or near "syntax"
ERROR:  plpython: Call of function `__plpython_procedure_crash_41133' failed.
plpy.SPIError: Unknown error in PLy_spi_execute_query.
a123=#

Option (A) is the current code.

Fixing this happens near line 2290 (could be off a bit, I have some other
patches in this file), in the first if clause in PLy_spi_execute_query.

The DECLARE_EXC, SAVE_EXC, TRAP_EXC, RESTORE_EXC, RERAISE_EXC macros
are wrappers around sigsetjmp and longjmp, and are used to intercept
the elog calls occurring when plpython calls spi.

In Option (A), we return NULL, which causes the next level of code
to call elog.  Elog notices that we're already in an Error state, and
shuts down the backend.

In Option (B), I replace the 'return NULL' with a RERAISE_EXC, which
allows elog to function normally, albeit without any cleanup of the
plpython environment or useful messages.

In Option (C), I set the global "InError" flag to false, and then
return NULL, causing all of the error messages to come out and
plpython to clean up gracefully, no backend crash.  However, this
seems to be an unprecedented approach, and I could be missing
something big.

There's probably an Option (D) that I'm overlooking.

HELP!  (thanks)

-Brad


Re: Plpython crashing the backend in one easy step

From
Tom Lane
Date:
Bradley McLean <brad@bradm.net> writes:
> In Option (C), I set the global "InError" flag to false, and then
> return NULL, causing all of the error messages to come out and
> plpython to clean up gracefully, no backend crash.  However, this
> seems to be an unprecedented approach, and I could be missing
> something big.

Yes, as in "it's totally unsafe".  Suppressing an elog(ERROR) is 
a *big* no-no at present, because way too much stuff relies on
post-abort cleanup to clean up whatever problem is being reported.
You cannot allow the transaction to continue after the error, and
you most certainly mustn't cavalierly reset the error handling state.

The only things you should be doing with longjmp trapping are
(a) doing any cleanup that Python itself has to have before you
re-propagate the longjmp, or
(b) issuing elog(NOTICE) to help identify the error location
before you re-propagate the longjmp.

plpgsql contains an example of doing (b).

Not propagating the longjmp, which is what the code seems to be doing
at present, is not acceptable.  I had not looked at this code closely,
but as-is it is a huge reliability hazard.  I will insist on removing
plpython from 7.2 entirely if this can't be fixed before release.
        regards, tom lane


Re: Plpython crashing the backend in one easy step

From
Bradley McLean
Date:
Thanks, I assumed something like that.

I can very quickly provide at implementation that meets
these criteria (It's my option "b"), but it may be
less informative in terms of messages.

I'll have a look to see what I can do to supplement
the messages.

I also need to check if there's a memory leak after I
fix it.

I was aware of the example in plpgsql.

-Brad


* Tom Lane (tgl@sss.pgh.pa.us) [011113 17:08]:
> 
> Yes, as in "it's totally unsafe".  Suppressing an elog(ERROR) is 
> a *big* no-no at present, because way too much stuff relies on
> post-abort cleanup to clean up whatever problem is being reported.
> You cannot allow the transaction to continue after the error, and
> you most certainly mustn't cavalierly reset the error handling state.
> 
> The only things you should be doing with longjmp trapping are
> (a) doing any cleanup that Python itself has to have before you
> re-propagate the longjmp, or
> (b) issuing elog(NOTICE) to help identify the error location
> before you re-propagate the longjmp.
> 
> plpgsql contains an example of doing (b).
> 
> Not propagating the longjmp, which is what the code seems to be doing
> at present, is not acceptable.  I had not looked at this code closely,
> but as-is it is a huge reliability hazard.  I will insist on removing
> plpython from 7.2 entirely if this can't be fixed before release.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly


Re: Plpython crashing the backend in one easy step - fix

From
Bradley McLean
Date:
Okay, the attached patch contains the following:

- Kevin Jacob's patch to make plpython worthy of a 'trusted' language
by eliminating the ability to arbitrarily read OS files.  He also did
a bunch of C declaration cleanups.  (His patch has yet to appear on
either hackers or patches, and I believe is also a prerequisite for
plpython in 7.2 final).

- Changed every place that catches elog() longjmps to re-propogate.
- Added code to keep track of the current plpython function and add
it to the elog messages
- #ifdef 0 around some items that appear to be not in use.
- changes to the expected output of the regression tests, because:

Tom's requirement that we always reraise the longjmp is at odds with
the original design of plpython, which attempted to allow a plpython
function to use Python's exception handling to recover when an
embedded SPI call failed.  There is no way to do this that I can
see without halting the propogation of the longjmp.

Thus, the semantics of 'execute' calls from plpython have been changed:
should the backend throw an error, the plpython function will be
summarily aborted.  Previously, it could attempt to catch a python
exception.

Note that this does create some redundant exception handling paths
in upper level methods within this module that are no longer used.
I have not yet removed them, but will happily do so (in a careful
deliberate manner) once sure that there is no alternative way to
restore the python level exception handling.

Let me know if this isn't what's needed to get this up to snuff
for 7.2.

-Brad

* Tom Lane (tgl@sss.pgh.pa.us) [011113 16:49]:
>
> Yes, as in "it's totally unsafe".  Suppressing an elog(ERROR) is
> a *big* no-no at present, because way too much stuff relies on
> post-abort cleanup to clean up whatever problem is being reported.
> You cannot allow the transaction to continue after the error, and
> you most certainly mustn't cavalierly reset the error handling state.
>
> The only things you should be doing with longjmp trapping are
> (a) doing any cleanup that Python itself has to have before you
> re-propagate the longjmp, or
> (b) issuing elog(NOTICE) to help identify the error location
> before you re-propagate the longjmp.
>
> plpgsql contains an example of doing (b).
>
> Not propagating the longjmp, which is what the code seems to be doing
> at present, is not acceptable.  I had not looked at this code closely,
> but as-is it is a huge reliability hazard.  I will insist on removing
> plpython from 7.2 entirely if this can't be fixed before release.
>
>             regards, tom lane

Attachment

Re: Plpython crashing the backend in one easy step - fix

From
Tom Lane
Date:
Bradley McLean <brad@bradm.net> writes:
> Okay, the attached patch contains the following:

I have checked over and applied this patch.

It appeared that you incorporated the earlier version of Kevin's patch
rather than the later one.  I made the further changes indicated by the
attached diff, which I think are all the differences between Kevin's
two submissions, but I'd appreciate it if both of you would review
what's now in CVS and confirm it's okay.  (There'll probably be a beta3
release later today, so you can look at that if it's easier than CVS.)

One thing I noticed while running the selftest on a LinuxPPC box is
that the "feature" test output differed:

--- feature.expected    Fri May 25 11:48:33 2001
+++ feature.output    Fri Nov 16 13:00:15 2001
@@ -29,7 +29,7 @@(1 row)SELECT import_fail();
-NOTICE:  ('import socket failed -- untrusted dynamic module: _socket',)
+NOTICE:  ('import socket failed -- untrusted dynamic module: socket',)    import_fail     -------------------- failed
asexpected
 

I assume you guys both get the "expected" output?  Perhaps this should
be noted as a possible platform discrepancy.

> Tom's requirement that we always reraise the longjmp is at odds with
> the original design of plpython, which attempted to allow a plpython
> function to use Python's exception handling to recover when an
> embedded SPI call failed.  There is no way to do this that I can
> see without halting the propogation of the longjmp.
> Thus, the semantics of 'execute' calls from plpython have been changed:
> should the backend throw an error, the plpython function will be
> summarily aborted.  Previously, it could attempt to catch a python
> exception.
> Note that this does create some redundant exception handling paths
> in upper level methods within this module that are no longer used.  
> I have not yet removed them, but will happily do so (in a careful
> deliberate manner) once sure that there is no alternative way to
> restore the python level exception handling.

I would suggest leaving it as-is for now.  Sooner or later we will
probably try to reduce the severity of most errors, and at that
point it'd become worthwhile to re-enable error trapping in plpython.
        regards, tom lane


*** plpython.c~    Fri Nov 16 12:19:38 2001
--- plpython.c    Fri Nov 16 12:24:43 2001
***************
*** 292,297 ****
--- 292,298 ----     "binascii",     "calendar",     "cmath",
+     "codecs",     "errno",     "marshal",     "math",
***************
*** 325,330 ****
--- 326,332 ----     "hexrevision",     "maxint",     "maxunicode",
+     "platform",     "version",     "version_info" };


Re: Plpython crashing the backend in one easy step - fix

From
Bradley McLean
Date:
* Kevin Jacobs (jacobs@penguin.theopalgroup.com) [011116 13:15]:
> On Fri, 16 Nov 2001, Tom Lane wrote:
> > --- feature.expected    Fri May 25 11:48:33 2001
> > +++ feature.output    Fri Nov 16 13:00:15 2001
> > @@ -29,7 +29,7 @@
> >  (1 row)
> >
> >  SELECT import_fail();
> > -NOTICE:  ('import socket failed -- untrusted dynamic module: _socket',)
> > +NOTICE:  ('import socket failed -- untrusted dynamic module: socket',)
> >      import_fail
> >  --------------------
> >   failed as expected
> >
> > I assume you guys both get the "expected" output?  Perhaps this should
> > be noted as a possible platform discrepancy.
> 
> This diff is most likely a platform issue where the module printed can
> change depending on the version of Python installed.  Its annoying, but
> harmless.

I can confirm that:  It occurs on stock python 1.5.2 (default on even the
latest RH 7.2 linux).  Betweeen 1.5 and 2.0 the low level code was moved
from socket to _socket to make it possible to include a high level python
wrapper called socket.

I've reviewed the merge from CVS and it appears correct.  Thanks!

-Brad


Re: Plpython crashing the backend in one easy step - fix

From
Tom Lane
Date:
Bradley McLean <brad@bradm.net> writes:
> I can confirm that:  It occurs on stock python 1.5.2 (default on even the
> latest RH 7.2 linux).

Bingo --- python 1.5.2 is what's installed on that machine.  Seems
to pass the self-test fine other than this one item.
        regards, tom lane


Re: Plpython crashing the backend in one easy step - fix

From
Kevin Jacobs
Date:
On Fri, 16 Nov 2001, Tom Lane wrote:
> --- feature.expected    Fri May 25 11:48:33 2001
> +++ feature.output    Fri Nov 16 13:00:15 2001
> @@ -29,7 +29,7 @@
>  (1 row)
>
>  SELECT import_fail();
> -NOTICE:  ('import socket failed -- untrusted dynamic module: _socket',)
> +NOTICE:  ('import socket failed -- untrusted dynamic module: socket',)
>      import_fail
>  --------------------
>   failed as expected
>
> I assume you guys both get the "expected" output?  Perhaps this should
> be noted as a possible platform discrepancy.

This diff is most likely a platform issue where the module printed can
change depending on the version of Python installed.  Its annoying, but
harmless.

-Kevin

--
Kevin Jacobs
The OPAL Group - Enterprise Systems Architect
Voice: (216) 986-0710 x 19         E-mail: jacobs@theopalgroup.com
Fax:   (216) 986-0714              WWW:    http://www.theopalgroup.com




Re: Plpython crashing the backend in one easy step

From
Bruce Momjian
Date:
Has this been addressed?

---------------------------------------------------------------------------

> Bradley McLean <brad@bradm.net> writes:
> > In Option (C), I set the global "InError" flag to false, and then
> > return NULL, causing all of the error messages to come out and
> > plpython to clean up gracefully, no backend crash.  However, this
> > seems to be an unprecedented approach, and I could be missing
> > something big.
> 
> Yes, as in "it's totally unsafe".  Suppressing an elog(ERROR) is 
> a *big* no-no at present, because way too much stuff relies on
> post-abort cleanup to clean up whatever problem is being reported.
> You cannot allow the transaction to continue after the error, and
> you most certainly mustn't cavalierly reset the error handling state.
> 
> The only things you should be doing with longjmp trapping are
> (a) doing any cleanup that Python itself has to have before you
> re-propagate the longjmp, or
> (b) issuing elog(NOTICE) to help identify the error location
> before you re-propagate the longjmp.
> 
> plpgsql contains an example of doing (b).
> 
> Not propagating the longjmp, which is what the code seems to be doing
> at present, is not acceptable.  I had not looked at this code closely,
> but as-is it is a huge reliability hazard.  I will insist on removing
> plpython from 7.2 entirely if this can't be fixed before release.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026