Thread: cache lookup failed from empty plpythonu function

cache lookup failed from empty plpythonu function

From

Sandro Santilli

Date:

24 January 2013, 14:03:45

I've found a bug in plpythonu resulting in a "cache lookup" failure.
Here's the SQL to reproduce (thanks Andres):

 CREATE EXTENSION plpythonu;
 CREATE OR REPLACE FUNCTION t() RETURNS trigger AS 'pass' LANGUAGE 'plpythonu';
 CREATE TABLE a();
 CREATE TABLE b();
 CREATE TRIGGER check_quota AFTER INSERT ON a FOR EACH ROW EXECUTE PROCEDURE t();
 INSERT INTO a DEFAULT VALUES;
 DROP TABLE a;
 CREATE TRIGGER check_quota
  AFTER INSERT ON b FOR EACH ROW EXECUTE PROCEDURE t();
 INSERT INTO b DEFAULT VALUES;
 --ERROR: cache lookup failed for relation 20200
 --Time: 0.570 ms
 -- cleanup with DROP TABLE b; DROP FUNCTION t();

Logs show:

 ERROR:  XX000: cache lookup failed for relation 3694414
 LOCATION:  PLy_procedure_argument_valid, plpython.c:1469

Andres (in IRC #postgresql) said:

  <andres> hm, I think I see the problem. The caching there looks pretty
           broken to me.
  <andres> i.e. the caching doesn't work *at all* if a function is used
           as a trigger for more than one table and this is just fallout
           of that.
  <andres> Looks like a typo to me, procedure_get should use the table's
           oid instead of fn_oid when doing lookups in PLy_trigger_cache.

Tested on:

 PostgreSQL 9.1.7 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit

It was probably also present in the stable version as of one year
ago ( see https://github.com/Vizzuality/cartodb/issues/173 )

--strk;

Re: cache lookup failed from empty plpythonu function

From

Andres Freund

Date:

24 January 2013, 14:17:58

On 2013-01-24 15:03:36 +0100, Sandro Santilli wrote:
> I've found a bug in plpythonu resulting in a "cache lookup" failure.
> Here's the SQL to reproduce (thanks Andres):
>
>  CREATE EXTENSION plpythonu;
>  CREATE OR REPLACE FUNCTION t() RETURNS trigger AS 'pass' LANGUAGE 'plpythonu';
>  CREATE TABLE a();
>  CREATE TABLE b();
>  CREATE TRIGGER check_quota AFTER INSERT ON a FOR EACH ROW EXECUTE PROCEDURE t();
>  INSERT INTO a DEFAULT VALUES;
>  DROP TABLE a;
>  CREATE TRIGGER check_quota
>   AFTER INSERT ON b FOR EACH ROW EXECUTE PROCEDURE t();
>  INSERT INTO b DEFAULT VALUES;
>  --ERROR: cache lookup failed for relation 20200
>  --Time: 0.570 ms
>  -- cleanup with DROP TABLE b; DROP FUNCTION t();
>
> Logs show:
>
>  ERROR:  XX000: cache lookup failed for relation 3694414
>  LOCATION:  PLy_procedure_argument_valid, plpython.c:1469
>
> Andres (in IRC #postgresql) said:
>
>   <andres> hm, I think I see the problem. The caching there looks pretty
>            broken to me.
>   <andres> i.e. the caching doesn't work *at all* if a function is used
>            as a trigger for more than one table and this is just fallout
>            of that.
>   <andres> Looks like a typo to me, procedure_get should use the table's
>            oid instead of fn_oid when doing lookups in PLy_trigger_cache.

ISTM the caching code for plpythonu trigger functions has been broken
for some time. The bug seem to be that PLy_procedure_get looks up the
function in a separate cache for trigger functions (PLy_trigger_cache)
but still only uses the function oid for lookup.
So the same function can be looked up for two tables and when the cached
entry originally refers to the old table that has been dropped we see
the above error.

The fix seems tob e to make PLy_trigger_cache have a key of (reloid,
fn_oid) instead of just fn_oid.

Other suggestions?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: cache lookup failed from empty plpythonu function

From

Tom Lane

Date:

24 January 2013, 16:40:39

Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-01-24 15:03:36 +0100, Sandro Santilli wrote:
> ISTM the caching code for plpythonu trigger functions has been broken
> for some time. The bug seem to be that PLy_procedure_get looks up the
> function in a separate cache for trigger functions (PLy_trigger_cache)
> but still only uses the function oid for lookup.
> So the same function can be looked up for two tables and when the cached
> entry originally refers to the old table that has been dropped we see
> the above error.

> The fix seems tob e to make PLy_trigger_cache have a key of (reloid,
> fn_oid) instead of just fn_oid.

If there's anything in either the cache entry itself or the function's
other persistent state that is dependent on which table it's attached
to, then yeah, obviously.  But another conceivable solution would be to
not store any such information.

In a quick look, it seems like the "PLyTypeInfo result" field might
depend on which table the trigger is attached to, so unless we can
get along without that, we'll need to expand the hash key.

Actually I'm wondering a bit why there are two hash tables at all.
Seems like one table with a hash key of (fnoid, reloid) could be
used, with the convention of reloid = 0 for non-trigger calls.

            regards, tom lane

Re: cache lookup failed from empty plpythonu function

From

Andres Freund

Date:

24 January 2013, 17:06:20

On 2013-01-24 11:40:33 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2013-01-24 15:03:36 +0100, Sandro Santilli wrote:
> > ISTM the caching code for plpythonu trigger functions has been broken
> > for some time. The bug seem to be that PLy_procedure_get looks up the
> > function in a separate cache for trigger functions (PLy_trigger_cache)
> > but still only uses the function oid for lookup.
> > So the same function can be looked up for two tables and when the cached
> > entry originally refers to the old table that has been dropped we see
> > the above error.
>
> > The fix seems to be to make PLy_trigger_cache have a key of (reloid,
> > fn_oid) instead of just fn_oid.
>
> If there's anything in either the cache entry itself or the function's
> other persistent state that is dependent on which table it's attached
> to, then yeah, obviously.  But another conceivable solution would be to
> not store any such information.
>
> In a quick look, it seems like the "PLyTypeInfo result" field might
> depend on which table the trigger is attached to, so unless we can
> get along without that, we'll need to expand the hash key.

It looks to me like getting rid of that requirement would be too much
code for something backbrancheable . Although I dislike seing code like
    /* If the pg_proc tuple has changed, it's not valid */
    if (!(proc->fn_xmin == HeapTupleHeaderGetXmin(procTup->t_data) &&
          ItemPointerEquals(&proc->fn_tid, &procTup->t_self)))
        return false;
in a pl. I have to admit though, that I don't immediately know how I
would prevent constant recompilation in a pl handler without such a
hack.


> Actually I'm wondering a bit why there are two hash tables at all.
> Seems like one table with a hash key of (fnoid, reloid) could be
> used, with the convention of reloid = 0 for non-trigger calls.

Seems fine to me. Currently there seems no point in having two distinct
hashes at all, given that they have exactly the same key and that
trigger functions cannot be called plainly anyway.

Unless somebody screams that will mean some stuff defined in plpython
headers will change but that doesn't seem like a legitimate concern to
me.

Tom, any other committer: I am happy to provide a <= 9.1 version of the
patch because plpython has been noticeably restructured in 9.1=>9.2, but
I am not sure if that helps you at all.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: cache lookup failed from empty plpythonu function

From

Tom Lane

Date:

24 January 2013, 17:12:49

Andres Freund <andres@2ndquadrant.com> writes:
> Tom, any other committer: I am happy to provide a <= 9.1 version of the
> patch because plpython has been noticeably restructured in 9.1=>9.2, but
> I am not sure if that helps you at all.

We need to back-patch to any branch where this is broken.  If the patch
would change materially before 9.2, then yeah it would help if you could
create/test a version for that too.

            regards, tom lane

Re: cache lookup failed from empty plpythonu function

From

Andres Freund

Date:

25 January 2013, 13:20:35

On 2013-01-24 11:40:33 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2013-01-24 15:03:36 +0100, Sandro Santilli wrote:
> > ISTM the caching code for plpythonu trigger functions has been broken
> > for some time. The bug seem to be that PLy_procedure_get looks up the
> > function in a separate cache for trigger functions (PLy_trigger_cache)
> > but still only uses the function oid for lookup.
> > So the same function can be looked up for two tables and when the cached
> > entry originally refers to the old table that has been dropped we see
> > the above error.
>
> > The fix seems tob e to make PLy_trigger_cache have a key of (reloid,
> > fn_oid) instead of just fn_oid.
>
> If there's anything in either the cache entry itself or the function's
> other persistent state that is dependent on which table it's attached
> to, then yeah, obviously.  But another conceivable solution would be to
> not store any such information.
>
> In a quick look, it seems like the "PLyTypeInfo result" field might
> depend on which table the trigger is attached to, so unless we can
> get along without that, we'll need to expand the hash key.
>
> Actually I'm wondering a bit why there are two hash tables at all.
> Seems like one table with a hash key of (fnoid, reloid) could be
> used, with the convention of reloid = 0 for non-trigger calls.

Its slightly more complex than just making it one hash table with an
extended key. When validating a trigger function we don't have a
relation to do the cache lookup. I chose to handle that case by not
doing a cache lookup at all in that case which imo is a sensible
choice.

I am less sure about whether its nicer to make that decision in
PLy_procedure_get as I have or whether it would have been better to make
PLy_procedure_create public and directly use it in plpython_validator.

The bug got introduced in 46211da1b84bc3537e799ee1126098e71c2428e8 which
interestingly says "Using HTABs instead of Python dictionaries makes
error recovery easier, and allows for procedures to be cached based on
their OIDs, not their names." - but the caching seems to always have
used ("%u_%u", fn_oid, tgreloid) as its key which explains why the bug
didn't exist back then.

Anyway, as far as I can see only 9.1 upwards are affected.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: cache lookup failed from empty plpythonu function

From

Tom Lane

Date:

25 January 2013, 19:51:49

Andres Freund <andres@2ndquadrant.com> writes:
> Its slightly more complex than just making it one hash table with an
> extended key. When validating a trigger function we don't have a
> relation to do the cache lookup. I chose to handle that case by not
> doing a cache lookup at all in that case which imo is a sensible
> choice.

Seems fair.  However ... why is it safe for PLy_procedure_create to be
using the same name for multiple instances of a trigger function?
Should we not be including the rel OID when building the procName
string?

            regards, tom lane

Re: cache lookup failed from empty plpythonu function

From

Andres Freund

Date:

25 January 2013, 20:07:55

On 2013-01-25 14:51:39 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > Its slightly more complex than just making it one hash table with an
> > extended key. When validating a trigger function we don't have a
> > relation to do the cache lookup. I chose to handle that case by not
> > doing a cache lookup at all in that case which imo is a sensible
> > choice.
>
> Seems fair.  However ... why is it safe for PLy_procedure_create to be
> using the same name for multiple instances of a trigger function?
> Should we not be including the rel OID when building the procName
> string?

I don't think its a problem, given the way python works I am pretty sure
it will result in independent functions.

Each PLy_procedure_compile will run the source code in a copy of
PLy_interp_globals, therefore the independent comilitions shouldn't
affect each other.

I am not sure why it builds the call to the function via eval'ing a
"$funcname()" instead of using the result of PyRun_String which will
return a reference to the function, but thats an independent issue.

Now I think an argument can be made that it would be nicer for debugging
purposes to have clearly distinguishable function names, but I
personally never needed it and it probably wouldn't be something to
backpatch. People might rely on those function names.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: cache lookup failed from empty plpythonu function

From

Andres Freund

Date:

25 January 2013, 20:35:06

On 2013-01-25 21:07:51 +0100, Andres Freund wrote:
> On 2013-01-25 14:51:39 -0500, Tom Lane wrote:
> > Andres Freund <andres@2ndquadrant.com> writes:
> > > Its slightly more complex than just making it one hash table with an
> > > extended key. When validating a trigger function we don't have a
> > > relation to do the cache lookup. I chose to handle that case by not
> > > doing a cache lookup at all in that case which imo is a sensible
> > > choice.
> >
> > Seems fair.  However ... why is it safe for PLy_procedure_create to be
> > using the same name for multiple instances of a trigger function?
> > Should we not be including the rel OID when building the procName
> > string?
>
> I don't think its a problem, given the way python works I am pretty sure
> it will result in independent functions.
>
> Each PLy_procedure_compile will run the source code in a copy of
> PLy_interp_globals, therefore the independent comilitions shouldn't
> affect each other.
>
> I am not sure why it builds the call to the function via eval'ing a
> "$funcname()" instead of using the result of PyRun_String which will
> return a reference to the function, but thats an independent issue.
>
> Now I think an argument can be made that it would be nicer for debugging
> purposes to have clearly distinguishable function names, but I
> personally never needed it and it probably wouldn't be something to
> backpatch. People might rely on those function names.

Just for fun I tried how easy it is to have some fun with the function
name. Guess what happens when the attached sql file is executed ;)

quite possibly doesn't work if copied from here but for quicker viewing:

CREATE OR REPLACE FUNCTION "foo():
    pass

import os
os._exit(1)
def "() RETURNS void LANGUAGE plpythonu AS $$pass$$;

Yep:
LOG:  server process (PID 29317) exited with exit code 1
DETAIL:  Failed process was running: CREATE OR REPLACE FUNCTION "foo():
        pass

    import os
    os._exit(1)
    def "() RETURNS void LANGUAGE plpythonu AS $$pass$$;
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited abnormally and possibly corrupted shared memory. 
HINT:  In a moment you should be able to reconnect to the database and repeat your command.

Anyway, back to more sensible things ;)

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: cache lookup failed from empty plpythonu function

From

Tom Lane

Date:

25 January 2013, 20:40:57

Andres Freund <andres@2ndquadrant.com> writes:
> quite possibly doesn't work if copied from here but for quicker viewing:

> CREATE OR REPLACE FUNCTION "foo():
>     pass

> import os
> os._exit(1)
> def "() RETURNS void LANGUAGE plpythonu AS $$pass$$;

> Yep:
> LOG:  server process (PID 29317) exited with exit code 1

Ooops.  Good thing plpython is already superuser-only, or this would be
a security problem.  So should we stop including the SQL function name
in the python name?

            regards, tom lane

Re: cache lookup failed from empty plpythonu function

From

Andres Freund

Date:

25 January 2013, 21:00:35

On 2013-01-25 15:40:46 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > quite possibly doesn't work if copied from here but for quicker viewing:
>
> > CREATE OR REPLACE FUNCTION "foo():
> >     pass
>
> > import os
> > os._exit(1)
> > def "() RETURNS void LANGUAGE plpythonu AS $$pass$$;
>
> > Yep:
> > LOG:  server process (PID 29317) exited with exit code 1
>
> Ooops.  Good thing plpython is already superuser-only, or this would be
> a security problem.  So should we stop including the SQL function name
> in the python name?

I don't think it's worth it and having something recognizable in there
is useful in (python) backtraces.
I am pretty sure I could do the same in the function source with some
trickery even without the function being executed just because python
also executes stuff during parsing, so I don't think it would buy
enough.
We probably should care about it if there were a trusted plpython (so
you couldn't cause an incomplete function being called or whatever) but
I don't see that happening anytime soon.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: cache lookup failed from empty plpythonu function

From

Tom Lane

Date:

25 January 2013, 22:08:08

Andres Freund <andres@2ndquadrant.com> writes:
> The bug got introduced in 46211da1b84bc3537e799ee1126098e71c2428e8 which
> interestingly says "Using HTABs instead of Python dictionaries makes
> error recovery easier, and allows for procedures to be cached based on
> their OIDs, not their names." - but the caching seems to always have
> used ("%u_%u", fn_oid, tgreloid) as its key which explains why the bug
> didn't exist back then.

> Anyway, as far as I can see only 9.1 upwards are affected.

Patch applied with some editorialization, mostly cosmetic but one thing
not so much: you introduced uninitialized-variable hazards into
PLy_procedure_get, since if use_cache is false then found, entry, and
proc wouldn't get set up.

Neither version of gcc I tried complained about this, and I assume yours
didn't either, which is a bit annoying/scary.  I wonder whether the
"volatile" marker prevents that?

BTW, it seems to me that there is another bug in PLy_procedure_get,
which is that it's not thinking about recursive use of a procedure.
Consider

    call f

        ... somebody updates f's pg_proc tuple

        ... recursively call f

At this point PLy_procedure_get will decide the cache entry is obsolete
and happily trash the procedure data that the outer call is still using.
We ran into this long ago with plpgsql and fixed it by adding
reference-counting logic.  Probably the same needs to be done here.

There are probably not that many people trying to use plpython functions
in such a way (note the recursion would have to happen via a SPI query,
not directly within Python).  But still...

            regards, tom lane

Re: cache lookup failed from empty plpythonu function

From

Andres Freund

Date:

25 January 2013, 22:39:17

On 2013-01-25 17:07:59 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > The bug got introduced in 46211da1b84bc3537e799ee1126098e71c2428e8 which
> > interestingly says "Using HTABs instead of Python dictionaries makes
> > error recovery easier, and allows for procedures to be cached based on
> > their OIDs, not their names." - but the caching seems to always have
> > used ("%u_%u", fn_oid, tgreloid) as its key which explains why the bug
> > didn't exist back then.
>
> > Anyway, as far as I can see only 9.1 upwards are affected.
>
> Patch applied with some editorialization, mostly cosmetic but one thing
> not so much: you introduced uninitialized-variable hazards into
> PLy_procedure_get, since if use_cache is false then found, entry, and
> proc wouldn't get set up.

Gah. Now I am pissed at myself. I thought I had initialized 'found' to
false which should be enough.

> Neither version of gcc I tried complained about this, and I assume yours
> didn't either, which is a bit annoying/scary.  I wonder whether the
> "volatile" marker prevents that?

I think the control flow is just to complex for gcc, it probably cannot
trace anything across a setjmp() althout it theoretically should be
possible given the failure is in the branch that is executed
unconditionally.
Given that the real culprit is 'found' I don't think its the volatiles.

> BTW, it seems to me that there is another bug in PLy_procedure_get,
> which is that it's not thinking about recursive use of a procedure.
> Consider
>
>     call f
>
>         ... somebody updates f's pg_proc tuple
>
>         ... recursively call f
>
> At this point PLy_procedure_get will decide the cache entry is obsolete
> and happily trash the procedure data that the outer call is still using.
> We ran into this long ago with plpgsql and fixed it by adding
> reference-counting logic.  Probably the same needs to be done here.
>
> There are probably not that many people trying to use plpython functions
> in such a way (note the recursion would have to happen via a SPI query,
> not directly within Python).  But still...

I am afraid youre right. It seems ugly that all pl's have to reinvent
that kind of cache + invalidation logic. I am e.g. not sure they go far
enough in invalidating stuff like parameter types & io functions.

Greetings,

Andres Freund


--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: cache lookup failed from empty plpythonu function

From

Tom Lane

Date:

25 January 2013, 22:44:13

Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-01-25 17:07:59 -0500, Tom Lane wrote:
>> Neither version of gcc I tried complained about this, and I assume yours
>> didn't either, which is a bit annoying/scary.  I wonder whether the
>> "volatile" marker prevents that?

> I think the control flow is just to complex for gcc, it probably cannot
> trace anything across a setjmp() althout it theoretically should be
> possible given the failure is in the branch that is executed
> unconditionally.
> Given that the real culprit is 'found' I don't think its the volatiles.

Yeah, but if it had warned about the volatiles you'd probably not have
missed the more significant issue.  I think the reason 'found' wasn't
complained of is that its address gets passed to another function ---
IME that seems to shut off any initialization warning, even if the
compiler can see that the function call might not happen.  But I'd
have thought it'd warn about the volatiles.  Anyway, not our problem.

            regards, tom lane

Re: cache lookup failed from empty plpythonu function

From

Jan Urbański

Date:

26 January 2013, 17:46:24

>>
>>     call f
>>
>>         ... somebody updates f's pg_proc tuple
>>
>>         ... recursively call f
>>
>>
>> At this point PLy_procedure_get will decide the cache entry is obsolete
>> and happily trash the procedure data that the outer call is still using.
>> We ran into this long ago with plpgsql and fixed it by adding
>> reference-counting logic.  Probably the same needs to be done here.

> I am afraid youre right. It seems ugly that all pl's have to reinvent
> that kind of cache + invalidation logic. I am e.g. not sure they go far
> enough in invalidating stuff like parameter types & io functions.

Pardon for coming late to the conversation.

First of all, thanks for fixing the trigger function caching bug, it
trurly was a silly one.

The problem with reentrant calls of PLy_procedure_get and freeing an
in-use PLyProcedure struct looks real enough. I took me some time to
make it actually fail, though, as even a piece of code like this:

create or replace function nasty() returns void language plpythonu as $$    if 'executed' in GD:        return
GD['executed']= True    plpy.execute('alter function nasty() cost 10')    plpy.execute("select nasty()")
plpy.error('oops')
$$;

select nasty();

was not bombing out, because after the reentrant call of
PLy_get_procedure freed the PLyProcedure struct, it allocates another
one for itself and it ends up in the exact same position in memory... I
ended up adding a dummy PLy_malloc somewhere and got a segfault when
plpy.error tried to access the current function name to add it to the
errcontext.

We could add a refcount to PLyProcedure or reuse the
PLy_execution_contexts scheme that's already in place.

When PLy_get_procedure decides it needs to recreate a cached function,
it could walk the execution contexts stack looking for a matching pointer.

I'm not yet sure what it should do in that case. The only provision we
have for freeing PLyProcedure structs is when they get evicted from the
hash. Perhaps a refcount is the only way to handle it sanely.

By the way, I agree that the caching in PL/Python (and PLs in general)
is ugly and ad hoc. Having to share memory management with the embedded
PL doesn't help, though.

Cheers,
Jan