Thread: cache lookup failed from empty plpythonu function
I've found a bug in plpythonu resulting in a "cache lookup" failure. Here's the SQL to reproduce (thanks Andres): CREATE EXTENSION plpythonu; CREATE OR REPLACE FUNCTION t() RETURNS trigger AS 'pass' LANGUAGE 'plpythonu'; CREATE TABLE a(); CREATE TABLE b(); CREATE TRIGGER check_quota AFTER INSERT ON a FOR EACH ROW EXECUTE PROCEDURE t(); INSERT INTO a DEFAULT VALUES; DROP TABLE a; CREATE TRIGGER check_quota AFTER INSERT ON b FOR EACH ROW EXECUTE PROCEDURE t(); INSERT INTO b DEFAULT VALUES; --ERROR: cache lookup failed for relation 20200 --Time: 0.570 ms -- cleanup with DROP TABLE b; DROP FUNCTION t(); Logs show: ERROR: XX000: cache lookup failed for relation 3694414 LOCATION: PLy_procedure_argument_valid, plpython.c:1469 Andres (in IRC #postgresql) said: <andres> hm, I think I see the problem. The caching there looks pretty broken to me. <andres> i.e. the caching doesn't work *at all* if a function is used as a trigger for more than one table and this is just fallout of that. <andres> Looks like a typo to me, procedure_get should use the table's oid instead of fn_oid when doing lookups in PLy_trigger_cache. Tested on: PostgreSQL 9.1.7 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit It was probably also present in the stable version as of one year ago ( see https://github.com/Vizzuality/cartodb/issues/173 ) --strk;
On 2013-01-24 15:03:36 +0100, Sandro Santilli wrote: > I've found a bug in plpythonu resulting in a "cache lookup" failure. > Here's the SQL to reproduce (thanks Andres): > > CREATE EXTENSION plpythonu; > CREATE OR REPLACE FUNCTION t() RETURNS trigger AS 'pass' LANGUAGE 'plpythonu'; > CREATE TABLE a(); > CREATE TABLE b(); > CREATE TRIGGER check_quota AFTER INSERT ON a FOR EACH ROW EXECUTE PROCEDURE t(); > INSERT INTO a DEFAULT VALUES; > DROP TABLE a; > CREATE TRIGGER check_quota > AFTER INSERT ON b FOR EACH ROW EXECUTE PROCEDURE t(); > INSERT INTO b DEFAULT VALUES; > --ERROR: cache lookup failed for relation 20200 > --Time: 0.570 ms > -- cleanup with DROP TABLE b; DROP FUNCTION t(); > > Logs show: > > ERROR: XX000: cache lookup failed for relation 3694414 > LOCATION: PLy_procedure_argument_valid, plpython.c:1469 > > Andres (in IRC #postgresql) said: > > <andres> hm, I think I see the problem. The caching there looks pretty > broken to me. > <andres> i.e. the caching doesn't work *at all* if a function is used > as a trigger for more than one table and this is just fallout > of that. > <andres> Looks like a typo to me, procedure_get should use the table's > oid instead of fn_oid when doing lookups in PLy_trigger_cache. ISTM the caching code for plpythonu trigger functions has been broken for some time. The bug seem to be that PLy_procedure_get looks up the function in a separate cache for trigger functions (PLy_trigger_cache) but still only uses the function oid for lookup. So the same function can be looked up for two tables and when the cached entry originally refers to the old table that has been dropped we see the above error. The fix seems tob e to make PLy_trigger_cache have a key of (reloid, fn_oid) instead of just fn_oid. Other suggestions? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-01-24 15:03:36 +0100, Sandro Santilli wrote: > ISTM the caching code for plpythonu trigger functions has been broken > for some time. The bug seem to be that PLy_procedure_get looks up the > function in a separate cache for trigger functions (PLy_trigger_cache) > but still only uses the function oid for lookup. > So the same function can be looked up for two tables and when the cached > entry originally refers to the old table that has been dropped we see > the above error. > The fix seems tob e to make PLy_trigger_cache have a key of (reloid, > fn_oid) instead of just fn_oid. If there's anything in either the cache entry itself or the function's other persistent state that is dependent on which table it's attached to, then yeah, obviously. But another conceivable solution would be to not store any such information. In a quick look, it seems like the "PLyTypeInfo result" field might depend on which table the trigger is attached to, so unless we can get along without that, we'll need to expand the hash key. Actually I'm wondering a bit why there are two hash tables at all. Seems like one table with a hash key of (fnoid, reloid) could be used, with the convention of reloid = 0 for non-trigger calls. regards, tom lane
On 2013-01-24 11:40:33 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2013-01-24 15:03:36 +0100, Sandro Santilli wrote: > > ISTM the caching code for plpythonu trigger functions has been broken > > for some time. The bug seem to be that PLy_procedure_get looks up the > > function in a separate cache for trigger functions (PLy_trigger_cache) > > but still only uses the function oid for lookup. > > So the same function can be looked up for two tables and when the cached > > entry originally refers to the old table that has been dropped we see > > the above error. > > > The fix seems to be to make PLy_trigger_cache have a key of (reloid, > > fn_oid) instead of just fn_oid. > > If there's anything in either the cache entry itself or the function's > other persistent state that is dependent on which table it's attached > to, then yeah, obviously. But another conceivable solution would be to > not store any such information. > > In a quick look, it seems like the "PLyTypeInfo result" field might > depend on which table the trigger is attached to, so unless we can > get along without that, we'll need to expand the hash key. It looks to me like getting rid of that requirement would be too much code for something backbrancheable . Although I dislike seing code like /* If the pg_proc tuple has changed, it's not valid */ if (!(proc->fn_xmin == HeapTupleHeaderGetXmin(procTup->t_data) && ItemPointerEquals(&proc->fn_tid, &procTup->t_self))) return false; in a pl. I have to admit though, that I don't immediately know how I would prevent constant recompilation in a pl handler without such a hack. > Actually I'm wondering a bit why there are two hash tables at all. > Seems like one table with a hash key of (fnoid, reloid) could be > used, with the convention of reloid = 0 for non-trigger calls. Seems fine to me. Currently there seems no point in having two distinct hashes at all, given that they have exactly the same key and that trigger functions cannot be called plainly anyway. Unless somebody screams that will mean some stuff defined in plpython headers will change but that doesn't seem like a legitimate concern to me. Tom, any other committer: I am happy to provide a <= 9.1 version of the patch because plpython has been noticeably restructured in 9.1=>9.2, but I am not sure if that helps you at all. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > Tom, any other committer: I am happy to provide a <= 9.1 version of the > patch because plpython has been noticeably restructured in 9.1=>9.2, but > I am not sure if that helps you at all. We need to back-patch to any branch where this is broken. If the patch would change materially before 9.2, then yeah it would help if you could create/test a version for that too. regards, tom lane
On 2013-01-24 11:40:33 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2013-01-24 15:03:36 +0100, Sandro Santilli wrote: > > ISTM the caching code for plpythonu trigger functions has been broken > > for some time. The bug seem to be that PLy_procedure_get looks up the > > function in a separate cache for trigger functions (PLy_trigger_cache) > > but still only uses the function oid for lookup. > > So the same function can be looked up for two tables and when the cached > > entry originally refers to the old table that has been dropped we see > > the above error. > > > The fix seems tob e to make PLy_trigger_cache have a key of (reloid, > > fn_oid) instead of just fn_oid. > > If there's anything in either the cache entry itself or the function's > other persistent state that is dependent on which table it's attached > to, then yeah, obviously. But another conceivable solution would be to > not store any such information. > > In a quick look, it seems like the "PLyTypeInfo result" field might > depend on which table the trigger is attached to, so unless we can > get along without that, we'll need to expand the hash key. > > Actually I'm wondering a bit why there are two hash tables at all. > Seems like one table with a hash key of (fnoid, reloid) could be > used, with the convention of reloid = 0 for non-trigger calls. Its slightly more complex than just making it one hash table with an extended key. When validating a trigger function we don't have a relation to do the cache lookup. I chose to handle that case by not doing a cache lookup at all in that case which imo is a sensible choice. I am less sure about whether its nicer to make that decision in PLy_procedure_get as I have or whether it would have been better to make PLy_procedure_create public and directly use it in plpython_validator. The bug got introduced in 46211da1b84bc3537e799ee1126098e71c2428e8 which interestingly says "Using HTABs instead of Python dictionaries makes error recovery easier, and allows for procedures to be cached based on their OIDs, not their names." - but the caching seems to always have used ("%u_%u", fn_oid, tgreloid) as its key which explains why the bug didn't exist back then. Anyway, as far as I can see only 9.1 upwards are affected. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Andres Freund <andres@2ndquadrant.com> writes: > Its slightly more complex than just making it one hash table with an > extended key. When validating a trigger function we don't have a > relation to do the cache lookup. I chose to handle that case by not > doing a cache lookup at all in that case which imo is a sensible > choice. Seems fair. However ... why is it safe for PLy_procedure_create to be using the same name for multiple instances of a trigger function? Should we not be including the rel OID when building the procName string? regards, tom lane
On 2013-01-25 14:51:39 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > Its slightly more complex than just making it one hash table with an > > extended key. When validating a trigger function we don't have a > > relation to do the cache lookup. I chose to handle that case by not > > doing a cache lookup at all in that case which imo is a sensible > > choice. > > Seems fair. However ... why is it safe for PLy_procedure_create to be > using the same name for multiple instances of a trigger function? > Should we not be including the rel OID when building the procName > string? I don't think its a problem, given the way python works I am pretty sure it will result in independent functions. Each PLy_procedure_compile will run the source code in a copy of PLy_interp_globals, therefore the independent comilitions shouldn't affect each other. I am not sure why it builds the call to the function via eval'ing a "$funcname()" instead of using the result of PyRun_String which will return a reference to the function, but thats an independent issue. Now I think an argument can be made that it would be nicer for debugging purposes to have clearly distinguishable function names, but I personally never needed it and it probably wouldn't be something to backpatch. People might rely on those function names. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-01-25 21:07:51 +0100, Andres Freund wrote: > On 2013-01-25 14:51:39 -0500, Tom Lane wrote: > > Andres Freund <andres@2ndquadrant.com> writes: > > > Its slightly more complex than just making it one hash table with an > > > extended key. When validating a trigger function we don't have a > > > relation to do the cache lookup. I chose to handle that case by not > > > doing a cache lookup at all in that case which imo is a sensible > > > choice. > > > > Seems fair. However ... why is it safe for PLy_procedure_create to be > > using the same name for multiple instances of a trigger function? > > Should we not be including the rel OID when building the procName > > string? > > I don't think its a problem, given the way python works I am pretty sure > it will result in independent functions. > > Each PLy_procedure_compile will run the source code in a copy of > PLy_interp_globals, therefore the independent comilitions shouldn't > affect each other. > > I am not sure why it builds the call to the function via eval'ing a > "$funcname()" instead of using the result of PyRun_String which will > return a reference to the function, but thats an independent issue. > > Now I think an argument can be made that it would be nicer for debugging > purposes to have clearly distinguishable function names, but I > personally never needed it and it probably wouldn't be something to > backpatch. People might rely on those function names. Just for fun I tried how easy it is to have some fun with the function name. Guess what happens when the attached sql file is executed ;) quite possibly doesn't work if copied from here but for quicker viewing: CREATE OR REPLACE FUNCTION "foo(): pass import os os._exit(1) def "() RETURNS void LANGUAGE plpythonu AS $$pass$$; Yep: LOG: server process (PID 29317) exited with exit code 1 DETAIL: Failed process was running: CREATE OR REPLACE FUNCTION "foo(): pass import os os._exit(1) def "() RETURNS void LANGUAGE plpythonu AS $$pass$$; LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because anotherserver process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. Anyway, back to more sensible things ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > quite possibly doesn't work if copied from here but for quicker viewing: > CREATE OR REPLACE FUNCTION "foo(): > pass > import os > os._exit(1) > def "() RETURNS void LANGUAGE plpythonu AS $$pass$$; > Yep: > LOG: server process (PID 29317) exited with exit code 1 Ooops. Good thing plpython is already superuser-only, or this would be a security problem. So should we stop including the SQL function name in the python name? regards, tom lane
On 2013-01-25 15:40:46 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > quite possibly doesn't work if copied from here but for quicker viewing: > > > CREATE OR REPLACE FUNCTION "foo(): > > pass > > > import os > > os._exit(1) > > def "() RETURNS void LANGUAGE plpythonu AS $$pass$$; > > > Yep: > > LOG: server process (PID 29317) exited with exit code 1 > > Ooops. Good thing plpython is already superuser-only, or this would be > a security problem. So should we stop including the SQL function name > in the python name? I don't think it's worth it and having something recognizable in there is useful in (python) backtraces. I am pretty sure I could do the same in the function source with some trickery even without the function being executed just because python also executes stuff during parsing, so I don't think it would buy enough. We probably should care about it if there were a trusted plpython (so you couldn't cause an incomplete function being called or whatever) but I don't see that happening anytime soon. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > The bug got introduced in 46211da1b84bc3537e799ee1126098e71c2428e8 which > interestingly says "Using HTABs instead of Python dictionaries makes > error recovery easier, and allows for procedures to be cached based on > their OIDs, not their names." - but the caching seems to always have > used ("%u_%u", fn_oid, tgreloid) as its key which explains why the bug > didn't exist back then. > Anyway, as far as I can see only 9.1 upwards are affected. Patch applied with some editorialization, mostly cosmetic but one thing not so much: you introduced uninitialized-variable hazards into PLy_procedure_get, since if use_cache is false then found, entry, and proc wouldn't get set up. Neither version of gcc I tried complained about this, and I assume yours didn't either, which is a bit annoying/scary. I wonder whether the "volatile" marker prevents that? BTW, it seems to me that there is another bug in PLy_procedure_get, which is that it's not thinking about recursive use of a procedure. Consider call f ... somebody updates f's pg_proc tuple ... recursively call f At this point PLy_procedure_get will decide the cache entry is obsolete and happily trash the procedure data that the outer call is still using. We ran into this long ago with plpgsql and fixed it by adding reference-counting logic. Probably the same needs to be done here. There are probably not that many people trying to use plpython functions in such a way (note the recursion would have to happen via a SPI query, not directly within Python). But still... regards, tom lane
On 2013-01-25 17:07:59 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > The bug got introduced in 46211da1b84bc3537e799ee1126098e71c2428e8 which > > interestingly says "Using HTABs instead of Python dictionaries makes > > error recovery easier, and allows for procedures to be cached based on > > their OIDs, not their names." - but the caching seems to always have > > used ("%u_%u", fn_oid, tgreloid) as its key which explains why the bug > > didn't exist back then. > > > Anyway, as far as I can see only 9.1 upwards are affected. > > Patch applied with some editorialization, mostly cosmetic but one thing > not so much: you introduced uninitialized-variable hazards into > PLy_procedure_get, since if use_cache is false then found, entry, and > proc wouldn't get set up. Gah. Now I am pissed at myself. I thought I had initialized 'found' to false which should be enough. > Neither version of gcc I tried complained about this, and I assume yours > didn't either, which is a bit annoying/scary. I wonder whether the > "volatile" marker prevents that? I think the control flow is just to complex for gcc, it probably cannot trace anything across a setjmp() althout it theoretically should be possible given the failure is in the branch that is executed unconditionally. Given that the real culprit is 'found' I don't think its the volatiles. > BTW, it seems to me that there is another bug in PLy_procedure_get, > which is that it's not thinking about recursive use of a procedure. > Consider > > call f > > ... somebody updates f's pg_proc tuple > > ... recursively call f > > At this point PLy_procedure_get will decide the cache entry is obsolete > and happily trash the procedure data that the outer call is still using. > We ran into this long ago with plpgsql and fixed it by adding > reference-counting logic. Probably the same needs to be done here. > > There are probably not that many people trying to use plpython functions > in such a way (note the recursion would have to happen via a SPI query, > not directly within Python). But still... I am afraid youre right. It seems ugly that all pl's have to reinvent that kind of cache + invalidation logic. I am e.g. not sure they go far enough in invalidating stuff like parameter types & io functions. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-01-25 17:07:59 -0500, Tom Lane wrote: >> Neither version of gcc I tried complained about this, and I assume yours >> didn't either, which is a bit annoying/scary. I wonder whether the >> "volatile" marker prevents that? > I think the control flow is just to complex for gcc, it probably cannot > trace anything across a setjmp() althout it theoretically should be > possible given the failure is in the branch that is executed > unconditionally. > Given that the real culprit is 'found' I don't think its the volatiles. Yeah, but if it had warned about the volatiles you'd probably not have missed the more significant issue. I think the reason 'found' wasn't complained of is that its address gets passed to another function --- IME that seems to shut off any initialization warning, even if the compiler can see that the function call might not happen. But I'd have thought it'd warn about the volatiles. Anyway, not our problem. regards, tom lane
>> >> call f >> >> ... somebody updates f's pg_proc tuple >> >> ... recursively call f >> >> >> At this point PLy_procedure_get will decide the cache entry is obsolete >> and happily trash the procedure data that the outer call is still using. >> We ran into this long ago with plpgsql and fixed it by adding >> reference-counting logic. Probably the same needs to be done here. > I am afraid youre right. It seems ugly that all pl's have to reinvent > that kind of cache + invalidation logic. I am e.g. not sure they go far > enough in invalidating stuff like parameter types & io functions. Pardon for coming late to the conversation. First of all, thanks for fixing the trigger function caching bug, it trurly was a silly one. The problem with reentrant calls of PLy_procedure_get and freeing an in-use PLyProcedure struct looks real enough. I took me some time to make it actually fail, though, as even a piece of code like this: create or replace function nasty() returns void language plpythonu as $$ if 'executed' in GD: return GD['executed']= True plpy.execute('alter function nasty() cost 10') plpy.execute("select nasty()") plpy.error('oops') $$; select nasty(); was not bombing out, because after the reentrant call of PLy_get_procedure freed the PLyProcedure struct, it allocates another one for itself and it ends up in the exact same position in memory... I ended up adding a dummy PLy_malloc somewhere and got a segfault when plpy.error tried to access the current function name to add it to the errcontext. We could add a refcount to PLyProcedure or reuse the PLy_execution_contexts scheme that's already in place. When PLy_get_procedure decides it needs to recreate a cached function, it could walk the execution contexts stack looking for a matching pointer. I'm not yet sure what it should do in that case. The only provision we have for freeing PLyProcedure structs is when they get evicted from the hash. Perhaps a refcount is the only way to handle it sanely. By the way, I agree that the caching in PL/Python (and PLs in general) is ugly and ad hoc. Having to share memory management with the embedded PL doesn't help, though. Cheers, Jan