Thread: pl/python improvements
Hi, no, no patch(es) yet. I'm going through plpython.c trying as best I can to improve things there. I'll have a patch (or patches) ready for the January commitfest, but I thought I'd open up a discussion already to spare me having to redo features because the way I attacked the problem is a dead end. Be warned, this ended up being a very long email... The code is on https://github.com/wulczer/postgres, in the plpython branch. I'll be rebasing it regularly, so don't be surprised by commit hashes changing. Currently the improvements are:* added a validator, so syntax errors are caught on function creation, and you don't pay compile overhead when first calling the function* use HTABs instead of Python dictionaries to cache procedures, maintaing two separate HTABs for procedures and triggers (this way you can use the procedure's OID as the hash key)* get rid of the global PLy_error_in_progress, hopefuly things that cause errors are reported immediately* use palloc(TopMemoryContext) instead of malloc() for things that need to live until the end of the session* execute SPI calls in a subtransaction, report errors back to Python as exceptions that can be caught etc.* do not prefix error messages with "PL/Python", just use the error string as errmsg* if a function has composite type inputs, check if the underlying types have not changed when calling the function and recompile it if necessary (fixes a TODO item)* support tracebacks and make them look just like Python tracebacks (as much as possible that is)* fix plpy.error and plpy.fatal to actually raise a plpy.Error/plpy.Fatal exception, that can be caught etc.* remove volatile modifiers from variables (AIUI they are completely not necessary anywhere in plpython.c)* general small refactorings, removing unnecessary code, splitting big procedures into smaller ones Planned improvements include:* explicit subxact starting, so you can issue consecutive plpy.execute calls atomically* split plpython.c into smaller modules, like plpython_traceback.c, plpython_type_conversion.c, plpython_plpy.c etc.* variadic argument handling (is this possible for a non-plpgsql pl?)* multipleOUT parameters support (a TODO item, I think "table function support" meant to say that)* add a DB-API interface on top of SPI (a TODO item) Not sure if I'll make it in time with all of them, I have put them in more-or-less priority order. Now that you're excited, the questions. Q1 To check for composite types changing after the procedure I/O functions have been cached, I store the relid of the base relation for the type (typeidTypeRelid(desc->tdtypeid), where desc is the type's TupleDesc), as well as the xmin and tid of the relation's tuple from pg_class. Then when the function is invoked, for each composite parameter I fetch the relation's tuple from pg_class again and compare xmin and tid. This means that functions with composite parameters are slower, because there's a syscache search for each parameter for each invocation. I roughly measured the slowdown for a function with one composite param to be 4%. OTOH this works now: CREATE TABLE employee ( name text, basesalary integer, bonus integer ); INSERT INTO employee VALUES ('John', 100, 10), ('Mary', 200, 10); CREATE OR REPLACE FUNCTION test_composite_table_input(e employee) RETURNS integer AS $$ return e['basesalary'] + e['bonus'] $$ LANGUAGE plpythonu; SELECT name, test_composite_table_input(employee.*) FROM employee; ALTER TABLE employee DROP bonus; ALTER TABLE employee ADD bonus integer; UPDATE employee SET bonus = 10; -- will recompile the function and recache I/O functions SELECT name, test_composite_table_input(employee.*) FROM employee; PL/Perl has the same problem, BTW. The question is: am I going the right way checking if the composite type is up-to-date? The TODO item says: (is this possible, easy, quick?). My answer is: I guess, kinda, dunno. Q2 The format of error messages changes, because the "PL/Python" prefix gets removed (I want raise plpy.Error("foo") to result in "ERROR: plpy.ERROR: foo", not in "PL/Python: plpy.ERROR: foo") and because of other things. Like for instance since plpy.error(msg) now just raises a plpy.Error exception, you will get "ERROR: plpy.Error: msg" and not just "ERROR: msg". This potentially breaks clients parsing error messages, but I don't buy that argument really. Still, is it acceptable? Q3 Tracebacks! The way I implemented them (based on the patch linked from the TODO item) they look like this: CREATE FUNCTION nested_error() RETURNS text AS 'def fun1():plpy.error("boom") def fun2():fun1() def fun3():fun2() fun3() return "not reached" ' LANGUAGE plpythonu; SELECT nested_error(); ERROR: plpy.Error: boom DETAIL: Traceback (most recent call last): PL/Python function "nested_error", line 10, in <module> fun3() PL/Python function"nested_error", line 8, in fun3 fun2() PL/Python function "nested_error", line 5, in fun2 fun1() PL/Python function"nested_error", line 2, in fun1 plpy.error("boom") CONTEXT: PL/Python function "nested_error" Tracebacks are added to every error, even if their depth is 1, making them a bit redundant, so you get things like: CREATE FUNCTION sql_syntax_error() RETURNS text AS 'plpy.execute("syntax error")' LANGUAGE plpythonu; SELECT sql_syntax_error(); ERROR: plpy.SPIError: syntax error at or near "syntax" DETAIL: Traceback (most recent call last): PL/Python function "sql_syntax_error", line 1, in <module> plpy.execute("syntaxerror") CONTEXT: PL/Python function "sql_syntax_error" Notice that they look almost exactly like a traceback Python would give you. I had to cheat there a bit, to hide the ugly generated Python function name and remove the topmost frame from the traceback that was pointing at the toplevel interpreter module. Another oddity is that Python tracebacks have line numbers, so I am storing the entire text of the function in the cache to print the corresponding line of each frame, just like the real Python traceback.py module. I guess that's negligible overhead? Or should I refetch prosrc every time a traceback is printed? Note that traceback computation costs some cycles, but I think it's OK, since we're not optimizing for that case, obviously. The question is: is this how you'd like plpython tracebacks to look like? Q3 Splitting into smaller modules - should I do it? I find a ~4k loc module to be too big, but then again, if I do the split, the resulting diff will be difficult to review. Or won't it? Q4 Passing function arguments as globals is IMHO ugly. Not sure what the TODO item wants to say by "Passing them as globals means functions cannot be called recursively.". I tried and recursion worked just fine. But it's still quite ugly IMHO... problem is that I don't know hot to fix it. Creating a Python function with the signature mimicking the one of the SQL function is tricky because of unnamed arguments. Also, passing arguments as real arguments would break code that uses the 'global' trick suggested in the manual. Even though the manual says you should not rely on it, I still feel bad about braking people's code. Does anyone have any bright ideas on how to improve argument passing in plplython? Q5 (last one) Anyone has any more things on her/his PL/Python Christmas whishlist? plpy.make_coffee()? Thanks for reading all the way here, Jan
On Tuesday 07 December 2010 20:17:57 Jan Urbański wrote: > Hi, > > no, no patch(es) yet. I'm going through plpython.c trying as best I can > to improve things there. I'll have a patch (or patches) ready for the > January commitfest, but I thought I'd open up a discussion already to > spare me having to redo features because the way I attacked the problem > is a dead end. Be warned, this ended up being a very long email... Nice to see improvements there. > Currently the improvements are: > * execute SPI calls in a subtransaction, report errors back to Python > as exceptions that can be caught etc. Youre doing that unconditionally? I think the performance impact of this will be too severe... Subtransactions can get quite expensive - especially in the face of HS. > * remove volatile modifiers from variables (AIUI they are completely > not necessary anywhere in plpython.c) I havent read the code recently but they might be needed for local variables changed inside a PG_TRY catching an error and also accessed in the CATCH block or afterwards if youre not rethrowing. See the man page of longjmp. > * explicit subxact starting, so you can issue consecutive plpy.execute > calls atomically That likely obsoletes my comment from above. > * split plpython.c into smaller modules, like plpython_traceback.c, > plpython_type_conversion.c, plpython_plpy.c etc. > * variadic argument handling (is this possible for a non-plpgsql pl?) > * multiple OUT parameters support (a TODO item, I think "table function > support" meant to say that) > * add a DB-API interface on top of SPI (a TODO item) I am not hot on that as dbapi simply is too unspecified in too many areas (prepared statements anyone). Thanks, Andres
On 07/12/10 21:33, Andres Freund wrote: > On Tuesday 07 December 2010 20:17:57 Jan Urbański wrote: >> * execute SPI calls in a subtransaction, report errors back to Python >> as exceptions that can be caught etc. > Youre doing that unconditionally? I think the performance impact of this will > be too severe... Subtransactions can get quite expensive - especially in the > face of HS. See http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg163004.html :( >> * explicit subxact starting, so you can issue consecutive plpy.execute >> calls atomically > That likely obsoletes my comment from above. Not really, I'm afraid :( You can pay the subxact overhead once with explicit subxact starting, but you always have to pay it... with plpy.subxact(): plpy.execute("select 1") plpy.execute("select 2") plpy.execute("select 3") will start only 1 subxact, as opposed to 3. Cheers, Jan
On tis, 2010-12-07 at 20:17 +0100, Jan Urbański wrote: > no, no patch(es) yet. I'm going through plpython.c trying as best I can > to improve things there. I'll have a patch (or patches) ready for the > January commitfest, but I thought I'd open up a discussion already to > spare me having to redo features because the way I attacked the problem > is a dead end. Be warned, this ended up being a very long email... This all sounds very nice (especially since I had many of the same items on my todo list to tackle after Christmas or so). But I think this would be massively simpler if you created a separate patch/branch/email thread for each feature or change. > The code is on https://github.com/wulczer/postgres, in the plpython > branch. I'll be rebasing it regularly, so don't be surprised by commit > hashes changing. I think rebasing published repositories isn't encouraged.
On 12/07/2010 04:50 PM, Peter Eisentraut wrote: > >> The code is on https://github.com/wulczer/postgres, in the plpython >> branch. I'll be rebasing it regularly, so don't be surprised by commit >> hashes changing. > I think rebasing published repositories isn't encouraged. Indeed. See <http://progit.org/book/ch3-6.html> for example, which says: *Do not rebase commits that you have pushed to a public repository.* If you follow that guideline, you’ll be fine. If you don’t, people will hate you, and you’ll be scorned by friends andfamily. cheers andrew
On 07/12/10 23:00, Andrew Dunstan wrote: > On 12/07/2010 04:50 PM, Peter Eisentraut wrote: >> >>> The code is on https://github.com/wulczer/postgres, in the plpython >>> branch. I'll be rebasing it regularly, so don't be surprised by commit >>> hashes changing. >> I think rebasing published repositories isn't encouraged. > > Indeed. See <http://progit.org/book/ch3-6.html> for example, which says: > > *Do not rebase commits that you have pushed to a public repository.* > > If you follow that guideline, you’ll be fine. If you don’t, people > will hate you, and you’ll be scorned by friends and family. If someone is interested in merging from that branch, I can refrain from rebasing (just tell me). But I'm really using it just as a patch queue and pushing to github is a way having an online backup for the changes, so it's not really a "public repository". It's meant for people who want to try out the code as it stands today, or see the changes that have been made and don't care about having to rebase their checkout themselves. I find developing that way much easier than merging from the master branch. If I get conflicts I get them on a single commit of mine, not on the merge commit, which makes them easier to resolve. See https://github.com/diaspora/diaspora/wiki/Git-Workflow for an example of that kind of workflow. Peter suggested having a mail/patch per feature and the way I intend to do that is instead of having a dozen branches, have one and after I'm done rebase it interactively to produce incremental patches that apply to master, each one implementing one feature. Cheers, Jan
On tis, 2010-12-07 at 23:56 +0100, Jan Urbański wrote: > Peter suggested having a mail/patch per feature and the way I intend > to do that is instead of having a dozen branches, have one and after > I'm done rebase it interactively to produce incremental patches that > apply to master, each one implementing one feature. Fair enough if you want to do it that way, but I'd encourage you to just send in any self-contained features/changes that you have finished.
On 08/12/10 22:41, Peter Eisentraut wrote: > On tis, 2010-12-07 at 23:56 +0100, Jan Urbański wrote: >> Peter suggested having a mail/patch per feature and the way I intend >> to do that is instead of having a dozen branches, have one and after >> I'm done rebase it interactively to produce incremental patches that >> apply to master, each one implementing one feature. > > Fair enough if you want to do it that way, but I'd encourage you to just > send in any self-contained features/changes that you have finished. I finished work on the PL/Python improvements. There 9 separate patches, with some interdependencies. The features implemented are:* general refactoring, dropping the global error state, using dynahash instead of PyDicts for procedure caching, etc* a validator function* executing SPI calls in subtransactions* starting explicitsubtransactions to have atomic behaviour of multiple SPI calls* providing custom exceptions for SPI errors, so you can catch only UniqueViolations and not have to muck around with SQLCODE* invalidate functions accepting composite types if the type changes after the functions has been defined* traceback support* table functions (ability to return RECORD and SETOF RECORD)* usingcustom parsers for datatypes + an example application with dict<->hstore parsing The dependencies are: refactor -> validator -> SPI in subxacts -> explicit subxacts -> custom exceptions forSPI -> invalidate composites -> tracebacks -> table functions -> custom parsers so everything depends on the refactoring patch, and the SPI changes are incremental. I will generate these patches and send them in in separate threads/add them to the commitfest app before Christmas (I hope). Meanwhile the code is available at https://github.com/wulczer/postgres. You will find 10 branches there, 9 correspond to these features, and the "plpython" branch is the sum of them all. From now on I will stop rebasing these branches, and instead just commit fixes to the appropriate branch and merge upwards (from refactor to everything, from spi-in-subxacts to explicit-subxacts and to custom-spi-exceptions, etc). I'll also try to independently maintain the plpython branch with the sum to make sure nothing has been left out as the feature branches get merged into mainline. Here's to hoping that this will not turn into a conflict hell. Cheers, Jan
On Thu, Dec 23, 2010 at 04:08, Jan Urbański <wulczer@wulczer.org> wrote: > * providing custom exceptions for SPI errors, so you can catch only > UniqueViolations and not have to muck around with SQLCODE py-postgresql already has a mapping from error codes to Python exceptions. I think it makes sense to re-use that, instead of inventing new names. https://github.com/jwp/py-postgresql/blob/v1.1/postgresql/exceptions.py It also follows the Python convention of ending exception classes with "Error", so instead of UniqueViolation they have UniqueError, instead of InvalidTextRepresentation, they have TextRepresentationError > Meanwhile the code > is available at https://github.com/wulczer/postgres. You will find 10 > branches there, 9 correspond to these features, and the "plpython" > branch is the sum of them all. I tried building the plpython branch, but got an unrelated error. I didn't investigate further for now... make[3]: Entering directory `/home/marti/src/postgresql-py/src/backend/bootstrap' gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -fwrapv -I. -I. -I../../../src/include -D_GNU_SOURCE -c -o bootparse.o bootparse.c bootparse.y: In function ‘boot_yyparse’: bootparse.y:224:16: error: too few arguments to function ‘heap_create’ ../../../src/include/catalog/heap.h:37:17: note: declared here bootparse.y:249:16: error: too few arguments to function ‘heap_create_with_catalog’ ../../../src/include/catalog/heap.h:48:12: note: declared here make[3]: *** [bootparse.o] Error 1 Regards, Marti
On 23/12/10 12:16, Marti Raudsepp wrote: > On Thu, Dec 23, 2010 at 04:08, Jan Urbański <wulczer@wulczer.org> wrote: >> * providing custom exceptions for SPI errors, so you can catch only >> UniqueViolations and not have to muck around with SQLCODE > > py-postgresql already has a mapping from error codes to Python > exceptions. I think it makes sense to re-use that, instead of > inventing new names. > https://github.com/jwp/py-postgresql/blob/v1.1/postgresql/exceptions.py > > It also follows the Python convention of ending exception classes with > "Error", so instead of UniqueViolation they have UniqueError, instead > of InvalidTextRepresentation, they have TextRepresentationError Oh, didn't know that. I see that it does some more fancy things, like defining a inheritance hierarchy for these exceptions and adding some more into the mix. The names I used are not really invented, they're just plpgsql condition names from http://www.postgresql.org/docs/current/static/errcodes-appendix.html with underscores changed to camel case. Also, since they're autogenerated from utils/errcodes.h they don't have any hierarchy, they just all inherit from SPIError. Sticking "Error" to every one of them will result in things like SubstringErrorError, so I'm not really sold on that. Basically I think more PL/Python users will be familiar with condition names as you use them in pl/pgsql than with the names from py-postgresql. >> Meanwhile the code >> is available at https://github.com/wulczer/postgres. You will find 10 >> branches there, 9 correspond to these features, and the "plpython" >> branch is the sum of them all. > > I tried building the plpython branch, but got an unrelated error. I > didn't investigate further for now... > > make[3]: Entering directory > `/home/marti/src/postgresql-py/src/backend/bootstrap' > gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith > -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing > -fwrapv -I. -I. -I../../../src/include -D_GNU_SOURCE -c -o > bootparse.o bootparse.c > bootparse.y: In function ‘boot_yyparse’: > bootparse.y:224:16: error: too few arguments to function ‘heap_create’ > ../../../src/include/catalog/heap.h:37:17: note: declared here > bootparse.y:249:16: error: too few arguments to function > ‘heap_create_with_catalog’ > ../../../src/include/catalog/heap.h:48:12: note: declared here > make[3]: *** [bootparse.o] Error 1 I'm pretty sure it'll go away it you do make maintainer-clean or git clean -dfx after switching to the plpython branch. Cheers, Jan
On Dec 23, 2010, at 3:38 AM, Jan Urbański wrote: > Oh, didn't know that. I see that it does some more fancy things, like > defining a inheritance hierarchy for these exceptions and adding some > more into the mix. Right, there were some cases that appeared to benefit from larger buckets than what the existing code classes provided. Also, some of the exceptions in there are strictly for py-postgresql/client-side things. > The names I used are not really invented, they're just plpgsql condition > names from > http://www.postgresql.org/docs/current/static/errcodes-appendix.html > with underscores changed to camel case. Also, since they're > autogenerated from utils/errcodes.h they don't have any hierarchy, they > just all inherit from SPIError. For the backend setting, I think this is quite appropriate. However, for pg-python, I had mixed feelings about this as I wanted to be able to leverage py-postgresql's hierarchy, but still have the projects independent. I ended up punting on this one by using a single error class, and forcing the user to compare the codes. =( > Sticking "Error" to every one of them will result in things like > SubstringErrorError, so I'm not really sold on that. There was some creativity applied to the names in postgresql.exceptions to accommodate for things like that. (Like no redundant "Error") > Basically I think > more PL/Python users will be familiar with condition names as you use > them in pl/pgsql than with the names from py-postgresql. I think that's fair assumption. In fact, I think that might make a good TODO for py-postgresql/pg-python. Provide a plpgsql-code-name to exception class mapping. cheers, jwp