Thread: pl/python improvements

pl/python improvements

From
Jan Urbański
Date:
Hi,

no, no patch(es) yet. I'm going through plpython.c trying as best I can
to improve things there. I'll have a patch (or patches) ready for the
January commitfest, but I thought I'd open up a discussion already to
spare me having to redo features because the way I attacked the problem
is a dead end. Be warned, this ended up being a very long email...

The code is on https://github.com/wulczer/postgres, in the plpython
branch. I'll be rebasing it regularly, so don't be surprised by commit
hashes changing.

Currently the improvements are:* added a validator, so syntax errors are caught on function creation,
and you don't pay compile overhead when first calling the function* use HTABs instead of Python dictionaries to cache
procedures,
maintaing two separate HTABs for procedures and triggers (this way you
can use the procedure's OID as the hash key)* get rid of the global PLy_error_in_progress, hopefuly things that
cause errors are reported immediately* use palloc(TopMemoryContext) instead of malloc() for things that need
to live until the end of the session* execute SPI calls in a subtransaction, report errors back to Python
as exceptions that can be caught etc.* do not prefix error messages with "PL/Python", just use the error
string as errmsg* if a function has composite type inputs, check if the underlying
types have not changed when calling the function and recompile it if
necessary (fixes a TODO item)* support tracebacks and make them look just like Python tracebacks (as
much as possible that is)* fix plpy.error and plpy.fatal to actually raise a
plpy.Error/plpy.Fatal exception, that can be caught etc.* remove volatile modifiers from variables (AIUI they are
completely
not necessary anywhere in plpython.c)* general small refactorings, removing unnecessary code, splitting big
procedures into smaller ones

Planned improvements include:* explicit subxact starting, so you can issue consecutive plpy.execute
calls atomically* split plpython.c into smaller modules, like plpython_traceback.c,
plpython_type_conversion.c, plpython_plpy.c etc.* variadic argument handling (is this possible for a non-plpgsql pl?)*
multipleOUT parameters support (a TODO item, I think "table function
 
support" meant to say that)* add a DB-API interface on top of SPI (a TODO item)

Not sure if I'll make it in time with all of them, I have put them in
more-or-less priority order.

Now that you're excited, the questions.

Q1

To check for composite types changing after the procedure I/O functions
have been cached, I store the relid of the base relation for the type
(typeidTypeRelid(desc->tdtypeid), where desc is the type's TupleDesc),
as well as the xmin and tid of the relation's tuple from pg_class. Then
when the function is invoked, for each composite parameter I fetch the
relation's tuple from pg_class again and compare xmin and tid. This
means that functions with composite parameters are slower, because
there's a syscache search for each parameter for each invocation. I
roughly measured the slowdown for a function with one composite param to
be 4%. OTOH this works now:

CREATE TABLE employee (   name text,   basesalary integer,   bonus integer
);

INSERT INTO employee VALUES ('John', 100, 10), ('Mary', 200, 10);

CREATE OR REPLACE FUNCTION test_composite_table_input(e employee)
RETURNS integer AS $$
return e['basesalary'] + e['bonus']
$$ LANGUAGE plpythonu;

SELECT name, test_composite_table_input(employee.*) FROM employee;

ALTER TABLE employee DROP bonus;

ALTER TABLE employee ADD bonus integer;
UPDATE employee SET bonus = 10;

-- will recompile the function and recache I/O functions
SELECT name, test_composite_table_input(employee.*) FROM employee;


PL/Perl has the same problem, BTW. The question is: am I going the right
way checking if the composite type is up-to-date? The TODO item says:
(is this possible, easy, quick?). My answer is: I guess, kinda, dunno.

Q2

The format of error messages changes, because the "PL/Python" prefix
gets removed (I want raise plpy.Error("foo") to result in "ERROR:
plpy.ERROR: foo", not in "PL/Python: plpy.ERROR: foo") and because of
other things. Like for instance since plpy.error(msg) now just raises a
plpy.Error exception, you will get "ERROR: plpy.Error: msg" and not just
"ERROR: msg". This potentially breaks clients parsing error messages,
but I don't buy that argument really. Still, is it acceptable?

Q3

Tracebacks! The way I implemented them (based on the patch linked from
the TODO item) they look like this:

CREATE FUNCTION nested_error() RETURNS text AS
'def fun1():plpy.error("boom")

def fun2():fun1()

def fun3():fun2()

fun3()
return "not reached"
' LANGUAGE plpythonu;
SELECT nested_error();
ERROR:  plpy.Error: boom
DETAIL:  Traceback (most recent call last): PL/Python function "nested_error", line 10, in <module>   fun3() PL/Python
function"nested_error", line 8, in fun3   fun2() PL/Python function "nested_error", line 5, in fun2   fun1() PL/Python
function"nested_error", line 2, in fun1   plpy.error("boom")
 
CONTEXT:  PL/Python function "nested_error"

Tracebacks are added to every error, even if their depth is 1, making
them a bit redundant, so you get things like:

CREATE FUNCTION sql_syntax_error() RETURNS text AS
'plpy.execute("syntax error")'
LANGUAGE plpythonu;
SELECT sql_syntax_error();
ERROR:  plpy.SPIError: syntax error at or near "syntax"
DETAIL:  Traceback (most recent call last): PL/Python function "sql_syntax_error", line 1, in <module>
plpy.execute("syntaxerror")
 
CONTEXT:  PL/Python function "sql_syntax_error"

Notice that they look almost exactly like a traceback Python would give
you. I had to cheat there a bit, to hide the ugly generated Python
function name and remove the topmost frame from the traceback that was
pointing at the toplevel interpreter module.
Another oddity is that Python tracebacks have line numbers, so I am
storing the entire text of the function in the cache to print the
corresponding line of each frame, just like the real Python traceback.py
module. I guess that's negligible overhead? Or should I refetch prosrc
every time a traceback is printed? Note that traceback computation costs
some cycles, but I think it's OK, since we're not optimizing for that
case, obviously. The question is: is this how you'd like plpython
tracebacks to look like?

Q3

Splitting into smaller modules - should I do it? I find a ~4k loc module
to be too big, but then again, if I do the split, the resulting diff
will be difficult to review. Or won't it?

Q4

Passing function arguments as globals is IMHO ugly. Not sure what the
TODO item wants to say by "Passing them as globals means functions
cannot be called recursively.". I tried and recursion worked just fine.
But it's still quite ugly IMHO... problem is that I don't know hot to
fix it. Creating a Python function with the signature mimicking the one
of the SQL function is tricky because of unnamed arguments. Also,
passing arguments as real arguments would break code that uses the
'global' trick suggested in the manual. Even though the manual says you
should not rely on it, I still feel bad about braking people's code.
Does anyone have any bright ideas on how to improve argument passing in
plplython?

Q5 (last one)

Anyone has any more things on her/his PL/Python Christmas whishlist?
plpy.make_coffee()?

Thanks for reading all the way here,
Jan


Re: pl/python improvements

From
Andres Freund
Date:
On Tuesday 07 December 2010 20:17:57 Jan Urbański wrote:
> Hi,
>
> no, no patch(es) yet. I'm going through plpython.c trying as best I can
> to improve things there. I'll have a patch (or patches) ready for the
> January commitfest, but I thought I'd open up a discussion already to
> spare me having to redo features because the way I attacked the problem
> is a dead end. Be warned, this ended up being a very long email...
Nice to see improvements there.

> Currently the improvements are:
>  * execute SPI calls in a subtransaction, report errors back to Python
> as exceptions that can be caught etc.
Youre doing that unconditionally? I think the performance impact of this will
be too severe... Subtransactions can get quite expensive - especially in the
face of HS.

>  * remove volatile modifiers from variables (AIUI they are completely
> not necessary anywhere in plpython.c)
I havent read the code recently but they might be needed for local variables
changed inside a PG_TRY catching an error and also accessed in the CATCH block
or afterwards if youre not rethrowing.

See the man page of longjmp.

>  * explicit subxact starting, so you can issue consecutive plpy.execute
> calls atomically
That likely obsoletes my comment from above.

>  * split plpython.c into smaller modules, like plpython_traceback.c,
> plpython_type_conversion.c, plpython_plpy.c etc.
>  * variadic argument handling (is this possible for a non-plpgsql pl?)
>  * multiple OUT parameters support (a TODO item, I think "table function
> support" meant to say that)
>  * add a DB-API interface on top of SPI (a TODO item)
I am not hot on that as dbapi simply is too unspecified in too many areas
(prepared statements anyone).

Thanks,

Andres


Re: pl/python improvements

From
Jan Urbański
Date:
On 07/12/10 21:33, Andres Freund wrote:
> On Tuesday 07 December 2010 20:17:57 Jan Urbański wrote:
>>  * execute SPI calls in a subtransaction, report errors back to Python
>> as exceptions that can be caught etc.
> Youre doing that unconditionally? I think the performance impact of this will 
> be too severe... Subtransactions can get quite expensive - especially in the 
> face of HS.

See
http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg163004.html :(

>>  * explicit subxact starting, so you can issue consecutive plpy.execute
>> calls atomically
> That likely obsoletes my comment from above.

Not really, I'm afraid :( You can pay the subxact overhead once with
explicit subxact starting, but you always have to pay it...

with plpy.subxact():   plpy.execute("select 1")   plpy.execute("select 2")   plpy.execute("select 3")

will start only 1 subxact, as opposed to 3.

Cheers,
Jan


Re: pl/python improvements

From
Peter Eisentraut
Date:
On tis, 2010-12-07 at 20:17 +0100, Jan Urbański wrote:
> no, no patch(es) yet. I'm going through plpython.c trying as best I can
> to improve things there. I'll have a patch (or patches) ready for the
> January commitfest, but I thought I'd open up a discussion already to
> spare me having to redo features because the way I attacked the problem
> is a dead end. Be warned, this ended up being a very long email...

This all sounds very nice (especially since I had many of the same items
on my todo list to tackle after Christmas or so).  But I think this
would be massively simpler if you created a separate patch/branch/email
thread for each feature or change.

> The code is on https://github.com/wulczer/postgres, in the plpython
> branch. I'll be rebasing it regularly, so don't be surprised by commit
> hashes changing.

I think rebasing published repositories isn't encouraged.




Re: pl/python improvements

From
Andrew Dunstan
Date:

On 12/07/2010 04:50 PM, Peter Eisentraut wrote:
>
>> The code is on https://github.com/wulczer/postgres, in the plpython
>> branch. I'll be rebasing it regularly, so don't be surprised by commit
>> hashes changing.
> I think rebasing published repositories isn't encouraged.


Indeed. See <http://progit.org/book/ch3-6.html> for example, which says:
   *Do not rebase commits that you have pushed to a public repository.*
   If you follow that guideline, you’ll be fine. If you don’t, people   will hate you, and you’ll be scorned by friends
andfamily.
 


cheers

andrew



Re: pl/python improvements

From
Jan Urbański
Date:
On 07/12/10 23:00, Andrew Dunstan wrote:
> On 12/07/2010 04:50 PM, Peter Eisentraut wrote:
>>
>>> The code is on https://github.com/wulczer/postgres, in the plpython
>>> branch. I'll be rebasing it regularly, so don't be surprised by commit
>>> hashes changing.
>> I think rebasing published repositories isn't encouraged.
> 
> Indeed. See <http://progit.org/book/ch3-6.html> for example, which says:
> 
>    *Do not rebase commits that you have pushed to a public repository.*
> 
>    If you follow that guideline, you’ll be fine. If you don’t, people
>    will hate you, and you’ll be scorned by friends and family.

If someone is interested in merging from that branch, I can refrain from
rebasing (just tell me). But I'm really using it just as a patch queue
and pushing to github is a way having an online backup for the changes,
so it's not really a "public repository". It's meant for people who want
to try out the code as it stands today, or see the changes that have
been made and don't care about having to rebase their checkout themselves.

I find developing that way much easier than merging from the master
branch. If I get conflicts I get them on a single commit of mine, not on
the merge commit, which makes them easier to resolve. See
https://github.com/diaspora/diaspora/wiki/Git-Workflow for an example of
that kind of workflow.

Peter suggested having a mail/patch per feature and the way I intend to
do that is instead of having a dozen branches, have one and after I'm
done rebase it interactively to produce incremental patches that apply
to master, each one implementing one feature.

Cheers,
Jan


Re: pl/python improvements

From
Peter Eisentraut
Date:
On tis, 2010-12-07 at 23:56 +0100, Jan Urbański wrote:
> Peter suggested having a mail/patch per feature and the way I intend
> to do that is instead of having a dozen branches, have one and after
> I'm done rebase it interactively to produce incremental patches that
> apply to master, each one implementing one feature.

Fair enough if you want to do it that way, but I'd encourage you to just
send in any self-contained features/changes that you have finished.



Re: pl/python improvements

From
Jan Urbański
Date:
On 08/12/10 22:41, Peter Eisentraut wrote:
> On tis, 2010-12-07 at 23:56 +0100, Jan Urbański wrote:
>> Peter suggested having a mail/patch per feature and the way I intend
>> to do that is instead of having a dozen branches, have one and after
>> I'm done rebase it interactively to produce incremental patches that
>> apply to master, each one implementing one feature.
> 
> Fair enough if you want to do it that way, but I'd encourage you to just
> send in any self-contained features/changes that you have finished.

I finished work on the PL/Python improvements. There 9 separate patches,
with some interdependencies.

The features implemented are:* general refactoring, dropping the global error state, using dynahash
instead of PyDicts for procedure caching, etc* a validator function* executing SPI calls in subtransactions* starting
explicitsubtransactions to have atomic behaviour of
 
multiple SPI calls* providing custom exceptions for SPI errors, so you can catch only
UniqueViolations and not have to muck around with SQLCODE* invalidate functions accepting composite types if the type
changes
after the functions has been defined* traceback support* table functions (ability to return RECORD and SETOF RECORD)*
usingcustom parsers for datatypes + an example application with
 
dict<->hstore parsing

The dependencies are:

refactor -> validator        -> SPI in subxacts             -> explicit subxacts                -> custom exceptions
forSPI        -> invalidate composites        -> tracebacks        -> table functions        -> custom parsers
 

so everything depends on the refactoring patch, and the SPI changes are
incremental.

I will generate these patches and send them in in separate threads/add
them to the commitfest app before Christmas (I hope). Meanwhile the code
is available at https://github.com/wulczer/postgres. You will find 10
branches there, 9 correspond to these features, and the "plpython"
branch is the sum of them all.

From now on I will stop rebasing these branches, and instead just commit
fixes to the appropriate branch and merge upwards (from refactor to
everything, from spi-in-subxacts to explicit-subxacts and to
custom-spi-exceptions, etc). I'll also try to independently maintain the
plpython branch with the sum to make sure nothing has been left out as
the feature branches get merged into mainline.

Here's to hoping that this will not turn into a conflict hell.

Cheers,
Jan


Re: pl/python improvements

From
Marti Raudsepp
Date:
On Thu, Dec 23, 2010 at 04:08, Jan Urbański <wulczer@wulczer.org> wrote:
>  * providing custom exceptions for SPI errors, so you can catch only
> UniqueViolations and not have to muck around with SQLCODE

py-postgresql already has a mapping from error codes to Python
exceptions. I think it makes sense to re-use that, instead of
inventing new names.
https://github.com/jwp/py-postgresql/blob/v1.1/postgresql/exceptions.py

It also follows the Python convention of ending exception classes with
"Error", so instead of UniqueViolation they have UniqueError, instead
of InvalidTextRepresentation, they have TextRepresentationError

> Meanwhile the code
> is available at https://github.com/wulczer/postgres. You will find 10
> branches there, 9 correspond to these features, and the "plpython"
> branch is the sum of them all.

I tried building the plpython branch, but got an unrelated error. I
didn't investigate further for now...

make[3]: Entering directory
`/home/marti/src/postgresql-py/src/backend/bootstrap'
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing
-fwrapv -I. -I. -I../../../src/include -D_GNU_SOURCE   -c -o
bootparse.o bootparse.c
bootparse.y: In function ‘boot_yyparse’:
bootparse.y:224:16: error: too few arguments to function ‘heap_create’
../../../src/include/catalog/heap.h:37:17: note: declared here
bootparse.y:249:16: error: too few arguments to function
‘heap_create_with_catalog’
../../../src/include/catalog/heap.h:48:12: note: declared here
make[3]: *** [bootparse.o] Error 1


Regards,
Marti


Re: pl/python improvements

From
Jan Urbański
Date:
On 23/12/10 12:16, Marti Raudsepp wrote:
> On Thu, Dec 23, 2010 at 04:08, Jan Urbański <wulczer@wulczer.org> wrote:
>>  * providing custom exceptions for SPI errors, so you can catch only
>> UniqueViolations and not have to muck around with SQLCODE
> 
> py-postgresql already has a mapping from error codes to Python
> exceptions. I think it makes sense to re-use that, instead of
> inventing new names.
> https://github.com/jwp/py-postgresql/blob/v1.1/postgresql/exceptions.py
> 
> It also follows the Python convention of ending exception classes with
> "Error", so instead of UniqueViolation they have UniqueError, instead
> of InvalidTextRepresentation, they have TextRepresentationError

Oh, didn't know that. I see that it does some more fancy things, like
defining a inheritance hierarchy for these exceptions and adding some
more into the mix.

The names I used are not really invented, they're just plpgsql condition
names from
http://www.postgresql.org/docs/current/static/errcodes-appendix.html
with underscores changed to camel case. Also, since they're
autogenerated from utils/errcodes.h they don't have any hierarchy, they
just all inherit from SPIError.

Sticking "Error" to every one of them will result in things like
SubstringErrorError, so I'm not really sold on that. Basically I think
more PL/Python users will be familiar with condition names as you use
them in pl/pgsql than with the names from py-postgresql.

>> Meanwhile the code
>> is available at https://github.com/wulczer/postgres. You will find 10
>> branches there, 9 correspond to these features, and the "plpython"
>> branch is the sum of them all.
> 
> I tried building the plpython branch, but got an unrelated error. I
> didn't investigate further for now...
> 
> make[3]: Entering directory
> `/home/marti/src/postgresql-py/src/backend/bootstrap'
> gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing
> -fwrapv -I. -I. -I../../../src/include -D_GNU_SOURCE   -c -o
> bootparse.o bootparse.c
> bootparse.y: In function ‘boot_yyparse’:
> bootparse.y:224:16: error: too few arguments to function ‘heap_create’
> ../../../src/include/catalog/heap.h:37:17: note: declared here
> bootparse.y:249:16: error: too few arguments to function
> ‘heap_create_with_catalog’
> ../../../src/include/catalog/heap.h:48:12: note: declared here
> make[3]: *** [bootparse.o] Error 1

I'm pretty sure it'll go away it you do make maintainer-clean or git
clean -dfx after switching to the plpython branch.

Cheers,
Jan


Re: pl/python improvements

From
James William Pye
Date:
On Dec 23, 2010, at 3:38 AM, Jan Urbański wrote:
> Oh, didn't know that. I see that it does some more fancy things, like
> defining a inheritance hierarchy for these exceptions and adding some
> more into the mix.

Right, there were some cases that appeared to benefit from larger
buckets than what the existing code classes provided. Also, some of the
exceptions in there are strictly for py-postgresql/client-side things.

> The names I used are not really invented, they're just plpgsql condition
> names from
> http://www.postgresql.org/docs/current/static/errcodes-appendix.html
> with underscores changed to camel case. Also, since they're
> autogenerated from utils/errcodes.h they don't have any hierarchy, they
> just all inherit from SPIError.

For the backend setting, I think this is quite appropriate.

However, for pg-python, I had mixed feelings about this as I wanted to
be able to leverage py-postgresql's hierarchy, but still have the projects
independent. I ended up punting on this one by using a single error class,
and forcing the user to compare the codes. =(

> Sticking "Error" to every one of them will result in things like
> SubstringErrorError, so I'm not really sold on that.

There was some creativity applied to the names in postgresql.exceptions
to accommodate for things like that. (Like no redundant "Error")

> Basically I think
> more PL/Python users will be familiar with condition names as you use
> them in pl/pgsql than with the names from py-postgresql.

I think that's fair assumption. In fact, I think that might make a good
TODO for py-postgresql/pg-python. Provide a plpgsql-code-name to
exception class mapping.

cheers, jwp