Thread: plpython SPI cursors
Hi, attached is a patch implementing the usage of SPI cursors in PL/Python. Currently when trying to process a large table in PL/Python you have slurp it all into memory (that's what plpy.execute does). This patch allows reading the result set in smaller chunks, using a SPI cursor behind the scenes. Example usage: cursor = plpy.cursor("select a, b from hugetable") for row in cursor: plpy.info("a is %s and b is %s" % (row['a'], row['b'])) The patch itself is simple, but there's a lot of boilerplate dedicated to opening a subtransaction and handling prepared plans. I'd like to do some refactoring of they way PL/Python uses SPI to reduce the amount of boilerplate needed, but that'll come as a separate patch (just before the patch to split plpython.c in smaller chunks). This feature has been sponsored by Nomao. Cheers, Jan PS: I already added it to the November CF. J
Attachment
On 11-10-15 07:28 PM, Jan Urbański wrote: > Hi, > > attached is a patch implementing the usage of SPI cursors in PL/Python. > Currently when trying to process a large table in PL/Python you have > slurp it all into memory (that's what plpy.execute does). > > J I found a few bugs (see my testing section below) that will need fixing + a few questions about the code Overview & Feature Review ----------- This patch adds cursor support to plpython. SPI cursors will allow a plpython function to read process a large results set without having to read it all into memory at once. This is a good thing. Without this patch I think you could accomplish the same with using SQL DECLARE CURSOR and SQL fetch. This feature allows you to use a python cursor as an iterator resulting in much cleaner python code than the SQL FETCH approach. I think the feature is worth having Usability Review ---------------------- The patch adds the user methods cursor=plpy.cursor(query_or_plan) cursor.fetch(100) cursor.close() Do we like the name plpy.cursor or would we rather call it something like plpy.execute_cursor(...) or plpy.cursor_open(...) or plpy.create_cursor(...) Since we will be mostly stuck with the API once we release 9.2 this is worth some opinions on. I like cursor() but if anyone disagrees now is the time. This patch does not provide a wrapper around SPI_cursor_move. The patch is useful without that and I don't see anything that preculdes someone else adding that later if they see a need. Documentation Review --------------------- The patch includes documentation updates that describes the new feature. The Database Access page doesn't provide a API style list of database access functions like the plperl http://www.postgresql.org/docs/9.1/interactive/plperl-builtins.html page does. I think the organization of the perl page is clearer than the python one and we should think about a doing some documentaiton refactoring. That should be done as a seperate patch and shouldn't be a barrier to committing this one. Code Review ---------------- in PLy_cursor_plan line 4080 + PG_TRY(); + { + Portal portal; + char *volatile nulls; + volatile int j; + + if (nargs > 0) + nulls = palloc(nargs * sizeof(char)); + else + nulls = NULL; + + for (j = 0; j < nargs; j++) + { + PyObject *elem; I am probably not seeing a code path or misunderstanding something about the setjmp/longjump usages but I don't see why nulls and j need to be volatile here. line 444 PLy_cursor(PyObject *self, PyObject *args) + { + char *query; + PyObject *plan; + PyObject *planargs = NULL; + + if (PyArg_ParseTuple(args, "s", &query)) + return PLy_cursor_query(query); + Should query be freed with PyMem_free() Testing ----------- I tested both python 2.6 and 3 on a Linux system create or replace function x() returns text as $$ cur=None try: with plpy.subtransaction(): cur=plpy.cursor('select generate_series(1,1000)') rows=cur.fetch(10); plpy.execute('selectf()') except plpy.SPIError: rows=cur.fetch(10); return rows[0]['generate_series'] return 'none' $$ language plpythonu; select x(); crashes the backend test=# select x(); The connection to the server was lost. Attempting reset: LOG: server process (PID 3166) was terminated by signal 11: Segmentation fault The below test gives me a strange error message: create or replace function x1() returns text as $$ plan=None try: with plpy.subtransaction(): plpy.execute('CREATE TEMP TABLE z AS select generate_series(1,1000)') plan=plpy.prepare('select* FROM z') plpy.execute('select * FROM does_not_exist') except plpy.SPIError, e: cur=plpy.cursor(plan) rows=cur.fetch(10) return rows[0]['generate_series'] return '1' $$ language plpythonu; select x1(); test=# select x1() test-# ; ERROR: TypeError: Expected sequence of 82187072 arguments, got 0: <NULL> CONTEXT: Traceback (most recent call last): PL/Python function "x1", line 9, in <module> cur=plpy.cursor(plan) PL/Python function "x1" STATEMENT: select x1() I was expecting an error from the function just a bit more useful one. Performance ------------------- I did not do any specific performance testing but I don't see this patch as having any impact to performance
On 20/11/11 19:14, Steve Singer wrote: > On 11-10-15 07:28 PM, Jan Urbański wrote: >> Hi, >> >> attached is a patch implementing the usage of SPI cursors in PL/Python. > I found a few bugs (see my testing section below) that will need fixing > + a few questions about the code Hi Steve, thanks a lot for the review, I'll investigate the errors you were getting and post a follow-up. Good catch on trying cursors with explicit subtransactions, I didn't think about how they would interact. Cheers, Jan
On 20/11/11 19:14, Steve Singer wrote: > On 11-10-15 07:28 PM, Jan Urbański wrote: >> Hi, >> >> attached is a patch implementing the usage of SPI cursors in PL/Python. >> Currently when trying to process a large table in PL/Python you have >> slurp it all into memory (that's what plpy.execute does). >> >> J > > I found a few bugs (see my testing section below) that will need fixing > + a few questions about the code Responding now to all questions and attaching a revised patch based on your comments. > Do we like the name plpy.cursor or would we rather call it something like > plpy.execute_cursor(...) or plpy.cursor_open(...) or > plpy.create_cursor(...) > Since we will be mostly stuck with the API once we release 9.2 this is > worth > some opinions on. I like cursor() but if anyone disagrees now is the time. We use plpy.subtransaction() to create Subxact objects, so I though plpy.cursor() would be most appropriate. > This patch does not provide a wrapper around SPI_cursor_move. The patch > is useful without that and I don't see anything that preculdes someone else > adding that later if they see a need. My idea is to add keyword arguments to plpy.cursor() that will allow you to decide whether you want a scrollable cursor and after that provide a move() method. > The patch includes documentation updates that describes the new feature. > The Database Access page doesn't provide a API style list of database > access > functions like the plperl > http://www.postgresql.org/docs/9.1/interactive/plperl-builtins.html page > does. I think the organization of the perl page is > clearer than the python one and we should think about a doing some > documentaiton refactoring. That should be done as a seperate patch and > shouldn't be a barrier to committing this one. Yeah, the PL/Python docs are a bit chaotic right now. I haven't yet summoned force to overhaul them. > in PLy_cursor_plan line 4080 > + PG_TRY(); > + { > + Portal portal; > + char *volatile nulls; > + volatile int j; > I am probably not seeing a code path or misunderstanding something > about the setjmp/longjump usages but I don't see why nulls and j need to be > volatile here. It looked like you could drop volatile there (and in PLy_spi_execute_plan, where this is copied from (did I mention there's quite some code duplication in PL/Python?)) but digging in git I found this commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2789b7278c11785750dd9d2837856510ffc67000 that added the original volatile qualification, so I guess there's a reason. > line 444 > PLy_cursor(PyObject *self, PyObject *args) > + { > + char *query; > + PyObject *plan; > + PyObject *planargs = NULL; > + > + if (PyArg_ParseTuple(args, "s", &query)) > + return PLy_cursor_query(query); > + > > Should query be freed with PyMem_free() No, PyArg_ParseTuple returns a string on the stack, I check that repeatedly creating a cursor with a plan argument does not leak memory and that adding PyMem_Free there promptly leads to a segfault. > I tested both python 2.6 and 3 on a Linux system > > [test cases demonstrating bugs] Turns out it's a really bad idea to store pointers to Portal structures, because they get invalidated by the subtransaction abort hooks. I switched to storing the cursor name and looking it up in the appropriate hash table every time it's used. The examples you sent (which I included as regression tests) now cause a ValueError to be raised with a message stating that the cursor has been created in an aborted subtransaction. Not sure about the wording of the error message, though. Thanks again for the review! Cheers, Jan
Attachment
On 11-11-23 01:58 PM, Jan Urbański wrote: <blockquote cite="mid:4ECD426F.5060702@wulczer.org" type="cite">On 20/11/11 19:14,Steve Singer wrote: <br /><blockquote type="cite">On 11-10-15 07:28 PM, Jan Urbański wrote: <br /><blockquote type="cite">Hi,<br /><br /> attached is a patch implementing the usage of SPI cursors in PL/Python. <br /> Currently whentrying to process a large table in PL/Python you have <br /> slurp it all into memory (that's what plpy.execute does).<br /><br /> J <br /></blockquote><br /> I found a few bugs (see my testing section below) that will need fixing <br/> + a few questions about the code <br /></blockquote><br /> Responding now to all questions and attaching a revisedpatch based on your comments. <br /><br /></blockquote><br /> I've looked over the revised version of the patch andit now seems fine.<br /><br /> Ready for committer.<br /><br /><br /><br /><blockquote cite="mid:4ECD426F.5060702@wulczer.org"type="cite"><blockquote type="cite">Do we like the name plpy.cursor or would we rathercall it something like <br /> plpy.execute_cursor(...) or plpy.cursor_open(...) or <br /> plpy.create_cursor(...) <br/> Since we will be mostly stuck with the API once we release 9.2 this is <br /> worth <br /> some opinions on. I likecursor() but if anyone disagrees now is the time. <br /></blockquote><br /> We use plpy.subtransaction() to create Subxactobjects, so I though plpy.cursor() would be most appropriate. <br /><br /><blockquote type="cite">This patch doesnot provide a wrapper around SPI_cursor_move. The patch <br /> is useful without that and I don't see anything that preculdessomeone else <br /> adding that later if they see a need. <br /></blockquote><br /> My idea is to add keyword argumentsto plpy.cursor() that will allow you to decide whether you want a scrollable cursor and after that provide a move()method. <br /><br /><blockquote type="cite">The patch includes documentation updates that describes the new feature.<br /> The Database Access page doesn't provide a API style list of database <br /> access <br /> functions likethe plperl <br /><a class="moz-txt-link-freetext" href="http://www.postgresql.org/docs/9.1/interactive/plperl-builtins.html">http://www.postgresql.org/docs/9.1/interactive/plperl-builtins.html</a> page<br /> does. I think the organization of the perl page is <br /> clearer than the python one and we should think abouta doing some <br /> documentaiton refactoring. That should be done as a seperate patch and <br /> shouldn't be a barrierto committing this one. <br /></blockquote><br /> Yeah, the PL/Python docs are a bit chaotic right now. I haven'tyet summoned force to overhaul them. <br /><br /><blockquote type="cite">in PLy_cursor_plan line 4080 <br /> + PG_TRY();<br /> + { <br /> + Portal portal; <br /> + char *volatile nulls; <br /> + volatile int j; <br /></blockquote><br/><blockquote type="cite">I am probably not seeing a code path or misunderstanding something <br /> aboutthe setjmp/longjump usages but I don't see why nulls and j need to be <br /> volatile here. <br /></blockquote><br />It looked like you could drop volatile there (and in PLy_spi_execute_plan, where this is copied from (did I mention there'squite some code duplication in PL/Python?)) but digging in git I found this commit: <br /><br /><a class="moz-txt-link-freetext" href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2789b7278c11785750dd9d2837856510ffc67000">http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2789b7278c11785750dd9d2837856510ffc67000</a><br /><br/> that added the original volatile qualification, so I guess there's a reason. <br /><br /><blockquote type="cite">line444 <br /> PLy_cursor(PyObject *self, PyObject *args) <br /> + { <br /> + char *query; <br /> + PyObject*plan; <br /> + PyObject *planargs = NULL; <br /> + <br /> + if (PyArg_ParseTuple(args, "s", &query)) <br />+ return PLy_cursor_query(query); <br /> + <br /><br /> Should query be freed with PyMem_free() <br /></blockquote><br/> No, PyArg_ParseTuple returns a string on the stack, I check that repeatedly creating a cursor with aplan argument does not leak memory and that adding PyMem_Free there promptly leads to a segfault. <br /><br /><br /><blockquotetype="cite">I tested both python 2.6 and 3 on a Linux system <br /><br /> [test cases demonstrating bugs] <br/></blockquote><br /> Turns out it's a really bad idea to store pointers to Portal structures, because they get invalidatedby the subtransaction abort hooks. <br /><br /> I switched to storing the cursor name and looking it up in theappropriate hash table every time it's used. The examples you sent (which I included as regression tests) now cause aValueError to be raised with a message stating that the cursor has been created in an aborted subtransaction. <br /><br/> Not sure about the wording of the error message, though. <br /><br /> Thanks again for the review! <br /><br /> Cheers,<br /> Jan <br /><pre wrap=""> <fieldset class="mimeAttachmentHeader"></fieldset> </pre></blockquote><br />
On lör, 2011-11-26 at 11:21 -0500, Steve Singer wrote: > I've looked over the revised version of the patch and it now seems > fine. > > Ready for committer. I can take it from here.
On ons, 2011-11-23 at 19:58 +0100, Jan Urbański wrote: > On 20/11/11 19:14, Steve Singer wrote: > > On 11-10-15 07:28 PM, Jan Urbański wrote: > >> Hi, > >> > >> attached is a patch implementing the usage of SPI cursors in PL/Python. > >> Currently when trying to process a large table in PL/Python you have > >> slurp it all into memory (that's what plpy.execute does). > >> > >> J > > > > I found a few bugs (see my testing section below) that will need fixing > > + a few questions about the code > > Responding now to all questions and attaching a revised patch based on > your comments. Committed Please refresh the other patch.
On 05/12/11 18:58, Peter Eisentraut wrote: > On ons, 2011-11-23 at 19:58 +0100, Jan Urbański wrote: >> On 20/11/11 19:14, Steve Singer wrote: >> Responding now to all questions and attaching a revised patch based on >> your comments. > > Committed > > Please refresh the other patch. Great, thanks! I'll try to send an updated version of the other patch this evening. Cheers, Jan
Jan Urbański wrote: > On 05/12/11 18:58, Peter Eisentraut wrote: > > On ons, 2011-11-23 at 19:58 +0100, Jan Urba?ski wrote: > >> On 20/11/11 19:14, Steve Singer wrote: > >> Responding now to all questions and attaching a revised patch based on > >> your comments. > > > > Committed > > > > Please refresh the other patch. > > Great, thanks! > > I'll try to send an updated version of the other patch this evening. I assume this is _not_ related to this TODO item: Add a DB-API compliant interface on top of the SPI interface -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 05/12/11 19:12, Bruce Momjian wrote: > Jan Urbański wrote: >> On 05/12/11 18:58, Peter Eisentraut wrote: >>> On ons, 2011-11-23 at 19:58 +0100, Jan Urba?ski wrote: >>>> On 20/11/11 19:14, Steve Singer wrote: >>>> Responding now to all questions and attaching a revised patch based on >>>> your comments. >>> >>> Committed >>> >>> Please refresh the other patch. >> >> Great, thanks! >> >> I'll try to send an updated version of the other patch this evening. > > I assume this is _not_ related to this TODO item: > > Add a DB-API compliant interface on top of the SPI interface No, not related.
On mån, 2011-12-05 at 13:12 -0500, Bruce Momjian wrote: > Jan Urbański wrote: > > On 05/12/11 18:58, Peter Eisentraut wrote: > > > On ons, 2011-11-23 at 19:58 +0100, Jan Urba?ski wrote: > > >> On 20/11/11 19:14, Steve Singer wrote: > > >> Responding now to all questions and attaching a revised patch based on > > >> your comments. > > > > > > Committed > > > > > > Please refresh the other patch. > > > > Great, thanks! > > > > I'll try to send an updated version of the other patch this evening. > > I assume this is _not_ related to this TODO item: > > Add a DB-API compliant interface on top of the SPI interface No, but this is: http://petereisentraut.blogspot.com/2011/11/plpydbapi-db-api-for-plpython.html