Thread: WIP patch: convert SQL-language functions to return tuplestores
We have an open TODO item to support SQL-language functions that return the output of a RETURNING clause attached to an INSERT/UPDATE/DELETE query within the function. This is something that was left undone in the 8.2 development cycle after this thread analyzed the problem: http://archives.postgresql.org/pgsql-hackers/2006-10/msg00665.php The basic conclusion was that the only sane way to do it is to have the SQL function execute the DML command to completion and then return the emitted rows in a tuplestore. Which is fine, but it would be pretty messy to do unless we make set-returning SQL functions return tuplestores all the time, and there was worry that that might lose performance compared to the existing value-per-call protocol. Attached is a draft patch that converts set-returning SQL functions to return tuplestores. It's pretty incomplete --- it doesn't actually add the RETURNING feature, and there's a lot of ugly stuff to clean up --- but it passes regression tests and it's close enough for performance testing. What I find is that the performance hit doesn't seem too bad. The test case I'm using looks like this: regression=# create function foo(int) returns setof int as 'select generate_series(1,$1)' language sql; CREATE FUNCTION regression=# select count(*) from (select foo(NNN)) ss; This example is chosen with malice aforethought to stress the tuplestore performance as much as possible. The internal generate_series() call is about as cheap a set-generating function as possible, and it returns through value-per-call mechanism so there is no added tuplestore in the way. In the outer query we again avoid the tuplestore that would be created by nodeFunctionscan.c, and we use an upper count(*) to avoid shipping all the rows to the client. I should note also that the function is intentionally declared VOLATILE to prevent its being inlined into the calling query. What I find on my development machine is that CVS HEAD processes this query at about 1.33 microsec/row. With the attached patch, the speed is about 1.0 usec/row if the tuplestore stays within work_mem; about 1.3 usec/row if it spills "to disk" but doesn't overflow available kernel disk cache; and about 1.56 usec/row in cases considerably larger than available RAM, when we actually have to write the data to disk and read it back. This is on my development workstation, which is a dual 2.8GHz Xeon EM64T with your typical junk consumer-grade single ATA disk drive, running Fedora 9. (BTW, the test seems to be mostly CPU-bound even when spilling to disk.) So I'm concluding that we can easily afford to switch to tuplestore-always operation, especially if we are willing to put any effort into tuplestore optimization. (I note that the current tuplestore code writes 24 bytes per row for this example, which is a shade on the high side for only 4 bytes payload. It looks like it would be pretty easy to knock 10 bytes off that for a 40% savings in I/O volume.) I'm putting up this patch mostly so that anyone who's worried about the performance issue can do their own tests. It's definitely not meant for style or completeness critiques ;-) BTW, the patch also removes the existing limitation of not being able to call set-returning plpgsql functions in a SELECT targetlist... regards, tom lane
Attachment
Re: WIP patch: convert SQL-language functions to return tuplestores
From
Martijn van Oosterhout
Date:
On Sun, Oct 26, 2008 at 09:49:49PM -0400, Tom Lane wrote: > So I'm concluding that we can easily afford to switch to tuplestore-always > operation, especially if we are willing to put any effort into tuplestore > optimization. (I note that the current tuplestore code writes 24 bytes > per row for this example, which is a shade on the high side for only 4 bytes > payload. It looks like it would be pretty easy to knock 10 bytes off that > for a 40% savings in I/O volume.) I thought that the bad case for a tuplestore was if the set returning function was expensive and the user used it with a LIMIT clause. In the tuplestore case you evaluate everything then throw it away. Your test cases, if you append LIMIT 1 to all of them, how do the timings compare then? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Martijn van Oosterhout <kleptog@svana.org> writes: > On Sun, Oct 26, 2008 at 09:49:49PM -0400, Tom Lane wrote: >> So I'm concluding that we can easily afford to switch to tuplestore-always >> operation, especially if we are willing to put any effort into tuplestore >> optimization. > I thought that the bad case for a tuplestore was if the set returning > function was expensive and the user used it with a LIMIT clause. In the > tuplestore case you evaluate everything then throw it away. I'm not terribly excited by that example --- but in any case, the real solution to any problem that involves communication between function and calling query is to make sure that the function can get inlined into the query. That was an option we didn't have back in 8.2; but it's there now. My test case deliberately disables that optimization ... regards, tom lane
>> I thought that the bad case for a tuplestore was if the set returning >> function was expensive and the user used it with a LIMIT clause. In the >> tuplestore case you evaluate everything then throw it away. > > I'm not terribly excited by that example --- but in any case, the real > solution to any problem that involves communication between function and > calling query is to make sure that the function can get inlined into the > query. That was an option we didn't have back in 8.2; but it's there > now. My test case deliberately disables that optimization ... I'm pretty excited by that example. LIMIT/OFFSET is really useful as a way of paginating query results for display on a web page (show results 1-100, 101-200, etc), and I use it on potentially expensive SRFs just as I do on tables and views. I agree that inlining is a better solution when it's possible, but who is to say that's always the case? Of course if it's PL/pgsql with RETURN QUERY the way forward is fairly obvious, but what if it isn't that simple? ...Robert
"Robert Haas" <robertmhaas@gmail.com> writes: >>> I thought that the bad case for a tuplestore was if the set returning >>> function was expensive and the user used it with a LIMIT clause. In the >>> tuplestore case you evaluate everything then throw it away. >> >> I'm not terribly excited by that example --- but in any case, the real >> solution to any problem that involves communication between function and >> calling query is to make sure that the function can get inlined into the >> query. That was an option we didn't have back in 8.2; but it's there >> now. My test case deliberately disables that optimization ... > I'm pretty excited by that example. LIMIT/OFFSET is really useful as > a way of paginating query results for display on a web page (show > results 1-100, 101-200, etc), and I use it on potentially expensive > SRFs just as I do on tables and views. I suspect it doesn't help you as much as you think. It's always been the case that SRFs in FROM-items are fed through a tuplestore, and so are plpgsql SRF results. The only place where you could win with an outside-the-function LIMIT in existing releases is if (1) it's a SQL-language function and (2) you call it in the SELECT targetlist, ieSELECT foo(...) LIMIT n; It seems to me that if you have a situation where you are really depending on the performance of such a construct, you could push the LIMIT into the function: CREATE FUNCTION foo(..., n bigint) RETURNS SETOF whatever AS $$ SELECT ... LIMIT $something$$ LANGUAGE sql; This would likely actually give you *better* performance since the plan for the function's SELECT would be generated with awareness that it was going to be LIMIT'ed. So my feeling is that people are obsessing about a corner case and losing sight of the fact that this patch appears to be a performance boost in more-typical cases ... not to mention the new features it enables. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > "Robert Haas" <robertmhaas@gmail.com> writes: >> I'm pretty excited by that example. LIMIT/OFFSET is really useful as >> a way of paginating query results for display on a web page (show >> results 1-100, 101-200, etc), and I use it on potentially expensive >> SRFs just as I do on tables and views. > > I suspect it doesn't help you as much as you think. It's always been > the case that SRFs in FROM-items are fed through a tuplestore, and so > are plpgsql SRF results. I always thought we considered that a bug though. It sure would be nice if we could generate results as needed instead of having to generate them in advance and store all of them. In particular I fear there are a lot of places that use functions where we might expect them to use views. They're never going to get really good plans but it would be nice if we could at least avoid the extra materialize steps. Now your patch isn't affecting that one way or the other but does it rule it out forever? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
On Sun, 2008-10-26 at 21:49 -0400, Tom Lane wrote: > So I'm concluding that we can easily afford to switch to > tuplestore-always operation, especially if we are willing to put any > effort into tuplestore optimization. (I note that the current > tuplestore code writes 24 bytes per row for this example, which is a > shade on the high side for only 4 bytes payload. It looks like it > would be pretty easy to knock 10 bytes off that for a 40% savings in > I/O volume.) That seems like an important, possibly more important, change. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Gregory Stark <stark@enterprisedb.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> I suspect it doesn't help you as much as you think. It's always been >> the case that SRFs in FROM-items are fed through a tuplestore, and so >> are plpgsql SRF results. > I always thought we considered that a bug though. It sure would be nice if we > could generate results as needed instead of having to generate them in advance > and store all of them. I suppose, but short of a fundamental rethink of how PL functions work that's not going to happen. There's also the whole issue of when do side-effects happen (such as before/after statement triggers). > In particular I fear there are a lot of places that use functions where we > might expect them to use views. They're never going to get really good plans > but it would be nice if we could at least avoid the extra materialize steps. Agreed, but I think the fundamental solution there, for simple-select functions, is inlining. > Now your patch isn't affecting that one way or the other but does it rule it > out forever? I think the PL side of the problem is the hard part --- if we knew how to solve these issues for plpgsql then SQL functions would surely be easy. regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes: > On Sun, 2008-10-26 at 21:49 -0400, Tom Lane wrote: >> So I'm concluding that we can easily afford to switch to >> tuplestore-always operation, especially if we are willing to put any >> effort into tuplestore optimization. (I note that the current >> tuplestore code writes 24 bytes per row for this example, which is a >> shade on the high side for only 4 bytes payload. It looks like it >> would be pretty easy to knock 10 bytes off that for a 40% savings in >> I/O volume.) > That seems like an important, possibly more important, change. Yeah, seeing that both WITH and window functions will be stressing tuplestore performance, anything we can save there is probably worth the trouble. regards, tom lane
On Tue, Oct 28, 2008 at 09:28:38AM -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Sun, 2008-10-26 at 21:49 -0400, Tom Lane wrote: > >> So I'm concluding that we can easily afford to switch to > >> tuplestore-always operation, especially if we are willing to put any > >> effort into tuplestore optimization. (I note that the current > >> tuplestore code writes 24 bytes per row for this example, which is a > >> shade on the high side for only 4 bytes payload. It looks like it > >> would be pretty easy to knock 10 bytes off that for a 40% savings in > >> I/O volume.) > > > That seems like an important, possibly more important, change. > > Yeah, seeing that both WITH and window functions will be stressing > tuplestore performance, anything we can save there is probably worth the > trouble. > > regards, tom lane > The pre-sort for index builds would also benefit from this change. Ken
I wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On Sun, 2008-10-26 at 21:49 -0400, Tom Lane wrote: >>> ... effort into tuplestore optimization. (I note that the current >>> tuplestore code writes 24 bytes per row for this example, which is a >>> shade on the high side for only 4 bytes payload. It looks like it >>> would be pretty easy to knock 10 bytes off that for a 40% savings in >>> I/O volume.) >> That seems like an important, possibly more important, change. > Yeah, seeing that both WITH and window functions will be stressing > tuplestore performance, anything we can save there is probably worth the > trouble. Six of the ten bytes I was speaking of are alignment padding, which can be removed with some relatively simple hacking in tuplestore.c (I think the same might apply in tuplesort.c BTW). The other place I was looking at is that currently, a SRF's result tuplestore is always built with randomAccess = true, which causes the tuplestore to write a trailing length word on each tuple, to support the possibility of being asked to read backwards. Of course, 99% of the time it never will be asked to do so. So we ought to try to suppress that overhead. I can see two ways we might do this: 1. Stop treating nodeFunctionscan.c as supporting backwards scan. This would be an extremely localized change (about two lines ;-)) but the downside is that calling a function scan in a SCROLL CURSOR would result in an extra Materialize node getting stuck in front of the Functionscan node to support backwards scanning. 2. Propagate the Functionscan node's EXEC_FLAG_BACKWARD flag bit to the called function. I'd be inclined to do this by adding it as an additional bit in the rsinfo->allowedModes field. In either case we would need cooperation from called SRFs to get the optimization to happen --- the tuplestore_begin_heap calls are not consolidated into one place (which maybe was a mistake) and they're all passing constant TRUE for randomAccess. That coding is safe but would need to change to reflect whichever policy we adopt. (Note: one advantage of approach #1 is that if anyone is mistakenly passing randomAccess = FALSE, it would become safe, which it isn't now.) I'm of mixed mind about which way to go. I'm not sure that optimizing scroll cursors on functions is something worth worrying about. However, if we do #1 now then we are probably locked into that approach and could not change to #2 in the future --- we'd have to worry about breaking third-party SRFs that are passing randomAccess = FALSE. Comments? regards, tom lane
>> I always thought we considered that a bug though. It sure would be nice if we >> could generate results as needed instead of having to generate them in advance >> and store all of them. > I suppose, but short of a fundamental rethink of how PL functions work > that's not going to happen. There's also the whole issue of when do > side-effects happen (such as before/after statement triggers). For PL/pgsql, I think it might be possible to execute a function to precisely the point where you have generated a sufficient number of records. In other words, when someone asks for a tuple, you start executing the function until a tuple pops out, and then save the execution context until someone asks for another. Conceivably you can push LIMIT and WHERE clauses down into any RETURN QUERY statements executed, as well. Maybe that qualifies as a fundamental rethink, though, and we can worry about how to suppress the tuplestore in that case when and if someone is prepared to implement it. For other procedural languages, you would need support from the executor for that PL, which in most cases will probably be lacking. <thinks a little more> In fact, I suspect that you would gain a lot by optimizing specifically for the case of a PL/pgsql function of the form: (1) execute 0 or more statements that may or may not have side effects but do not return any tuples, (2) execute exactly 1 RETURN QUERY statement, and (3) implicit or explicit RETURN. I suspect that's a very common usage pattern, and it wouldn't require being able to save the entire execution context at an arbitrary point. (I agree that BEFORE/AFTER statement triggers are a problem here but I'm not sure that they are an insoluble one, and I'd hate for that to be the thing that kills this type of optimization. Even if you implemented a full-blown partial-execution model, it would be reasonable to always run any particular INSERT/UPDATE/DELETE to completion. It's really SELECT that is the problem.) >> In particular I fear there are a lot of places that use functions where we >> might expect them to use views. They're never going to get really good plans >> but it would be nice if we could at least avoid the extra materialize steps. > Agreed, but I think the fundamental solution there, for simple-select > functions, is inlining. +1. Upthread passing LIMIT and OFFSET clauses into the SRF as parameters was suggested, but that's really intractable for real-world use where you are also applying WHERE clauses to the SRF results. ...Robert
Hi, In the python language, functions that lazily return collections are called generators and use the yield keyword instead of return. http://www.python.org/doc/2.5.2/tut/node11.html#SECTION00111000000000000000000 Maybe having such a concept in PostgreSQL would allow the user to choose between current behavior (materializing) and lazy computing, with a new internal API to get done in the executor maybe. CREATE FUNCTION mygenerator() returns setof integer language PLPGSQL AS $f$ BEGIN FOR v_foo IN SELECT foo FROM table LOOP YIELD my_expensive_function(v_foo); END LOOP; RETURN; END; $f$; At the plain SQL level, we could expose this with a new function parameter, GENERATOR maybe? CREATE FUNCTION my_generator_example(integer, integer) returns setof integer generator language SQL $f$ SELECT generate_series($1, $2); $f$; Maybe we should prefer to add the GENERATOR (or LAZY or whatever sounds good for a native English speaker) parameter to PL functions to instead of providing YIELD, having RETURN doing YIELD in this case. Le mardi 28 octobre 2008, Tom Lane a écrit : > I suppose, but short of a fundamental rethink of how PL functions work > that's not going to happen. There's also the whole issue of when do > side-effects happen (such as before/after statement triggers). Would it be possible to forbid "generators" when using in those cases? > Agreed, but I think the fundamental solution there, for simple-select > functions, is inlining. Would it be possible to maintain current behavior with ROWS estimator for functions, even when inlining, as a way to trick the planner when you can't feed it good enough stats? > I think the PL side of the problem is the hard part --- if we knew how > to solve these issues for plpgsql then SQL functions would surely be > easy. What about this python idea of GENERATORS and the YIELD control for lazy evaluation of functions? -- dim
2008/10/28 Dimitri Fontaine <dfontaine@hi-media.com>: > Hi, > > In the python language, functions that lazily return collections are called > generators and use the yield keyword instead of return. > http://www.python.org/doc/2.5.2/tut/node11.html#SECTION00111000000000000000000 > > Maybe having such a concept in PostgreSQL would allow the user to choose > between current behavior (materializing) and lazy computing, with a new > internal API to get done in the executor maybe. > lazy computing is good idea, but I am afraid so it should be really wery hard implemented. You should to store somewhere current state, stop execution, return back from node, and you should be able restore PL state and continue in process. I can't to see it without thread support. Regards Pavel Stehule > CREATE FUNCTION mygenerator() > returns setof integer > language PLPGSQL > AS $f$ > BEGIN > FOR v_foo IN SELECT foo FROM table LOOP > YIELD my_expensive_function(v_foo); > END LOOP; > RETURN; > END; > $f$; > > At the plain SQL level, we could expose this with a new function parameter, > GENERATOR maybe? > > CREATE FUNCTION my_generator_example(integer, integer) > returns setof integer > generator > language SQL > $f$ > SELECT generate_series($1, $2); > $f$; > > Maybe we should prefer to add the GENERATOR (or LAZY or whatever sounds good > for a native English speaker) parameter to PL functions to instead of > providing YIELD, having RETURN doing YIELD in this case. > > Le mardi 28 octobre 2008, Tom Lane a écrit : >> I suppose, but short of a fundamental rethink of how PL functions work >> that's not going to happen. There's also the whole issue of when do >> side-effects happen (such as before/after statement triggers). > > Would it be possible to forbid "generators" when using in those cases? > >> Agreed, but I think the fundamental solution there, for simple-select >> functions, is inlining. > > Would it be possible to maintain current behavior with ROWS estimator for > functions, even when inlining, as a way to trick the planner when you can't > feed it good enough stats? > >> I think the PL side of the problem is the hard part --- if we knew how >> to solve these issues for plpgsql then SQL functions would surely be >> easy. > > What about this python idea of GENERATORS and the YIELD control for lazy > evaluation of functions? > -- > dim >
Le mardi 28 octobre 2008, Pavel Stehule a écrit : > 2008/10/28 Dimitri Fontaine <dfontaine@hi-media.com>: > > Hi, > > > > In the python language, functions that lazily return collections are > > called generators and use the yield keyword instead of return. > > http://www.python.org/doc/2.5.2/tut/node11.html#SECTION001110000000000000 > >00000 > > > > Maybe having such a concept in PostgreSQL would allow the user to choose > > between current behavior (materializing) and lazy computing, with a new > > internal API to get done in the executor maybe. > > lazy computing is good idea, but I am afraid so it should be really > wery hard implemented. You should to store somewhere current state, > stop execution, return back from node, and you should be able restore > PL state and continue in process. I can't to see it without thread > support. I'm not sure to understand what is the current situation then. By reading this Tom's commit message Extend ExecMakeFunctionResult() to support set-returning functions that return via a tuplestore insteadof value-per-call ... For the moment, SQL functions still do things the old way. http://git.postgresql.org/?p=postgresql.git;a=commit;h=6d5301be5ece6394433d73288e0fafaed6326485 I had the impression we already have a lazy implementation, this value-per-call returning code path, which still exists for SQL functions. > > CREATE FUNCTION my_generator_example(integer, integer) > > returns setof integer > > generator > > language SQL > > $f$ > > SELECT generate_series($1, $2); > > $f$; So my idea would be to have the SQL function behavior choose to return values either via tuplestore or via value-per-call, depending on the user setting "generator" or "lazy". Done this way, the user could also choose for the function to be lazy or to use a tuplestore whatever the language in which it's written. Current behaviour would then mean the default depends on the language, lazy for SQL and tuplestore for PL/pgSQL. Well, it will have to be documented, whatever the final choice is. Is it possible? A good idea? -- dim
> So my idea would be to have the SQL function behavior choose to return values > either via tuplestore or via value-per-call, depending on the user > setting "generator" or "lazy". > Done this way, the user could also choose for the function to be lazy or to > use a tuplestore whatever the language in which it's written. The problem is not the general PostgreSQL executor, but whatever body of code executes PL/pgsql functions (and other PL languages). It does not, as I understand it, support freezing execution of the function midway through and picking up again later. I haven't looked at the code, but based on previous experience, that could turn out to be a pretty major refactoring. I suspect it would be worthwhile and quite welcome to many users - but I doubt very much that it would be easy. ...Robert
Dimitri Fontaine <dfontaine@hi-media.com> writes: > Done this way, the user could also choose for the function to be lazy or to > use a tuplestore whatever the language in which it's written. The odds of this ever happening for any of the PLs are not distinguishable from zero. It might be nice to have, but the amount of work involved would be incredibly out of proportion to the benefits --- even assuming that it's technically possible at all, which I rather doubt for the PLs that depend on language interpreters that aren't under our control. So the fact that it's possible for SQL-language functions is an idiosyncrasy of that language, not something we should cram into the general CREATE FUNCTION syntax in the vain hope that having syntax might cause an implementation to appear someday. Therefore, if we were going to expose a knob to the user to control this behavior, I'd be inclined to make it a part of the language-specific syntax of SQL function bodies. We could take a hint from the (underdocumented) #option syntax in plpgsql, something like CREATE FUNCTION foo(...) RETURNS SETOF something AS $$ #option lazy SELECT ... $$ LANGUAGE SQL; Mind you, I'm not exactly *for* this, because I think it will result in making functions.c a whole lot more complex and hard to maintain than it needs to be, in exchange for a behavior that I don't believe is especially useful in most cases, and can easily be worked around when it is useful. But if people are going to be sticky about the point, something like this might be a workable compromise. regards, tom lane
Le mercredi 29 octobre 2008, Tom Lane a écrit : > So the fact that it's possible for SQL-language functions is an > idiosyncrasy of that language, not something we should cram into the > general CREATE FUNCTION syntax in the vain hope that having syntax > might cause an implementation to appear someday. Ok, that confirms that lazy evaluation and call-per-value are distinct things, for once, and that what you where after was not an easy syntax bit. :) > Therefore, if we were going to expose a knob to the user to control this > behavior, I'd be inclined to make it a part of the language-specific > syntax of SQL function bodies. How would we support the option for SQL functions? > Mind you, I'm not exactly *for* this, because I think it will result > in making functions.c a whole lot more complex and hard to maintain > than it needs to be, in exchange for a behavior that I don't believe > is especially useful in most cases, and can easily be worked around > when it is useful. From what I understand, the lazy evaluation of functions is not seen as easy to be worked around by people asking for it. > But if people are going to be sticky about the > point, something like this might be a workable compromise. What's against PLpgSQL implementing a YIELD statement? Wouldn't it be simpler to integrate for both hackers and users? This would maybe even allow to have a new API in the executor for this, and each PL would be free to add support for it when best suits them. Maybe that's exactly what you're calling "a whole lot more complex and hard to maintain than it needs to be", though. Regards, -- dim
Dimitri Fontaine <dfontaine@hi-media.com> writes: > Le mercredi 29 octobre 2008, Tom Lane a écrit : >> So the fact that it's possible for SQL-language functions is an >> idiosyncrasy of that language, not something we should cram into the >> general CREATE FUNCTION syntax in the vain hope that having syntax >> might cause an implementation to appear someday. > Ok, that confirms that lazy evaluation and call-per-value are distinct things, > for once, and that what you where after was not an easy syntax bit. :) Well, call-per-value is *necessary* for lazy evaluation, but it's not *sufficient*. You need a function implementation that can suspend and resume execution, and that's difficult in general. >> Therefore, if we were going to expose a knob to the user to control this >> behavior, I'd be inclined to make it a part of the language-specific >> syntax of SQL function bodies. > How would we support the option for SQL functions? Well, we'd just tweak how the executor gets called inside functions.c. The main problem is that we'd have to have two different sets of behavior there, depending on whether we are trying to evaluate commands a row at a time or all at once, plus interlocks to disallow cases like using LAZY with a RETURNING query. It's certainly possible but I believe it will make functions.c a lot longer and uglier than it would be without it. >> Mind you, I'm not exactly *for* this, because I think it will result >> in making functions.c a whole lot more complex and hard to maintain >> than it needs to be, in exchange for a behavior that I don't believe >> is especially useful in most cases, and can easily be worked around >> when it is useful. > From what I understand, the lazy evaluation of functions is not seen as easy > to be worked around by people asking for it. Nobody has refuted the argument that sticking a LIMIT into the function would accomplish the same result. > What's against PLpgSQL implementing a YIELD statement? Feel free to try it, if you want. When you get done you might have some grasp of why it'll be nearly impossible for PLs that we don't control the entire implementation of. regards, tom lane
Le mercredi 29 octobre 2008, Tom Lane a écrit : > Well, call-per-value is *necessary* for lazy evaluation, but it's not > *sufficient*. You need a function implementation that can suspend and > resume execution, and that's difficult in general. Ok, I think I begin to understand how things are tied together. Thanks again for your patience explaining :) > Well, we'd just tweak how the executor gets called inside functions.c. > The main problem is that we'd have to have two different sets of > behavior there, depending on whether we are trying to evaluate commands > a row at a time or all at once, plus interlocks to disallow cases like > using LAZY with a RETURNING query. It's certainly possible but I > believe it will make functions.c a lot longer and uglier than it would > be without it. And I fail to see how the user would control which behavior will get chosen, which I think was part of the "going further with your ideas" sub thread. > Nobody has refuted the argument that sticking a LIMIT into the function > would accomplish the same result. Fair enough. > > What's against PLpgSQL implementing a YIELD statement? > > Feel free to try it, if you want. Hehe, not this year. But being able to ask questions and get clarifications from hackers certainly is a step in this direction. Feeling ready and organizing one's time around it is the next :) > When you get done you might have some > grasp of why it'll be nearly impossible for PLs that we don't control > the entire implementation of. Hence the YIELD / new API idea, with the LAZY property which would be optional for PLs and only implemented in plpgsql (and maybe plpython, as python supports the generator functions concept) first. Maybe having optional features for PLs has not yet been done? But again, I was started in this only by misunderstanding your call here: > I think the PL side of the problem is the hard part --- if we knew how> to solve these issues for plpgsql then SQL functionswould surely be> easy. I'm not being sticky on the feature request, just struggling to understand correctly the issues at hand, recognizing that easy choice of EAGER or LAZY function evaluation would be great as a user, even if unsupported in a number of PLs. Regards, -- dim
Dimitri Fontaine <dfontaine@hi-media.com> writes: > And I fail to see how the user would control which behavior will get chosen, Oh, I'm sorry, I didn't realize you misunderstood my syntax example. I was suggesting that the SQL function manager recognize some optional non-SQL keywords at the start of a SQL function body, along the lines of CREATE FUNCTION foo(...) RETURNS SETOF something AS $$ #option eager SELECT ... $$ LANGUAGE SQL; versus CREATE FUNCTION foo(...) RETURNS SETOF something AS $$ #option lazy SELECT ... $$ LANGUAGE SQL; (I'm not wedded to this particular spelling of it, but there is precedent in plpgsql.) Now of course the bigger problem with either this syntax or yours is that attaching such a property to a function is arguably the Wrong Thing in the first place. Which one is the best way is likely to depend on the calling query more than it does on the function. However, I see no solution to that problem except function inlining; and if the function gets inlined then all this discussion is moot anyhow. regards, tom lane
Le mercredi 29 octobre 2008, Tom Lane a écrit : > Now of course the bigger problem with either this syntax or yours is > that attaching such a property to a function is arguably the Wrong Thing > in the first place. Which one is the best way is likely to depend on > the calling query more than it does on the function. Let the planner figure this out, and add in some starting cost considerations too maybe? That sounds even better, yes. > However, I see no > solution to that problem except function inlining; and if the function > gets inlined then all this discussion is moot anyhow. How to inline PLs functions? -- dim
Dimitri Fontaine <dfontaine@hi-media.com> writes: > Le mercredi 29 octobre 2008, Tom Lane a écrit : >> However, I see no >> solution to that problem except function inlining; and if the function >> gets inlined then all this discussion is moot anyhow. > How to inline PLs functions? All of this is pie-in-the-sky for PL functions, and I think properly so: the whole reason for supporting PLs is to enable doing things that SQL does poorly or not at all. So expecting SQL to interoperate very closely with them seems impossible, or at least unreasonably limiting. The real issue at hand is what to do with SQL-language functions. I'm currently going to have a look at just what it would take to support both lazy and eager evaluation in functions.c (independently of what syntax, if any, we settle on to expose the choice to the user). If it's either really awful or really easy we should know that before arguing further. regards, tom lane
On Wed, 2008-10-29 at 11:58 -0400, Tom Lane wrote: > Dimitri Fontaine <dfontaine@hi-media.com> writes: > > And I fail to see how the user would control which behavior will get chosen, > > Oh, I'm sorry, I didn't realize you misunderstood my syntax example. > I was suggesting that the SQL function manager recognize some optional > non-SQL keywords at the start of a SQL function body, along the lines of > > CREATE FUNCTION foo(...) RETURNS SETOF something AS $$ > #option eager > SELECT ... $$ LANGUAGE SQL; > > versus > > CREATE FUNCTION foo(...) RETURNS SETOF something AS $$ > #option lazy > SELECT ... $$ LANGUAGE SQL; > > (I'm not wedded to this particular spelling of it, but there is > precedent in plpgsql.) > > Now of course the bigger problem with either this syntax or yours is > that attaching such a property to a function is arguably the Wrong Thing > in the first place. Which one is the best way is likely to depend on > the calling query more than it does on the function. However, I see no > solution to that problem except function inlining; and if the function > gets inlined then all this discussion is moot anyhow. I have some vague ideas about extending SET-returning functions to NODE-returning functions, which will have some extra methods (for OO languages like python) or extra functions (for C, somewhat similar to how AGGREGATE functions are defined) to interact with planner/optimiser, so that planner can ask the function instance things like "can you do fast start" or "how many rows for theses args" during planning and also can advise function about strategies once the plan is chosen. That would be something which could be very useful for SQL/MED implementation as well. ----------------- Hannu -- ------------------------------------------ Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
> All of this is pie-in-the-sky for PL functions, and I think properly so: > the whole reason for supporting PLs is to enable doing things that SQL > does poorly or not at all. So expecting SQL to interoperate very > closely with them seems impossible, or at least unreasonably limiting. > The real issue at hand is what to do with SQL-language functions. > > I'm currently going to have a look at just what it would take to support > both lazy and eager evaluation in functions.c (independently of what > syntax, if any, we settle on to expose the choice to the user). If it's > either really awful or really easy we should know that before arguing > further. It occurs to me that for PL/perl and similar one could design an interface that is similar to the one that is used for C functions - that is, function is invoked multiple times, returns one value per call, and is given a place to stash its state across calls. For example, for PL/perl, you could pass a mutable empty hash reference on the first call and then pass the same hash reference back on each subsequent call. That wouldn't require being able to freeze/thaw the whole state, just being able to maintain the contents of that hash reference across calls. It would probably be a lot more difficult to make something like this work usefully for PL/pgsql, which as a language is rather underpowered (nonetheless I use it heavily; it's awesome for the things it is good at), but I suspect it could be applied to Python, PHP, etc. pretty easily. So that's at least three ways you can evaluate the function: generate the whole thing in one fell swoop, single function call but with lazy execution, or value-per-call mode. I'm guessing someone could dream up other possibilities as well. Now, who's volunteering to implement?:-) ...Robert
2008/10/30 Robert Haas <robertmhaas@gmail.com>: >> All of this is pie-in-the-sky for PL functions, and I think properly so: >> the whole reason for supporting PLs is to enable doing things that SQL >> does poorly or not at all. So expecting SQL to interoperate very >> closely with them seems impossible, or at least unreasonably limiting. >> The real issue at hand is what to do with SQL-language functions. >> >> I'm currently going to have a look at just what it would take to support >> both lazy and eager evaluation in functions.c (independently of what >> syntax, if any, we settle on to expose the choice to the user). If it's >> either really awful or really easy we should know that before arguing >> further. > > It occurs to me that for PL/perl and similar one could design an > interface that is similar to the one that is used for C functions - > that is, function is invoked multiple times, returns one value per > call, and is given a place to stash its state across calls. For > example, for PL/perl, you could pass a mutable empty hash reference on > the first call and then pass the same hash reference back on each > subsequent call. That wouldn't require being able to freeze/thaw the > whole state, just being able to maintain the contents of that hash > reference across calls. > > It would probably be a lot more difficult to make something like this > work usefully for PL/pgsql, which as a language is rather underpowered > (nonetheless I use it heavily; it's awesome for the things it is good > at), but I suspect it could be applied to Python, PHP, etc. pretty > easily. > > So that's at least three ways you can evaluate the function: generate > the whole thing in one fell swoop, single function call but with lazy > execution, or value-per-call mode. I'm guessing someone could dream > up other possibilities as well. Now, who's volunteering to implement? > :-) > With session variables we could implement srf function in plpgsql like current C srf function. Like create or replace function foo(....) returns record as $$ #option with_srf_context(datatype of srf context) begin return row(...); end; $$ language plpgsql; I thing it is implementable, but It's not very efective. There are lot of initialisation code. But this technique is used for agregation functions without problems. I belive, so it's should not be fast, but it could be usefull for very large datasets, where current srf functions should fail. regards Pavel Stehule > ...Robert >
I wrote: > I'm currently going to have a look at just what it would take to support > both lazy and eager evaluation in functions.c (independently of what > syntax, if any, we settle on to expose the choice to the user). If it's > either really awful or really easy we should know that before arguing > further. Attached is a draft patch that allows SQL functions to return sets using either value-per-call or materialize mode. It does not expose any control to the user; for the moment, the choice is driven by whether the call site is ExecMakeFunctionResult (which prefers value-per-call) or ExecMakeTableFunctionResult (which prefers materialize). I estimate that functions.c is two or three hundred lines longer than it would be if we stripped the value-per-call support and simplified the logic down to what I had in my prior patch. Which is not all that much in the big scheme of things, so I'll withdraw my argument for simplifying. I'm not sure if it's worth adding a control knob or not --- it's still true that materialize is faster on a tuple-by-tuple basis, but whether the difference is all that significant for nontrivial queries is debatable. Anyway I don't really want to work on that point right now. The next step is to make it actually support RETURNING queries, and if I don't get on with that I won't finish it before commitfest. regards, tom lane
Attachment
> With session variables we could implement srf function in plpgsql like > current C srf function. Like > > create or replace function foo(....) > returns record as $$ > #option with_srf_context(datatype of srf context) > begin > return row(...); > end; > $$ language plpgsql; Oh, sure - but what you can do with this will be somewhat limited compared to a Perl hash reference off which you can chain any arbitrary data structure with ease. I'd want to see an actual use case for this before anyone bothered implementing it. I was actually thinking one way to do it would be to extend the variable declaration syntax so that you could declare n>=0 variables as SRF context variables, which I think is nicer, but even that I think is of limited usefulness. I think the biggest value of PL/plgsql is the ability to RETURN QUERY, and I think the ability to push a lazy execution model down into that subordinate query is where the win is. That case won't be helped at all by this sort of alternate calling convention - in fact it'll be nearly impossible to even do that at all with this type of execution model. ...Robert
2008/10/30 Robert Haas <robertmhaas@gmail.com>: >> With session variables we could implement srf function in plpgsql like >> current C srf function. Like >> >> create or replace function foo(....) >> returns record as $$ >> #option with_srf_context(datatype of srf context) >> begin >> return row(...); >> end; >> $$ language plpgsql; > > Oh, sure - but what you can do with this will be somewhat limited > compared to a Perl hash reference off which you can chain any > arbitrary data structure with ease. I'd want to see an actual use > case for this before anyone bothered implementing it. I was actually > thinking one way to do it would be to extend the variable declaration > syntax so that you could declare n>=0 variables as SRF context > variables, which I think is nicer, but even that I think is of limited > usefulness. I think the biggest value of PL/plgsql is the ability to > RETURN QUERY, and I think the ability to push a lazy execution model > down into that subordinate query is where the win is. That case won't > be helped at all by this sort of alternate calling convention - in > fact it'll be nearly impossible to even do that at all with this type > of execution model. > RETURN QUERY should be implemented for lazy execution model. And it should be fast and not to much dificult. Pavel > ...Robert >
"Pavel Stehule" <pavel.stehule@gmail.com> writes: > RETURN QUERY should be implemented for lazy execution model. And it > should be fast and not to much dificult. Really? Consider what happens if it's inside a loop, or an exception block, or any other nesting construct. regards, tom lane
2008/10/31 Tom Lane <tgl@sss.pgh.pa.us>: > "Pavel Stehule" <pavel.stehule@gmail.com> writes: >> RETURN QUERY should be implemented for lazy execution model. And it >> should be fast and not to much dificult. > > Really? Consider what happens if it's inside a loop, or an exception > block, or any other nesting construct. > true, I forgot, RETURN QUERY shouldn't be last statement. But when we have some RETURN FINAL QUERY, I believe so it should be possible - we only should to call plpgsql in two modes - standard and result of final query. regards Pavel Stehule > regards, tom lane >
With the attached patch, SQL functions support returning the results of INSERT/UPDATE/DELETE RETURNING clauses. An INSERT/UPDATE/DELETE statement is always executed to completion before returning (including firing any related triggers or rules), so we always materialize the RETURNING output. When the function result comes from a SELECT, we continue to use value-per-call mode when being called from ExecMakeFunctionResult (thus preserving the prior behavior if the calling query doesn't execute to completion). When called from ExecMakeTableFunctionResult, we materialize the output and return the tuplestore in one call. This is also the same behavior as before, since ExecMakeTableFunctionResult would've filled a tuplestore anyway; but it's noticeably faster because a lot of call/return and executor entry overhead is eliminated. This is code-complete but I haven't touched the docs yet. Barring objections, I plan to commit when I've finished fixing the docs. regards, tom lane
Attachment
Le vendredi 31 octobre 2008, Tom Lane a écrit : > With the attached patch, SQL functions support returning the results of > INSERT/UPDATE/DELETE RETURNING clauses. Thanks for your work and for having considered user whining in-passing! :) -- dim