Thread: Passing arguments to views
I've written some complicated queries that I'd like to save inside the server and then call from my clients using a short name. For the queries that require no external information, views are perfect. For queries that *do* require external information (like a search date range), I've used functions (written in the SQL language, because I'm just writing plain SQL queries but using $1, $2 tokens for passed-in arguments). When I use these functions, I'm typically joining the results of the function with other tables. Since much of the work is being done inside the function, the planner can't use both the inside-function and outside-function query information when generating a query plan. Instead, it has to do Function Scans (planning and executing the SQL inside the function at each execution, I'm assuming) and then manipulate the output. Ideally, I'd like to be able to write queries that contain $n tokens that will be substituted at execution time, save them on the server, and let the query planner plan the whole query before it's executed. Basically, writing views with arguments. For example, a "sales_figures" view that requires start_date and end_date parameters could be used like this: CREATE VIEW sales_figures($1, $2) AS SELECT ... FROM ... WHERE purchase_date BETWEEN $1 AND $2; SELECT * FROM sales_figures('2005-08-22', '2005-09-14') sf JOIN customers c ON (sf.customer_id = c.customer_id) What do you think? Is this an interesting feature? Is this the right way to go about it, or should I try to get the planner to see through SQL function boundaries (e.g., enhance the function system instead of enhancing the view system)? Would this be a good project for a newbie to the code? I can see that the syntax used above would be problematic: how would it distinguish that from a call to a sales_figures() function? Any comments about alternative syntax would be welcome, too! Thanks! - Chris
Chris Campbell <chris@bignerdranch.com> writes: > What do you think? Is this an interesting feature? Is this the right way to go > about it, or should I try to get the planner to see through SQL function > boundaries The "right" way to go about this in the original abstract set-theoretic mindset of SQL is to code the view to retrieve all the rows and then apply further WHERE clause restrictions to the results of the view. So for example this: > CREATE VIEW sales_figures($1, $2) AS > SELECT ... FROM ... WHERE purchase_date BETWEEN $1 AND $2; Becomes: CREATE VIEW sales_figures AS SELECT ... FROM ... And then you query it with SELECT * FROM sales_figures WHERE purchase_date BETWEEN $1 AND $2 sales_figures could have any number of joins and complex where clauses built-in. It could even be an aggregate grouped by some column (like purchase_date). This relies on the SQL optimizer to push the WHERE clause down into the view to the appropriate depth. Postgres isn't always capable of doing so but it does a pretty decent job. -- greg
On Feb 2, 2006, at 23:33, Greg Stark wrote: > The "right" way to go about this in the original abstract set- > theoretic > mindset of SQL is to code the view to retrieve all the rows and > then apply > further WHERE clause restrictions to the results of the view. > > So for example this: > >> CREATE VIEW sales_figures($1, $2) AS >> SELECT ... FROM ... WHERE purchase_date BETWEEN $1 AND $2; > > Becomes: > > CREATE VIEW sales_figures AS SELECT ... FROM ... > > And then you query it with > > SELECT * FROM sales_figures WHERE purchase_date BETWEEN $1 AND $2 That was a very simplistic example and didn't illustrate my point -- I apologize. I was trying to think of something succinct and illustrative for a quick mailing list post but came up short. Maybe a better example would be a situation where you want to do substitutions in places other than the WHERE clause? There's no way to "push" that out to the calling query. But even in this simple case, the easier-to-grok syntax of making a view look like a function (and codifying the options for restricting the results as arguments to the view) is a nice win in terms of readability and maintainability. I was hoping that people would overlook my bad example because they've had the need for a "view with arguments" tool in their own work, and the conversation would just be about how it could be implemented. :) I'll try to distill a better example from some of the projects I'm working on. Thanks! - Chris
Chris Campbell schrieb: ... > That was a very simplistic example and didn't illustrate my point -- I > apologize. I was trying to think of something succinct and illustrative > for a quick mailing list post but came up short. > > Maybe a better example would be a situation where you want to do > substitutions in places other than the WHERE clause? There's no way to > "push" that out to the calling query. But even in this simple case, the > easier-to-grok syntax of making a view look like a function (and > codifying the options for restricting the results as arguments to the > view) is a nice win in terms of readability and maintainability. Well if the view does not suit your needs, why dont you use an set returnung function instead? Inside it you can do all the magic you want and still use it similar to a table or view. Regards Tino
On Feb 3, 2006, at 02:09, Tino Wildenhain wrote: > Well if the view does not suit your needs, why dont you use an > set returnung function instead? Inside it you can do all the magic > you want and still use it similar to a table or view. That's what I'm currently doing (as explained in the first message in the thread). But the function is a "black box" to the planner when the query is executed -- I'd like the planner to be able to combine the query inside the function with the outer calling query and plan it as one big query. Like it does with views. Thus, "views with arguments." We're certainly not deficient in this area (set-returning functions fill the need quite well), but a feature like this would go even further in terms of ease-of-use and performance. Benefits of "views with arguments" versus functions: * Better query execution performance because the planner can plan the whole query (rewriting the original query to replace references to the view with the view's definition -- this is how views work today) * PostgreSQL-tracked dependancies: views create dependencies on the relations they reference -- functions do not * Don't have to manually maintain a composite type for the return value Basically, better performance and easier administration. Thanks! - Chris
On Fri, Feb 03, 2006 at 08:33:23AM -0500, Chris Campbell wrote: > On Feb 3, 2006, at 02:09, Tino Wildenhain wrote: > > >Well if the view does not suit your needs, why dont you use an > >set returnung function instead? Inside it you can do all the magic > >you want and still use it similar to a table or view. > > That's what I'm currently doing (as explained in the first message in > the thread). But the function is a "black box" to the planner when > the query is executed -- I'd like the planner to be able to combine > the query inside the function with the outer calling query and plan > it as one big query. Like it does with views. Thus, "views with > arguments." > > We're certainly not deficient in this area (set-returning functions > fill the need quite well), but a feature like this would go even > further in terms of ease-of-use and performance. Hmm, we actually do inline SQL functions under certain situations, but only for "simple" queries (see inline_function in optimizer/util/clauses.c). One approach would be to expand that function to inline more complicated things. > Benefits of "views with arguments" versus functions: > > * Better query execution performance because the planner can plan the > whole query (rewriting the original query to replace references to > the view with the view's definition -- this is how views work today) Well, the inlining would acheive the same effect. > * PostgreSQL-tracked dependancies: views create dependencies on the > relations they reference -- functions do not Technically a bug. We should be tracking dependancies for functions anyway. > * Don't have to manually maintain a composite type for the return value This is a good point. Though with syntactic sugar you could work around this too... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
On Feb 3, 2006, at 08:50, Martijn van Oosterhout wrote: > Hmm, we actually do inline SQL functions under certain situations, but > only for "simple" queries (see inline_function in > optimizer/util/clauses.c). One approach would be to expand that > function to inline more complicated things. >> * Better query execution performance because the planner can plan the >> whole query (rewriting the original query to replace references to >> the view with the view's definition -- this is how views work today) > > Well, the inlining would acheive the same effect. So you think approaching it from the "beefing up functions" side would be better than the "beefing up views" side? >> * PostgreSQL-tracked dependancies: views create dependencies on the >> relations they reference -- functions do not > > Technically a bug. We should be tracking dependancies for functions > anyway. Well, it's impossible to track dependencies for all functions, since they're just strings (or compiled code in shared libraries) until they're executed. But maybe SQL language functions could be special- cased? Do you think it would be easier to add dependancy-tracking for functions, or would it be easier to implement this functionality using the more-restrictive-language but better-dependency-tracking view system? When you add dependencies for things that didn't have dependencies before (like non-SQL functions), you create all sorts of backwards-compatibility problems due to the ordering that things need to be dumped and created, and circular dependancies. For example, this works: CREATE FUNCTION foo(INTEGER) RETURNS INTEGER AS 'BEGIN RETURN bar ($1-1); END;' LANGUAGE plpgsql; CREATE FUNCTION bar(INTEGER) RETURNS INTEGER AS 'BEGIN IF $1 < 0 THEN RETURN $1; ELSE RETURN foo($1); END IF; END;' LANGUAGE plpgsql; But it wouldn't work if PostgreSQL tracked and enforced dependancies. But it could probably be done with SQL-language functions only. I don't know if we'd want to add dependancy tracking for functions if it only works for SQL-language functions, though. > This is a good point. Though with syntactic sugar you could work > around > this too... Basically, how views do it? :) By auto-creating a table with the proper columns (for a function, that would be an auto-created type). I'm looking for a function/view hybrid, taking features from each. It seems to me that views have most of the features I want (only missing the ability to pass arguments), so it's a shorter distance to the goal than by starting with functions. Thanks! - Chris
On Fri, Feb 03, 2006 at 09:18:51AM -0500, Chris Campbell wrote: > So you think approaching it from the "beefing up functions" side > would be better than the "beefing up views" side? Well yes, I think you're underestimating the issues with trying to extend views. > >Technically a bug. We should be tracking dependancies for functions > >anyway. > > Well, it's impossible to track dependencies for all functions, since > they're just strings (or compiled code in shared libraries) until <snip> Sure, for most procedural languages you can't do much. But we do do syntax checking already and checking that the necessary functions exist can be considered part of that. It' s not terribly important though. > Basically, how views do it? :) By auto-creating a table with the > proper columns (for a function, that would be an auto-created type). > > I'm looking for a function/view hybrid, taking features from each. It > seems to me that views have most of the features I want (only missing > the ability to pass arguments), so it's a shorter distance to the > goal than by starting with functions. Ok, here's the deal. A view is nothing more than a RULE. Creating a view does this automatically: CREATE RULE blah AS ON SELECT TO myview DO INSTEAD <select statement>; Now, say you wanted to add parameters to this, would you restrict it to SELECT rules, what about UPDATE or DELETE rules? UPDATE myview(3,4) SET ...; The other issue is that currently you can tell from looking at a statement whether something is a function or a table (is it followed by an open parenthesis?). With the above change you can't anymore, which might mean you can't have functions and tables with the same names because they'd be ambiguous. On the whole, I think allowing the server to inline SRFs would be a far better way to go... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
On Feb 3, 2006, at 10:25, Martijn van Oosterhout wrote: > Sure, for most procedural languages you can't do much. But we do do > syntax checking already and checking that the necessary functions > exist > can be considered part of that. It' s not terribly important though. Dependancy tracking needs to be more than a "best effort." If you can't do it accurately and completely, then I don't think it's worth doing at all. But I agree with you that syntax checking is probably sufficient. We don't need true dependancy tracking for functions. The only failure case (where syntax checking won't help you) is deleting (or renaming, or modifying) a table that a function was using. If you were to run or re-define the function, you'd learn about the missing (or renamed, or modified) table, whereas the dependancy-tracking system would prevent you from making the changes to the referenced table in the first place. > Ok, here's the deal. A view is nothing more than a RULE. Creating a > view does this automatically Technically, it's a table and a rule, both of which are created by the CREATE VIEW command. We were talking about syntactic sugar, and CREATE VIEW is syntactic sugar for doing a CREATE TABLE and a CREATE RULE. That was my comparison. I'm aware of how views work. Here's the deal: I want to beef up rules versus beefing up functions. Maybe that's not the way to go; I'm enjoying this discussion and your insights. > CREATE RULE blah AS ON SELECT TO myview DO INSTEAD <select statement>; > > Now, say you wanted to add parameters to this, would you restrict > it to > SELECT rules, what about UPDATE or DELETE rules? I don't see a huge use case for anything but SELECT rules, but I think it could be extended to any rule type. Maybe the CREATE RULE syntax could be something like: CREATE RULE blah AS ON SELECT(INTEGER, INTEGER, DATE) TO myview DO INSTEAD SELECT * FROM sale WHERE sale_date = $3; > The other issue is that currently you can tell from looking at a > statement whether something is a function or a table (is it > followed by > an open parenthesis?). With the above change you can't anymore, which > might mean you can't have functions and tables with the same names > because they'd be ambiguous. Right. I said that my example syntax was deficient in this regard in the first message in this thread. And I solicited ideas for a better (unambiguous) syntax. I'm sure we would be able to come up with something. Maybe using square brackets instead of parentheses? Curly braces? "myview->(1, 2, 3)" notation? Since views are tables (when parsing the query, at least) we'd have to allow this syntax for any table reference, but throw an error (or silently discard the arguments) if the table didn't have a rule matching the argument types? > On the whole, I think allowing the server to inline SRFs would be a > far > better way to go... Maybe, but the highly-structured view definition syntax and everything that comes with it (dependancy tracking primarily) is so tasty. I think a little grammar hacking and a couple extra columns in pg_rewrite (nargs and argtypes) would get us most of the way there. I would much rather put more stringent requirements on the programmer when defining his query (like a view), versus letting him submit any old string as a function (like a function). The database can do so much more when it's able to work with a better representation of the computation. At the core, I want query rewriting with arguments. That sounds like a better fit for views/rules than functions, so that's why I keep coming back to it. Thanks! - Chris
Chris Campbell <chris@bignerdranch.com> writes: > I want to beef up rules versus beefing up functions. Martijn didn't present a very convincing argument why this is a bad idea, but I agree with him that it is. The problem to me is that a "view with arguments" is fundamentally wrong --- a view is a virtual table and there is no such thing as a table with arguments. The whole notion distorts the relational data model beyond recognition. The SRF concept captures what you want a whole lot better. If the implementation isn't up to snuff, we should improve it, not warp other pieces of the system. Martijn mentioned the idea of inlining SQL functions that return sets --- this is something I've toyed with too, but not got round to looking at seriously. AFAICS it would accomplish everything that you could do with parameters in ON SELECT rules, considering the existing restrictions on what can be in an ON SELECT rule. And it wouldn't require any new concepts at all, just a few(?) pages of code. As for the dependency issue, one man's bug is another man's feature. I think the fact that we don't track the internal dependencies of functions is not all bad. We've certainly seen plenty of complaints about how you can't easily change tables that a view is depending on because the view dependencies block it... regards, tom lane
On Feb 3, 2006, at 11:21, Tom Lane wrote: > The SRF concept captures what you want a whole lot better. If the > implementation isn't up to snuff, we should improve it, not warp other > pieces of the system. Point taken. The rewriting concept is what I'm after; if that can be done pre-planning with SQL functions, I'm all for it. I just thought that since rules already do rewriting, that's the best thing to start building on. > Martijn mentioned the idea of inlining SQL functions that return sets > --- this is something I've toyed with too, but not got round to > looking > at seriously. AFAICS it would accomplish everything that you could do > with parameters in ON SELECT rules, considering the existing > restrictions on what can be in an ON SELECT rule. And it wouldn't > require any new concepts at all, just a few(?) pages of code. True, as long as there's a hook to do the inlining/rewriting before the query's planned. I guess we can see function calls at the parse stage, check to see if they're SQL functions or not, grab the prosrc, do the substitution, then re-parse? I guess I can live without the dependancy tracking. I can always dump and reload my database to re-parse all the functions. Maybe we could have a RELOAD FUNCTION command that would just re-parse an existing function, so I don't have to dump and reload? What about auto-creating a composite type for the function's return type based on the query definition? (Like how CREATE VIEW creates an appropriate table definition.) Do you see a way for CREATE FUNCTION to do that? The problem is that you have to specify a return type in CREATE FUNCTION. Maybe an extension to CREATE FUNCTION as a shorthand for set- returning SQL functions? Like: CREATE SQL FUNCTION sales_figures(DATE) AS SELECT ... FROM ... WHERE sale_date <= $1; It would (1) automatically create a composite type (newtype) for the return value, and (2) do a CREATE FUNCTION sales_figures(DATE) RETURNS SETOF newtype AS '...' LANGUAGE sql. How much do I have to justify a patch for non-standard "RELOAD FUNCTION" and "CREATE SQL FUNCTION" commands (as described) in the grammar? :) Thanks! - Chris
Chris Campbell <chris@bignerdranch.com> writes: > True, as long as there's a hook to do the inlining/rewriting before > the query's planned. I guess we can see function calls at the parse > stage, check to see if they're SQL functions or not, grab the prosrc, > do the substitution, then re-parse? pull_up_subqueries in prepjointree.c would be the appropriate place I think: if it's an RTE_FUNCTION RTE, look to see if function is SQL and has the other needed properties, if so replace it by an RTE_SUBQUERY RTE with the correct subquery, then recurse to try to flatten the subquery. (Note: I'm in the middle of hacking that code to flatten UNION subqueries, so you might want to wait till I commit before starting on a patch ;-)) > I guess I can live without the dependancy tracking. I can always dump > and reload my database to re-parse all the functions. Maybe we could > have a RELOAD FUNCTION command that would just re-parse an existing > function, so I don't have to dump and reload? Hm? I don't understand why you think this is needed. > What about auto-creating a composite type for the function's return > type based on the query definition? Can't get excited about this --- although I don't have any fundamental objection either. Choosing a name for such a type might be a bit of a problem (I don't think you can just use the function name, as that would preclude overloading). > Maybe an extension to CREATE FUNCTION as a shorthand for set- > returning SQL functions? It would be surprising if this form of CREATE FUNCTION defaulted to assuming SETOF when other forms don't, so I don't like the proposal as written. Also the syntax you suggest has noplace to put function attributes like VOLATILE. Note that you can already do regression=# create function fooey(int, out k1 int, out k2 int) returns setof record as regression-# $$ select unique1, unique2 from tenk1 where thousand = $1 $$ language sql; CREATE FUNCTION regression=# select * from fooey(44); k1 | k2 ------+------7044 | 5625044 | 6921044 | 7894044 | 18753044 | 36492044 | 40638044 | 61246044 | 64519044 | 6503 44 | 7059 (10 rows) regression=# I'm not convinced that the incremental advantage of not having to write out the function output column types is worth introducing an inconsistent variant of CREATE FUNCTION. Some indeed would argue that that's not an advantage at all ;-) regards, tom lane
On Feb 3, 2006, at 12:27, Tom Lane wrote: >> I guess I can live without the dependancy tracking. I can always dump >> and reload my database to re-parse all the functions. Maybe we could >> have a RELOAD FUNCTION command that would just re-parse an existing >> function, so I don't have to dump and reload? > > Hm? I don't understand why you think this is needed. Consider function foo() that references table bar. When you CREATE FUNCTION foo() ... AS 'SELECT ... FROM bar' you get an error message if bar doesn't exist. If it does exist, CREATE FUNCTION succeeds. If you later DROP bar, you're not informed that function foo() was referencing it. You only find that out if you redefine foo() (using CREATE OR REPLACE FUNCTION and passing in the same definition, which fails) or if you try to run foo() (and the query fails). If functions had true dependency tracking, then you couldn't DROP bar due to foo()'s dependency on it, unless you did a DROP CASCADE and were alerted that foo() was dropped as well. I'm fine with those limitations. I can confirm that all of my functions are not referencing tables that don't exist by doing a CREATE OR REPLACE FUNCTION to reload each function. A pg_dump/ pg_restore would accomplish this, but it would be nice to have a "RELOAD FUNCTION" (or "REPARSE"? or "VERIFY"?) command that would just re-parse the function's source code (like CREATE FUNCTION does) and spit out errors if the function is referencing relations that don't exist. Just as a way to confirm that the table modification I just performed didn't break any functions. On-demand dependency checking, in a way. > Note that you can already do > > regression=# create function fooey(int, out k1 int, out k2 int) > returns setof record as > regression-# $$ select unique1, unique2 from tenk1 where thousand = > $1 $$ language sql; > CREATE FUNCTION > regression=# select * from fooey(44); > k1 | k2 > ------+------ > 7044 | 562 > 5044 | 692 > 1044 | 789 > 4044 | 1875 > 3044 | 3649 > 2044 | 4063 > 8044 | 6124 > 6044 | 6451 > 9044 | 6503 > 44 | 7059 > (10 rows) > > regression=# Learn something new every day. I'm still using 7.4 for most of my day job, and I can't do this without supplying a column definition list: ERROR: a column definition list is required for functions returning "record" I hereby withdraw my proposal for "CREATE SQL FUNCTION." Thanks! - Chris
Chris Campbell <chris@bignerdranch.com> writes: > I'm fine with those limitations. I can confirm that all of my > functions are not referencing tables that don't exist by doing a > CREATE OR REPLACE FUNCTION to reload each function. A pg_dump/ > pg_restore would accomplish this, but it would be nice to have a > "RELOAD FUNCTION" (or "REPARSE"? or "VERIFY"?) command that would > just re-parse the function's source code (like CREATE FUNCTION does) > and spit out errors if the function is referencing relations that > don't exist. This is putting way too much trust in the behavior of a PL-language-specific verifier function. Anyway, you can do what you want today:select fmgr_sql_validator(oid) from pg_proc where prolang = 14; (Generalizing this to work for any language is left as an exercise for the reader...) regards, tom lane
Tom, > As for the dependency issue, one man's bug is another man's feature. > I think the fact that we don't track the internal dependencies of > functions is not all bad. We've certainly seen plenty of complaints > about how you can't easily change tables that a view is depending on > because the view dependencies block it... I'd agree with this. I write about 150,000 lines of function code a year, and if I had to rebuild all of the cascading functions every time I change a table they way I have to with views, it would probably add 20% to my overall application development time. BTW, the other thing that we're still TODOing on SRFs (as far as I know) is finding ways to change the row estimate for an SRF. It's still a flat 1000 in the code, which can cause a lot of bad query plans. I proposed a year ago that, as a first step, we allow the function owner to assign a static estimate variable to the function (i.e. "average rows returned = 5'). This doesn't solve the whole problem of SRF estimates but it would be a significant step forwards in being able to use them in queries. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Tom Lane wrote: > Chris Campbell <chris@bignerdranch.com> writes: > >>True, as long as there's a hook to do the inlining/rewriting before >>the query's planned. I guess we can see function calls at the parse >>stage, check to see if they're SQL functions or not, grab the prosrc, >>do the substitution, then re-parse? > > > pull_up_subqueries in prepjointree.c would be the appropriate place > I think: if it's an RTE_FUNCTION RTE, look to see if function is SQL > and has the other needed properties, if so replace it by an RTE_SUBQUERY > RTE with the correct subquery, then recurse to try to flatten the > subquery. (Note: I'm in the middle of hacking that code to flatten > UNION subqueries, so you might want to wait till I commit before > starting on a patch ;-)) If we are talking about inserting the function definition into the query as a subquery and then letting the parser treat it as a subquery, then I see no reason to use either the existing function or view subsystems. It sounds more like we are discussing a macro language. CREATE MACRO foo(bar,baz) AS $$ select a from b where b > bar and b < baz $$; Then when you query SELECT * FROM foo(1,7) AS f WHERE f % 7 = 3 you get a macro expansion as such: SELECT * FROM (a from b where b > bar and b < baz) AS f WHERE f % 7 = 3 Then whatever optimizations the query planner can manage against a subquery will work for macros as well. Thoughts?
Josh Berkus <josh@agliodbs.com> writes: > BTW, the other thing that we're still TODOing on SRFs (as far as I know) is > finding ways to change the row estimate for an SRF. It's still a flat > 1000 in the code, which can cause a lot of bad query plans. I proposed a > year ago that, as a first step, we allow the function owner to assign a > static estimate variable to the function (i.e. "average rows returned = > 5'). This doesn't solve the whole problem of SRF estimates but it would > be a significant step forwards in being able to use them in queries. The inlining thing would solve that much better, at least for the cases where the function can be inlined. I'm not sure how we can improve the situation for things like looping plpgsql functions --- the function owner probably can't write down a hard estimate for those either, in most cases. regards, tom lane
Mark Dilger <pgsql@markdilger.com> writes: > If we are talking about inserting the function definition into the > query as a subquery and then letting the parser treat it as a > subquery, then I see no reason to use either the existing function or > view subsystems. It sounds more like we are discussing a macro > language. Which is pretty much what a SQL function is already. I don't see a need to invent a separate concept. To the extent that macros have different semantics than functions (eg, multiple evaluation of arguments) the differences are generally not improvements IMHO ... regards, tom lane
Josh Berkus wrote: > Tom, > > >>As for the dependency issue, one man's bug is another man's feature. >>I think the fact that we don't track the internal dependencies of >>functions is not all bad. We've certainly seen plenty of complaints >>about how you can't easily change tables that a view is depending on >>because the view dependencies block it... > > > I'd agree with this. I write about 150,000 lines of function code a year, > and if I had to rebuild all of the cascading functions every time I change > a table they way I have to with views, it would probably add 20% to my > overall application development time. > > BTW, the other thing that we're still TODOing on SRFs (as far as I know) is > finding ways to change the row estimate for an SRF. It's still a flat > 1000 in the code, which can cause a lot of bad query plans. I proposed a > year ago that, as a first step, we allow the function owner to assign a > static estimate variable to the function (i.e. "average rows returned = > 5'). This doesn't solve the whole problem of SRF estimates but it would > be a significant step forwards in being able to use them in queries. > This would only seem to work for trivial functions. Most functions that I write are themselves dependent on underlying tables, and without any idea how many rows are in the tables, and without any idea of the statistical distribution of those rows, I can't really say anything like "average rows returned = 5". What I have wanted for some time is a function pairing system. For each set returning function F() I create, I would have the option of creating a statistics function S() which returns a single integer which represents the guess of how many rows will be returned. S() would be called by the planner, and the return value of S() would be used to decide the plan. S() would need access to the table statistics information. I imagine that the system would want to prevent S() from running queries, and only allow it to call certain defined table statistics functions and some internal math functions, thereby avoiding any infinite recursion in the planner. (If S() ran any queries, those queries would go yet again to the planner, and on down the infinite recursion you might go.) Of course, some (possibly most) people could chose not to write an S() for their F(), and the default of 1000 rows would continue to be used. As such, this new extension to the system would be backwards compatible to functions which don't have an S() defined. Thoughts?
Mark, > This would only seem to work for trivial functions. Most functions that > I write are themselves dependent on underlying tables, and without any > idea how many rows are in the tables, and without any idea of the > statistical distribution of those rows, I can't really say anything like > "average rows returned = 5". > > What I have wanted for some time is a function pairing system. For each > set returning function F() I create, I would have the option of creating > a statistics function S() which returns a single integer which > represents the guess of how many rows will be returned. S() would be > called by the planner, and the return value of S() would be used to > decide the plan. S() would need access to the table statistics > information. I imagine that the system would want to prevent S() from > running queries, and only allow it to call certain defined table > statistics functions and some internal math functions, thereby avoiding > any infinite recursion in the planner. (If S() ran any queries, those > queries would go yet again to the planner, and on down the infinite > recursion you might go.) > > Of course, some (possibly most) people could chose not to write an S() > for their F(), and the default of 1000 rows would continue to be used. > As such, this new extension to the system would be backwards compatible > to functions which don't have an S() defined. I think this is a fine idea, and I think I endorsed it the first time. However, even a static "function returns #" would be better than what we have now, and I think the S() method could take quite a bit of engineering to work out (for example, what if F() is being called in a JOIN or correlated subquery?). So I'm worried that shooting for the S() idea only could result in us not doing *anything* for several more versions. What I'd like to do is implement the constant method for 8.2, and work on doing the S() method later on. Does that make sense? -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Tom Lane wrote: > Mark Dilger <pgsql@markdilger.com> writes: > >>If we are talking about inserting the function definition into the >>query as a subquery and then letting the parser treat it as a >>subquery, then I see no reason to use either the existing function or >>view subsystems. It sounds more like we are discussing a macro >>language. > > > Which is pretty much what a SQL function is already. I don't see a need > to invent a separate concept. To the extent that macros have different > semantics than functions (eg, multiple evaluation of arguments) the > differences are generally not improvements IMHO ... > > regards, tom lane I have numerous times run EXPLAIN ANALYZE on my queries with SQL functions embedded and gotten different (far worse) results than if I manually inline the function following the macro expansion idea above. That has led me to wish that postgres would inline it for me. That doesn't prove that the macro idea is needed; it might be that the SQL function systems needs more work. (In fact, I haven't done this since 8.0.3, so I'm not sure that 8.1 even does a bad job anymore.)
Josh Berkus wrote: > Mark, > > >>This would only seem to work for trivial functions. Most functions that >>I write are themselves dependent on underlying tables, and without any >>idea how many rows are in the tables, and without any idea of the >>statistical distribution of those rows, I can't really say anything like >>"average rows returned = 5". >> >>What I have wanted for some time is a function pairing system. For each >>set returning function F() I create, I would have the option of creating >>a statistics function S() which returns a single integer which >>represents the guess of how many rows will be returned. S() would be >>called by the planner, and the return value of S() would be used to >>decide the plan. S() would need access to the table statistics >>information. I imagine that the system would want to prevent S() from >>running queries, and only allow it to call certain defined table >>statistics functions and some internal math functions, thereby avoiding >>any infinite recursion in the planner. (If S() ran any queries, those >>queries would go yet again to the planner, and on down the infinite >>recursion you might go.) >> >>Of course, some (possibly most) people could chose not to write an S() >>for their F(), and the default of 1000 rows would continue to be used. >>As such, this new extension to the system would be backwards compatible >>to functions which don't have an S() defined. > > > I think this is a fine idea, and I think I endorsed it the first time. > However, even a static "function returns #" would be better than what we > have now, and I think the S() method could take quite a bit of engineering > to work out (for example, what if F() is being called in a JOIN or > correlated subquery?). So I'm worried that shooting for the S() idea only > could result in us not doing *anything* for several more versions. > > What I'd like to do is implement the constant method for 8.2, and work on > doing the S() method later on. Does that make sense? > I have no objections to implementing the constant method sooner than the full version. It might be useful to implement it as a subsyntax to the full version though in preparation for later expansion. For instance, if there is a plstats language developed, you could limit the parser for it to just functions like: CREATE FUNCTION S() RETURNS INTEGER PROVIDES FOR F() AS $$RETURN 5; $$ LANGUAGE PLSTATS; Then the language could be expanded later to allow calls to the table statistics functions.
Josh Berkus <josh@agliodbs.com> writes: > What I'd like to do is implement the constant method for 8.2, and work on > doing the S() method later on. Does that make sense? I'm not thrilled with putting in a stopgap that we will have to support forever. The constant method is *clearly* inadequate for many (probably most IMHO) practical cases. Where do you see it being of use? W.R.T. the estimator function method, the concern about recursion seems misplaced. Such an estimator presumably wouldn't invoke the associated function itself. I'm more concerned about coming up with a usable API for such things. Our existing mechanisms for estimating operator selectivities require access to internal planner data structures, which makes it pretty much impossible to write them in anything but C. We'd need something cleaner to have a feature I'd want to export for general use. regards, tom lane
Tom, > > What I'd like to do is implement the constant method for 8.2, and work > > on doing the S() method later on. Does that make sense? > > I'm not thrilled with putting in a stopgap that we will have to support > forever. The constant method is *clearly* inadequate for many (probably > most IMHO) practical cases. Where do you see it being of use? Well, mostly for the real-world use cases where I've run into SRF estimate issues, which have mostly been SRFs which return one row. > W.R.T. the estimator function method, the concern about recursion seems > misplaced. Such an estimator presumably wouldn't invoke the associated > function itself. No, but if you're calling the S() estimator in the context of performing a join, what do you supply for parameters? > I'm more concerned about coming up with a usable API > for such things. Our existing mechanisms for estimating operator > selectivities require access to internal planner data structures, which > makes it pretty much impossible to write them in anything but C. We'd > need something cleaner to have a feature I'd want to export for general > use. Yes -- we need to support the simplest case, which is functions that return either (a) a fixed number of rows, or (b) a fixed multiple of the number of rows passed to the function. These simple cases should be easy to build. For more complex estimation, I personally don't see a problem with forcing people to hack it in C. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: >> I'm not thrilled with putting in a stopgap that we will have to support >> forever. The constant method is *clearly* inadequate for many (probably >> most IMHO) practical cases. Where do you see it being of use? > Well, mostly for the real-world use cases where I've run into SRF estimate > issues, which have mostly been SRFs which return one row. Well, if they're certain to return one row, you can just declare them as not SETOF, no? Since 8.1 we do get that case right: regression=# explain select * from cos(0); QUERY PLAN --------------------------------------------------------Function Scan on cos (cost=0.00..0.01 rows=1 width=8) (1 row) > No, but if you're calling the S() estimator in the context of performing a > join, what do you supply for parameters? Exactly my point about the API problem. I'm not sure that joins matter, but the function parameters sure do, and those might not be simple constants. regards, tom lane
Josh Berkus wrote: > Tom, > > >>>What I'd like to do is implement the constant method for 8.2, and work >>>on doing the S() method later on. Does that make sense? >> >>I'm not thrilled with putting in a stopgap that we will have to support >>forever. The constant method is *clearly* inadequate for many (probably >>most IMHO) practical cases. Where do you see it being of use? > > > Well, mostly for the real-world use cases where I've run into SRF estimate > issues, which have mostly been SRFs which return one row. > > >>W.R.T. the estimator function method, the concern about recursion seems >>misplaced. Such an estimator presumably wouldn't invoke the associated >>function itself. > > > No, but if you're calling the S() estimator in the context of performing a > join, what do you supply for parameters? I've been thinking about this more, and now I don't see why this is an issue. When the planner estimates how many rows will be returned from a subquery that is being used within a join, it can't know which "parameters" to use either. (Parameters being whatever conditions the subquery will pivot upon which are the result of some other part of the execution of the full query.) So it seems to me that function S() is at no more of a disadvantage than the planner. If I defined a function S(a integer, b integer) which provides an estimate for the function F(a integer, b integer), then S(null, null) could be called when the planner can't know what a and b are. S could then still make use of the table statistics to provide some sort of estimate. Of course, this would mean that functions S() cannot be defined strict. >>I'm more concerned about coming up with a usable API >>for such things. Our existing mechanisms for estimating operator >>selectivities require access to internal planner data structures, which >>makes it pretty much impossible to write them in anything but C. We'd >>need something cleaner to have a feature I'd want to export for general >>use. > > > Yes -- we need to support the simplest case, which is functions that return > either (a) a fixed number of rows, or (b) a fixed multiple of the number > of rows passed to the function. These simple cases should be easy to > build. For more complex estimation, I personally don't see a problem with > forcing people to hack it in C. Could we provide table statistics access functions in whatever higher-level language S() is written in, or is there something fundamentally squirrelly about the statistics that would make this impossible? Also, since we haven't nailed down a language for S(), if we allowed any of sql, plpgsql, plperl, plpython, etc, then we would need access methods for each, which would place a burden on all PLs, right? That argument isn't strong enough to make me lean either way; it's just an observation.
Mark Dilger wrote: > I've been thinking about this more, and now I don't see why this is an > issue. When the planner estimates how many rows will be returned from a > subquery that is being used within a join, it can't know which > "parameters" to use either. (Parameters being whatever conditions the > subquery will pivot upon which are the result of some other part of the > execution of the full query.) So it seems to me that function S() is at > no more of a disadvantage than the planner. > > If I defined a function S(a integer, b integer) which provides an > estimate for the function F(a integer, b integer), then S(null, null) > could be called when the planner can't know what a and b are. S could > then still make use of the table statistics to provide some sort of > estimate. Of course, this would mean that functions S() cannot be > defined strict. Ok, null probably isn't a good value. F(null, null) could be the call being made, so S(null, null) would mean "F is being passed nulls" rather than "We don't know what F's arguments are yet". The returned estimate might be quite different for these two cases. You could have: F(a integer, b integer) S(a integer, a_is_known boolean, b integer, b_is_known boolean) But I'm not fond of the verbosity of doubling the argument list. Since some arguments might be known while others still are not, I don't think a single boolean argument all_arguments_are_known is sufficient.