Thread: Functionscan estimates
Folks, I'm wondering if it might be useful to be able to add estimated selectivity to a function definition for purposes of query estimation. Currently function scans automatically return a flat default 1000 estimated rows. It seems like the DBA ought to be able to ALTER FUNCTION and give it a row estimate for planning purposes. Thoughts? -- --Josh Josh Berkus Aglio Database Solutions San Francisco
On Fri, Apr 08, 2005 at 03:15:50PM -0700, Josh Berkus wrote: > > I'm wondering if it might be useful to be able to add estimated selectivity to > a function definition for purposes of query estimation. Currently function > scans automatically return a flat default 1000 estimated rows. It seems > like the DBA ought to be able to ALTER FUNCTION and give it a row estimate > for planning purposes. About a month ago I mentioned that I'd find that useful. In a followup, Christopher Kings-Lynne brought up the idea of a GUC variable that could give hints about the expected row count. http://archives.postgresql.org/pgsql-hackers/2005-03/msg00146.php http://archives.postgresql.org/pgsql-hackers/2005-03/msg00153.php -- Michael Fuhr http://www.fuhr.org/~mfuhr/
On Fri, Apr 08, 2005 at 04:38:20PM -0600, Michael Fuhr wrote: > On Fri, Apr 08, 2005 at 03:15:50PM -0700, Josh Berkus wrote: > > > > I'm wondering if it might be useful to be able to add estimated selectivity to > > a function definition for purposes of query estimation. Currently function > > scans automatically return a flat default 1000 estimated rows. It seems > > like the DBA ought to be able to ALTER FUNCTION and give it a row estimate > > for planning purposes. > > About a month ago I mentioned that I'd find that useful. In a > followup, Christopher Kings-Lynne brought up the idea of a GUC > variable that could give hints about the expected row count. That seems pretty limited ... what happens if the query contains more that one SRF? Maybe issuing some sort of special call to the function (say, with some boolean in the call info struct) on which it returns planning data; thus the planner can call the function itself. The hard part would be figuring out how to do it without breaking backwards compatibility with functions that don't know how to handle that. (And how to do it in plpgsql). -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "La principal característica humana es la tontería" (Augusto Monterroso)
Alvaro, Michael, > > About a month ago I mentioned that I'd find that useful. In a > > followup, Christopher Kings-Lynne brought up the idea of a GUC > > variable that could give hints about the expected row count. > > That seems pretty limited ... what happens if the query contains more > that one SRF? Yeah, I'd see that as a pretty bad idea too. I don't want to tell the planner how many rows I expect "all functions" to return, I want to tell it how many *one particular* function will return. > Maybe issuing some sort of special call to the function (say, with > some boolean in the call info struct) on which it returns planning data; > thus the planner can call the function itself. The hard part would be > figuring out how to do it without breaking backwards compatibility with > functions that don't know how to handle that. (And how to do it in > plpgsql). Or in pl/perl, or pl/python, or plsh .... doesn't sound feasable. My solution would be a lot simpler, since we could simply populate pg_proc.proestrows with "1000" by default if not changed by the DBA. In an even better world, we could tie it to a table, saying that, for example, proestrows = my_table*0.02. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
On Fri, Apr 08, 2005 at 04:04:27PM -0700, Josh Berkus wrote: > My solution would be a lot simpler, since we could simply populate > pg_proc.proestrows with "1000" by default if not changed by the DBA. In an > even better world, we could tie it to a table, saying that, for example, > proestrows = my_table*0.02. The problem with that approach is that it can't differ depending on the arguments to the function, so it too seems limited to me. Ideally an estimator would be able to peek at other table statistics and do some computation with them, just like other nodes are able to. Another idea would be have an estimator function (pg_proc.proestimator) for each regular function. The estimator would be a very cheap function to be called with the same arguments, and it would return the estimated number of tuples the other function would return. The default estimator could be "return 1000". -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "A wizard is never late, Frodo Baggins, nor is he early. He arrives precisely when he means to." (Gandalf, en LoTR FoTR)
Not too many releases ago, there were several columns in pg_proc that were intended to support estimation of the runtime cost and number of result rows of set-returning functions. I believe in fact that these were the remains of Joe Hellerstein's thesis on expensive-function evaluation, and are exactly what he was talking about here: http://archives.postgresql.org/pgsql-hackers/2002-06/msg00085.php But with all due respect to Joe, I think the reason that stuff got trimmed is that it didn't work very well. In most cases it's *hard* to write an estimator for a SRF. Let's see you produce one for dblink() for instance ... regards, tom lane
> My solution would be a lot simpler, since we could simply populate > pg_proc.proestrows with "1000" by default if not changed by the DBA. In > an > even better world, we could tie it to a table, saying that, for example, > proestrows = my_table*0.02. What if the estimated row is a function of a parameter ? Say a function takes as a parameter : - a number to use in a LIMIT - it's a function to generate a certain number of values from a predetermined set (like, array -> set returning function) In all those cases it's no use to have just a fixed number. Id suggest two solutions : - The ideal solution which is impossible to do : The function tells the planner about its stats, looking at its parameters - A solution that would be possible to do pg_proc.proestrows is... the name of another function, defined by the user, which takes the exact same parameters as the set returning function we're talking about, and which returns estimates. For instance, in pseudo-sql : CREATE FUNCTION int_array_srf( INTEGER[] ) RETURNS SETOF INTEGER LANGUAGE plpgsql AS $$ BEGIN FOR _i IN 1..icount($1) RETURN NEXT $1[_i]; END END In the two cases above, this would give : CREATE FUNCTION array_srf_estimator( INTEGER[] ) RETURNS INTEGER LANGUAGE plpgsql AS $$ BEGIN RETURN icount( $1 ); END; ALTER FUNCTION array_srf SET ESTIMATOR array_srf_estimator; Another interesting case would be the famous "Top 5 by category" case where we use a SRF to emulate an index skip scan. Say we have a table Categories and a table Users, each User having columns "categories" and "score" and we want the N users with best score in each category : CREATE FUNCTION top_n_by_category( INTEGER ) RETURN SETOF users%ROWTYPE LANGUAGE plpgsql AS $$ DECLARE _cat_id INTEGER; _n ALIAS FOR $1; _user users%ROWTYPE; BEGIN FOR _cat_id IN SELECT category_id FROM categories DO FOR _user IN SELECT * FROM users WHERE category_id = _cat_id ORDER BY score DESC LIMIT _n DO RETURN NEXT _user; END END END CREATE FUNCTION top_n_by_category_estimator( INTEGER ) RETURN INTEGER LANGUAGE plpgsql AS $$ BEGIN RETURN $1 * (the estimated number of rows for the categories table taken from the table statistics); END; ALTER FUNCTION top_n_by_category SET ESTIMATOR top_n_by_category_estimator; Got it ? The array_srf case would be extremely useful as this type of function is generally used to join against other tables, and having estimates is useful for that. The top_n case would be useless if we're just returning the rows from the function directly, but would be useful if we'll join them to other tables. This sounds pretty simple, powerful, and versatile. Additionally, in some cases (most notably the array case) it's easy to estimate the statistics on the returned values because they're all in the array already, so the mechanism could be extended to have a way of returning a pseudo pg_stats for a Set Returning function. For instance, say you have a SRF which returns N random rows from a table. It could have an estimator which would return a rowcount of N, and a statistics estimator which would return the sats rows for the source table, appropriately modified. This sounds harder to do. WHat do you think ?
> But with all due respect to Joe, I think the reason that stuff got > trimmed is that it didn't work very well. In most cases it's > *hard* to write an estimator for a SRF. Let's see you produce > one for dblink() for instance ... Good one... Well in some cases it'll be impossible, but suppose I have a function get_id_for_something() which just grabs an ID using dblink, then I know it returns one row, and pg would be interested in that information too !
On Sat, Apr 09, 2005 at 12:00:56AM -0400, Tom Lane wrote: > Not too many releases ago, there were several columns in pg_proc that > were intended to support estimation of the runtime cost and number of > result rows of set-returning functions. I believe in fact that these > were the remains of Joe Hellerstein's thesis on expensive-function > evaluation, and are exactly what he was talking about here: > http://archives.postgresql.org/pgsql-hackers/2002-06/msg00085.php > > But with all due respect to Joe, I think the reason that stuff got > trimmed is that it didn't work very well. In most cases it's > *hard* to write an estimator for a SRF. Let's see you produce > one for dblink() for instance ... Actually, if the remote database supported a way to get a rows estimate from the query passed to db_link, it would be trivial, since you'd just pass that back. In fact, having such a function (estimate_rows_for_sql(text)) would probably be very useful to functions that wanted to support returning a rows estimate. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <decibel@decibel.org> writes: > On Sat, Apr 09, 2005 at 12:00:56AM -0400, Tom Lane wrote: >> But with all due respect to Joe, I think the reason that stuff got >> trimmed is that it didn't work very well. In most cases it's >> *hard* to write an estimator for a SRF. Let's see you produce >> one for dblink() for instance ... > Actually, if the remote database supported a way to get a rows estimate > from the query passed to db_link, it would be trivial, since you'd just > pass that back. This assumes that (1) you have the complete query argument at the time of estimation, and (2) it's OK to contact the remote database and do an EXPLAIN at that time. Both of these seem pretty shaky assumptions. The larger point is that writing an estimator for an SRF is frequently a task about as difficult as writing the SRF itself, and sometimes much *more* difficult due to lack of information. I don't foresee a whole lot of use of an estimator hook designed as proposed here. In particular, if the API is such that we can only use the estimator when all the function arguments are plan-time constants, it's not going to be very helpful. regards, tom lane
Tom Lane wrote: > Not too many releases ago, there were several columns in pg_proc that > were intended to support estimation of the runtime cost and number of > result rows of set-returning functions. I believe in fact that these > were the remains of Joe Hellerstein's thesis on expensive-function > evaluation FYI, Hellerstein's thesis on xfunc optimization is available here: ftp://ftp.cs.wisc.edu/pub/tech-reports/reports/1996/tr1304.ps.Z There's also a paper on this subject by Hellerstein that was published in Transactions on Database Systems: http://www.cs.berkeley.edu/~jmh/miscpapers/todsxfunc.pdf I haven't had a chance to digest either one yet, but it might be worth a look. -Neil
Tom Lane wrote: > The larger point is that writing an estimator for an SRF is frequently a > task about as difficult as writing the SRF itself True, although I think this doesn't necessarily kill the idea. If writing an estimator for a given SRF is too difficult, the user is no worse off than they are today. Hopefully there would be a fairly large class of SRFs for which writing an estimator would be relatively simple, and result in improved planner behavior. > I don't foresee a whole lot of use of an estimator hook designed as > proposed here. In particular, if the API is such that we can only > use the estimator when all the function arguments are plan-time > constants, it's not going to be very helpful. Yes :( One approach might be to break the function's domain into pieces and have the estimator function calculate the estimated result set size for each piece. So, given a trivial function like: foo(int): if $1 < 10 then produce 100 rows else produce 10000 rows If the planner has encoded the distribution of input tuples to the function as a histogram, it could invoke the SRF's estimator function for the boundary values of each histogram bucket, and use that to get an idea of the function's likely result set size at runtime. And yes, the idea as sketched is totally unworkable :) For one thing, the difficulty of doing this grows rapidly as the number of arguments to the function increases. But perhaps there is some variant of this idea that might work... Another thought is that the estimator could provide information on the cost of evaluating the function, the number of tuples produced by the function, and even the distribution of those tuples. BTW, why is this on -performance? It should be on -hackers. -Neil
People: (HACKERS: Please read this entire thread at http://archives.postgresql.org/pgsql-performance/2005-04/msg00179.php Sorry for crossing this over.) > > The larger point is that writing an estimator for an SRF is frequently a > > task about as difficult as writing the SRF itself > > True, although I think this doesn't necessarily kill the idea. If > writing an estimator for a given SRF is too difficult, the user is no > worse off than they are today. Hopefully there would be a fairly large > class of SRFs for which writing an estimator would be relatively simple, > and result in improved planner behavior. For that matter, even supplying an estimate constant would be a vast improvement over current functionality. I would suggest, in fact, that we allow the use of either a constant number, or an estimator function, in that column. Among other things, this would allow implementing the constant number right now and the use of an estimating function later, in case we can do the one but not the other for 8.1. To be more sophisticated about the estimator function, it could take a subset of the main functions arguments, based on $1 numbering, for example: CREATE FUNCTION some_func ( INT, TEXT, TEXT, INT, INT ) ... ALTER FUNCTION some_func WITH ESTIMATOR some_func_est( $4, $5 ) This would make writing estimators which would work for several functions easier. Estimators would be a special type of functions which would take any params and RETURN ESTIMATOR, which would be implicitly castable from some general numeric type (like INT or FLOAT). > > I don't foresee a whole lot of use of an estimator hook designed as > > proposed here. In particular, if the API is such that we can only > > use the estimator when all the function arguments are plan-time > > constants, it's not going to be very helpful. Actually, 95% of the time I use SRFs they are accepting constants and not row references. And I use a lot of SRFs. > > Yes :( One approach might be to break the function's domain into pieces > and have the estimator function calculate the estimated result set size > for each piece. So, given a trivial function like: > > foo(int): > if $1 < 10 then produce 100 rows > else produce 10000 rows > > If the planner has encoded the distribution of input tuples to the > function as a histogram, it could invoke the SRF's estimator function > for the boundary values of each histogram bucket, and use that to get an > idea of the function's likely result set size at runtime. > > And yes, the idea as sketched is totally unworkable :) For one thing, > the difficulty of doing this grows rapidly as the number of arguments to > the function increases. But perhaps there is some variant of this idea > that might work... > > Another thought is that the estimator could provide information on the > cost of evaluating the function, the number of tuples produced by the > function, and even the distribution of those tuples. Another possibility would be to support default values for all estimator functions and have functions called in row context passed DEFAULT, thus leaving it up to the estimator writer to supply median values for context cases. Or to simply take the "first" values and use those. While any of these possibilites aren't ideal, they are an improvement over the current "flat 1000" estimate. As I said, even the ability to set a per-function flat constant estimate would be an improvement. > BTW, why is this on -performance? It should be on -hackers. 'cause I spend more time reading -performance, and I started the thread. Crossed over now. -- Josh Berkus Aglio Database Solutions San Francisco