Thread: Re: [PERFORM] Functionscan estimates

Re: [PERFORM] Functionscan estimates

From
Josh Berkus
Date:
People:

(HACKERS: Please read this entire thread at
http://archives.postgresql.org/pgsql-performance/2005-04/msg00179.php
Sorry for crossing this over.)

> > The larger point is that writing an estimator for an SRF is frequently a
> > task about as difficult as writing the SRF itself
>
> True, although I think this doesn't necessarily kill the idea. If
> writing an estimator for a given SRF is too difficult, the user is no
> worse off than they are today. Hopefully there would be a fairly large
> class of SRFs for which writing an estimator would be relatively simple,
> and result in improved planner behavior.

For that matter, even supplying an estimate constant would be a vast
improvement over current functionality.  I would suggest, in fact, that we
allow the use of either a constant number, or an estimator function, in that
column.  Among other things, this would allow implementing the constant
number right now and the use of an estimating function later, in case we can
do the one but not the other for 8.1.

To be more sophisticated about the estimator function, it could take a subset
of the main functions arguments, based on $1 numbering, for example:
CREATE FUNCTION some_func ( INT, TEXT, TEXT, INT, INT ) ...
ALTER FUNCTION some_func WITH ESTIMATOR some_func_est( $4, $5 )

This would make writing estimators which would work for several functions
easier.   Estimators would be a special type of functions which would take
any params and RETURN ESTIMATOR, which would be implicitly castable from some
general numeric type (like INT or FLOAT).

> > I don't foresee a whole lot of use of an estimator hook designed as
> > proposed here.  In particular, if the API is such that we can only
> > use the estimator when all the function arguments are plan-time
> > constants, it's not going to be very helpful.

Actually, 95% of the time I use SRFs they are accepting constants and not row
references.  And I use a lot of SRFs.

>
> Yes :( One approach might be to break the function's domain into pieces
> and have the estimator function calculate the estimated result set size
> for each piece. So, given a trivial function like:
>
> foo(int):
>     if $1 < 10 then produce 100 rows
>     else produce 10000 rows
>
> If the planner has encoded the distribution of input tuples to the
> function as a histogram, it could invoke the SRF's estimator function
> for the boundary values of each histogram bucket, and use that to get an
> idea of the function's likely result set size at runtime.
>
> And yes, the idea as sketched is totally unworkable :) For one thing,
> the difficulty of doing this grows rapidly as the number of arguments to
> the function increases. But perhaps there is some variant of this idea
> that might work...
>
> Another thought is that the estimator could provide information on the
> cost of evaluating the function, the number of tuples produced by the
> function, and even the distribution of those tuples.

Another possibility would be to support default values for all estimator
functions and have functions called in row context passed DEFAULT, thus
leaving it up to the estimator writer to supply median values for context
cases.  Or to simply take the "first" values and use those.

While any of these possibilites aren't ideal, they are an improvement over the
current "flat 1000" estimate.   As I said, even the ability to set a
per-function flat constant estimate would be an improvement.

> BTW, why is this on -performance? It should be on -hackers.

'cause I spend more time reading -performance, and I started the thread.
Crossed over now.

--
Josh Berkus
Aglio Database Solutions
San Francisco