Thread: Function execution costs 'n all that

Function execution costs 'n all that

From
Tom Lane
Date:
So I've been working on the scheme I suggested a few days ago of
representing "equivalence classes" of variables explicitly, and avoiding
the current ad-hocery of generating and then removing redundant clauses
in favor of generating only the ones we want in the first place.  Any
clause that looks like an equijoin gets sent to the EquivalenceClass
machinery by distribute_qual_to_rels, and not put into the
restrictlist/joinlist data structure at all.  Then we make passes over
the EquivalenceClass lists at appropriate times to generate the clauses
we want.  This is turning over well enough now to pass the regression
tests, but I noticed that one query in opr_sanity got a whole lot slower
than before.  The query is

SELECT p1.opcname, p1.opcfamily
FROM pg_opclass AS p1
WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS p2                WHERE p2.amopfamily = p1.opcfamily                  AND
binary_coercible(p1.opcintype,p2.amoplefttype));
 

and investigation showed that the plan changed from (8.2 and before)
Seq Scan on pg_opclass p1  (cost=0.00..393.94 rows=51 width=68)  Filter: (NOT (subplan))  SubPlan    ->  Seq Scan on
pg_amopp2  (cost=0.00..7.66 rows=2 width=0)          Filter: ((amopfamily = $0) AND binary_coercible($1,
amoplefttype))

to
Seq Scan on pg_opclass p1  (cost=0.00..393.94 rows=51 width=68)  Filter: (NOT (subplan))  SubPlan    ->  Seq Scan on
pg_amopp2  (cost=0.00..7.66 rows=2 width=0)          Filter: (binary_coercible($1, amoplefttype) AND (amopfamily =
$0))

thus resulting in many more calls of binary_coercible() which is a
pretty expensive function.  This is not too surprising: the clause
p2.amopfamily = p1.opcfamily got diverted through the EquivalenceClass
code for just long enough to end up behind the other one in the table's
restrictlist :-(

In short, this approach results in a whole lot less stability in the
order in which WHERE clauses are evaluated.  That might be a killer
objection to the whole thing, but on the other hand we've never made
any strong promises about WHERE evaluation order.

Instead, I'm thinking it might be time to re-introduce some notion of
function execution cost into the system, and make use of that info to
sort WHERE clauses into a reasonable execution order.  This example
would be fixed with even a very stupid rule-of-thumb about SQL functions
being more expensive than C functions, but if we're going to go to the
trouble it seems like it'd be a good idea to provide a way to label
user-defined functions with execution costs.

Would a simple constant value be workable, or do we need some more
complex model (and if so what)?

Comments?
        regards, tom lane


Re: Function execution costs 'n all that

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Instead, I'm thinking it might be time to re-introduce some notion of
> function execution cost into the system, and make use of that info to
> sort WHERE clauses into a reasonable execution order.  

That sounds like a straightforward idea.

> This example
> would be fixed with even a very stupid rule-of-thumb about SQL functions
> being more expensive than C functions, but if we're going to go to the
> trouble it seems like it'd be a good idea to provide a way to label
> user-defined functions with execution costs.

Agreed.

> Would a simple constant value be workable, or do we need some more
> complex model (and if so what)?

A simple constant would probably be enough. If we want anything fancier 
than that, it should be up to the author of the function to define the 
cost model as well. I'm envisioning that you could attach a custom cost 
function to a user-defined function which would return the estimated CPU 
cost. And # of rows returned for a set-returning function.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Function execution costs 'n all that

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> Would a simple constant value be workable, or do we need some more
>> complex model (and if so what)?

> A simple constant would probably be enough. If we want anything fancier 
> than that, it should be up to the author of the function to define the 
> cost model as well. I'm envisioning that you could attach a custom cost 
> function to a user-defined function which would return the estimated CPU 
> cost. And # of rows returned for a set-returning function.

But what will such an estimation function work on?  In general the
planner does not have the value(s) that will be passed to the actual
function at runtime.  It might have expressions or estimates but
those data structures are certainly not something we could pass to
non-C-coded functions.  Are we willing to restrict these functions
to being coded in C, as selectivity estimation functions are?

If we go this route it seems like we'll need four more columns in
pg_proc (cost estimation function OID, rowcount estimation function OID,
fallback cost constant, fallback rowcount constant).

BTW, I'm thinking that a "cost constant" probably ought to be measured
in units of cpu_operator_cost.  The default for built-in functions would
thus be 1, at least till such time as someone wants to refine the
estimates.  We'd probably want the default for PL and SQL functions to
be 10 or 100 or so.
        regards, tom lane


Re: Function execution costs 'n all that

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> A simple constant would probably be enough. If we want anything fancier 
>> than that, it should be up to the author of the function to define the 
>> cost model as well. I'm envisioning that you could attach a custom cost 
>> function to a user-defined function which would return the estimated CPU 
>> cost. And # of rows returned for a set-returning function.
> 
> But what will such an estimation function work on?  In general the
> planner does not have the value(s) that will be passed to the actual
> function at runtime.  It might have expressions or estimates but
> those data structures are certainly not something we could pass to
> non-C-coded functions.   Are we willing to restrict these functions
> to being coded in C, as selectivity estimation functions are?

Yeah, I don't know. If the planner knows the actual value, that would 
certainly be the easiest for the UDF writer to work with. Anything more 
than that gets really complicated.

> If we go this route it seems like we'll need four more columns in
> pg_proc (cost estimation function OID, rowcount estimation function OID,
> fallback cost constant, fallback rowcount constant).

What would the fallbacks be for?

> BTW, I'm thinking that a "cost constant" probably ought to be measured
> in units of cpu_operator_cost.  The default for built-in functions would
> thus be 1, at least till such time as someone wants to refine the
> estimates.  We'd probably want the default for PL and SQL functions to
> be 10 or 100 or so.

Agreed.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Function execution costs 'n all that

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> If we go this route it seems like we'll need four more columns in
>> pg_proc (cost estimation function OID, rowcount estimation function OID,
>> fallback cost constant, fallback rowcount constant).

> What would the fallbacks be for?

By "fallback" I meant "this is what to use if no estimation function is
provided".

I'm envisioning that the CREATE FUNCTION syntax would add optional
clauses
    COST function-name-or-numeric-constant    ROWS function-name-or-numeric-constant

that would be used to fill these columns.
        regards, tom lane


Re: Function execution costs 'n all that

From
Richard Troy
Date:
On Mon, 15 Jan 2007, Tom Lane wrote:

> So I've been working on the scheme I suggested a few days ago of
> representing "equivalence classes" of variables explicitly, and avoiding
> the current ad-hocery of generating and then removing redundant clauses
> in favor of generating only the ones we want in the first place.  Any
> clause that looks like an equijoin gets sent to the EquivalenceClass
> machinery by distribute_qual_to_rels, and not put into the
> restrictlist/joinlist data structure at all.  Then we make passes over
> the EquivalenceClass lists at appropriate times to generate the clauses
> we want.  This is turning over well enough now to pass the regression
> tests,

That was quick...

> In short, this approach results in a whole lot less stability in the
> order in which WHERE clauses are evaluated.  That might be a killer
> objection to the whole thing, but on the other hand we've never made
> any strong promises about WHERE evaluation order.

Showing my ignorance here, but I've never been a fan of "syntax based
optimization," though it is better than no optimization. If people are
counting on order for optimization, then, hmmm... If you can provide a way
to at least _try_ to do better, then don't worry about it. It will improve
with time.

> Instead, I'm thinking it might be time to re-introduce some notion of
> function execution cost into the system, and make use of that info to
> sort WHERE clauses into a reasonable execution order.

Ingres did/does it that way, IIRC. It's a solid strategy.

>  This example
> would be fixed with even a very stupid rule-of-thumb about SQL functions
> being more expensive than C functions, but if we're going to go to the
> trouble it seems like it'd be a good idea to provide a way to label
> user-defined functions with execution costs.
>
> Would a simple constant value be workable, or do we need some more
> complex model (and if so what)?

Ingres would, if I'm not mistaken, gain through historical use through
histograms. Short of that, you've got classes of functions, agregations,
for example, and there's sure to be missing information to make a great
decision at planning time. However, I take it that the cost here is
primarily CPU and not I/O. I therefore propose that the engine evaluate -
benchmark, if you will - all functions as they are ingested, or
vacuum-like at some later date (when valid data for testing may exist),
and assign a cost relative to what it already knows - the built-ins, for
example. Doing so could allow this strategy to be functional in short
order and be improved with time so all the work doesn't have to be
implemented on day 1. And, DBA/sys-admin tweaking can always be done by
updating the catalogues.

HTH,
Richard

-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy@ScienceTools.com, http://ScienceTools.com/



Re: Function execution costs 'n all that

From
Neil Conway
Date:
On Mon, 2007-01-15 at 10:51 -0800, Richard Troy wrote:
> I therefore propose that the engine evaluate -
> benchmark, if you will - all functions as they are ingested, or
> vacuum-like at some later date (when valid data for testing may exist),
> and assign a cost relative to what it already knows - the built-ins, for
> example.

That seems pretty unworkable. It is unsafe, for one: evaluating a
function may have side effects (inside or outside the database), so the
DBMS cannot just invoke user-defined functions at whim. Also, the
relationship between a function's arguments and its performance will
often be highly complex -- it would be very difficult, not too mention
computationally infeasible, to reconstruct that relationship
automatically, especially without any real knowledge about the
function's behavior.

-Neil




Re: Function execution costs 'n all that

From
Richard Troy
Date:
On Mon, 15 Jan 2007, Neil Conway wrote:
> On Mon, 2007-01-15 at 10:51 -0800, Richard Troy wrote:
> > I therefore propose that the engine evaluate -
> > benchmark, if you will - all functions as they are ingested, or
> > vacuum-like at some later date (when valid data for testing may exist),
> > and assign a cost relative to what it already knows - the built-ins, for
> > example.
>
> That seems pretty unworkable. It is unsafe, for one: evaluating a
> function may have side effects (inside or outside the database), so the
> DBMS cannot just invoke user-defined functions at whim. Also, the
> relationship between a function's arguments and its performance will
> often be highly complex -- it would be very difficult, not too mention
> computationally infeasible, to reconstruct that relationship
> automatically, especially without any real knowledge about the
> function's behavior.
>
> -Neil

Hi Neil,

Tom had already proposed:
>
> I'm envisioning that the CREATE FUNCTION syntax would add optional
> clauses
>
>                COST function-name-or-numeric-constant
>                ROWS function-name-or-numeric-constant
>
> that would be used to fill these columns.

I was considering these ideas in the mix; let the user provide either a
numeric or a function, the distinction here being that instead of running
that function at planning-time, it could be run "off-line", so to speak.

Richard

-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy@ScienceTools.com, http://ScienceTools.com/



Re: Function execution costs 'n all that

From
Brian Hurt
Date:
Neil Conway wrote:<br /><blockquote cite="mid1168887288.6174.109.camel@localhost.localdomain" type="cite"><pre
wrap="">OnMon, 2007-01-15 at 10:51 -0800, Richard Troy wrote: </pre><blockquote type="cite"><pre wrap="">I therefore
proposethat the engine evaluate -
 
benchmark, if you will - all functions as they are ingested, or
vacuum-like at some later date (when valid data for testing may exist),
and assign a cost relative to what it already knows - the built-ins, for
example.   </pre></blockquote><pre wrap="">
That seems pretty unworkable. It is unsafe, for one: evaluating a
function may have side effects (inside or outside the database), so the
DBMS cannot just invoke user-defined functions at whim. Also, the
relationship between a function's arguments and its performance will
often be highly complex -- it would be very difficult, not too mention
computationally infeasible, to reconstruct that relationship
automatically, especially without any real knowledge about the
function's behavior. </pre></blockquote> Non-developer here, but we use a lot of plpgsql functions at work.  And the
functionswe use fall into two broad, ill-defined catagories- "expensive" functions and "cheap" functions.  What I'd
likeas a user is some way to tell the planner "this function is expensive- prefer plans which call this function less
evenif they're otherwise more expensive" or "this function is cheap, prefer plans that are otherwise less expensive
evenif they call this function more often".  Precise cost estimates aren't that important, IMHO.<br /><br /> Brian<br
/><br/> 

Re: Function execution costs 'n all that

From
Tom Lane
Date:
Brian Hurt <bhurt@janestcapital.com> writes:
> Non-developer here, but we use a lot of plpgsql functions at work.  And 
> the functions we use fall into two broad, ill-defined catagories- 
> "expensive" functions and "cheap" functions.  What I'd like as a user is 
> some way to tell the planner "this function is expensive- prefer plans 
> which call this function less even if they're otherwise more expensive" 
> or "this function is cheap, prefer plans that are otherwise less 
> expensive even if they call this function more often".  Precise cost 
> estimates aren't that important, IMHO.

Right, so a plain constant cost would be plenty for your situation.

I suspect there's an 80/20 rule at work here --- the estimator-function
side of this will take most of the effort to design/implement, but not
get used nearly as much as the plain-constant form ... maybe we should
just do the constant for starters and see how many people really want to
write C-code estimators ...
        regards, tom lane


Re: Function execution costs 'n all that

From
Neil Conway
Date:
On Mon, 2007-01-15 at 15:05 -0500, Tom Lane wrote:
> maybe we should just do the constant for starters and see how many
> people really want to write C-code estimators ...

+1

BTW, your proposal would still pushdown all qualifiers, right?
Hellerstein's xfunc work discusses situations in which it makes sense to
pullup expensive qualifiers above joins, for example, in order to reduce
the number of tuples the qualifier is applied to. Unfortunately, this
would probably increase the optimizer's search space by a fairly
significant degree, so it might need to be controlled by a GUC variable,
or only applied when the estimated cost of applying a qualifier is
particularly large relative to the total estimated cost of the plan.

-Neil




Re: Function execution costs 'n all that

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> BTW, your proposal would still pushdown all qualifiers, right?

Yeah, I have no intention of readopting xfunc in the near future ...
especially seeing that it's possible for the user to force that
sort of thing if he really has to.
SELECT * FROM (SELECT ... OFFSET 0) ss    WHERE expensive_function(...);
        regards, tom lane


Re: Function execution costs 'n all that

From
Mark Cave-Ayland
Date:
On Mon, 2007-01-15 at 15:05 -0500, Tom Lane wrote:
> Brian Hurt <bhurt@janestcapital.com> writes:
> > Non-developer here, but we use a lot of plpgsql functions at work.  And 
> > the functions we use fall into two broad, ill-defined catagories- 
> > "expensive" functions and "cheap" functions.  What I'd like as a user is 
> > some way to tell the planner "this function is expensive- prefer plans 
> > which call this function less even if they're otherwise more expensive" 
> > or "this function is cheap, prefer plans that are otherwise less 
> > expensive even if they call this function more often".  Precise cost 
> > estimates aren't that important, IMHO.
> 
> Right, so a plain constant cost would be plenty for your situation.
> 
> I suspect there's an 80/20 rule at work here --- the estimator-function
> side of this will take most of the effort to design/implement, but not
> get used nearly as much as the plain-constant form ... maybe we should
> just do the constant for starters and see how many people really want to
> write C-code estimators ...
> 
>             regards, tom lane


Hi Tom et al,

Having worked with stored procedures on large datasets for reporting, I
would say that it would be useful to have a non-constant estimator for
the number of rows, whereas a single CPU cost constant should be fine.
Where I have struggled with this has been joining onto slightly more
exotic queries when doing large scale data processing as part of a
custom report or an application upgrade.

Using PL/PGSQL I would find it useful to have access to the constants
passed into a function to be used to help provide a row count estimate
(typically useful for passing in table/column names), e.g.

SELECT * FROM my_func('my_table1') AS t1, my_table2 AS t2 WHERE t1.id =
t2.id;


CREATE FUNCTION my_func(text) AS $$
...
$$ LANGUAGE 'plpgsql' COST 1.0 ROWS my_func_row_cost;


In my cost function, I could then estimate the number of rows using
something like below, where all constants are passed into the cost
function as parameters, e.g.:


CREATE FUNCTION my_func_row_cost(text) AS $$
DECLAREfoo bigint;
BEGINEXECUTE INTO foo 'SELECT COUNT(*) FROM ' || quote_literal($1);RETURN foo;
END;
$$ LANGUAGE 'plpgsql';


In the case where a parameter was not a constant but a column name, then
it would be reasonable in my mind to simply replace that parameter with
NULL when passing to the row cost function, e.g.


SELECT * FROM my_table1 WHERE my_table1.junk = (SELECT
my_func(my_table1.name));


In this case, the text parameter passed to my_func_row_cost would be
replaced by NULL to indicate that it was non-constant.

Of course, even with constants passed upon input, it still may not be
possible to calculate a number of rows that can be returned - it could
be the case that the only parameter passed to cost function has been
converted to NULL because it is a column name. Perhaps in this case it
would be useful to specify returning NULL from my_func_row_cost means "I
can't return anything meaningful, so use the fallback values".


Kind regards,

Mark.




Re: Function execution costs 'n all that

From
Gregory Stark
Date:
> Tom Lane wrote:
> > Instead, I'm thinking it might be time to re-introduce some notion of
> > function execution cost into the system, and make use of that info to
> > sort WHERE clauses into a reasonable execution order.

I imagine you've thought of this already but just in case, the cost of the
function call has to be combined with the selectivity to get this right. If
you can do an expensive but very selective clause first and save 100 cheap
calls that almost always return true it may still be worthwhile.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: Function execution costs 'n all that

From
Tom Lane
Date:
Gregory Stark <gsstark@mit.edu> writes:
> I imagine you've thought of this already but just in case, the cost of the
> function call has to be combined with the selectivity to get this right. If
> you can do an expensive but very selective clause first and save 100 cheap
> calls that almost always return true it may still be worthwhile.

I've thought of it, but I haven't figured out a reasonable algorithm for
ordering the clauses in view of that.  Have you?
        regards, tom lane


Re: Function execution costs 'n all that

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <gsstark@mit.edu> writes:
>> I imagine you've thought of this already but just in case, the cost of the
>> function call has to be combined with the selectivity to get this right. If
>> you can do an expensive but very selective clause first and save 100 cheap
>> calls that almost always return true it may still be worthwhile.
>
> I've thought of it, but I haven't figured out a reasonable algorithm for
> ordering the clauses in view of that.  Have you?

Hum, I hadn't tried. Now that I think about it it's certainly not obvious.

And picturing the possible failure modes I would rather it execute cheap
expressions more often than necessary than call some user-defined perl
function that could be doing i/o or involve waiting on other resources any
more than absolutely necessary. So I guess what you originally described is
safest.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Function execution costs 'n all that

From
Ron Mayer
Date:
Tom Lane wrote:
> 
> BTW, I'm thinking that a "cost constant" probably ought to be measured
> in units of cpu_operator_cost.  The default for built-in functions would
> thus be 1, at least till such time as someone wants to refine the
> estimates.  We'd probably want the default for PL and SQL functions to
> be 10 or 100 or so.

Any chance that costs could eventually change to real-world units?

It's hard for me to guess how many cpu_operator_cost units
something might take; but relatively easy for me to measure
or estimate in fractions-of-a-seconds how long something takes.

I could imagine having the other planner costs be measured in seconds
too - perhaps with the goal of eventually writing some auto-config
code that tries to measure values like cpu_tuple_cost on a given
piece of hardware.



Re: Function execution costs 'n all that

From
Tom Lane
Date:
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
> Tom Lane wrote:
>> BTW, I'm thinking that a "cost constant" probably ought to be measured
>> in units of cpu_operator_cost.

> Any chance that costs could eventually change to real-world units?

Define "real world units".

If you like you can try to adjust things so that cpu_operator_cost bears
some relation to elapsed-msec-on-your-own-machine, and scale everything
else accordingly; indeed that's why we added seq_page_cost in 8.2.
But it's awfully hard to see the numbers being very portable.
        regards, tom lane


Re: Function execution costs 'n all that

From
Jeff Davis
Date:
On Mon, 2007-01-15 at 13:54 -0500, Neil Conway wrote:
> On Mon, 2007-01-15 at 10:51 -0800, Richard Troy wrote:
> > I therefore propose that the engine evaluate -
> > benchmark, if you will - all functions as they are ingested, or
> > vacuum-like at some later date (when valid data for testing may exist),
> > and assign a cost relative to what it already knows - the built-ins, for
> > example.
> 
> That seems pretty unworkable. It is unsafe, for one: evaluating a
> function may have side effects (inside or outside the database), so the

Would any form of cost estimate have meaning if the function has side
effects? If it's a volatile function, doesn't that mean that the planner
can't avoid or favor executing it?

Regards,Jeff Davis



Re: Function execution costs 'n all that

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> Would any form of cost estimate have meaning if the function has side
> effects? If it's a volatile function, doesn't that mean that the planner
> can't avoid or favor executing it?

No, not really.  If the function is down inside a sub-select or
something like that, the number of executions could depend on any number
of factors (like whether we put it on the inside or outside of a join)
--- and if it's expensive then we will and should try to arrange the
query to minimize the number of executions.  We're certainly not going
to drop back to all-plain-nestloops just because the query contains one
volatile function.

(Now mind you, a smart user will probably avoid using a volatile
function like that, but I don't think it's an adequate argument for
saying that we don't need cost information.)
        regards, tom lane


Re: Function execution costs 'n all that

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> On Mon, 2007-01-15 at 15:05 -0500, Tom Lane wrote:
>> maybe we should just do the constant for starters and see how many
>> people really want to write C-code estimators ...

> +1

It seemed like that was the list's consensus, so I'll go off and do the
simple-constant case.  We can always extend it later.

For reference, the plan is to add these options to CREATE/ALTER
FUNCTION:
COST float-constant        (defaults to 1)ROWS float-constant        (defaults to 1000)

feeding into two float4 columns added to pg_proc.  ROWS is only
allowed/meaningful for set-returning functions.  COST is implicitly
scaled by cpu_operator_cost.
        regards, tom lane


Re: Function execution costs 'n all that

From
Tom Lane
Date:
I complained about how:
> The query is

> SELECT p1.opcname, p1.opcfamily
> FROM pg_opclass AS p1
> WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS p2
>                  WHERE p2.amopfamily = p1.opcfamily
>                    AND binary_coercible(p1.opcintype, p2.amoplefttype));

> and investigation showed that the plan changed from (8.2 and before)

>  Seq Scan on pg_opclass p1  (cost=0.00..393.94 rows=51 width=68)
>    Filter: (NOT (subplan))
>    SubPlan
>      ->  Seq Scan on pg_amop p2  (cost=0.00..7.66 rows=2 width=0)
>            Filter: ((amopfamily = $0) AND binary_coercible($1, amoplefttype))

> to

>  Seq Scan on pg_opclass p1  (cost=0.00..393.94 rows=51 width=68)
>    Filter: (NOT (subplan))
>    SubPlan
>      ->  Seq Scan on pg_amop p2  (cost=0.00..7.66 rows=2 width=0)
>            Filter: (binary_coercible($1, amoplefttype) AND (amopfamily = $0))

Now that some function-cost smarts are in there, I expected to see the
plan go back to the first case, but what I actually see in CVS HEAD is
Seq Scan on pg_opclass p1  (cost=0.00..660.35 rows=51 width=68)  Filter: (NOT (subplan))  SubPlan    ->  Bitmap Heap
Scanon pg_amop p2  (cost=4.29..8.60 rows=2 width=0)          Recheck Cond: (amopfamily = $0)          Filter:
binary_coercible($1,amoplefttype)          ->  Bitmap Index Scan on pg_amop_fam_strat_index  (cost=0.00..4.29 rows=5
width=0)               Index Cond: (amopfamily = $0)
 

The reason this happens is that cost_qual_eval charges the entire cost of
evaluating all the arms of an AND, even though we'll drop out as soon as
something returns FALSE; and so the planner is led to avoid the seqscan
because it now appears to have a high filter-condition evaluation cost,
in favor of a plan that will evaluate the filter condition many fewer
times.  In reality those two plans will call binary_coercible() exactly
the same number of times, and so this is a bogus reason to switch.

I'm kind of inclined to leave it alone though, because the second plan
seems a bit more "failsafe".  To do anything differently, we'd have to
order the qual conditions the way we expect to execute them before
any use of cost_qual_eval, which sounds expensive; and as noted in
an upthread discussion with Greg, relying on the correctness of *both*
cost and selectivity estimates seems a tad fragile.

Comments?
        regards, tom lane


Re: Function execution costs 'n all that

From
Mark Dilger
Date:
Tom Lane wrote:
> Would a simple constant value be workable, or do we need some more
> complex model (and if so what)?

Consider:

ANALYZE myfunc(integer) ON (SELECT myfunc(7)) WITH RATIO 0.03;

ANALYZE myfunc(text,text) ON (SELECT myfunc(mt.a,mt.b) FROM mytable mt) WITH 
RATIO 1.071;

ANALYZE myfunc(text,text) ON (  SELECT myfunc(mt.a,mt.b) FROM mytable mt  UNION  SELECT myfunc(ot.a,ot.b) FROM
othertableot
 
) WITH RATIO 0.5;

These commands could turn on function timing for the lifespan of the query, with 
statistics gathered about the given function's runtimes.  The "WITH RATIO" 
clause would be there to translate runtimes (in milliseconds) into units of 
cpu_operator_cost.  The "WITH RATIO" clause could be optional, with a default 
ratio taken from the postgresql.conf file, if any exists, and finally defaulting 
to a hardcoded "reasonable" value.  Users would be well advised to adopt a 
consistent policy regarding system load at the time that various analyze 
functions are run.

If the function has side effects, it would be the user's responsibility to not 
analyze the function unless those side effects are acceptable.  The user can 
only analyze those queries that the user has permissions to run, so there 
shouldn't be any additional ability to generate side-effects beyond what the 
user already has permission to do.

The syntax might need some adjusting to make the parser happy and to avoid new 
reserved words.  The syntax used above is just an example.

It seems to me that the above system would work perfectly well for collecting 
the number of rows returned from a set returning function, not just the run times.

mark


Re: Function execution costs 'n all that

From
Tom Lane
Date:
Mark Dilger <pgsql@markdilger.com> writes:
> Tom Lane wrote:
>> Would a simple constant value be workable, or do we need some more
>> complex model (and if so what)?

> Consider:
> ANALYZE myfunc(integer) ON (SELECT myfunc(7)) WITH RATIO 0.03;
> ...
> It seems to me that the above system would work perfectly well for
> collecting the number of rows returned from a set returning function,
> not just the run times.

I don't really think that data collection is the bottleneck.  If a
constant estimate isn't good enough for you, then you need some kind of
model of how the runtime or number of rows varies with the function's
inputs ... and I hardly see how something like the above is likely to
figure out how to fit a good model.  Or at least, if you think it can,
then you skipped all the interesting bits.

One other point is that we already know that sampling overhead and
measurement error are significant problems when trying to measure
intervals on the order of one Plan-node execution.  I'm afraid that
would get a great deal worse if we try to use a similar approach to
timing individual function calls.
        regards, tom lane


Re: Function execution costs 'n all that

From
Mark Dilger
Date:
Tom Lane wrote:
> Mark Dilger <pgsql@markdilger.com> writes:
>> Tom Lane wrote:
>>> Would a simple constant value be workable, or do we need some more
>>> complex model (and if so what)?
> 
>> Consider:
>> ANALYZE myfunc(integer) ON (SELECT myfunc(7)) WITH RATIO 0.03;
>> ...
>> It seems to me that the above system would work perfectly well for
>> collecting the number of rows returned from a set returning function,
>> not just the run times.
> 
> I don't really think that data collection is the bottleneck.

Ahh, I'm not just thinking about data collection.  I'm thinking about usability 
for non-hackers who know enough plpgsql to write a function and then want to 
train the system to plan for it appropriately.  It's a much easier task for a 
novice user to say "go away and figure out how expensive this thing is" than for 
a novice to think about things like statistical variance, etc.  We don't demand 
that users have that kind of knowledge to write queries or run analyze on 
tables, so why would they need that kind of knowledge to write a function?

> If a
> constant estimate isn't good enough for you, then you need some kind of
> model of how the runtime or number of rows varies with the function's
> inputs ... and I hardly see how something like the above is likely to
> figure out how to fit a good model.  Or at least, if you think it can,
> then you skipped all the interesting bits.

I am (perhaps naively) imagining that the user will train the database over the 
same query as the one that will actually get used most often in production.  In 
the case that the query modifies the table, the user could train the database 
over a copy of that table.  The data collected by the analyze phase would just 
be constant stuff like average and stddev.  That would make the job of the 
planner / cost estimator easier, right?  It could treat the function as a 
constant cost function.

> One other point is that we already know that sampling overhead and
> measurement error are significant problems when trying to measure
> intervals on the order of one Plan-node execution.  I'm afraid that
> would get a great deal worse if we try to use a similar approach to
> timing individual function calls.

The query could be run with the arguments passed to "myfunc" being recorded to a 
temporary table.  After the query is complete (and the temporary table 
populated), data from the temp table could be pulled into memory in batches, 
with the "myfunc" run on them again in a tight loop.  The loop itself could be 
timed, rather than each iteration.  The sum of all the timings for the various 
loops would then be the final runtime which would be divided by the total number 
of rows to get the average runtime per call.  The downside is that I don't see 
how you retrieve the standard deviation.  (I also don't know if the planner 
knows how to use standard deviation information, so perhaps this is a non issue.)

A further refinement would be to batch the inputs based on properties of the 
input data.  For text, you could run a batch of short text first, a batch of 
medium second, and a batch of long text last, and use best-fit linear algebra to 
determine the runtime cost vs. input text length function.  I'm not sure how 
such a refinement would be done for fixed size datatypes.  And for some text 
functions the runtime won't vary with length but with some other property anyway.

mark