Thread: Expensive function and the optimizer

Expensive function and the optimizer

From
"Craig A. James"
Date:
I have a function, call it "myfunc()", that is REALLY expensive computationally.  Think of it like, "If you call this
function,it's going to telephone the Microsoft Help line and wait in their support queue to get the answer."  Ok, it's
notthat bad, but it's so bad that the optimizer should ALWAYS consider it last, no matter what.  (Realistically, the
functiontakes 1-2 msec average, so applying it to 40K rows takes 40-80 seconds.  It's a graph-theory algorithm, known
tobe NP-complete.) 

Is there some way to explain this cost to the optimizer in a permanent way, like when the function is installed?
Here'swhat I get with one critical query (somewhat paraphrased for simplicity): 

 explain analyze
   select A.ID
     from A join B ON (A.ID = B.ID)
     where A.row_num >= 0 and A.row_num <= 43477
     and B.ID = 52
     and myfunc(A.FOO, 'FooBar') order by row_num;

                            QUERY PLAN
  ----------------------------------------------------------------------------------
  Nested Loop  (cost=0.00..72590.13 rows=122 width=8)
   ->  Index Scan using i_a_row_num on a  (cost=0.00..10691.35 rows=12222 width=8)
         Index Cond: ((row_num >= 0) AND (row_num <= 43477))
         Filter: myfunc((foo)::text, 'FooBar'::text)
   ->  Index Scan using i_b_id on b  (cost=0.00..5.05 rows=1 width=4)
         Index Cond: ("outer".id = b.id)
         Filter: (id = 52)
  Total runtime: 62592.631 ms
  (8 rows)

Notice the "Filter: myfunc(...)" that comes in the first loop.  This means it's applying myfunc() to 43477 rows in this
example. The second index scan would cut this number down from 43477 rows to about 20 rows, making the query time drop
from62 seconds down to a fraction of a second. 

Is there any way to give Postgres this information?

The only way I've thought of is something like this:

   select X.id from
     (select A.id, A.foo, A.row_num
        from A join B ON (A.id = B.id)
       where A.row_num >= 0 and A.row_num <= 43477
         and B.id = 52) as X
     where myfunc(X.foo, 'FooBar') order by X.row_num;

I can do this, but it means carefully hand-crafting each query rather than writing a more natural query.

Thanks,
Craig

Re: Expensive function and the optimizer

From
Tom Lane
Date:
"Craig A. James" <cjames@modgraph-usa.com> writes:
> Is there some way to explain this cost to the optimizer in a permanent
> way,

Nope, sorry.  One thing you could do in the particular case at hand is
to rejigger the WHERE clause involving the function so that it requires
values from both tables and therefore can't be applied till after the
join is made.  (If nothing else, give the function an extra dummy
argument that can be passed as a variable from the other table.)
This is an ugly and non-general solution of course.

> The only way I've thought of is something like this:

>    select X.id from
>      (select A.id, A.foo, A.row_num
>         from A join B ON (A.id = B.id)
>        where A.row_num >= 0 and A.row_num <= 43477
>          and B.id = 52) as X
>      where myfunc(X.foo, 'FooBar') order by X.row_num;

As written, that won't work because the planner will happily flatten the
query to the same thing you had before.  You can put an OFFSET 0 into
the sub-select to prevent that from happening, but realize that this
creates a pretty impervious optimization fence ... the side-effects
might be undesirable when you come to look at real queries instead
of toy cases.

            regards, tom lane