Thread: Function is called multiple times in subselect

Function is called multiple times in subselect

From
Chris Campbell
Date:
pgsql-bugs:

I wrote a plpgsql function that does a fairly expensive calculation on
its input, and want to peform a query that:

1. Selects certain rows from a table, then
2. Calls my function on the selected rows

So I wrote a query and used a subselect to first select the rows, and
then used the outer select to call my function on each of the selected
rows and return the result.

I referenced the result of my inner select's calculation multiple times
in the outer select, and found that my function is called once for *each
reference* to its result in the outer select, rather than once for each
row of the inner select.

Here's a simple example:

     CREATE FUNCTION square_it(INTEGER) RETURNS INTEGER AS '
     DECLARE
        i ALIAS FOR $1;
     BEGIN
        RAISE NOTICE ''square_it(%)'', i;
        RETURN (i * i);
     END;
     ' LANGUAGE 'plpgsql';

     CREATE TABLE foo (i INTEGER);

     INSERT INTO foo (i) VALUES (1);
     INSERT INTO foo (i) VALUES (2);
     INSERT INTO foo (i) VALUES (3);
     INSERT INTO foo (i) VALUES (4);

     SELECT query.i,
            query.squared AS test1,
            query.squared + 1 AS test2,
            query.squared + 2 AS test3,
            query.squared + 3 AS test4
     FROM (
         SELECT i,
                square_it(i) AS squared
         FROM foo
     ) query;

When I run it, I expect to see 4 lines of output, and I expect that
square_it() will have been called 4 times (once for each row). However,
it is actually called *4 times for each row* because I reference
"query.squared" 4 times in the outer select.

     NOTICE:  square_it(1)
     NOTICE:  square_it(1)
     NOTICE:  square_it(1)
     NOTICE:  square_it(1)
     NOTICE:  square_it(2)
     NOTICE:  square_it(2)
     NOTICE:  square_it(2)
     NOTICE:  square_it(2)
     NOTICE:  square_it(3)
     NOTICE:  square_it(3)
     NOTICE:  square_it(3)
     NOTICE:  square_it(3)
     NOTICE:  square_it(4)
     NOTICE:  square_it(4)
     NOTICE:  square_it(4)
     NOTICE:  square_it(4)
      i | test1 | test2 | test3 | test4
     ---+-------+-------+-------+-------
      1 |     1 |     2 |     3 |     4
      2 |     4 |     5 |     6 |     7
      3 |     9 |    10 |    11 |    12
      4 |    16 |    17 |    18 |    19
     (4 rows)

I don't think this should be happening (PostgreSQL 7.4.1). I think it
should be saving the result of the calculation in the resulting rows
from the innery query. In my case, that means my query takes 4 times
longer than it should. And when it's a query that takes a nontrivial
amount of time to execute, that's harsh. Any ideas?

Here's the query plan:

                                            QUERY PLAN

-------------------------------------------------------------------------------------------------
  Seq Scan on foo  (cost=0.00..37.50 rows=1000 width=4) (actual
time=3.203..4.384 rows=4 loops=1)
  Total runtime: 4.742 ms

Thanks!

- Chris

Re: Function is called multiple times in subselect

From
Tom Lane
Date:
Chris Campbell <chris@bignerdranch.com> writes:
> I don't think this should be happening (PostgreSQL 7.4.1).

Sorry, this isn't a bug.  Flattening the subquery is generally desirable
behavior.

I think the easiest way to force the subquery not to be flattened is to
insert a LIMIT or OFFSET clause, viz

    SELECT ... FROM (SELECT ... OFFSET 0) query;

            regards, tom lane

Re: Function is called multiple times in subselect

From
"Alex J. Avriette"
Date:
On Thu, Mar 11, 2004 at 11:02:21PM -0500, Tom Lane wrote:
> Chris Campbell <chris@bignerdranch.com> writes:
> > I don't think this should be happening (PostgreSQL 7.4.1).
>
> Sorry, this isn't a bug.  Flattening the subquery is generally desirable
> behavior.

And if you are concerned about the query's cost vs its actual output
(you weren't clear in your original message), there is the 'iscachable'
pragma for functions.

Alex

--
alex@posixnap.net
Alex J. Avriette, Unix Systems Gladiator
Liberty in your lifetime: http://www.freestateproject.org/

Re: Function is called multiple times in subselect

From
Tom Lane
Date:
"Alex J. Avriette" <alex@posixnap.net> writes:
> And if you are concerned about the query's cost vs its actual output
> (you weren't clear in your original message), there is the 'iscachable'
> pragma for functions.

I think that's unrelated to Chris' problem.

One of the reasons we renamed that pragma to "immutable" is that too
many people were interpreting it as suggesting that Postgres would do
some kind of caching of the function's results.  There is no such cache.

            regards, tom lane

Re: Function is called multiple times in subselect

From
Chris Campbell
Date:
Alex J. Avriette wrote:

> And if you are concerned about the query's cost vs its actual output
> (you weren't clear in your original message), there is the 'iscachable'
> pragma for functions.

I was concerned that my function was being called 4 times for each row
of the result. The cost of the function call is quite high, so the cost
of the query was 4 times what I expected.

So my concern was the query's cost.

     SELECT query.i,
            query.squared AS test1,
            query.squared + 1 AS test2,
            query.squared + 2 AS test3,
            query.squared + 3 AS test4
     FROM (
         SELECT i,
                square_it(i) AS squared
         FROM foo
         OFFSET 0
     ) query;

As Tom explained to me, the optimizer flattens the subselect, so it then
becomes:

     SELECT i,
            square_it(i) AS test1,
            square_it(i) + 1 AS test2,
            square_it(i) + 2 AS test3,
            square_it(i) + 3 AS test4
     FROM   foo;

Thus, each reference to query.squared in the outer query is replaced
with a call to the function.

In this trivial example, that's not really a performance issue. But my
real-world function calculates about 8 values and returns them as a
record, which I then pick apart in the outer query. When the optimizer
flattens the subquery and each reference to the function result is
replaced with a call to the function, it makes for a very expensive query.

     CREATE TYPE patient_balances_type AS (patient_id INTEGER,
                                           account_id INTEGER,
                                           date DATE,
                                           due_now INTEGER,
                                           future_due INTEGER,
                                           copay_balance INTEGER,
                                           expected_insurance INTEGER,
                                           total_balance INTEGER,
                                           contract_amount INTEGER,
                                           real_due_now INTEGER,
                                           real_future_due INTEGER);

     CREATE OR REPLACE FUNCTION patient_balances(INTEGER, INTEGER, DATE)
     RETURNS patient_balances_type AS '...' LANGUAGE 'plpgsql';

  My actual query looked something like:

         SELECT  query.appointment_id AS appointment_id,
                 query.date,
                 query.start_time,
                 query.duration,
                 query.patient_id,
                 (query.bal).total_balance,
                 (query.bal).expected_insurance,
                 (query.bal).future_due,
                 (query.bal).due_now
         FROM
         (
             SELECT appt.appointment_id,
                    appt.start_time,
                    appt.duration,
                    appt.date,
                    p.patient_id,
                    p.account_id,
                    patient_balances(p.patient_id,
                                     p.account_id,
                                     d.system_date) AS bal
             FROM
                    patients AS p
                    JOIN appointments appt ON
                        (p.patient_id = appt.patient_id)
                    JOIN system_date d ON
                        (appt.date = d.system_date)
         ) query;

When that subquery was flattened, each reference to query.bal was
replaced with a call to patient_balances(). Which was pretty expensive.

Tom's suggestion off using OFFSET 0 to cause the optimizer not to
flatten the subquery is exactly what I was looking for.

Is there a better way to be picking apart the result of my function than
putting it in a subquery, since the subquery will be flattened by
default? There are a bunch of places I've done this that I need to go
back to now and add an OFFSET 0 because I didn't realize the optimizer
would negate my cleverly crafted record-dissecting subselect. :)

Or is there a way to advise the optimizer of the cost of my function, so
that it will choose to not flatten the subquery (since the total cost of
doing that will be higher)? Or should the optimizer be assuming that
function calls are fairly expensive by default, and not flattening
subqueries that have function calls? So it's not a bug, but maybe a
feature request? :)

Thanks!

- Chris

Re: Function is called multiple times in subselect

From
Tom Lane
Date:
Chris Campbell <chris@bignerdranch.com> writes:
> Or is there a way to advise the optimizer of the cost of my function, so
> that it will choose to not flatten the subquery (since the total cost of
> doing that will be higher)?

Not at present.  AFAIR, we have removed all traces of Joe Hellerstein's
thesis work on optimizing expensive functions ;-).  The thesis work
itself wasn't bad, but it depended on knowing a great deal about the
behavior of arbitrary functions, which isn't a very tenable assumption
in the real world.

> Or should the optimizer be assuming that
> function calls are fairly expensive by default,

Certainly not.  Remember that every comparison and arithmetic operator
is a function call under the hood.

We could possibly think about a simple binary classification "cheap" vs
"expensive", with the consequence being that the optimizer would try to
avoid duplicating expressions containing any "expensive" functions.
I'm not at all sure what it would take to implement that, though.

            regards, tom lane

Re: Function is called multiple times in subselect

From
"Alex J. Avriette"
Date:
On Fri, Mar 12, 2004 at 09:41:39AM -0500, Tom Lane wrote:

> > And if you are concerned about the query's cost vs its actual output
> > (you weren't clear in your original message), there is the 'iscachable'
> > pragma for functions.

> One of the reasons we renamed that pragma to "immutable" is that too
> many people were interpreting it as suggesting that Postgres would do
> some kind of caching of the function's results.  There is no such cache.

Well, I'm glad you told me that. However, I think that it could still be
useful. This allows indices to be built upon the results of functions. In
one case this proved to be tremendously fast compared to using lower(foo)
dynamically. lower(foo) will always return the same value, as would the
original poster's.

But I digress.

Alex

--
alex@posixnap.net
Alex J. Avriette, Unix Systems Gladiator
"Shut down some of the bullshit the government is spending money on and use it
to buy all the Microsoft stock. If we did that, we could ... just bronze Gates,
turn him into a statue, and stick him in front of the Commerce Department." - Scott McNealy