stable function optimizations, revisited - Mailing list pgsql-performance

From Phil Frost
Subject stable function optimizations, revisited
Date
Msg-id 20060512175908.GA30515@unununium.org
Whole thread Raw
List pgsql-performance
I have recently been encountering a number of significant performance
problems related to stable functions being called multiple times when I
believe they could be called just once. Searching the ML archives, I see
I'm not the first:

<http://archives.postgresql.org/pgsql-hackers/2003-04/msg00890.php>
<http://archives.postgresql.org/pgsql-performance/2006-01/msg00140.php>

and so on. None of them seemed to resolve to a plan of action or elegant
workaround. It is mentioned that "stable" was added to allow such
functions to be used for index scans, but I could not tell if other
optimizations I would like are impossible, or possible and if so, might
or will never be implemented.

I have several examples of queries in which eliminating extra calls to a
stable function would result in very significant performance gains. All
of these cases were found while developing a real application, and
although I've simplified them to be more readable, they are not
contrived.

Problem 1: creating a view with multiple columns calculated from a
function.

    create table sale(saleid serial, total numeric);
    create function cost_of_sale(sale.saleid%type) returns numeric stable as $$
      -- calculates the cost of purchasing the things sold in a sale
      -- takes considerable time to calculate
    $$;
    create view convenient_view_on_sale as
        select *,
        cost_of_sale(saleid) as cost,
        total - cost_of_sale(saleid) as profit,
        case when total != 0 then (total-cost_of_sale(saleid)) / total * 100 end as margin;

Executing "select * from convenient_view_on_sale limit 1" will execute
cost_of_sale thrice. However, from the definition of stable, we know it
could have been called just once. As cost_of_sale takes hundreds of ms
to execute while the rest of the query is extremely simple, additional
calls in effect multiply the total execution time.

Nonsolution 1a: moving function to a subselect:

    create view convenient_view_on_sale as
        select *,
        total - cost as profit,
        case when total != 0 then (total-cost) / total * 100 end as margin
        from (select *, cost_of_sale(saleid) as cost from sale) as subq;

The query planner will eliminate the subselect, and cost_of_sale will
still be executed thrice. I can observe no change in behaviour
whatsoever with this view definition.

PS: I wonder what the behaviour would be if I explicitly inlined
cost_of_sale here?

Nonsolution 1b: preventing optimization of the subselect with "offset 0"

    create view convenient_view_on_sale as
        select *,
        total - cost as profit,
        case when total != 0 then (total-cost) / total * 100 end as margin
        from (select *, cost_of_sale(saleid) as cost from sale offset 0) as subq;

This helps in the case of a "select *"; the subquery will not be
eliminated due to the "offset 0", and cost_of_sale will be executed only
once. However, it will always be executed, even if none of the cost
related columns are selected. For exaple,
"select saleid from convenient_view_on_sale limit 1" will execute
cost_of_sale once, although it could have not been executed at all.

Problem 1 has a workaround: perform the dependant calculations (profit
and margin in this case) on the client, or in a stored procedure. This
is often inconvienent, but it works.

Problem 2: expensive functions returning composite types.

Consider that the purchases for a sale might have not yet been made, so
the exact cost can not be known, but a guess can be made based on the
current prices. cost_of_sale might be updated to reflect this:

    create function cost_of_sale(sale.saleid%type, out cost numeric, out estimated bool)
    stable as $$ ... $$;

    create view convenient_view_on_sale as
    select *, cost_of_sale(saleid) from sale;

Note that in many cases, calculating "cost" and "estimated" together
takes just as long as calculating either one. This is why both are
returned from the same function.

Now, I use python as a client, in particular the psycopg2 module. When I
do something such as "select cost from convenient_view_on_sale", the
values returned for the cost column (a composite type (numeric, bool))
are strings. Perhaps this is an issue with psycopg2, but as a user, this
is very annoying since I can not really get at the components of the
composite type without reimplementing pg's parser. Granted I could
probably do it simply in a way that work work most the time, but I feel
it would be error prone, and I'd rather not.

Thus, I seek a way to get the components of the cost column in top-level
columns. For example I try, "select (cost).cost, (cost).estimated", but
this now executes cost_of_sale twice, doubling the time of my query.

Since stable functions are the most common in my experience, and I have
quite a number of them that perform complex, slow queries, I'd really
like to see optimizations in this area. Until such a time, I would very
much appreciate any workaround suggestions.

pgsql-performance by date:

Previous
From: "Jim Nasby"
Date:
Subject: Re: slow variable against int??
Next
From: "Jim C. Nasby"
Date:
Subject: Re: Wrong plan for subSELECT with GROUP BY