stable function optimizations, revisited - Mailing list pgsql-performance
From | Phil Frost |
---|---|
Subject | stable function optimizations, revisited |
Date | |
Msg-id | 20060512175908.GA30515@unununium.org Whole thread Raw |
List | pgsql-performance |
I have recently been encountering a number of significant performance problems related to stable functions being called multiple times when I believe they could be called just once. Searching the ML archives, I see I'm not the first: <http://archives.postgresql.org/pgsql-hackers/2003-04/msg00890.php> <http://archives.postgresql.org/pgsql-performance/2006-01/msg00140.php> and so on. None of them seemed to resolve to a plan of action or elegant workaround. It is mentioned that "stable" was added to allow such functions to be used for index scans, but I could not tell if other optimizations I would like are impossible, or possible and if so, might or will never be implemented. I have several examples of queries in which eliminating extra calls to a stable function would result in very significant performance gains. All of these cases were found while developing a real application, and although I've simplified them to be more readable, they are not contrived. Problem 1: creating a view with multiple columns calculated from a function. create table sale(saleid serial, total numeric); create function cost_of_sale(sale.saleid%type) returns numeric stable as $$ -- calculates the cost of purchasing the things sold in a sale -- takes considerable time to calculate $$; create view convenient_view_on_sale as select *, cost_of_sale(saleid) as cost, total - cost_of_sale(saleid) as profit, case when total != 0 then (total-cost_of_sale(saleid)) / total * 100 end as margin; Executing "select * from convenient_view_on_sale limit 1" will execute cost_of_sale thrice. However, from the definition of stable, we know it could have been called just once. As cost_of_sale takes hundreds of ms to execute while the rest of the query is extremely simple, additional calls in effect multiply the total execution time. Nonsolution 1a: moving function to a subselect: create view convenient_view_on_sale as select *, total - cost as profit, case when total != 0 then (total-cost) / total * 100 end as margin from (select *, cost_of_sale(saleid) as cost from sale) as subq; The query planner will eliminate the subselect, and cost_of_sale will still be executed thrice. I can observe no change in behaviour whatsoever with this view definition. PS: I wonder what the behaviour would be if I explicitly inlined cost_of_sale here? Nonsolution 1b: preventing optimization of the subselect with "offset 0" create view convenient_view_on_sale as select *, total - cost as profit, case when total != 0 then (total-cost) / total * 100 end as margin from (select *, cost_of_sale(saleid) as cost from sale offset 0) as subq; This helps in the case of a "select *"; the subquery will not be eliminated due to the "offset 0", and cost_of_sale will be executed only once. However, it will always be executed, even if none of the cost related columns are selected. For exaple, "select saleid from convenient_view_on_sale limit 1" will execute cost_of_sale once, although it could have not been executed at all. Problem 1 has a workaround: perform the dependant calculations (profit and margin in this case) on the client, or in a stored procedure. This is often inconvienent, but it works. Problem 2: expensive functions returning composite types. Consider that the purchases for a sale might have not yet been made, so the exact cost can not be known, but a guess can be made based on the current prices. cost_of_sale might be updated to reflect this: create function cost_of_sale(sale.saleid%type, out cost numeric, out estimated bool) stable as $$ ... $$; create view convenient_view_on_sale as select *, cost_of_sale(saleid) from sale; Note that in many cases, calculating "cost" and "estimated" together takes just as long as calculating either one. This is why both are returned from the same function. Now, I use python as a client, in particular the psycopg2 module. When I do something such as "select cost from convenient_view_on_sale", the values returned for the cost column (a composite type (numeric, bool)) are strings. Perhaps this is an issue with psycopg2, but as a user, this is very annoying since I can not really get at the components of the composite type without reimplementing pg's parser. Granted I could probably do it simply in a way that work work most the time, but I feel it would be error prone, and I'd rather not. Thus, I seek a way to get the components of the cost column in top-level columns. For example I try, "select (cost).cost, (cost).estimated", but this now executes cost_of_sale twice, doubling the time of my query. Since stable functions are the most common in my experience, and I have quite a number of them that perform complex, slow queries, I'd really like to see optimizations in this area. Until such a time, I would very much appreciate any workaround suggestions.
pgsql-performance by date: