Thread: Suboptimal evaluation of CASE expressions
Hi, when dealing with a bug in the postgresql-plr interface I think I found a suboptimal method to process CASE statements. First to the problem: I'm using the Debian packaged version of PLR version 0.6.2-2 (Debian testing) and found a problem calculating median from a set of values that contain only NULL values. The problem becomes clear if you look at the following example: ------------------------------------------------------------------------- $ psql -t test create table plrtest( nonnull numeric not null, mightbenull numeric, flag int); insert into plrtest values(42.0, 42.0, 1); insert into plrtest values(17.0, 17.0, 1); insert into plrtest values(23.0, NULL, 0 ); insert into plrtest values(4711.0, 4711.0, 1); insert into plrtest values(174.0, NULL, 0); CREATE OR REPLACE FUNCTION plr_call_handler() RETURNS LANGUAGE_HANDLER AS '\$libdir/plr' LANGUAGE C; CREATE LANGUAGE plr HANDLER plr_call_handler; create or replace function r_median(_numeric) returns numeric as 'median(arg1)' language 'plr'; CREATE OR REPLACE FUNCTION r_median(_numeric) returns numeric as ' median(arg1) ' language 'plr'; CREATE OR REPLACE FUNCTION plr_array_accum (_numeric, numeric) RETURNS numeric[] AS '\$libdir/plr','plr_array_accum' LANGUAGE 'C'; CREATE AGGREGATE median ( sfunc = plr_array_accum, basetype = numeric, stype = _numeric, finalfunc = r_median ); SELECT median(nonnull) from plrtest; 42 SELECT median(mightbenull) from plrtest; 42 SELECT median(nonnull), median(mightbenull) from plrtest where flag = 0; ERROR: R interpreter expression evaluation error DETAIL: Error in median(arg1) : need numeric data CONTEXT: In PL/R function r_median ------------------------------------------------------------------------- I would expect NULL as result of the last query. So I thought I will verify in a CASE statement whether there are only NULL values in the column by max(mightbenull) like this: # SELECT CASE WHEN max(mightbenull) IS NULL THEN 0 ELSE median(mightbenull) END from plrtest where flag = 0; ERROR: R interpreter expression evaluation error DETAIL: Error in median(arg1) : need numeric data CONTEXT: In PL/R function r_median The problem I want to discuss here is the following: Usually in programming languages only one branch of the IF-THEN-ELSE statement will be calculated. But here *both* branches are calculated (obviousely because of the error that occures). If we just forget that my goal was to circumvent the error by some hack, I think if there is some kind of complex query in the ELSE branche that calculation would just cost extra processing time with no need. I would regard this as a bug. Kind regards Andreas. PS: Please CC me. I'm not subscribed. -- http://fam-tille.de
On Tue, Apr 11, 2006 at 04:43:33PM +0200, Andreas Tille wrote: > Hi, > > when dealing with a bug in the postgresql-plr interface I think > I found a suboptimal method to process CASE statements. First > to the problem: <snip> > SELECT median(nonnull), median(mightbenull) from plrtest where flag = 0; > ERROR: R interpreter expression evaluation error > DETAIL: Error in median(arg1) : need numeric data > CONTEXT: In PL/R function r_median Because there were no non-null rows, the system passed a NULL to the final func. Seems you have two ways of dealing with this. Mark the finalfunc as STRICT so the system won't call it with NULL. Or give the agrregate an INITCOND which is an empty array. This would also avoid the NULL. > I would expect NULL as result of the last query. > > So I thought I will verify in a CASE statement whether there > are only NULL values in the column by max(mightbenull) like this: <snip> > The problem I want to discuss here is the following: Usually in > programming languages only one branch of the IF-THEN-ELSE statement > will be calculated. But here *both* branches are calculated > (obviousely because of the error that occures). If we just forget Usually in programming languages, but not in SQL. > that my goal was to circumvent the error by some hack, I think > if there is some kind of complex query in the ELSE branche that > calculation would just cost extra processing time with no need. > I would regard this as a bug. The problem in your example is that you're using aggrgates in the case statement. Which means that as each row is processed, the aggregates need to be calculated. It can't shortcut because if it first calculated the max() and then the median() it would have to evaluate the entire query twice. In the general case, PostgreSQL *may* avoid calculating redundant clauses if it doesn't need to, but you can't rely on it. Fixing your underlying issue with the aggregate should solve everything for you. Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Martijn van Oosterhout <kleptog@svana.org> writes: > The problem in your example is that you're using aggrgates in the case > statement. Yeah. The aggregate results are all computed before we start to evaluate the SELECT output list --- the fact that the aggregate is referenced within a CASE doesn't save you if the aggregate's finalfunc fails. We could maybe change things so that the finalfunc isn't run unless the result value is actually demanded in the SELECT list or HAVING clause, but for 99.99% of applications checking that would be a waste of cycles, so I'm disinclined to do it. As Martijn said, really you want to fix the finalfunc so that it behaves sanely in corner cases. An aggregate that fails on zero rows needs work, period. regards, tom lane
On Tue, 11 Apr 2006, Martijn van Oosterhout wrote: > Because there were no non-null rows, the system passed a NULL to the > final func. Seems you have two ways of dealing with this. Mark the > finalfunc as STRICT so the system won't call it with NULL. Or give the > agrregate an INITCOND which is an empty array. This would also avoid > the NULL. Ah. Thanks, this might help for the original problem. > The problem in your example is that you're using aggrgates in the case > statement. Which means that as each row is processed, the aggregates > need to be calculated. It can't shortcut because if it first calculated > the max() and then the median() it would have to evaluate the entire > query twice. A this sounds be reasonable. So my assumption might have been wrong. > In the general case, PostgreSQL *may* avoid calculating redundant > clauses if it doesn't need to, but you can't rely on it. Just theoretically spoken: Woouldn't it make sense to enforce to avoid this calculation. > Fixing your underlying issue with the aggregate should solve everything > for you. Sure. I hope that I was able to trigger some ideas about optimisation anyway. Thanks for the quick help Andreas. -- http://fam-tille.de
On Tue, 11 Apr 2006, Tom Lane wrote: > We could maybe change things so that the finalfunc isn't run unless the > result value is actually demanded in the SELECT list or HAVING clause, > but for 99.99% of applications checking that would be a waste of cycles, > so I'm disinclined to do it. I'm lacking experience here so I perfectly trust you that keeping the default case as it is. The question is, whether adding an option to change the default might make sense. > As Martijn said, really you want to fix > the finalfunc so that it behaves sanely in corner cases. An aggregate > that fails on zero rows needs work, period. Fully ACK. As I hopefully made clear I just used it as a sign / proof, that something works differently than I would regard reasonable (before I understand the problem with the aggregate). Kind regards Andreas. -- http://fam-tille.de
On Tue, Apr 11, 2006 at 11:22:53PM +0200, Andreas Tille wrote: > On Tue, 11 Apr 2006, Tom Lane wrote: > > >We could maybe change things so that the finalfunc isn't run unless the > >result value is actually demanded in the SELECT list or HAVING clause, > >but for 99.99% of applications checking that would be a waste of cycles, > >so I'm disinclined to do it. > > I'm lacking experience here so I perfectly trust you that keeping > the default case as it is. The question is, whether adding an > option to change the default might make sense. Can you give an example of a simple case where PostgreSQL doesn't do this already. For the really obvious cases without aggregates, it works already: test=# select case when true then 5 else 1/0 end;case ------ 5 (1 row) test=# select case when false then 5 else 1/0 end; ERROR: division by zero What we're saying is that as long as the SQL standard doesn't require it, we're not going to write large chunks of code to avoid a small amount of processing that nobody is going to notice anyway. i.e. you can't *rely* on this behaviour, but improvement is merely an optimisation, not a feature or a bug fix. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Andreas Tille <tillea@rki.de> writes: > I'm lacking experience here so I perfectly trust you that keeping > the default case as it is. The question is, whether adding an > option to change the default might make sense. I don't think so. The current API contract for aggregate functions is that the finalfunc will be called exactly once per aggregate occurrence. Changing that to say "maybe you'll get called and maybe you won't" does not strike me as making life easier for aggregate authors, rather the reverse --- for example, the finalfunc might need to clean up some working state. regards, tom lane
Martijn van Oosterhout <kleptog@svana.org> writes: > --f2QGlHpHGjS2mn6Y > On Tue, Apr 11, 2006 at 11:22:53PM +0200, Andreas Tille wrote: >> I'm lacking experience here so I perfectly trust you that keeping >> the default case as it is. The question is, whether adding an >> option to change the default might make sense. > Can you give an example of a simple case where PostgreSQL doesn't do > this already. I think we might be confusing each other with varying meanings for the word "case" ;-). The facts as I see them are: 1. The CASE expression does indeed not evaluate unneeded subexpressions. 2. However, aggregate functions are evaluated in a separate pass before we start to evaluate the SELECT's output list (or the HAVING clause if any). So you cannot use a CASE to suppress evaluation of an aggregate's finalfunc ... much less its state transition function. 3. There are other situations where a CASE might "not work" to suppress contained evaluations. For instance, this example is pretty misleading: > test=3D# select case when true then 5 else 1/0 end; > case=20 > ------ > 5 > (1 row) > test=3D# select case when false then 5 else 1/0 end; > ERROR: division by zero A counterexample is: regression=# select f1, case when true then 5 else 1/0 end from int4_tbl; f1 | case -------------+------ 0 | 5 123456 | 5 -123456 | 5 2147483647 | 5-2147483647 | 5 (5 rows) regression=# select f1, case when f1 <> 42 then 5 else 1/0 end from int4_tbl; ERROR: division by zero regression=# The reason the latter fails is that constant-folding encounters the 1/0 before we actually start to run the SELECT. The first three examples work only because the WHEN clause is a plan-time constant and so the constant folder never reaches the ELSE clause. I'm not really inclined to remove the constant folder just to make the world safe for silly examples like this, so the bottom line is that you have to be aware of there being multiple passes of evaluation. regards, tom lane