Thread: Suboptimal evaluation of CASE expressions

Suboptimal evaluation of CASE expressions

From
Andreas Tille
Date:
Hi,

when dealing with a bug in the postgresql-plr interface I think
I found a suboptimal method to process CASE statements.  First
to the problem:

I'm using the Debian packaged version of PLR version 0.6.2-2 (Debian
testing) and found a problem calculating median from a set of
values that contain only NULL values.  The problem becomes clear
if you look at the following example:

-------------------------------------------------------------------------
$ psql -t test

create table plrtest( nonnull numeric not null, mightbenull numeric, flag int);
insert into plrtest values(42.0, 42.0, 1);
insert into plrtest values(17.0, 17.0, 1);
insert into plrtest values(23.0, NULL, 0 );
insert into plrtest values(4711.0, 4711.0, 1);
insert into plrtest values(174.0, NULL, 0);

CREATE OR REPLACE FUNCTION plr_call_handler()
RETURNS LANGUAGE_HANDLER
AS '\$libdir/plr' LANGUAGE C;

CREATE LANGUAGE plr HANDLER plr_call_handler;

create or replace function r_median(_numeric) returns numeric as 'median(arg1)' 
language 'plr';

CREATE OR REPLACE FUNCTION r_median(_numeric) returns numeric as '  median(arg1)
' language 'plr';

CREATE OR REPLACE FUNCTION plr_array_accum (_numeric, numeric)
RETURNS numeric[]
AS '\$libdir/plr','plr_array_accum'
LANGUAGE 'C';


CREATE AGGREGATE median (  sfunc = plr_array_accum,  basetype = numeric,  stype = _numeric,  finalfunc = r_median
);

SELECT median(nonnull) from plrtest;     42

SELECT median(mightbenull) from plrtest;     42

SELECT median(nonnull), median(mightbenull) from plrtest where flag = 0;
ERROR:  R interpreter expression evaluation error
DETAIL:  Error in median(arg1) : need numeric data
CONTEXT:  In PL/R function r_median

-------------------------------------------------------------------------

I would expect NULL as result of the last query.

So I thought I will verify in a CASE statement whether there
are only NULL values in the column by max(mightbenull) like this:

# SELECT CASE WHEN max(mightbenull) IS NULL THEN 0 ELSE median(mightbenull) END from plrtest where flag = 0;
ERROR:  R interpreter expression evaluation error
DETAIL:  Error in median(arg1) : need numeric data
CONTEXT:  In PL/R function r_median

The problem I want to discuss here is the following:  Usually in
programming languages only one branch of the IF-THEN-ELSE statement
will be calculated.  But here *both* branches are calculated
(obviousely because of the error that occures).  If we just forget
that my goal was to circumvent the error by some hack, I think
if there is some kind of complex query in the ELSE branche that
calculation would just cost extra processing time with no need.
I would regard this as a bug.

Kind regards
      Andreas.

PS: Please CC me. I'm not subscribed.
-- 
http://fam-tille.de


Re: Suboptimal evaluation of CASE expressions

From
Martijn van Oosterhout
Date:
On Tue, Apr 11, 2006 at 04:43:33PM +0200, Andreas Tille wrote:
> Hi,
>
> when dealing with a bug in the postgresql-plr interface I think
> I found a suboptimal method to process CASE statements.  First
> to the problem:

<snip>
> SELECT median(nonnull), median(mightbenull) from plrtest where flag = 0;
> ERROR:  R interpreter expression evaluation error
> DETAIL:  Error in median(arg1) : need numeric data
> CONTEXT:  In PL/R function r_median

Because there were no non-null rows, the system passed a NULL to the
final func. Seems you have two ways of dealing with this. Mark the
finalfunc as STRICT so the system won't call it with NULL. Or give the
agrregate an INITCOND which is an empty array. This would also avoid
the NULL.

> I would expect NULL as result of the last query.
>
> So I thought I will verify in a CASE statement whether there
> are only NULL values in the column by max(mightbenull) like this:

<snip>

> The problem I want to discuss here is the following:  Usually in
> programming languages only one branch of the IF-THEN-ELSE statement
> will be calculated.  But here *both* branches are calculated
> (obviousely because of the error that occures).  If we just forget

Usually in programming languages, but not in SQL.

> that my goal was to circumvent the error by some hack, I think
> if there is some kind of complex query in the ELSE branche that
> calculation would just cost extra processing time with no need.
> I would regard this as a bug.

The problem in your example is that you're using aggrgates in the case
statement. Which means that as each row is processed, the aggregates
need to be calculated. It can't shortcut because if it first calculated
the max() and then the median() it would have to evaluate the entire
query twice.

In the general case, PostgreSQL *may* avoid calculating redundant
clauses if it doesn't need to, but you can't rely on it.

Fixing your underlying issue with the aggregate should solve everything
for you.

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Suboptimal evaluation of CASE expressions

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> The problem in your example is that you're using aggrgates in the case
> statement.

Yeah.  The aggregate results are all computed before we start to evaluate
the SELECT output list --- the fact that the aggregate is referenced
within a CASE doesn't save you if the aggregate's finalfunc fails.

We could maybe change things so that the finalfunc isn't run unless the
result value is actually demanded in the SELECT list or HAVING clause,
but for 99.99% of applications checking that would be a waste of cycles,
so I'm disinclined to do it.  As Martijn said, really you want to fix
the finalfunc so that it behaves sanely in corner cases.  An aggregate
that fails on zero rows needs work, period.
        regards, tom lane


Re: Suboptimal evaluation of CASE expressions

From
Andreas Tille
Date:
On Tue, 11 Apr 2006, Martijn van Oosterhout wrote:

> Because there were no non-null rows, the system passed a NULL to the
> final func. Seems you have two ways of dealing with this. Mark the
> finalfunc as STRICT so the system won't call it with NULL. Or give the
> agrregate an INITCOND which is an empty array. This would also avoid
> the NULL.

Ah.  Thanks, this might help for the original problem.

> The problem in your example is that you're using aggrgates in the case
> statement. Which means that as each row is processed, the aggregates
> need to be calculated. It can't shortcut because if it first calculated
> the max() and then the median() it would have to evaluate the entire
> query twice.

A this sounds be reasonable.  So my assumption might have been wrong.

> In the general case, PostgreSQL *may* avoid calculating redundant
> clauses if it doesn't need to, but you can't rely on it.

Just theoretically spoken: Woouldn't it make sense to enforce to
avoid this calculation.

> Fixing your underlying issue with the aggregate should solve everything
> for you.

Sure.  I hope that I was able to trigger some ideas about optimisation
anyway.

Thanks for the quick help
         Andreas.

-- 
http://fam-tille.de


Re: Suboptimal evaluation of CASE expressions

From
Andreas Tille
Date:
On Tue, 11 Apr 2006, Tom Lane wrote:

> We could maybe change things so that the finalfunc isn't run unless the
> result value is actually demanded in the SELECT list or HAVING clause,
> but for 99.99% of applications checking that would be a waste of cycles,
> so I'm disinclined to do it.

I'm lacking experience here so I perfectly trust you that keeping
the default case as it is.  The question is, whether adding an
option to change the default might make sense.

> As Martijn said, really you want to fix
> the finalfunc so that it behaves sanely in corner cases.  An aggregate
> that fails on zero rows needs work, period.

Fully ACK.  As I hopefully made clear I just used it as a sign /
proof, that something works differently than I would regard reasonable
(before I understand the problem with the aggregate).

Kind regards
          Andreas.

-- 
http://fam-tille.de


Re: Suboptimal evaluation of CASE expressions

From
Martijn van Oosterhout
Date:
On Tue, Apr 11, 2006 at 11:22:53PM +0200, Andreas Tille wrote:
> On Tue, 11 Apr 2006, Tom Lane wrote:
>
> >We could maybe change things so that the finalfunc isn't run unless the
> >result value is actually demanded in the SELECT list or HAVING clause,
> >but for 99.99% of applications checking that would be a waste of cycles,
> >so I'm disinclined to do it.
>
> I'm lacking experience here so I perfectly trust you that keeping
> the default case as it is.  The question is, whether adding an
> option to change the default might make sense.

Can you give an example of a simple case where PostgreSQL doesn't do
this already. For the really obvious cases without aggregates, it works
already:

test=# select case when true then 5 else 1/0 end;case
------   5
(1 row)

test=# select case when false then 5 else 1/0 end;
ERROR:  division by zero

What we're saying is that as long as the SQL standard doesn't require
it, we're not going to write large chunks of code to avoid a small
amount of processing that nobody is going to notice anyway. i.e. you
can't *rely* on this behaviour, but improvement is merely an
optimisation, not a feature or a bug fix.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Suboptimal evaluation of CASE expressions

From
Tom Lane
Date:
Andreas Tille <tillea@rki.de> writes:
> I'm lacking experience here so I perfectly trust you that keeping
> the default case as it is.  The question is, whether adding an
> option to change the default might make sense.

I don't think so.  The current API contract for aggregate functions is
that the finalfunc will be called exactly once per aggregate occurrence.
Changing that to say "maybe you'll get called and maybe you won't" does
not strike me as making life easier for aggregate authors, rather the
reverse --- for example, the finalfunc might need to clean up some
working state.
        regards, tom lane


Re: Suboptimal evaluation of CASE expressions

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> --f2QGlHpHGjS2mn6Y
> On Tue, Apr 11, 2006 at 11:22:53PM +0200, Andreas Tille wrote:
>> I'm lacking experience here so I perfectly trust you that keeping
>> the default case as it is.  The question is, whether adding an
>> option to change the default might make sense.

> Can you give an example of a simple case where PostgreSQL doesn't do
> this already.

I think we might be confusing each other with varying meanings for the
word "case" ;-).  The facts as I see them are:

1. The CASE expression does indeed not evaluate unneeded subexpressions.

2. However, aggregate functions are evaluated in a separate pass before
we start to evaluate the SELECT's output list (or the HAVING clause if
any).  So you cannot use a CASE to suppress evaluation of an aggregate's
finalfunc ... much less its state transition function.

3. There are other situations where a CASE might "not work" to suppress
contained evaluations.  For instance, this example is pretty misleading:

> test=3D# select case when true then 5 else 1/0 end;
>  case=20
> ------
>     5
> (1 row)

> test=3D# select case when false then 5 else 1/0 end;
> ERROR:  division by zero

A counterexample is:

regression=# select f1, case when true then 5 else 1/0 end from int4_tbl;    f1      | case
-------------+------          0 |    5     123456 |    5    -123456 |    5 2147483647 |    5-2147483647 |    5
(5 rows)

regression=# select f1, case when f1 <> 42 then 5 else 1/0 end from int4_tbl;
ERROR:  division by zero
regression=#

The reason the latter fails is that constant-folding encounters the 1/0
before we actually start to run the SELECT.  The first three examples
work only because the WHEN clause is a plan-time constant and so the
constant folder never reaches the ELSE clause.

I'm not really inclined to remove the constant folder just to make the
world safe for silly examples like this, so the bottom line is that you
have to be aware of there being multiple passes of evaluation.
        regards, tom lane