Thread: Invalid optimization of VOLATILE function in WHERE clause?

Invalid optimization of VOLATILE function in WHERE clause?

From
Florian.Schoppmann@emc.com (Florian Schoppmann)
Date:
Hi all,

In PostgreSQL 9.1 and 9.2 (possibly also in earlier versions), the query

--8<--
WITH source AS (   SELECT i FROM generate_series(1,10) AS i
)
SELECT   i
FROM   source, (       SELECT           count(*) AS _n       FROM source   ) AS _stats
WHERE   random() < 5::DOUBLE PRECISION/_n;
-->8--

translates into the following query plan:

--8<--Nested Loop  (cost=35.00..65.03 rows=1000 width=4)  CTE source    ->  Function Scan on generate_series i
(cost=0.00..10.00rows=1000        width=4)  ->  Aggregate  (cost=25.00..25.02 rows=1 width=0)        Filter: (random()
<(5::double precision / (count(*))::double        precision))        ->  CTE Scan on source  (cost=0.00..20.00
rows=1000width=0)  ->  CTE Scan on source  (cost=0.00..20.00 rows=1000 width=4)
 
-->8--

In other words, the query either gives exactly 0 or 10 rows, and both
cases happen with probability 0.5. Naturally, I would have expected
instead that each row is sampled independently with probability 0.5.

Since random() is volatile, so is the whole where-expression. So I
wonder why the condition is pushed down to the lowest level, given that
this changes results. Is this behavior correct, i.e., specified
somewhere? Or is this a bug?

Florian




Re: Invalid optimization of VOLATILE function in WHERE clause?

From
Tom Lane
Date:
Florian.Schoppmann@emc.com (Florian Schoppmann) writes:
> In PostgreSQL 9.1 and 9.2 (possibly also in earlier versions), the query

> --8<--
> WITH source AS (
>     SELECT i FROM generate_series(1,10) AS i
> )
> SELECT
>     i
> FROM
>     source, (
>         SELECT
>             count(*) AS _n
>         FROM source
>     ) AS _stats
> WHERE
>     random() < 5::DOUBLE PRECISION/_n;
> -->8--

[ doesn't do what you think it should ]

I can't get excited about this.  Any time you put a volatile function
into WHERE, you're playing with fire.  The docs warn against it:
http://www.postgresql.org/docs/9.2/static/sql-expressions.html#SYNTAX-EXPRESS-EVAL

To do what you want, I'd suggest wrapping the join into a sub-select
with an "OFFSET 0" clause, which will serve as an optimization fence
that prevents the random() call from being pushed down.
        regards, tom lane



Re: Invalid optimization of VOLATILE function in WHERE clause?

From
Merlin Moncure
Date:
On Wed, Sep 19, 2012 at 9:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Florian.Schoppmann@emc.com (Florian Schoppmann) writes:
>> In PostgreSQL 9.1 and 9.2 (possibly also in earlier versions), the query
>
>> --8<--
>> WITH source AS (
>>     SELECT i FROM generate_series(1,10) AS i
>> )
>> SELECT
>>     i
>> FROM
>>     source, (
>>         SELECT
>>             count(*) AS _n
>>         FROM source
>>     ) AS _stats
>> WHERE
>>     random() < 5::DOUBLE PRECISION/_n;
>> -->8--
>
> [ doesn't do what you think it should ]
>
> I can't get excited about this.  Any time you put a volatile function
> into WHERE, you're playing with fire.  The docs warn against it:
> http://www.postgresql.org/docs/9.2/static/sql-expressions.html#SYNTAX-EXPRESS-EVAL
>
> To do what you want, I'd suggest wrapping the join into a sub-select
> with an "OFFSET 0" clause, which will serve as an optimization fence
> that prevents the random() call from being pushed down.

I like the more standard CTE approach to optimization fencing where it works:

postgres=# WITH source AS (   SELECT i, random() r   FROM generate_series(1,10) AS i
)
SELECT   i
FROM   source, (       SELECT           count(*) AS _n       FROM source   ) AS _stats
WHERE   r < 5::DOUBLE PRECISION/_n;

merlin



Re: Invalid optimization of VOLATILE function in WHERE clause?

From
Robert Haas
Date:
On Wed, Sep 19, 2012 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Florian.Schoppmann@emc.com (Florian Schoppmann) writes:
>> In PostgreSQL 9.1 and 9.2 (possibly also in earlier versions), the query
>
>> --8<--
>> WITH source AS (
>>     SELECT i FROM generate_series(1,10) AS i
>> )
>> SELECT
>>     i
>> FROM
>>     source, (
>>         SELECT
>>             count(*) AS _n
>>         FROM source
>>     ) AS _stats
>> WHERE
>>     random() < 5::DOUBLE PRECISION/_n;
>> -->8--
>
> [ doesn't do what you think it should ]
>
> I can't get excited about this.  Any time you put a volatile function
> into WHERE, you're playing with fire.  The docs warn against it:
> http://www.postgresql.org/docs/9.2/static/sql-expressions.html#SYNTAX-EXPRESS-EVAL
>
> To do what you want, I'd suggest wrapping the join into a sub-select
> with an "OFFSET 0" clause, which will serve as an optimization fence
> that prevents the random() call from being pushed down.

You've repeatedly objected to complaints on pgsql-performance on the
grounds that WITH is an optimization fence.  It seems awfully
inconsistent to turn around and say, oh, sometimes it's not a fence
after all.  It seems that users may not rely on WITH either to do the
optimizations necessary to have good performance or to fail to do
optimizations that lead to wrong results.  Ouch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Invalid optimization of VOLATILE function in WHERE clause?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Sep 19, 2012 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> To do what you want, I'd suggest wrapping the join into a sub-select
>> with an "OFFSET 0" clause, which will serve as an optimization fence
>> that prevents the random() call from being pushed down.

> You've repeatedly objected to complaints on pgsql-performance on the
> grounds that WITH is an optimization fence.  It seems awfully
> inconsistent to turn around and say, oh, sometimes it's not a fence
> after all.

Huh?  The join in question is not inside a WITH.  If it were, that
would work too, as noted by Merlin.
        regards, tom lane



Re: Invalid optimization of VOLATILE function in WHERE clause?

From
Robert Haas
Date:
On Wed, Sep 19, 2012 at 12:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Sep 19, 2012 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> To do what you want, I'd suggest wrapping the join into a sub-select
>>> with an "OFFSET 0" clause, which will serve as an optimization fence
>>> that prevents the random() call from being pushed down.
>
>> You've repeatedly objected to complaints on pgsql-performance on the
>> grounds that WITH is an optimization fence.  It seems awfully
>> inconsistent to turn around and say, oh, sometimes it's not a fence
>> after all.
>
> Huh?  The join in question is not inside a WITH.  If it were, that
> would work too, as noted by Merlin.

Oh, hmm.  I see now: the problem isn't that random() is being pushed
into the WITH, it's that it's being pushed into the join.  Sorry, I
should have read that more carefully.

It still seems like awfully weird behavior.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Invalid optimization of VOLATILE function in WHERE clause?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> It still seems like awfully weird behavior.

Why?  The WHERE condition relates only to the output of the _stats
subquery, so why shouldn't it be evaluated there, rather than after
the join?

In a green field I might agree that we should de-optimize such cases,
but the problem with doing so is that it would totally destroy
performance for cases in which a user has defined a function that's
actually stable or immutable but they forgot to mark it so.  If
VOLATILE weren't the default marking, such a change wouldn't be so
problematic ... but it is.  Given that the behavior has been like
this since the late stone age, I'm not inclined to change it.
        regards, tom lane



Re: Invalid optimization of VOLATILE function in WHERE clause?

From
Robert Haas
Date:
On Wed, Sep 19, 2012 at 1:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> It still seems like awfully weird behavior.
>
> Why?  The WHERE condition relates only to the output of the _stats
> subquery, so why shouldn't it be evaluated there, rather than after
> the join?

Because my mental model (and apparently that of the OP) is that the
WHERE clause gets evaluated separately for each row.  Obviously in
many cases that can be optimized without changing the results, but not
in this case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Invalid optimization of VOLATILE function in WHERE clause?

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> It still seems like awfully weird behavior.
> 
> Why?  The WHERE condition relates only to the output of the _stats
> subquery, so why shouldn't it be evaluated there, rather than
> after the join?

In another thread, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It's easier to understand why this is if you realize that SQL has
> a very clear model of a "pipeline" of query execution. 
> Conceptually, what happens is:
> 
> 1. Form the cartesian product of the tables listed in FROM (ie,
> all combinations of rows).
> 
> 2. Apply the WHERE condition to each row from 1, and drop rows
> that don't pass it.
People expect that the results will be consistent with this model,
even if the implementation is optimized "under the covers".  I think
correct semantics should trump performance here.
-Kevin



Re: Invalid optimization of VOLATILE function in WHERE clause?

From
"ktm@rice.edu"
Date:
On Wed, Sep 19, 2012 at 02:39:12PM -0500, Kevin Grittner wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> It still seems like awfully weird behavior.
> > 
> > Why?  The WHERE condition relates only to the output of the _stats
> > subquery, so why shouldn't it be evaluated there, rather than
> > after the join?
> 
> In another thread, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > It's easier to understand why this is if you realize that SQL has
> > a very clear model of a "pipeline" of query execution. 
> > Conceptually, what happens is:
> > 
> > 1. Form the cartesian product of the tables listed in FROM (ie,
> > all combinations of rows).
> > 
> > 2. Apply the WHERE condition to each row from 1, and drop rows
> > that don't pass it.
>  
> People expect that the results will be consistent with this model,
> even if the implementation is optimized "under the covers".  I think
> correct semantics should trump performance here.
>  
> -Kevin
> 

It seems like this is what happens here except that the function is
evaluated once for the WHERE and not once per ROW. Both of these meet
the criterion for 2 above and Tom's earlier comments both hold.

Regards,
Ken



Re: Invalid optimization of VOLATILE function in WHERE clause?

From
"Kevin Grittner"
Date:
"ktm@rice.edu" <ktm@rice.edu> wrote:
> On Wed, Sep 19, 2012 at 02:39:12PM -0500, Kevin Grittner wrote:
>> In another thread, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> 2. Apply the WHERE condition to each row from 1, and drop rows
>>> that don't pass it.
>>  
>> People expect that the results will be consistent with this
>> model, even if the implementation is optimized "under the
>> covers".  I think correct semantics should trump performance
>> here.
> It seems like this is what happens here except that the function
> is evaluated once for the WHERE and not once per ROW. Both of
> these meet the criterion for 2 above and Tom's earlier comments
> both hold.
There really needs to be some way to specify that when an expression
is evaluated for each row in a set, a function used within that
expression is not optimized away for some rows.  Fortunately we have
a way:
http://www.postgresql.org/docs/9.2/interactive/sql-createfunction.html
| VOLATILE indicates that the function value can change even within
| a single table scan, so no optimizations can be made. Relatively
| few database functions are volatile in this sense; some examples
| are random(), [...]
The behavior in the OP's query would certainly be sane if the
function were not VOLATILE; as it is, I have a hard time seeing this
as anything but a bug.

There is a workaround, if you don't mind ugly:
CREATE FUNCTION random_really_i_mean_it(dummy int) RETURNS double precision LANGUAGE plpgsql VOLATILE
AS $$
BEGIN -- no need to reference dummy parameter RETURN random();
END;
$$;

WITH source AS (   SELECT i FROM generate_series(1,10) AS i
)
SELECT   i
FROM   source, (       SELECT           count(*) AS _n       FROM source   ) AS _stats
WHERE   random_really_i_mean_it(i) < 5::DOUBLE PRECISION/_n;
-Kevin



Re: Invalid optimization of VOLATILE function in WHERE clause?

From
"Kevin Grittner"
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> There is a workaround, if you don't mind ugly:
Or, better:
WITH source AS (   SELECT i, random() AS r FROM generate_series(1,10) AS i
)
SELECT   i
FROM   source, (       SELECT           count(*) AS _n       FROM source   ) AS _stats
WHERE   r < 5::DOUBLE PRECISION/_n;
-Kevin



Re: Invalid optimization of VOLATILE function in WHERE clause?

From
"David Johnston"
Date:
> -----Original Message-----
> There really needs to be some way to specify that when an expression is
> evaluated for each row in a set, a function used within that expression is
not
> optimized away for some rows.  Fortunately we have a way:
> 
> http://www.postgresql.org/docs/9.2/interactive/sql-createfunction.html
> 
> | VOLATILE indicates that the function value can change even within a
> | single table scan, so no optimizations can be made. Relatively few
> | database functions are volatile in this sense; some examples are
> | random(), [...]
> 
> The behavior in the OP's query would certainly be sane if the function
were
> not VOLATILE; as it is, I have a hard time seeing this as anything but a
bug.

What are the arguments against adding a 4th identifier - call it PER_ROW for
this argument?  The main reason VOLATILE is broken is that it is the default
and in order to minimize beginner's penalty it is not treated as such in
some situations.  The new one could behave just like VOLATILE but would
never be optimized away and would always evaluate once for each row in its
context.  

Then the question is whether you write a new "random()" function or break
backwards compatibility and alter the existing version.

David J.





Re: Invalid optimization of VOLATILE function in WHERE clause?

From
"Kevin Grittner"
Date:
"David Johnston" <polobo@yahoo.com> wrote:
>> | VOLATILE indicates that the function value can change even
>> | within a single table scan, so no optimizations can be made.
>> | Relatively few database functions are volatile in this sense;
>> | some examples are random(), [...]
> What are the arguments against adding a 4th identifier - call it
> PER_ROW for this argument?  The main reason VOLATILE is broken is
> that it is the default and in order to minimize beginner's penalty
> it is not treated as such in some situations.  The new one could
> behave just like VOLATILE but would never be optimized away and
> would always evaluate once for each row in its context.  
So how would you document that?  It sounds like the proposed level
would behave exactly as the VOLATILE level is currently documented
to behave; so I guess we could shift the documentation of VOLATILE
to PER_ROW (or whatever).  How would you then describe the behavior
of VOLATILE?
-Kevin



Re: Invalid optimization of VOLATILE function in WHERE clause?

From
"David Johnston"
Date:
> -----Original Message-----
> >> | VOLATILE indicates that the function value can change even within a
> >> | single table scan, so no optimizations can be made.
> >> | Relatively few database functions are volatile in this sense; some
> >> | examples are random(), [...]
> 
> > What are the arguments against adding a 4th identifier - call it
> > PER_ROW for this argument?  The main reason VOLATILE is broken is that
> > it is the default and in order to minimize beginner's penalty it is
> > not treated as such in some situations.  The new one could behave just
> > like VOLATILE but would never be optimized away and would always
> > evaluate once for each row in its context.
> 
> So how would you document that?  It sounds like the proposed level would
> behave exactly as the VOLATILE level is currently documented to behave; so
I
> guess we could shift the documentation of VOLATILE to PER_ROW (or
> whatever).  How would you then describe the behavior of VOLATILE?
> 

I'm not sure but however we would describe it we might as well make the
change now regardless of whether another level is added.

The main distinguishing characteristic is that VOLATILE is not guaranteed to
evaluate once-per-row if it is not dependent upon particular values within a
given row.  

VOLATILE: "A Volatile function used in an ORDER BY or WHERE clause without
referencing any columns from the query itself (i.e., no parameters or all
constants) will be evaluated a single time and the result treated as a
constant (i.e., all rows will have identical values) for that part of the
query."

PER_ROW: "A per_row function will be evaluated once for every row that is
visible to the function and will be treated as a virtual column of said
relation with each "cell" having an its own value as a result of the
function call."

Using random() as an example of the two possible behaviors should further
clarify the differences quite nicely.

Quick pass - hopefully, a) this inspires someone else, and b) this is the
correct understanding in the first place.

David J.





Re: Invalid optimization of VOLATILE function in WHERE clause?

From
"Kevin Grittner"
Date:
"David Johnston" <polobo@yahoo.com> wrote:
> VOLATILE: "A Volatile function used in an ORDER BY or WHERE clause
> without referencing any columns from the query itself (i.e., no
> parameters or all constants) will be evaluated a single time and
> the result treated as a constant (i.e., all rows will have
> identical values) for that part of the query."
I hope you're wrong about the ORDER BY part of that.  A quick test
confirms that it works in ORDER BY, at least for some cases.  If
there are any exceptions to that, I would sure like to know about
it -- and really soon.
select * from generate_series(1, 10000) s(n) order by random() limit 10;
-Kevin



Re: Invalid optimization of VOLATILE function in WHERE clause?

From
"David Johnston"
Date:
> -----Original Message-----
> From: Kevin Grittner [mailto:Kevin.Grittner@wicourts.gov]
> Sent: Wednesday, September 19, 2012 5:51 PM
> To: ktm@rice.edu; David Johnston
> Cc: 'Florian Schoppmann'; 'Robert Haas'; pgsql-hackers@postgresql.org;
'Tom
> Lane'
> Subject: RE: [HACKERS] Invalid optimization of VOLATILE function in WHERE
> clause?
> 
> "David Johnston" <polobo@yahoo.com> wrote:
> 
> > VOLATILE: "A Volatile function used in an ORDER BY or WHERE clause
> > without referencing any columns from the query itself (i.e., no
> > parameters or all constants) will be evaluated a single time and the
> > result treated as a constant (i.e., all rows will have identical
> > values) for that part of the query."
> 
> I hope you're wrong about the ORDER BY part of that.  A quick test
confirms
> that it works in ORDER BY, at least for some cases.  If there are any
> exceptions to that, I would sure like to know about it -- and really soon.
> 
> select * from generate_series(1, 10000) s(n)
>   order by random() limit 10;
> 
> -Kevin

I'd rather have someone who knows the code assert one way or the other; I
tossed it in there because I thought I've seen people complain that random()
doesn't work as expected with ORDER BY but that may just be faulty memory.
It may or may not depend on whether LIMIT/OFFSET are involved...?  Used in
the SELECT-list it gets evaluated for each row and I guess the ORDER BY
could have that behavior as well (I would expect it to anyway), so is it
strictly limited to WHERE clause evaluation that this discrepancy manifests?

David J.







Re: Invalid optimization of VOLATILE function in WHERE clause?

From
Florian.Schoppmann@emc.com (Florian Schoppmann)
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Florian.Schoppmann@emc.com (Florian Schoppmann) writes:
> > [VOLATILE function in WHERE clause *does* get optimized]
> 
> I can't get excited about this.  Any time you put a volatile function
> into WHERE, you're playing with fire.  The docs warn against it:
> http://www.postgresql.org/docs/9.2/static/sql-expressions.html#SYNTAX-
> EXPRESS-EVAL

All this section tells me is that one cannot rely on the evaluation
order in expressions, and that side effects are dangerous in WHERE and
HAVING clauses.

I do not read in this section that VOLATILE functions are unsafe per se.
After all, a VOLATILE function is not required to have side effects
(suppose, e.g., somebody implemented a true random nubmer generator).

However,
<http://www.postgresql.org/docs/9.2/interactive/sql-createfunction.html>
is actually very clear in its wording:

| VOLATILE indicates that the function value can change even within a
| single table scan, so no optimizations can be made.

I therefore tend to see the behavior as a bug. I concede though that
there is also good reason to keep the current behavior (VOLATILE is the
default for UDFs, etc.). But then I think the documentation needs to be
changed, and it has to be made explicit where the optimizer may change
semantics and what, on the other hand, is defined behavior.

E.g., users should need to know if the following rewrite causes defined
or undefined behavior. Does it give correct results by accident, or are
correct results guaranteed?

--8<--
WITH source AS (   SELECT i FROM generate_series(1,10) AS i
)
SELECT   i
FROM   source AS _stats
WHERE   random() < 5::DOUBLE PRECISION               / (SELECT count(*) FROM source);
-->8--

> To do what you want, I'd suggest wrapping the join into a sub-select
> with an "OFFSET 0" clause, which will serve as an optimization fence
> that prevents the random() call from being pushed down.

My interpretation so far is that VOLATILE functions in a WHERE clause
can always be avoided: E.g., move the condition to the SELECT list and
embed in an outer query that then filters on the condition column. Or is
my assumption wrong, and the optimizer could theoretically interfere
even here?

Florian




Re: Invalid optimization of VOLATILE function in WHERE clause?

From
Merlin Moncure
Date:
On Wed, Sep 19, 2012 at 2:39 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> It still seems like awfully weird behavior.
>>
>> Why?  The WHERE condition relates only to the output of the _stats
>> subquery, so why shouldn't it be evaluated there, rather than
>> after the join?
>
> In another thread, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It's easier to understand why this is if you realize that SQL has
>> a very clear model of a "pipeline" of query execution.
>> Conceptually, what happens is:
>>
>> 1. Form the cartesian product of the tables listed in FROM (ie,
>> all combinations of rows).
>>
>> 2. Apply the WHERE condition to each row from 1, and drop rows
>> that don't pass it.
>
> People expect that the results will be consistent with this model,
> even if the implementation is optimized "under the covers".  I think
> correct semantics should trump performance here.

Hm, I bet it's possible (although probably not easy) to deduce
volatility from the function body...maybe through the validator. If
you could do that (perhaps warning in cases where you can't), then the
performance regression-inducing-argument (which I agree with) becomes
greatly ameliorated.

merlin



Re: Invalid optimization of VOLATILE function in WHERE clause?

From
Merlin Moncure
Date:
On Thu, Sep 20, 2012 at 9:24 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Merlin Moncure <mmoncure@gmail.com> wrote:
>
>> Hm, I bet it's possible (although probably not easy) to deduce
>> volatility from the function body...maybe through the validator.
>> If you could do that (perhaps warning in cases where you can't),
>> then the performance regression-inducing-argument (which I agree
>> with) becomes greatly ameliorated.
>
> For about the last 10 years the Wisconsin Courts have been parsing
> SQL code to generate Java query classes, including "stored
> procedures", and determining information like this.  For example, we
> set a readOnly property for the query class by examining the
> statements in the procedure and the readOnly status of each called
> procedure.  It wasn't that hard for us, and I'm not sure what would
> make much it harder in PostgreSQL, if we can do it where a parse
> tree for the function is handy.

hm, what do you do about 'after the fact' changes to things the
procedure body is pointing to?  what would the server have to do?

merlin