Thread: Invalid optimization of VOLATILE function in WHERE clause?
Invalid optimization of VOLATILE function in WHERE clause?
From
Florian.Schoppmann@emc.com (Florian Schoppmann)
Date:
Hi all, In PostgreSQL 9.1 and 9.2 (possibly also in earlier versions), the query --8<-- WITH source AS ( SELECT i FROM generate_series(1,10) AS i ) SELECT i FROM source, ( SELECT count(*) AS _n FROM source ) AS _stats WHERE random() < 5::DOUBLE PRECISION/_n; -->8-- translates into the following query plan: --8<--Nested Loop (cost=35.00..65.03 rows=1000 width=4) CTE source -> Function Scan on generate_series i (cost=0.00..10.00rows=1000 width=4) -> Aggregate (cost=25.00..25.02 rows=1 width=0) Filter: (random() <(5::double precision / (count(*))::double precision)) -> CTE Scan on source (cost=0.00..20.00 rows=1000width=0) -> CTE Scan on source (cost=0.00..20.00 rows=1000 width=4) -->8-- In other words, the query either gives exactly 0 or 10 rows, and both cases happen with probability 0.5. Naturally, I would have expected instead that each row is sampled independently with probability 0.5. Since random() is volatile, so is the whole where-expression. So I wonder why the condition is pushed down to the lowest level, given that this changes results. Is this behavior correct, i.e., specified somewhere? Or is this a bug? Florian
Florian.Schoppmann@emc.com (Florian Schoppmann) writes: > In PostgreSQL 9.1 and 9.2 (possibly also in earlier versions), the query > --8<-- > WITH source AS ( > SELECT i FROM generate_series(1,10) AS i > ) > SELECT > i > FROM > source, ( > SELECT > count(*) AS _n > FROM source > ) AS _stats > WHERE > random() < 5::DOUBLE PRECISION/_n; > -->8-- [ doesn't do what you think it should ] I can't get excited about this. Any time you put a volatile function into WHERE, you're playing with fire. The docs warn against it: http://www.postgresql.org/docs/9.2/static/sql-expressions.html#SYNTAX-EXPRESS-EVAL To do what you want, I'd suggest wrapping the join into a sub-select with an "OFFSET 0" clause, which will serve as an optimization fence that prevents the random() call from being pushed down. regards, tom lane
On Wed, Sep 19, 2012 at 9:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Florian.Schoppmann@emc.com (Florian Schoppmann) writes: >> In PostgreSQL 9.1 and 9.2 (possibly also in earlier versions), the query > >> --8<-- >> WITH source AS ( >> SELECT i FROM generate_series(1,10) AS i >> ) >> SELECT >> i >> FROM >> source, ( >> SELECT >> count(*) AS _n >> FROM source >> ) AS _stats >> WHERE >> random() < 5::DOUBLE PRECISION/_n; >> -->8-- > > [ doesn't do what you think it should ] > > I can't get excited about this. Any time you put a volatile function > into WHERE, you're playing with fire. The docs warn against it: > http://www.postgresql.org/docs/9.2/static/sql-expressions.html#SYNTAX-EXPRESS-EVAL > > To do what you want, I'd suggest wrapping the join into a sub-select > with an "OFFSET 0" clause, which will serve as an optimization fence > that prevents the random() call from being pushed down. I like the more standard CTE approach to optimization fencing where it works: postgres=# WITH source AS ( SELECT i, random() r FROM generate_series(1,10) AS i ) SELECT i FROM source, ( SELECT count(*) AS _n FROM source ) AS _stats WHERE r < 5::DOUBLE PRECISION/_n; merlin
On Wed, Sep 19, 2012 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Florian.Schoppmann@emc.com (Florian Schoppmann) writes: >> In PostgreSQL 9.1 and 9.2 (possibly also in earlier versions), the query > >> --8<-- >> WITH source AS ( >> SELECT i FROM generate_series(1,10) AS i >> ) >> SELECT >> i >> FROM >> source, ( >> SELECT >> count(*) AS _n >> FROM source >> ) AS _stats >> WHERE >> random() < 5::DOUBLE PRECISION/_n; >> -->8-- > > [ doesn't do what you think it should ] > > I can't get excited about this. Any time you put a volatile function > into WHERE, you're playing with fire. The docs warn against it: > http://www.postgresql.org/docs/9.2/static/sql-expressions.html#SYNTAX-EXPRESS-EVAL > > To do what you want, I'd suggest wrapping the join into a sub-select > with an "OFFSET 0" clause, which will serve as an optimization fence > that prevents the random() call from being pushed down. You've repeatedly objected to complaints on pgsql-performance on the grounds that WITH is an optimization fence. It seems awfully inconsistent to turn around and say, oh, sometimes it's not a fence after all. It seems that users may not rely on WITH either to do the optimizations necessary to have good performance or to fail to do optimizations that lead to wrong results. Ouch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Sep 19, 2012 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> To do what you want, I'd suggest wrapping the join into a sub-select >> with an "OFFSET 0" clause, which will serve as an optimization fence >> that prevents the random() call from being pushed down. > You've repeatedly objected to complaints on pgsql-performance on the > grounds that WITH is an optimization fence. It seems awfully > inconsistent to turn around and say, oh, sometimes it's not a fence > after all. Huh? The join in question is not inside a WITH. If it were, that would work too, as noted by Merlin. regards, tom lane
On Wed, Sep 19, 2012 at 12:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Sep 19, 2012 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> To do what you want, I'd suggest wrapping the join into a sub-select >>> with an "OFFSET 0" clause, which will serve as an optimization fence >>> that prevents the random() call from being pushed down. > >> You've repeatedly objected to complaints on pgsql-performance on the >> grounds that WITH is an optimization fence. It seems awfully >> inconsistent to turn around and say, oh, sometimes it's not a fence >> after all. > > Huh? The join in question is not inside a WITH. If it were, that > would work too, as noted by Merlin. Oh, hmm. I see now: the problem isn't that random() is being pushed into the WITH, it's that it's being pushed into the join. Sorry, I should have read that more carefully. It still seems like awfully weird behavior. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > It still seems like awfully weird behavior. Why? The WHERE condition relates only to the output of the _stats subquery, so why shouldn't it be evaluated there, rather than after the join? In a green field I might agree that we should de-optimize such cases, but the problem with doing so is that it would totally destroy performance for cases in which a user has defined a function that's actually stable or immutable but they forgot to mark it so. If VOLATILE weren't the default marking, such a change wouldn't be so problematic ... but it is. Given that the behavior has been like this since the late stone age, I'm not inclined to change it. regards, tom lane
On Wed, Sep 19, 2012 at 1:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> It still seems like awfully weird behavior. > > Why? The WHERE condition relates only to the output of the _stats > subquery, so why shouldn't it be evaluated there, rather than after > the join? Because my mental model (and apparently that of the OP) is that the WHERE clause gets evaluated separately for each row. Obviously in many cases that can be optimized without changing the results, but not in this case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> It still seems like awfully weird behavior. > > Why? The WHERE condition relates only to the output of the _stats > subquery, so why shouldn't it be evaluated there, rather than > after the join? In another thread, Tom Lane <tgl@sss.pgh.pa.us> wrote: > It's easier to understand why this is if you realize that SQL has > a very clear model of a "pipeline" of query execution. > Conceptually, what happens is: > > 1. Form the cartesian product of the tables listed in FROM (ie, > all combinations of rows). > > 2. Apply the WHERE condition to each row from 1, and drop rows > that don't pass it. People expect that the results will be consistent with this model, even if the implementation is optimized "under the covers". I think correct semantics should trump performance here. -Kevin
On Wed, Sep 19, 2012 at 02:39:12PM -0500, Kevin Grittner wrote: > Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> It still seems like awfully weird behavior. > > > > Why? The WHERE condition relates only to the output of the _stats > > subquery, so why shouldn't it be evaluated there, rather than > > after the join? > > In another thread, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > It's easier to understand why this is if you realize that SQL has > > a very clear model of a "pipeline" of query execution. > > Conceptually, what happens is: > > > > 1. Form the cartesian product of the tables listed in FROM (ie, > > all combinations of rows). > > > > 2. Apply the WHERE condition to each row from 1, and drop rows > > that don't pass it. > > People expect that the results will be consistent with this model, > even if the implementation is optimized "under the covers". I think > correct semantics should trump performance here. > > -Kevin > It seems like this is what happens here except that the function is evaluated once for the WHERE and not once per ROW. Both of these meet the criterion for 2 above and Tom's earlier comments both hold. Regards, Ken
"ktm@rice.edu" <ktm@rice.edu> wrote: > On Wed, Sep 19, 2012 at 02:39:12PM -0500, Kevin Grittner wrote: >> In another thread, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> 2. Apply the WHERE condition to each row from 1, and drop rows >>> that don't pass it. >> >> People expect that the results will be consistent with this >> model, even if the implementation is optimized "under the >> covers". I think correct semantics should trump performance >> here. > It seems like this is what happens here except that the function > is evaluated once for the WHERE and not once per ROW. Both of > these meet the criterion for 2 above and Tom's earlier comments > both hold. There really needs to be some way to specify that when an expression is evaluated for each row in a set, a function used within that expression is not optimized away for some rows. Fortunately we have a way: http://www.postgresql.org/docs/9.2/interactive/sql-createfunction.html | VOLATILE indicates that the function value can change even within | a single table scan, so no optimizations can be made. Relatively | few database functions are volatile in this sense; some examples | are random(), [...] The behavior in the OP's query would certainly be sane if the function were not VOLATILE; as it is, I have a hard time seeing this as anything but a bug. There is a workaround, if you don't mind ugly: CREATE FUNCTION random_really_i_mean_it(dummy int) RETURNS double precision LANGUAGE plpgsql VOLATILE AS $$ BEGIN -- no need to reference dummy parameter RETURN random(); END; $$; WITH source AS ( SELECT i FROM generate_series(1,10) AS i ) SELECT i FROM source, ( SELECT count(*) AS _n FROM source ) AS _stats WHERE random_really_i_mean_it(i) < 5::DOUBLE PRECISION/_n; -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > There is a workaround, if you don't mind ugly: Or, better: WITH source AS ( SELECT i, random() AS r FROM generate_series(1,10) AS i ) SELECT i FROM source, ( SELECT count(*) AS _n FROM source ) AS _stats WHERE r < 5::DOUBLE PRECISION/_n; -Kevin
> -----Original Message----- > There really needs to be some way to specify that when an expression is > evaluated for each row in a set, a function used within that expression is not > optimized away for some rows. Fortunately we have a way: > > http://www.postgresql.org/docs/9.2/interactive/sql-createfunction.html > > | VOLATILE indicates that the function value can change even within a > | single table scan, so no optimizations can be made. Relatively few > | database functions are volatile in this sense; some examples are > | random(), [...] > > The behavior in the OP's query would certainly be sane if the function were > not VOLATILE; as it is, I have a hard time seeing this as anything but a bug. What are the arguments against adding a 4th identifier - call it PER_ROW for this argument? The main reason VOLATILE is broken is that it is the default and in order to minimize beginner's penalty it is not treated as such in some situations. The new one could behave just like VOLATILE but would never be optimized away and would always evaluate once for each row in its context. Then the question is whether you write a new "random()" function or break backwards compatibility and alter the existing version. David J.
"David Johnston" <polobo@yahoo.com> wrote: >> | VOLATILE indicates that the function value can change even >> | within a single table scan, so no optimizations can be made. >> | Relatively few database functions are volatile in this sense; >> | some examples are random(), [...] > What are the arguments against adding a 4th identifier - call it > PER_ROW for this argument? The main reason VOLATILE is broken is > that it is the default and in order to minimize beginner's penalty > it is not treated as such in some situations. The new one could > behave just like VOLATILE but would never be optimized away and > would always evaluate once for each row in its context. So how would you document that? It sounds like the proposed level would behave exactly as the VOLATILE level is currently documented to behave; so I guess we could shift the documentation of VOLATILE to PER_ROW (or whatever). How would you then describe the behavior of VOLATILE? -Kevin
> -----Original Message----- > >> | VOLATILE indicates that the function value can change even within a > >> | single table scan, so no optimizations can be made. > >> | Relatively few database functions are volatile in this sense; some > >> | examples are random(), [...] > > > What are the arguments against adding a 4th identifier - call it > > PER_ROW for this argument? The main reason VOLATILE is broken is that > > it is the default and in order to minimize beginner's penalty it is > > not treated as such in some situations. The new one could behave just > > like VOLATILE but would never be optimized away and would always > > evaluate once for each row in its context. > > So how would you document that? It sounds like the proposed level would > behave exactly as the VOLATILE level is currently documented to behave; so I > guess we could shift the documentation of VOLATILE to PER_ROW (or > whatever). How would you then describe the behavior of VOLATILE? > I'm not sure but however we would describe it we might as well make the change now regardless of whether another level is added. The main distinguishing characteristic is that VOLATILE is not guaranteed to evaluate once-per-row if it is not dependent upon particular values within a given row. VOLATILE: "A Volatile function used in an ORDER BY or WHERE clause without referencing any columns from the query itself (i.e., no parameters or all constants) will be evaluated a single time and the result treated as a constant (i.e., all rows will have identical values) for that part of the query." PER_ROW: "A per_row function will be evaluated once for every row that is visible to the function and will be treated as a virtual column of said relation with each "cell" having an its own value as a result of the function call." Using random() as an example of the two possible behaviors should further clarify the differences quite nicely. Quick pass - hopefully, a) this inspires someone else, and b) this is the correct understanding in the first place. David J.
"David Johnston" <polobo@yahoo.com> wrote: > VOLATILE: "A Volatile function used in an ORDER BY or WHERE clause > without referencing any columns from the query itself (i.e., no > parameters or all constants) will be evaluated a single time and > the result treated as a constant (i.e., all rows will have > identical values) for that part of the query." I hope you're wrong about the ORDER BY part of that. A quick test confirms that it works in ORDER BY, at least for some cases. If there are any exceptions to that, I would sure like to know about it -- and really soon. select * from generate_series(1, 10000) s(n) order by random() limit 10; -Kevin
> -----Original Message----- > From: Kevin Grittner [mailto:Kevin.Grittner@wicourts.gov] > Sent: Wednesday, September 19, 2012 5:51 PM > To: ktm@rice.edu; David Johnston > Cc: 'Florian Schoppmann'; 'Robert Haas'; pgsql-hackers@postgresql.org; 'Tom > Lane' > Subject: RE: [HACKERS] Invalid optimization of VOLATILE function in WHERE > clause? > > "David Johnston" <polobo@yahoo.com> wrote: > > > VOLATILE: "A Volatile function used in an ORDER BY or WHERE clause > > without referencing any columns from the query itself (i.e., no > > parameters or all constants) will be evaluated a single time and the > > result treated as a constant (i.e., all rows will have identical > > values) for that part of the query." > > I hope you're wrong about the ORDER BY part of that. A quick test confirms > that it works in ORDER BY, at least for some cases. If there are any > exceptions to that, I would sure like to know about it -- and really soon. > > select * from generate_series(1, 10000) s(n) > order by random() limit 10; > > -Kevin I'd rather have someone who knows the code assert one way or the other; I tossed it in there because I thought I've seen people complain that random() doesn't work as expected with ORDER BY but that may just be faulty memory. It may or may not depend on whether LIMIT/OFFSET are involved...? Used in the SELECT-list it gets evaluated for each row and I guess the ORDER BY could have that behavior as well (I would expect it to anyway), so is it strictly limited to WHERE clause evaluation that this discrepancy manifests? David J.
Re: Invalid optimization of VOLATILE function in WHERE clause?
From
Florian.Schoppmann@emc.com (Florian Schoppmann)
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Florian.Schoppmann@emc.com (Florian Schoppmann) writes: > > [VOLATILE function in WHERE clause *does* get optimized] > > I can't get excited about this. Any time you put a volatile function > into WHERE, you're playing with fire. The docs warn against it: > http://www.postgresql.org/docs/9.2/static/sql-expressions.html#SYNTAX- > EXPRESS-EVAL All this section tells me is that one cannot rely on the evaluation order in expressions, and that side effects are dangerous in WHERE and HAVING clauses. I do not read in this section that VOLATILE functions are unsafe per se. After all, a VOLATILE function is not required to have side effects (suppose, e.g., somebody implemented a true random nubmer generator). However, <http://www.postgresql.org/docs/9.2/interactive/sql-createfunction.html> is actually very clear in its wording: | VOLATILE indicates that the function value can change even within a | single table scan, so no optimizations can be made. I therefore tend to see the behavior as a bug. I concede though that there is also good reason to keep the current behavior (VOLATILE is the default for UDFs, etc.). But then I think the documentation needs to be changed, and it has to be made explicit where the optimizer may change semantics and what, on the other hand, is defined behavior. E.g., users should need to know if the following rewrite causes defined or undefined behavior. Does it give correct results by accident, or are correct results guaranteed? --8<-- WITH source AS ( SELECT i FROM generate_series(1,10) AS i ) SELECT i FROM source AS _stats WHERE random() < 5::DOUBLE PRECISION / (SELECT count(*) FROM source); -->8-- > To do what you want, I'd suggest wrapping the join into a sub-select > with an "OFFSET 0" clause, which will serve as an optimization fence > that prevents the random() call from being pushed down. My interpretation so far is that VOLATILE functions in a WHERE clause can always be avoided: E.g., move the condition to the SELECT list and embed in an outer query that then filters on the condition column. Or is my assumption wrong, and the optimizer could theoretically interfere even here? Florian
On Wed, Sep 19, 2012 at 2:39 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> It still seems like awfully weird behavior. >> >> Why? The WHERE condition relates only to the output of the _stats >> subquery, so why shouldn't it be evaluated there, rather than >> after the join? > > In another thread, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It's easier to understand why this is if you realize that SQL has >> a very clear model of a "pipeline" of query execution. >> Conceptually, what happens is: >> >> 1. Form the cartesian product of the tables listed in FROM (ie, >> all combinations of rows). >> >> 2. Apply the WHERE condition to each row from 1, and drop rows >> that don't pass it. > > People expect that the results will be consistent with this model, > even if the implementation is optimized "under the covers". I think > correct semantics should trump performance here. Hm, I bet it's possible (although probably not easy) to deduce volatility from the function body...maybe through the validator. If you could do that (perhaps warning in cases where you can't), then the performance regression-inducing-argument (which I agree with) becomes greatly ameliorated. merlin
On Thu, Sep 20, 2012 at 9:24 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Merlin Moncure <mmoncure@gmail.com> wrote: > >> Hm, I bet it's possible (although probably not easy) to deduce >> volatility from the function body...maybe through the validator. >> If you could do that (perhaps warning in cases where you can't), >> then the performance regression-inducing-argument (which I agree >> with) becomes greatly ameliorated. > > For about the last 10 years the Wisconsin Courts have been parsing > SQL code to generate Java query classes, including "stored > procedures", and determining information like this. For example, we > set a readOnly property for the query class by examining the > statements in the procedure and the readOnly status of each called > procedure. It wasn't that hard for us, and I'm not sure what would > make much it harder in PostgreSQL, if we can do it where a parse > tree for the function is handy. hm, what do you do about 'after the fact' changes to things the procedure body is pointing to? what would the server have to do? merlin