Thread: Unexpected sort order.
Shouldn't the results of this query shown here been sorted by "b" rather than by "a"? I would have thought since "order by b" is in the outer sql statement it would have been the one the final result gets ordered by. li=# select * from (select (random()*10)::int as a, (random()*10)::int as b from generate_series(1,10) order by a) as x orderby b; a | b ---+---- 0 | 8 1 | 10 3 | 4 4 | 8 5 | 1 5 | 9 6 | 4 6 | 5 8 | 4 9 | 0 (10 rows) Changing the constant from 10 to 11 in either but not both of the places produces results I would have expected; as do many other ways of rewriting the query. Unless I'm missing something, it seems the way I wrote the query creates some confusion of which of the two similar expressions with random() it's sorting by.
On Mon, 2006-11-27 at 12:44 -0800, Ron Mayer wrote: > Shouldn't the results of this query shown here been sorted by "b" rather than by "a"? > > I would have thought since "order by b" is in the outer sql statement it would have > been the one the final result gets ordered by. > > li=# select * from (select (random()*10)::int as a, (random()*10)::int as b from generate_series(1,10) order by a) as xorder by b; > a | b > ---+---- > 0 | 8 > 1 | 10 > 3 | 4 > 4 | 8 > 5 | 1 > 5 | 9 > 6 | 4 > 6 | 5 > 8 | 4 > 9 | 0 > (10 rows) > > > Changing the constant from 10 to 11 in either but not both of the > places produces results I would have expected; as do many other ways of > rewriting the query. > > Unless I'm missing something, it seems the way I wrote the query creates > some confusion of which of the two similar expressions with random() > it's sorting by. It looks like a planner bug. Below are two plans; the first fails and the second succeeds. That leads me to believe it's a planner bug, but what seems strangest to me is that it does order by a, and not by some new evaluation of (random()*10). => explain select * from (select (random()*10)::int as a, (random ()*10)::int as b from generate_series(1,10) order by a) as x order by b; QUERY PLAN ------------------------------------------------------------------------------ Sort (cost=77.33..79.83 rows=1000 width=0) Sort Key: ((random() * 10::double precision))::integer -> Function Scan on generate_series (cost=0.00..27.50 rows=1000 width=0) (3 rows) Time: 0.584 ms => explain select * from (select (random()*10)::int as a, (random ()*11)::int as b from generate_series(1,10) order by a) as x order by b; QUERY PLAN ------------------------------------------------------------------------------------ Sort (cost=139.66..142.16 rows=1000 width=8) Sort Key: x.b -> Sort (cost=77.33..79.83 rows=1000 width=0) Sort Key: ((random() * 10::double precision))::integer -> Function Scan on generate_series (cost=0.00..27.50 rows=1000 width=0) (5 rows) You can apparently get the correct behavior on almost any kind of rewriting of the query, including the mere addition of a "DESC" onto the end. However, the query also fails if you nest it as another subselect, like so: => select a,b from (select a,b from (select (random()*10)::int as a, (random()*10)::int as b from generate_series(1,10) order by a) as x) as y order by b; Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Mon, 2006-11-27 at 12:44 -0800, Ron Mayer wrote: >> Shouldn't the results of this query shown here been sorted by "b" rather than by "a"? >> li=# select * from (select (random()*10)::int as a, (random()*10)::int as b from generate_series(1,10) order by a) asx order by b; > It looks like a planner bug. It looks to me like the planner thinks that order by a and order by b are equivalent because the expressions are equal(); hence it discards what it thinks is a redundant second sort step. I suppose we could add a check for whether the sort expression contains volatile functions before believing this, but I'm having a hard time believing that there are any real-world cases where the check wouldn't be a waste of cycles. What's the use-case for sorting by a volatile expression in the first place? regards, tom lane
Jeff Davis wrote: > On Mon, 2006-11-27 at 12:44 -0800, Ron Mayer wrote: >> Shouldn't the results of this query shown here been sorted by "b" rather than by "a"? >> >> I would have thought since "order by b" is in the outer sql statement it would have >> been the one the final result gets ordered by. >> >> li=# select * from (select (random()*10)::int as a, (random()*10)::int as b from generate_series(1,10) order by a) asx order by b; >> a | b >> ---+---- >> 0 | 8 >> 1 | 10 >> 3 | 4 >> 4 | 8 >> 5 | 1 >> 5 | 9 >> 6 | 4 >> 6 | 5 >> 8 | 4 >> 9 | 0 >> (10 rows) >>... > > It looks like a planner bug. > > Below are two plans; the first fails and the second succeeds. That leads > me to believe it's a planner bug, but what seems strangest to me is that > it does order by a, and not by some new evaluation of (random()*10). > Yeah, looks that way to me too. So how would I report it. Ccing the bugs list? Guess it can't hurt.
Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: >> On Mon, 2006-11-27 at 12:44 -0800, Ron Mayer wrote: >>> li=# select * from (select (random()*10)::int as a, (random()*10)::int as b from generate_series(1,10) order by a) asx order by b; > >> It looks like a planner bug. > > It looks to me like the planner thinks that order by a and order by b > are equivalent because the expressions are equal(); hence it discards > what it thinks is a redundant second sort step. > > I suppose we could add a check for whether the sort expression contains > volatile functions before believing this, but I'm having a hard time > believing that there are any real-world cases where the check wouldn't > be a waste of cycles. Would it be a smaller waste of cycles and still avoid the problem if the planner blindly kept only the second sort step rather than the first one when it sees these redundant steps? Or would that get other cases wrong? > What's the use-case for sorting by a volatile > expression in the first place? > There was no use-case I had in mind when I reported it. The order just surprised me so I thought I'd post it here. If I wanted to make up a possible use case - hmm, perhaps random sampling - but surely there would be better ways of doing that. So nope, no real-world use cases I can make up - just a odd result on a rather weirdly written query. None of my real applications would care if it's not fixed.
I wrote: > It looks to me like the planner thinks that order by a and order by b > are equivalent because the expressions are equal(); hence it discards > what it thinks is a redundant second sort step. > ... What's the use-case for sorting by a volatile > expression in the first place? It may be worth pointing out that there are related gotchas without bothering with anything as complicated as a sub-select. Consider select random() from foo order by 1; select random() from foo order by random(); Are these the same, or not? If you experiment you'll find out that Postgres treats them the same --- random() is evaluated only once per row of foo, and you get output that is sorted. Arguably for the second case there should be two evaluations of random() per row, and you should get output that appears randomly ordered (because the sort key and the output value will be uncorrelated). If you do select random() from foo order by random()+1; then you do get two evaluations and random-looking output. I'd be the first to admit that these various behaviors "just grew" rather than being intentionally designed; no one has been thinking about volatility in sort keys. The question remains whether it is worth expending development effort and planning cycles to have a more consistent definition. What's the use-case? regards, tom lane
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: > Tom Lane wrote: >> It looks to me like the planner thinks that order by a and order by b >> are equivalent because the expressions are equal(); hence it discards >> what it thinks is a redundant second sort step. > Would it be a smaller waste of cycles and still avoid the problem > if the planner blindly kept only the second sort step rather than > the first one when it sees these redundant steps? Or would that > get other cases wrong? I was fuzzing the explanation a bit --- there really isn't any place that we could simply reverse the logic and get the other behavior. The real issue is that the planner's "PathKey" representation of sort ordering is actually incapable of distinguishing whether the sub-query is sorted by a or by b: in either case the PathKeyItem will contain the expression "(random()*10)::int". So when the upper query tries to decide whether the lower query is already sorted the way it wants, it'll come out with a match. We surely don't want to discard the optimization of avoiding redundant sorts of subquery outputs, so the only way to "fix" this would be a fundamental redesign of the PathKey mechanism to special-case volatile expressions somehow. I'm resistant to doing that without a fairly solid use-case for sorting by volatile expressions ... regards, tom lane
On Mon, 2006-11-27 at 17:05 -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > On Mon, 2006-11-27 at 12:44 -0800, Ron Mayer wrote: > >> Shouldn't the results of this query shown here been sorted by "b" rather than by "a"? > > >> li=# select * from (select (random()*10)::int as a, (random()*10)::int as b from generate_series(1,10) order by a) asx order by b; > > > It looks like a planner bug. > > It looks to me like the planner thinks that order by a and order by b > are equivalent because the expressions are equal(); hence it discards > what it thinks is a redundant second sort step. > > I suppose we could add a check for whether the sort expression contains > volatile functions before believing this, but I'm having a hard time > believing that there are any real-world cases where the check wouldn't > be a waste of cycles. What's the use-case for sorting by a volatile > expression in the first place? The only use case that I can think of is avoiding surprise during testing. random() and generate_series() are probably used rarely in real applications, as with any other volatile function aside from the sequence functions. However, they're frequently used when developing and testing applications. The only reason I mention this is because it might not always be so obvious when it's a sorting problem. If you do a GROUP BY, and it's grouping by the wrong column, I could see how that could be very confusing[1]. Granted, this is all for hypothetical, contrived testing scenarios. So it's not very compelling if it requires significant work to implement. Regards, Jeff Davis [1] This result certainly doesn't make much sense, although I suppose the query doesn't either: => select sum(a) as aa,b from (select distinct (random()*10)::int as a, (random()*10)::int as b from generate_series(1,10)) x group by b; aa | b ----+---- 1 | 3 1 | 5 2 | 4 7 | 8 7 | 10 8 | 6 8 | 10 9 | 1 9 | 6 9 | 8 (10 rows)