Thread: Unexpected sort order.

Unexpected sort order.

From
Ron Mayer
Date:
Shouldn't the results of this query shown here been sorted by "b" rather than by "a"?

I would have thought since "order by b" is in the outer sql statement it would have
been the one the final result gets ordered by.

li=# select * from (select (random()*10)::int as a, (random()*10)::int as b from generate_series(1,10) order by a) as x
orderby b; 
 a | b
---+----
 0 |  8
 1 | 10
 3 |  4
 4 |  8
 5 |  1
 5 |  9
 6 |  4
 6 |  5
 8 |  4
 9 |  0
(10 rows)


Changing the constant from 10 to 11 in either but not both of the
places produces results I would have expected; as do many other ways of
rewriting the query.

Unless I'm missing something, it seems the way I wrote the query creates
some confusion of which of the two similar expressions with random()
it's sorting by.

Re: Unexpected sort order.

From
Jeff Davis
Date:
On Mon, 2006-11-27 at 12:44 -0800, Ron Mayer wrote:
> Shouldn't the results of this query shown here been sorted by "b" rather than by "a"?
>
> I would have thought since "order by b" is in the outer sql statement it would have
> been the one the final result gets ordered by.
>
> li=# select * from (select (random()*10)::int as a, (random()*10)::int as b from generate_series(1,10) order by a) as
xorder by b; 
>  a | b
> ---+----
>  0 |  8
>  1 | 10
>  3 |  4
>  4 |  8
>  5 |  1
>  5 |  9
>  6 |  4
>  6 |  5
>  8 |  4
>  9 |  0
> (10 rows)
>
>
> Changing the constant from 10 to 11 in either but not both of the
> places produces results I would have expected; as do many other ways of
> rewriting the query.
>
> Unless I'm missing something, it seems the way I wrote the query creates
> some confusion of which of the two similar expressions with random()
> it's sorting by.

It looks like a planner bug.

Below are two plans; the first fails and the second succeeds. That leads
me to believe it's a planner bug, but what seems strangest to me is that
it does order by a, and not by some new evaluation of (random()*10).

=> explain select * from (select (random()*10)::int as a, (random
()*10)::int as b from generate_series(1,10) order by a) as x order by b;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Sort  (cost=77.33..79.83 rows=1000 width=0)
   Sort Key: ((random() * 10::double precision))::integer
   ->  Function Scan on generate_series  (cost=0.00..27.50 rows=1000
width=0)
(3 rows)

Time: 0.584 ms

=> explain select * from (select (random()*10)::int as a, (random
()*11)::int as b from generate_series(1,10) order by a) as x order by b;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Sort  (cost=139.66..142.16 rows=1000 width=8)
   Sort Key: x.b
   ->  Sort  (cost=77.33..79.83 rows=1000 width=0)
         Sort Key: ((random() * 10::double precision))::integer
         ->  Function Scan on generate_series  (cost=0.00..27.50
rows=1000 width=0)
(5 rows)

You can apparently get the correct behavior on almost any kind of
rewriting of the query, including the mere addition of a "DESC" onto the
end.

However, the query also fails if you nest it as another subselect, like
so:

=> select a,b from (select a,b from (select (random()*10)::int as a,
(random()*10)::int as b from generate_series(1,10) order by a) as x) as
y order by b;

Regards,
    Jeff Davis


Re: Unexpected sort order.

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Mon, 2006-11-27 at 12:44 -0800, Ron Mayer wrote:
>> Shouldn't the results of this query shown here been sorted by "b" rather than by "a"?

>> li=# select * from (select (random()*10)::int as a, (random()*10)::int as b from generate_series(1,10) order by a)
asx order by b; 

> It looks like a planner bug.

It looks to me like the planner thinks that order by a and order by b
are equivalent because the expressions are equal(); hence it discards
what it thinks is a redundant second sort step.

I suppose we could add a check for whether the sort expression contains
volatile functions before believing this, but I'm having a hard time
believing that there are any real-world cases where the check wouldn't
be a waste of cycles.  What's the use-case for sorting by a volatile
expression in the first place?

            regards, tom lane

Re: Unexpected sort order (suspected bug)

From
Ron Mayer
Date:
Jeff Davis wrote:
> On Mon, 2006-11-27 at 12:44 -0800, Ron Mayer wrote:
>> Shouldn't the results of this query shown here been sorted by "b" rather than by "a"?
>>
>> I would have thought since "order by b" is in the outer sql statement it would have
>> been the one the final result gets ordered by.
>>
>> li=# select * from (select (random()*10)::int as a, (random()*10)::int as b from generate_series(1,10) order by a)
asx order by b; 
>>  a | b
>> ---+----
>>  0 |  8
>>  1 | 10
>>  3 |  4
>>  4 |  8
>>  5 |  1
>>  5 |  9
>>  6 |  4
>>  6 |  5
>>  8 |  4
>>  9 |  0
>> (10 rows)
>>...
>
> It looks like a planner bug.
>
> Below are two plans; the first fails and the second succeeds. That leads
> me to believe it's a planner bug, but what seems strangest to me is that
> it does order by a, and not by some new evaluation of (random()*10).
>

Yeah, looks that way to me too.

So how would I report it.  Ccing the bugs list?  Guess it can't hurt.

Re: Unexpected sort order.

From
Ron Mayer
Date:
Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
>> On Mon, 2006-11-27 at 12:44 -0800, Ron Mayer wrote:
>>> li=# select * from (select (random()*10)::int as a, (random()*10)::int as b from generate_series(1,10) order by a)
asx order by b; 
>
>> It looks like a planner bug.
>
> It looks to me like the planner thinks that order by a and order by b
> are equivalent because the expressions are equal(); hence it discards
> what it thinks is a redundant second sort step.
>
> I suppose we could add a check for whether the sort expression contains
> volatile functions before believing this, but I'm having a hard time
> believing that there are any real-world cases where the check wouldn't
> be a waste of cycles.

Would it be a smaller waste of cycles and still avoid the problem
if the planner blindly kept only the second sort step rather than
the first one when it sees these redundant steps?  Or would that
get other cases wrong?

> What's the use-case for sorting by a volatile
> expression in the first place?
>

There was no use-case I had in mind when I reported it.
The order just surprised me so I thought I'd post it here.


If I wanted to make up a possible use case - hmm, perhaps random
sampling - but surely there would be better ways of doing that.
So nope, no real-world use cases I can make up - just a odd
result on a rather weirdly written query.   None of my real
applications would care if it's not fixed.

Re: Unexpected sort order.

From
Tom Lane
Date:
I wrote:
> It looks to me like the planner thinks that order by a and order by b
> are equivalent because the expressions are equal(); hence it discards
> what it thinks is a redundant second sort step.
> ... What's the use-case for sorting by a volatile
> expression in the first place?

It may be worth pointing out that there are related gotchas without
bothering with anything as complicated as a sub-select.  Consider

    select random() from foo order by 1;
    select random() from foo order by random();

Are these the same, or not?  If you experiment you'll find out that
Postgres treats them the same --- random() is evaluated only once per
row of foo, and you get output that is sorted.  Arguably for the
second case there should be two evaluations of random() per row, and you
should get output that appears randomly ordered (because the sort key
and the output value will be uncorrelated).  If you do

    select random() from foo order by random()+1;

then you do get two evaluations and random-looking output.

I'd be the first to admit that these various behaviors "just grew"
rather than being intentionally designed; no one has been thinking
about volatility in sort keys.  The question remains whether it is
worth expending development effort and planning cycles to have a more
consistent definition.  What's the use-case?

            regards, tom lane

Re: Unexpected sort order.

From
Tom Lane
Date:
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
> Tom Lane wrote:
>> It looks to me like the planner thinks that order by a and order by b
>> are equivalent because the expressions are equal(); hence it discards
>> what it thinks is a redundant second sort step.

> Would it be a smaller waste of cycles and still avoid the problem
> if the planner blindly kept only the second sort step rather than
> the first one when it sees these redundant steps?  Or would that
> get other cases wrong?

I was fuzzing the explanation a bit --- there really isn't any place
that we could simply reverse the logic and get the other behavior.
The real issue is that the planner's "PathKey" representation of sort
ordering is actually incapable of distinguishing whether the sub-query
is sorted by a or by b: in either case the PathKeyItem will contain the
expression "(random()*10)::int".  So when the upper query tries to
decide whether the lower query is already sorted the way it wants,
it'll come out with a match.  We surely don't want to discard the
optimization of avoiding redundant sorts of subquery outputs, so the
only way to "fix" this would be a fundamental redesign of the PathKey
mechanism to special-case volatile expressions somehow.  I'm resistant
to doing that without a fairly solid use-case for sorting by volatile
expressions ...

            regards, tom lane

Re: Unexpected sort order.

From
Jeff Davis
Date:
On Mon, 2006-11-27 at 17:05 -0500, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > On Mon, 2006-11-27 at 12:44 -0800, Ron Mayer wrote:
> >> Shouldn't the results of this query shown here been sorted by "b" rather than by "a"?
>
> >> li=# select * from (select (random()*10)::int as a, (random()*10)::int as b from generate_series(1,10) order by a)
asx order by b; 
>
> > It looks like a planner bug.
>
> It looks to me like the planner thinks that order by a and order by b
> are equivalent because the expressions are equal(); hence it discards
> what it thinks is a redundant second sort step.
>
> I suppose we could add a check for whether the sort expression contains
> volatile functions before believing this, but I'm having a hard time
> believing that there are any real-world cases where the check wouldn't
> be a waste of cycles.  What's the use-case for sorting by a volatile
> expression in the first place?

The only use case that I can think of is avoiding surprise during
testing. random() and generate_series() are probably used rarely in real
applications, as with any other volatile function aside from the
sequence functions. However, they're frequently used when developing and
testing applications.

The only reason I mention this is because it might not always be so
obvious when it's a sorting problem. If you do a GROUP BY, and it's
grouping by the wrong column, I could see how that could be very
confusing[1].

Granted, this is all for hypothetical, contrived testing scenarios. So
it's not very compelling if it requires significant work to implement.

Regards,
    Jeff Davis

[1] This result certainly doesn't make much sense, although I suppose
the query doesn't either:

=> select sum(a) as aa,b from (select distinct (random()*10)::int as a,
(random()*10)::int as b from generate_series(1,10)) x group by b;
 aa | b
----+----
  1 |  3
  1 |  5
  2 |  4
  7 |  8
  7 | 10
  8 |  6
  8 | 10
  9 |  1
  9 |  6
  9 |  8
(10 rows)