Thread: DISTINCT ON changes sort order

DISTINCT ON changes sort order

From
Alexander Reichstadt
Date:
Hi,

following a query:
SELECT DISTINCT ON (msgid) msgid FROM (SELECT refid_messages as msgid FROM messagehistorywithcontent WHERE 1=1 AND  (lower(ARRAY_TO_STRING(ARRAY[login,firstname,lastname,content,msgtitle], ' ')) LIKE '%gg%') ORDER BY messagekind DESC) as foo

This query rearranges the sort order though.

When I only execute the inner SELECT I get this:

53
29
46
46
51
52
53
29
46
47
48
48
49
49
49
49
49
49
49
49
49
49
49
49
49
49
49
49
49
50
49
49
46
46
46
46
43
43
43
43
43
4
(Ignore that the last entry is 4, it's a copy-paste error and should be 43 as well. Anyway.....)


The order is correct. Now from the outer SELECT I would expect then to get:
53
29
46
51
52
53
46
.
.
.
.
43

But this is not the case. 43 is the id of the only record with messagekind 'AM' where all other have messagekind 'PD'. Yet the order in which the full query returns the results is:
29
43
46
47
.
.
.

Which is wrong. I can't figure out why this is wrong, but from toying around I found that depending on wether I use DISTINCT msgid or DISTINCT ON (msgid) msgid I get different results. Still, both results are wrong.

The 'must sort by what you distinct on' behavior gives me a total headache. What does one have to do to sort by a criterion and at the same time not use that criterion in the distinct clause?

Thank you
A. Reichtadt

Re: DISTINCT ON changes sort order

From
Shaun Thomas
Date:
On 04/24/2013 09:03 AM, Alexander Reichstadt wrote:

> The order is correct. Now from the outer SELECT I would expect then to get:
> 53
> 29
> 46

Please re-read the manual on DISTINCT ON.

"SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of
each set of rows where the given expressions evaluate to equal. The
DISTINCT ON expressions are interpreted using the same rules as for
ORDER BY (see above). Note that the "first row" of each set is
unpredictable unless ORDER BY is used to ensure that the desired row
appears first."

You're running into the "unpredictable" part in that stanza. The
distinct on may change the query plan, which will result in a different
ordering of tuples than your inner clause, unless you used an
optimization fence.

Use an order-by clause. It's good practice to choose the column you want
so that it always comes first, and eliminates other candidates.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: DISTINCT ON changes sort order

From
Kevin Grittner
Date:
Alexander Reichstadt <lxr@mac.com> wrote:

> SELECT
>   DISTINCT ON (msgid)
>     msgid
>   FROM (
>          SELECT refid_messages as msgid
>            FROM messagehistorywithcontent
>            WHERE 1=1
>              AND (lower(ARRAY_TO_STRING(ARRAY[login,firstname,lastname,content,msgtitle], ' ')) LIKE '%gg%')
>            ORDER BY messagekind DESC
>        ) as foo

> [does not return results in the order of in subquery]

> Which is wrong.

No, your query is underspecified.  The subquery in the FROM clause
is an input relation to the top-level query, which then does a form
of aggregation on that.  There are no guarantees that a SELECT
statement will return values in an order based on any of its input
relations, unless you specify it that way.

All the top-level SELECT sees is a list of msgid values, and you
are asking it to eliminate duplicates to return a set of distinct
values.  It is doing exactly that in the way which it estimates
will be the fastest.  If you actually want something different from
that, you must specify it.

I'm not any more sure than the planner what it is that you *really*
want, but maybe this?:

SELECT msgid
  FROM (
         SELECT distinct messagekind, refid_messages as msgid
           FROM messagehistorywithcontent
           WHERE (lower(ARRAY_TO_STRING(ARRAY[login,firstname,lastname,content,msgtitle], ' ')) LIKE '%gg%')
       ) as foo
  ORDER BY
    messagekind DESC,
    msgid;

If that doesn't give you what you're after, we need a better
explanation of what you want.  A self-contained test case, where
you create a table and load it with data and show a query, would be
best.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: DISTINCT ON changes sort order

From
Alexander Reichstadt
Date:
I think I solved it:

SELECT * FROM (SELECT DISTINCT ON(refid_messages) refid_messages as msgid, * FROM messagehistorywithcontent WHERE   (lower(ARRAY_TO_STRING(ARRAY[login,firstname,lastname,content,msgtitle], ' ')) LIKE '%gg%') ORDER BY refid_messages DESC) as foo ORDER BY messagekind ASC

Thanks
Alex


On 24.04.2013, at 16:03, Alexander Reichstadt wrote:

Hi,

following a query:
SELECT DISTINCT ON (msgid) msgid FROM (SELECT refid_messages as msgid FROM messagehistorywithcontent WHERE 1=1 AND  (lower(ARRAY_TO_STRING(ARRAY[login,firstname,lastname,content,msgtitle], ' ')) LIKE '%gg%') ORDER BY messagekind DESC) as foo

This query rearranges the sort order though.

When I only execute the inner SELECT I get this:

53
29
46
46
51
52
53
29
46
47
48
48
49
49
49
49
49
49
49
49
49
49
49
49
49
49
49
49
49
50
49
49
46
46
46
46
43
43
43
43
43
4
(Ignore that the last entry is 4, it's a copy-paste error and should be 43 as well. Anyway.....)


The order is correct. Now from the outer SELECT I would expect then to get:
53
29
46
51
52
53
46
.
.
.
.
43

But this is not the case. 43 is the id of the only record with messagekind 'AM' where all other have messagekind 'PD'. Yet the order in which the full query returns the results is:
29
43
46
47
.
.
.

Which is wrong. I can't figure out why this is wrong, but from toying around I found that depending on wether I use DISTINCT msgid or DISTINCT ON (msgid) msgid I get different results. Still, both results are wrong.

The 'must sort by what you distinct on' behavior gives me a total headache. What does one have to do to sort by a criterion and at the same time not use that criterion in the distinct clause?

Thank you
A. Reichtadt


Re: DISTINCT ON changes sort order

From
Tom Lane
Date:
Alexander Reichstadt <info@apfeltaste.net> writes:
> I think I solved it:
> SELECT * FROM (SELECT DISTINCT ON(refid_messages) refid_messages as msgid, * FROM messagehistorywithcontent WHERE
(lower(ARRAY_TO_STRING(ARRAY[login,firstname,lastname,content,msgtitle],' ')) LIKE '%gg%') ORDER BY refid_messages
DESC)as foo ORDER BY messagekind ASC 

This is not really a solution, unless you don't care which of the rows
with a particular refid_messages value comes out.  Usually, use of
"DISTINCT ON (x)" should be accompanied by "ORDER BY x, y", where y is
what you want to determine which row of a given x value comes out.
If you write it like this, you get the row with smallest y for each
x; or you could write "ORDER BY x, y DESC" to get the row with largest
y.  See the DISTINCT ON example in the SELECT reference page.

            regards, tom lane