Thread: DISTINCT ON changes sort order
Hi,
(Ignore that the last entry is 4, it's a copy-paste error and should be 43 as well. Anyway.....)
following a query:
SELECT DISTINCT ON (msgid) msgid FROM (SELECT refid_messages as msgid FROM messagehistorywithcontent WHERE 1=1 AND (lower(ARRAY_TO_STRING(ARRAY[login,firstname,lastname,content,msgtitle], ' ')) LIKE '%gg%') ORDER BY messagekind DESC) as foo
This query rearranges the sort order though.
When I only execute the inner SELECT I get this:
53 |
29 |
46 |
46 |
51 |
52 |
53 |
29 |
46 |
47 |
48 |
48 |
49 |
49 |
49 |
49 |
49 |
49 |
49 |
49 |
49 |
49 |
49 |
49 |
49 |
49 |
49 |
49 |
49 |
50 |
49 |
49 |
46 |
46 |
46 |
46 |
43 |
43 |
43 |
43 |
43 |
4 |
The order is correct. Now from the outer SELECT I would expect then to get:
53
29
46
51
52
53
46
.
.
.
.
43
But this is not the case. 43 is the id of the only record with messagekind 'AM' where all other have messagekind 'PD'. Yet the order in which the full query returns the results is:
29
43
46
47
.
.
.
Which is wrong. I can't figure out why this is wrong, but from toying around I found that depending on wether I use DISTINCT msgid or DISTINCT ON (msgid) msgid I get different results. Still, both results are wrong.
The 'must sort by what you distinct on' behavior gives me a total headache. What does one have to do to sort by a criterion and at the same time not use that criterion in the distinct clause?
Thank you
A. Reichtadt
On 04/24/2013 09:03 AM, Alexander Reichstadt wrote: > The order is correct. Now from the outer SELECT I would expect then to get: > 53 > 29 > 46 Please re-read the manual on DISTINCT ON. "SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first." You're running into the "unpredictable" part in that stanza. The distinct on may change the query plan, which will result in a different ordering of tuples than your inner clause, unless you used an optimization fence. Use an order-by clause. It's good practice to choose the column you want so that it always comes first, and eliminates other candidates. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
Alexander Reichstadt <lxr@mac.com> wrote: > SELECT > DISTINCT ON (msgid) > msgid > FROM ( > SELECT refid_messages as msgid > FROM messagehistorywithcontent > WHERE 1=1 > AND (lower(ARRAY_TO_STRING(ARRAY[login,firstname,lastname,content,msgtitle], ' ')) LIKE '%gg%') > ORDER BY messagekind DESC > ) as foo > [does not return results in the order of in subquery] > Which is wrong. No, your query is underspecified. The subquery in the FROM clause is an input relation to the top-level query, which then does a form of aggregation on that. There are no guarantees that a SELECT statement will return values in an order based on any of its input relations, unless you specify it that way. All the top-level SELECT sees is a list of msgid values, and you are asking it to eliminate duplicates to return a set of distinct values. It is doing exactly that in the way which it estimates will be the fastest. If you actually want something different from that, you must specify it. I'm not any more sure than the planner what it is that you *really* want, but maybe this?: SELECT msgid FROM ( SELECT distinct messagekind, refid_messages as msgid FROM messagehistorywithcontent WHERE (lower(ARRAY_TO_STRING(ARRAY[login,firstname,lastname,content,msgtitle], ' ')) LIKE '%gg%') ) as foo ORDER BY messagekind DESC, msgid; If that doesn't give you what you're after, we need a better explanation of what you want. A self-contained test case, where you create a table and load it with data and show a query, would be best. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I think I solved it:
SELECT * FROM (SELECT DISTINCT ON(refid_messages) refid_messages as msgid, * FROM messagehistorywithcontent WHERE (lower(ARRAY_TO_STRING(ARRAY[login,firstname,lastname,content,msgtitle], ' ')) LIKE '%gg%') ORDER BY refid_messages DESC) as foo ORDER BY messagekind ASC
Thanks
Alex
On 24.04.2013, at 16:03, Alexander Reichstadt wrote:
Hi,following a query:SELECT DISTINCT ON (msgid) msgid FROM (SELECT refid_messages as msgid FROM messagehistorywithcontent WHERE 1=1 AND (lower(ARRAY_TO_STRING(ARRAY[login,firstname,lastname,content,msgtitle], ' ')) LIKE '%gg%') ORDER BY messagekind DESC) as fooThis query rearranges the sort order though.When I only execute the inner SELECT I get this:(Ignore that the last entry is 4, it's a copy-paste error and should be 43 as well. Anyway.....)
53 29 46 46 51 52 53 29 46 47 48 48 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 50 49 49 46 46 46 46 43 43 43 43 43 4The order is correct. Now from the outer SELECT I would expect then to get:53294651525346....43But this is not the case. 43 is the id of the only record with messagekind 'AM' where all other have messagekind 'PD'. Yet the order in which the full query returns the results is:29434647...Which is wrong. I can't figure out why this is wrong, but from toying around I found that depending on wether I use DISTINCT msgid or DISTINCT ON (msgid) msgid I get different results. Still, both results are wrong.The 'must sort by what you distinct on' behavior gives me a total headache. What does one have to do to sort by a criterion and at the same time not use that criterion in the distinct clause?Thank youA. Reichtadt
Alexander Reichstadt <info@apfeltaste.net> writes: > I think I solved it: > SELECT * FROM (SELECT DISTINCT ON(refid_messages) refid_messages as msgid, * FROM messagehistorywithcontent WHERE (lower(ARRAY_TO_STRING(ARRAY[login,firstname,lastname,content,msgtitle],' ')) LIKE '%gg%') ORDER BY refid_messages DESC)as foo ORDER BY messagekind ASC This is not really a solution, unless you don't care which of the rows with a particular refid_messages value comes out. Usually, use of "DISTINCT ON (x)" should be accompanied by "ORDER BY x, y", where y is what you want to determine which row of a given x value comes out. If you write it like this, you get the row with smallest y for each x; or you could write "ORDER BY x, y DESC" to get the row with largest y. See the DISTINCT ON example in the SELECT reference page. regards, tom lane