Re: [SQL] Confusion about DISTINCT. - Mailing list pgsql-sql
From | Tom Lane |
---|---|
Subject | Re: [SQL] Confusion about DISTINCT. |
Date | |
Msg-id | 10784.925057363@sss.pgh.pa.us Whole thread Raw |
In response to | Confusion about DISTINCT. (Chris Bitmead <chris.bitmead@bigfoot.com>) |
Responses |
Re: [SQL] Confusion about DISTINCT.
|
List | pgsql-sql |
Chris Bitmead <chris.bitmead@bigfoot.com> writes: > I'm a bit confused about DISTINCT. Firstly, what does DISTINCT mean when > you don't use "ON". Means the equality test is taken across all columns of the output. "ON" restricts the test for "duplicate row" to look at just some of the columns (thus creating the issue of which tuple out of a group of "duplicates" gets through the filter to determine the values of the other columns). > Like this... > SELECT DISTINCT category.image FROM story, category* WHERE > story.category = category.oid ORDER BY datetime DESC; > The DISTINCT here has no effect on the output other than changing the > ordering. There are duplicates! This is a known bug, or at least IMHO it's a bug; it's triggered by your use of ORDER BY with DISTINCT. You need to sayORDER BY category.image, datetime DESC to make the above example work. The reason is that DISTINCT is implemented by a sort followed by an adjacent-duplicate-elimination pass (just like "sort | uniq" in Unix shell programming). When you put on an explicit ORDER BY clause, you override the sort order that the DISTINCT wants, and so the duplicate filter doesn't necessarily spot the duplicates. You have to make sure to sort in an order that will keep the unwanted duplicates together, ie, list all the DISTINCT columns first in the ORDER BY. A straightforward solution would be to apply the user-specified ORDER BY sort *after* the DISTINCT-generated sort and dup-filter steps. I used to think this was the right fix. However, this would break that nifty technique of letting a user-specified ORDER BY resolve the ambiguity of DISTINCT ON --- the user ordering has to happen before the dup-filter for that to work. Alternatively, the system could alter the user's ORDER BY to ensure the DISTINCT columns are the primary sort keys --- ie, silently change your ORDER BY in the above example. I don't think I like that too much either. For one thing, people would complain that the resulting order wasn't what they asked for. For another, it seems possible that there are applications where applying the dup-filter to columns that aren't the primary keys might be useful. Let's see, if you didSELECT DISTINCT category.image FROM story, category* WHEREstory.category= category.oid ORDER BY datetime DESC, category.image; then you'd get each image listed only once *per date* (I think). Doesn't that sound like it could be a useful behavior? So right at the moment, I think the system ought not tinker with the user's ORDER BY. But it probably should emit a warning message if the ORDER BY fails to mention all the columns being DISTINCTed on. If you aren't sorting by a DISTINCT column at all, you won't even get reproducible results, let alone desirable ones. > 2nd question: Why does the following query result in duplicates even > though I use DISTINCT ON? Same deal. regards, tom lane