Re: [SQL] Confusion about DISTINCT. - Mailing list pgsql-sql
From | Bruce Momjian |
---|---|
Subject | Re: [SQL] Confusion about DISTINCT. |
Date | |
Msg-id | 199905101625.MAA06754@candle.pha.pa.us Whole thread Raw |
In response to | Re: [SQL] Confusion about DISTINCT. (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-sql |
Nice summary of the issue. > Chris Bitmead <chris.bitmead@bigfoot.com> writes: > > I'm a bit confused about DISTINCT. Firstly, what does DISTINCT mean when > > you don't use "ON". > > Means the equality test is taken across all columns of the output. > "ON" restricts the test for "duplicate row" to look at just some > of the columns (thus creating the issue of which tuple out of a > group of "duplicates" gets through the filter to determine the values > of the other columns). > > > Like this... > > SELECT DISTINCT category.image FROM story, category* WHERE > > story.category = category.oid ORDER BY datetime DESC; > > The DISTINCT here has no effect on the output other than changing the > > ordering. There are duplicates! > > This is a known bug, or at least IMHO it's a bug; it's triggered by your > use of ORDER BY with DISTINCT. You need to say > ORDER BY category.image, datetime DESC > to make the above example work. > > The reason is that DISTINCT is implemented by a sort followed by an > adjacent-duplicate-elimination pass (just like "sort | uniq" in Unix > shell programming). When you put on an explicit ORDER BY clause, > you override the sort order that the DISTINCT wants, and so the > duplicate filter doesn't necessarily spot the duplicates. You have > to make sure to sort in an order that will keep the unwanted duplicates > together, ie, list all the DISTINCT columns first in the ORDER BY. > > A straightforward solution would be to apply the user-specified ORDER BY > sort *after* the DISTINCT-generated sort and dup-filter steps. I used > to think this was the right fix. However, this would break that nifty > technique of letting a user-specified ORDER BY resolve the ambiguity of > DISTINCT ON --- the user ordering has to happen before the dup-filter > for that to work. > > Alternatively, the system could alter the user's ORDER BY to ensure the > DISTINCT columns are the primary sort keys --- ie, silently change your > ORDER BY in the above example. I don't think I like that too much > either. For one thing, people would complain that the resulting order > wasn't what they asked for. For another, it seems possible that there > are applications where applying the dup-filter to columns that aren't > the primary keys might be useful. Let's see, if you did > SELECT DISTINCT category.image FROM story, category* WHERE > story.category = category.oid ORDER BY datetime DESC, category.image; > then you'd get each image listed only once *per date* (I think). > Doesn't that sound like it could be a useful behavior? > > So right at the moment, I think the system ought not tinker with the > user's ORDER BY. But it probably should emit a warning message if the > ORDER BY fails to mention all the columns being DISTINCTed on. If you > aren't sorting by a DISTINCT column at all, you won't even get > reproducible results, let alone desirable ones. > > > 2nd question: Why does the following query result in duplicates even > > though I use DISTINCT ON? > > Same deal. > > regards, tom lane > > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026