Re: [SQL] Confusion about DISTINCT. - Mailing list pgsql-sql

From Bruce Momjian
Subject Re: [SQL] Confusion about DISTINCT.
Date
Msg-id 199905101625.MAA06754@candle.pha.pa.us
Whole thread Raw
In response to Re: [SQL] Confusion about DISTINCT.  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-sql

Nice summary of the issue.


> Chris Bitmead <chris.bitmead@bigfoot.com> writes:
> > I'm a bit confused about DISTINCT. Firstly, what does DISTINCT mean when
> > you don't use "ON".
> 
> Means the equality test is taken across all columns of the output.
> "ON" restricts the test for "duplicate row" to look at just some
> of the columns (thus creating the issue of which tuple out of a
> group of "duplicates" gets through the filter to determine the values
> of the other columns).
> 
> > Like this...
> > SELECT DISTINCT category.image FROM story, category* WHERE
> > story.category = category.oid ORDER BY datetime DESC;
> > The DISTINCT here has no effect on the output other than changing the
> > ordering. There are duplicates!
> 
> This is a known bug, or at least IMHO it's a bug; it's triggered by your
> use of ORDER BY with DISTINCT.  You need to say
>     ORDER BY category.image, datetime DESC 
> to make the above example work.
> 
> The reason is that DISTINCT is implemented by a sort followed by an
> adjacent-duplicate-elimination pass (just like "sort | uniq" in Unix
> shell programming).  When you put on an explicit ORDER BY clause,
> you override the sort order that the DISTINCT wants, and so the
> duplicate filter doesn't necessarily spot the duplicates.  You have
> to make sure to sort in an order that will keep the unwanted duplicates
> together, ie, list all the DISTINCT columns first in the ORDER BY.
> 
> A straightforward solution would be to apply the user-specified ORDER BY
> sort *after* the DISTINCT-generated sort and dup-filter steps.  I used
> to think this was the right fix.  However, this would break that nifty
> technique of letting a user-specified ORDER BY resolve the ambiguity of
> DISTINCT ON --- the user ordering has to happen before the dup-filter
> for that to work.
> 
> Alternatively, the system could alter the user's ORDER BY to ensure the
> DISTINCT columns are the primary sort keys --- ie, silently change your
> ORDER BY in the above example.  I don't think I like that too much
> either.  For one thing, people would complain that the resulting order
> wasn't what they asked for.  For another, it seems possible that there
> are applications where applying the dup-filter to columns that aren't
> the primary keys might be useful.  Let's see, if you did
>     SELECT DISTINCT category.image FROM story, category* WHERE
>     story.category = category.oid ORDER BY datetime DESC, category.image;
> then you'd get each image listed only once *per date* (I think).
> Doesn't that sound like it could be a useful behavior?
> 
> So right at the moment, I think the system ought not tinker with the
> user's ORDER BY.  But it probably should emit a warning message if the
> ORDER BY fails to mention all the columns being DISTINCTed on.  If you
> aren't sorting by a DISTINCT column at all, you won't even get
> reproducible results, let alone desirable ones.
> 
> > 2nd question: Why does the following query result in duplicates even
> > though I use DISTINCT ON?
> 
> Same deal.
> 
>             regards, tom lane
> 
> 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


pgsql-sql by date:

Previous
From: "Matthias Seiferth"
Date:
Subject: Retrieving column names and table names of a database
Next
From: Tom Lane
Date:
Subject: Re: [SQL] Simple Optimization Problem