Re: [SQL] Confusion about DISTINCT. - Mailing list pgsql-sql

From Tom Lane
Subject Re: [SQL] Confusion about DISTINCT.
Date
Msg-id 10784.925057363@sss.pgh.pa.us
Whole thread Raw
In response to Confusion about DISTINCT.  (Chris Bitmead <chris.bitmead@bigfoot.com>)
Responses Re: [SQL] Confusion about DISTINCT.
List pgsql-sql
Chris Bitmead <chris.bitmead@bigfoot.com> writes:
> I'm a bit confused about DISTINCT. Firstly, what does DISTINCT mean when
> you don't use "ON".

Means the equality test is taken across all columns of the output.
"ON" restricts the test for "duplicate row" to look at just some
of the columns (thus creating the issue of which tuple out of a
group of "duplicates" gets through the filter to determine the values
of the other columns).

> Like this...
> SELECT DISTINCT category.image FROM story, category* WHERE
> story.category = category.oid ORDER BY datetime DESC;
> The DISTINCT here has no effect on the output other than changing the
> ordering. There are duplicates!

This is a known bug, or at least IMHO it's a bug; it's triggered by your
use of ORDER BY with DISTINCT.  You need to sayORDER BY category.image, datetime DESC 
to make the above example work.

The reason is that DISTINCT is implemented by a sort followed by an
adjacent-duplicate-elimination pass (just like "sort | uniq" in Unix
shell programming).  When you put on an explicit ORDER BY clause,
you override the sort order that the DISTINCT wants, and so the
duplicate filter doesn't necessarily spot the duplicates.  You have
to make sure to sort in an order that will keep the unwanted duplicates
together, ie, list all the DISTINCT columns first in the ORDER BY.

A straightforward solution would be to apply the user-specified ORDER BY
sort *after* the DISTINCT-generated sort and dup-filter steps.  I used
to think this was the right fix.  However, this would break that nifty
technique of letting a user-specified ORDER BY resolve the ambiguity of
DISTINCT ON --- the user ordering has to happen before the dup-filter
for that to work.

Alternatively, the system could alter the user's ORDER BY to ensure the
DISTINCT columns are the primary sort keys --- ie, silently change your
ORDER BY in the above example.  I don't think I like that too much
either.  For one thing, people would complain that the resulting order
wasn't what they asked for.  For another, it seems possible that there
are applications where applying the dup-filter to columns that aren't
the primary keys might be useful.  Let's see, if you didSELECT DISTINCT category.image FROM story, category*
WHEREstory.category= category.oid ORDER BY datetime DESC, category.image;
 
then you'd get each image listed only once *per date* (I think).
Doesn't that sound like it could be a useful behavior?

So right at the moment, I think the system ought not tinker with the
user's ORDER BY.  But it probably should emit a warning message if the
ORDER BY fails to mention all the columns being DISTINCTed on.  If you
aren't sorting by a DISTINCT column at all, you won't even get
reproducible results, let alone desirable ones.

> 2nd question: Why does the following query result in duplicates even
> though I use DISTINCT ON?

Same deal.
        regards, tom lane


pgsql-sql by date:

Previous
From: "Nigel Tamplin"
Date:
Subject: Relating 1 table to another.
Next
From: Tom Lane
Date:
Subject: Re: [SQL] Relating 1 table to another.