Re: ORDER BY and DISTINCT ON - Mailing list pgsql-hackers

From Tom Lane
Subject Re: ORDER BY and DISTINCT ON
Date
Msg-id 12581.1071500003@sss.pgh.pa.us
Whole thread Raw
In response to Re: ORDER BY and DISTINCT ON  (Bruno Wolff III <bruno@wolff.to>)
List pgsql-hackers
Bruno Wolff III <bruno@wolff.to> writes:
> Specifically the interpretation I think makes sense is that
> SELECT DISTINCT ON (a, b, c) * FROM tablename ORDER BY d, e, f
> should be treated as equvialent to
> SELECT * FROM
>   (SELECT DISTINCT ON (a, b, c) FROM tablename ORDER BY a, b, c, d, e, f) AS t
>   ORDER BY d, e, f

Right, that's the same thing Neil said elsewhere in the thread.  I guess
I agree that this would be cleaner, and also that it doesn't seem like a
high priority problem.

BTW, you could imagine implementing DISTINCT and DISTINCT ON by a hash
table, if you didn't expect too many distinct rows.  The hash key would
be the DISTINCT columns, and in the DISTINCT ON case, when you get a new
row with DISTINCT columns matching an existing entry, you compare the
two rows using the ORDER BY key columns to decide which row to keep.  In
this formulation it's fairly obvious that the DISTINCT and ORDER BY keys
don't need to overlap at all.  This technique does not require presorted
input (good) but also delivers unsorted output (bad).  It could still be
a big win if you expect the DISTINCT filtering to reduce the number of
rows substantially, because you'd end up sorting a much smaller number
of rows.  I didn't get around to trying to implement this when I was
doing hash aggregation for 7.4, but it seems like a reasonable item for
the TODO list.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Bruno Wolff III
Date:
Subject: Re: ORDER BY and DISTINCT ON
Next
From: Andrew Dunstan
Date:
Subject: Re: [PATCHES] fork/exec patch