Tom Lane <tgl@sss.pgh.pa.us> writes:
> Greg Stark <gsstark@mit.edu> writes:
> > There is a simple case that isn't being handled. A straight DISTINCT is
> > exactly equivalent to a GROUP BY on all the columns listed. Right now it's
> > still doing a Sort+Unique.
>
> True. The DISTINCT logic is so intertwined with ORDER BY that I
> couldn't see an easy way to consider a hash-based alternative in the
> planner. I think it would require querytree changes propagating all the
> way back to the parser to make this feasible at all, and then you'd
> still need to think about how to make an honest cost-estimate comparison
> of the alternatives (preferably without planning the whole query twice).
> Anyone want to dig into it?
So my thinking is that DISTINCT and DISTINCT ON are really just special cases
of GROUP BY. instead of having separate code paths for them I would suggest
rewriting them to GROUP BY, using a "first()" aggregate function for the
DISTINCT ON case. Then let the normal logic handle it.
The only disadvantage is that if there is an index it's possible to implement
DISTINCT ON more efficiently than arbitrary aggregate functions. Is that the
case currently for the Unique node?
If we have a way of marking arbitrary aggregate functions as only requiring
the first or last record then we could notice that the only aggregate in
rewritten DISTINCT ON queries is first() which only requires the first record
of each value and optimize the index scan. As a bonus this would solve the
min()/max() issue as well.
--
greg