Sean Shanny <shannyconsulting@earthlink.net> writes:
> Sort of piggybacking on this thread but why the suggestion to drop the
> use of DISTINCT in 7.4?
Because the 7.4 planner can decide for itself whether DISTINCT'ifying
the sub-select output is the best way to proceed or not. There is more
than one good way to do an "IN sub-SELECT" operation, and the 7.4
planner knows several. (Pre-7.4 planners didn't know any :-( ... but
I digress.) When you write "foo IN (SELECT DISTINCT ...)", the DISTINCT
doesn't change the semantics at all, it just adds overhead.
In fact it's worse than that: if the planner decides that the best way
to proceed is to make the subselect output unique, it will throw another
layer of sort/unique processing on top of what you did. So writing
DISTINCT is actually a pessimization in 7.4.
> Example:
> -> HashAggregate (cost=1020025.13..1020025.13 rows=51 width=4)
> (actual time=954049.317..954450.065 rows=82208 loops=1)
> -> Subquery Scan "IN_subquery" (cost=983429.20..1020025.00
> rows=51 width=4) (actual time=856641.244..953639.116 rows=82208 loops=1)
> -> Unique (cost=983429.20..1020024.49 rows=51 width=4)
> (actual time=856641.230..952939.539 rows=82208 loops=1)
> -> Sort (cost=983429.20..1001726.84 rows=7319058
> width=4) (actual time=856641.215..906429.835 rows=11067735 loops=1)
> Sort Key: useragent_key
> -> Index Scan using
The sort/unique steps are coming from the DISTINCT. The HashAggregate
step is the planner making sure the output rows are distinct :-(
I just a couple days ago added some logic to CVS tip to notice that the
sub-select has a DISTINCT clause, and not add unnecessary unique-ifying
processing on top of it. So in 7.5, writing a DISTINCT clause will
amount to forcing a particular query plan, which might or might not be
the best thing but hopefully won't be too terrible. But in 7.4 it has
nothing to recommend it ...
regards, tom lane