Home > mailing lists

Re: Query performance question on a large table - Mailing list pgsql-general

From	Tom Lane
Subject	Re: Query performance question on a large table
Date	January 7, 2004 06:32:18
Msg-id	16879.1073460682@sss.pgh.pa.us Whole thread Raw
In response to	Re: Query performance question on a large table (Sean Shanny <shannyconsulting@earthlink.net>)
Responses	Re: Query performance question on a large table
List	pgsql-general

Tree view

Sean Shanny <shannyconsulting@earthlink.net> writes:
> Sort of piggybacking on this thread but why the suggestion to drop the
> use of DISTINCT in 7.4?

Because the 7.4 planner can decide for itself whether DISTINCT'ifying
the sub-select output is the best way to proceed or not.  There is more
than one good way to do an "IN sub-SELECT" operation, and the 7.4
planner knows several.  (Pre-7.4 planners didn't know any :-( ... but
I digress.)  When you write "foo IN (SELECT DISTINCT ...)", the DISTINCT
doesn't change the semantics at all, it just adds overhead.

In fact it's worse than that: if the planner decides that the best way
to proceed is to make the subselect output unique, it will throw another
layer of sort/unique processing on top of what you did.  So writing
DISTINCT is actually a pessimization in 7.4.

>  Example:

>    ->  HashAggregate  (cost=1020025.13..1020025.13 rows=51 width=4)
> (actual time=954049.317..954450.065 rows=82208 loops=1)
>          ->  Subquery Scan "IN_subquery"  (cost=983429.20..1020025.00
> rows=51 width=4) (actual time=856641.244..953639.116 rows=82208 loops=1)
>                ->  Unique  (cost=983429.20..1020024.49 rows=51 width=4)
> (actual time=856641.230..952939.539 rows=82208 loops=1)
>                      ->  Sort  (cost=983429.20..1001726.84 rows=7319058
> width=4) (actual time=856641.215..906429.835 rows=11067735 loops=1)
>                            Sort Key: useragent_key
>                            ->  Index Scan using

The sort/unique steps are coming from the DISTINCT.  The HashAggregate
step is the planner making sure the output rows are distinct :-(

I just a couple days ago added some logic to CVS tip to notice that the
sub-select has a DISTINCT clause, and not add unnecessary unique-ifying
processing on top of it.  So in 7.5, writing a DISTINCT clause will
amount to forcing a particular query plan, which might or might not be
the best thing but hopefully won't be too terrible.  But in 7.4 it has
nothing to recommend it ...

            regards, tom lane

pgsql-general by date:

From: "Chris Travers"
Date: 07 January 2004, 06:31:55
Subject: Re: problems with transaction blocks

From: Együd Csaba
Date: 07 January 2004, 06:44:51
Subject: Re: Query performance question on a large table

Re: Query performance question on a large table - Mailing list pgsql-general

Previous

Next