Home > mailing lists

Re: Differences in identical queries - Mailing list pgsql-general

From	Rob Schall
Subject	Re: Differences in identical queries
Date	March 2, 2007 17:12:13
Msg-id	45E868EC.3050101@callone.net Whole thread Raw
In response to	Re: Differences in identical queries (Richard Huxton <dev@archonet.com>)
List	pgsql-general

Tree view

The final answer wound up being....
  ALTER TABLE current ALTER COLUMN orignum SET STATISTICS 100;
and then an ANALYZE current;

Then the queries started running like a champ (split seconds, not minutes).

Thanks Richard!
Rob


Richard Huxton wrote:
> Rob Schall wrote:
>> When I reanalyzed the anitmp table with just the 4 entries (2 with
>> istf=true and 2 with istf=false), both queries then ran the same
>> way/time.
>> So it would appear, if you want to do a join or a subselect (IN), then
>> the number of items if will be comparing it to must be less than a
>> certain number.
>
> I'm still not sure how you're concluding that.
>
> > In my case, the full amount of false's that fail is
>> actually 40. So if you have a table of 2 million items (in current) and
>> want to join it to a table of 40 items, then it must do the longer
>> hashed join version and not the nested loop.
>>
>> Below are the results with the anitmp table with 42 items. 40 set as
>> false, and 2 set as true. Is there a way to rewrite my query to have it
>> run as quick as the others?
>
>> calldetail=> EXPLAIN ANALYZE SELECT current.* FROM current JOIN anitmp
>> ON current.orignum=anitmp.ani AND istf=false;
>>                                                             QUERY PLAN
>>
-----------------------------------------------------------------------------------------------------------------------------------
>>
>>  Hash Join  (cost=1.63..1031833.26 rows=256563 width=108) (actual
>> time=1889.469..155380.749 rows=653 loops=1)
>
> OK - so here's the root of the problem. The planner thinks it'll get
> back 256,563 rows but actually gets 653. If you actually got more than
> 200,000 rows back then a seq-scan on current might well make sense.
>
>>    Hash Cond: ("outer".orignum = "inner".ani)
>>    ->  Seq Scan on current  (cost=0.00..920527.00 rows=10873900
>> width=108) (actual time=670.402..136192.991 rows=10681150 loops=1)
>>    ->  Hash  (cost=1.52..1.52 rows=41 width=8) (actual time=0.187..0.187
>> rows=0 loops=1)
>>          ->  Seq Scan on anitmp  (cost=0.00..1.52 rows=41 width=8)
>> (actual time=0.014..0.108 rows=40 loops=1)
>>                Filter: (istf = false)
>
> Hmm - what sort of distribution of values do you have in "orignum" -
> telephone numbers, so presumably they're quite distinct.
>
> I'd be tempted to up the statistics on that column, reanalyse both
> tables and see what happens.
>   ALTER TABLE current ALTER COLUMN orignum SET STATISTICS=100;
> You can set values up to 1000, start at 100 and step up. Not checked
> the syntax on that last statement btw.
>
> You can see the before and after effects by looking at:
> SELECT * FROM pg_stats WHERE tablename='current' AND attname='orignum';
>

pgsql-general by date:

From: Laurent ROCHE
Date: 02 March 2007, 16:55:00
Subject: Re : pg_dump, serial

From: Tom Lane
Date: 02 March 2007, 17:18:25
Subject: Re: Re : pg_dump, serial

Re: Differences in identical queries - Mailing list pgsql-general

Previous

Next