Re: Differences in identical queries - Mailing list pgsql-general

From Richard Huxton
Subject Re: Differences in identical queries
Date
Msg-id 45E844FE.6050802@archonet.com
Whole thread Raw
In response to Re: Differences in identical queries  (Rob Schall <rschall@callone.net>)
Responses Re: Differences in identical queries  (Rob Schall <rschall@callone.net>)
List pgsql-general
Rob Schall wrote:
> When I reanalyzed the anitmp table with just the 4 entries (2 with
> istf=true and 2 with istf=false), both queries then ran the same way/time.
> So it would appear, if you want to do a join or a subselect (IN), then
> the number of items if will be comparing it to must be less than a
> certain number.

I'm still not sure how you're concluding that.

 > In my case, the full amount of false's that fail is
> actually 40. So if you have a table of 2 million items (in current) and
> want to join it to a table of 40 items, then it must do the longer
> hashed join version and not the nested loop.
>
> Below are the results with the anitmp table with 42 items. 40 set as
> false, and 2 set as true. Is there a way to rewrite my query to have it
> run as quick as the others?

> calldetail=> EXPLAIN ANALYZE SELECT current.* FROM current JOIN anitmp
> ON current.orignum=anitmp.ani AND istf=false;
>                                                             QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------
>  Hash Join  (cost=1.63..1031833.26 rows=256563 width=108) (actual
> time=1889.469..155380.749 rows=653 loops=1)

OK - so here's the root of the problem. The planner thinks it'll get
back 256,563 rows but actually gets 653. If you actually got more than
200,000 rows back then a seq-scan on current might well make sense.

>    Hash Cond: ("outer".orignum = "inner".ani)
>    ->  Seq Scan on current  (cost=0.00..920527.00 rows=10873900
> width=108) (actual time=670.402..136192.991 rows=10681150 loops=1)
>    ->  Hash  (cost=1.52..1.52 rows=41 width=8) (actual time=0.187..0.187
> rows=0 loops=1)
>          ->  Seq Scan on anitmp  (cost=0.00..1.52 rows=41 width=8)
> (actual time=0.014..0.108 rows=40 loops=1)
>                Filter: (istf = false)

Hmm - what sort of distribution of values do you have in "orignum" -
telephone numbers, so presumably they're quite distinct.

I'd be tempted to up the statistics on that column, reanalyse both
tables and see what happens.
   ALTER TABLE current ALTER COLUMN orignum SET STATISTICS=100;
You can set values up to 1000, start at 100 and step up. Not checked the
syntax on that last statement btw.

You can see the before and after effects by looking at:
SELECT * FROM pg_stats WHERE tablename='current' AND attname='orignum';

--
   Richard Huxton
   Archonet Ltd

pgsql-general by date:

Previous
From: Rob Schall
Date:
Subject: Re: Differences in identical queries
Next
From: Vincenzo Romano
Date:
Subject: psql client quits after 1st command