estimates for nested loop very wrong? - Mailing list pgsql-sql

From joostje@komputilo.org
Subject estimates for nested loop very wrong?
Date
Msg-id 20030410112144.GA22716@co.uea.org
Whole thread Raw
Responses Re: estimates for nested loop very wrong?  (Stephan Szabo <sszabo@megazone23.bigpanda.com>)
Re: estimates for nested loop very wrong?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-sql
Hi,

When JOINing two tabels (one with 23 entries, one with 2.5e6 entries),
psql estimates the cost of the nested loop method way to high, causing
it to use Hash Join, even though Hash Join actually takes 30 seconds,
agianst 0.020 seconds for Nested Loop.

It really puzzles me why the estimate for the Nested Loop is so bad,
as it looks like a rather easy thing to estimate...

Below:
table db   has 2.5e6 entries, column "id" has rather evenly distributed values.          column id has a (btree)
index.
table tmp1 has 23 entries, column "v0" has all unique entries.
psql version: 7.2.1

Here is the query twice, once with enable_hashjoin ON, second time
with enable_hashjoin OFF, to force psql to use Nested Loop

ueadb=> explain analyse select id, var, val from db, tmp1 where id=tmp1.v0;
NOTICE:  QUERY PLAN:

Hash Join  (cost=1.29..67863.71 rows=61140 width=38) (actual time=4475.26..32442.99 rows=756 loops=1) ->  Seq Scan on
db (cost=0.00..54498.12 rows=2520012 width=31) (actual time=0.07..29170.62 rows=2520012 loops=1) ->  Hash
(cost=1.23..1.23rows=23 width=7) (actual time=0.25..0.25 rows=0 loops=1)       ->  Seq Scan on tmp1  (cost=0.00..1.23
rows=23width=7) (actual time=0.17..0.22 rows=23 loops=1)
 
Total runtime: 32443.78 msec


--Setting hashjoin off, forcing psql to use the Nested Loop
ueadb=> set enable_hashjoin = off;


ueadb=> explain analyse select id, var, val from db, tmp1 where id=tmp1.v0;
NOTICE:  QUERY PLAN:

Nested Loop  (cost=0.00..208256.60 rows=61140 width=38) (actual time=0.92..18.49 rows=756 loops=1) ->  Seq Scan on tmp1
(cost=0.00..1.23 rows=23 width=7) (actual time=0.24..0.39 rows=23 loops=1) ->  Index Scan using db_id_idx on db
(cost=0.00..9021.35rows=2658 width=31) (actual time=0.32..0.69 rows=33 loops=23)
 
Total runtime: 19.20 msec




I guess I'll be doing my queries with enable_hashjoin OFF, but is there anythign I'm
doing wrong?
(Apart from maybe uzing psql 7.2.1 -- would 7.3 be smarter here?)



pgsql-sql by date:

Previous
From: Olleg Samojlov
Date:
Subject: Re: create databases with plpgsql
Next
From: "Ries van Twisk"
Date:
Subject: Re: The need to know if a field is using/connected to a sequence