Re: [HACKERS] Why is that so slow? - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] Why is that so slow?
Date
Msg-id 3107.920676281@sss.pgh.pa.us
Whole thread Raw
In response to Re: [HACKERS] Why is that so slow?  (Tatsuo Ishii <t-ishii@sra.co.jp>)
Responses Re: [HACKERS] Why is that so slow?
List pgsql-hackers
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> postal=> explain select * from postal,prefecture where city ~ '^aaa' and postal.pid = prefecture.pid;
> NOTICE:  QUERY PLAN:
> 
> Nested Loop  (cost=98.90 size=1 width=100)
>   ->  Seq Scan on prefecture  (cost=2.55 size=47 width=26)
>   ->  Index Scan using pidindex on postal  (cost=2.05 size=1 width=74)
> 
> This is so slooow. Can anybody explain this? Am I missing something?

and later:
> I had defined a btree index on pid and it has 2000
> duplicate entries in average! After I removed the index, the query
> runs unbelievably fast! Now explain shows:

> Nested Loop  (cost=933.82 size=1 width=100)
> -> Index Scan using cityindex on postal  (cost=931.77 size=1 width=74)
> -> Index Scan using prefpidindex on prefecture  (cost=2.05 size=47 width=26)

Hmm.  Removal of the index is just a hack --- the system should have
been smart enough not to use it.  It looks like the system chose the
first plan shown above because it thought that selecting postal entries
matching a particular pid value would on average match only one postal
tuple (note the "size" fields, which are estimates of the numbers of
resulting tuples).  But in reality, each scan produced 2000 matching
entries on average, according to your second message --- and each of
those entries had to be tested to see if it had the right city name.
So, very slow.

The question I have is why didn't the system realize that there would be
lots of matches on pid?  The "dispersion" statistics it keeps ought to
have given it a clue that this approach wouldn't be very selective.

The second example is fast because the scan over postal looking for city
name matches finds only one match, so prefecture is scanned only once.
However the cost estimates there also look bogus --- the system is again
mis-guessing how many entries will be selected.  It seems to think that
all 47 prefecture entries will be matched by a scan for a specific pid.
So, bogus dispersion values again (or bad use of them).

Something is fishy here.  Have you done a "vacuum analyze" since loading
the data in these tables?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Michael Davis
Date:
Subject: permissions problem
Next
From: Michael Davis
Date:
Subject: RE: [GENERAL] permissions problem