Re: Use of index in 7.0 vs 6.5 - Mailing list pgsql-sql

From Tom Lane
Subject Re: Use of index in 7.0 vs 6.5
Date
Msg-id 21530.959235278@sss.pgh.pa.us
Whole thread Raw
In response to Use of index in 7.0 vs 6.5  (Ryan Bradetich <ryan_bradetich@hp.com>)
Responses RE: Use of index in 7.0 vs 6.5  ("Hiroshi Inoue" <Inoue@tpf.co.jp>)
List pgsql-sql
Ryan Bradetich <ryan_bradetich@hp.com> writes:
> procman=# explain select count(catagory) from medusa where host_id = 404
> and catagory like 'A%';

> Here is my analysis of the stastics (based on the examples in the
> archive).

> The most common value host_id in the table is 446 with row fraction of
> ~ 2.8%.  The estimated number of rows in the index is 6704.  This
> table has 4,630,229 entries in the table.
> I do not understand why the planner would choose a seqscan over the
> index scan because 6704/4,630,229 is ~ 0.15%.

I see at least part of the answer.  The 'rows=' number in the EXPLAIN
output is the planner's estimate of the net number of rows out after
applying all available WHERE restrictions.  In this case we've got a
clause "host_id = 404" which can be used with the index on host_id,
and then we have another clause "catagory like 'A%'" which will be
applied on-the-fly to the tuples returned by the indexscan.  The
rows number tells you about the estimated number of rows out after
that second filter step.  However, the cost of the indexscan depends
mainly on the number of tuples that have to be fetched, and that is
determined by the selectivity of just the "host_id = 404" clause.

I made a dummy table with the schema you showed and then inserted
the statistics you reported into the system tables (who's afraid
of "update pg_class ..." ;-)).  If I didn't mess up, you should
be able to reproduce these EXPLAIN results:

set enable_seqscan = off;
explain select count(catagory) from medusa where host_id = 404;
NOTICE:  QUERY PLAN:

Aggregate  (cost=206943.69..206943.69 rows=1 width=12) ->  Index Scan using medusa_host_id_key on medusa
(cost=0.00..206781.97rows=64690 width=12)
 

set enable_seqscan = on;
explain select count(catagory) from medusa where host_id = 404;
NOTICE:  QUERY PLAN:

Aggregate  (cost=178115.59..178115.59 rows=1 width=12) ->  Seq Scan on medusa  (cost=0.00..177953.86 rows=64690
width=12)


This shows that the planner is actually estimating that the indexscan
will fetch about 64690 rows (of which it guesses only 6704 will remain
after applying the catagory clause, but that's not really relevant to
the cost estimate).  Since there are 120076 pages in the table, that
would mean pretty nearly one separate page fetch for each retrieved
tuple, if the matching tuples are randomly distributed --- and that
very probably *would* take more time than reading the whole table
sequentially.  So the planner's chain of logic holds up if all these
assumptions are correct.

Since you find that in reality the indexscan method is very quick,
I'm guessing that there are actually fairly few tuples matching
host_id = 404.  Could you run a quick "select count(*)" to check?

This seems to point up (once again) the deficiency of assuming that
the most-common value in the table is a good guide to the frequency
of typical values.  You showed that host_id = 446 occurs in 2.8% of
the rows in this table; a search for 446 very probably would be faster
as a seqscan than as an indexscan (you might care to try it and see).
But that's probably a statistical outlier that's not got much to do
with the frequency of typical values in the table.

The only really good answer to this problem is to collect more-detailed
statistics in VACUUM ANALYZE, which I hope to see us doing in a release
or so.  In the meantime I am wondering about deliberately skewing the
cost model in favor of indexscans, because I sure haven't heard many
complaints about erroneous selection of indexscans...

One way to put a thumb on the scales is to reduce the value of the SET
variable random_page_cost.  The default value is 4.0, which seems to
correspond more or less to reality, but reducing it to 3 or so would
shift the planner pretty nicely in the direction of indexscans.
        regards, tom lane


pgsql-sql by date:

Previous
From: Ryan Bradetich
Date:
Subject: Re: Use of index in 7.0 vs 6.5
Next
From: "Hiroshi Inoue"
Date:
Subject: RE: Use of index in 7.0 vs 6.5