index v. seqscan for certain values - Mailing list pgsql-performance

From Jeremy Dunn
Subject index v. seqscan for certain values
Date
Msg-id 000901c420a4$7b733650$4f01a8c0@jeremydunn
Whole thread Raw
Responses Re: index v. seqscan for certain values
Re: index v. seqscan for certain values
Re: index v. seqscan for certain values
List pgsql-performance
I've searched the archives and can't find an answer to this seemingly simple question.  Apologies if it's too common.
 
The table in question has ~1.3M rows.  It has 85 columns, 5 of which have single-column indexes.
 
The column in question (CID) has 183 distinct values.  For these values, the largest has ~38,000 rows, and the smallest has 1 row.  About 30 values have < 100 rows, and about 10 values have > 20,000 rows.
 
The database is 7.2.3 running on RedHat 7.1. (we are in process of upgrading to PG 7.4.2)    All of the query plan options are enabled, and the cpu costs are set to the default values. ( cpu_tuple_cost is 0.01, cpu_index_tuple_cost is 0.001).  The database is VACUUM'd every night.
 
The problem:
A simply query:
    select count(*) from xxx where CID=<smalval>
where <smalval> is a CID value which has relatively few rows, returns a plan using the index on that column.
 
   explain analyze select count(*) from xxx where cid=869366;
   Aggregate  (cost=19136.33..19136.33 rows=1 width=0) (actual time=78.49..78.49 rows=1 loops=1)
     ->  Index Scan using xxx_cid on emailrcpts  (cost=0.00..19122.21 rows=5648 width=0) (actual time=63.40..78.46 rows=1 loops=1)
   Total runtime: 78.69 msec
 
The same plan is true for values which have up to about 20,000 rows:
 
   explain analyze select count(*) from xxx where cid=6223341;
   Aggregate  (cost=74384.19..74384.19 rows=1 width=0) (actual time=11614.89..11614.89 rows=1 loops=1)
     ->  Index Scan using xxx_cid on emailrcpts  (cost=0.00..74329.26 rows=21974 width=0) (actual time=35.75..11582.10 rows=20114 loops=1)
   Total runtime: 11615.05 msec
However for the values that have > 20,000 rows, the plan changes to a sequential scan, which is proportionately much slower.
 
   explain analyze select count(*) from xxx where cid=7191032;
   Aggregate  (cost=97357.61..97357.61 rows=1 width=0) (actual time=46427.81..46427.82 rows=1 loops=1)
    ->   Seq Scan on xxx (cost=0.00..97230.62 rows=50792 width=0) (actual time=9104.45..46370.27 rows=37765 loops=1)
    Total runtime: 46428.00 msec
 
 
The question: why does the planner consider a sequential scan to be better for these top 10 values?  In terms of elapsed time it is more than twice as slow, proportionate to an index scan for the same number of rows.
 
What I tried:
 
A) alter table xxx alter column cid set statistics 500;   
    analyze xxx;
This does not affect the results.
 
B)  dropped/rebuilt the index, with no improvement.
 
C) decreasing cpu_index_tuple_cost by a factor of up to 1000, with no success
 
D) force an index scan for the larger values by using a very high value for cpu_tuple_cost (e.g. .5) but this doesn't seem like a wise thing to do.
 
Your thoughts appreciated in advance!
 
- Jeremy 
 
7+ years experience in Oracle performance-tuning
relatively new to postgresql

pgsql-performance by date:

Previous
From: "Shea,Dan [CIS]"
Date:
Subject: Re: Deleting certain duplicates
Next
From: Bill Moran
Date:
Subject: Re: index v. seqscan for certain values