Re: index v. seqscan for certain values - Mailing list pgsql-performance

From Bill Moran
Subject Re: index v. seqscan for certain values
Date
Msg-id 407ABF2B.80306@potentialtech.com
Whole thread Raw
In response to index v. seqscan for certain values  ("Jeremy Dunn" <jdunn@autorevenue.com>)
Responses Re: index v. seqscan for certain values
List pgsql-performance
Quick bit of input, since you didn't mention it.

How often do you run ANALYZE?  I found it interesting that a database I
was doing tests on sped up by a factor of 20 after ANALYZE.  If your
data changes a lot, you should probably schedule ANALYZE to run with
VACUUM.

Jeremy Dunn wrote:
> I've searched the archives and can't find an answer to this seemingly
> simple question.  Apologies if it's too common.
>
> The table in question has ~1.3M rows.  It has 85 columns, 5 of which
> have single-column indexes.
>
> The column in question (CID) has 183 distinct values.  For these values,
> the largest has ~38,000 rows, and the smallest has 1 row.  About 30
> values have < 100 rows, and about 10 values have > 20,000 rows.
>
> The database is 7.2.3 running on RedHat 7.1. (we are in process of
> upgrading to PG 7.4.2)    All of the query plan options are enabled, and
> the cpu costs are set to the default values. ( cpu_tuple_cost is 0.01,
> cpu_index_tuple_cost is 0.001).  The database is VACUUM'd every night.
>
> The problem:
> A simply query:
>     select count(*) from xxx where CID=<smalval>
> where <smalval> is a CID value which has relatively few rows, returns a
> plan using the index on that column.
>
>    explain analyze select count(*) from xxx where cid=869366;
>    Aggregate  (cost=19136.33..19136.33 rows=1 width=0) (actual
> time=78.49..78.49 rows=1 loops=1)
>      ->  Index Scan using xxx_cid on emailrcpts  (cost=0.00..19122.21
> rows=5648 width=0) (actual time=63.40..78.46 rows=1 loops=1)
>    Total runtime: 78.69 msec
>
> The same plan is true for values which have up to about 20,000 rows:
>
>    explain analyze select count(*) from xxx where cid=6223341;
>    Aggregate  (cost=74384.19..74384.19 rows=1 width=0) (actual
> time=11614.89..11614.89 rows=1 loops=1)
>      ->  Index Scan using xxx_cid on emailrcpts  (cost=0.00..74329.26
> rows=21974 width=0) (actual time=35.75..11582.10 rows=20114 loops=1)
>    Total runtime: 11615.05 msec
> However for the values that have > 20,000 rows, the plan changes to a
> sequential scan, which is proportionately much slower.
>
>    explain analyze select count(*) from xxx where cid=7191032;
>    Aggregate  (cost=97357.61..97357.61 rows=1 width=0) (actual
> time=46427.81..46427.82 rows=1 loops=1)
>     ->   Seq Scan on xxx (cost=0.00..97230.62 rows=50792 width=0)
> (actual time=9104.45..46370.27 rows=37765 loops=1)
>     Total runtime: 46428.00 msec
>
>
> The question: why does the planner consider a sequential scan to be
> better for these top 10 values?  In terms of elapsed time it is more
> than twice as slow, proportionate to an index scan for the same number
> of rows.
>
> What I tried:
>
> A) alter table xxx alter column cid set statistics 500;
>     analyze xxx;
> This does not affect the results.
>
> B)  dropped/rebuilt the index, with no improvement.
>
> C) decreasing cpu_index_tuple_cost by a factor of up to 1000, with no
> success
>
> D) force an index scan for the larger values by using a very high value
> for cpu_tuple_cost (e.g. .5) but this doesn't seem like a wise thing to do.
>
> Your thoughts appreciated in advance!
>
> - Jeremy
>
> 7+ years experience in Oracle performance-tuning
> relatively new to postgresql


--
Bill Moran
Potential Technologies
http://www.potentialtech.com


pgsql-performance by date:

Previous
From: "Jeremy Dunn"
Date:
Subject: index v. seqscan for certain values
Next
From: "Jeremy Dunn"
Date:
Subject: Re: index v. seqscan for certain values