Re: index v. seqscan for certain values - Mailing list pgsql-performance

From Jeremy Dunn
Subject Re: index v. seqscan for certain values
Date
Msg-id 001d01c420c1$0ed23290$4f01a8c0@jeremydunn
Whole thread Raw
In response to Re: index v. seqscan for certain values  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: index v. seqscan for certain values
Re: index v. seqscan for certain values
List pgsql-performance
> "Jeremy Dunn" <jdunn@autorevenue.com> writes:
> > The question: why does the planner consider a sequential scan to be
> > better for these top 10 values?
>
> At some point a seqscan *will* be better.  In the limit, if
> the key being sought is common enough to occur on every page
> of the table, it's certain that a seqscan will require less
> I/O than an indexscan (because reading the index isn't
> actually saving you any heap fetches). In practice the
> breakeven point is less than that because Unix kernels are
> better at handling sequential than random access.
>
> Your gripe appears to be basically that the planner's idea of
> the breakeven point is off a bit.  It looks to me like it's
> within about a factor of 2 of being right, though, which is
> not all that bad when it's using generic cost parameters.

Agreed.  However, given that count(*) is a question that can be answered
_solely_ using the index (without reference to the actual data blocks),
I'd expect that the break-even point would be considerably higher than
the < 3% (~38,000 / ~1.3M) I'm currently getting.  Does PG not use
solely the index in this situation??

> > A) alter table xxx alter column cid set statistics 500;
> >     analyze xxx;
> > This does not affect the results.
>
> It probably improved the accuracy of the row count estimates,
> no? The estimate you show for cid=7191032 is off by more than
> 25% (37765 vs 50792), which seems like a lot of error for one
> of the most common values in the table.  (I hope that was
> with default stats target and not 500.)  That leads directly
> to a 25% overestimate of the cost of an indexscan, while
> having IIRC no impact on the cost of a seqscan. Since the
> cost ratio was more than 25%, this didn't change the selected
> plan, but you want to fix that error as best you can before
> you move on to tweaking cost parameters.

Actually it made them worse!  Yes, this was the default statistics (10).
When I just tried it again with a value of 300, analyze, then run the
query, I get a *worse* result for an estimate.  I don't understand this.


   alter table xxx alter column cid set statistics 300;
   analyze emailrcpts;
   set random_page_cost to 2;
   explain analyze select count(*) from xxx where cid=7191032;

   Aggregate  (cost=20563.28..20563.28 rows=1 width=0) (actual
time=7653.90..7653.90 rows=1 loops=1)
  ->  Index Scan using xxx_cid on xxx  (cost=0.00..20535.82 rows=10983
width=0) (actual time=72.24..7602.38 rows=37765 loops=1)
   Total runtime: 7654.14 msec

Now it estimates I have only 10,983 rows (~3x too low) instead of the
old estimate 50,792 (1.3x too high).  Why is that ??

Anyway, a workable solution seems to be using a lower value for
Random_Page_Cost.  Thanks to everyone who replied with this answer.

> Also it is likely appropriate to increase
> effective_cache_size, which is awfully small in the default
> configuration.  I'd set that to something related to your
> available RAM before trying to home in on a suitable random_page_cost.

We have ours set to the default value of 1000, which does seem low for a
system with 1GB of RAM.  We'll up this once we figure out what's
available.  Then tweak the Random_Page_Cost appropriately at that point.

I'd still like to understand the strangeness above, if anyone can shed
light.

- Jeremy


pgsql-performance by date:

Previous
From: Pailloncy Jean-Gérard
Date:
Subject: Re: Index Backward Scan fast / Index Scan slow ! (Modifié par Pailloncy Jean-Gérard)
Next
From: Bruno Wolff III
Date:
Subject: Re: index v. seqscan for certain values