Re: index v. seqscan for certain values - Mailing list pgsql-performance
From | Jeremy Dunn |
---|---|
Subject | Re: index v. seqscan for certain values |
Date | |
Msg-id | 001d01c420c1$0ed23290$4f01a8c0@jeremydunn Whole thread Raw |
In response to | Re: index v. seqscan for certain values (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: index v. seqscan for certain values
Re: index v. seqscan for certain values |
List | pgsql-performance |
> "Jeremy Dunn" <jdunn@autorevenue.com> writes: > > The question: why does the planner consider a sequential scan to be > > better for these top 10 values? > > At some point a seqscan *will* be better. In the limit, if > the key being sought is common enough to occur on every page > of the table, it's certain that a seqscan will require less > I/O than an indexscan (because reading the index isn't > actually saving you any heap fetches). In practice the > breakeven point is less than that because Unix kernels are > better at handling sequential than random access. > > Your gripe appears to be basically that the planner's idea of > the breakeven point is off a bit. It looks to me like it's > within about a factor of 2 of being right, though, which is > not all that bad when it's using generic cost parameters. Agreed. However, given that count(*) is a question that can be answered _solely_ using the index (without reference to the actual data blocks), I'd expect that the break-even point would be considerably higher than the < 3% (~38,000 / ~1.3M) I'm currently getting. Does PG not use solely the index in this situation?? > > A) alter table xxx alter column cid set statistics 500; > > analyze xxx; > > This does not affect the results. > > It probably improved the accuracy of the row count estimates, > no? The estimate you show for cid=7191032 is off by more than > 25% (37765 vs 50792), which seems like a lot of error for one > of the most common values in the table. (I hope that was > with default stats target and not 500.) That leads directly > to a 25% overestimate of the cost of an indexscan, while > having IIRC no impact on the cost of a seqscan. Since the > cost ratio was more than 25%, this didn't change the selected > plan, but you want to fix that error as best you can before > you move on to tweaking cost parameters. Actually it made them worse! Yes, this was the default statistics (10). When I just tried it again with a value of 300, analyze, then run the query, I get a *worse* result for an estimate. I don't understand this. alter table xxx alter column cid set statistics 300; analyze emailrcpts; set random_page_cost to 2; explain analyze select count(*) from xxx where cid=7191032; Aggregate (cost=20563.28..20563.28 rows=1 width=0) (actual time=7653.90..7653.90 rows=1 loops=1) -> Index Scan using xxx_cid on xxx (cost=0.00..20535.82 rows=10983 width=0) (actual time=72.24..7602.38 rows=37765 loops=1) Total runtime: 7654.14 msec Now it estimates I have only 10,983 rows (~3x too low) instead of the old estimate 50,792 (1.3x too high). Why is that ?? Anyway, a workable solution seems to be using a lower value for Random_Page_Cost. Thanks to everyone who replied with this answer. > Also it is likely appropriate to increase > effective_cache_size, which is awfully small in the default > configuration. I'd set that to something related to your > available RAM before trying to home in on a suitable random_page_cost. We have ours set to the default value of 1000, which does seem low for a system with 1GB of RAM. We'll up this once we figure out what's available. Then tweak the Random_Page_Cost appropriately at that point. I'd still like to understand the strangeness above, if anyone can shed light. - Jeremy
pgsql-performance by date: