More thoughts about planner's cost estimates - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | More thoughts about planner's cost estimates |
Date | |
Msg-id | 14821.1149104747@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: More thoughts about planner's cost estimates
Re: More thoughts about planner's cost estimates Re: More thoughts about planner's cost estimates |
List | pgsql-hackers |
I've been thinking about the planner's costing problems again, particularly in connection with Philippe Lang's complaint here: http://archives.postgresql.org/pgsql-general/2006-05/msg01526.php Investigation showed that the planner is much too optimistic about the usefulness of a multi-index BitmapAnd plan, and since that comparison is just a cost-estimate comparison, it implies we are underestimating the cost of an index scan. A typical example from his results is -> BitmapAnd (cost=12.30..12.30 rows=1 width=0) (actual time=0.306..0.306 rows=0 loops=13628) -> Bitmap IndexScan on lw_id_workflow (cost=0.00..2.02 rows=7 width=0) (actual time=0.009..0.009 rows=7 loops=13628) Index Cond: (lw.id_workflow = "outer".id) -> Bitmap Index Scan on lw_ordre (cost=0.00..10.03 rows=1437 width=0)(actual time=0.293..0.293 rows=1714 loops=13628) Index Cond: (ordre = $4) There are two variables involved here: the cost of touching an index page and the cost of processing an index tuple. Given the two comparable data points above, we can solve for those numbers, and it turns out that the real cost ratio on Philippe's machine is about 50 to 1. Which says that for him, cpu_index_tuple_cost plus cpu_operator_cost should be around 0.02, nearly an order of magnitude higher than their current default values (0.001 and 0.0025 respectively). In general it seems to me that for CPU-bound databases, the default values of the cpu_xxx_cost variables are too low. I am tempted to raise the default value of cpu_index_tuple_cost to 0.005, which would be half of cpu_tuple_cost and twice cpu_operator_cost; that seems more in line with my feel for the relative costs of things. For a CPU-bound database all three would need to go up more than that. But rather than telling people to manipulate all three of these variables individually, I think it might also be a good idea to provide a new GUC variable named something like "cpu_speed_scale" that would just be a multiplier for the other variables. It would default to 1.0 and our standard advice for CPU-bound databases would be "decrease random_page_cost to 1.0 and raise cpu_speed_scale to 10.0 or so". Seems cleaner than telling people to muck with three or so individual values. Another thing that's bothering me is that the index access cost computation (in genericcostestimate) is looking sillier and sillier: /* * Estimate the number of index pages that will be retrieved. * * For all currently-supported index types,the first page of the index is * a metadata page, and we should figure on fetching that plus a pro-rated * fractionof the remaining pages. */ if (index->pages > 1 && index->tuples > 0) { numIndexPages = (numIndexTuples/ index->tuples) * (index->pages - 1); numIndexPages += 1; /* count the metapage too */ numIndexPages = ceil(numIndexPages); } else numIndexPages = 1.0; /* * Compute the index access cost. * * Disk cost: our generic assumption is that the index pages will be read * sequentially, so they have cost 1.0 each, not random_page_cost. */ *indexTotalCost = numIndexPages; There are several things wrong with this, at least in its application to btrees. It's not counting descent of the btree (ie, touches of the root page and intermediate-level pages). On the other hand it's surely overcharging for metapage touches. As of CVS tip we cache the metapage in the relcache, so it's probably fair to disregard that cost altogether. And on the third hand, when we need to retrieve multiple leaf pages it's over-optimistic to assume that those'll be purely sequential fetches. (That would be approximately true in a freshly-built btree, but not in one that's suffered any amount of page splitting or recycling.) I've desisted from touching this mainly because the obvious sanity adjustments, such as adding something for tree descent and charging random_page_cost not 1.0 for leaf page touches, would increase the estimated cost of index usage, which seemed like not the direction we wanted to go in. So I figured that the errors were more or less cancelling each other. But the addition of bitmap index scans is now exposing the weaknesses, so we need to face up to the problem. I am inclined to think that a reasonable model is to continue to estimate the number of index leaf pages touched as above (pro-rating against the total number of index entries), to charge random_page_cost per leaf page touched, and not to count anything for metapage or tree descent. I justify the latter on the grounds that upper tree levels will tend to stay in cache because they're touched so often. random_page_cost per leaf page may be an overestimate, since sometimes logically adjacent leaf pages will be physically adjacent too, but not charging for tree descent should help to cancel that out. With this model, the disk cost to fetch a single index entry will be estimated as random_page_cost (default 4.0) rather than the current fixed 2.0. This shouldn't hurt things too much for simple indexscans --- especially since anyone running with a reduced random_page_cost won't see as much change. And it will increase the costs for bitmap scans that cross many index pages, which is what we need in light of Philippe's numbers. Now we have seen a lot of cases in which indexscans were being drastically overestimated, so increasing the cost estimate even more may seem like a horrid idea. But I believe that most or all of these cases were ones where the same index was being visited many times, and so the real estimation failure is not accounting for cache effects across multiple indexscans. Rather than being afraid to raise the cost estimate for an indexscan in isolation, I think it's time to bite the bullet and do something about that. The big difficulty in modeling cache effects from multiple scans is that we usually don't know how many times the index will be scanned. If we did have that number, I think it'd be reasonable to use the Mackert and Lohman approximation already used in cost_index to estimate the total number of index leaf pages fetched over N scans, and then divide by N to get our per-scan estimated cost. But because the planner builds cost estimates bottom-up, in most scenarios we don't have any idea whether an indexscan plan node will be iterated once or many times in the finished plan. However there is one case where we do have some information: when computing a best_inner_indexscan() for a nestloop join, it'd be reasonable to use the estimated size of the outer relation as the loop count. And this is exactly the case where we are falling down in practice: the complaints we've seen are mostly about overestimating the cost of a nestloop-with-inner-index-scan. So I think handling just this case would be a big step forward, even if it's not a theoretically pure general solution. Thoughts, gripes, better ideas? regards, tom lane
pgsql-hackers by date: