More thoughts about planner's cost estimates - Mailing list pgsql-hackers

I've been thinking about the planner's costing problems again,
particularly in connection with Philippe Lang's complaint here:
http://archives.postgresql.org/pgsql-general/2006-05/msg01526.php
Investigation showed that the planner is much too optimistic about the
usefulness of a multi-index BitmapAnd plan, and since that comparison is
just a cost-estimate comparison, it implies we are underestimating the
cost of an index scan.  A typical example from his results is
   ->  BitmapAnd  (cost=12.30..12.30 rows=1 width=0) (actual time=0.306..0.306 rows=0 loops=13628)         ->  Bitmap
IndexScan on lw_id_workflow  (cost=0.00..2.02 rows=7 width=0) (actual time=0.009..0.009 rows=7 loops=13628)
 Index Cond: (lw.id_workflow = "outer".id)         ->  Bitmap Index Scan on lw_ordre  (cost=0.00..10.03 rows=1437
width=0)(actual time=0.293..0.293 rows=1714 loops=13628)               Index Cond: (ordre = $4)
 

There are two variables involved here: the cost of touching an index page
and the cost of processing an index tuple.  Given the two comparable data
points above, we can solve for those numbers, and it turns out that the
real cost ratio on Philippe's machine is about 50 to 1.  Which says that
for him, cpu_index_tuple_cost plus cpu_operator_cost should be around
0.02, nearly an order of magnitude higher than their current default
values (0.001 and 0.0025 respectively).

In general it seems to me that for CPU-bound databases, the default values
of the cpu_xxx_cost variables are too low.  I am tempted to raise the
default value of cpu_index_tuple_cost to 0.005, which would be half of
cpu_tuple_cost and twice cpu_operator_cost; that seems more in line with
my feel for the relative costs of things.  For a CPU-bound database all
three would need to go up more than that.  But rather than telling people
to manipulate all three of these variables individually, I think it might
also be a good idea to provide a new GUC variable named something like
"cpu_speed_scale" that would just be a multiplier for the other variables.
It would default to 1.0 and our standard advice for CPU-bound databases
would be "decrease random_page_cost to 1.0 and raise cpu_speed_scale to
10.0 or so".  Seems cleaner than telling people to muck with three or so
individual values.

Another thing that's bothering me is that the index access cost computation
(in genericcostestimate) is looking sillier and sillier:
   /*    * Estimate the number of index pages that will be retrieved.    *    * For all currently-supported index
types,the first page of the index is    * a metadata page, and we should figure on fetching that plus a pro-rated    *
fractionof the remaining pages.    */   if (index->pages > 1 && index->tuples > 0)   {       numIndexPages =
(numIndexTuples/ index->tuples) * (index->pages - 1);       numIndexPages += 1;        /* count the metapage too */
 numIndexPages = ceil(numIndexPages);   }   else       numIndexPages = 1.0;
 
   /*    * Compute the index access cost.    *    * Disk cost: our generic assumption is that the index pages will be
read   * sequentially, so they have cost 1.0 each, not random_page_cost.    */   *indexTotalCost = numIndexPages;
 

There are several things wrong with this, at least in its application to
btrees.  It's not counting descent of the btree (ie, touches of the root
page and intermediate-level pages).  On the other hand it's surely
overcharging for metapage touches.  As of CVS tip we cache the metapage in
the relcache, so it's probably fair to disregard that cost altogether.
And on the third hand, when we need to retrieve multiple leaf pages it's
over-optimistic to assume that those'll be purely sequential fetches.
(That would be approximately true in a freshly-built btree, but not in one
that's suffered any amount of page splitting or recycling.)

I've desisted from touching this mainly because the obvious sanity
adjustments, such as adding something for tree descent and charging
random_page_cost not 1.0 for leaf page touches, would increase the
estimated cost of index usage, which seemed like not the direction we
wanted to go in.  So I figured that the errors were more or less
cancelling each other.  But the addition of bitmap index scans is now
exposing the weaknesses, so we need to face up to the problem.

I am inclined to think that a reasonable model is to continue to estimate
the number of index leaf pages touched as above (pro-rating against the
total number of index entries), to charge random_page_cost per leaf page
touched, and not to count anything for metapage or tree descent.  I
justify the latter on the grounds that upper tree levels will tend to stay
in cache because they're touched so often.  random_page_cost per leaf page
may be an overestimate, since sometimes logically adjacent leaf pages will
be physically adjacent too, but not charging for tree descent should help
to cancel that out.  With this model, the disk cost to fetch a single
index entry will be estimated as random_page_cost (default 4.0) rather
than the current fixed 2.0.  This shouldn't hurt things too much for
simple indexscans --- especially since anyone running with a reduced
random_page_cost won't see as much change.  And it will increase the costs
for bitmap scans that cross many index pages, which is what we need in
light of Philippe's numbers.

Now we have seen a lot of cases in which indexscans were being drastically
overestimated, so increasing the cost estimate even more may seem like a
horrid idea.  But I believe that most or all of these cases were ones
where the same index was being visited many times, and so the real
estimation failure is not accounting for cache effects across multiple
indexscans.  Rather than being afraid to raise the cost estimate for an
indexscan in isolation, I think it's time to bite the bullet and do
something about that.

The big difficulty in modeling cache effects from multiple scans is that
we usually don't know how many times the index will be scanned.  If we did
have that number, I think it'd be reasonable to use the Mackert and Lohman
approximation already used in cost_index to estimate the total number of
index leaf pages fetched over N scans, and then divide by N to get our
per-scan estimated cost.  But because the planner builds cost estimates
bottom-up, in most scenarios we don't have any idea whether an indexscan
plan node will be iterated once or many times in the finished plan.
However there is one case where we do have some information: when
computing a best_inner_indexscan() for a nestloop join, it'd be reasonable
to use the estimated size of the outer relation as the loop count.
And this is exactly the case where we are falling down in practice:
the complaints we've seen are mostly about overestimating the cost of a
nestloop-with-inner-index-scan.  So I think handling just this case would
be a big step forward, even if it's not a theoretically pure general
solution.

Thoughts, gripes, better ideas?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Possible TODO item: copy to/from pipe
Next
From: Greg Stark
Date:
Subject: Re: error-free disabling of individual child partition