Re: More thoughts about planner's cost estimates - Mailing list pgsql-hackers

From Mark Kirkwood
Subject Re: More thoughts about planner's cost estimates
Date
Msg-id 447F9737.10702@paradise.net.nz
Whole thread Raw
In response to More thoughts about planner's cost estimates  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: More thoughts about planner's cost estimates  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Tom Lane wrote:

> Another thing that's bothering me is that the index access cost computation
> (in genericcostestimate) is looking sillier and sillier:
> 
>     /*
>      * Estimate the number of index pages that will be retrieved.
>      *
>      * For all currently-supported index types, the first page of the index is
>      * a metadata page, and we should figure on fetching that plus a pro-rated
>      * fraction of the remaining pages.
>      */
>     if (index->pages > 1 && index->tuples > 0)
>     {
>         numIndexPages = (numIndexTuples / index->tuples) * (index->pages - 1);
>         numIndexPages += 1;        /* count the metapage too */
>         numIndexPages = ceil(numIndexPages);
>     }
>     else
>         numIndexPages = 1.0;
> 
>     /*
>      * Compute the index access cost.
>      *
>      * Disk cost: our generic assumption is that the index pages will be read
>      * sequentially, so they have cost 1.0 each, not random_page_cost.
>      */
>     *indexTotalCost = numIndexPages;
> 
> There are several things wrong with this, at least in its application to
> btrees.  It's not counting descent of the btree (ie, touches of the root
> page and intermediate-level pages).  On the other hand it's surely
> overcharging for metapage touches.  As of CVS tip we cache the metapage in
> the relcache, so it's probably fair to disregard that cost altogether.
> And on the third hand, when we need to retrieve multiple leaf pages it's
> over-optimistic to assume that those'll be purely sequential fetches.
> (That would be approximately true in a freshly-built btree, but not in one
> that's suffered any amount of page splitting or recycling.)
> 
> 
> I am inclined to think that a reasonable model is to continue to estimate
> the number of index leaf pages touched as above (pro-rating against the
> total number of index entries), to charge random_page_cost per leaf page
> touched, and not to count anything for metapage or tree descent.  I
> justify the latter on the grounds that upper tree levels will tend to stay
> in cache because they're touched so often.  random_page_cost per leaf page
> may be an overestimate, since sometimes logically adjacent leaf pages will
> be physically adjacent too, but not charging for tree descent should help
> to cancel that out.  With this model, the disk cost to fetch a single
> index entry will be estimated as random_page_cost (default 4.0) rather
> than the current fixed 2.0.  This shouldn't hurt things too much for
> simple indexscans --- especially since anyone running with a reduced
> random_page_cost won't see as much change.  And it will increase the costs
> for bitmap scans that cross many index pages, which is what we need in
> light of Philippe's numbers.
> 

This sounds good to me, although the 2.0 -> 4.0 cost jump may cause 
(more) cases of people seeing their index scans in pre-8.2 versions 
becoming some other type of access in 8.2. I guess a comment about 
testing existing applications could be placed in the 8.2 release notes?

> Now we have seen a lot of cases in which indexscans were being drastically
> overestimated, so increasing the cost estimate even more may seem like a
> horrid idea.  But I believe that most or all of these cases were ones
> where the same index was being visited many times, and so the real
> estimation failure is not accounting for cache effects across multiple
> indexscans.  Rather than being afraid to raise the cost estimate for an
> indexscan in isolation, I think it's time to bite the bullet and do
> something about that.
> 
> The big difficulty in modeling cache effects from multiple scans is that
> we usually don't know how many times the index will be scanned.  If we did
> have that number, I think it'd be reasonable to use the Mackert and Lohman
> approximation already used in cost_index to estimate the total number of
> index leaf pages fetched over N scans, and then divide by N to get our
> per-scan estimated cost.  But because the planner builds cost estimates
> bottom-up, in most scenarios we don't have any idea whether an indexscan
> plan node will be iterated once or many times in the finished plan.
> However there is one case where we do have some information: when
> computing a best_inner_indexscan() for a nestloop join, it'd be reasonable
> to use the estimated size of the outer relation as the loop count.
> And this is exactly the case where we are falling down in practice:
> the complaints we've seen are mostly about overestimating the cost of a
> nestloop-with-inner-index-scan.  So I think handling just this case would
> be a big step forward, even if it's not a theoretically pure general
> solution.
> 
>

If I understand correctly, this is the case were we normally see folks 
needing add several 'set enable_xxx=off' statements to get the nested 
loop plan that *actually* works best :-). So, also looks good!

Cheers

Mark



pgsql-hackers by date:

Previous
From: David Fetter
Date:
Subject: Re: More thoughts about planner's cost estimates
Next
From: Tom Lane
Date:
Subject: Re: More thoughts about planner's cost estimates