Re: slower merge join on sorted data chosen over nested loop - Mailing list pgsql-hackers

From Tom Lane
Subject Re: slower merge join on sorted data chosen over nested loop
Date
Msg-id 5933.1128979407@sss.pgh.pa.us
Whole thread Raw
In response to slower merge join on sorted data chosen over nested loop  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
List pgsql-hackers
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> 10/06/05 9:28 PM >>>
>> There's a known issue that the planner tends to overestimate the cost of
>> inner-index-scan nestloops, because it doesn't allow for the strong
>> caching effects associated with repeated scans of the same index (in
>> particular, that all the upper index levels tend to stay in cache).
>> See the archives for past inconclusive discussions about how to fix
>> this.

> I spent a few hours trying different searches in the archives, and
> found three very interesting threads on the topic.  All were from
> 2003.  Should I keep digging for more recent threads, or would
> these probably represent the current state of the issue?

No, I don't recall that anything much has happened with it.

The place where the rubber hits the road is this passage in
genericcostestimate():
   /*    * Estimate the number of index pages that will be retrieved.    *    * For all currently-supported index
types,the first page of the index    * is a metadata page, and we should figure on fetching that plus a    * pro-rated
fractionof the remaining pages.    */   if (index->pages > 1 && index->tuples > 0)   {       numIndexPages =
(numIndexTuples/ index->tuples) * (index->pages - 1);       numIndexPages += 1;        /* count the metapage too */
 numIndexPages = ceil(numIndexPages);   }   else       numIndexPages = 1.0;
 
   /*    * Compute the index access cost.    *    * Disk cost: our generic assumption is that the index pages will be
read   * sequentially, so they have cost 1.0 each, not random_page_cost.    */   *indexTotalCost = numIndexPages;
 

An important point here: in all the currently supported index types,
that last assumption is ridiculously optimistic.  Except maybe in a
freshly-built btree, logically consecutive index pages won't be
physically adjacent, and so a honest cost accounting would require
charging random_page_cost per page.  However, that's not the direction
we want the estimate to go in :-(

We could do something based on effective_cache_size to estimate the
probability that the index page is already in cache and hence need not
be re-read.  The main trouble with this is estimating what fraction of
the effective_cache_size can fairly be assumed to be populated by each
index --- without some global knowledge about what other indexes are in
heavy use, it's difficult to see how to get there from here.  (OTOH,
one could also level the same charge against the existing uses of
effective_cache_size... maybe we could redefine it as cache available
per query, or something like that, to put part of the problem on the
DBA's shoulders.)

One simple tweak we could make is to not count the index metapage in the
pages-to-be-fetched estimate, which could be justified on the grounds
that the metapage is almost certain to be in cache.  (Note that the
existing estimation technique is already logically wrong for btrees,
because it's considering only the metapage and the leaf page(s) that
need to be visited, and not the internal pages that need to be descended
through to get to the right leaf.  However, again we can say that the
upper internal pages are probably in cache and so it's fair not to count
them as needing to be read.)

Whether this tweak would make the overall behavior better or worse is
really hard to say.  It'd be interesting to look at some test cases
though.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Need A Suggestion
Next
From: Tom Lane
Date:
Subject: PG 8.1beta3 out soon