Re: slower merge join on sorted data chosen over nested loop - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: slower merge join on sorted data chosen over nested loop |
Date | |
Msg-id | 5933.1128979407@sss.pgh.pa.us Whole thread Raw |
In response to | slower merge join on sorted data chosen over nested loop ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>) |
List | pgsql-hackers |
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> 10/06/05 9:28 PM >>> >> There's a known issue that the planner tends to overestimate the cost of >> inner-index-scan nestloops, because it doesn't allow for the strong >> caching effects associated with repeated scans of the same index (in >> particular, that all the upper index levels tend to stay in cache). >> See the archives for past inconclusive discussions about how to fix >> this. > I spent a few hours trying different searches in the archives, and > found three very interesting threads on the topic. All were from > 2003. Should I keep digging for more recent threads, or would > these probably represent the current state of the issue? No, I don't recall that anything much has happened with it. The place where the rubber hits the road is this passage in genericcostestimate(): /* * Estimate the number of index pages that will be retrieved. * * For all currently-supported index types,the first page of the index * is a metadata page, and we should figure on fetching that plus a * pro-rated fractionof the remaining pages. */ if (index->pages > 1 && index->tuples > 0) { numIndexPages = (numIndexTuples/ index->tuples) * (index->pages - 1); numIndexPages += 1; /* count the metapage too */ numIndexPages = ceil(numIndexPages); } else numIndexPages = 1.0; /* * Compute the index access cost. * * Disk cost: our generic assumption is that the index pages will be read * sequentially, so they have cost 1.0 each, not random_page_cost. */ *indexTotalCost = numIndexPages; An important point here: in all the currently supported index types, that last assumption is ridiculously optimistic. Except maybe in a freshly-built btree, logically consecutive index pages won't be physically adjacent, and so a honest cost accounting would require charging random_page_cost per page. However, that's not the direction we want the estimate to go in :-( We could do something based on effective_cache_size to estimate the probability that the index page is already in cache and hence need not be re-read. The main trouble with this is estimating what fraction of the effective_cache_size can fairly be assumed to be populated by each index --- without some global knowledge about what other indexes are in heavy use, it's difficult to see how to get there from here. (OTOH, one could also level the same charge against the existing uses of effective_cache_size... maybe we could redefine it as cache available per query, or something like that, to put part of the problem on the DBA's shoulders.) One simple tweak we could make is to not count the index metapage in the pages-to-be-fetched estimate, which could be justified on the grounds that the metapage is almost certain to be in cache. (Note that the existing estimation technique is already logically wrong for btrees, because it's considering only the metapage and the leaf page(s) that need to be visited, and not the internal pages that need to be descended through to get to the right leaf. However, again we can say that the upper internal pages are probably in cache and so it's fair not to count them as needing to be read.) Whether this tweak would make the overall behavior better or worse is really hard to say. It'd be interesting to look at some test cases though. regards, tom lane
pgsql-hackers by date: