Thread: tweaking costs to favor nestloop
Hi, I'm running PG 7.3.2 on a dual P3 1 GHz, 4GB RAM, 5-disk RAID 5 (hardware) on Debian Linux, kernel 2.4.21-rc3. I'm unable to tweak the various _cost settings in such a way that attached query will use the right plan. Attachment contains relevant config file settings, table defenitions and explain analyze output with some enable_* settings turned off. Could anyone tell me what I'm doing wrong in the query itself or tell me what I should do with which config file setting to let the planner choose the fastest plan by itself? Regards, Vincent van Leeuwen Media Design - http://www.mediadesign.nl/
Attachment
Vincent van Leeuwen <pgsql.spam@vinz.nl> writes: > I'm unable to tweak the various _cost settings in such a way that attached > query will use the right plan. You aren't going to be able to. You've already overshot a reasonable random_page_cost setting --- to judge by the relative actual costs of the merge and hash join, a value somewhere around 3 is appropriate for your setup. (Assuming I did the math right --- if you set it to 3, do you get a ratio of merge and hash estimated costs that agrees with the ratio of actual runtimes?) The problem here is that the costing of the repeated inner index scans isn't realistic: 35417 probes into "auth" are clearly taking much less than 35417 times what a single probe could be expected to take. We talked about how repeated scans would win from caching of the upper btree levels, but I think there's more to it than that. It occurs to me that the probes you are making are probably not random and uncorrelated. They are driven by the values of reportuser.idreporter ... is it fair to guess that most of the reportuser rows link to just a small fraction of the total auth population? If so, the caching could be eliminating most of the reads, not just the upper btree levels, because we're mostly hitting only small parts of the index and auth tables. I'm beginning to think that the only reasonable way to model this is to cost the entire nestloop join as a unit, so that we have access to statistics about the outer table as well as the indexed table. That would give us a shot at estimating how much of the index is likely to get touched. As of 7.3 I think all you can do is force nestloop by disabling the other two join types. regards, tom lane
On 2003-06-11 16:17:53 -0400, Tom Lane wrote: > Vincent van Leeuwen <pgsql.spam@vinz.nl> writes: > > I'm unable to tweak the various _cost settings in such a way that attached > > query will use the right plan. > > You aren't going to be able to. You've already overshot a reasonable > random_page_cost setting --- to judge by the relative actual costs of > the merge and hash join, a value somewhere around 3 is appropriate for > your setup. (Assuming I did the math right --- if you set it to 3, > do you get a ratio of merge and hash estimated costs that agrees with > the ratio of actual runtimes?) > Well, random_page_cost is where it is right now because for a number of other queries it seems to give the best result. Specifically, 1.25 seems to be the sweet spot where a number of queries that were using seqscans but should use indexscans started to use indexscans. Tweaking the cpu_index_tuple_cost by rather large margins didn't seem to have any effect on the calculated costs. Going back to a setting of 3 will hurt overall performance, unless we can still get those other queries to use the right plan by tweaking other config parameters. How did you calculate the value of 3? Another problem we've noticed is that on an idle database certain queries are better off using an indexscan than a seqscan, something which the planner already wanted to do. But when the load on the database gets a lot higher, indexscans are consistently slower than seqscans (same query, same parameters). So we had to dick around a bit to favor seqscans more for those queries (we set cpu_operator_cost a lot lower to favor a seqscan+sort over a (reverse? dunno anymore) indexscan). > The problem here is that the costing of the repeated inner index scans > isn't realistic: 35417 probes into "auth" are clearly taking much less > than 35417 times what a single probe could be expected to take. We > talked about how repeated scans would win from caching of the upper > btree levels, but I think there's more to it than that. It occurs to me > that the probes you are making are probably not random and uncorrelated. > They are driven by the values of reportuser.idreporter ... is it fair > to guess that most of the reportuser rows link to just a small fraction > of the total auth population? If so, the caching could be eliminating > most of the reads, not just the upper btree levels, because we're > mostly hitting only small parts of the index and auth tables. > Exactly. I think the 'auth' table is already completely in kernel filesystemcache to begin with, and probably largely in shared_buffers too, since it's a small table that gets hit a lot. Especially on it's primary key, which we use here. > I'm beginning to think that the only reasonable way to model this is to > cost the entire nestloop join as a unit, so that we have access to > statistics about the outer table as well as the indexed table. That > would give us a shot at estimating how much of the index is likely to > get touched. > > As of 7.3 I think all you can do is force nestloop by disabling the > other two join types. > Does 7.4 already have changes in this area that will affect this query? Vincent van Leeuwen Media Design - http://www.mediadesign.nl/
Vincent van Leeuwen <pgsql.spam@vinz.nl> writes: > How did you calculate the value of 3? Estimated cost of an indexscan is approximately proportional to random_page_cost, but cost of a seqscan isn't affected by it. You had a hash join plan that used two seqscans (so its estimated cost is unaffected by random_page_cost) plus a merge join plan that had one indexscan input. I just extrapolated the change in the indexscan cost needed to make the ratio of total costs agree with reality. This is a pretty rough calculation of course, but I don't believe small values of random_page_cost except for situations where all your data is buffered in RAM. It's real easy to get led down the garden path by small test cases that get fully buffered (especially when you repeat them over and over), and pick cost values that will not reflect reality in a production environment. I can't say whether that actually happened to you, but it's something to be on your guard about. > Another problem we've noticed is that on an idle database certain queries are > better off using an indexscan than a seqscan, something which the planner > already wanted to do. But when the load on the database gets a lot higher, > indexscans are consistently slower than seqscans (same query, same > parameters). See above. Increasing load reduces the chances that any one query will find its data already buffered, since there's more competition for the available buffer space. > Does 7.4 already have changes in this area that will affect this query? No. regards, tom lane
Tom- FWIW, these are the same kind of numbers I'm seeing for the project I'm working on.. ie: nested loop estimates at 0.00-3.01 but reality is much closer to 0.2. I agrees that it probably makes sense to take the correlation of both tables into account for nested-loop joins. On Wed, Jun 11, 2003 at 04:17:53PM -0400, Tom Lane wrote: > Vincent van Leeuwen <pgsql.spam@vinz.nl> writes: > > I'm unable to tweak the various _cost settings in such a way that attached > > query will use the right plan. > > You aren't going to be able to. You've already overshot a reasonable > random_page_cost setting --- to judge by the relative actual costs of > the merge and hash join, a value somewhere around 3 is appropriate for > your setup. (Assuming I did the math right --- if you set it to 3, > do you get a ratio of merge and hash estimated costs that agrees with > the ratio of actual runtimes?) > > The problem here is that the costing of the repeated inner index scans > isn't realistic: 35417 probes into "auth" are clearly taking much less > than 35417 times what a single probe could be expected to take. We > talked about how repeated scans would win from caching of the upper > btree levels, but I think there's more to it than that. It occurs to me > that the probes you are making are probably not random and uncorrelated. > They are driven by the values of reportuser.idreporter ... is it fair > to guess that most of the reportuser rows link to just a small fraction > of the total auth population? If so, the caching could be eliminating > most of the reads, not just the upper btree levels, because we're > mostly hitting only small parts of the index and auth tables. > > I'm beginning to think that the only reasonable way to model this is to > cost the entire nestloop join as a unit, so that we have access to > statistics about the outer table as well as the indexed table. That > would give us a shot at estimating how much of the index is likely to > get touched. > > As of 7.3 I think all you can do is force nestloop by disabling the > other two join types. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html > -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"