Thread: Plan stability versus near-exact ties in cost estimates
So after committing the latest round of parameterized-plan hacking, I was dismayed to see the buildfarm breaking out in pink, thanks to some of the members producing a different plan than expected for one test query. I eventually managed to reproduce that (on the fourth machine I tried locally), and after some quality time with gdb I understand what is happening, to wit: the two alternative plans have exactly the same cost so far as our cost model is concerned. On my main development machine, the two plans look like this to add_path: $13 = {type = T_NestPath, pathtype = T_NestLoop, parent = 0x40193c08, param_info = 0x40194458, rows = 5, startup_cost =0, total_cost = 47.628499999999988, pathkeys = 0x0} $16 = {type = T_NestPath, pathtype = T_NestLoop, parent = 0x40193c08, param_info = 0x40194458, rows = 5, startup_cost =0, total_cost = 47.628499999999981, pathkeys = 0x0} so it picks the second one on the basis that its total_cost is better at the sixteenth decimal place. On the other machine, the same two paths look like this: $12 = {type = T_NestPath, pathtype = T_NestLoop, parent = 0x895b9e0, param_info = 0x895c198, rows = 5, startup_cost = 0, total_cost = 47.578499999999977, pathkeys = 0x0} Breakpoint 2, add_path (parent_rel=0x895b9e0, new_path=0x895c208) $15 = {type = T_NestPath, pathtype = T_NestLoop, parent = 0x895b9e0, param_info = 0x895c198, rows = 5, startup_cost = 0, total_cost = 47.578499999999977, pathkeys = 0x0} and add_path is coded to arbitrarily keep the first one when two paths are exactly the same on all its preference measures. Now, the fact that the two machines get different costs at the third decimal place isn't very interesting here; that's a pretty standard cross-platform difference arising from different MAXALIGN values. The important point is that the total_costs of the two paths are exactly the same on one machine, and on the other one different only by a microscopic amount that probably arises from a slightly different floating-point calculation sequence chosen by a different compiler. So, as our code stands, we're inevitably going to have very platform- and compiler-specific decisions as to which plan to prefer. I'm a bit surprised that it's taken us this long to trip over this type of situation, because it's surely not specific to parameterized paths. We could deal with this either by giving up on showing the selected plan in the regression test, or by creating multiple "expected" files, but neither of those alternatives is very appealing. The idea that I'm toying with is to try to make the choice a bit less platform-specific, by removing the exact cost test that add_path uses as its last-ditch comparison step, essentially this: /* * Same pathkeys and outer rels, and fuzzily * the same cost, so keep just one; to decide * which, first check rowsand then do an * exact cost comparison. */ if (new_path->rows < old_path->rows) remove_old = true; /* new dominatesold */ - else if (new_path->rows > old_path->rows) - accept_new = false; /* old dominates new */ - else if (compare_path_costs(new_path, old_path, - TOTAL_COST) < 0) - remove_old = true; /* new dominates old */ else accept_new = false; /* old equals or dominates new */ This change would mean that, when two paths have the same pathkeys, parameterization, and rowcount, and fuzzily the same cost, that we arbitrarily keep the first-submitted one rather than looking at low order digits of the costs. Since the order in which different paths are generated is a property of our code and not platform-specific, this should eliminate platform dependencies in cases where two paths are essentially identical to the cost model. A variant idea would be to replace the exact cost comparison with a second round of fuzzy cost comparison, but with a much tighter fuzz factor, maybe 1e-6 instead of 0.01. Now, neither of these fixes is perfect: what they would do is remove platform-specific instability from where the costs are basically equal and add some more in the range where the costs differ by almost exactly the fuzz factor. But the behavior near that point is platform-specific already, just not quite as much, and it's surely something we're unlikely to trip over in the regression tests. Thoughts, better ideas? regards, tom lane
On 4/19/12 5:39 PM, Tom Lane wrote: > Now, neither of these fixes is perfect: what they would do is remove > platform-specific instability from where the costs are basically equal > and add some more in the range where the costs differ by almost exactly > the fuzz factor. But the behavior near that point is platform-specific > already, just not quite as much, and it's surely something we're > unlikely to trip over in the regression tests. I can't help but think of complaints we get from users regarding plan stability, even though this is a case of taking thatto an extreme. Because this case is extreme (changing plans due to 1e-16 of difference) it's fairly easy to fix thisspecific case. In order to get 9.2 out the door maybe fixing just this case is the right thing to do. But ISTM this isjust an example of a bigger problem. One of the complaints we've seen is that the planner will sometimes choose a plan that has a marginally lower cost (wheremarginally in this case is significantly more than 1e-16 ;) even though that plan will perform really poorly if thestats are off. I have wondered if that could be addressed by introducing the concept of an error range to each plan. Myidea is that each node would predict how much the cost estimate would change if the stats were off by some amount. If twoplans are close to the same cost, you would want to choose the plan that had the lower error range, trading off a smallamount of theoretical performance for less risk of getting a horrible plan if the stats assumptions proved to be wrong. I believe that would fix this specific case because even though to plans might come out with a nearly identical cost it isunlikely that they would also have a nearly identical error range. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Jim Nasby <jim@nasby.net> writes: > [ add some error ranges to cost estimates ] > I believe that would fix this specific case because even though to plans might come out with a nearly identical cost itis unlikely that they would also have a nearly identical error range. Actually, I think that *wouldn't* fix this specific case --- did you look at the details? The two formulations of the plan are really pretty nearly equivalent; you can do the two nestloops in either order and it's not clear it'll make much difference. I'm suspicious that the addition of parameterized planning might open up more scope for this type of thing, even though in principle it was always possible. regards, tom lane
On Thu, Apr 19, 2012 at 11:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > A variant idea would be to replace the exact cost comparison with a > second round of fuzzy cost comparison, but with a much tighter fuzz > factor, maybe 1e-6 instead of 0.01. The fuzz factor is a better idea, IMHO. I would like to see that as a user set parameter. Jim is right that plan stability is a wide problem, which could be addressed by setting a higher fuzz factor when plan instability is observed. Instability normally occurs for queries near a decision point in the cost models, which can fluctuate as exacts stats change. For those queries, it would be much better to address the instability directly via a fuzz factor than to attempt to fiddle with the enable_* parameters as is now common practice. Doc entry for the parameter... plan_choice_factor (float) - plans that vary in cost by less than this factor are considered equal by the planner, leading to a consistent choice of the first derived plans even across different types of hardware. This has a tendency to favour indexed plans when they are available. This is designed to address problems related to plan stability and may not produce desired results if used as a hinting/tweaking mechanism. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Tom Lane WROTE. > So after committing the latest round of parameterized-plan hacking, > I was dismayed to see the buildfarm breaking out in pink, thanks to > some of the members producing a different plan than expected for one > test query. I eventually managed to reproduce that (on the fourth > machine I tried locally), and after some quality time with gdb > I understand what is happening, to wit: the two alternative plans have > exactly the same cost so far as our cost model is concerned. > The idea that I'm toying with is to try to make the choice a bit less > platform-specific, by removing the exact cost test that add_path uses > as its last-ditch comparison step, essentially this: > > /* > * Same pathkeys and outer rels, and fuzzily > * the same cost, so keep just one; to decide > * which, first check rows and then do an > * exact cost comparison. > */ > if (new_path->rows < old_path->rows) > remove_old = true; /* new dominates old */ > - else if (new_path->rows > old_path->rows) > - accept_new = false; /* old dominates new */ > - else if (compare_path_costs(new_path, old_path, > - TOTAL_COST) < 0) > - remove_old = true; /* new dominates old */ > else > accept_new = false; /* old equals or dominates new */ > > This change would mean that, when two paths have the same pathkeys, > parameterization, and rowcount, and fuzzily the same cost, that we > arbitrarily keep the first-submitted one rather than looking at low > order digits of the costs. Since the order in which different paths > are generated is a property of our code and not platform-specific, > this should eliminate platform dependencies in cases where two paths > are essentially identical to the cost model. > > A variant idea would be to replace the exact cost comparison with a > second round of fuzzy cost comparison, but with a much tighter fuzz > factor, maybe 1e-6 instead of 0.01. > > Now, neither of these fixes is perfect: what they would do is remove > platform-specific instability from where the costs are basically equal > and add some more in the range where the costs differ by almost exactly > the fuzz factor. But the behavior near that point is platform-specific > already, just not quite as much, and it's surely something we're > unlikely to trip over in the regression tests. What if you remove the exact cost comparison, but leave the part where old dominates new based on rows? That should not add any new instabilities compared to the code as it is. Of course, if the row count estimates are subject to the same problem, it would solve nothing. Perhaps it is worth trying on the buildfarm. Yours, Laurenz Albe
On 19 April 2012 23:39, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The idea that I'm toying with is to try to make the choice a bit less > platform-specific, by removing the exact cost test that add_path uses > as its last-ditch comparison step, essentially this: > > /* > * Same pathkeys and outer rels, and fuzzily > * the same cost, so keep just one; to decide > * which, first check rows and then do an > * exact cost comparison. > */ > if (new_path->rows < old_path->rows) > remove_old = true; /* new dominates old */ > - else if (new_path->rows > old_path->rows) > - accept_new = false; /* old dominates new */ > - else if (compare_path_costs(new_path, old_path, > - TOTAL_COST) < 0) > - remove_old = true; /* new dominates old */ > else > accept_new = false; /* old equals or dominates new */ > > This change would mean that, when two paths have the same pathkeys, > parameterization, and rowcount, and fuzzily the same cost, that we > arbitrarily keep the first-submitted one rather than looking at low > order digits of the costs. Since the order in which different paths > are generated is a property of our code and not platform-specific, > this should eliminate platform dependencies in cases where two paths > are essentially identical to the cost model. > I think that the fuzz factor here is large enough that people would notice, and it would inevitably lead to questions like "why did it choose this plan whose cost is 127 over this one whose cost is 126?". > A variant idea would be to replace the exact cost comparison with a > second round of fuzzy cost comparison, but with a much tighter fuzz > factor, maybe 1e-6 instead of 0.01. > That seems like a better approach. Given the accuracy of floating point numbers, you could comfortably reduce that down to 1e-12 or even less, which would be less user-visible but still remove any platform-specific instability. Regards, Dean
"Albe Laurenz" <laurenz.albe@wien.gv.at> writes: > What if you remove the exact cost comparison, but leave the part where > old dominates new based on rows? Um, that is what the proposed patch does. regards, tom lane
Tom Lane wrote: >> What if you remove the exact cost comparison, but leave the part where >> old dominates new based on rows? > > Um, that is what the proposed patch does. I was referring to the first two lines that the patch removes. I guess I don't understand why they should go. Anyway, I fail to see how this patch could introduce new platform-specific instabilities, so it seems safe for me. Yours, Laurenz Albe
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > This change would mean that, when two paths have the same pathkeys, > parameterization, and rowcount, and fuzzily the same cost, that we > arbitrarily keep the first-submitted one rather than looking at low > order digits of the costs. +1 on this approach from me. > Since the order in which different paths > are generated is a property of our code and not platform-specific, > this should eliminate platform dependencies in cases where two paths > are essentially identical to the cost model. And the above is why. > A variant idea would be to replace the exact cost comparison with a > second round of fuzzy cost comparison, but with a much tighter fuzz > factor, maybe 1e-6 instead of 0.01. Not impressed with this idea- the notion that our model is good enough to produce valid values out to that many digits is, well, unlikely. I haev to disagree about users noticing this and complaining about it too, to be honest, that strikes me as very unlikely.. For starters, they'd have to be debugging the planner sufficiently to see that there are two nearly-identical plans under consideration and that we picked one over the other based on which came first.. Thanks, Stephen
"Albe Laurenz" <laurenz.albe@wien.gv.at> writes: > Tom Lane wrote: >> Um, that is what the proposed patch does. > I was referring to the first two lines that the patch removes. > I guess I don't understand why they should go. What we'd have left after the proposed removal is if (new_path->rows < old_path->rows) remove_old = true; /* newdominates old */ else accept_new = false; /* old equals or dominatesnew */ There's no need to make more than one test on the rows values. regards, tom lane
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> A variant idea would be to replace the exact cost comparison with a >> second round of fuzzy cost comparison, but with a much tighter fuzz >> factor, maybe 1e-6 instead of 0.01. > Not impressed with this idea- the notion that our model is good enough > to produce valid values out to that many digits is, well, unlikely. > I haev to disagree about users noticing this and complaining about it > too, to be honest, that strikes me as very unlikely.. For starters, > they'd have to be debugging the planner sufficiently to see that there > are two nearly-identical plans under consideration and that we picked > one over the other based on which came first.. Yeah, I'm pretty dubious about that too. If there is really a reason to care which one gets picked, it must be that the actual difference in cost is much more than 1%. In which case, the appropriate fix is in the cost estimates, not in the details of how add_path resolves near-ties. regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes: > On Thu, Apr 19, 2012 at 11:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> A variant idea would be to replace the exact cost comparison with a >> second round of fuzzy cost comparison, but with a much tighter fuzz >> factor, maybe 1e-6 instead of 0.01. > The fuzz factor is a better idea, IMHO. I would like to see that as a > user set parameter. I can see the case for exposing the fuzz factor for the existing round of fuzzy cost comparisons (in fact there's been a comment there for years questioning whether we should do so). Essentially the point of that would be to let the user trade off planning speed against resolution, because a larger fuzz factor would result in more plans getting discarded by add_path as being indistinguishably different in cost, so it would go faster but you'd probably get a plan that was only within so many percent of being the best available. (However, I'd like to see somebody do some experimentation to show that this actually behaves usefully before we invent Yet Another GUC. In particular it's not clear whether the error would accumulate unpleasantly over many levels of joining, if you tried to use a large fuzz factor.) If we have a second round of fuzzy comparisons, it would only be to dodge platform-specific roundoff differences, so I would think that locking its fuzz factor to 1e-6 or 1e-10 or so would be plenty good enough. > Jim is right that plan stability is a wide problem, which could be > addressed by setting a higher fuzz factor when plan instability is > observed. User-visible plan instability (ie, instability on a given installation) usually arises from changes in the contents of pg_statistic. I am really dubious that changing the fuzz factor would help that in any useful way. It's possible I guess, but I'd want to see some experimental proof before we advertise it as being helpful for that. regards, tom lane
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> A variant idea would be to replace the exact cost comparison with a >> second round of fuzzy cost comparison, but with a much tighter fuzz >> factor, maybe 1e-6 instead of 0.01. > Not impressed with this idea- the notion that our model is good enough > to produce valid values out to that many digits is, well, unlikely. While I remain convinced of the abstract truth of that position, I've committed a patch that uses a second round of fuzzy comparison. It turned out that simply removing the exact comparison as per my first proposal resulted in a surprisingly large number of changes in the regression test results; apparently, a lot more cases than we realized have multiple plans with indistinguishable costs. I didn't feel like going through and validating the test result changes, and besides that this could have resulted in changes in behavior-in-the-field that people would complain about. regards, tom lane
Tom Lane wrote: >>> Um, that is what the proposed patch does. >> I was referring to the first two lines that the patch removes. >> I guess I don't understand why they should go. > What we'd have left after the proposed removal is > > if (new_path->rows < old_path->rows) > remove_old = true; /* new dominates old */ > else > accept_new = false; /* old equals or dominates new */ > > There's no need to make more than one test on the rows values. I see, thanks for explaining. Yours, Laurenz Albe