Thread: Plan stability versus near-exact ties in cost estimates

Plan stability versus near-exact ties in cost estimates

From
Tom Lane
Date:
So after committing the latest round of parameterized-plan hacking,
I was dismayed to see the buildfarm breaking out in pink, thanks to
some of the members producing a different plan than expected for one
test query.  I eventually managed to reproduce that (on the fourth
machine I tried locally), and after some quality time with gdb
I understand what is happening, to wit: the two alternative plans have
exactly the same cost so far as our cost model is concerned.  On my
main development machine, the two plans look like this to add_path:

$13 = {type = T_NestPath, pathtype = T_NestLoop, parent = 0x40193c08,  param_info = 0x40194458, rows = 5, startup_cost
=0,  total_cost = 47.628499999999988, pathkeys = 0x0}
 

$16 = {type = T_NestPath, pathtype = T_NestLoop, parent = 0x40193c08,  param_info = 0x40194458, rows = 5, startup_cost
=0,  total_cost = 47.628499999999981, pathkeys = 0x0}
 

so it picks the second one on the basis that its total_cost is better at
the sixteenth decimal place.  On the other machine, the same two paths
look like this:

$12 = {type = T_NestPath, pathtype = T_NestLoop, parent = 0x895b9e0,  param_info = 0x895c198, rows = 5, startup_cost =
0, total_cost = 47.578499999999977, pathkeys = 0x0}
 

Breakpoint 2, add_path (parent_rel=0x895b9e0, new_path=0x895c208)
$15 = {type = T_NestPath, pathtype = T_NestLoop, parent = 0x895b9e0,  param_info = 0x895c198, rows = 5, startup_cost =
0, total_cost = 47.578499999999977, pathkeys = 0x0}
 

and add_path is coded to arbitrarily keep the first one when two
paths are exactly the same on all its preference measures.

Now, the fact that the two machines get different costs at the third
decimal place isn't very interesting here; that's a pretty standard
cross-platform difference arising from different MAXALIGN values.
The important point is that the total_costs of the two paths are
exactly the same on one machine, and on the other one different only
by a microscopic amount that probably arises from a slightly different
floating-point calculation sequence chosen by a different compiler.

So, as our code stands, we're inevitably going to have very platform-
and compiler-specific decisions as to which plan to prefer.  I'm a bit
surprised that it's taken us this long to trip over this type of
situation, because it's surely not specific to parameterized paths.

We could deal with this either by giving up on showing the selected
plan in the regression test, or by creating multiple "expected" files,
but neither of those alternatives is very appealing.

The idea that I'm toying with is to try to make the choice a bit less
platform-specific, by removing the exact cost test that add_path uses
as its last-ditch comparison step, essentially this:
                               /*                                * Same pathkeys and outer rels, and fuzzily
                   * the same cost, so keep just one; to decide                                * which, first check
rowsand then do an                                * exact cost comparison.                                */
                  if (new_path->rows < old_path->rows)                                   remove_old = true;  /* new
dominatesold */
 
-                               else if (new_path->rows > old_path->rows)
-                                   accept_new = false; /* old dominates new */
-                               else if (compare_path_costs(new_path, old_path,
-                                                      TOTAL_COST) < 0)
-                                   remove_old = true;  /* new dominates old */                               else
                            accept_new = false; /* old equals or dominates new */
 

This change would mean that, when two paths have the same pathkeys,
parameterization, and rowcount, and fuzzily the same cost, that we
arbitrarily keep the first-submitted one rather than looking at low
order digits of the costs.  Since the order in which different paths
are generated is a property of our code and not platform-specific,
this should eliminate platform dependencies in cases where two paths
are essentially identical to the cost model.

A variant idea would be to replace the exact cost comparison with a
second round of fuzzy cost comparison, but with a much tighter fuzz
factor, maybe 1e-6 instead of 0.01.

Now, neither of these fixes is perfect: what they would do is remove
platform-specific instability from where the costs are basically equal
and add some more in the range where the costs differ by almost exactly
the fuzz factor.  But the behavior near that point is platform-specific
already, just not quite as much, and it's surely something we're
unlikely to trip over in the regression tests.

Thoughts, better ideas?
        regards, tom lane


Re: Plan stability versus near-exact ties in cost estimates

From
Jim Nasby
Date:
On 4/19/12 5:39 PM, Tom Lane wrote:
> Now, neither of these fixes is perfect: what they would do is remove
> platform-specific instability from where the costs are basically equal
> and add some more in the range where the costs differ by almost exactly
> the fuzz factor.  But the behavior near that point is platform-specific
> already, just not quite as much, and it's surely something we're
> unlikely to trip over in the regression tests.

I can't help but think of complaints we get from users regarding plan stability, even though this is a case of taking
thatto an extreme. Because this case is extreme (changing plans due to 1e-16 of difference) it's fairly easy to fix
thisspecific case. In order to get 9.2 out the door maybe fixing just this case is the right thing to do. But ISTM this
isjust an example of a bigger problem.
 

One of the complaints we've seen is that the planner will sometimes choose a plan that has a marginally lower cost
(wheremarginally in this case is significantly more than 1e-16 ;) even though that plan will perform really poorly if
thestats are off. I have wondered if that could be addressed by introducing the concept of an error range to each plan.
Myidea is that each node would predict how much the cost estimate would change if the stats were off by some amount. If
twoplans are close to the same cost, you would want to choose the plan that had the lower error range, trading off a
smallamount of theoretical performance for less risk of getting a horrible plan if the stats assumptions proved to be
wrong.

I believe that would fix this specific case because even though to plans might come out with a nearly identical cost it
isunlikely that they would also have a nearly identical error range.
 
-- 
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net


Re: Plan stability versus near-exact ties in cost estimates

From
Tom Lane
Date:
Jim Nasby <jim@nasby.net> writes:
> [ add some error ranges to cost estimates ]

> I believe that would fix this specific case because even though to plans might come out with a nearly identical cost
itis unlikely that they would also have a nearly identical error range.
 

Actually, I think that *wouldn't* fix this specific case --- did you
look at the details?  The two formulations of the plan are really pretty
nearly equivalent; you can do the two nestloops in either order and it's
not clear it'll make much difference.  I'm suspicious that the addition
of parameterized planning might open up more scope for this type of
thing, even though in principle it was always possible.
        regards, tom lane


Re: Plan stability versus near-exact ties in cost estimates

From
Simon Riggs
Date:
On Thu, Apr 19, 2012 at 11:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> A variant idea would be to replace the exact cost comparison with a
> second round of fuzzy cost comparison, but with a much tighter fuzz
> factor, maybe 1e-6 instead of 0.01.

The fuzz factor is a better idea, IMHO. I would like to see that as a
user set parameter.

Jim is right that plan stability is a wide problem, which could be
addressed by setting a higher fuzz factor when plan instability is
observed. Instability normally occurs for queries near a decision
point in the cost models, which can fluctuate as exacts stats change.
For those queries, it would be much better to address the instability
directly via a fuzz factor than to attempt to fiddle with the enable_*
parameters as is now common practice.

Doc entry for the parameter...

plan_choice_factor (float) - plans that vary in cost by less than this
factor are considered equal by the planner, leading to a consistent
choice of the first derived plans even across different types of
hardware. This has a tendency to favour indexed plans when they are
available. This is designed to address problems related to plan
stability and may not produce desired results if used as a
hinting/tweaking mechanism.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Plan stability versus near-exact ties in cost estimates

From
"Albe Laurenz"
Date:
Tom Lane WROTE.
> So after committing the latest round of parameterized-plan hacking,
> I was dismayed to see the buildfarm breaking out in pink, thanks to
> some of the members producing a different plan than expected for one
> test query.  I eventually managed to reproduce that (on the fourth
> machine I tried locally), and after some quality time with gdb
> I understand what is happening, to wit: the two alternative plans have
> exactly the same cost so far as our cost model is concerned.

> The idea that I'm toying with is to try to make the choice a bit less
> platform-specific, by removing the exact cost test that add_path uses
> as its last-ditch comparison step, essentially this:
>
>                                 /*
>                                  * Same pathkeys and outer rels, and
fuzzily
>                                  * the same cost, so keep just one; to
decide
>                                  * which, first check rows and then do
an
>                                  * exact cost comparison.
>                                  */
>                                 if (new_path->rows < old_path->rows)
>                                     remove_old = true;  /* new
dominates old */
> -                               else if (new_path->rows >
old_path->rows)
> -                                   accept_new = false; /* old
dominates new */
> -                               else if (compare_path_costs(new_path,
old_path,
> -                                                      TOTAL_COST) <
0)
> -                                   remove_old = true;  /* new
dominates old */
>                                 else
>                                     accept_new = false; /* old equals
or dominates new */
>
> This change would mean that, when two paths have the same pathkeys,
> parameterization, and rowcount, and fuzzily the same cost, that we
> arbitrarily keep the first-submitted one rather than looking at low
> order digits of the costs.  Since the order in which different paths
> are generated is a property of our code and not platform-specific,
> this should eliminate platform dependencies in cases where two paths
> are essentially identical to the cost model.
>
> A variant idea would be to replace the exact cost comparison with a
> second round of fuzzy cost comparison, but with a much tighter fuzz
> factor, maybe 1e-6 instead of 0.01.
>
> Now, neither of these fixes is perfect: what they would do is remove
> platform-specific instability from where the costs are basically equal
> and add some more in the range where the costs differ by almost
exactly
> the fuzz factor.  But the behavior near that point is
platform-specific
> already, just not quite as much, and it's surely something we're
> unlikely to trip over in the regression tests.

What if you remove the exact cost comparison, but leave the part where
old dominates new based on rows?
That should not add any new instabilities compared to the code as it is.
Of course, if the row count estimates are subject to the same problem,
it would solve nothing.  Perhaps it is worth trying on the buildfarm.

Yours,
Laurenz Albe


Re: Plan stability versus near-exact ties in cost estimates

From
Dean Rasheed
Date:
On 19 April 2012 23:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The idea that I'm toying with is to try to make the choice a bit less
> platform-specific, by removing the exact cost test that add_path uses
> as its last-ditch comparison step, essentially this:
>
>                                /*
>                                 * Same pathkeys and outer rels, and fuzzily
>                                 * the same cost, so keep just one; to decide
>                                 * which, first check rows and then do an
>                                 * exact cost comparison.
>                                 */
>                                if (new_path->rows < old_path->rows)
>                                    remove_old = true;  /* new dominates old */
> -                               else if (new_path->rows > old_path->rows)
> -                                   accept_new = false; /* old dominates new */
> -                               else if (compare_path_costs(new_path, old_path,
> -                                                      TOTAL_COST) < 0)
> -                                   remove_old = true;  /* new dominates old */
>                                else
>                                    accept_new = false; /* old equals or dominates new */
>
> This change would mean that, when two paths have the same pathkeys,
> parameterization, and rowcount, and fuzzily the same cost, that we
> arbitrarily keep the first-submitted one rather than looking at low
> order digits of the costs.  Since the order in which different paths
> are generated is a property of our code and not platform-specific,
> this should eliminate platform dependencies in cases where two paths
> are essentially identical to the cost model.
>

I think that the fuzz factor here is large enough that people would
notice, and it would inevitably lead to questions like "why did it
choose this plan whose cost is 127 over this one whose cost is 126?".


> A variant idea would be to replace the exact cost comparison with a
> second round of fuzzy cost comparison, but with a much tighter fuzz
> factor, maybe 1e-6 instead of 0.01.
>

That seems like a better approach. Given the accuracy of floating
point numbers, you could comfortably reduce that down to 1e-12 or even
less, which would be less user-visible but still remove any
platform-specific instability.

Regards,
Dean


Re: Plan stability versus near-exact ties in cost estimates

From
Tom Lane
Date:
"Albe Laurenz" <laurenz.albe@wien.gv.at> writes:
> What if you remove the exact cost comparison, but leave the part where
> old dominates new based on rows?

Um, that is what the proposed patch does.
        regards, tom lane


Re: Plan stability versus near-exact ties in cost estimates

From
"Albe Laurenz"
Date:
Tom Lane wrote:
>> What if you remove the exact cost comparison, but leave the part
where
>> old dominates new based on rows?
>
> Um, that is what the proposed patch does.

I was referring to the first two lines that the patch removes.
I guess I don't understand why they should go.

Anyway, I fail to see how this patch could introduce new
platform-specific
instabilities, so it seems safe for me.

Yours,
Laurenz Albe


Re: Plan stability versus near-exact ties in cost estimates

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> This change would mean that, when two paths have the same pathkeys,
> parameterization, and rowcount, and fuzzily the same cost, that we
> arbitrarily keep the first-submitted one rather than looking at low
> order digits of the costs.

+1 on this approach from me.

> Since the order in which different paths
> are generated is a property of our code and not platform-specific,
> this should eliminate platform dependencies in cases where two paths
> are essentially identical to the cost model.

And the above is why.

> A variant idea would be to replace the exact cost comparison with a
> second round of fuzzy cost comparison, but with a much tighter fuzz
> factor, maybe 1e-6 instead of 0.01.

Not impressed with this idea- the notion that our model is good enough
to produce valid values out to that many digits is, well, unlikely.

I haev to disagree about users noticing this and complaining about it
too, to be honest, that strikes me as very unlikely..  For starters,
they'd have to be debugging the planner sufficiently to see that there
are two nearly-identical plans under consideration and that we picked
one over the other based on which came first..
Thanks,
    Stephen

Re: Plan stability versus near-exact ties in cost estimates

From
Tom Lane
Date:
"Albe Laurenz" <laurenz.albe@wien.gv.at> writes:
> Tom Lane wrote:
>> Um, that is what the proposed patch does.

> I was referring to the first two lines that the patch removes.
> I guess I don't understand why they should go.

What we'd have left after the proposed removal is
                           if (new_path->rows < old_path->rows)                                 remove_old = true;  /*
newdominates old */                           else                                 accept_new = false; /* old equals or
dominatesnew */
 

There's no need to make more than one test on the rows values.
        regards, tom lane


Re: Plan stability versus near-exact ties in cost estimates

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> A variant idea would be to replace the exact cost comparison with a
>> second round of fuzzy cost comparison, but with a much tighter fuzz
>> factor, maybe 1e-6 instead of 0.01.

> Not impressed with this idea- the notion that our model is good enough
> to produce valid values out to that many digits is, well, unlikely.

> I haev to disagree about users noticing this and complaining about it
> too, to be honest, that strikes me as very unlikely..  For starters,
> they'd have to be debugging the planner sufficiently to see that there
> are two nearly-identical plans under consideration and that we picked
> one over the other based on which came first..

Yeah, I'm pretty dubious about that too.  If there is really a reason
to care which one gets picked, it must be that the actual difference in
cost is much more than 1%.  In which case, the appropriate fix is in the
cost estimates, not in the details of how add_path resolves near-ties.
        regards, tom lane


Re: Plan stability versus near-exact ties in cost estimates

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Thu, Apr 19, 2012 at 11:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> A variant idea would be to replace the exact cost comparison with a
>> second round of fuzzy cost comparison, but with a much tighter fuzz
>> factor, maybe 1e-6 instead of 0.01.

> The fuzz factor is a better idea, IMHO. I would like to see that as a
> user set parameter.

I can see the case for exposing the fuzz factor for the existing round
of fuzzy cost comparisons (in fact there's been a comment there for
years questioning whether we should do so).  Essentially the point of
that would be to let the user trade off planning speed against
resolution, because a larger fuzz factor would result in more plans
getting discarded by add_path as being indistinguishably different in
cost, so it would go faster but you'd probably get a plan that was only
within so many percent of being the best available.

(However, I'd like to see somebody do some experimentation to show
that this actually behaves usefully before we invent Yet Another GUC.
In particular it's not clear whether the error would accumulate
unpleasantly over many levels of joining, if you tried to use a large
fuzz factor.)

If we have a second round of fuzzy comparisons, it would only be to
dodge platform-specific roundoff differences, so I would think that
locking its fuzz factor to 1e-6 or 1e-10 or so would be plenty good
enough.

> Jim is right that plan stability is a wide problem, which could be
> addressed by setting a higher fuzz factor when plan instability is
> observed.

User-visible plan instability (ie, instability on a given installation)
usually arises from changes in the contents of pg_statistic.  I am
really dubious that changing the fuzz factor would help that in any
useful way.  It's possible I guess, but I'd want to see some
experimental proof before we advertise it as being helpful for that.
        regards, tom lane


Re: Plan stability versus near-exact ties in cost estimates

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> A variant idea would be to replace the exact cost comparison with a
>> second round of fuzzy cost comparison, but with a much tighter fuzz
>> factor, maybe 1e-6 instead of 0.01.

> Not impressed with this idea- the notion that our model is good enough
> to produce valid values out to that many digits is, well, unlikely.

While I remain convinced of the abstract truth of that position, I've
committed a patch that uses a second round of fuzzy comparison.  It
turned out that simply removing the exact comparison as per my first
proposal resulted in a surprisingly large number of changes in the
regression test results; apparently, a lot more cases than we realized
have multiple plans with indistinguishable costs.  I didn't feel like
going through and validating the test result changes, and besides that
this could have resulted in changes in behavior-in-the-field that people
would complain about.
        regards, tom lane


Re: Plan stability versus near-exact ties in cost estimates

From
"Albe Laurenz"
Date:
Tom Lane wrote:
>>> Um, that is what the proposed patch does.

>> I was referring to the first two lines that the patch removes.
>> I guess I don't understand why they should go.

> What we'd have left after the proposed removal is
>
>                            if (new_path->rows < old_path->rows)
>                                  remove_old = true;  /* new dominates old */
>                            else
>                                  accept_new = false; /* old equals or dominates new */
>
> There's no need to make more than one test on the rows values.

I see, thanks for explaining.

Yours,
Laurenz Albe