Plan stability versus near-exact ties in cost estimates - Mailing list pgsql-hackers

So after committing the latest round of parameterized-plan hacking,
I was dismayed to see the buildfarm breaking out in pink, thanks to
some of the members producing a different plan than expected for one
test query.  I eventually managed to reproduce that (on the fourth
machine I tried locally), and after some quality time with gdb
I understand what is happening, to wit: the two alternative plans have
exactly the same cost so far as our cost model is concerned.  On my
main development machine, the two plans look like this to add_path:

$13 = {type = T_NestPath, pathtype = T_NestLoop, parent = 0x40193c08,  param_info = 0x40194458, rows = 5, startup_cost
=0,  total_cost = 47.628499999999988, pathkeys = 0x0}
 

$16 = {type = T_NestPath, pathtype = T_NestLoop, parent = 0x40193c08,  param_info = 0x40194458, rows = 5, startup_cost
=0,  total_cost = 47.628499999999981, pathkeys = 0x0}
 

so it picks the second one on the basis that its total_cost is better at
the sixteenth decimal place.  On the other machine, the same two paths
look like this:

$12 = {type = T_NestPath, pathtype = T_NestLoop, parent = 0x895b9e0,  param_info = 0x895c198, rows = 5, startup_cost =
0, total_cost = 47.578499999999977, pathkeys = 0x0}
 

Breakpoint 2, add_path (parent_rel=0x895b9e0, new_path=0x895c208)
$15 = {type = T_NestPath, pathtype = T_NestLoop, parent = 0x895b9e0,  param_info = 0x895c198, rows = 5, startup_cost =
0, total_cost = 47.578499999999977, pathkeys = 0x0}
 

and add_path is coded to arbitrarily keep the first one when two
paths are exactly the same on all its preference measures.

Now, the fact that the two machines get different costs at the third
decimal place isn't very interesting here; that's a pretty standard
cross-platform difference arising from different MAXALIGN values.
The important point is that the total_costs of the two paths are
exactly the same on one machine, and on the other one different only
by a microscopic amount that probably arises from a slightly different
floating-point calculation sequence chosen by a different compiler.

So, as our code stands, we're inevitably going to have very platform-
and compiler-specific decisions as to which plan to prefer.  I'm a bit
surprised that it's taken us this long to trip over this type of
situation, because it's surely not specific to parameterized paths.

We could deal with this either by giving up on showing the selected
plan in the regression test, or by creating multiple "expected" files,
but neither of those alternatives is very appealing.

The idea that I'm toying with is to try to make the choice a bit less
platform-specific, by removing the exact cost test that add_path uses
as its last-ditch comparison step, essentially this:
                               /*                                * Same pathkeys and outer rels, and fuzzily
                   * the same cost, so keep just one; to decide                                * which, first check
rowsand then do an                                * exact cost comparison.                                */
                  if (new_path->rows < old_path->rows)                                   remove_old = true;  /* new
dominatesold */
 
-                               else if (new_path->rows > old_path->rows)
-                                   accept_new = false; /* old dominates new */
-                               else if (compare_path_costs(new_path, old_path,
-                                                      TOTAL_COST) < 0)
-                                   remove_old = true;  /* new dominates old */                               else
                            accept_new = false; /* old equals or dominates new */
 

This change would mean that, when two paths have the same pathkeys,
parameterization, and rowcount, and fuzzily the same cost, that we
arbitrarily keep the first-submitted one rather than looking at low
order digits of the costs.  Since the order in which different paths
are generated is a property of our code and not platform-specific,
this should eliminate platform dependencies in cases where two paths
are essentially identical to the cost model.

A variant idea would be to replace the exact cost comparison with a
second round of fuzzy cost comparison, but with a much tighter fuzz
factor, maybe 1e-6 instead of 0.01.

Now, neither of these fixes is perfect: what they would do is remove
platform-specific instability from where the costs are basically equal
and add some more in the range where the costs differ by almost exactly
the fuzz factor.  But the behavior near that point is platform-specific
already, just not quite as much, and it's surely something we're
unlikely to trip over in the regression tests.

Thoughts, better ideas?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Dimitri Fontaine
Date:
Subject: Re: Bug #6593, extensions, and proposed new patch policy
Next
From: Jim Nasby
Date:
Subject: Re: Plan stability versus near-exact ties in cost estimates