Thread: Planner making bad choice in alternative subplan decision

Planner making bad choice in alternative subplan decision

From
David Rowley
Date:
This follows on from what I mentioned in [1]. I'll reiterate below:

Really the topic is much more broad than what I wrote above, but the
example case I have is around subplan costing which results in the
alternative subplan code making a bad choice.

So the example case is:

create table t (a int, b int);
insert into t select x,1 from generate_Series(1,10000)x;
create index t_b_idx on t (b);
analyze t;

The query is:

select * from t where exists (select 1 from t t2 where t.a=t2.b) or a < 0;

Here the planner has 3 choices:

1. Use a non-hashed subplan and use the t_b_idx to find if there's a
match for the EXISTS, or;
2. Use a hashed subplan and just scan all of t and put that in a hash
table to probe when scanning the outer query, or;
3. Use a seq-scan on the inner table and fetch 1 row matching t.a=t2.b
for each outer row.

#3 is not going to work out well here since t.a has the values from 1
to 10000 and t2.b just has one distinct value that is 1. For 9999 of
the scans we'd have to seqscan all rows in the entire t2 to find out
that no row exists. This would be a terrible plan if the planner went
with that.

Unfortunately, that's the exact plan the planner goes with :-(

Let me explain:

The subquery planning that's going on is pretty much planning the
following. I'll use PREPARE and a generic plan to simulate how the
planner deals with unknown parameter values in the subquery. I'll add
a LIMIT 1 since the subquery_planner will by passed a 1.0
tuple_fraction when planning a EXISTS_SUBLINK.

set plan_cache_mode = force_generic_plan;
prepare q1(int) as select b from t where b = $1 limit 1;
postgres=# explain (analyze, costs off, timing off) execute q1(1);
                 QUERY PLAN
---------------------------------------------
 Limit (actual rows=1 loops=1)
   ->  Seq Scan on t (actual rows=1 loops=1)
         Filter: (b = $1)
 Planning Time: 0.035 ms
 Execution Time: 0.089 ms
(5 rows)

So here the planner has to plan with a unknown parameter value.
var_eq_non_const() comes along and sees the stats say there's only 1
distinct value in the "b" column, so the selectivity is  1.0 / 1.0,
aka 1.0.  That works out well when I pass the value of 1 as the to
execute, but if I pass in anything else, then:

postgres=# explain (analyze, costs off, timing off) execute q1(0);
                 QUERY PLAN
---------------------------------------------
 Limit (actual rows=0 loops=1)
   ->  Seq Scan on t (actual rows=0 loops=1)
         Filter: (b = $1)
         Rows Removed by Filter: 10000
 Planning Time: 0.017 ms
 Execution Time: 1.850 ms
(6 rows)

We run through the entire table to find nothing.

I removed the costs here to trim down the output, but the row estimate
is 10000 in both cases.

The reason that the alternative subplan code picks the non-hashed plan
is due to the estimated rows of the subplan is 10000 and we apply the
scan cost in cost_subplan() as:

/* we only need to fetch 1 tuple; clamp to avoid zero divide */
sp_cost.per_tuple += plan_run_cost / clamp_row_est(plan->plan_rows);

On my machine the seqscan for the 10000 rows costs 180. So
cost_subplan() costs the subplan with: 180 / 10000. That's slightly
more favourable than the hashed subplan since that's planning the
seqscan + the cost of hashing. The qual-less scan in this case cost
just 155 but the cost of hashing, coincidentally, takes the startup
cost to 180.  In the very recently committed fix_alternative_subplan()
function, by the time we account for the hash probe cost, the hashed
subplan will cost 205.  That code ends up choosing the non-hashed
plan, aka the dreaded #3 case from above.

Fixing it:

I'm not really quite sure which part of the planner to blame for this
poor choice.  Is it the fact that the planner picked such a risky
seqscan path to fetch 1 row from a table. Or is it cost_subplan() for
dividing the total cost for the EXISTS_SUBLINK by the row estimate.
Certainly if we change that to do the same as ALL_SUBLINK, then that
would fix this particular problem, but it might penilise othercases
incorrectly.

I'm more leaning towards the fact that the planner picked the seqscan
path in the first place. Unfotunately I don't have any great ideas on
how to fix without fudging the costs a bit. Maybe we can make
var_eq_non_const() divide down the selectivity by:

ndistinct = Max(get_variable_numdistinct(vardata, &isdefault),
DEFAULT_NUM_DISTINCT);

instead of :

ndistinct = get_variable_numdistinct(vardata, &isdefault);

But that's going to make the row estimate in this case 50 rows (10000
/ 200). With a subplan that the stats say can only either return 10000
or 0 rows. Is that too weird?

FWIW, the execution times for #3 is 7,4 seconds on my machine. Using
the hashed subplan (#2) takes 3.7 milliseconds, or 2000 times faster,
if you'd prefer that.

Anyone else have any thoughts on this?

David

(copied in Tom as he was the one I mentioned this to in [1] and who I
promised another thread for this to)

[1] https://www.postgresql.org/message-id/CAApHDvqeh8JEqMjpCFTgHD_zu2S03nOVh2srejd+sNLza8M+mg@mail.gmail.com



Re: Planner making bad choice in alternative subplan decision

From
David Rowley
Date:
On Mon, 28 Sep 2020 at 13:22, David Rowley <dgrowleyml@gmail.com> wrote:
> I'm more leaning towards the fact that the planner picked the seqscan
> path in the first place. Unfotunately I don't have any great ideas on
> how to fix without fudging the costs a bit. Maybe we can make
> var_eq_non_const() divide down the selectivity by:
>
> ndistinct = Max(get_variable_numdistinct(vardata, &isdefault),
> DEFAULT_NUM_DISTINCT);
>
> instead of :
>
> ndistinct = get_variable_numdistinct(vardata, &isdefault);
>
> But that's going to make the row estimate in this case 50 rows (10000
> / 200). With a subplan that the stats say can only either return 10000
> or 0 rows. Is that too weird?

I figured it was easier to have a discussion around a patch, so, for
discussion purposes, I wrote one and attached it.

The patch is along the lines of what I mentioned above, so aims to be
a more generic fix for this problem rather than something that only
aims to change EXISTS type subqueries.

It affects both of the examples that I mentioned in my previous email.
So we now get an index scan in the following case;

postgres=# set plan_cache_mode = force_generic_plan;
SET
postgres=# prepare q1(int) as select b from t where b = $1 limit 1;
PREPARE
postgres=# explain (analyze, costs off, timing off) execute q1(1);
                            QUERY PLAN
------------------------------------------------------------------
 Limit (actual rows=1 loops=1)
   ->  Index Only Scan using t_b_idx on t (actual rows=1 loops=1)
         Index Cond: (b = $1)
         Heap Fetches: 0
 Planning Time: 0.166 ms
 Execution Time: 1.772 ms
(6 rows)

and we get a hashed subplan in the other case I mentioned. (Which does
win out over the Index Scan subplan in fix_alternative_subplan())

While it is great to fix both those cases, this patch could have
repercussions later in planning. The selectivity estimate for these
parameter equality scans is now going to be at most 1.0 / 200.0, where
before it could have been 1.0 given an ndistinct estimate of 1.
There's probably more smarts that could be added, like checking if the
parameter is from a Var and if there's a foreign key to verify that
we'll never pass in a parameter value that's not in the table. I'm not
sure how hard that will be exactly. I see we do at least pass down the
PlannerInfo into the operator's oprcode function, which is a good
start, but I suspect it'll still be tricky to do this efficiently.

Any opinions on this?

David

Attachment

Re: Planner making bad choice in alternative subplan decision

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> Any opinions on this?

This patch scares the heck out of me.  It's a pretty much unprincipled
change in a fundamental selectivity estimator, which is going to affect
all sorts of queries not only the particular case you have in mind.
There's no reason to think that the outcome will be positive for other
cases, either.

The idea I'd had was to adjust make_subplan and cost_subplan to estimate
EXIST cases on the basis of either 50% retrieval (same as ANY/ALL) or
maybe full retrieval if you want to be pessimistic.  I've not had time
to try it out though.

            regards, tom lane



Re: Planner making bad choice in alternative subplan decision

From
David Rowley
Date:
Thanks for chipping in here.

On Tue, 29 Sep 2020 at 12:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > Any opinions on this?
>
> This patch scares the heck out of me.  It's a pretty much unprincipled
> change in a fundamental selectivity estimator, which is going to affect
> all sorts of queries not only the particular case you have in mind.
> There's no reason to think that the outcome will be positive for other
> cases, either.

hmm. Yeah, I understand your thoughts. The reason why I had thoughts
that this might be an okay route to fix the problem was regarding the
comment above the current good which says, "Is that a good idea?".

I think I've mentioned somewhere on list about a risk-based costing
model, where tag on some sort of risk factor onto a Path and have
add_path() consider the risk and the cost perhaps with influence of
some GUCs to help weight the decision in a certain direction.  A
simple version of the primary case for this is how we multiply
selectivities of quals assuming no correlation. With each
multiplication, we increase the risk of being wrong which would
increase the risk score. How this problem tends to come out and bite
people is how we end up with a selectivity that's so low we think
there's 1 row and we end up doing some subsequent join as a
non-parameterised nested loop join, but it turns out 1 million rows
match and someone has to wait a long time for their query to finish.
The risk+cost based planner would see that that's risky and maybe
consider hashing that 1 row and doing a hash join.  Hashing 1 row is
pretty cheap, not much more expensive than nested looping, but if it
blows up, the explosion is contained.

Anyway, perhaps it's better to fix the more general case that I
mention one day when we have some sort of risk factor in the costing
model and just assign a higher risk to the seq scan path.

> The idea I'd had was to adjust make_subplan and cost_subplan to estimate
> EXIST cases on the basis of either 50% retrieval (same as ANY/ALL) or
> maybe full retrieval if you want to be pessimistic.  I've not had time
> to try it out though.

Yeah, I can look at that again, if you think it's more reasonable. It
was the first place a landed when looking for a fix until I discovered
the problem was more generic than just subplans.

David



Re: Planner making bad choice in alternative subplan decision

From
David Rowley
Date:
On Tue, 29 Sep 2020 at 12:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > Any opinions on this?
>
> This patch scares the heck out of me.  It's a pretty much unprincipled
> change in a fundamental selectivity estimator, which is going to affect
> all sorts of queries not only the particular case you have in mind.
> There's no reason to think that the outcome will be positive for other
> cases, either.
>
> The idea I'd had was to adjust make_subplan and cost_subplan to estimate
> EXIST cases on the basis of either 50% retrieval (same as ANY/ALL) or
> maybe full retrieval if you want to be pessimistic.  I've not had time
> to try it out though.

Here's a patch which does that.  There are no regression test failures.

I'm trying to anticipate areas that this could cause a regression.  I
think generally a hashed subplan should win in about all cases where
we lookup all of the items in the table. The places where it could
cause regression are, where we have to load way more into the hash
table than we're going to lookup. Assuming roughly the same costs for
the subplan for hashing and the non-hashed subplan, then the crossover
point will be about 2 lookups (2 x 50%).  I do wonder if 50% is a bit
too big a number. We did ask the planner for a fast startup plan, so
there is perhaps some reasoning to make the cost multiplier somewhere
between 50% and 1 / nrows. I'm just not sure what that should be.
There's very little to go on cost-wise, or even heuristically. So
consider the patch still a discussion level patch.

David

Attachment

Re: Planner making bad choice in alternative subplan decision

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> On Tue, 29 Sep 2020 at 12:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The idea I'd had was to adjust make_subplan and cost_subplan to estimate
>> EXIST cases on the basis of either 50% retrieval (same as ANY/ALL) or
>> maybe full retrieval if you want to be pessimistic.  I've not had time
>> to try it out though.

> Here's a patch which does that.  There are no regression test failures.

(1) IMO, make_subplan should be adjusted in the same way; or at least,
if it isn't, the comments therein need to be changed to explain why
it's okay for it to be out of sync with cost_subplan.

(2) Does this fix the plan choice problem you started with?

> I'm trying to anticipate areas that this could cause a regression.  I
> think generally a hashed subplan should win in about all cases where
> we lookup all of the items in the table. The places where it could
> cause regression are, where we have to load way more into the hash
> table than we're going to lookup. Assuming roughly the same costs for
> the subplan for hashing and the non-hashed subplan, then the crossover
> point will be about 2 lookups (2 x 50%).  I do wonder if 50% is a bit
> too big a number. We did ask the planner for a fast startup plan, so
> there is perhaps some reasoning to make the cost multiplier somewhere
> between 50% and 1 / nrows. I'm just not sure what that should be.
> There's very little to go on cost-wise, or even heuristically. So
> consider the patch still a discussion level patch.

One way to get some data is to see what the threshold multiplier is
to change the plan choice in an EXISTS that is currently planned
wrongly.

            regards, tom lane



Re: Planner making bad choice in alternative subplan decision

From
David Rowley
Date:
On Wed, 30 Sep 2020 at 04:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > On Tue, 29 Sep 2020 at 12:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> The idea I'd had was to adjust make_subplan and cost_subplan to estimate
> >> EXIST cases on the basis of either 50% retrieval (same as ANY/ALL) or
> >> maybe full retrieval if you want to be pessimistic.  I've not had time
> >> to try it out though.
>
> > Here's a patch which does that.  There are no regression test failures.
>
> (1) IMO, make_subplan should be adjusted in the same way; or at least,
> if it isn't, the comments therein need to be changed to explain why
> it's okay for it to be out of sync with cost_subplan.

All the code seems to be doing is deciding on the tuple_fraction to
pass to the planner. Perhaps it's worth putting a note there to
mention why cost_subplan() does it differently, but that might be
better explained in cost_subplan()

> (2) Does this fix the plan choice problem you started with?

Yes

> > I'm trying to anticipate areas that this could cause a regression.  I
> > think generally a hashed subplan should win in about all cases where
> > we lookup all of the items in the table. The places where it could
> > cause regression are, where we have to load way more into the hash
> > table than we're going to lookup. Assuming roughly the same costs for
> > the subplan for hashing and the non-hashed subplan, then the crossover
> > point will be about 2 lookups (2 x 50%).  I do wonder if 50% is a bit
> > too big a number. We did ask the planner for a fast startup plan, so
> > there is perhaps some reasoning to make the cost multiplier somewhere
> > between 50% and 1 / nrows. I'm just not sure what that should be.
> > There's very little to go on cost-wise, or even heuristically. So
> > consider the patch still a discussion level patch.
>
> One way to get some data is to see what the threshold multiplier is
> to change the plan choice in an EXISTS that is currently planned
> wrongly.

It'll choose the hashed plan if we divide by just half the rows.

I've input the costs into the attached spreadsheet. Changing cell D2
is what I'm proposing to change. The patch effectively changes this to
=F2*0.5 which makes the cost of the non-hashed plan 850,000 (vs 195
for the hashed plan). Changing it to =F2/C2*2 (i.e divide by half the
rows) would cause the planner to choose the hashed plan.

David

Attachment