Thread: BUG #15947: Worse plan is chosen after giving the planner more freedom (partitionwise join)

BUG #15947: Worse plan is chosen after giving the planner more freedom (partitionwise join)

From

PG Bug reporting form

Date:

10 August 2019, 22:28:08

The following bug has been logged on the website:

Bug reference:      15947
Logged by:          Yaroslav Schekin
Email address:      ladayaroslav@yandex.ru
PostgreSQL version: 11.5
Operating system:   Any
Description:

After creating of the tables below:
-----
CREATE TABLE sg (
id bigint NOT NULL,
sc_fk bigint,
geo_id bigint,
sl smallint NOT NULL,
a date NOT NULL,
o boolean NOT NULL
)
PARTITION BY RANGE (o, sl, a);

CREATE TABLE sg_19_01_d PARTITION OF sg FOR VALUES FROM (false, '5',
'2019-01-01') TO (false, '5', '2019-02-01');
CREATE TABLE sg_19_02_d PARTITION OF sg FOR VALUES FROM (false, '5',
'2019-02-01') TO (false, '5', '2019-03-01');

CREATE TABLE sc (
id bigint,
a date NOT NULL,
sl smallint NOT NULL,
o boolean NOT NULL
)
PARTITION BY RANGE (o, sl, a);

CREATE TABLE sc_19_01_d PARTITION OF sc FOR VALUES FROM (false, '5',
'2019-01-01') TO (false, '5', '2019-02-01');
CREATE TABLE sc_19_02_d PARTITION OF sc FOR VALUES FROM (false, '5',
'2019-02-01') TO (false, '5', '2019-03-01');

INSERT INTO sg_19_01_d(id, sc_fk, geo_id, sl, a, o)
SELECT n, n, 0, 5, '2019-01-01', false
  FROM generate_series(1, 1000) AS g(n);

INSERT INTO sg_19_02_d(id, sc_fk, geo_id, sl, a, o)
SELECT n, n, 0, 5, '2019-02-01', false
  FROM generate_series(1, 1000) AS g(n);

INSERT INTO sc_19_01_d(id, a, sl, o)
SELECT n, '2019-01-01', 5, false
  FROM generate_series(1, 1000) AS g(n);

INSERT INTO sc_19_02_d(id, a, sl, o)
SELECT n, '2019-02-01', 5, false
  FROM generate_series(1, 1000) AS g(n);

ANALYZE sg_19_01_d, sg_19_02_d, sc_19_01_d, sc_19_02_d;
-----
I'm trying the following query:

EXPLAIN
SELECT COUNT(*)
  FROM sc
 WHERE EXISTS (
       SELECT 1
         FROM sg
        WHERE sc.id = sg.sc_fk
          AND sc.a = sg.a
          AND sc.o = sg.o
          AND sc.sl = sg.sl
       );

Which produces the plan with this cost estimation (top node):
-- Aggregate  (cost=147.25..147.26 rows=1 width=8)

But after:
SET enable_partitionwise_join = true;

The new plan is more expensive:
-- Aggregate  (cost=175.00..175.01 rows=1 width=8)

This shouldn't be happening, right?

Re: BUG #15947: Worse plan is chosen after giving the planner morefreedom (partitionwise join)

From

Erik Rijkers

Date:

10 August 2019, 23:54:01

On 2019-08-11 00:28, PG Bug reporting form wrote:
> The following bug has been logged on the website:
> 
> Bug reference:      15947
> I'm trying the following query:
> 
> EXPLAIN
> SELECT COUNT(*)
>   FROM sc
>  WHERE EXISTS (
>        SELECT 1
>          FROM sg
>         WHERE sc.id = sg.sc_fk
>           AND sc.a = sg.a
>           AND sc.o = sg.o
>           AND sc.sl = sg.sl
>        );
> 
> Which produces the plan with this cost estimation (top node):
> -- Aggregate  (cost=147.25..147.26 rows=1 width=8)
> 
> But after:
> SET enable_partitionwise_join = true;
> 
> The new plan is more expensive:
> -- Aggregate  (cost=175.00..175.01 rows=1 width=8)
> 
> This shouldn't be happening, right?

Why not?

The 'worse plan' cnsistently executes faster.

I don't think the cost units of different queries can/should be compared 
as if they are the same.

Re: BUG #15947: Worse plan is chosen after giving the planner more freedom (partitionwise join)

From

Tom Lane

Date:

11 August 2019, 16:39:17

PG Bug reporting form <noreply@postgresql.org> writes:
> After creating of the tables below:
> ...
> ANALYZE sg_19_01_d, sg_19_02_d, sc_19_01_d, sc_19_02_d;

Hm, you made a tactical error there: you should have done

ANALYZE sg, sc;

ie analyze the parent partitioned tables not the partitions.
As you have it, we never compute inherited stats across the
whole partitioned tables, which leads to wrong estimates
about the join size and hence cost:

 Aggregate  (cost=147.25..147.26 rows=1 width=8) (actual time=9.934..9.934 rows=1 loops=1)
   ->  Hash Join  (cost=80.00..146.62 rows=250 width=0) (actual time=6.204..9.632 rows=2000 loops=1)
   ...

After analyzing the parents it's much better:

 Aggregate  (cost=195.00..195.01 rows=1 width=8) (actual time=6.199..6.200 rows=1 loops=1)
   ->  Hash Semi Join  (cost=88.00..190.00 rows=2000 width=0) (actual time=2.586..5.885 rows=2000 loops=1)
   ...

The size and cost estimates for partitionwise-join paths are made by
adding up the per-partition sizes/costs, so that those are a lot
closer to being correct even without any parent-level stats:

 Aggregate  (cost=175.00..175.01 rows=1 width=8) (actual time=4.562..4.563 rows=1 loops=1)
   ->  Append  (cost=39.00..170.00 rows=2000 width=0) (actual time=1.105..4.327 rows=2000 loops=1)
   ...

So *if* you have parent-level stats in place, this example works well:
the partitionwise join is estimated as cheaper than the other way, and
that estimate is correct, and all is great.

But when you don't, why doesn't it seize on the incorrectly-estimated-
as-cheaper non-partitioned join?  The answer seems to be that
apply_scanjoin_target_to_paths throws away all the non-partitioned
paths, if the join is partitioned.  It claims that

     * If the rel is partitioned, we want to drop its existing paths and
     * generate new ones.  This function would still be correct if we kept the
     * existing paths: we'd modify them to generate the correct target above
     * the partitioning Append, and then they'd compete on cost with paths
     * generating the target below the Append.  However, in our current cost
     * model the latter way is always the same or cheaper cost, so modifying
     * the existing paths would just be useless work.

but evidently that's not really true when considering partitioned vs
non-partitioned joins.  It holds only if cost estimates that are made
in very different ways are comparable.

I'm inclined to think that the partitioned estimates are more trustworthy
than the non-partitioned ones, so maybe we should leave things as they
stand for now.  Still, this is another point that's causing me to form
an increasingly hardened conviction that apply_scanjoin_target_to_paths
needs to be nuked from orbit.  It's been nothing but trouble since it
was committed, which is unsurprising considering how many planner
structural conventions it tramples on.  I'm not very sure what a better
way would look like though :-(.

Another point that this example raises is that letting autoanalyze
ignore partition parent tables is not just a minor problem, but
a potentially serious bug, causing very poor plan choices to be
made for any nontrivial queries involving partitioned tables.

I don't see us doing anything about either of these points in the
very short term, and certainly not back-patching any changes into
released branches.  But we ought to think about fixes going forward.

            regards, tom lane

Re: BUG #15947: Worse plan is chosen after giving the planner morefreedom (partitionwise join)

From

Keith

Date:

14 August 2019, 00:46:16

Another point that this example raises is that letting autoanalyze
ignore partition parent tables is not just a minor problem, but
a potentially serious bug, causing very poor plan choices to be
made for any nontrivial queries involving partitioned tables.

regards, tom lane

I ran into this issue frequently with pg_partman well before native partitioning. It's why I'd initially had the internal maintenance process run an analyze from the parent table whenever a new child table was made. Otherwise constraint exclusion didn't always work right. But it began causing too much of an issue with the maintenance runs because really large partition sets would then cause maintenance to take a really, really long time sometimes, potentially holding locks that could block normal table usage. So as of PG11 I no longer do the analyze by default, but it can still be turned on.

If there was some way to have autoanalyze kick in on the parent whenever a new child table is created in a set, that would probably help greatly. Could run with the same priority as the normal autoanalyze, so queries that would come in that require a higher lock could cancel it (assuming that's how that works?) and it could resume later. But something really does need to be done here to keep set-wide stats current so the exclusion/pruning process work better.

Keith Fiske