Thread: plan_rows confusion with parallel queries

plan_rows confusion with parallel queries

From

Tomas Vondra

Date:

02 November 2016, 18:42:24

Hi,

while eye-balling some explain plans for parallel queries, I got a bit 
confused by the row count estimates. I wonder whether I'm alone.

Consider for example a simple seq scan query, which in non-parallel 
explain looks like this:
                              QUERY PLAN
--------------------------------------------------------------------- Seq Scan on tables t  (cost=0.00..16347.60
rows=317160width=356)                       (actual rows=317160 loops=1) Planning time: 0.173 ms Execution time: 47.707
ms
(3 rows)

but a parallel plan looks like this:
                             QUERY PLAN
--------------------------------------------------------------------- Gather  (cost=0.00..14199.10 rows=317160
width=356)        (actual rows=317160 loops=1)   Workers Planned: 3   Workers Launched: 3   ->  Parallel Seq Scan on
tablest  (cost=... rows=102310 width=356)                                      (actual rows=79290 loops=4) Planning
time:0.209 ms Execution time: 150.812 ms
 
(6 rows)


Now, for actual rows we can simply do 79290 * 4 = 317160, and we get the 
correct number of rows produced by the plan (i.e. matching the 
non-parallel query).

But for the estimate, it doesn't work like that:
    102310 * 4 = 409240

which is ~30% above the actual estimate. It's clear why this is 
happening - when computing plan_rows, we don't count the leader as a 
full worker, but use this:
    leader_contribution = 1.0 - (0.3 * path->parallel_workers);

so with 3 workers, the leader is only worth ~0.1 of a worker:
    102310 * 3.1 = 317161

It's fairly easy to spot this here, because the Gather node is right 
above the Parallel Seq Scan, and the values in the Gather accurate. But 
in many plans the Gather will not be immediately above the node (e.g. 
there may be parallel aggregate in between).

Of course, the fact that we use planned number of workers when computing 
plan_rows but actual number of workers for actually produced rows makes 
this even more confusing.

BTW is it really a good idea to use nloops to track the number of 
workers executing a given node? How will that work if once we get 
parallel nested loops and index scans?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: plan_rows confusion with parallel queries

From

Tom Lane

Date:

02 November 2016, 20:06:25

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> while eye-balling some explain plans for parallel queries, I got a bit
> confused by the row count estimates. I wonder whether I'm alone.

I got confused by that a minute ago, so no you're not alone.  The problem
is even worse in join cases.  For example:
Gather  (cost=34332.00..53265.35 rows=100 width=8)  Workers Planned: 2  ->  Hash Join  (cost=33332.00..52255.35
rows=100width=8)        Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2))        ->  Append  (cost=0.00..8614.96
rows=417996width=8)              ->  Parallel Seq Scan on pp  (cost=0.00..8591.67 rows=416667 widt 
h=8)              ->  Parallel Seq Scan on pp1  (cost=0.00..23.29 rows=1329 width=8
)        ->  Hash  (cost=14425.00..14425.00 rows=1000000 width=8)              ->  Seq Scan on cc  (cost=0.00..14425.00
rows=1000000width=8) 

There are actually 1000000 rows in pp, and none in pp1.  I'm not bothered
particularly by the nonzero estimate for pp1, because I know where that
came from, but I'm not very happy that nowhere here does it look like
it's estimating a million-plus rows going into the join.
        regards, tom lane

Re: plan_rows confusion with parallel queries

From

Tomas Vondra

Date:

02 November 2016, 22:56:45

On 11/02/2016 09:00 PM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> while eye-balling some explain plans for parallel queries, I got a bit
>> confused by the row count estimates. I wonder whether I'm alone.
>
> I got confused by that a minute ago, so no you're not alone.  The problem
> is even worse in join cases.  For example:
>
>  Gather  (cost=34332.00..53265.35 rows=100 width=8)
>    Workers Planned: 2
>    ->  Hash Join  (cost=33332.00..52255.35 rows=100 width=8)
>          Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2))
>          ->  Append  (cost=0.00..8614.96 rows=417996 width=8)
>                ->  Parallel Seq Scan on pp  (cost=0.00..8591.67 rows=416667 widt
> h=8)
>                ->  Parallel Seq Scan on pp1  (cost=0.00..23.29 rows=1329 width=8
> )
>          ->  Hash  (cost=14425.00..14425.00 rows=1000000 width=8)
>                ->  Seq Scan on cc  (cost=0.00..14425.00 rows=1000000 width=8)
>
> There are actually 1000000 rows in pp, and none in pp1.  I'm not bothered
> particularly by the nonzero estimate for pp1, because I know where that
> came from, but I'm not very happy that nowhere here does it look like
> it's estimating a million-plus rows going into the join.
>

Yeah. I wonder how tools visualizing explain plans are going to compute 
time spent in a given node (i.e. excluding the time spent in child 
nodes), or expected cost of that node.

So far we could do something like
    self_time = total_time - child_node_time * nloops

and example in this plan it's pretty clear we spend ~130ms in Aggregate:
                                 QUERY PLAN
---------------------------------------------------------------------------- Aggregate  (cost=17140.50..17140.51 rows=1
width=8)          (actual time=306.675..306.675 rows=1 loops=1)   ->  Seq Scan on tables  (cost=0.00..16347.60
rows=317160width=0)                    (actual time=0.188..170.993 rows=317160 loops=1) Planning time: 0.201 ms
Executiontime: 306.860 ms
 
(4 rows)

But in parallel plans it can easily happen that
    child_node_time * nloops > total_time

Consider for example this parallel plan:
                                QUERY PLAN
---------------------------------------------------------------------------- Finalize Aggregate
(cost=15455.19..15455.20rows=1 width=8)                     (actual time=107.636..107.636 rows=1 loops=1)   ->  Gather
(cost=15454.87..15455.18rows=3 width=8)               (actual time=107.579..107.629 rows=4 loops=1)         Workers
Planned:3         Workers Launched: 3         ->  Partial Aggregate  (cost=14454.87..14454.88 rows=1 ...)
       (actual time=103.895..103.895 rows=1 loops=4)               ->  Parallel Seq Scan on tables
(cost=0.00..14199.10 rows=102310 width=0)                  (actual time=0.059..59.217 rows=79290 loops=4) Planning
time:0.052 ms Execution time: 109.250 ms
 
(8 rows)

Reading explains for parallel plans will always be complicated, but 
perhaps overloading the nloops like this makes it more complicated?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: plan_rows confusion with parallel queries

From

Tomas Vondra

Date:

03 November 2016, 02:44:39

On 11/02/2016 11:56 PM, Tomas Vondra wrote:
> On 11/02/2016 09:00 PM, Tom Lane wrote:
>> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>>> while eye-balling some explain plans for parallel queries, I got a bit
>>> confused by the row count estimates. I wonder whether I'm alone.
>>
>> I got confused by that a minute ago, so no you're not alone.  The problem
>> is even worse in join cases.  For example:
>>
>>  Gather  (cost=34332.00..53265.35 rows=100 width=8)
>>    Workers Planned: 2
>>    ->  Hash Join  (cost=33332.00..52255.35 rows=100 width=8)
>>          Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2))
>>          ->  Append  (cost=0.00..8614.96 rows=417996 width=8)
>>                ->  Parallel Seq Scan on pp  (cost=0.00..8591.67
>> rows=416667 widt
>> h=8)
>>                ->  Parallel Seq Scan on pp1  (cost=0.00..23.29
>> rows=1329 width=8
>> )
>>          ->  Hash  (cost=14425.00..14425.00 rows=1000000 width=8)
>>                ->  Seq Scan on cc  (cost=0.00..14425.00 rows=1000000
>> width=8)
>>
>> There are actually 1000000 rows in pp, and none in pp1.  I'm not bothered
>> particularly by the nonzero estimate for pp1, because I know where that
>> came from, but I'm not very happy that nowhere here does it look like
>> it's estimating a million-plus rows going into the join.
>>

Although - it is estimating 1M rows, but only "per worker" estimates are 
shown, and because there are 2 workers planned it says 1M/2.4 which is 
the 416k. I agree it's a bit unclear, but at least it's consistent with 
how we treat loops (i.e. that the numbers are per loop).

But there's more fun with joins - consider for example this simple join:
                               QUERY PLAN
------------------------------------------------------------------------------ Gather  (cost=19515.96..43404.82
rows=96957width=12)         (actual time=295.167..746.312 rows=99999 loops=1)   Workers Planned: 2   Workers Launched:
2  ->  Hash Join  (cost=18515.96..32709.12 rows=96957 width=12)                  (actual time=249.281..670.309
rows=33333loops=3)         Hash Cond: (t2.a = t1.a)         ->  Parallel Seq Scan on t2             (cost=0.00..8591.67
rows=416667width=8)             (actual time=0.100..184.315 rows=333333 loops=3)         ->  Hash
(cost=16925.00..16925.00rows=96957 width=8)                   (actual time=246.760..246.760 rows=99999 loops=3)
     Buckets: 131072  Batches: 2  Memory Usage: 2976kB               ->  Seq Scan on t1
(cost=0.00..16925.00rows=96957 width=8)                   (actual time=0.065..178.385 rows=99999 loops=3)
     Filter: (b < 100000)                     Rows Removed by Filter: 900001 Planning time: 0.763 ms Execution time:
793.653ms
 
(13 rows)

Suddenly we don't show per-worker estimates for the hash join - both the 
Hash Join and the Gather have exactly the same cardinality estimate.

Now, let's try forcing Nested Loops and see what happens:
                                QUERY PLAN
----------------------------------------------------------------------------- Gather  (cost=1000.42..50559.65
rows=96957width=12)         (actual time=0.610..203.694 rows=99999 loops=1)   Workers Planned: 2   Workers Launched: 2
->  Nested Loop  (cost=0.42..39863.95 rows=96957 width=12)                    (actual time=0.222..182.755 rows=33333
loops=3)        ->  Parallel Seq Scan on t1                    (cost=0.00..9633.33 rows=40399 width=8)
 (actual time=0.030..40.358 rows=33333 loops=3)               Filter: (b < 100000)               Rows Removed by
Filter:300000         ->  Index Scan using t2_a_idx on t2              (cost=0.42..0.74 rows=1 width=8)
(actualtime=0.002..0.002 rows=1 loops=99999)               Index Cond: (a = t1.a) Planning time: 0.732 ms Execution
time:250.707 ms
 
(11 rows)

So, different join method but same result - 2 workers, loops=3. But 
let's try with small tables (100k rows instead of 1M rows):
                                  QUERY PLAN
---------------------------------------------------------------------------- Gather  (cost=0.29..36357.94 rows=100118
width=12)(actual 
 
time=13.219..589.723 rows=100000 loops=1)   Workers Planned: 1   Workers Launched: 1   Single Copy: true   ->  Nested
Loop (cost=0.29..36357.94 rows=100118 width=12)                    (actual time=0.288..442.821 rows=100000 loops=1)
   ->  Seq Scan on t1  (cost=0.00..1444.18 rows=100118 width=8)                      (actual time=0.148..49.308
rows=100000loops=1)         ->  Index Scan using t2_a_idx on t2                      (cost=0.29..0.34 rows=1 width=8)
                  (actual time=0.002..0.002 rows=1 loops=100000)               Index Cond: (a = t1.a) Planning time:
0.483ms Execution time: 648.941 ms
 
(10 rows)

Suddenly, we get nworkers=1 with loops=1 (and not nworkers+1 as before). 
FWIW I've only seen this with force_parallel_mode=on, and the row counts 
are correct, so perhaps that's OK. single_copy seems a bit 
underdocumented, though.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: plan_rows confusion with parallel queries

From

Tom Lane

Date:

03 November 2016, 02:54:54

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On 11/02/2016 11:56 PM, Tomas Vondra wrote:
>> On 11/02/2016 09:00 PM, Tom Lane wrote:
>>> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>>>> while eye-balling some explain plans for parallel queries, I got a bit
>>>> confused by the row count estimates. I wonder whether I'm alone.

>>> I got confused by that a minute ago, so no you're not alone.  The problem
>>> is even worse in join cases.  For example:
>>>  Gather  (cost=34332.00..53265.35 rows=100 width=8)
>>>    Workers Planned: 2
>>>    ->  Hash Join  (cost=33332.00..52255.35 rows=100 width=8)
>>>          Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2))
>>>          ->  Append  (cost=0.00..8614.96 rows=417996 width=8)
>>>                ->  Parallel Seq Scan on pp  (cost=0.00..8591.67 rows=416667 width=8)
>>>                ->  Parallel Seq Scan on pp1  (cost=0.00..23.29 rows=1329 width=8)
>>>          ->  Hash  (cost=14425.00..14425.00 rows=1000000 width=8)
>>>                ->  Seq Scan on cc  (cost=0.00..14425.00 rows=1000000 width=8)
>>> There are actually 1000000 rows in pp, and none in pp1.  I'm not bothered
>>> particularly by the nonzero estimate for pp1, because I know where that
>>> came from, but I'm not very happy that nowhere here does it look like
>>> it's estimating a million-plus rows going into the join.

> Although - it is estimating 1M rows, but only "per worker" estimates are
> shown, and because there are 2 workers planned it says 1M/2.4 which is
> the 416k. I agree it's a bit unclear, but at least it's consistent with
> how we treat loops (i.e. that the numbers are per loop).

Well, it's not *that* consistent.  If we were estimating all the numbers
underneath the Gather as being per-worker numbers, that would make some
amount of sense.  But neither the other seqscan, nor the hash on it, nor
the hashjoin's output count are scaled that way.  It's very hard to call
the above display anything but flat-out broken.

> But there's more fun with joins - consider for example this simple join:
> ...
> Suddenly we don't show per-worker estimates for the hash join - both the
> Hash Join and the Gather have exactly the same cardinality estimate.

Yeah.  That doesn't seem to be quite the same problem as in my example,
but it's about as confused.

Maybe we need to bite the bullet and add a "number of workers" field
to the estimated and actual counts.  Not sure how much that helps for
the partial-count-for-the-leader issue, though.
        regards, tom lane

Re: plan_rows confusion with parallel queries

From

Robert Haas

Date:

03 November 2016, 14:07:38

On Wed, Nov 2, 2016 at 2:42 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> BTW is it really a good idea to use nloops to track the number of workers
> executing a given node? How will that work if once we get parallel nested
> loops and index scans?

We already have parallel nested loops with inner index scans.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: plan_rows confusion with parallel queries

From

Robert Haas

Date:

03 November 2016, 14:44:19

On Wed, Nov 2, 2016 at 4:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> while eye-balling some explain plans for parallel queries, I got a bit
>> confused by the row count estimates. I wonder whether I'm alone.
>
> I got confused by that a minute ago, so no you're not alone.  The problem
> is even worse in join cases.  For example:
>
>  Gather  (cost=34332.00..53265.35 rows=100 width=8)
>    Workers Planned: 2
>    ->  Hash Join  (cost=33332.00..52255.35 rows=100 width=8)
>          Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2))
>          ->  Append  (cost=0.00..8614.96 rows=417996 width=8)
>                ->  Parallel Seq Scan on pp  (cost=0.00..8591.67 rows=416667 widt
> h=8)
>                ->  Parallel Seq Scan on pp1  (cost=0.00..23.29 rows=1329 width=8
> )
>          ->  Hash  (cost=14425.00..14425.00 rows=1000000 width=8)
>                ->  Seq Scan on cc  (cost=0.00..14425.00 rows=1000000 width=8)
>
> There are actually 1000000 rows in pp, and none in pp1.  I'm not bothered
> particularly by the nonzero estimate for pp1, because I know where that
> came from, but I'm not very happy that nowhere here does it look like
> it's estimating a million-plus rows going into the join.

I welcome suggestions for improvement, but you will note that if the
row count didn't reflect some kind of guess about the number of rows
that each individual worker will see, the costing would be hopelessly
broken.  The cost needs to reflect a guess about the time the query
will finish, not the total amount of effort expended.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: plan_rows confusion with parallel queries

From

Robert Haas

Date:

03 November 2016, 14:51:45

On Wed, Nov 2, 2016 at 10:44 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Although - it is estimating 1M rows, but only "per worker" estimates are
> shown, and because there are 2 workers planned it says 1M/2.4 which is the
> 416k. I agree it's a bit unclear, but at least it's consistent with how we
> treat loops (i.e. that the numbers are per loop).

Right.  Which I think was a horrible decision.  I think that it would
be best to change EXPLAIN so that the row counts and costs are never
divided by nloops.  That would be a backward-incompatible change, but
I think it would be worth it.  What you typically want to understand
is the total effort expended in a particular plan node, and the
current system makes that incredibly difficult to understand,
especially because we then round off the row count estimates to the
nearest integer, so that you can't even reverse the division if you
want to (which you always do).

> But there's more fun with joins - consider for example this simple join:
>
>                                QUERY PLAN
> ------------------------------------------------------------------------------
>  Gather  (cost=19515.96..43404.82 rows=96957 width=12)
>          (actual time=295.167..746.312 rows=99999 loops=1)
>    Workers Planned: 2
>    Workers Launched: 2
>    ->  Hash Join  (cost=18515.96..32709.12 rows=96957 width=12)
>                   (actual time=249.281..670.309 rows=33333 loops=3)
>          Hash Cond: (t2.a = t1.a)
>          ->  Parallel Seq Scan on t2
>              (cost=0.00..8591.67 rows=416667 width=8)
>              (actual time=0.100..184.315 rows=333333 loops=3)
>          ->  Hash  (cost=16925.00..16925.00 rows=96957 width=8)
>                    (actual time=246.760..246.760 rows=99999 loops=3)
>                Buckets: 131072  Batches: 2  Memory Usage: 2976kB
>                ->  Seq Scan on t1
>                    (cost=0.00..16925.00 rows=96957 width=8)
>                    (actual time=0.065..178.385 rows=99999 loops=3)
>                      Filter: (b < 100000)
>                      Rows Removed by Filter: 900001
>  Planning time: 0.763 ms
>  Execution time: 793.653 ms
> (13 rows)
>
> Suddenly we don't show per-worker estimates for the hash join - both the
> Hash Join and the Gather have exactly the same cardinality estimate.

I'm not sure why that's happening, but I haven't made any changes to
the costing for a node like hash join.  It doesn't treat the parallel
sequential scan that is coming as its first input any differently than
it would if that were a non-parallel plan.  It's just costing the join
normally, based on an input row count that is lower than what it would
be if it were going to see every row from t2 rather than only some of
them.

> So, different join method but same result - 2 workers, loops=3. But let's
> try with small tables (100k rows instead of 1M rows):
>
>                                   QUERY PLAN
> ----------------------------------------------------------------------------
>  Gather  (cost=0.29..36357.94 rows=100118 width=12) (actual
> time=13.219..589.723 rows=100000 loops=1)
>    Workers Planned: 1
>    Workers Launched: 1
>    Single Copy: true
>    ->  Nested Loop  (cost=0.29..36357.94 rows=100118 width=12)
>                     (actual time=0.288..442.821 rows=100000 loops=1)
>          ->  Seq Scan on t1  (cost=0.00..1444.18 rows=100118 width=8)
>                       (actual time=0.148..49.308 rows=100000 loops=1)
>          ->  Index Scan using t2_a_idx on t2
>                       (cost=0.29..0.34 rows=1 width=8)
>                       (actual time=0.002..0.002 rows=1 loops=100000)
>                Index Cond: (a = t1.a)
>  Planning time: 0.483 ms
>  Execution time: 648.941 ms
> (10 rows)
>
> Suddenly, we get nworkers=1 with loops=1 (and not nworkers+1 as before).
> FWIW I've only seen this with force_parallel_mode=on, and the row counts are
> correct, so perhaps that's OK. single_copy seems a bit underdocumented,
> though.

This is certainly entirely as expected.  Single-copy means that
there's one process running the non-parallel plan beneath it, and
that's it.  So the Gather is just a pass-through node here, like a
Materialize or Sort: the number of input rows and the number of output
rows have to be the same.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] plan_rows confusion with parallel queries

From

Robert Haas

Date:

11 January 2017, 21:24:38

On Wed, Nov 2, 2016 at 10:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> I got confused by that a minute ago, so no you're not alone.  The problem
>>>> is even worse in join cases.  For example:
>>>>  Gather  (cost=34332.00..53265.35 rows=100 width=8)
>>>>    Workers Planned: 2
>>>>    ->  Hash Join  (cost=33332.00..52255.35 rows=100 width=8)
>>>>          Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2))
>>>>          ->  Append  (cost=0.00..8614.96 rows=417996 width=8)
>>>>                ->  Parallel Seq Scan on pp  (cost=0.00..8591.67 rows=416667 width=8)
>>>>                ->  Parallel Seq Scan on pp1  (cost=0.00..23.29 rows=1329 width=8)
>>>>          ->  Hash  (cost=14425.00..14425.00 rows=1000000 width=8)
>>>>                ->  Seq Scan on cc  (cost=0.00..14425.00 rows=1000000 width=8)
>> Although - it is estimating 1M rows, but only "per worker" estimates are
>> shown, and because there are 2 workers planned it says 1M/2.4 which is
>> the 416k. I agree it's a bit unclear, but at least it's consistent with
>> how we treat loops (i.e. that the numbers are per loop).
>
> Well, it's not *that* consistent.  If we were estimating all the numbers
> underneath the Gather as being per-worker numbers, that would make some
> amount of sense.  But neither the other seqscan, nor the hash on it, nor
> the hashjoin's output count are scaled that way.  It's very hard to call
> the above display anything but flat-out broken.

While investigating why Rushabh Lathia's Gather Merge patch sometimes
fails to pick a Gather Merge plan even when it really ought to do so,
I ran smack into this problem.  I discovered that this is more than a
cosmetic issue.  The costing itself is actually badly broken.  In the
single-table case, when you have just ...

Gather
-> Parallel Seq Scan

...the Parallel Seq Scan node reflects a per-worker row estimate, and
the Gather node reflects a total row estimate.  But in the join case,
as shown above, the Gather thinks that the total number of rows which
it will produce is equal to the number that will be produced by one
single worker, which is crap, and the cost of doing the join in
parallel is based on the per-worker rather than the total number,
which is crappier.  The difference in cost between the Gather and the
underlying join in the above example is exactly 1010, namely 1000 for
parallel_setup_cost and 100 tuples at 0.1 per tuple, even though 100
is the number of tuples per-worker, not the total number.  That's
really not good.  I probably should have realized this when I looked
at this thread the first time, but I somehow got it into my head that
this was just a complaint about the imperfections of the display
(which is indeed imperfect) and failed to realize that the same report
was also pointing to an actual costing bug.  I apologize for that.

The reason why this is happening is that final_cost_nestloop(),
final_cost_hashjoin(), and final_cost_mergejoin() don't care a whit
about whether the path they are generating is partial.  They apply the
row estimate for the joinrel itself to every such path generated for
the join, except for parameterized paths which are a special case.  I
think this generally has the effect of discouraging parallel joins,
because the inflated row count also inflates the join cost.  I think
the right thing to do is probably to scale the row count estimate for
the joinrel's partial paths by the leader_contribution value computed
in cost_seqscan.

Despite my general hatred of back-patching things that cause plan
changes, I'm inclined to think the fix for this should be back-patched
to 9.6, because this is really a brown-paper-bag bug.  If the
consensus is otherwise I will of course defer to that consensus.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] plan_rows confusion with parallel queries

From

Robert Haas

Date:

12 January 2017, 00:05:11

On Wed, Jan 11, 2017 at 1:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Well, it's not *that* consistent.  If we were estimating all the numbers
>> underneath the Gather as being per-worker numbers, that would make some
>> amount of sense.  But neither the other seqscan, nor the hash on it, nor
>> the hashjoin's output count are scaled that way.  It's very hard to call
>> the above display anything but flat-out broken.
>
> While investigating why Rushabh Lathia's Gather Merge patch sometimes
> fails to pick a Gather Merge plan even when it really ought to do so,
> I ran smack into this problem.  I discovered that this is more than a
> cosmetic issue.  The costing itself is actually badly broken.
>
> The reason why this is happening is that final_cost_nestloop(),
> final_cost_hashjoin(), and final_cost_mergejoin() don't care a whit
> about whether the path they are generating is partial.  They apply the
> row estimate for the joinrel itself to every such path generated for
> the join, except for parameterized paths which are a special case.  I
> think this generally has the effect of discouraging parallel joins,
> because the inflated row count also inflates the join cost.  I think
> the right thing to do is probably to scale the row count estimate for
> the joinrel's partial paths by the leader_contribution value computed
> in cost_seqscan.
>
> Despite my general hatred of back-patching things that cause plan
> changes, I'm inclined to think the fix for this should be back-patched
> to 9.6, because this is really a brown-paper-bag bug.  If the
> consensus is otherwise I will of course defer to that consensus.

And here is a patch which seems to fix the problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

parallel-join-rows-v1.patch

Re: [HACKERS] plan_rows confusion with parallel queries

From

Robert Haas

Date:

13 January 2017, 21:38:00

On Wed, Jan 11, 2017 at 4:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> While investigating why Rushabh Lathia's Gather Merge patch sometimes
>> fails to pick a Gather Merge plan even when it really ought to do so,
>> I ran smack into this problem.  I discovered that this is more than a
>> cosmetic issue.  The costing itself is actually badly broken.
>>
>> The reason why this is happening is that final_cost_nestloop(),
>> final_cost_hashjoin(), and final_cost_mergejoin() don't care a whit
>> about whether the path they are generating is partial.  They apply the
>> row estimate for the joinrel itself to every such path generated for
>> the join, except for parameterized paths which are a special case.  I
>> think this generally has the effect of discouraging parallel joins,
>> because the inflated row count also inflates the join cost.  I think
>> the right thing to do is probably to scale the row count estimate for
>> the joinrel's partial paths by the leader_contribution value computed
>> in cost_seqscan.
>>
>> Despite my general hatred of back-patching things that cause plan
>> changes, I'm inclined to think the fix for this should be back-patched
>> to 9.6, because this is really a brown-paper-bag bug.  If the
>> consensus is otherwise I will of course defer to that consensus.
>
> And here is a patch which seems to fix the problem.

Since nobody seems to have any comment here, I've committed and
back-patched this to 9.6.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company