Re: Re: fix cost subqueryscan wrong parallel cost - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Re: fix cost subqueryscan wrong parallel cost
Date
Msg-id CA+TgmoaxZjK2+X0EOzoDDjhQodv8n2JanmEo5UKA3PQcd6RjzA@mail.gmail.com
Whole thread Raw
In response to Re: Re: fix cost subqueryscan wrong parallel cost  ("David G. Johnston" <david.g.johnston@gmail.com>)
Responses Re: Re: fix cost subqueryscan wrong parallel cost  ("David G. Johnston" <david.g.johnston@gmail.com>)
Re: Re: fix cost subqueryscan wrong parallel cost  (Richard Guo <guofenglinux@gmail.com>)
List pgsql-hackers
On Fri, Apr 22, 2022 at 11:55 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:
> On Wed, Apr 20, 2022 at 11:38 PM bucoo@sohu.com <bucoo@sohu.com> wrote:
>>
>> > > for now fuction cost_subqueryscan always using *total* rows even parallel
>> > > path. like this:
>> > >
>> > > Gather (rows=30000)
>> > >   Workers Planned: 2
>> > >   ->  Subquery Scan  (rows=30000) -- *total* rows, should be equal subpath
>> > >         ->  Parallel Seq Scan  (rows=10000)
>> >
>> > OK, that's bad.
>
> I don't understand how that plan shape is possible.  Gather requires a parallel aware subpath, so said subpath can be
executedmultiple times in parallel, and subquery isn't.  If there is parallelism happening within a subquery the
resultsare consolidated using Append or Gather first - and the output rows of that path entry (all subpaths of Subquery
havethe same ->row value per set_subquery_size_estimates), become the input tuples for Subquery, to which it then
appliesits selectivity multiplier and stores the final result in baserel->rows; which the costing code then examines
whencosting the RTE_SUBQUERY path entry. 

Gather doesn't require a parallel aware subpath, just a parallel-safe
subpath. In a case like this, the parallel seq scan will divide the
rows from the underlying relation across the three processes executing
it. Each process will pass the rows it receives through its own copy
of the subquery scan. Then, the Gather node will collect all the rows
from all the workers to produce the final result.

It's an extremely important feature of parallel query that the
parallel-aware node doesn't have to be immediately beneath the Gather.
You need to have a parallel-aware node in there someplace, but it
could be separated from the gather by any number of levels e.g.

Gather
-> Nested Loop
  -> Nested Loop
    -> Nested Loop
       -> Parallel Seq Scan
       -> Index Scan
     -> Index Scan
   -> Index Scan

You can stick as many parameterized index scans in there as you like
and you still only need one parallel-aware node at the bottom. Once
the parallel seq scan divides up the rows across workers, each worker
can perform the index lookups for the rows that it receives without
any coordination with other workers. It neither knows nor cares that
this is happening in the midst of an operation that is parallel
overall; all the nested loops and index scans work just as they would
in a non-parallel plan. The only things that needed to know about
parallelism are the operation that is dividing up the rows among
workers (here, the parallel seq scan) and the operation that is
gathering up all the results produced by individual workers (here, the
gather node).

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Przemysław Sztoch
Date:
Subject: [PATCH] Completed unaccent dictionary with many missing characters
Next
From: vignesh C
Date:
Subject: Re: Handle infinite recursion in logical replication setup