Re: A reloption for partitioned tables - parallel_workers - Mailing list pgsql-hackers

From Amit Langote
Subject Re: A reloption for partitioned tables - parallel_workers
Date
Msg-id CA+HiwqEZ5pKh0gmkqchutCdgdi+3g2G_0bmnjvJK9VwJnT-bVA@mail.gmail.com
Whole thread Raw
In response to Re: A reloption for partitioned tables - parallel_workers  (Laurenz Albe <laurenz.albe@cybertec.at>)
Responses Re: A reloption for partitioned tables - parallel_workers
List pgsql-hackers
Hi,

On Tue, Feb 16, 2021 at 1:06 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> On Mon, 2021-02-15 at 17:53 +0900, Amit Langote wrote:
> > On Mon, Feb 15, 2021 at 5:28 PM Seamus Abshere <seamus@abshere.net> wrote:
> > > It turns out parallel_workers may be a useful reloption for certain uses of partitioned tables,
> > >  at least if they're made up of fancy column store partitions (see
> > >  https://www.postgresql.org/message-id/7d6fdc20-857c-4cbe-ae2e-c0ff9520ed55%40www.fastmail.com).
> > > Would somebody tell me what I'm doing wrong? I would love to submit a patch but I'm stuck:
> >
> > You may see by inspecting the callers of compute_parallel_worker()
> > that it never gets called on a partitioned table, only its leaf
> > partitions.  Maybe you could try calling compute_parallel_worker()
> > somewhere in add_paths_to_append_rel(), which has this code to figure
> > out parallel_workers to use for a parallel Append path for a given
> > partitioned table:
> >
> >         /* Find the highest number of workers requested for any subpath. */
> >         foreach(lc, partial_subpaths)
> >         {
> >             Path       *path = lfirst(lc);
> >
> >             parallel_workers = Max(parallel_workers, path->parallel_workers);
> >         }
> >         Assert(parallel_workers > 0);
> >
> >         /*
> >          * If the use of parallel append is permitted, always request at least
> >          * log2(# of children) workers.  We assume it can be useful to have
> >          * extra workers in this case because they will be spread out across
> >          * the children.  The precise formula is just a guess, but we don't
> >          * want to end up with a radically different answer for a table with N
> >          * partitions vs. an unpartitioned table with the same data, so the
> >          * use of some kind of log-scaling here seems to make some sense.
> >          */
> >         if (enable_parallel_append)
> >         {
> >             parallel_workers = Max(parallel_workers,
> >                                    fls(list_length(live_childrels)));
> >             parallel_workers = Min(parallel_workers,
> >                                    max_parallel_workers_per_gather);
> >         }
> >         Assert(parallel_workers > 0);
> >
> > Note that the 'rel' in this code refers to the partitioned table for
> > which an Append path is being considered, so compute_parallel_worker()
> > using that 'rel' would use the partitioned table's
> > rel_parallel_workers as you are trying to do.
>
> Note that there is a second chunk of code quite like that one a few
> lines down from there that would also have to be modified.
>
> I am +1 on allowing to override the degree of parallelism on a parallel
> append.  If "parallel_workers" on the partitioned table is an option for
> that, it might be a simple solution.  On the other hand, perhaps it would
> be less confusing to have a different storage parameter name rather than
> having "parallel_workers" do double duty.
>
> Also, since there is a design rule that storage parameters can only be used
> on partitions, we would have to change that - is that a problem for anybody?

I am not aware of a rule that suggests that parallel_workers is always
interpreted using storage-level considerations.  If that is indeed a
popular interpretation at this point, then yes, we should be open to
considering a new name for the parameter that this patch wants to add.

Maybe parallel_append_workers?  Perhaps not a bad idea in this patch's
case, but considering that we may want to expand the support of
cross-partition parallelism to operations other than querying, maybe
something else?

This reminds me of something I forgot to mention in my review of the
patch -- it should also update the documentation of parallel_workers
on the CREATE TABLE page to mention that it will be interpreted a bit
differently for partitioned tables than for regular storage-bearing
relations.  Workers specified for partitioned tables would be
distributed by the executor over its partitions, unlike with
storage-bearing relations, where the distribution of specified workers
is controlled by the AM using storage-level considerations.

> There is another related consideration that doesn't need to be addressed
> by this patch, but that is somewhat related: if the executor prunes some
> partitions, the degree of parallelism is unaffected, right?

That's correct.  Launched workers could be less than planned, but that
would not have anything to do with executor pruning.

> So if the planner decides to use 24 workers for 25 partitions, and the
> executor discards all but one of these partition scans, we would end up
> with 24 workers scanning a single partition.

I do remember pondering this when testing my patches to improve the
performance of executing a generic plan to scan a partitioned table
where runtime pruning is possible.  Here is an example:

create table hp (a int) partition by hash (a);
select 'create table hp' || i || ' partition of hp for values with
(modulus 100, remainder ' || i || ');' from generate_series(0, 99) i;
\gexec
insert into hp select generate_series(0, 1000000);
alter table hp set (parallel_workers = 16);
set plan_cache_mode to force_generic_plan ;
set max_parallel_workers_per_gather to 16;
prepare q as select * from hp where a = $1;

explain (analyze, verbose) execute q (1);
                                                            QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..14426.50 rows=5675 width=4) (actual
time=2.370..25.002 rows=1 loops=1)
   Output: hp.a
   Workers Planned: 16
   Workers Launched: 7
   ->  Parallel Append  (cost=0.00..12859.00 rows=400 width=4) (actual
time=0.006..0.384 rows=0 loops=8)
         Worker 0:  actual time=0.001..0.001 rows=0 loops=1
         Worker 1:  actual time=0.001..0.001 rows=0 loops=1
         Worker 2:  actual time=0.001..0.001 rows=0 loops=1
         Worker 3:  actual time=0.001..0.001 rows=0 loops=1
         Worker 4:  actual time=0.001..0.001 rows=0 loops=1
         Worker 5:  actual time=0.001..0.001 rows=0 loops=1
         Worker 6:  actual time=0.001..0.001 rows=0 loops=1
         Subplans Removed: 99
         ->  Parallel Seq Scan on public.hp40 hp_1  (cost=0.00..126.50
rows=33 width=4) (actual time=0.041..3.060 rows=1 loops=1)
               Output: hp_1.a
               Filter: (hp_1.a = $1)
               Rows Removed by Filter: 9813
 Planning Time: 5.543 ms
 Execution Time: 25.139 ms
(19 rows)

deallocate q;
set max_parallel_workers_per_gather to 0;
prepare q as select * from hp where a = $1;
explain (analyze, verbose) execute q (1);
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..18754.88 rows=5675 width=4) (actual
time=0.029..2.474 rows=1 loops=1)
   Subplans Removed: 99
   ->  Seq Scan on public.hp40 hp_1  (cost=0.00..184.25 rows=56
width=4) (actual time=0.028..2.471 rows=1 loops=1)
         Output: hp_1.a
         Filter: (hp_1.a = $1)
         Rows Removed by Filter: 9813
 Planning Time: 2.231 ms
 Execution Time: 2.535 ms
(8 rows)

Comparing the Execution Times above, it's clear that Gather and
workers are pure overhead in this case.

Although in cases where one expects runtime pruning to be useful, the
plan itself is very unlikely to be parallelized. For example, when the
individual partition scans are Index Scans.

deallocate q;
create index on hp (a);
alter table hp set (parallel_workers = 16);
analyze;
set max_parallel_workers_per_gather to 16;
prepare q as select * from hp where a = $1;
explain (analyze, verbose) execute q (1);
                                                               QUERY
PLAN


---------------------------------------------------------------------------------------------------------------------------------------
-
 Append  (cost=0.29..430.75 rows=100 width=4) (actual
time=0.043..0.046 rows=1 loops=1)
   Subplans Removed: 99
   ->  Index Only Scan using hp40_a_idx on public.hp40 hp_1
(cost=0.29..4.30 rows=1 width=4) (actual time=0.042..0.044 rows=1
loops=1)
         Output: hp_1.a
         Index Cond: (hp_1.a = $1)
         Heap Fetches: 0
 Planning Time: 13.769 ms
 Execution Time: 0.115 ms
(8 rows)

> I am not sure how that could be improved.

The planner currently ignores runtime pruning optimization when
assigning costs to the Append path, so fixing that would be a good
start.  There are efforts underway for that, such as [1].

-- 
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://commitfest.postgresql.org/32/2829/



pgsql-hackers by date:

Previous
From: "Joel Jacobson"
Date:
Subject: Re: [HACKERS] GSoC 2017: Foreign Key Arrays
Next
From: Zohar Gofer
Date:
Subject: RE: pg_replication_origin_session_setup and superuser