Re: A reloption for partitioned tables - parallel_workers - Mailing list pgsql-hackers
From | Amit Langote |
---|---|
Subject | Re: A reloption for partitioned tables - parallel_workers |
Date | |
Msg-id | CA+HiwqEZ5pKh0gmkqchutCdgdi+3g2G_0bmnjvJK9VwJnT-bVA@mail.gmail.com Whole thread Raw |
In response to | Re: A reloption for partitioned tables - parallel_workers (Laurenz Albe <laurenz.albe@cybertec.at>) |
Responses |
Re: A reloption for partitioned tables - parallel_workers
|
List | pgsql-hackers |
Hi, On Tue, Feb 16, 2021 at 1:06 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > On Mon, 2021-02-15 at 17:53 +0900, Amit Langote wrote: > > On Mon, Feb 15, 2021 at 5:28 PM Seamus Abshere <seamus@abshere.net> wrote: > > > It turns out parallel_workers may be a useful reloption for certain uses of partitioned tables, > > > at least if they're made up of fancy column store partitions (see > > > https://www.postgresql.org/message-id/7d6fdc20-857c-4cbe-ae2e-c0ff9520ed55%40www.fastmail.com). > > > Would somebody tell me what I'm doing wrong? I would love to submit a patch but I'm stuck: > > > > You may see by inspecting the callers of compute_parallel_worker() > > that it never gets called on a partitioned table, only its leaf > > partitions. Maybe you could try calling compute_parallel_worker() > > somewhere in add_paths_to_append_rel(), which has this code to figure > > out parallel_workers to use for a parallel Append path for a given > > partitioned table: > > > > /* Find the highest number of workers requested for any subpath. */ > > foreach(lc, partial_subpaths) > > { > > Path *path = lfirst(lc); > > > > parallel_workers = Max(parallel_workers, path->parallel_workers); > > } > > Assert(parallel_workers > 0); > > > > /* > > * If the use of parallel append is permitted, always request at least > > * log2(# of children) workers. We assume it can be useful to have > > * extra workers in this case because they will be spread out across > > * the children. The precise formula is just a guess, but we don't > > * want to end up with a radically different answer for a table with N > > * partitions vs. an unpartitioned table with the same data, so the > > * use of some kind of log-scaling here seems to make some sense. > > */ > > if (enable_parallel_append) > > { > > parallel_workers = Max(parallel_workers, > > fls(list_length(live_childrels))); > > parallel_workers = Min(parallel_workers, > > max_parallel_workers_per_gather); > > } > > Assert(parallel_workers > 0); > > > > Note that the 'rel' in this code refers to the partitioned table for > > which an Append path is being considered, so compute_parallel_worker() > > using that 'rel' would use the partitioned table's > > rel_parallel_workers as you are trying to do. > > Note that there is a second chunk of code quite like that one a few > lines down from there that would also have to be modified. > > I am +1 on allowing to override the degree of parallelism on a parallel > append. If "parallel_workers" on the partitioned table is an option for > that, it might be a simple solution. On the other hand, perhaps it would > be less confusing to have a different storage parameter name rather than > having "parallel_workers" do double duty. > > Also, since there is a design rule that storage parameters can only be used > on partitions, we would have to change that - is that a problem for anybody? I am not aware of a rule that suggests that parallel_workers is always interpreted using storage-level considerations. If that is indeed a popular interpretation at this point, then yes, we should be open to considering a new name for the parameter that this patch wants to add. Maybe parallel_append_workers? Perhaps not a bad idea in this patch's case, but considering that we may want to expand the support of cross-partition parallelism to operations other than querying, maybe something else? This reminds me of something I forgot to mention in my review of the patch -- it should also update the documentation of parallel_workers on the CREATE TABLE page to mention that it will be interpreted a bit differently for partitioned tables than for regular storage-bearing relations. Workers specified for partitioned tables would be distributed by the executor over its partitions, unlike with storage-bearing relations, where the distribution of specified workers is controlled by the AM using storage-level considerations. > There is another related consideration that doesn't need to be addressed > by this patch, but that is somewhat related: if the executor prunes some > partitions, the degree of parallelism is unaffected, right? That's correct. Launched workers could be less than planned, but that would not have anything to do with executor pruning. > So if the planner decides to use 24 workers for 25 partitions, and the > executor discards all but one of these partition scans, we would end up > with 24 workers scanning a single partition. I do remember pondering this when testing my patches to improve the performance of executing a generic plan to scan a partitioned table where runtime pruning is possible. Here is an example: create table hp (a int) partition by hash (a); select 'create table hp' || i || ' partition of hp for values with (modulus 100, remainder ' || i || ');' from generate_series(0, 99) i; \gexec insert into hp select generate_series(0, 1000000); alter table hp set (parallel_workers = 16); set plan_cache_mode to force_generic_plan ; set max_parallel_workers_per_gather to 16; prepare q as select * from hp where a = $1; explain (analyze, verbose) execute q (1); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Gather (cost=1000.00..14426.50 rows=5675 width=4) (actual time=2.370..25.002 rows=1 loops=1) Output: hp.a Workers Planned: 16 Workers Launched: 7 -> Parallel Append (cost=0.00..12859.00 rows=400 width=4) (actual time=0.006..0.384 rows=0 loops=8) Worker 0: actual time=0.001..0.001 rows=0 loops=1 Worker 1: actual time=0.001..0.001 rows=0 loops=1 Worker 2: actual time=0.001..0.001 rows=0 loops=1 Worker 3: actual time=0.001..0.001 rows=0 loops=1 Worker 4: actual time=0.001..0.001 rows=0 loops=1 Worker 5: actual time=0.001..0.001 rows=0 loops=1 Worker 6: actual time=0.001..0.001 rows=0 loops=1 Subplans Removed: 99 -> Parallel Seq Scan on public.hp40 hp_1 (cost=0.00..126.50 rows=33 width=4) (actual time=0.041..3.060 rows=1 loops=1) Output: hp_1.a Filter: (hp_1.a = $1) Rows Removed by Filter: 9813 Planning Time: 5.543 ms Execution Time: 25.139 ms (19 rows) deallocate q; set max_parallel_workers_per_gather to 0; prepare q as select * from hp where a = $1; explain (analyze, verbose) execute q (1); QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Append (cost=0.00..18754.88 rows=5675 width=4) (actual time=0.029..2.474 rows=1 loops=1) Subplans Removed: 99 -> Seq Scan on public.hp40 hp_1 (cost=0.00..184.25 rows=56 width=4) (actual time=0.028..2.471 rows=1 loops=1) Output: hp_1.a Filter: (hp_1.a = $1) Rows Removed by Filter: 9813 Planning Time: 2.231 ms Execution Time: 2.535 ms (8 rows) Comparing the Execution Times above, it's clear that Gather and workers are pure overhead in this case. Although in cases where one expects runtime pruning to be useful, the plan itself is very unlikely to be parallelized. For example, when the individual partition scans are Index Scans. deallocate q; create index on hp (a); alter table hp set (parallel_workers = 16); analyze; set max_parallel_workers_per_gather to 16; prepare q as select * from hp where a = $1; explain (analyze, verbose) execute q (1); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- - Append (cost=0.29..430.75 rows=100 width=4) (actual time=0.043..0.046 rows=1 loops=1) Subplans Removed: 99 -> Index Only Scan using hp40_a_idx on public.hp40 hp_1 (cost=0.29..4.30 rows=1 width=4) (actual time=0.042..0.044 rows=1 loops=1) Output: hp_1.a Index Cond: (hp_1.a = $1) Heap Fetches: 0 Planning Time: 13.769 ms Execution Time: 0.115 ms (8 rows) > I am not sure how that could be improved. The planner currently ignores runtime pruning optimization when assigning costs to the Append path, so fixing that would be a good start. There are efforts underway for that, such as [1]. -- Amit Langote EDB: http://www.enterprisedb.com [1] https://commitfest.postgresql.org/32/2829/
pgsql-hackers by date: