Thread: Determine parallel-safety of partition relations for Inserts

Determine parallel-safety of partition relations for Inserts

From

Amit Kapila

Date:

15 January 2021, 13:20:46

While reviewing parallel insert [1] (Insert into .... Select) and
parallel copy patches [2], it came to my notice that both the patches
traverse the entire partition hierarchy to determine parallel-safety
of partitioned relations. This is required because before considering
the Insert or Copy can be considered for parallelism, we need to
determine whether it is safe to do so. We need to check for each
partition because any of the partitions can have some parallel-unsafe
index expression, constraint, etc. We do a similar thing for Selects
in standard_planner.

The plain Select case for partitioned tables was simpler because we
anyway loop through all the partitions in set_append_rel_size() and we
determine parallel-safety of each partition at that time but the same
is not true for Inserts.

For Inserts, currently, we only open the partition table when we are
about to insert into that partition. During ExecInsert, we find out
the partition matching the partition-key value and then lock if it is
not already locked. In this patch, we need to open each partition at
the planning time to determine its parallel-safety.

This will surely increase planning time but the execution is reduced
to an extent due to parallelism that it won't matter for either of the
cases if we see just total time. For example, see the latest results
for parallel inserts posted by Haiying Tang [3]. There might be an
impact when Selects can't be parallelized due to the small size of the
Select-table but we still have to traverse all the partitions to
determine parallel-safety but not sure how much it is compared to
overall time. I guess we need to find the same but apart from that can
anyone think of a better way to determine parallel-safety of
partitioned relation for Inserts?

Thoughts?

Note: I have kept a few people in Cc who are either directly involved
in this work or work regularly in the partitioning related work just
in the hope that might help in moving the discussion forward.

[1] - https://www.postgresql.org/message-id/CAJcOf-dF9ohqub_D805k57Y_AuDLeAQfvtaax9SpwjTSEVdiXg%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CALDaNm32c7kWpZm9UkS5P%2BShsfRhyMTWKvFHyn9O53WZWvO2iA%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/b54f2e306780449093c311118cd8a04e%40G08CNEXMBPEKD05.g08.fujitsu.local

--
With Regards,
Amit Kapila.

Re: Determine parallel-safety of partition relations for Inserts

From

Ashutosh Bapat

Date:

15 January 2021, 15:23:03

On Fri, Jan 15, 2021 at 3:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> While reviewing parallel insert [1] (Insert into .... Select) and
> parallel copy patches [2], it came to my notice that both the patches
> traverse the entire partition hierarchy to determine parallel-safety
> of partitioned relations. This is required because before considering
> the Insert or Copy can be considered for parallelism, we need to
> determine whether it is safe to do so. We need to check for each
> partition because any of the partitions can have some parallel-unsafe
> index expression, constraint, etc. We do a similar thing for Selects
> in standard_planner.
>
> The plain Select case for partitioned tables was simpler because we
> anyway loop through all the partitions in set_append_rel_size() and we
> determine parallel-safety of each partition at that time but the same
> is not true for Inserts.
>
> For Inserts, currently, we only open the partition table when we are
> about to insert into that partition. During ExecInsert, we find out
> the partition matching the partition-key value and then lock if it is
> not already locked. In this patch, we need to open each partition at
> the planning time to determine its parallel-safety.

We don't want to open the partitions where no rows will be inserted.

>
> This will surely increase planning time but the execution is reduced
> to an extent due to parallelism that it won't matter for either of the
> cases if we see just total time. For example, see the latest results
> for parallel inserts posted by Haiying Tang [3]. There might be an
> impact when Selects can't be parallelized due to the small size of the
> Select-table but we still have to traverse all the partitions to
> determine parallel-safety but not sure how much it is compared to
> overall time. I guess we need to find the same but apart from that can
> anyone think of a better way to determine parallel-safety of
> partitioned relation for Inserts?

In case of SELECT we open only those partitions which surive pruning.
So those are the ones which will definitely required to be scanned. We
perform parallelism checks only on those partitions. The actual check
isn't much costly.

>
> Thoughts?
>
> Note: I have kept a few people in Cc who are either directly involved
> in this work or work regularly in the partitioning related work just
> in the hope that might help in moving the discussion forward.

Since you brought up comparison between SELECT and INSERT, "pruning"
partitions based on the values being INSERTed might help. It should be
doable in case of INSERT ... SELECT where we need to prune partitions
based on the clauses of SELECT. Doable with some little effort in case
of VALUEs and COPY.

Second possibility is to open partitions only when the estimated
number of rows to be inserted goes beyond a certain value.

Third idea is to use something similar to parallel append where
individual partitions are scanned sequentially but multiple partitions
are scanned in parallel. When a row is inserted into a non-yet-opened
partition, allocate one/more backends to insert into partitions which
do not allow parallelism, otherwise continue to use a common pool of
parallel workers for insertion. This means the same thread performing
select may not perform insert. So some complications will be involved.

-- 
Best Wishes,
Ashutosh Bapat

Re: Determine parallel-safety of partition relations for Inserts

From

Amit Kapila

Date:

15 January 2021, 16:02:10

On Fri, Jan 15, 2021 at 5:53 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Fri, Jan 15, 2021 at 3:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > While reviewing parallel insert [1] (Insert into .... Select) and
> > parallel copy patches [2], it came to my notice that both the patches
> > traverse the entire partition hierarchy to determine parallel-safety
> > of partitioned relations. This is required because before considering
> > the Insert or Copy can be considered for parallelism, we need to
> > determine whether it is safe to do so. We need to check for each
> > partition because any of the partitions can have some parallel-unsafe
> > index expression, constraint, etc. We do a similar thing for Selects
> > in standard_planner.
> >
> > The plain Select case for partitioned tables was simpler because we
> > anyway loop through all the partitions in set_append_rel_size() and we
> > determine parallel-safety of each partition at that time but the same
> > is not true for Inserts.
> >
> > For Inserts, currently, we only open the partition table when we are
> > about to insert into that partition. During ExecInsert, we find out
> > the partition matching the partition-key value and then lock if it is
> > not already locked. In this patch, we need to open each partition at
> > the planning time to determine its parallel-safety.
>
> We don't want to open the partitions where no rows will be inserted.
>
> >
> > This will surely increase planning time but the execution is reduced
> > to an extent due to parallelism that it won't matter for either of the
> > cases if we see just total time. For example, see the latest results
> > for parallel inserts posted by Haiying Tang [3]. There might be an
> > impact when Selects can't be parallelized due to the small size of the
> > Select-table but we still have to traverse all the partitions to
> > determine parallel-safety but not sure how much it is compared to
> > overall time. I guess we need to find the same but apart from that can
> > anyone think of a better way to determine parallel-safety of
> > partitioned relation for Inserts?
>
> In case of SELECT we open only those partitions which surive pruning.
> So those are the ones which will definitely required to be scanned. We
> perform parallelism checks only on those partitions. The actual check
> isn't much costly.
>
> >
> > Thoughts?
> >
> > Note: I have kept a few people in Cc who are either directly involved
> > in this work or work regularly in the partitioning related work just
> > in the hope that might help in moving the discussion forward.
>
> Since you brought up comparison between SELECT and INSERT, "pruning"
> partitions based on the values being INSERTed might help. It should be
> doable in case of INSERT ... SELECT where we need to prune partitions
> based on the clauses of SELECT. Doable with some little effort in case
> of VALUEs and COPY.
>

I don't think we should try pruning in this patch as that is a
separate topic and even after pruning the same problem can happen when
we are not able to prune partitions in the table where we want to
Insert.

> Second possibility is to open partitions only when the estimated
> number of rows to be inserted goes beyond a certain value.
>

Yeah, this option has merits in the sense that we will pay the cost to
traverse partitions only when the parallel plan is possible. If you
see the 0001 patch in email [1], it tries to determine parallel-safety
(which in turn will traverse all partition tables) in standard_planner
where we determine the parallel-safety for the Query tree. Now, if we
have to postpone it for partitioned tables then we need to determine
it at the places where we create_modifytable_path if the
current_rel->pathlist has some parallel_safe paths. And which will
also mean that we need to postpone generating gather node as well till
that time because Insert can be parallel-unsafe.

This will have one disadvantage over the current approach implemented
by the patch which is that we might end up spending a lot of time in
the optimizer to create partial paths and later (while determining
parallel-safety of Insert) find that none of them is required.

If we decide to postpone the parallel-safety checking for Inserts then
why not we check parallel-safety for all sorts of Inserts at that
point. That can at least simplify the patch.

> Third idea is to use something similar to parallel append where
> individual partitions are scanned sequentially but multiple partitions
> are scanned in parallel. When a row is inserted into a non-yet-opened
> partition, allocate one/more backends to insert into partitions which
> do not allow parallelism, otherwise continue to use a common pool of
> parallel workers for insertion. This means the same thread performing
> select may not perform insert. So some complications will be involved.
>

We want to do this for Inserts where only Select can be parallel and
Inserts will always be done by the leader backend. This is actually
the case we first want to implement.
--
With Regards,
Amit Kapila.

Re: Determine parallel-safety of partition relations for Inserts

From

Amit Langote

Date:

15 January 2021, 16:15:40

On Fri, Jan 15, 2021 at 9:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> We want to do this for Inserts where only Select can be parallel and
> Inserts will always be done by the leader backend. This is actually
> the case we first want to implement.

Sorry, I haven't looked at the linked threads and the latest patches
there closely enough yet, so I may be misreading this, but if the
inserts will always be done by the leader backend as you say, then why
does the planner need to be checking the parallel safety of the
*target* table's expressions?

-- 
Amit Langote
EDB: http://www.enterprisedb.com

RE: Determine parallel-safety of partition relations for Inserts

From

"tsunakawa.takay@fujitsu.com"

Date:

15 January 2021, 17:04:53

From: Amit Kapila <amit.kapila16@gmail.com>
> This will surely increase planning time but the execution is reduced
> to an extent due to parallelism that it won't matter for either of the
> cases if we see just total time. For example, see the latest results
> for parallel inserts posted by Haiying Tang [3]. There might be an
> impact when Selects can't be parallelized due to the small size of the
> Select-table but we still have to traverse all the partitions to
> determine parallel-safety but not sure how much it is compared to
> overall time. I guess we need to find the same but apart from that can
> anyone think of a better way to determine parallel-safety of
> partitioned relation for Inserts?

Three solutions(?) quickly come to my mind:


(1) Have the user specify whether they want to parallelize DML
Oracle [1] and SQL Server [2] take this approach.  Oracle disables parallel DML execution by default.  The reason is
describedas "This mode is required because parallel DML and serial DML have different locking, transaction, and disk
spacerequirements and parallel DML is disabled for a session by default."  To enable parallel DML in a session or in a
specificstatement, you need to run either of the following:
 

  ALTER SESSION ENABLE PARALLEL DML;
  INSERT /*+ ENABLE_PARALLEL_DML */ …

Besides, the user has to specify a parallel hint in a DML statement, or specify the parallel attribute in CREATE or
ALTERTABLE.
 

SQL Server requires a TABLOCK hint to be specified in the INSERT SELECT statement like this:

  INSERT INTO Sales.SalesHistory WITH (TABLOCK)  (target columns...) SELECT ...;


(2) Postpone the parallel safety check after the planner finds a worthwhile parallel query plan
I'm not sure if the current planner code allows this easily...


(3) Record the parallel safety in system catalog
Add a column like relparallel in pg_class that indicates the parallel safety of the relation.  planner just checks the
valueinstead of doing heavy work for every SQL statement.  That column's value is modified whenever a relation
alterationis made that affects the parallel safety, such as adding a domain column and CHECK constraint.  In case of a
partitionedrelation, the parallel safety attributes of all its descendant relations are merged.  For example, if a
partitionbecomes parallel-unsafe, the ascendant partitioned tables also become parallel-unsafe.
 

But... developing such code would be burdonsome and bug-prone?


I'm inclined to propose (1).  Parallel DML would be something that a limited people run in limited circumstances (data
loadingin data warehouse and batch processing in OLTP systems by the DBA or data administrator), so I think it's
legitimateto require explicit specification of parallelism.
 

As an aside, (1) and (2) has a potential problem with memory consumption.  Opening relations bloat CacheMemoryContext
withrelcaches and catcaches, and closing relations does not free the (all) memory.  But I don't think it could really
becomea problem in practice, because parallel DML would be run in limited number of concurrent sessions.
 


[1]
Types of Parallelism

https://docs.oracle.com/en/database/oracle/oracle-database/21/vldbg/types-parallelism.html#GUID-D8290A02-BE5F-436A-B814-D6FD71CEE81F

[2]
INSERT (Transact-SQL)
https://docs.microsoft.com/en-us/sql/t-sql/statements/insert-transact-sql?view=sql-server-ver15#best-practices


Regards
Takayuki Tsunakawa

RE: Determine parallel-safety of partition relations for Inserts

From

"tsunakawa.takay@fujitsu.com"

Date:

15 January 2021, 17:26:24

From: Amit Langote <amitlangote09@gmail.com>
> Sorry, I haven't looked at the linked threads and the latest patches
> there closely enough yet, so I may be misreading this, but if the
> inserts will always be done by the leader backend as you say, then why
> does the planner need to be checking the parallel safety of the
> *target* table's expressions?

Yeah, I also wanted to confirm this next - that is, whether the current patch allows the SELECT operation to execute in
parallelbut the INSERT operation serially.  Oracle allows it; it even allows the user to specify a hint after INSERT
andSELECT separately.  Even if INSERT in INSERT SELECT can't be run in parallel, it is useful for producing summary
data,such as aggregating large amounts of IoT sensor data in parallel and inserting the small amount of summary data
serially.


Regards
Takayuki Tsunakawa

Re: Determine parallel-safety of partition relations for Inserts

From

Amit Kapila

Date:

16 January 2021, 08:04:40

On Fri, Jan 15, 2021 at 6:45 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> On Fri, Jan 15, 2021 at 9:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > We want to do this for Inserts where only Select can be parallel and
> > Inserts will always be done by the leader backend. This is actually
> > the case we first want to implement.
>
> Sorry, I haven't looked at the linked threads and the latest patches
> there closely enough yet, so I may be misreading this, but if the
> inserts will always be done by the leader backend as you say, then why
> does the planner need to be checking the parallel safety of the
> *target* table's expressions?
>

The reason is that once we enter parallel-mode we can't allow
parallel-unsafe things (like allocation of new CIDs, XIDs, etc.). We
enter the parallel-mode at the beginning of the statement execution,
see ExecutePlan(). So, the Insert will be performed in parallel-mode
even though it happens in the leader backend. It is not possible that
we finish getting all the tuples from the gather node first and then
start inserting. Even, if we somehow find something to make this work
anyway the checks being discussed will be required to make inserts
parallel (where inserts will be performed by workers) which is
actually the next patch in the thread I mentioned in the previous
email.

Does this answer your question?

-- 
With Regards,
Amit Kapila.

Re: Determine parallel-safety of partition relations for Inserts

From

Amit Kapila

Date:

16 January 2021, 08:49:40

On Fri, Jan 15, 2021 at 7:35 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Amit Kapila <amit.kapila16@gmail.com>
> > This will surely increase planning time but the execution is reduced
> > to an extent due to parallelism that it won't matter for either of the
> > cases if we see just total time. For example, see the latest results
> > for parallel inserts posted by Haiying Tang [3]. There might be an
> > impact when Selects can't be parallelized due to the small size of the
> > Select-table but we still have to traverse all the partitions to
> > determine parallel-safety but not sure how much it is compared to
> > overall time. I guess we need to find the same but apart from that can
> > anyone think of a better way to determine parallel-safety of
> > partitioned relation for Inserts?
>
> Three solutions(?) quickly come to my mind:
>
>
> (1) Have the user specify whether they want to parallelize DML
> Oracle [1] and SQL Server [2] take this approach.  Oracle disables parallel DML execution by default.  The reason is
describedas "This mode is required because parallel DML and serial DML have different locking, transaction, and disk
spacerequirements and parallel DML is disabled for a session by default."  To enable parallel DML in a session or in a
specificstatement, you need to run either of the following: 
>
>   ALTER SESSION ENABLE PARALLEL DML;
>   INSERT /*+ ENABLE_PARALLEL_DML */ …
>
> Besides, the user has to specify a parallel hint in a DML statement, or specify the parallel attribute in CREATE or
ALTERTABLE. 
>
> SQL Server requires a TABLOCK hint to be specified in the INSERT SELECT statement like this:
>
>   INSERT INTO Sales.SalesHistory WITH (TABLOCK)  (target columns...) SELECT ...;
>

I think it would be good if the parallelism works by default when
required but I guess if we want to use something on these lines then
we can always check if the parallel_workers option is non-zero for a
relation (with RelationGetParallelWorkers). So users can always say
Alter Table <tbl_name> Set (parallel_workers = 0) if they don't want
to enable write parallelism for tbl and if someone is bothered that
this might impact Selects as well because the same option is used to
compute the number of workers for it then we can invent a second
option parallel_dml_workers or something like that.

>
> (2) Postpone the parallel safety check after the planner finds a worthwhile parallel query plan
> I'm not sure if the current planner code allows this easily...
>

I think it is possible but it has a bit of disadvantage as well as
mentioned in response to Ashutosh's email [1].

>
> (3) Record the parallel safety in system catalog
> Add a column like relparallel in pg_class that indicates the parallel safety of the relation.  planner just checks
thevalue instead of doing heavy work for every SQL statement.  That column's value is modified whenever a relation
alterationis made that affects the parallel safety, such as adding a domain column and CHECK constraint.  In case of a
partitionedrelation, the parallel safety attributes of all its descendant relations are merged.  For example, if a
partitionbecomes parallel-unsafe, the ascendant partitioned tables also become parallel-unsafe. 
>
> But... developing such code would be burdonsome and bug-prone?
>
>
> I'm inclined to propose (1).  Parallel DML would be something that a limited people run in limited circumstances
(dataloading in data warehouse and batch processing in OLTP systems by the DBA or data administrator), so I think it's
legitimateto require explicit specification of parallelism. 
>
> As an aside, (1) and (2) has a potential problem with memory consumption.
>

I can see the memory consumption argument for (2) because we might end
up generating parallel paths (partial paths) for reading the table but
don't see how it applies to (1)?

[1] - https://www.postgresql.org/message-id/CAA4eK1J80Rzn4M-A5sfkmJ8NjgTxbaC8UWVaNHK6%2B2BCYYv2Nw%40mail.gmail.com

--
With Regards,
Amit Kapila.

Re: Determine parallel-safety of partition relations for Inserts

From

Amit Langote

Date:

17 January 2021, 14:15:29

On Sat, Jan 16, 2021 at 2:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jan 15, 2021 at 6:45 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Fri, Jan 15, 2021 at 9:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > We want to do this for Inserts where only Select can be parallel and
> > > Inserts will always be done by the leader backend. This is actually
> > > the case we first want to implement.
> >
> > Sorry, I haven't looked at the linked threads and the latest patches
> > there closely enough yet, so I may be misreading this, but if the
> > inserts will always be done by the leader backend as you say, then why
> > does the planner need to be checking the parallel safety of the
> > *target* table's expressions?
> >
>
> The reason is that once we enter parallel-mode we can't allow
> parallel-unsafe things (like allocation of new CIDs, XIDs, etc.). We
> enter the parallel-mode at the beginning of the statement execution,
> see ExecutePlan(). So, the Insert will be performed in parallel-mode
> even though it happens in the leader backend. It is not possible that
> we finish getting all the tuples from the gather node first and then
> start inserting. Even, if we somehow find something to make this work
> anyway the checks being discussed will be required to make inserts
> parallel (where inserts will be performed by workers) which is
> actually the next patch in the thread I mentioned in the previous
> email.
>
> Does this answer your question?

Yes, thanks for the explanation.  I kind of figured that doing the
insert part itself in parallel using workers would be a part of the
end goal of this work, although that didn't come across immediately.

It's a bit unfortunate that the parallel safety check of the
individual partitions cannot be deferred until it's known that a given
partition will be affected by the command at all.  Will we need
fundamental changes to how parallel query works to make that possible?
 If so, have such options been considered in these projects?  If such
changes are not possible in the short term, like for v14, we should at
least try to make sure that the eager checking of all partitions is
only performed if using parallelism is possible at all.

I will try to take a look at the patches themselves to see if there's
something I know that will help.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

RE: Determine parallel-safety of partition relations for Inserts

From

"tsunakawa.takay@fujitsu.com"

Date:

18 January 2021, 03:38:38

From: Amit Kapila <amit.kapila16@gmail.com>
> I think it would be good if the parallelism works by default when
> required but I guess if we want to use something on these lines then
> we can always check if the parallel_workers option is non-zero for a
> relation (with RelationGetParallelWorkers). So users can always say
> Alter Table <tbl_name> Set (parallel_workers = 0) if they don't want
> to enable write parallelism for tbl and if someone is bothered that
> this might impact Selects as well because the same option is used to
> compute the number of workers for it then we can invent a second
> option parallel_dml_workers or something like that.

Yes, if we have to require some specification to enable parallel DML, I agree that parallel query and parall DML can be
separatelyenabled.  With that said, I'm not sure if the user, and PG developers, want to allow specifying degree of
parallelismfor DML.
 


> > As an aside, (1) and (2) has a potential problem with memory consumption.
> >
> 
> I can see the memory consumption argument for (2) because we might end
> up generating parallel paths (partial paths) for reading the table but
> don't see how it applies to (1)?

I assumed that we would still open all partitions for parallel safety check in (1) and (2).  In (1), parallel safety
checkis done only when parallel DML is explicitly enabled by the user.  Just opening partitions keeps
CacheMemoryContextbloated even after they are closed.
 


Regards
Takayuki Tsunakawa

Re: Determine parallel-safety of partition relations for Inserts

From

Amit Kapila

Date:

18 January 2021, 06:04:38

On Sun, Jan 17, 2021 at 4:45 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> On Sat, Jan 16, 2021 at 2:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Fri, Jan 15, 2021 at 6:45 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > On Fri, Jan 15, 2021 at 9:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > We want to do this for Inserts where only Select can be parallel and
> > > > Inserts will always be done by the leader backend. This is actually
> > > > the case we first want to implement.
> > >
> > > Sorry, I haven't looked at the linked threads and the latest patches
> > > there closely enough yet, so I may be misreading this, but if the
> > > inserts will always be done by the leader backend as you say, then why
> > > does the planner need to be checking the parallel safety of the
> > > *target* table's expressions?
> > >
> >
> > The reason is that once we enter parallel-mode we can't allow
> > parallel-unsafe things (like allocation of new CIDs, XIDs, etc.). We
> > enter the parallel-mode at the beginning of the statement execution,
> > see ExecutePlan(). So, the Insert will be performed in parallel-mode
> > even though it happens in the leader backend. It is not possible that
> > we finish getting all the tuples from the gather node first and then
> > start inserting. Even, if we somehow find something to make this work
> > anyway the checks being discussed will be required to make inserts
> > parallel (where inserts will be performed by workers) which is
> > actually the next patch in the thread I mentioned in the previous
> > email.
> >
> > Does this answer your question?
>
> Yes, thanks for the explanation.  I kind of figured that doing the
> insert part itself in parallel using workers would be a part of the
> end goal of this work, although that didn't come across immediately.
>
> It's a bit unfortunate that the parallel safety check of the
> individual partitions cannot be deferred until it's known that a given
> partition will be affected by the command at all.  Will we need
> fundamental changes to how parallel query works to make that possible?
>  If so, have such options been considered in these projects?
>

I think it is quite fundamental to how parallel query works and we
might not be able to change it due to various reasons like (a) it will
end up generating a lot of paths in optimizer when it is not safe to
do so and in the end, we won't use it. (b) If after inserting into a
few partitions we came to know that the next partition we are going to
insert has some parallel-unsafe constraints then we need to give up
the execution and restart the statement by again trying to first plan
it by having non-parallel paths. Now, we can optimize this by
retaining both parallel and non-parallel plans such that if we fail to
execute parallel-plan we can use a non-parallel plan to execute the
statement but still that doesn't seem like an advisable approach.

The extra time spent in optimizer will pay-off well by the parallel
execution. As pointer earlier, you can see one of the results shared
on the other thread [1]. The cases where it might not get the benefit
(say when the underlying plan is non-parallel) can have some impact
but still, we have not tested that in detail. The ideas we have
discussed so far to address that are (a) postpone parallel-safety
checks for partitions till there are some underneath partial paths
(from which parallel paths can be generated) but that has some
down-side in that we will end up generating partial paths when that is
really not required, (b) have a rel option like parallel_dml_workers
or use existing option parallel_workers to allow considering parallel
insert for a relation. Any better ideas?

>  If such
> changes are not possible in the short term, like for v14, we should at
> least try to make sure that the eager checking of all partitions is
> only performed if using parallelism is possible at all.
>

As of now, we do first check if it is safe to generate a parallel plan
for underlying select (in Insert into .... Select ..) and then perform
parallel-safety checks for the DML. We can postpone it further as
suggested above in (a).

> I will try to take a look at the patches themselves to see if there's
> something I know that will help.
>

Thank you. It will be really helpful if you can do that.

[1] - https://www.postgresql.org/message-id/b54f2e306780449093c311118cd8a04e%40G08CNEXMBPEKD05.g08.fujitsu.local

-- 
With Regards,
Amit Kapila.

Re: Determine parallel-safety of partition relations for Inserts

From

Amit Kapila

Date:

18 January 2021, 06:17:06

On Mon, Jan 18, 2021 at 6:08 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Amit Kapila <amit.kapila16@gmail.com>
> > I think it would be good if the parallelism works by default when
> > required but I guess if we want to use something on these lines then
> > we can always check if the parallel_workers option is non-zero for a
> > relation (with RelationGetParallelWorkers). So users can always say
> > Alter Table <tbl_name> Set (parallel_workers = 0) if they don't want
> > to enable write parallelism for tbl and if someone is bothered that
> > this might impact Selects as well because the same option is used to
> > compute the number of workers for it then we can invent a second
> > option parallel_dml_workers or something like that.
>
> Yes, if we have to require some specification to enable parallel DML, I agree that parallel query and parall DML can
beseparately enabled.  With that said, I'm not sure if the user, and PG developers, want to allow specifying degree of
parallelismfor DML. 
>

We already allow users to specify the degree of parallelism for all
the parallel operations via guc's max_parallel_maintenance_workers,
max_parallel_workers_per_gather, then we have a reloption
parallel_workers and vacuum command has the parallel option where
users can specify the number of workers that can be used for
parallelism. The parallelism considers these as hints but decides
parallelism based on some other parameters like if there are that many
workers available, etc. Why the users would expect differently for
parallel DML?

>
> > > As an aside, (1) and (2) has a potential problem with memory consumption.
> > >
> >
> > I can see the memory consumption argument for (2) because we might end
> > up generating parallel paths (partial paths) for reading the table but
> > don't see how it applies to (1)?
>
> I assumed that we would still open all partitions for parallel safety check in (1) and (2).  In (1), parallel safety
checkis done only when parallel DML is explicitly enabled by the user.  Just opening partitions keeps
CacheMemoryContextbloated even after they are closed. 
>

Which memory specific to partitions are you referring to here and does
that apply to the patch being discussed?

--
With Regards,
Amit Kapila.

RE: Determine parallel-safety of partition relations for Inserts

From

"tsunakawa.takay@fujitsu.com"

Date:

18 January 2021, 07:57:06

From: Amit Kapila <amit.kapila16@gmail.com>
> We already allow users to specify the degree of parallelism for all
> the parallel operations via guc's max_parallel_maintenance_workers,
> max_parallel_workers_per_gather, then we have a reloption
> parallel_workers and vacuum command has the parallel option where
> users can specify the number of workers that can be used for
> parallelism. The parallelism considers these as hints but decides
> parallelism based on some other parameters like if there are that many
> workers available, etc. Why the users would expect differently for
> parallel DML?

I agree that the user would want to specify the degree of parallelism of DML, too.  My simple (probably silly) question
was,in INSERT SELECT,
 

* If the target table has 10 partitions and the source table has 100 partitions, how would the user want to specify
parameters?

* If the source and target tables have the same number of partitions, and the user specified different values to
parallel_workersand parallel_dml_workers, how many parallel workers would run?
 

* What would the query plan be like?  Something like below?  Can we easily support this sort of nested thing?

Gather
  Workers Planned: <parallel_dml_workers>
  Insert
    Gather
      Workers Planned: <parallel_workers>
      Parallel Seq Scan


> Which memory specific to partitions are you referring to here and does
> that apply to the patch being discussed?

Relation cache and catalog cache, which are not specific to partitions.  This patch's current parallel safety check
opensand closes all descendant partitions of the target table.  That leaves those cache entries in CacheMemoryContext
afterthe SQL statement ends.  But as I said, we can consider it's not a serious problem in this case because the
parallelDML would be executed in limited number of concurrent sessions.  I just touched on the memory consumption issue
forcompleteness in comparison with (3).
 


Regards
Takayuki Tsunakawa

Re: Determine parallel-safety of partition relations for Inserts

From

Amit Kapila

Date:

18 January 2021, 11:33:55

On Mon, Jan 18, 2021 at 10:27 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Amit Kapila <amit.kapila16@gmail.com>
> > We already allow users to specify the degree of parallelism for all
> > the parallel operations via guc's max_parallel_maintenance_workers,
> > max_parallel_workers_per_gather, then we have a reloption
> > parallel_workers and vacuum command has the parallel option where
> > users can specify the number of workers that can be used for
> > parallelism. The parallelism considers these as hints but decides
> > parallelism based on some other parameters like if there are that many
> > workers available, etc. Why the users would expect differently for
> > parallel DML?
>
> I agree that the user would want to specify the degree of parallelism of DML, too.  My simple (probably silly)
questionwas, in INSERT SELECT,
 
>
> * If the target table has 10 partitions and the source table has 100 partitions, how would the user want to specify
parameters?
>
> * If the source and target tables have the same number of partitions, and the user specified different values to
parallel_workersand parallel_dml_workers, how many parallel workers would run?
 
>

Good question. I think if we choose to have a separate parameter for
DML, it can probably a boolean to just indicate whether to enable
parallel DML for a specified table and use the parallel_workers
specified in the table used in SELECT.

-- 
With Regards,
Amit Kapila.

RE: Determine parallel-safety of partition relations for Inserts

From

"tsunakawa.takay@fujitsu.com"

Date:

18 January 2021, 11:37:26

From: Amit Kapila <amit.kapila16@gmail.com>
> Good question. I think if we choose to have a separate parameter for
> DML, it can probably a boolean to just indicate whether to enable
> parallel DML for a specified table and use the parallel_workers
> specified in the table used in SELECT.

Agreed.


Regards
Takayuki Tsunakawa

RE: Determine parallel-safety of partition relations for Inserts

From

"Hou, Zhijie"

Date:

28 January 2021, 14:30:43

> From: Amit Kapila <amit.kapila16@gmail.com>
> > Good question. I think if we choose to have a separate parameter for
> > DML, it can probably a boolean to just indicate whether to enable
> > parallel DML for a specified table and use the parallel_workers
> > specified in the table used in SELECT.
> 
> Agreed.

Hi

I have an issue about the parameter for DML.

If we define the parameter as a tableoption.

For a partitioned table, if we set the parent table's parallel_dml=on,
and set one of its partition parallel_dml=off, it seems we can not divide the plan for the separate table.

For this case, should we just check the parent's parameter ?

Best regards,
houzj

Re: Determine parallel-safety of partition relations for Inserts

From

Amit Kapila

Date:

28 January 2021, 15:28:14

On Thu, Jan 28, 2021 at 5:00 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
>
> > From: Amit Kapila <amit.kapila16@gmail.com>
> > > Good question. I think if we choose to have a separate parameter for
> > > DML, it can probably a boolean to just indicate whether to enable
> > > parallel DML for a specified table and use the parallel_workers
> > > specified in the table used in SELECT.
> >
> > Agreed.
>
> Hi
>
> I have an issue about the parameter for DML.
>
> If we define the parameter as a tableoption.
>
> For a partitioned table, if we set the parent table's parallel_dml=on,
> and set one of its partition parallel_dml=off, it seems we can not divide the plan for the separate table.
>
> For this case, should we just check the parent's parameter ?
>

I think so. IIUC, this means the Inserts where target table is parent
table, those will just check the option of the parent table and then
ignore the option value for child tables whereas we will consider
childrel's option for Inserts where target table is one of the child
table, right?

-- 
With Regards,
Amit Kapila.

RE: Determine parallel-safety of partition relations for Inserts

From

"Hou, Zhijie"

Date:

29 January 2021, 04:01:01

> > Hi
> >
> > I have an issue about the parameter for DML.
> >
> > If we define the parameter as a tableoption.
> >
> > For a partitioned table, if we set the parent table's parallel_dml=on,
> > and set one of its partition parallel_dml=off, it seems we can not divide
> the plan for the separate table.
> >
> > For this case, should we just check the parent's parameter ?
> >
> 
> I think so. IIUC, this means the Inserts where target table is parent table,
> those will just check the option of the parent table and then ignore the
> option value for child tables whereas we will consider childrel's option
> for Inserts where target table is one of the child table, right?
> 

Yes, I think we can just check the target table itself.

Best regards,
houzj

RE: Determine parallel-safety of partition relations for Inserts

From

"Hou, Zhijie"

Date:

29 January 2021, 09:43:14

Hi,

Attatching v1 patches, introducing options which let user manually control whether to use parallel dml.

About the patch:
1. add a new guc option: enable_parallel_dml (boolean)
2. add a new tableoption: parallel_dml (boolean)

   The default of both is off(false).

User can set enable_parallel_dml in postgresql.conf or seesion to enable parallel dml.
If user want to choose some specific to use parallel insert, they can set table.parallel_dml to on.

Some attention：
(1)
Currently if guc option enable_parallel_dml is set to on but table's parallel_dml is off,
planner still do not consider parallel for dml.

In this way, If user want to use parallel dml , they have to set enable_parallel_dml=on and set parallel_dml = on.
If someone dislike this, I think we can also let tableoption. parallel_dml's default value to on ,with this user only
need
to set enable_parallel_dml=on

(2)
For the parallel_dml.
If target table is partitioned, and it's parallel_dml is set to on, planner will ignore
The option value of it's child. 
This is beacause we can not divide dml plan to separate table, so we just check the target table itself.


Thoughts and comments are welcome.

Best regards,
houzj

Attachment

RE: Determine parallel-safety of partition relations for Inserts

From

"Hou, Zhijie"

Date:

29 January 2021, 09:49:05

> Hi,
> 
> Attatching v1 patches, introducing options which let user manually control
> whether to use parallel dml.
> 
> About the patch:
> 1. add a new guc option: enable_parallel_dml (boolean) 2. add a new
> tableoption: parallel_dml (boolean)
> 
>    The default of both is off(false).
> 
> User can set enable_parallel_dml in postgresql.conf or seesion to enable
> parallel dml.
> If user want to choose some specific to use parallel insert, they can set
> table.parallel_dml to on.
> 
> Some attention：
> (1)
> Currently if guc option enable_parallel_dml is set to on but table's
> parallel_dml is off, planner still do not consider parallel for dml.
> 
> In this way, If user want to use parallel dml , they have to set
> enable_parallel_dml=on and set parallel_dml = on.
> If someone dislike this, I think we can also let tableoption. parallel_dml's
> default value to on ,with this user only need to set enable_parallel_dml=on
> 
> (2)
> For the parallel_dml.
> If target table is partitioned, and it's parallel_dml is set to on, planner
> will ignore The option value of it's child.
> This is beacause we can not divide dml plan to separate table, so we just
> check the target table itself.
> 
> 
> Thoughts and comments are welcome.

The patch is based on v13 patch(parallel insert select) in [1]

[1] https://www.postgresql.org/message-id/CAJcOf-ejV8iU%2BYpuV4qbYEY-2vCG1QF2g3Gxn%3DZ%2BPyUH_5f84A%40mail.gmail.com

Best regards,
houzj

Re: Determine parallel-safety of partition relations for Inserts

From

Greg Nancarrow

Date:

01 February 2021, 05:37:43

On Fri, Jan 29, 2021 at 5:44 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
>
>
> Attatching v1 patches, introducing options which let user manually control whether to use parallel dml.
>
> About the patch:
> 1. add a new guc option: enable_parallel_dml (boolean)
> 2. add a new tableoption: parallel_dml (boolean)
>
>    The default of both is off(false).
>
> User can set enable_parallel_dml in postgresql.conf or seesion to enable parallel dml.
> If user want to choose some specific to use parallel insert, they can set table.parallel_dml to on.
>
> Some attention：
> (1)
> Currently if guc option enable_parallel_dml is set to on but table's parallel_dml is off,
> planner still do not consider parallel for dml.
>
> In this way, If user want to use parallel dml , they have to set enable_parallel_dml=on and set parallel_dml = on.
> If someone dislike this, I think we can also let tableoption. parallel_dml's default value to on ,with this user only
need
> to set enable_parallel_dml=on
>
> (2)
> For the parallel_dml.
> If target table is partitioned, and it's parallel_dml is set to on, planner will ignore
> The option value of it's child.
> This is beacause we can not divide dml plan to separate table, so we just check the target table itself.
>
>
> Thoughts and comments are welcome.
>

Personally, I think a table's "parallel_dml" option should be ON by default.
It's annoying to have to separately enable it for each and every table
being used, when I think the need to turn it selectively OFF should be
fairly rare.
And I'm not sure that "parallel_dml" is the best name for the table
option - because it sort of implies parallel dml WILL be used - but
that isn't true, it depends on other factors too.
So I think (to be consistent with other table option naming) it would
have to be something like "parallel_dml_enabled".
I'm still looking at the patches, but have so far noticed that there
are some issues in the documentation updates (grammatical issues and
consistency issues with current documentation), and also some of the
explanations are not clear. I guess I can address these separately.

Regards,
Greg Nancarrow
Fujitsu Australia

RE: Determine parallel-safety of partition relations for Inserts

From

"Hou, Zhijie"

Date:

01 February 2021, 08:02:18

Hi greg,

Thanks for the review !

> Personally, I think a table's "parallel_dml" option should be ON by default.
> It's annoying to have to separately enable it for each and every table being
> used, when I think the need to turn it selectively OFF should be fairly
> rare.

Yes, I agreed.
Changed.

> And I'm not sure that "parallel_dml" is the best name for the table option
> - because it sort of implies parallel dml WILL be used - but that isn't
> true, it depends on other factors too.
> So I think (to be consistent with other table option naming) it would have
> to be something like "parallel_dml_enabled".

Agreed.
Changed to parallel_dml_enabled.

Attatching v2 patch which addressed the comments above.

Some further refactor:

Introducing a new function is_parallel_possible_for_modify() which decide whether to do safety check.

IMO, It seems more readable to extract all the check that we can do before the safety-check and put them
in the new function.

Please consider it for further review.

Best regards,
houzj

Attachment

Re: Determine parallel-safety of partition relations for Inserts

From

Greg Nancarrow

Date:

02 February 2021, 02:55:40

On Mon, Feb 1, 2021 at 4:02 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
>
> Attatching v2 patch which addressed the comments above.
>
> Some further refactor:
>
> Introducing a new function is_parallel_possible_for_modify() which decide whether to do safety check.
>
> IMO, It seems more readable to extract all the check that we can do before the safety-check and put them
> in the new function.
>
> Please consider it for further review.
>

I've updated your v2 patches and altered some comments and
documentation changes (but made no code changes) - please compare
against your v2 patches, and see whether you agree with the changes to
the wording.
In the documentation, you will also notice that in your V2 patch, it
says that the "parallel_dml_enabled" table option defaults to false.
As it actually defaults to true, I changed that in the documentation
too.

Regards,
Greg Nancarrow
Fujitsu Australia

Attachment

Re: Determine parallel-safety of partition relations for Inserts

From

Zhihong Yu

Date:

02 February 2021, 03:04:11

Hi,

For v3_0003-reloption-parallel_dml-src.patch :

+ * Check if parallel_dml_enabled is enabled for the target table,
+ * if not, skip the safety checks and return PARALLEL_UNSAFE.

Looks like the return value is true / false. So the above comment should be adjusted.

+ if (!RelationGetParallelDML(rel, true))
+ {
+ table_close(rel, NoLock);
+ return false;
+ }
+
+ table_close(rel, NoLock);

Since the rel would always be closed, it seems the return value from RelationGetParallelDML() can be assigned to a variable, followed by call to table_close(), then the return statement.

Cheers

On Mon, Feb 1, 2021 at 3:56 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

On Mon, Feb 1, 2021 at 4:02 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
>
> Attatching v2 patch which addressed the comments above.
>
> Some further refactor:
>
> Introducing a new function is_parallel_possible_for_modify() which decide whether to do safety check.
>
> IMO, It seems more readable to extract all the check that we can do before the safety-check and put them
> in the new function.
>
> Please consider it for further review.
>

I've updated your v2 patches and altered some comments and
documentation changes (but made no code changes) - please compare
against your v2 patches, and see whether you agree with the changes to
the wording.
In the documentation, you will also notice that in your V2 patch, it
says that the "parallel_dml_enabled" table option defaults to false.
As it actually defaults to true, I changed that in the documentation
too.

Regards,
Greg Nancarrow
Fujitsu Australia

RE: Determine parallel-safety of partition relations for Inserts

From

"Hou, Zhijie"

Date:

02 February 2021, 04:54:37

> > IMO, It seems more readable to extract all the check that we can do
> > before the safety-check and put them in the new function.
> >
> > Please consider it for further review.
> >
> 
> I've updated your v2 patches and altered some comments and documentation
> changes (but made no code changes) - please compare against your v2 patches,
> and see whether you agree with the changes to the wording.
> In the documentation, you will also notice that in your V2 patch, it says
> that the "parallel_dml_enabled" table option defaults to false.
> As it actually defaults to true, I changed that in the documentation too.

Hi greg,

Thanks a lot for the document update, LGTM.

Attaching v4 patches with the changes:
 * fix some typos in the code.
 * store partitioned reloption in a separate struct PartitionedOptions, 
   making it more standard and easier to expand in the future.

Please consider it for further review.

Best regards,
Houzj

Attachment

RE: Determine parallel-safety of partition relations for Inserts

From

"Hou, Zhijie"

Date:

03 February 2021, 04:01:05

Hi,

Attaching v5 patches with the changes:
  * rebase the code on the greg's latest parallel insert patch
  * fix some code style.

Please consider it for further review.

Best regards,
Houzj

Attachment

RE: Determine parallel-safety of partition relations for Inserts

From

"Hou, Zhijie"

Date:

03 February 2021, 04:02:45

> For v3_0003-reloption-parallel_dml-src.patch :
> +   table_close(rel, NoLock);

> Since the rel would always be closed, it seems the return value from RelationGetParallelDML() can be assigned to a
variable,followed by call to table_close(), then the return statement.
 

Thanks for the comment. Fixed in the latest patch.

Best regards,
houzj

RE: Determine parallel-safety of partition relations for Inserts

From

"Huang, Qiuyan"

Date:

05 February 2021, 05:54:32

Hi,

I notice the comment 5) about is_parallel_possible_for_modify() seems to be inaccurate in clauses.c.
    *  5) the reloption parallel_dml_enabled is not set for the target table

Because you have set parallel_dml_enabled to 'true' as default. 
    {
        {
            "parallel_dml_enabled",
            "Enables \"parallel dml\" feature for this table",
            RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
            ShareUpdateExclusiveLock
        },
        true
    }

So even user doesn't set parallel_dml_enabled explicit, its value is 'on', Parallel is also possible for this rel(when
enable_parallel_dmlis on). 
 

Maybe we can just comment like this or a better one you'd like if you agree with above:
    *  5) the reloption parallel_dml_enabled is set to off

Regards
Huang

> -----Original Message-----
> From: Hou, Zhijie <houzj.fnst@cn.fujitsu.com>
> Sent: Wednesday, February 3, 2021 9:01 AM
> To: Greg Nancarrow <gregn4422@gmail.com>
> Cc: Amit Kapila <amit.kapila16@gmail.com>; PostgreSQL Hackers
> <pgsql-hackers@lists.postgresql.org>; vignesh C <vignesh21@gmail.com>;
> Amit Langote <amitlangote09@gmail.com>; David Rowley
> <dgrowleyml@gmail.com>; Tom Lane <tgl@sss.pgh.pa.us>; Tsunakawa,
> Takayuki/綱川 貴之 <tsunakawa.takay@fujitsu.com>
> Subject: RE: Determine parallel-safety of partition relations for Inserts
> 
> Hi,
> 
> Attaching v5 patches with the changes:
>   * rebase the code on the greg's latest parallel insert patch
>   * fix some code style.
> 
> Please consider it for further review.
> 
> Best regards,
> Houzj
> 
> 
> 
>

RE: Determine parallel-safety of partition relations for Inserts

From

"Hou, Zhijie"

Date:

18 February 2021, 06:24:48

Hi,

Attaching v6 patches with the changes:
  * rebase the code on the greg's latest parallel insert patch.
  * fix some code comment.
  * add some test to cover the partitioned table.

Please consider it for further review.

Best regards,
Houzj