Thread: Parallel INSERT SELECT take 2

Parallel INSERT SELECT take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

12 April 2021, 01:21:57

This is another try of [1].


BACKGROUND
========================================

We want to realize parallel INSERT SELECT in the following steps:
1) INSERT + parallel SELECT
2) Parallel INSERT + parallel SELECT

Below are example use cases.  We don't expect high concurrency or an empty data source.
* Data loading (ETL or ELT) into an analytics database, typically a data ware house.
* Batch processing in an OLTP database.


PROBLEMS
========================================

(1) The overhead of checking parallel-safety could be large
We have to check the target table and its child partitions for parallel safety.  That is, we make sure those relations
don'thave parallel-unsafe domains, constraints, indexes, or triggers. 

What we should check is the relations into which the statement actually inserts data.  However, the planner does not
knowwhich relations will be actually inserted into.  So, the planner has to check all descendant partitions of a target
table. When the target table has many partitions, this overhead could be unacceptable when compared to the benefit
gainedfrom parallelism. 


(2) There's no mechanism for parallel workers to assign an XID
Parallel workers need an XID of the current (sub)transaction when actually inserting a tuple (i.e., calling
heap_insert()). When the leader has not got the XID yet, the worker may have to assign a new XID and communicate it to
theleader and other workers so that all parallel processes use the same XID. 


SOLUTION TO (1)
========================================

The candidate ideas are:

1) Caching the result of parallel-safety check
The planner stores the result of checking parallel safety for each relation in relcache, or some purpose-built hash
tablein shared memory. 

The problems are:

* Even if the target relation turns out to be parallel safe by looking at those data structures, we cannot assume it
remainstrue until the SQL statement finishes.  For instance, other sessions might add a parallel-unsafe index to its
descendantrelations.  Other examples include that when the user changes the parallel safety of indexes or triggers by
runningALTER FUNCTION on the underlying index AM function or trigger function, the relcache entry of the table or index
isnot invalidated, so the correct parallel safety is not maintained in the cache. 
In that case, when the executor encounters a parallel-unsafe object, it can change the cached state as being
parallel-unsafeand error out. 

* Can't ensure fast access.  With relcache, the first access in each session has to undergo the overhead of
parallel-safetycheck.  With a hash table in shared memory, the number of relations stored there would be limited, so
thefirst access after database startup or the hash table entry eviction similarly experiences slowness. 

* With a new hash table, some lwlock for concurrent access must be added, which can have an adverse effect on
performance.


2) Enabling users to declare that the table allows parallel data modification
Add a table property that represents parallel safety of the table for DML statement execution.  Users specify it as
follows:

CREATE TABLE table_name (...) PARALLEL { UNSAFE | RESTRICTED | SAFE };
    ALTER TABLE table_name PARALLEL { UNSAFE | RESTRICTED | SAFE };

This property is recorded in pg_class's relparallel column as 'u', 'r', or 's', just like pg_proc's proparallel.  The
defaultis UNSAFE. 

The planner assumes that all of the table, its descendant partitions, and their ancillary objects have the specified
parallelsafety or safer one.  The user is responsible for its correctness.  If the parallel processes find an object
thatis less safer than the assumed parallel safety during statement execution, it throws an ERROR and abort the
statementexecution. 

The objects that relate to the parallel safety of a DML target table are as follows:

    * Column default expression
    * DOMAIN type CHECK expression
    * CHECK constraints on column
    * Partition key
    * Partition key support function
    * Index expression
    * Index predicate
    * Index AM function
    * Operator function
    * Trigger function

When the parallel safety of some of these objects is changed, it's costly to reflect it on the parallel safety of
tablesthat depend on them.  So, we don't do it.  Instead, we provide a utility function
pg_get_parallel_safety('table_name')that returns records of (objid, classid, parallel_safety) that represent the
parallelsafety of objects that determine the parallel safety of the specified table.  The function only outputs objects
thatare not parallel safe.  Otherwise, it will consume excessive memory while accumulating the output.  The user can
usethis function to identify problematic objects when a parallel DML fails or is not parallelized in an expected
manner.

How does the executor detect parallel unsafe objects?  There are two ways:

1) At loading time
When the executor loads the definition of objects (tables, constraints, index, triggers, etc.) during the first access
tothem after session start or their eviction by sinval message, it checks the parallel safety. 

This is a legitimate way, but may need much code.  Also, it might overlook necessary code changes without careful
inspection.


2) At function execution time
All related objects come down to some function execution.  So, add a parallel safety check there when in a parallel
worker. If the current process is a parallel worker and the function is parallel unsafe, error out with ereport(ERROR).
This approach eliminates the oversight of parallel safety check with the additional bonus of tiny code change! 

The place would be FunctionCallInvoke().  It's a macro in fmgr.h now.  Perhaps we should make it a function in fmgr.c,
sothat fmgr.h does not have to include header files for parallelism-related definitions. 

We have to evaluate the performance effect of converting FunctionCallInvoke() into a function and adding an if
statementthere, because it's a relatively low-level function. 




SOLUTION TO (2)
========================================

1) Make it possible for workers to assign an XID and share it among the parallel processes
The problems are:

* Tuple visibility
If the worker that acquires the XID writes some row and another worker reads that row before it gets to see the XID
information,the latter worker won't treat such a row is written by its own transaction. 

For instance, the worker (w-1) that acquires the XID (501) deletes the tuple (CTID: 0, 2).  Now, another worker (w-2)
readsthat tuple (CTID: 0, 2), it would consider that the tuple is still visible to its snapshot but if the w-2 knows
that501 is its own XID, it would have been considered it as (not-visible) deleted.  I think this can happen when
multipleupdates to the same row happen and new rows get added to the new page. 

* The implementation seems complex
When the DML is run inside a deeply nested subtransaction and the parent transactions have not allocated their XIDs
yet,the worker needs to allocate the XIDs for its parents.  That indeterminate number of XIDs must be stored in shared
memory. The stack of TransactionState structures must also be passed. 

Also, TransactionIdIsCurrentTransactionId() uses an array ParallelCurrentXids where parallel workers receive
sub-committedXIDs from the leader.  This needs to be reconsidered. 


2) The leader assigns an XID before entering parallel mode and passes it to workers
This is what was done in [1].

The problem is that the XID would not be used if the data source (SELECT query) returns no valid rows.  This is a waste
ofXID. 

However, the data source should be rarely empty when this feature is used.  As the following Oracle manual says,
parallelDML will be used in data analytics and OLTP batch jobs.  There should be plenty of source data in those
scenarios.

When to Use Parallel DML

https://docs.oracle.com/en/database/oracle/oracle-database/21/vldbg/types-parallelism.html#GUID-18B2AF09-C548-48DE-A794-86224111549F
--------------------------------------------------
Several scenarios where parallel DML is used include:

Refreshing Tables in a Data Warehouse System

Creating Intermediate Summary Tables

Using Scoring Tables

Updating Historical Tables

Running Batch Jobs
--------------------------------------------------



CONCLUSION
========================================

(1) The overhead of checking parallel-safety could be large
We're inclined to go with solution 2, because it doesn't have a big problem.  However, we'd like to try to present some
moreanalysis on solution 1 in this thread. 

Regarding how to check parallel safety in executor, I prefer the simpler way of adding a check in function execution.
Ifit turns out to have an untolerable performance problem, we can choose the other approach. 

(2) There's no mechanism for parallel workers to assign an XID
We'd like to adopt solution 2 because it will really not have a big issue in the assumed use cases.  The implementation
isvery easy and does not look strange. 


Of course, any better-looking idea would be much appreciated.  (But simple, or not unnecessarily complex,  one is
desired.)



[1]
Parallel INSERT (INTO ... SELECT ...)
https://www.postgresql.org/message-id/flat/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV=qpFJrR3AcrTS3g@mail.gmail.com


Regards
Takayuki Tsunakawa

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

22 April 2021, 11:20:59

> BACKGROUND
> ========================================
>
> We want to realize parallel INSERT SELECT in the following steps:
> 1) INSERT + parallel SELECT
> 2) Parallel INSERT + parallel SELECT
>
> Below are example use cases.  We don't expect high concurrency or an empty
> data source.
> * Data loading (ETL or ELT) into an analytics database, typically a data ware
> house.
> * Batch processing in an OLTP database.
> 2) Enabling users to declare that the table allows parallel data modification Add
> a table property that represents parallel safety of the table for DML statement
> execution.  Users specify it as follows:
>
> CREATE TABLE table_name (...) PARALLEL { UNSAFE | RESTRICTED | SAFE };
>     ALTER TABLE table_name PARALLEL { UNSAFE | RESTRICTED | SAFE };
>
> This property is recorded in pg_class's relparallel column as 'u', 'r', or 's', just
> like pg_proc's proparallel.  The default is UNSAFE.
>
> The planner assumes that all of the table, its descendant partitions, and their
> ancillary objects have the specified parallel safety or safer one.  The user is
> responsible for its correctness.  If the parallel processes find an object that is
> less safer than the assumed parallel safety during statement execution, it
> throws an ERROR and abort the statement execution.
>
> When the parallel safety of some of these objects is changed, it's costly to
> reflect it on the parallel safety of tables that depend on them.  So, we don't do
> it.  Instead, we provide a utility function pg_get_parallel_safety('table_name')
> that returns records of (objid, classid, parallel_safety) that represent the
> parallel safety of objects that determine the parallel safety of the specified
> table.  The function only outputs objects that are not parallel safe.  Otherwise,
> it will consume excessive memory while accumulating the output.  The user
> can use this function to identify problematic objects when a parallel DML fails
> or is not parallelized in an expected manner.
>
> How does the executor detect parallel unsafe objects?  There are two ways:
>
> 1) At loading time
> ...
> 2) At function execution time
> All related objects come down to some function execution.  So, add a parallel
> safety check there when in a parallel worker.  If the current process is a parallel
> worker and the function is parallel unsafe, error out with ereport(ERROR).  This
> approach eliminates the oversight of parallel safety check with the additional
> bonus of tiny code change!
>
> The place would be FunctionCallInvoke().  It's a macro in fmgr.h now.  Perhaps
> we should make it a function in fmgr.c, so that fmgr.h does not have to include
> header files for parallelism-related definitions.
>
> We have to evaluate the performance effect of converting FunctionCallInvoke()
> into a function and adding an if statement there, because it's a relatively
> low-level function.

Based on above, we plan to move forward with the apporache 2) (declarative idea).

Attatching the POC patchset which including the following:

0001: provide a utility function pg_get_parallel_safety('table_name').

  The function returns records of (objid, classid, parallel_safety) that represent
  the parallel safety of objects that determine the parallel safety of the specified table.
  Note: The function only outputs objects that are not parallel safe.
  (Thanks a lot for greg's previous work, most of the safety check code here is based on it)

0002: allow user use "ALTER TABLE PARALLEL SAFE/UNSAFE/RESTRICTED".

  Add proparallel column in pg_class and allow use to change its.

0003: detect parallel unsafe objects in executor.

  Currently we choose to check function's parallel safety at function execution time.
  We add safety check at FunctionCallInvoke(), but it may be better to check in fmgr_info_cxt_security.
  we are still discussing it in another thread[1].

  TODO: we currently skip checking built-in function's parallel safety, because we lack the information about built-in
  function's parallel safety, we cannot access pg_proc.proparallel in a low level because it could result in infinite
recursion.
  Adding parallel property in fmgrBuiltin will enlarge the frequently accessed fmgr_builtins and lock down the value of
the 
  parallel-safety flag. The solution is still under discussion. Suggestions and comments are welcome.

0004: fix some mislabeled function in testcase

  Since we check parallel safety of function at a low level, we found some functions marked as parallel unsafe will be
  executed in parallel mode in regression test when setting force_parallel_mode=regress. After checking, these
functions
  are parallel safe, So , we plan to fix these function's parallel label.
  Note: we plan to take 0004 as a separate patch , see[2], I post 0004 here just to prevent some testcase failures.

The above are the POC patches, it could be imperfect for now and I am still working on improving it.
Suggestions and comments about the design or code are very welcome and appreciated.

Best regards,
houzj

Attachment

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

23 April 2021, 00:38:52

> > BACKGROUND
> > ========================================
> >
> > We want to realize parallel INSERT SELECT in the following steps:
> > 1) INSERT + parallel SELECT
> > 2) Parallel INSERT + parallel SELECT
> >
> > Below are example use cases.  We don't expect high concurrency or an
> > empty data source.
> > * Data loading (ETL or ELT) into an analytics database, typically a
> > data ware house.
> > * Batch processing in an OLTP database.
> > 2) Enabling users to declare that the table allows parallel data
> > modification Add a table property that represents parallel safety of
> > the table for DML statement execution.  Users specify it as follows:
> >
> > CREATE TABLE table_name (...) PARALLEL { UNSAFE | RESTRICTED | SAFE };
> >     ALTER TABLE table_name PARALLEL { UNSAFE | RESTRICTED | SAFE };
> >
> > This property is recorded in pg_class's relparallel column as 'u',
> > 'r', or 's', just like pg_proc's proparallel.  The default is UNSAFE.
> >
> > The planner assumes that all of the table, its descendant partitions,
> > and their ancillary objects have the specified parallel safety or
> > safer one.  The user is responsible for its correctness.  If the
> > parallel processes find an object that is less safer than the assumed
> > parallel safety during statement execution, it throws an ERROR and abort the
> statement execution.
> >
> > When the parallel safety of some of these objects is changed, it's
> > costly to reflect it on the parallel safety of tables that depend on
> > them.  So, we don't do it.  Instead, we provide a utility function
> > pg_get_parallel_safety('table_name')
> > that returns records of (objid, classid, parallel_safety) that
> > represent the parallel safety of objects that determine the parallel
> > safety of the specified table.  The function only outputs objects that
> > are not parallel safe.  Otherwise, it will consume excessive memory
> > while accumulating the output.  The user can use this function to
> > identify problematic objects when a parallel DML fails or is not parallelized in
> an expected manner.
> >
> > How does the executor detect parallel unsafe objects?  There are two ways:
> >
> > 1) At loading time
> > ...
> > 2) At function execution time
> > All related objects come down to some function execution.  So, add a
> > parallel safety check there when in a parallel worker.  If the current
> > process is a parallel worker and the function is parallel unsafe,
> > error out with ereport(ERROR).  This approach eliminates the oversight
> > of parallel safety check with the additional bonus of tiny code change!
> >
> > The place would be FunctionCallInvoke().  It's a macro in fmgr.h now.
> > Perhaps we should make it a function in fmgr.c, so that fmgr.h does
> > not have to include header files for parallelism-related definitions.
> >
> > We have to evaluate the performance effect of converting
> > FunctionCallInvoke() into a function and adding an if statement there,
> > because it's a relatively low-level function.
>
> Based on above, we plan to move forward with the apporache 2) (declarative
> idea).
>
> Attatching the POC patchset which including the following:
>
> 0001: provide a utility function pg_get_parallel_safety('table_name').
>
>   The function returns records of (objid, classid, parallel_safety) that represent
>   the parallel safety of objects that determine the parallel safety of the
> specified table.
>   Note: The function only outputs objects that are not parallel safe.
>   (Thanks a lot for greg's previous work, most of the safety check code here is
> based on it)
>
> 0002: allow user use "ALTER TABLE PARALLEL SAFE/UNSAFE/RESTRICTED".
>
>   Add proparallel column in pg_class and allow use to change its.
>
> 0003: detect parallel unsafe objects in executor.
>
>   Currently we choose to check function's parallel safety at function execution
> time.
>   We add safety check at FunctionCallInvoke(), but it may be better to check in
> fmgr_info_cxt_security.
>   we are still discussing it in another thread[1].
>
>   TODO: we currently skip checking built-in function's parallel safety, because
> we lack the information about built-in
>   function's parallel safety, we cannot access pg_proc.proparallel in a low level
> because it could result in infinite recursion.
>   Adding parallel property in fmgrBuiltin will enlarge the frequently accessed
> fmgr_builtins and lock down the value of the
>   parallel-safety flag. The solution is still under discussion. Suggestions and
> comments are welcome.
>
> 0004: fix some mislabeled function in testcase
>
>   Since we check parallel safety of function at a low level, we found some
> functions marked as parallel unsafe will be
>   executed in parallel mode in regression test when setting
> force_parallel_mode=regress. After checking, these functions
>   are parallel safe, So , we plan to fix these function's parallel label.
>   Note: we plan to take 0004 as a separate patch , see[2], I post 0004 here just
> to prevent some testcase failures.
>
> The above are the POC patches, it could be imperfect for now and I am still
> working on improving it.
> Suggestions and comments about the design or code are very welcome and
> appreciated.

Sorry, I forgot to attach the discussion link about [1] and [2].

[1]
https://www.postgresql.org/message-id/756027.1619012086%40sss.pgh.pa.us

[2]
https://www.postgresql.org/message-id/OS0PR01MB571637085C0D3AFC3AB3600194479%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best regards,
houzj

Re: Parallel INSERT SELECT take 2

From

Bharath Rupireddy

Date:

23 April 2021, 15:43:17

On Thu, Apr 22, 2021 at 4:51 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> > BACKGROUND
> > ========================================
> >
> > We want to realize parallel INSERT SELECT in the following steps:
> > 1) INSERT + parallel SELECT
> > 2) Parallel INSERT + parallel SELECT
> >
> > Below are example use cases.  We don't expect high concurrency or an empty
> > data source.
> > * Data loading (ETL or ELT) into an analytics database, typically a data ware
> > house.
> > * Batch processing in an OLTP database.
> > 2) Enabling users to declare that the table allows parallel data modification Add
> > a table property that represents parallel safety of the table for DML statement
> > execution.  Users specify it as follows:
> >
> > CREATE TABLE table_name (...) PARALLEL { UNSAFE | RESTRICTED | SAFE };
> >     ALTER TABLE table_name PARALLEL { UNSAFE | RESTRICTED | SAFE };
> >
> > This property is recorded in pg_class's relparallel column as 'u', 'r', or 's', just
> > like pg_proc's proparallel.  The default is UNSAFE.
> >
> > The planner assumes that all of the table, its descendant partitions, and their
> > ancillary objects have the specified parallel safety or safer one.  The user is
> > responsible for its correctness.  If the parallel processes find an object that is
> > less safer than the assumed parallel safety during statement execution, it
> > throws an ERROR and abort the statement execution.
> >
> > When the parallel safety of some of these objects is changed, it's costly to
> > reflect it on the parallel safety of tables that depend on them.  So, we don't do
> > it.  Instead, we provide a utility function pg_get_parallel_safety('table_name')
> > that returns records of (objid, classid, parallel_safety) that represent the
> > parallel safety of objects that determine the parallel safety of the specified
> > table.  The function only outputs objects that are not parallel safe.  Otherwise,
> > it will consume excessive memory while accumulating the output.  The user
> > can use this function to identify problematic objects when a parallel DML fails
> > or is not parallelized in an expected manner.
> >
> > How does the executor detect parallel unsafe objects?  There are two ways:
> >
> > 1) At loading time
> > ...
> > 2) At function execution time
> > All related objects come down to some function execution.  So, add a parallel
> > safety check there when in a parallel worker.  If the current process is a parallel
> > worker and the function is parallel unsafe, error out with ereport(ERROR).  This
> > approach eliminates the oversight of parallel safety check with the additional
> > bonus of tiny code change!
> >
> > The place would be FunctionCallInvoke().  It's a macro in fmgr.h now.  Perhaps
> > we should make it a function in fmgr.c, so that fmgr.h does not have to include
> > header files for parallelism-related definitions.
> >
> > We have to evaluate the performance effect of converting FunctionCallInvoke()
> > into a function and adding an if statement there, because it's a relatively
> > low-level function.
>
> Based on above, we plan to move forward with the apporache 2) (declarative idea).

IIUC, the declarative behaviour idea attributes parallel
safe/unsafe/restricted tags to each table with default being the
unsafe. Does it mean for a parallel unsafe table, no parallel selects,
inserts (may be updates) will be picked up? Or is it only the parallel
inserts? If both parallel inserts, selects will be picked, then the
existing tables need to be adjusted to set the parallel safety tags
while migrating?

Another point, what does it mean a table being parallel restricted?
What should happen if it is present in a query of other parallel safe
tables?

I may be wrong here: IIUC, the main problem we are trying to solve
with the declarative approach is to let the user decide parallel
safety for partition tables as it may be costlier for postgres to
determine it. And for the normal tables we can perform parallel safety
checks without incurring much cost. So, I think we should restrict the
declarative approach to only partitioned tables?

While reading the design, I came across this "erroring out during
execution of a query when a parallel unsafe function is detected". If
this is correct, isn't it warranting users to run
pg_get_parallel_safety to know the parallel unsafe objects, set
parallel safety to all of them if possible, otherwise disable
parallelism to run the query? Isn't this burdensome? Instead, how
about postgres retries the query upon detecting the error that came
from a parallel unsafe function during execution, disable parallelism
and run the query? I think this kind of retry query feature can be
built outside of the core postgres, but IMO it will be good to have
inside (of course configurable). IIRC, the Teradata database has a
Query Retry feature.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

26 April 2021, 01:30:18

> > Based on above, we plan to move forward with the apporache 2) (declarative
> idea).
> 
> IIUC, the declarative behaviour idea attributes parallel safe/unsafe/restricted
> tags to each table with default being the unsafe. Does it mean for a parallel
> unsafe table, no parallel selects, inserts (may be updates) will be picked up? Or
> is it only the parallel inserts? If both parallel inserts, selects will be picked, then
> the existing tables need to be adjusted to set the parallel safety tags while
> migrating?

Thanks for looking into this.

The parallel attributes in table means the parallel safety when user does some data-modification operations on it.
So, It only limit the use of parallel plan when using INSERT/UPDATE/DELETE.

> Another point, what does it mean a table being parallel restricted?
> What should happen if it is present in a query of other parallel safe tables?

If a table is parallel restricted, it means the table contains some parallel restricted objects(such as: parallel
restrictedfunctions in index expressions).
 
And in planner, it means parallel insert plan will not be chosen, but it can use parallel select(with serial insert).

> I may be wrong here: IIUC, the main problem we are trying to solve with the
> declarative approach is to let the user decide parallel safety for partition tables
> as it may be costlier for postgres to determine it. And for the normal tables we
> can perform parallel safety checks without incurring much cost. So, I think we
> should restrict the declarative approach to only partitioned tables?

Yes, we are tring to avoid overhead when checking parallel safety.
The cost to check all the partition's parallel safety is the biggest one.
Another is the safety check of index's expression.
Currently, for INSERT, the planner does not open the target table's indexinfo and does not
parse the expression of the index. We need to parse the expression in planner if we want
to do parallel safety check for it which can bring some overhead(it will open the index the do the parse in executor
again).
So, we plan to skip all of the extra check and let user take responsibility for the safety.

Of course, maybe we can try to pass the indexinfo to the executor but it need some further refactor and I will take a
lookinto it.
 

> While reading the design, I came across this "erroring out during execution of a
> query when a parallel unsafe function is detected". If this is correct, isn't it
> warranting users to run pg_get_parallel_safety to know the parallel unsafe
> objects, set parallel safety to all of them if possible, otherwise disable
> parallelism to run the query? Isn't this burdensome? 

How about:
If detecting parallel unsafe objects in executor, then, alter the table to parallel unsafe internally.
So, user do not need to alter it manually.

> Instead, how about
> postgres retries the query upon detecting the error that came from a parallel
> unsafe function during execution, disable parallelism and run the query? I think
> this kind of retry query feature can be built outside of the core postgres, but
> IMO it will be good to have inside (of course configurable). IIRC, the Teradata
> database has a Query Retry feature.
> 

Thanks for the suggestion. 
The retry query feature sounds like a good idea to me.
OTOH, it sounds more like an independent feature which parallel select can also benefit from it.
I think maybe we can try to achieve it after we commit the parallel insert ?

Best regards,
houzj

Re: Parallel INSERT SELECT take 2

From

Amit Kapila

Date:

26 April 2021, 06:17:54

On Mon, Apr 26, 2021 at 7:00 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> > Instead, how about
> > postgres retries the query upon detecting the error that came from a parallel
> > unsafe function during execution, disable parallelism and run the query? I think
> > this kind of retry query feature can be built outside of the core postgres, but
> > IMO it will be good to have inside (of course configurable). IIRC, the Teradata
> > database has a Query Retry feature.
> >
>
> Thanks for the suggestion.
> The retry query feature sounds like a good idea to me.
> OTOH, it sounds more like an independent feature which parallel select can also benefit from it.
>

+1. I also think retrying a query on an error is not related to this
feature and should be built separately if required.


-- 
With Regards,
Amit Kapila.

Re: Parallel INSERT SELECT take 2

From

Bharath Rupireddy

Date:

26 April 2021, 09:26:17

On Mon, Apr 26, 2021 at 7:00 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> > > Based on above, we plan to move forward with the apporache 2) (declarative
> > idea).
> >
> > IIUC, the declarative behaviour idea attributes parallel safe/unsafe/restricted
> > tags to each table with default being the unsafe. Does it mean for a parallel
> > unsafe table, no parallel selects, inserts (may be updates) will be picked up? Or
> > is it only the parallel inserts? If both parallel inserts, selects will be picked, then
> > the existing tables need to be adjusted to set the parallel safety tags while
> > migrating?
>
> Thanks for looking into this.

Thanks for the responses.

> The parallel attributes in table means the parallel safety when user does some data-modification operations on it.
> So, It only limit the use of parallel plan when using INSERT/UPDATE/DELETE.

In that case, isn't it better to use the terminology "PARALLEL DML
SAFE/UNSAFE/RESTRICTED" in the code and docs? This way, it will be
clear that these tags don't affect parallel selects.

> > Another point, what does it mean a table being parallel restricted?
> > What should happen if it is present in a query of other parallel safe tables?
>
> If a table is parallel restricted, it means the table contains some parallel restricted objects(such as: parallel
restrictedfunctions in index expressions).

> And in planner, it means parallel insert plan will not be chosen, but it can use parallel select(with serial
insert).

Makes sense. I assume that when there is a parallel restricted
function associated with a table, the current design doesn't enforce
the planner to choose parallel select and it is left up to it.

> > I may be wrong here: IIUC, the main problem we are trying to solve with the
> > declarative approach is to let the user decide parallel safety for partition tables
> > as it may be costlier for postgres to determine it. And for the normal tables we
> > can perform parallel safety checks without incurring much cost. So, I think we
> > should restrict the declarative approach to only partitioned tables?
>
> Yes, we are tring to avoid overhead when checking parallel safety.
> The cost to check all the partition's parallel safety is the biggest one.
> Another is the safety check of index's expression.
> Currently, for INSERT, the planner does not open the target table's indexinfo and does not
> parse the expression of the index. We need to parse the expression in planner if we want
> to do parallel safety check for it which can bring some overhead(it will open the index the do the parse in executor
again).
> So, we plan to skip all of the extra check and let user take responsibility for the safety.
> Of course, maybe we can try to pass the indexinfo to the executor but it need some further refactor and I will take a
lookinto it.

Will the planner parse and check parallel safety of index((where
clause) expressions in case of SELECTs? I'm not sure of this. But if
it does, maybe we could do the same thing for parallel DML as well for
normal tables? What is the overhead of parsing index expressions? If
the cost is heavy for checking index expressions parallel safety in
case of normal tables, then the current design i.e. attributing
parallel safety tag to all the tables makes sense.

I was actually thinking that we will have the declarative approach
only for partitioned tables as it is the main problem we are trying to
solve with this design. Something like: users will run
pg_get_parallel_safety to see the parallel unsafe objects associated
with a partitioned table by looking at all of its partitions and be
able to set a parallel dml safety tag to only partitioned tables.

> > While reading the design, I came across this "erroring out during execution of a
> > query when a parallel unsafe function is detected". If this is correct, isn't it
> > warranting users to run pg_get_parallel_safety to know the parallel unsafe
> > objects, set parallel safety to all of them if possible, otherwise disable
> > parallelism to run the query? Isn't this burdensome?
>
> How about:
> If detecting parallel unsafe objects in executor, then, alter the table to parallel unsafe internally.
> So, user do not need to alter it manually.

I don't think this is a good idea, because, if there are multiple
tables involved in the query, do you alter all the tables? Usually, we
error out on finding the first such unsafe object.

> > Instead, how about
> > postgres retries the query upon detecting the error that came from a parallel
> > unsafe function during execution, disable parallelism and run the query? I think
> > this kind of retry query feature can be built outside of the core postgres, but
> > IMO it will be good to have inside (of course configurable). IIRC, the Teradata
> > database has a Query Retry feature.
> >
>
> Thanks for the suggestion.
> The retry query feature sounds like a good idea to me.
> OTOH, it sounds more like an independent feature which parallel select can also benefit from it.
> I think maybe we can try to achieve it after we commit the parallel insert ?

Yeah, it will be a separate thing altogether.

>0001: provide a utility function pg_get_parallel_safety('table_name').
>
>The function returns records of (objid, classid, parallel_safety) that represent
>the parallel safety of objects that determine the parallel safety of the specified table.
>Note: The function only outputs objects that are not parallel safe.

If it returns only parallel "unsafe" objects and not "safe" or
"restricted" objects, how about naming it to
pg_get_table_parallel_unsafe_objects("table_name")? This way we could
get rid of parallel_safety in the output record? If at all users want
to see parallel restricted or safe objects, we can also have the
counterparts pg_get_table_parallel_safe_objects and
pg_get_table_parallel_restricted_objects. Of course, we can caution
the user that execution of these functions might take longer and
"might consume excessive memory while accumulating the output".

Otherwise, we can have a single function
pg_get_parallel_safety("table_name" IN, "parallel_safety" IN, "objid"
OUT, "classid" OUT)? If required, we could name it
pg_get_parallel_safety_of_table_objects.

Thoughts?

Although, I have not looked at the patches, few questions on
pg_get_parallel_safety function:
1) Will it parse all the expressions for the objects that are listed
under "The objects that relate to the parallel safety of a DML target
table are as follows:" in the upthread?
2) How will it behave if a partitioned table is passed to it? Will it
recurse for all the partitions?
3) How will it behave if a foreign table is passed to it? Will it error out?

In general:
1) Is ALTER SET PARALLEL SAFETY on a partitioned table allowed? If
yes, will it be set based on all the partitions parallel safety?
2) How will users have to decide on parallel safety of a foreign table
or a partitioned table with foreign partitions? Or is it that we set
these tables parallel unsafe and don't do parallel inserts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

26 April 2021, 11:26:29

> > The parallel attributes in table means the parallel safety when user does some
> data-modification operations on it.
> > So, It only limit the use of parallel plan when using INSERT/UPDATE/DELETE.
> 
> In that case, isn't it better to use the terminology "PARALLEL DML
> SAFE/UNSAFE/RESTRICTED" in the code and docs? This way, it will be clear that
> these tags don't affect parallel selects.

Makes sense, I recalled I have heart similar suggestion before.
If there are no other objections, I plan to change command to
PARALLEL DML in my next version patches.

> > > I may be wrong here: IIUC, the main problem we are trying to solve
> > > with the declarative approach is to let the user decide parallel
> > > safety for partition tables as it may be costlier for postgres to
> > > determine it. And for the normal tables we can perform parallel
> > > safety checks without incurring much cost. So, I think we should restrict the
> declarative approach to only partitioned tables?
> >
> > Yes, we are tring to avoid overhead when checking parallel safety.
> > The cost to check all the partition's parallel safety is the biggest one.
> > Another is the safety check of index's expression.
> > Currently, for INSERT, the planner does not open the target table's
> > indexinfo and does not parse the expression of the index. We need to
> > parse the expression in planner if we want to do parallel safety check for it
> which can bring some overhead(it will open the index the do the parse in
> executor again).
> > So, we plan to skip all of the extra check and let user take responsibility for
> the safety.
> > Of course, maybe we can try to pass the indexinfo to the executor but it need
> some further refactor and I will take a look into it.
> 
> Will the planner parse and check parallel safety of index((where
> clause) expressions in case of SELECTs? I'm not sure of this. But if it does, maybe
> we could do the same thing for parallel DML as well for normal tables? 

The planner does not check the index expression, because the expression will not be used in SELECT.
I think the expression is only used when a tuple inserted into the index.

> the overhead of parsing index expressions? If the cost is heavy for checking
> index expressions parallel safety in case of normal tables, then the current
> design i.e. attributing parallel safety tag to all the tables makes sense.

Currently, index expression and predicate are stored in text format.
We need to use stringToNode(expression/predicate) to parse it.
Some committers think doing this twice does not look good,
unless we found some ways to pass parsed info to the executor to avoid the second parse.

> I was actually thinking that we will have the declarative approach only for
> partitioned tables as it is the main problem we are trying to solve with this
> design. Something like: users will run pg_get_parallel_safety to see the parallel
> unsafe objects associated with a partitioned table by looking at all of its
> partitions and be able to set a parallel dml safety tag to only partitioned tables.

We originally want to make the declarative approach for both normal and partitioned table.
In this way, it will not bring any overhead to planner and looks consistency.
But, do you think we should put some really cheap safety check to the planner ?

> 
> >0001: provide a utility function pg_get_parallel_safety('table_name').
> >
> >The function returns records of (objid, classid, parallel_safety) that
> >represent the parallel safety of objects that determine the parallel safety of
> the specified table.
> >Note: The function only outputs objects that are not parallel safe.
> 
> If it returns only parallel "unsafe" objects and not "safe" or "restricted" objects,
> how about naming it to pg_get_table_parallel_unsafe_objects("table_name")?

Currently, the function returns both unsafe and restricted objects(I thought restricted is also not safe),
because we thought users only care about the objects that affect the use of parallel plan.

> This way we could get rid of parallel_safety in the output record? If at all users
> want to see parallel restricted or safe objects, we can also have the
> counterparts pg_get_table_parallel_safe_objects and
> pg_get_table_parallel_restricted_objects. Of course, we can caution the user
> that execution of these functions might take longer and "might consume
> excessive memory while accumulating the output".
> 
> Otherwise, we can have a single function pg_get_parallel_safety("table_name"
> IN, "parallel_safety" IN, "objid"
> OUT, "classid" OUT)? If required, we could name it
> pg_get_parallel_safety_of_table_objects.
> 
> Thoughts?

I am sure will user want to get safe objects, do you have some usecases ?
If users do not care the safe objects, I think they can use
"SELECT * FROM pg_get_parallel_safety() where parallel_safety = 'specified safety' " to get the specified objects.

> Although, I have not looked at the patches, few questions on
> pg_get_parallel_safety function:
> 1) Will it parse all the expressions for the objects that are listed under "The
> objects that relate to the parallel safety of a DML target table are as follows:" in
> the upthread?

Yes.
But some parsed expression(such as domain type's expression) can be found in typecache,
we just check the safety for these already parsed expression.

> 2) How will it behave if a partitioned table is passed to it? Will it recurse for all
> the partitions?

Yes, because both parent table and child table will be inserted and the parallel
related objects on them will be executed. If users want to make sure the parallel insert succeed,
they need to check all the objects.

> 3) How will it behave if a foreign table is passed to it? Will it error out?

It currently does not error out.
It will also check the objects on it and return not safe objects.
Note: I consider foreign table itself as a parallel restricted object, because it does not support parallel insert fdw
apifor now.
 

> In general:
> 1) Is ALTER SET PARALLEL SAFETY on a partitioned table allowed?

Yes.

> If yes, will it be
> set based on all the partitions parallel safety?

Yes,
But the ALTER PARALLEL command itself does not check all the partition's safety flag.
The function pg_get_parallel_safety can return all the partition's not safe objects,
user should set the parallel safety based the function's result. 

> 2) How will users have to decide on parallel safety of a foreign table or a
> partitioned table with foreign partitions? Or is it that we set these tables
> parallel unsafe and don't do parallel inserts?

Foreign table itself is considered as parallel restricted,
because we do not support parallel insert fdw api for now.

Best regards,
houzj

Re: Parallel INSERT SELECT take 2

From

Bharath Rupireddy

Date:

27 April 2021, 00:50:43

On Mon, Apr 26, 2021 at 4:56 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
> > > The parallel attributes in table means the parallel safety when user does some
> > data-modification operations on it.
> > > So, It only limit the use of parallel plan when using INSERT/UPDATE/DELETE.
> >
> > In that case, isn't it better to use the terminology "PARALLEL DML
> > SAFE/UNSAFE/RESTRICTED" in the code and docs? This way, it will be clear that
> > these tags don't affect parallel selects.
>
> Makes sense, I recalled I have heart similar suggestion before.
> If there are no other objections, I plan to change command to
> PARALLEL DML in my next version patches.

+1 from me. Let's hear from others.

> > Will the planner parse and check parallel safety of index((where
> > clause) expressions in case of SELECTs? I'm not sure of this. But if it does, maybe
> > we could do the same thing for parallel DML as well for normal tables?
>
> The planner does not check the index expression, because the expression will not be used in SELECT.
> I think the expression is only used when a tuple inserted into the index.

Oh.

> > the overhead of parsing index expressions? If the cost is heavy for checking
> > index expressions parallel safety in case of normal tables, then the current
> > design i.e. attributing parallel safety tag to all the tables makes sense.
>
> Currently, index expression and predicate are stored in text format.
> We need to use stringToNode(expression/predicate) to parse it.
> Some committers think doing this twice does not look good,
> unless we found some ways to pass parsed info to the executor to avoid the second parse.

How much is the extra cost that's added if we do stringToNode twiice?
Maybe, we can check for a few sample test cases by just having
stringToNode(expression/predicate) in the planner and see if it adds
much to the total execution time of the query.

> > I was actually thinking that we will have the declarative approach only for
> > partitioned tables as it is the main problem we are trying to solve with this
> > design. Something like: users will run pg_get_parallel_safety to see the parallel
> > unsafe objects associated with a partitioned table by looking at all of its
> > partitions and be able to set a parallel dml safety tag to only partitioned tables.
>
> We originally want to make the declarative approach for both normal and partitioned table.
> In this way, it will not bring any overhead to planner and looks consistency.
> But, do you think we should put some really cheap safety check to the planner ?

I still feel that why we shouldn't limit the declarative approach to
only partitioned tables? And for normal tables, possibly with a
minimal cost(??), the server can do the safety checking. I know this
feels a little inconsistent. In the planner we will have different
paths like: if (partitioned_table) { check the parallel safety tag
associated with the table } else { perform the parallel safety of the
associated objects }.

Others may have better thoughts on this.

> > >0001: provide a utility function pg_get_parallel_safety('table_name').
> > >
> > >The function returns records of (objid, classid, parallel_safety) that
> > >represent the parallel safety of objects that determine the parallel safety of
> > the specified table.
> > >Note: The function only outputs objects that are not parallel safe.
> >
> > If it returns only parallel "unsafe" objects and not "safe" or "restricted" objects,
> > how about naming it to pg_get_table_parallel_unsafe_objects("table_name")?
>
> Currently, the function returns both unsafe and restricted objects(I thought restricted is also not safe),
> because we thought users only care about the objects that affect the use of parallel plan.

Hm.

> > This way we could get rid of parallel_safety in the output record? If at all users
> > want to see parallel restricted or safe objects, we can also have the
> > counterparts pg_get_table_parallel_safe_objects and
> > pg_get_table_parallel_restricted_objects. Of course, we can caution the user
> > that execution of these functions might take longer and "might consume
> > excessive memory while accumulating the output".
> >
> > Otherwise, we can have a single function pg_get_parallel_safety("table_name"
> > IN, "parallel_safety" IN, "objid"
> > OUT, "classid" OUT)? If required, we could name it
> > pg_get_parallel_safety_of_table_objects.
> >
> > Thoughts?
>
> I am sure will user want to get safe objects, do you have some usecases ?
> If users do not care the safe objects, I think they can use
> "SELECT * FROM pg_get_parallel_safety() where parallel_safety = 'specified safety' " to get the specified objects.

I don't know any practical scenarios, but If I'm a user, at least I
will be tempted to see the parallel safe objects associated with a
particular table along with unsafe and restricted ones. Others may
have better thoughts on this.

> > Although, I have not looked at the patches, few questions on
> > pg_get_parallel_safety function:
> > 1) Will it parse all the expressions for the objects that are listed under "The
> > objects that relate to the parallel safety of a DML target table are as follows:" in
> > the upthread?
>
> Yes.
> But some parsed expression(such as domain type's expression) can be found in typecache,
> we just check the safety for these already parsed expression.

Oh.

> > 2) How will it behave if a partitioned table is passed to it? Will it recurse for all
> > the partitions?
>
> Yes, because both parent table and child table will be inserted and the parallel
> related objects on them will be executed. If users want to make sure the parallel insert succeed,
> they need to check all the objects.

Then, running the pg_get_parallel_safety will have some overhead if
there are many partitions associated with a table. And, this is the
overhead planner would have had to incur without the declarative
approach which we are trying to avoid with this design.

> > 3) How will it behave if a foreign table is passed to it? Will it error out?
>
> It currently does not error out.
> It will also check the objects on it and return not safe objects.
> Note: I consider foreign table itself as a parallel restricted object, because it does not support parallel insert
fdwapi for now.

>
> > 2) How will users have to decide on parallel safety of a foreign table or a
> > partitioned table with foreign partitions? Or is it that we set these tables
> > parallel unsafe and don't do parallel inserts?
>
> Foreign table itself is considered as parallel restricted,
> because we do not support parallel insert fdw api for now.

Maybe, the ALTER TABLE ... SET PARALLEL on a foreign table should
default to parallel restricted always and emit a warning saying the
reason?

> > In general:
> > 1) Is ALTER SET PARALLEL SAFETY on a partitioned table allowed?
>
> Yes.
>
> > If yes, will it be
> > set based on all the partitions parallel safety?
>
> Yes,
> But the ALTER PARALLEL command itself does not check all the partition's safety flag.
> The function pg_get_parallel_safety can return all the partition's not safe objects,
> user should set the parallel safety based the function's result.

I'm thinking that when users say ALTER TABLE partioned_table SET
PARALLEL TO 'safe';, we check all the partitions' and their associated
objects' parallel safety? If all are parallel safe, then only we set
partitioned_table as parallel safe. What should happen if the parallel
safety of any of the associated objects/partitions changes after
setting the partitioned_table safety?

My understanding was that: the command ALTER TABLE ... SET PARALLEL TO
'safe' work will check the parallel safety of all the objects
associated with the table. If the objects are all parallel safe, then
the table will be set to safe. If at least one object is parallel
unsafe or restricted, then the command will fail. I was also thinking
that how will the design cope with situations such as the parallel
safety of any of the associated objects changing after setting the
table to parallel safe. The planner would have relied on the outdated
parallel safety of the table and chosen parallel inserts and the
executor will catch such situations. Looks like my understanding was
wrong.

So, the ALTER TABLE ... SET PARALLEL TO command just sets the target
table safety, doesn't bother what the associated objects' safety is.
It just believes the user. If at all there are any parallel unsafe
objects it will be caught by the executor. Just like, setting parallel
safety of the functions/aggregates, the docs caution users about
accidentally/intentionally tagging parallel unsafe
functions/aggregates as parallel safe.

Note: I meant the table objects are the ones that are listed under
"The objects that relate to the parallel safety of a DML target table
are as follows:" in the upthread.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

27 April 2021, 02:09:01

> > Currently, index expression and predicate are stored in text format.
> > We need to use stringToNode(expression/predicate) to parse it.
> > Some committers think doing this twice does not look good, unless we
> > found some ways to pass parsed info to the executor to avoid the second
> parse.
> 
> How much is the extra cost that's added if we do stringToNode twiice?
> Maybe, we can check for a few sample test cases by just having
> stringToNode(expression/predicate) in the planner and see if it adds much to
> the total execution time of the query.


OK, I will do some test on it.

> > Yes, because both parent table and child table will be inserted and
> > the parallel related objects on them will be executed. If users want
> > to make sure the parallel insert succeed, they need to check all the objects.
> 
> Then, running the pg_get_parallel_safety will have some overhead if there are
> many partitions associated with a table. And, this is the overhead planner
> would have had to incur without the declarative approach which we are trying
> to avoid with this design.

Yes, I think put such overhead in a separate function is better than in a common path(planner).

> > Foreign table itself is considered as parallel restricted, because we
> > do not support parallel insert fdw api for now.
> 
> Maybe, the ALTER TABLE ... SET PARALLEL on a foreign table should default to
> parallel restricted always and emit a warning saying the reason?

Thanks for the comment, I agree.
I will change this in the next version patches.

> > But the ALTER PARALLEL command itself does not check all the partition's
> safety flag.
> > The function pg_get_parallel_safety can return all the partition's not
> > safe objects, user should set the parallel safety based the function's result.
> 
> I'm thinking that when users say ALTER TABLE partioned_table SET PARALLEL
> TO 'safe';, we check all the partitions' and their associated objects' parallel
> safety? If all are parallel safe, then only we set partitioned_table as parallel safe.
> What should happen if the parallel safety of any of the associated
> objects/partitions changes after setting the partitioned_table safety?

Currently, nothing happened if any of the associated objects/partitions changes after setting the partitioned_table
safety.
Because , we do not have a really cheap way to catch the change. The existing relcache does not work because alter
function
does not invalid the relcache which the function belongs to. And it will bring some other overhead(locking, systable
scan,...)
to find the table the objects belong to.

> My understanding was that: the command ALTER TABLE ... SET PARALLEL TO
> 'safe' work will check the parallel safety of all the objects associated with the
> table. If the objects are all parallel safe, then the table will be set to safe. If at
> least one object is parallel unsafe or restricted, then the command will fail.

I think this idea makes sense. Some detail of the designed can be improved.
I agree with you that we can try to check check all the partitions' and their associated objects' parallel safety when
ALTERPARALLEL.
 
Because it's a separate new command, add some overhead to it seems not too bad.
If there are no other objections, I plan to add safety check in the ALTER PARALLEL command.

> also thinking that how will the design cope with situations such as the parallel
> safety of any of the associated objects changing after setting the table to
> parallel safe. The planner would have relied on the outdated parallel safety of
> the table and chosen parallel inserts and the executor will catch such situations.
> Looks like my understanding was wrong.

Currently, we assume user is responsible for its correctness.
Because, from our research, when the parallel safety of some of these objects is changed,
it's costly to reflect it on the parallel safety of tables that depend on them.
(we need to scan the pg_depend,pg_inherit,pg_index.... to find the target table)

> So, the ALTER TABLE ... SET PARALLEL TO command just sets the target table
> safety, doesn't bother what the associated objects' safety is.
> It just believes the user. If at all there are any parallel unsafe objects it will be
> caught by the executor. Just like, setting parallel safety of the
> functions/aggregates, the docs caution users about accidentally/intentionally
> tagging parallel unsafe functions/aggregates as parallel safe.

Yes, thanks for looking into this.

Best regards,
houzj

Re: Parallel INSERT SELECT take 2

From

Greg Nancarrow

Date:

27 April 2021, 02:15:36

On Tue, Apr 27, 2021 at 10:51 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
>
> I still feel that why we shouldn't limit the declarative approach to
> only partitioned tables? And for normal tables, possibly with a
> minimal cost(??), the server can do the safety checking. I know this
> feels a little inconsistent. In the planner we will have different
> paths like: if (partitioned_table) { check the parallel safety tag
> associated with the table } else { perform the parallel safety of the
> associated objects }.
>

Personally I think the simplest and best approach is just do it
consistently, using the declarative approach across all table types.

>
> Then, running the pg_get_parallel_safety will have some overhead if
> there are many partitions associated with a table. And, this is the
> overhead planner would have had to incur without the declarative
> approach which we are trying to avoid with this design.
>

The big difference is that pg_get_parallel_safety() is intended to be
used during development, not as part of runtime parallel-safety checks
(which are avoided using the declarative approach).
So there is no runtime overhead associated with pg_get_parallel_safety().

>
> I'm thinking that when users say ALTER TABLE partioned_table SET
> PARALLEL TO 'safe';, we check all the partitions' and their associated
> objects' parallel safety? If all are parallel safe, then only we set
> partitioned_table as parallel safe. What should happen if the parallel
> safety of any of the associated objects/partitions changes after
> setting the partitioned_table safety?
>

With the declarative approach, there is no parallel-safety checking on
either the CREATE/ALTER when the parallel-safety is declared/set.
It's up to the user to get it right. If it's actually wrong, it will
be detected at runtime.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Parallel INSERT SELECT take 2

From

Amit Kapila

Date:

27 April 2021, 02:42:32

On Tue, Apr 27, 2021 at 7:45 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> On Tue, Apr 27, 2021 at 10:51 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> >
> > I still feel that why we shouldn't limit the declarative approach to
> > only partitioned tables? And for normal tables, possibly with a
> > minimal cost(??), the server can do the safety checking. I know this
> > feels a little inconsistent. In the planner we will have different
> > paths like: if (partitioned_table) { check the parallel safety tag
> > associated with the table } else { perform the parallel safety of the
> > associated objects }.
> >
>
> Personally I think the simplest and best approach is just do it
> consistently, using the declarative approach across all table types.
>

Yeah, if we decide to go with a declarative approach then that sounds
like a better approach.

> >
> > Then, running the pg_get_parallel_safety will have some overhead if
> > there are many partitions associated with a table. And, this is the
> > overhead planner would have had to incur without the declarative
> > approach which we are trying to avoid with this design.
> >
>
> The big difference is that pg_get_parallel_safety() is intended to be
> used during development, not as part of runtime parallel-safety checks
> (which are avoided using the declarative approach).
> So there is no runtime overhead associated with pg_get_parallel_safety().
>
> >
> > I'm thinking that when users say ALTER TABLE partioned_table SET
> > PARALLEL TO 'safe';, we check all the partitions' and their associated
> > objects' parallel safety? If all are parallel safe, then only we set
> > partitioned_table as parallel safe. What should happen if the parallel
> > safety of any of the associated objects/partitions changes after
> > setting the partitioned_table safety?
> >
>
> With the declarative approach, there is no parallel-safety checking on
> either the CREATE/ALTER when the parallel-safety is declared/set.
> It's up to the user to get it right. If it's actually wrong, it will
> be detected at runtime.
>

OTOH, even if we want to verify at DDL time, we won't be able to
maintain it at the later point of time say if user changed the
parallel-safety of some function used in check constraint. I think the
function pg_get_parallel_safety() will help the user to decide whether
it can declare table parallel-safe. Now, it is quite possible that the
user can later change the parallel-safe property of some function then
that should be caught at runtime.

-- 
With Regards,
Amit Kapila.

Re: Parallel INSERT SELECT take 2

From

Bharath Rupireddy

Date:

27 April 2021, 03:34:45

On Tue, Apr 27, 2021 at 7:39 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
> > I'm thinking that when users say ALTER TABLE partioned_table SET PARALLEL
> > TO 'safe';, we check all the partitions' and their associated objects' parallel
> > safety? If all are parallel safe, then only we set partitioned_table as parallel safe.
> > What should happen if the parallel safety of any of the associated
> > objects/partitions changes after setting the partitioned_table safety?
>
> Currently, nothing happened if any of the associated objects/partitions changes after setting the partitioned_table
safety.
> Because , we do not have a really cheap way to catch the change. The existing relcache does not work because alter
function
> does not invalid the relcache which the function belongs to. And it will bring some other overhead(locking, systable
scan,...)
> to find the table the objects belong to.

Makes sense. Anyways, the user is responsible for such changes and
otherwise the executor can catch them at run time, if not, the users
will see unintended consequences.

> > My understanding was that: the command ALTER TABLE ... SET PARALLEL TO
> > 'safe' work will check the parallel safety of all the objects associated with the
> > table. If the objects are all parallel safe, then the table will be set to safe. If at
> > least one object is parallel unsafe or restricted, then the command will fail.
>
> I think this idea makes sense. Some detail of the designed can be improved.
> I agree with you that we can try to check check all the partitions' and their associated objects' parallel safety
whenALTER PARALLEL.
 
> Because it's a separate new command, add some overhead to it seems not too bad.
> If there are no other objections, I plan to add safety check in the ALTER PARALLEL command.

Maybe we can make the parallel safety check of the associated
objects/partitions optional for CREATE/ALTER DDLs, with the default
being no checks performed. Both Greg and Amit agree that we don't have
to perform any parallel safety checks during CREATE/ALTER DDLs.

> > also thinking that how will the design cope with situations such as the parallel
> > safety of any of the associated objects changing after setting the table to
> > parallel safe. The planner would have relied on the outdated parallel safety of
> > the table and chosen parallel inserts and the executor will catch such situations.
> > Looks like my understanding was wrong.
>
> Currently, we assume user is responsible for its correctness.
> Because, from our research, when the parallel safety of some of these objects is changed,
> it's costly to reflect it on the parallel safety of tables that depend on them.
> (we need to scan the pg_depend,pg_inherit,pg_index.... to find the target table)

Agree.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel INSERT SELECT take 2

From

Bharath Rupireddy

Date:

27 April 2021, 03:40:24

On Tue, Apr 27, 2021 at 7:45 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> On Tue, Apr 27, 2021 at 10:51 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> >
> > I still feel that why we shouldn't limit the declarative approach to
> > only partitioned tables? And for normal tables, possibly with a
> > minimal cost(??), the server can do the safety checking. I know this
> > feels a little inconsistent. In the planner we will have different
> > paths like: if (partitioned_table) { check the parallel safety tag
> > associated with the table } else { perform the parallel safety of the
> > associated objects }.
> >
>
> Personally I think the simplest and best approach is just do it
> consistently, using the declarative approach across all table types.

Agree.

> > Then, running the pg_get_parallel_safety will have some overhead if
> > there are many partitions associated with a table. And, this is the
> > overhead planner would have had to incur without the declarative
> > approach which we are trying to avoid with this design.
> >
>
> The big difference is that pg_get_parallel_safety() is intended to be
> used during development, not as part of runtime parallel-safety checks
> (which are avoided using the declarative approach).
> So there is no runtime overhead associated with pg_get_parallel_safety().

Yes, while we avoid runtime overhead, but we run the risk of changed
parallel safety of any of the underlying objects/functions/partitions.
This risk will anyways be unavoidable with declarative approach.

> > I'm thinking that when users say ALTER TABLE partioned_table SET
> > PARALLEL TO 'safe';, we check all the partitions' and their associated
> > objects' parallel safety? If all are parallel safe, then only we set
> > partitioned_table as parallel safe. What should happen if the parallel
> > safety of any of the associated objects/partitions changes after
> > setting the partitioned_table safety?
> >
>
> With the declarative approach, there is no parallel-safety checking on
> either the CREATE/ALTER when the parallel-safety is declared/set.
> It's up to the user to get it right. If it's actually wrong, it will
> be detected at runtime.

As I said upthread, we can provide the parallel safety check of
associated objects/partitions as an option with default as false. I'm
not sure if this is a good thing to do at all. Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel INSERT SELECT take 2

From

Bharath Rupireddy

Date:

27 April 2021, 03:41:40

On Tue, Apr 27, 2021 at 8:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Apr 27, 2021 at 7:45 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
> >
> > On Tue, Apr 27, 2021 at 10:51 AM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > >
> > >
> > > I still feel that why we shouldn't limit the declarative approach to
> > > only partitioned tables? And for normal tables, possibly with a
> > > minimal cost(??), the server can do the safety checking. I know this
> > > feels a little inconsistent. In the planner we will have different
> > > paths like: if (partitioned_table) { check the parallel safety tag
> > > associated with the table } else { perform the parallel safety of the
> > > associated objects }.
> > >
> >
> > Personally I think the simplest and best approach is just do it
> > consistently, using the declarative approach across all table types.
> >
>
> Yeah, if we decide to go with a declarative approach then that sounds
> like a better approach.

Agree.

> > > Then, running the pg_get_parallel_safety will have some overhead if
> > > there are many partitions associated with a table. And, this is the
> > > overhead planner would have had to incur without the declarative
> > > approach which we are trying to avoid with this design.
> > >
> >
> > The big difference is that pg_get_parallel_safety() is intended to be
> > used during development, not as part of runtime parallel-safety checks
> > (which are avoided using the declarative approach).
> > So there is no runtime overhead associated with pg_get_parallel_safety().
> >
> > >
> > > I'm thinking that when users say ALTER TABLE partioned_table SET
> > > PARALLEL TO 'safe';, we check all the partitions' and their associated
> > > objects' parallel safety? If all are parallel safe, then only we set
> > > partitioned_table as parallel safe. What should happen if the parallel
> > > safety of any of the associated objects/partitions changes after
> > > setting the partitioned_table safety?
> > >
> >
> > With the declarative approach, there is no parallel-safety checking on
> > either the CREATE/ALTER when the parallel-safety is declared/set.
> > It's up to the user to get it right. If it's actually wrong, it will
> > be detected at runtime.
> >
>
> OTOH, even if we want to verify at DDL time, we won't be able to
> maintain it at the later point of time say if user changed the
> parallel-safety of some function used in check constraint. I think the
> function pg_get_parallel_safety() will help the user to decide whether
> it can declare table parallel-safe. Now, it is quite possible that the
> user can later change the parallel-safe property of some function then
> that should be caught at runtime.

Yeah, this is an unavoidable problem with the declarative approach.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

28 April 2021, 02:44:18

Hi,

Attaching new version patches with the following change:

0002:
1): Change "ALTER/CREATE TABLE PARALLEL SAFE" to "ALTER/CREATE TABLE PARALLEL DML SAFE"
2): disallow temp/foreign table to be parallel safe.

0003:
1) Temporarily, add the check of built-in function by adding a member for proparallel in FmgrBuiltin.
To avoid enlarging FmgrBuiltin struct , change the existing bool members strict and and retset into
one member of type char, and represent the original values with some bit flags.

Note: this will lock down the parallel property of built-in function, but, I think the parallel safety of built-in
function
is related to the C code, user should not change the property of it unless they change its code. So, I think it might
be
better to disallow changing parallel safety for built-in functions,  Thoughts ?

I have not added the parallel safety check in ALTER/CREATE table PARALLEL DML SAFE command.
I think it seems better to add it after some more discussion.

Best regards,
houzj

Attachment

Re: Parallel INSERT SELECT take 2

From

Greg Nancarrow

Date:

28 April 2021, 10:26:22

On Wed, Apr 28, 2021 at 12:44 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> 0003:
> 1) Temporarily, add the check of built-in function by adding a member for proparallel in FmgrBuiltin.
> To avoid enlarging FmgrBuiltin struct , change the existing bool members strict and and retset into
> one member of type char, and represent the original values with some bit flags.
>

I was thinking that it would be better to replace the two bool members
with one "unsigned char" member for the bitflags for strict and
retset, and another "char" member for parallel.
The struct would still remain the same size as it originally was, and
you wouldn't need to convert between bit settings and char
('u'/'r'/'s') each time a built-in function was checked for
parallel-safety in fmgr_info().

> Note: this will lock down the parallel property of built-in function, but, I think the parallel safety of built-in
function
> is related to the C code, user should not change the property of it unless they change its code. So, I think it might
be
> better to disallow changing parallel safety for built-in functions,  Thoughts ?
>

I'd vote for disallowing it (unless someone can justify why it
currently is allowed).

> I have not added the parallel safety check in ALTER/CREATE table PARALLEL DML SAFE command.
> I think it seems better to add it after some more discussion.
>

I'd vote for not adding such a check (as this is a declaration).

Some additional comments:

1) In patch 0002 comment, it says:
This property is recorded in pg_class's relparallel column as 'u', 'r', or 's',
just like pg_proc's proparallel.  The default is UNSAFE.
It should say "relparalleldml" column.

2) With the patches applied, I seem to be getting a couple of errors
when running "make installcheck-world" with
force_parallel_mode=regress in effect.
Can you please try it?

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Parallel INSERT SELECT take 2

From

Amit Kapila

Date:

10 May 2021, 05:50:50

On Mon, Apr 12, 2021 at 6:52 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
>
> SOLUTION TO (1)
> ========================================
>
> The candidate ideas are:
>
> 1) Caching the result of parallel-safety check
> The planner stores the result of checking parallel safety for each relation in relcache, or some purpose-built hash
tablein shared memory. 
>
> The problems are:
>
> * Even if the target relation turns out to be parallel safe by looking at those data structures, we cannot assume it
remainstrue until the SQL statement finishes.  For instance, other sessions might add a parallel-unsafe index to its
descendantrelations.  Other examples include that when the user changes the parallel safety of indexes or triggers by
runningALTER FUNCTION on the underlying index AM function or trigger function, the relcache entry of the table or index
isnot invalidated, so the correct parallel safety is not maintained in the cache. 
> In that case, when the executor encounters a parallel-unsafe object, it can change the cached state as being
parallel-unsafeand error out. 
>
> * Can't ensure fast access.  With relcache, the first access in each session has to undergo the overhead of
parallel-safetycheck.  With a hash table in shared memory, the number of relations stored there would be limited, so
thefirst access after database startup or the hash table entry eviction similarly experiences slowness. 
>
> * With a new hash table, some lwlock for concurrent access must be added, which can have an adverse effect on
performance.
>
>
> 2) Enabling users to declare that the table allows parallel data modification
> Add a table property that represents parallel safety of the table for DML statement execution.  Users specify it as
follows:
>
> CREATE TABLE table_name (...) PARALLEL { UNSAFE | RESTRICTED | SAFE };
>     ALTER TABLE table_name PARALLEL { UNSAFE | RESTRICTED | SAFE };
>
> This property is recorded in pg_class's relparallel column as 'u', 'r', or 's', just like pg_proc's proparallel.  The
defaultis UNSAFE. 
>

So, in short, if we need to go with any sort of solution with caching,
we can't avoid
(a) locking all the partitions
(b) getting an error while execution because at a later point user has
altered the parallel-safe property of a relation.

We can't avoid locking all the partitions because while we are
executing the statement, the user can change the parallel-safety for
one of the partitions by changing a particular partition and if we
didn't have a lock on that partition, it will lead to an error during
execution. Now, here, one option could be that we document this point
and then don't take lock on any of the partitions except for root
table. So, the design would be simpler, that we either cache the
parallel-safe in relcache or shared hash table and just lock the
parent table and perform all parallel-safety checks for the first
time.

I think if we want to go with the behavior that we will error out
during statement execution if any parallel-safe property is changed at
run-time, it is better to go with the declarative approach. In the
declarative approach, at least the user will be responsible for taking
any such decision and the chances of toggling the parallel-safe
property will be less. To aid users, as suggested, we can provide a
function to determine parallel-safety of relation for DML operations.

Now, in the declarative approach, we can either go with whatever the
user has mentioned or we can do some checks during DDL to determine
the actual parallel-safety. I think even if try to determine
parallel-safety during DDL it will be quite tricky in some cases, like
when a user performs Alter Function to change parallel-safety of the
function used in some constraint for the table or if the user changes
parallel-safety of one of the partition then we need to traverse the
partition hierarchy upwards which doesn't seem advisable. So, I guess
it is better to go with whatever the user has mentioned but if you or
others feel we can have some sort of parallel-safety checks during DDL
as well.

> The planner assumes that all of the table, its descendant partitions, and their ancillary objects have the specified
parallelsafety or safer one.  The user is responsible for its correctness.  If the parallel processes find an object
thatis less safer than the assumed parallel safety during statement execution, it throws an ERROR and abort the
statementexecution. 
>
> The objects that relate to the parallel safety of a DML target table are as follows:
>
>     * Column default expression
>     * DOMAIN type CHECK expression
>     * CHECK constraints on column
>     * Partition key
>     * Partition key support function
>     * Index expression
>     * Index predicate
>     * Index AM function
>     * Operator function
>     * Trigger function
>
> When the parallel safety of some of these objects is changed, it's costly to reflect it on the parallel safety of
tablesthat depend on them.  So, we don't do it.  Instead, we provide a utility function
pg_get_parallel_safety('table_name')that returns records of (objid, classid, parallel_safety) that represent the
parallelsafety of objects that determine the parallel safety of the specified table.  The function only outputs objects
thatare not parallel safe. 
>

So, users need to check count(*) for this to determine
parallel-safety? How about if we provide a wrapper function on top of
this function or a separate function that returns char to indicate
whether it is safe, unsafe, or restricted to perform a DML operation
on the table?

> How does the executor detect parallel unsafe objects?  There are two ways:
>
> 1) At loading time
> When the executor loads the definition of objects (tables, constraints, index, triggers, etc.) during the first
accessto them after session start or their eviction by sinval message, it checks the parallel safety. 
>
> This is a legitimate way, but may need much code.  Also, it might overlook necessary code changes without careful
inspection.
>

If we want to go with a declarative approach, then I think we should
try to do this because it will be almost free in some cases and we can
detect error early. For example, when we decide to insert in a
partition that is declared unsafe whereas the root (partitioned) table
is declared safe.

--
With Regards,
Amit Kapila.

Re: Parallel INSERT SELECT take 2

From

Dilip Kumar

Date:

10 May 2021, 06:01:58

On Mon, May 10, 2021 at 11:21 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Now, in the declarative approach, we can either go with whatever the
> user has mentioned or we can do some checks during DDL to determine
> the actual parallel-safety. I think even if try to determine
> parallel-safety during DDL it will be quite tricky in some cases, like
> when a user performs Alter Function to change parallel-safety of the
> function used in some constraint for the table or if the user changes
> parallel-safety of one of the partition then we need to traverse the
> partition hierarchy upwards which doesn't seem advisable. So, I guess
> it is better to go with whatever the user has mentioned but if you or
> others feel we can have some sort of parallel-safety checks during DDL
> as well.

IMHO, it makes sense to go with what the user has declared to avoid
complexity. And, I don't see any problem with that.

> > The planner assumes that all of the table, its descendant partitions, and their ancillary objects have the
specifiedparallel safety or safer one.  The user is responsible for its correctness.  If the parallel processes find an
objectthat is less safer than the assumed parallel safety during statement execution, it throws an ERROR and abort the
statementexecution. 
> >
> > The objects that relate to the parallel safety of a DML target table are as follows:
> >
> >     * Column default expression
> >     * DOMAIN type CHECK expression
> >     * CHECK constraints on column
> >     * Partition key
> >     * Partition key support function
> >     * Index expression
> >     * Index predicate
> >     * Index AM function
> >     * Operator function
> >     * Trigger function
> >
> > When the parallel safety of some of these objects is changed, it's costly to reflect it on the parallel safety of
tablesthat depend on them.  So, we don't do it.  Instead, we provide a utility function
pg_get_parallel_safety('table_name')that returns records of (objid, classid, parallel_safety) that represent the
parallelsafety of objects that determine the parallel safety of the specified table.  The function only outputs objects
thatare not parallel safe. 
> >
>
> So, users need to check count(*) for this to determine
> parallel-safety? How about if we provide a wrapper function on top of
> this function or a separate function that returns char to indicate
> whether it is safe, unsafe, or restricted to perform a DML operation
> on the table?

Such wrapper function make sense.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

10 May 2021, 06:46:54

> > > When the parallel safety of some of these objects is changed, it's costly to
> reflect it on the parallel safety of tables that depend on them.  So, we don't do
> it.  Instead, we provide a utility function pg_get_parallel_safety('table_name')
> that returns records of (objid, classid, parallel_safety) that represent the
> parallel safety of objects that determine the parallel safety of the specified
> table.  The function only outputs objects that are not parallel safe.
> > >
> >
> > So, users need to check count(*) for this to determine
> > parallel-safety? How about if we provide a wrapper function on top of
> > this function or a separate function that returns char to indicate
> > whether it is safe, unsafe, or restricted to perform a DML operation
> > on the table?
> 
> Such wrapper function make sense.

Thanks for the suggestion, and I agree.
I will add another wrapper function and post new version patches soon.

Best regards,
houzj

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

11 May 2021, 12:40:53

> > > So, users need to check count(*) for this to determine
> > > parallel-safety? How about if we provide a wrapper function on top
> > > of this function or a separate function that returns char to
> > > indicate whether it is safe, unsafe, or restricted to perform a DML
> > > operation on the table?
> >
> > Such wrapper function make sense.
> 
> Thanks for the suggestion, and I agree.
> I will add another wrapper function and post new version patches soon.

Attaching new version patches with the following changes:

0001
Add a new function pg_get_max_parallel_hazard('table_name') returns char('s', 'u', 'r')
which indicate whether it is safe, unsafe, or restricted to perform a DML.

0003
Temporarily, I removed the safety check for function in the executor.
Because we are trying to post the safety check as a separate patch which
can help detect parallel unsafe function in parallel mode, and the approach
is still in discussion[1].
Comments and suggestions are welcome either in that thread[1] or here.

[1]
https://www.postgresql.org/message-id/OS0PR01MB571646637784DAF1DD4C8BE994539%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best regards,
houzj

Attachment

Re: Parallel INSERT SELECT take 2

From

Bharath Rupireddy

Date:

11 May 2021, 16:15:12

On Tue, May 11, 2021 at 6:11 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> > > > So, users need to check count(*) for this to determine
> > > > parallel-safety? How about if we provide a wrapper function on top
> > > > of this function or a separate function that returns char to
> > > > indicate whether it is safe, unsafe, or restricted to perform a DML
> > > > operation on the table?
> > >
> > > Such wrapper function make sense.
> >
> > Thanks for the suggestion, and I agree.
> > I will add another wrapper function and post new version patches soon.
>
> Attaching new version patches with the following changes:
>
> 0001
> Add a new function pg_get_max_parallel_hazard('table_name') returns char('s', 'u', 'r')
> which indicate whether it is safe, unsafe, or restricted to perform a DML.

Thanks for the patches. I think we should have the table name as
regclass type for pg_get_max_parallel_hazard? See, pg_relation_size,
pg_table_size, pg_filenode_relation and so on.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel INSERT SELECT take 2

From

Greg Nancarrow

Date:

12 May 2021, 07:01:43

On Tue, May 11, 2021 at 10:41 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> Attaching new version patches with the following changes:
>
> 0001
> Add a new function pg_get_max_parallel_hazard('table_name') returns char('s', 'u', 'r')
> which indicate whether it is safe, unsafe, or restricted to perform a DML.
>

Currently the 1st patch actually also contains the changes for the
Parallel-SELECT-for-INSERT functionality. This is not obvious.
So I think this should be split out into a separate patch (i.e. the
minor planner update and related planner comment changes,
is_parallel_allowed_for_modify() function, max_parallel_hazard()
update, XID changes).
Also, the regression tests' "serial_schedule" file has been removed
since you posted the v3-POC patches, so you need to remove updates for
that from your 3rd patch.

How about reorganisation of the patches like the following?

0001: CREATE ALTER TABLE PARALLEL DML
0002: parallel-SELECT-for-INSERT (planner changes,
max_parallel_hazard() update, XID changes)
0003: pg_get_parallel_safety()
0004: regression test updates

Regards,
Greg Nancarrow
Fujitsu Australia

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

14 May 2021, 08:24:30

> 
> Currently the 1st patch actually also contains the changes for the
> Parallel-SELECT-for-INSERT functionality. This is not obvious.
> So I think this should be split out into a separate patch (i.e. the minor planner
> update and related planner comment changes,
> is_parallel_allowed_for_modify() function, max_parallel_hazard() update, XID
> changes).
> Also, the regression tests' "serial_schedule" file has been removed since you
> posted the v3-POC patches, so you need to remove updates for that from your
> 3rd patch.

Thanks for the comments, I have posted new version patches with this change. 

> How about reorganisation of the patches like the following?
> 0001: CREATE ALTER TABLE PARALLEL DML
> 0002: parallel-SELECT-for-INSERT (planner changes,
> max_parallel_hazard() update, XID changes)
> 0003: pg_get_parallel_safety()
> 0004: regression test updates

Thanks, it looks good and I reorganized the latest patchset in this way.

Attaching new version patches with the following change.

0003
Change functions arg type to regclass.

0004
remove updates for "serial_schedule".

Best regards,
houzj

Attachment

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

14 May 2021, 08:24:43

> > > > > So, users need to check count(*) for this to determine
> > > > > parallel-safety? How about if we provide a wrapper function on
> > > > > top of this function or a separate function that returns char to
> > > > > indicate whether it is safe, unsafe, or restricted to perform a
> > > > > DML operation on the table?
> > > >
> > > > Such wrapper function make sense.
> > >
> > > Thanks for the suggestion, and I agree.
> > > I will add another wrapper function and post new version patches soon.
> >
> > Attaching new version patches with the following changes:
> >
> > 0001
> > Add a new function pg_get_max_parallel_hazard('table_name') returns
> > char('s', 'u', 'r') which indicate whether it is safe, unsafe, or restricted to
> perform a DML.
> 
> Thanks for the patches. I think we should have the table name as regclass type
> for pg_get_max_parallel_hazard? See, pg_relation_size, pg_table_size,
> pg_filenode_relation and so on.

Thanks for the comment.
I have changed the type to regclass in the latest patchset.

Best regards,
houzj

Re: Parallel INSERT SELECT take 2

From

Greg Nancarrow

Date:

19 May 2021, 11:55:24

On Fri, May 14, 2021 at 6:24 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> Thanks for the comments, I have posted new version patches with this change.
>
> > How about reorganisation of the patches like the following?
> > 0001: CREATE ALTER TABLE PARALLEL DML
> > 0002: parallel-SELECT-for-INSERT (planner changes,
> > max_parallel_hazard() update, XID changes)
> > 0003: pg_get_parallel_safety()
> > 0004: regression test updates
>
> Thanks, it looks good and I reorganized the latest patchset in this way.
>
> Attaching new version patches with the following change.
>
> 0003
> Change functions arg type to regclass.
>
> 0004
> remove updates for "serial_schedule".
>

I've got some comments for the V4 set of patches:

(0001)

(i) Patch comment needs a little updating (suggested change is below):

Enable users to declare a table's parallel data-modification safety
(SAFE/RESTRICTED/UNSAFE).

Add a table property that represents parallel safety of a table for
DML statement execution.
It may be specified as follows:

CREATE TABLE table_name PARALLEL DML { UNSAFE | RESTRICTED | SAFE };
ALTER TABLE table_name PARALLEL DML { UNSAFE | RESTRICTED | SAFE };

This property is recorded in pg_class's relparallel column as 'u',
'r', or 's', just like pg_proc's proparallel.
The default is UNSAFE.

The planner assumes that all of the table, its descendant partitions,
and their ancillary objects have,
at worst, the specified parallel safety. The user is responsible for
its correctness.

---

NOTE: The following sentence was removed from the original V4 0001
patch comment (since this version of the patch is not doing runtime
parallel-safety checks on functions):.

If the parallel processes
find an object that is less safer than the assumed parallel safety during
statement execution, it throws an ERROR and abort the statement execution.

(ii) Update message to say "a foreign ...":

BEFORE:
+ errmsg("cannot support parallel data modification on foreign or
temporary table")));

AFTER:
+ errmsg("cannot support parallel data modification on a foreign or
temporary table")));

(iii) strVal() macro already casts to "Value *", so the cast can be
removed from the following:

+ char    *parallel = strVal((Value *) def);

(0003)

(i) Suggested updates to the patch comment:

Provide a utility function "pg_get_parallel_safety(regclass)" that
returns records of
(objid, classid, parallel_safety) for all parallel unsafe/restricted
table-related objects
from which the table's parallel DML safety is determined. The user can
use this information
during development in order to accurately declare a table's parallel
DML safety. or to
identify any problematic objects if a parallel DML fails or behaves
unexpectedly.

When the use of an index-related parallel unsafe/restricted function
is detected, both the
function oid and the index oid are returned.

Provide a utility function "pg_get_max_parallel_hazard(regclass)" that
returns the worst
parallel DML safety hazard that can be found in the given relation.
Users can use this
function to do a quick check without caring about specific
parallel-related objects.

Regards,
Greg Nancarrow
Fujitsu Australia

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

24 May 2021, 05:15:46

From: Greg Nancarrow <gregn4422@gmail.com>
Sent: Wednesday, May 19, 2021 7:55 PM
> 
> On Fri, May 14, 2021 at 6:24 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Thanks for the comments, I have posted new version patches with this
> change.
> >
> > > How about reorganisation of the patches like the following?
> > > 0001: CREATE ALTER TABLE PARALLEL DML
> > > 0002: parallel-SELECT-for-INSERT (planner changes,
> > > max_parallel_hazard() update, XID changes)
> > > 0003: pg_get_parallel_safety()
> > > 0004: regression test updates
> >
> > Thanks, it looks good and I reorganized the latest patchset in this way.
> >
> > Attaching new version patches with the following change.
> >
> > 0003
> > Change functions arg type to regclass.
> >
> > 0004
> > remove updates for "serial_schedule".
> >
> 
> I've got some comments for the V4 set of patches:
> 
> (0001)
> 
> (i) Patch comment needs a little updating (suggested change is below):
> 
> Enable users to declare a table's parallel data-modification safety
> (SAFE/RESTRICTED/UNSAFE).
> 
> Add a table property that represents parallel safety of a table for
> DML statement execution.
> It may be specified as follows:
> 
> CREATE TABLE table_name PARALLEL DML { UNSAFE | RESTRICTED | SAFE };
> ALTER TABLE table_name PARALLEL DML { UNSAFE | RESTRICTED | SAFE };
> 
> This property is recorded in pg_class's relparallel column as 'u',
> 'r', or 's', just like pg_proc's proparallel.
> The default is UNSAFE.
> 
> The planner assumes that all of the table, its descendant partitions,
> and their ancillary objects have,
> at worst, the specified parallel safety. The user is responsible for
> its correctness.
> 
> ---
> 
> NOTE: The following sentence was removed from the original V4 0001
> patch comment (since this version of the patch is not doing runtime
> parallel-safety checks on functions):.
> 
> If the parallel processes
> find an object that is less safer than the assumed parallel safety during
> statement execution, it throws an ERROR and abort the statement execution.
> 
> 
> (ii) Update message to say "a foreign ...":
> 
> BEFORE:
> + errmsg("cannot support parallel data modification on foreign or
> temporary table")));
> 
> AFTER:
> + errmsg("cannot support parallel data modification on a foreign or
> temporary table")));
> 
> 
> (iii) strVal() macro already casts to "Value *", so the cast can be
> removed from the following:
> 
> + char    *parallel = strVal((Value *) def);
> 
> 
> (0003)
> 
> (i) Suggested updates to the patch comment:
> 
> Provide a utility function "pg_get_parallel_safety(regclass)" that
> returns records of
> (objid, classid, parallel_safety) for all parallel unsafe/restricted
> table-related objects
> from which the table's parallel DML safety is determined. The user can
> use this information
> during development in order to accurately declare a table's parallel
> DML safety. or to
> identify any problematic objects if a parallel DML fails or behaves
> unexpectedly.
> 
> When the use of an index-related parallel unsafe/restricted function
> is detected, both the
> function oid and the index oid are returned.
> 
> Provide a utility function "pg_get_max_parallel_hazard(regclass)" that
> returns the worst
> parallel DML safety hazard that can be found in the given relation.
> Users can use this
> function to do a quick check without caring about specific
> parallel-related objects.

Thanks for the comments and your descriptions looks good.
Attaching v5 patchset with all these changes.

Best regards,
houzj

Attachment

Re: Parallel INSERT SELECT take 2

From

Greg Nancarrow

Date:

28 May 2021, 08:41:42

On Mon, May 24, 2021 at 3:15 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
>
> Thanks for the comments and your descriptions looks good.
> Attaching v5 patchset with all these changes.
>

A few other minor things I noticed:

(1) error message wording when declaring a table SAFE for parallel DML

src/backend/commands/tablecmds.c

Since data modification for the RELKIND_FOREIGN_TABLE and
RELPERSISTENCE_TEMP types are allowed in the parallel-restricted case
(i.e. leader may modify in parallel mode)
I'm thinking it may be better to use wording like:

    "cannot support foreign or temporary table data modification by
parallel workers"

instead of

    "cannot support parallel data modification on a foreign or temporary table"

There are TWO places where this error message is used.

(What do you think?)

(2) Minor formatting issue

src/backend/optimizer/util/clauses.c

    static safety_object *make_safety_object(Oid objid, Oid classid,
char proparallel)

should be:

    static safety_object *
    make_safety_object(Oid objid, Oid classid, char proparallel)

(3) Minor formatting issue

src/backend/utils/cache/typcache.c


    List *GetDomainConstraints(Oid type_id)

should be:

    List *
    GetDomainConstraints(Oid type_id)


Regards,
Greg Nancarrow
Fujitsu Australia

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

31 May 2021, 05:34:09

From: Greg Nancarrow <gregn4422@gmail.com>
Sent: Friday, May 28, 2021 4:42 PM
> On Mon, May 24, 2021 at 3:15 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> >
> > Thanks for the comments and your descriptions looks good.
> > Attaching v5 patchset with all these changes.
> >
> 
> A few other minor things I noticed:
> 
> (1) error message wording when declaring a table SAFE for parallel DML
> 
> src/backend/commands/tablecmds.c
> 
> Since data modification for the RELKIND_FOREIGN_TABLE and
> RELPERSISTENCE_TEMP types are allowed in the parallel-restricted case (i.e.
> leader may modify in parallel mode) I'm thinking it may be better to use
> wording like:
> 
>     "cannot support foreign or temporary table data modification by parallel
> workers"
> 
> instead of
> 
>     "cannot support parallel data modification on a foreign or temporary table"
> 
> There are TWO places where this error message is used.
> 
> (What do you think?)

I think your change looks good.
I used your msg in the latest patchset.

> (2) Minor formatting issue
> 
> src/backend/optimizer/util/clauses.c
> 
>     static safety_object *make_safety_object(Oid objid, Oid classid, char
> proparallel)
> 
> should be:
> 
>     static safety_object *
>     make_safety_object(Oid objid, Oid classid, char proparallel)

Changed.
 
> (3) Minor formatting issue
>
> src/backend/utils/cache/typcache.c
> 
> 
>     List *GetDomainConstraints(Oid type_id)
> 
> should be:
> 
>     List *
>     GetDomainConstraints(Oid type_id)

Changed.

Attaching v6 patchset.
And I registered it in CF https://commitfest.postgresql.org/33/3143/,
comments are welcome.

Best regards,
houzj

Attachment

Re: Parallel INSERT SELECT take 2

From

Greg Nancarrow

Date:

01 June 2021, 09:32:24

On Mon, May 31, 2021 at 3:34 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
>
> Attaching v6 patchset.
> And I registered it in CF https://commitfest.postgresql.org/33/3143/,
> comments are welcome.
>

The latest patchset has some documentation updates. I'd like to
suggest a couple of documentation tweaks (this is mainly just minor
word changes and some extra details):

(1)
doc/src/sgml/ref/create_foreign_table.sgml
doc/src/sgml/ref/create_table.sgml

PARALLEL DML UNSAFE indicates that the data in the table can't be
modified in parallel mode, and this forces a serial execution plan for
DML statements operating on the table. This is the default. PARALLEL
DML RESTRICTED indicates that the data in the table can be modified in
parallel mode, but the modification is restricted to the parallel
group leader. PARALLEL DML SAFE indicates that the data in the table
can be modified in parallel mode without restriction. Note that
PostgreSQL currently does not support data modification by parallel
workers.

Tables should be labeled parallel dml unsafe/restricted if any
parallel unsafe/restricted function could be executed when modifying
the data in the table (e.g., functions in triggers/index
expressions/constraints etc.).

To assist in correctly labeling the parallel DML safety level of a
table, PostgreSQL provides some utility functions that may be used
during application development. Refer to pg_get_parallel_safety() and
pg_get_max_parallel_hazard() for more information.

(2) doc/src/sgml/func.sgml

(i) pg_get_parallel_safety
Returns a row containing enough information to uniquely identify the
parallel unsafe/restricted table-related objects from which the
table's parallel DML safety is determined. The user can use this
information during development in order to accurately declare a
table's parallel DML safety, or to identify any problematic objects if
parallel DML fails or behaves unexpectedly. Note that when the use of
an object-related parallel unsafe/restricted function is detected,
both the function OID and the object OID are returned. classid is the
OID of the system catalog containing the object; objid is the OID of
the object itself.

(ii) pg_get_max_parallel_hazard
Returns the worst parallel DML safety hazard that can be found in the
given relation:
    s safe
    r restricted
    u unsafe
Users can use this function to do a quick check without caring about
specific parallel-related objects.

 ---

Also, shouldn't support for "Parallel" be added for table output in
PSQL? (e.g. \dt+)

Regards,
Greg Nancarrow
Fujitsu Australia

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

02 June 2021, 07:30:32

From: Greg Nancarrow <gregn4422@gmail.com>
> On Mon, May 31, 2021 at 3:34 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> >
> > Attaching v6 patchset.
> > And I registered it in CF https://commitfest.postgresql.org/33/3143/,
> > comments are welcome.
> >
> 
> The latest patchset has some documentation updates. I'd like to suggest a
> couple of documentation tweaks (this is mainly just minor word changes and
> some extra details):
> 
> (1)
> doc/src/sgml/ref/create_foreign_table.sgml
> doc/src/sgml/ref/create_table.sgml
> 
> PARALLEL DML UNSAFE indicates that the data in the table can't be modified in
> parallel mode, and this forces a serial execution plan for DML statements
> operating on the table. This is the default. PARALLEL DML RESTRICTED
> indicates that the data in the table can be modified in parallel mode, but the
> modification is restricted to the parallel group leader. PARALLEL DML SAFE
> indicates that the data in the table can be modified in parallel mode without
> restriction. Note that PostgreSQL currently does not support data
> modification by parallel workers.
> 
> Tables should be labeled parallel dml unsafe/restricted if any parallel
> unsafe/restricted function could be executed when modifying the data in the
> table (e.g., functions in triggers/index expressions/constraints etc.).
> 
> To assist in correctly labeling the parallel DML safety level of a table,
> PostgreSQL provides some utility functions that may be used during
> application development. Refer to pg_get_parallel_safety() and
> pg_get_max_parallel_hazard() for more information.
> 
> 
> (2) doc/src/sgml/func.sgml
> 
> (i) pg_get_parallel_safety
> Returns a row containing enough information to uniquely identify the parallel
> unsafe/restricted table-related objects from which the table's parallel DML
> safety is determined. The user can use this information during development in
> order to accurately declare a table's parallel DML safety, or to identify any
> problematic objects if parallel DML fails or behaves unexpectedly. Note that
> when the use of an object-related parallel unsafe/restricted function is
> detected, both the function OID and the object OID are returned. classid is the
> OID of the system catalog containing the object; objid is the OID of the object
> itself.
> 
> (ii) pg_get_max_parallel_hazard
> Returns the worst parallel DML safety hazard that can be found in the given
> relation:
>     s safe
>     r restricted
>     u unsafe
> Users can use this function to do a quick check without caring about specific
> parallel-related objects.

Thanks for looking into the doc change, I think your change looks better and
have merged it in the attached patch.

> Also, shouldn't support for "Parallel" be added for table output in PSQL? (e.g.
> \dt+)

Yeah, I think we should add it and I added it in the attached 0001 patch.

Best regards,
houzj

Attachment

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

02 June 2021, 09:33:26

From: houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com>
> Thanks for looking into the doc change, I think your change looks better and
> have merged it in the attached patch.
> 
> > Also, shouldn't support for "Parallel" be added for table output in PSQL? (e.g.
> > \dt+)
> 
> Yeah, I think we should add it and I added it in the attached 0001 patch.

Oops, forgot to update the regression test in contrib/.
Attaching new version patchset with this fix.

Best regards,
houzj

Attachment

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

08 June 2021, 09:12:31

Hi,

When testing the patch, I found some issues in the 0003,0004 patch.
Attaching new version patchset which fix these issues.

0003
* don't check parallel safety of partition's default column expression.
* rename some function/variable names to be consistent with existing code.
* remove some unused function parameters.
* fix a max_hazard overwrite issue.
* add some code comments and adjust some code forms.

0004
* Remove some unrelated comments in the regression test.
* add the 'begin;', 'rollback;' in insert_parallel.sql.

Best regards,
houzj

Attachment

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

10 June 2021, 01:26:38

From: houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com>
> Hi,
> 
> When testing the patch, I found some issues in the 0003,0004 patch.
> Attaching new version patchset which fix these issues.
> 
> 0003
> * don't check parallel safety of partition's default column expression.
> * rename some function/variable names to be consistent with existing code.
> * remove some unused function parameters.
> * fix a max_hazard overwrite issue.
> * add some code comments and adjust some code forms.
> 
> 0004
> * Remove some unrelated comments in the regression test.
> * add the 'begin;', 'rollback;' in insert_parallel.sql.

Through further review and thanks for greg-san's suggestions,
I attached a new version patchset with some minor change in 0001,0003 and 0004.

0001.
* fix a typo in variable name.
* add a TODO in patch comment about updating the version number when branch PG15.

0003
* fix a 'git apply white space' warning.
* Remove some unnecessary if condition.
* add some code comments above the safety check function.
* Fix some typo.

0004
* add a testcase to test ALTER PARALLEL DML UNSAFE/RESTRICTED.

Best regards,
houzj

Attachment

Re: Parallel INSERT SELECT take 2

From

Greg Nancarrow

Date:

10 June 2021, 05:39:20

On Thu, Jun 10, 2021 at 11:26 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> Through further review and thanks for greg-san's suggestions,
> I attached a new version patchset with some minor change in 0001,0003 and 0004.
>
> 0001.
> * fix a typo in variable name.
> * add a TODO in patch comment about updating the version number when branch PG15.
>
> 0003
> * fix a 'git apply white space' warning.
> * Remove some unnecessary if condition.
> * add some code comments above the safety check function.
> * Fix some typo.
>
> 0004
> * add a testcase to test ALTER PARALLEL DML UNSAFE/RESTRICTED.
>

Thanks,  those updates addressed most of what I was going to comment
on for the v9 patches.

Some additional comments on the v10 patches:

(1) I noticed some functions in the 0003 patch have no function header:

   make_safety_object
   parallel_hazard_walker
   target_rel_all_parallel_hazard_recurse

(2) I found the "recurse_partition" parameter of the
target_rel_all_parallel_hazard_recurse() function a bit confusing,
because the function recursively checks partitions without looking at
that flag. How about naming it "is_partition"?

(3) The names of the utility functions don't convey that they operate on tables.

How about:

   pg_get_parallel_safety() -> pg_get_table_parallel_safety()
   pg_get_max_parallel_hazard() -> pg_get_table_max_parallel_hazard()

or pg_get_rel_xxxxx()?

What do you think?

(4) I think that some of the tests need parallel dml settings to match
their expected output:

(i)
+-- Test INSERT with underlying query - and RETURNING (no projection)
+-- (should create a parallel plan; parallel SELECT)

-> but creates a serial plan (so needs to set parallel dml safe, so a
parallel plan is created)

(ii)
+-- Parallel INSERT with unsafe column default, should not use a parallel plan
+--
+alter table testdef parallel dml safe;

-> should set "unsafe" not "safe"

(iii)
+-- Parallel INSERT with restricted column default, should use parallel SELECT
+--
+explain (costs off) insert into testdef(a,b,d) select a,a*2,a*8 from test_data;

-> should use "alter table testdef parallel dml restricted;" before the explain

(iv)
+--
+-- Parallel INSERT with restricted and unsafe column defaults, should
not use a parallel plan
+--
+explain (costs off) insert into testdef(a,d) select a,a*8 from test_data;

-> should use "alter table testdef parallel dml unsafe;"  before the explain


Regards,
Greg Nancarrow
Fujitsu Australia

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

06 July 2021, 03:42:28

On Thursday, June 10, 2021 1:39 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> On Thu, Jun 10, 2021 at 11:26 AM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Through further review and thanks for greg-san's suggestions, I
> > attached a new version patchset with some minor change in 0001,0003 and
> 0004.
> >
> > 0001.
> > * fix a typo in variable name.
> > * add a TODO in patch comment about updating the version number when
> branch PG15.
> >
> > 0003
> > * fix a 'git apply white space' warning.
> > * Remove some unnecessary if condition.
> > * add some code comments above the safety check function.
> > * Fix some typo.
> >
> > 0004
> > * add a testcase to test ALTER PARALLEL DML UNSAFE/RESTRICTED.
> >
> 
> Thanks,  those updates addressed most of what I was going to comment on
> for the v9 patches.
> 
> Some additional comments on the v10 patches:
> 
> (1) I noticed some functions in the 0003 patch have no function header:
> 
>    make_safety_object
>    parallel_hazard_walker
>    target_rel_all_parallel_hazard_recurse

Thanks, added.

> (2) I found the "recurse_partition" parameter of the
> target_rel_all_parallel_hazard_recurse() function a bit confusing, because the
> function recursively checks partitions without looking at that flag. How about
> naming it "is_partition"?

Yeah, it looks better. Changed.

> (3) The names of the utility functions don't convey that they operate on tables.
> 
> How about:
> 
>    pg_get_parallel_safety() -> pg_get_table_parallel_safety()
>    pg_get_max_parallel_hazard() -> pg_get_table_max_parallel_hazard()
> 
> or pg_get_rel_xxxxx()?
> 
> What do you think?

I changed it like the following:
pg_get_parallel_safety -> pg_get_table_parallel_dml_safety
pg_get_max_parallel_hazard -> pg_get_table_max_parallel_dml_hazard

> (4) I think that some of the tests need parallel dml settings to match their
> expected output:
> 
> (i)
> +-- Test INSERT with underlying query - and RETURNING (no projection)
> +-- (should create a parallel plan; parallel SELECT)
> 
> -> but creates a serial plan (so needs to set parallel dml safe, so a
> parallel plan is created)

Changed.

> (ii)
> +-- Parallel INSERT with unsafe column default, should not use a
> +parallel plan
> +--
> +alter table testdef parallel dml safe;
> 
> -> should set "unsafe" not "safe"

I thought the testcase about table 'testdef' is to test if the planner is able to
check whether it has parallel unsafe or restricted column default expression,
because column default expression will be merged into select part in planner.
So, It seems we don't need to change the table's parallel safety for these cases ?

> (iii)
> +-- Parallel INSERT with restricted column default, should use parallel
> +SELECT
> +--
> +explain (costs off) insert into testdef(a,b,d) select a,a*2,a*8 from
> +test_data;
> 
> -> should use "alter table testdef parallel dml restricted;" before the
> -> explain
> 
> (iv)
> +--
> +-- Parallel INSERT with restricted and unsafe column defaults, should
> not use a parallel plan
> +--
> +explain (costs off) insert into testdef(a,d) select a,a*8 from
> +test_data;
> 
> -> should use "alter table testdef parallel dml unsafe;"  before the
> -> explain

I addressed most of the comments and rebased the patch.
Besides, I changed the following things:
* I removed the safety check for index-am function as discussed[1].
* change version 140000 to 150000

Attach new version patchset for further review.

[1] https://www.postgresql.org/message-id/164474.1623652853%40sss.pgh.pa.us

Best regards,
houzj

Attachment

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

12 July 2021, 08:00:52

On Tuesday, July 6, 2021 11:42 AM houzj.fnst@fujitsu.com wrote:
> 
> I addressed most of the comments and rebased the patch.
> Besides, I changed the following things:
> * I removed the safety check for index-am function as discussed[1].
> * change version 140000 to 150000
> 
> Attach new version patchset for further review.
> 
> [1]
> https://www.postgresql.org/message-id/164474.1623652853%40sss.pgh.pa.us

Attach a rebased patchset which fix some compile warnings
and errors due to recent commit 2ed532.

Best regards,
Hou zhijie

Attachment

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

20 July 2021, 01:47:37

On Monday, July 12, 2021 4:01 PM <houzj.fnst@fujitsu.com> wrote:
> Attach a rebased patchset which fix some compile warnings and errors due to
> recent commit 2ed532.

Attach rebased patches.

Best regards
Houzj

Attachment

Re: Parallel INSERT SELECT take 2

From

Greg Nancarrow

Date:

20 July 2021, 02:40:48

On Tue, Jul 20, 2021 at 11:47 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> Attach rebased patches.
>

Just letting you know that CRLFs are in the patch comments for the
0001 and 0003 patches.
(It doesn't affect patch application)

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Parallel INSERT SELECT take 2

From

Greg Nancarrow

Date:

23 July 2021, 05:00:07

On Tue, Jul 20, 2021 at 11:47 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
>
> Attach rebased patches.
>

There's a failure in the "triggers" tests and the cfbot is not happy.
Attaching an updated set of patches with a minor update to the
expected test results to fix this.
Also removed some CRLFs in some of the patch comments. No other changes.

Regards,
Greg Nancarrow
Fujitsu Australia

Attachment

RE: Parallel INSERT SELECT take 2

From

"houzj.fnst@fujitsu.com"

Date:

05 August 2021, 09:41:10

Hi,

Based on the discussion in another thread[1], we plan to change the design like
the following:

allow users to specify a parallel-safety option for both partitioned and
non-partitioned relations but for non-partitioned relations if users didn't
specify, it would be computed automatically? If the user has specified
parallel-safety option for non-partitioned relation then we would consider that
instead of computing the value by ourselves.

In this approach, it will be more convenient for users to use and get the
benefit of parallel select for insert.

Since most of the technical discussion happened in another thread, I
posted the new version patch including the new design to that thread[2].
Comments are also welcome in that thread.

[1] https://www.postgresql.org/message-id/CAA4eK1%2BMQnm6RkqooHA7R-y7riRa84qsh5j3FZDScw71m_n4OA%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/OS0PR01MB5716DB1E3F723F86314D080094F09%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best regards,
houzj