Re: Parallel INSERT SELECT take 2 - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Parallel INSERT SELECT take 2
Date
Msg-id CALj2ACUBg+G486TsdCq0gXenaT5f6Cb5Qu9F=ih2+vpPTM7kpw@mail.gmail.com
Whole thread Raw
In response to RE: Parallel INSERT SELECT take 2  ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>)
Responses RE: Parallel INSERT SELECT take 2
List pgsql-hackers
On Mon, Apr 26, 2021 at 7:00 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> > > Based on above, we plan to move forward with the apporache 2) (declarative
> > idea).
> >
> > IIUC, the declarative behaviour idea attributes parallel safe/unsafe/restricted
> > tags to each table with default being the unsafe. Does it mean for a parallel
> > unsafe table, no parallel selects, inserts (may be updates) will be picked up? Or
> > is it only the parallel inserts? If both parallel inserts, selects will be picked, then
> > the existing tables need to be adjusted to set the parallel safety tags while
> > migrating?
>
> Thanks for looking into this.

Thanks for the responses.

> The parallel attributes in table means the parallel safety when user does some data-modification operations on it.
> So, It only limit the use of parallel plan when using INSERT/UPDATE/DELETE.

In that case, isn't it better to use the terminology "PARALLEL DML
SAFE/UNSAFE/RESTRICTED" in the code and docs? This way, it will be
clear that these tags don't affect parallel selects.

> > Another point, what does it mean a table being parallel restricted?
> > What should happen if it is present in a query of other parallel safe tables?
>
> If a table is parallel restricted, it means the table contains some parallel restricted objects(such as: parallel
restrictedfunctions in index expressions).
 
> And in planner, it means parallel insert plan will not be chosen, but it can use parallel select(with serial
insert).

Makes sense. I assume that when there is a parallel restricted
function associated with a table, the current design doesn't enforce
the planner to choose parallel select and it is left up to it.

> > I may be wrong here: IIUC, the main problem we are trying to solve with the
> > declarative approach is to let the user decide parallel safety for partition tables
> > as it may be costlier for postgres to determine it. And for the normal tables we
> > can perform parallel safety checks without incurring much cost. So, I think we
> > should restrict the declarative approach to only partitioned tables?
>
> Yes, we are tring to avoid overhead when checking parallel safety.
> The cost to check all the partition's parallel safety is the biggest one.
> Another is the safety check of index's expression.
> Currently, for INSERT, the planner does not open the target table's indexinfo and does not
> parse the expression of the index. We need to parse the expression in planner if we want
> to do parallel safety check for it which can bring some overhead(it will open the index the do the parse in executor
again).
> So, we plan to skip all of the extra check and let user take responsibility for the safety.
> Of course, maybe we can try to pass the indexinfo to the executor but it need some further refactor and I will take a
lookinto it.
 

Will the planner parse and check parallel safety of index((where
clause) expressions in case of SELECTs? I'm not sure of this. But if
it does, maybe we could do the same thing for parallel DML as well for
normal tables? What is the overhead of parsing index expressions? If
the cost is heavy for checking index expressions parallel safety in
case of normal tables, then the current design i.e. attributing
parallel safety tag to all the tables makes sense.

I was actually thinking that we will have the declarative approach
only for partitioned tables as it is the main problem we are trying to
solve with this design. Something like: users will run
pg_get_parallel_safety to see the parallel unsafe objects associated
with a partitioned table by looking at all of its partitions and be
able to set a parallel dml safety tag to only partitioned tables.

> > While reading the design, I came across this "erroring out during execution of a
> > query when a parallel unsafe function is detected". If this is correct, isn't it
> > warranting users to run pg_get_parallel_safety to know the parallel unsafe
> > objects, set parallel safety to all of them if possible, otherwise disable
> > parallelism to run the query? Isn't this burdensome?
>
> How about:
> If detecting parallel unsafe objects in executor, then, alter the table to parallel unsafe internally.
> So, user do not need to alter it manually.

I don't think this is a good idea, because, if there are multiple
tables involved in the query, do you alter all the tables? Usually, we
error out on finding the first such unsafe object.

> > Instead, how about
> > postgres retries the query upon detecting the error that came from a parallel
> > unsafe function during execution, disable parallelism and run the query? I think
> > this kind of retry query feature can be built outside of the core postgres, but
> > IMO it will be good to have inside (of course configurable). IIRC, the Teradata
> > database has a Query Retry feature.
> >
>
> Thanks for the suggestion.
> The retry query feature sounds like a good idea to me.
> OTOH, it sounds more like an independent feature which parallel select can also benefit from it.
> I think maybe we can try to achieve it after we commit the parallel insert ?

Yeah, it will be a separate thing altogether.

>0001: provide a utility function pg_get_parallel_safety('table_name').
>
>The function returns records of (objid, classid, parallel_safety) that represent
>the parallel safety of objects that determine the parallel safety of the specified table.
>Note: The function only outputs objects that are not parallel safe.

If it returns only parallel "unsafe" objects and not "safe" or
"restricted" objects, how about naming it to
pg_get_table_parallel_unsafe_objects("table_name")? This way we could
get rid of parallel_safety in the output record? If at all users want
to see parallel restricted or safe objects, we can also have the
counterparts pg_get_table_parallel_safe_objects and
pg_get_table_parallel_restricted_objects. Of course, we can caution
the user that execution of these functions might take longer and
"might consume excessive memory while accumulating the output".

Otherwise, we can have a single function
pg_get_parallel_safety("table_name" IN, "parallel_safety" IN, "objid"
OUT, "classid" OUT)? If required, we could name it
pg_get_parallel_safety_of_table_objects.

Thoughts?

Although, I have not looked at the patches, few questions on
pg_get_parallel_safety function:
1) Will it parse all the expressions for the objects that are listed
under "The objects that relate to the parallel safety of a DML target
table are as follows:" in the upthread?
2) How will it behave if a partitioned table is passed to it? Will it
recurse for all the partitions?
3) How will it behave if a foreign table is passed to it? Will it error out?

In general:
1) Is ALTER SET PARALLEL SAFETY on a partitioned table allowed? If
yes, will it be set based on all the partitions parallel safety?
2) How will users have to decide on parallel safety of a foreign table
or a partitioned table with foreign partitions? Or is it that we set
these tables parallel unsafe and don't do parallel inserts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Addition of authenticated ID to pg_stat_activity
Next
From: "Andrey V. Lepikhov"
Date:
Subject: Re: Asynchronous Append on postgres_fdw nodes.