Re: Parallel INSERT SELECT take 2 - Mailing list pgsql-hackers
From | Bharath Rupireddy |
---|---|
Subject | Re: Parallel INSERT SELECT take 2 |
Date | |
Msg-id | CALj2ACUBg+G486TsdCq0gXenaT5f6Cb5Qu9F=ih2+vpPTM7kpw@mail.gmail.com Whole thread Raw |
In response to | RE: Parallel INSERT SELECT take 2 ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>) |
Responses |
RE: Parallel INSERT SELECT take 2
|
List | pgsql-hackers |
On Mon, Apr 26, 2021 at 7:00 AM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > > > Based on above, we plan to move forward with the apporache 2) (declarative > > idea). > > > > IIUC, the declarative behaviour idea attributes parallel safe/unsafe/restricted > > tags to each table with default being the unsafe. Does it mean for a parallel > > unsafe table, no parallel selects, inserts (may be updates) will be picked up? Or > > is it only the parallel inserts? If both parallel inserts, selects will be picked, then > > the existing tables need to be adjusted to set the parallel safety tags while > > migrating? > > Thanks for looking into this. Thanks for the responses. > The parallel attributes in table means the parallel safety when user does some data-modification operations on it. > So, It only limit the use of parallel plan when using INSERT/UPDATE/DELETE. In that case, isn't it better to use the terminology "PARALLEL DML SAFE/UNSAFE/RESTRICTED" in the code and docs? This way, it will be clear that these tags don't affect parallel selects. > > Another point, what does it mean a table being parallel restricted? > > What should happen if it is present in a query of other parallel safe tables? > > If a table is parallel restricted, it means the table contains some parallel restricted objects(such as: parallel restrictedfunctions in index expressions). > And in planner, it means parallel insert plan will not be chosen, but it can use parallel select(with serial insert). Makes sense. I assume that when there is a parallel restricted function associated with a table, the current design doesn't enforce the planner to choose parallel select and it is left up to it. > > I may be wrong here: IIUC, the main problem we are trying to solve with the > > declarative approach is to let the user decide parallel safety for partition tables > > as it may be costlier for postgres to determine it. And for the normal tables we > > can perform parallel safety checks without incurring much cost. So, I think we > > should restrict the declarative approach to only partitioned tables? > > Yes, we are tring to avoid overhead when checking parallel safety. > The cost to check all the partition's parallel safety is the biggest one. > Another is the safety check of index's expression. > Currently, for INSERT, the planner does not open the target table's indexinfo and does not > parse the expression of the index. We need to parse the expression in planner if we want > to do parallel safety check for it which can bring some overhead(it will open the index the do the parse in executor again). > So, we plan to skip all of the extra check and let user take responsibility for the safety. > Of course, maybe we can try to pass the indexinfo to the executor but it need some further refactor and I will take a lookinto it. Will the planner parse and check parallel safety of index((where clause) expressions in case of SELECTs? I'm not sure of this. But if it does, maybe we could do the same thing for parallel DML as well for normal tables? What is the overhead of parsing index expressions? If the cost is heavy for checking index expressions parallel safety in case of normal tables, then the current design i.e. attributing parallel safety tag to all the tables makes sense. I was actually thinking that we will have the declarative approach only for partitioned tables as it is the main problem we are trying to solve with this design. Something like: users will run pg_get_parallel_safety to see the parallel unsafe objects associated with a partitioned table by looking at all of its partitions and be able to set a parallel dml safety tag to only partitioned tables. > > While reading the design, I came across this "erroring out during execution of a > > query when a parallel unsafe function is detected". If this is correct, isn't it > > warranting users to run pg_get_parallel_safety to know the parallel unsafe > > objects, set parallel safety to all of them if possible, otherwise disable > > parallelism to run the query? Isn't this burdensome? > > How about: > If detecting parallel unsafe objects in executor, then, alter the table to parallel unsafe internally. > So, user do not need to alter it manually. I don't think this is a good idea, because, if there are multiple tables involved in the query, do you alter all the tables? Usually, we error out on finding the first such unsafe object. > > Instead, how about > > postgres retries the query upon detecting the error that came from a parallel > > unsafe function during execution, disable parallelism and run the query? I think > > this kind of retry query feature can be built outside of the core postgres, but > > IMO it will be good to have inside (of course configurable). IIRC, the Teradata > > database has a Query Retry feature. > > > > Thanks for the suggestion. > The retry query feature sounds like a good idea to me. > OTOH, it sounds more like an independent feature which parallel select can also benefit from it. > I think maybe we can try to achieve it after we commit the parallel insert ? Yeah, it will be a separate thing altogether. >0001: provide a utility function pg_get_parallel_safety('table_name'). > >The function returns records of (objid, classid, parallel_safety) that represent >the parallel safety of objects that determine the parallel safety of the specified table. >Note: The function only outputs objects that are not parallel safe. If it returns only parallel "unsafe" objects and not "safe" or "restricted" objects, how about naming it to pg_get_table_parallel_unsafe_objects("table_name")? This way we could get rid of parallel_safety in the output record? If at all users want to see parallel restricted or safe objects, we can also have the counterparts pg_get_table_parallel_safe_objects and pg_get_table_parallel_restricted_objects. Of course, we can caution the user that execution of these functions might take longer and "might consume excessive memory while accumulating the output". Otherwise, we can have a single function pg_get_parallel_safety("table_name" IN, "parallel_safety" IN, "objid" OUT, "classid" OUT)? If required, we could name it pg_get_parallel_safety_of_table_objects. Thoughts? Although, I have not looked at the patches, few questions on pg_get_parallel_safety function: 1) Will it parse all the expressions for the objects that are listed under "The objects that relate to the parallel safety of a DML target table are as follows:" in the upthread? 2) How will it behave if a partitioned table is passed to it? Will it recurse for all the partitions? 3) How will it behave if a foreign table is passed to it? Will it error out? In general: 1) Is ALTER SET PARALLEL SAFETY on a partitioned table allowed? If yes, will it be set based on all the partitions parallel safety? 2) How will users have to decide on parallel safety of a foreign table or a partitioned table with foreign partitions? Or is it that we set these tables parallel unsafe and don't do parallel inserts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: