Re: Parallel INSERT SELECT take 2 - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Parallel INSERT SELECT take 2
Date
Msg-id CALj2ACXEtN1anW81atg=AZY9_j5hVJR9i6=JNWt1=T+0poy9OA@mail.gmail.com
Whole thread Raw
In response to RE: Parallel INSERT SELECT take 2  ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>)
Responses RE: Parallel INSERT SELECT take 2  ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>)
List pgsql-hackers
On Thu, Apr 22, 2021 at 4:51 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> > BACKGROUND
> > ========================================
> >
> > We want to realize parallel INSERT SELECT in the following steps:
> > 1) INSERT + parallel SELECT
> > 2) Parallel INSERT + parallel SELECT
> >
> > Below are example use cases.  We don't expect high concurrency or an empty
> > data source.
> > * Data loading (ETL or ELT) into an analytics database, typically a data ware
> > house.
> > * Batch processing in an OLTP database.
> > 2) Enabling users to declare that the table allows parallel data modification Add
> > a table property that represents parallel safety of the table for DML statement
> > execution.  Users specify it as follows:
> >
> > CREATE TABLE table_name (...) PARALLEL { UNSAFE | RESTRICTED | SAFE };
> >     ALTER TABLE table_name PARALLEL { UNSAFE | RESTRICTED | SAFE };
> >
> > This property is recorded in pg_class's relparallel column as 'u', 'r', or 's', just
> > like pg_proc's proparallel.  The default is UNSAFE.
> >
> > The planner assumes that all of the table, its descendant partitions, and their
> > ancillary objects have the specified parallel safety or safer one.  The user is
> > responsible for its correctness.  If the parallel processes find an object that is
> > less safer than the assumed parallel safety during statement execution, it
> > throws an ERROR and abort the statement execution.
> >
> > When the parallel safety of some of these objects is changed, it's costly to
> > reflect it on the parallel safety of tables that depend on them.  So, we don't do
> > it.  Instead, we provide a utility function pg_get_parallel_safety('table_name')
> > that returns records of (objid, classid, parallel_safety) that represent the
> > parallel safety of objects that determine the parallel safety of the specified
> > table.  The function only outputs objects that are not parallel safe.  Otherwise,
> > it will consume excessive memory while accumulating the output.  The user
> > can use this function to identify problematic objects when a parallel DML fails
> > or is not parallelized in an expected manner.
> >
> > How does the executor detect parallel unsafe objects?  There are two ways:
> >
> > 1) At loading time
> > ...
> > 2) At function execution time
> > All related objects come down to some function execution.  So, add a parallel
> > safety check there when in a parallel worker.  If the current process is a parallel
> > worker and the function is parallel unsafe, error out with ereport(ERROR).  This
> > approach eliminates the oversight of parallel safety check with the additional
> > bonus of tiny code change!
> >
> > The place would be FunctionCallInvoke().  It's a macro in fmgr.h now.  Perhaps
> > we should make it a function in fmgr.c, so that fmgr.h does not have to include
> > header files for parallelism-related definitions.
> >
> > We have to evaluate the performance effect of converting FunctionCallInvoke()
> > into a function and adding an if statement there, because it's a relatively
> > low-level function.
>
> Based on above, we plan to move forward with the apporache 2) (declarative idea).

IIUC, the declarative behaviour idea attributes parallel
safe/unsafe/restricted tags to each table with default being the
unsafe. Does it mean for a parallel unsafe table, no parallel selects,
inserts (may be updates) will be picked up? Or is it only the parallel
inserts? If both parallel inserts, selects will be picked, then the
existing tables need to be adjusted to set the parallel safety tags
while migrating?

Another point, what does it mean a table being parallel restricted?
What should happen if it is present in a query of other parallel safe
tables?

I may be wrong here: IIUC, the main problem we are trying to solve
with the declarative approach is to let the user decide parallel
safety for partition tables as it may be costlier for postgres to
determine it. And for the normal tables we can perform parallel safety
checks without incurring much cost. So, I think we should restrict the
declarative approach to only partitioned tables?

While reading the design, I came across this "erroring out during
execution of a query when a parallel unsafe function is detected". If
this is correct, isn't it warranting users to run
pg_get_parallel_safety to know the parallel unsafe objects, set
parallel safety to all of them if possible, otherwise disable
parallelism to run the query? Isn't this burdensome? Instead, how
about postgres retries the query upon detecting the error that came
from a parallel unsafe function during execution, disable parallelism
and run the query? I think this kind of retry query feature can be
built outside of the core postgres, but IMO it will be good to have
inside (of course configurable). IIRC, the Teradata database has a
Query Retry feature.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: A test for replay of regression tests
Next
From: Robert Haas
Date:
Subject: Re: decoupling table and index vacuum