RE: Parallel INSERT SELECT take 2 - Mailing list pgsql-hackers

From houzj.fnst@fujitsu.com
Subject RE: Parallel INSERT SELECT take 2
Date
Msg-id OS0PR01MB5716BAD92BE8C55CD8AE18ED94459@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to RE: Parallel INSERT SELECT take 2  ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>)
List pgsql-hackers
> > BACKGROUND
> > ========================================
> >
> > We want to realize parallel INSERT SELECT in the following steps:
> > 1) INSERT + parallel SELECT
> > 2) Parallel INSERT + parallel SELECT
> >
> > Below are example use cases.  We don't expect high concurrency or an
> > empty data source.
> > * Data loading (ETL or ELT) into an analytics database, typically a
> > data ware house.
> > * Batch processing in an OLTP database.
> > 2) Enabling users to declare that the table allows parallel data
> > modification Add a table property that represents parallel safety of
> > the table for DML statement execution.  Users specify it as follows:
> >
> > CREATE TABLE table_name (...) PARALLEL { UNSAFE | RESTRICTED | SAFE };
> >     ALTER TABLE table_name PARALLEL { UNSAFE | RESTRICTED | SAFE };
> >
> > This property is recorded in pg_class's relparallel column as 'u',
> > 'r', or 's', just like pg_proc's proparallel.  The default is UNSAFE.
> >
> > The planner assumes that all of the table, its descendant partitions,
> > and their ancillary objects have the specified parallel safety or
> > safer one.  The user is responsible for its correctness.  If the
> > parallel processes find an object that is less safer than the assumed
> > parallel safety during statement execution, it throws an ERROR and abort the
> statement execution.
> >
> > When the parallel safety of some of these objects is changed, it's
> > costly to reflect it on the parallel safety of tables that depend on
> > them.  So, we don't do it.  Instead, we provide a utility function
> > pg_get_parallel_safety('table_name')
> > that returns records of (objid, classid, parallel_safety) that
> > represent the parallel safety of objects that determine the parallel
> > safety of the specified table.  The function only outputs objects that
> > are not parallel safe.  Otherwise, it will consume excessive memory
> > while accumulating the output.  The user can use this function to
> > identify problematic objects when a parallel DML fails or is not parallelized in
> an expected manner.
> >
> > How does the executor detect parallel unsafe objects?  There are two ways:
> >
> > 1) At loading time
> > ...
> > 2) At function execution time
> > All related objects come down to some function execution.  So, add a
> > parallel safety check there when in a parallel worker.  If the current
> > process is a parallel worker and the function is parallel unsafe,
> > error out with ereport(ERROR).  This approach eliminates the oversight
> > of parallel safety check with the additional bonus of tiny code change!
> >
> > The place would be FunctionCallInvoke().  It's a macro in fmgr.h now.
> > Perhaps we should make it a function in fmgr.c, so that fmgr.h does
> > not have to include header files for parallelism-related definitions.
> >
> > We have to evaluate the performance effect of converting
> > FunctionCallInvoke() into a function and adding an if statement there,
> > because it's a relatively low-level function.
>
> Based on above, we plan to move forward with the apporache 2) (declarative
> idea).
>
> Attatching the POC patchset which including the following:
>
> 0001: provide a utility function pg_get_parallel_safety('table_name').
>
>   The function returns records of (objid, classid, parallel_safety) that represent
>   the parallel safety of objects that determine the parallel safety of the
> specified table.
>   Note: The function only outputs objects that are not parallel safe.
>   (Thanks a lot for greg's previous work, most of the safety check code here is
> based on it)
>
> 0002: allow user use "ALTER TABLE PARALLEL SAFE/UNSAFE/RESTRICTED".
>
>   Add proparallel column in pg_class and allow use to change its.
>
> 0003: detect parallel unsafe objects in executor.
>
>   Currently we choose to check function's parallel safety at function execution
> time.
>   We add safety check at FunctionCallInvoke(), but it may be better to check in
> fmgr_info_cxt_security.
>   we are still discussing it in another thread[1].
>
>   TODO: we currently skip checking built-in function's parallel safety, because
> we lack the information about built-in
>   function's parallel safety, we cannot access pg_proc.proparallel in a low level
> because it could result in infinite recursion.
>   Adding parallel property in fmgrBuiltin will enlarge the frequently accessed
> fmgr_builtins and lock down the value of the
>   parallel-safety flag. The solution is still under discussion. Suggestions and
> comments are welcome.
>
> 0004: fix some mislabeled function in testcase
>
>   Since we check parallel safety of function at a low level, we found some
> functions marked as parallel unsafe will be
>   executed in parallel mode in regression test when setting
> force_parallel_mode=regress. After checking, these functions
>   are parallel safe, So , we plan to fix these function's parallel label.
>   Note: we plan to take 0004 as a separate patch , see[2], I post 0004 here just
> to prevent some testcase failures.
>
> The above are the POC patches, it could be imperfect for now and I am still
> working on improving it.
> Suggestions and comments about the design or code are very welcome and
> appreciated.

Sorry, I forgot to attach the discussion link about [1] and [2].

[1]
https://www.postgresql.org/message-id/756027.1619012086%40sss.pgh.pa.us

[2]
https://www.postgresql.org/message-id/OS0PR01MB571637085C0D3AFC3AB3600194479%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best regards,
houzj




pgsql-hackers by date:

Previous
From: Masahiro Ikeda
Date:
Subject: Re: wal stats questions
Next
From: Peter Geoghegan
Date:
Subject: Getting rid of freezing and hint bits by eagerly vacuuming aborted xacts (was: decoupling table and index vacuum)