RE: Parallel INSERT SELECT take 2 - Mailing list pgsql-hackers
From | houzj.fnst@fujitsu.com |
---|---|
Subject | RE: Parallel INSERT SELECT take 2 |
Date | |
Msg-id | OS0PR01MB5716BAD92BE8C55CD8AE18ED94459@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | RE: Parallel INSERT SELECT take 2 ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>) |
List | pgsql-hackers |
> > BACKGROUND > > ======================================== > > > > We want to realize parallel INSERT SELECT in the following steps: > > 1) INSERT + parallel SELECT > > 2) Parallel INSERT + parallel SELECT > > > > Below are example use cases. We don't expect high concurrency or an > > empty data source. > > * Data loading (ETL or ELT) into an analytics database, typically a > > data ware house. > > * Batch processing in an OLTP database. > > 2) Enabling users to declare that the table allows parallel data > > modification Add a table property that represents parallel safety of > > the table for DML statement execution. Users specify it as follows: > > > > CREATE TABLE table_name (...) PARALLEL { UNSAFE | RESTRICTED | SAFE }; > > ALTER TABLE table_name PARALLEL { UNSAFE | RESTRICTED | SAFE }; > > > > This property is recorded in pg_class's relparallel column as 'u', > > 'r', or 's', just like pg_proc's proparallel. The default is UNSAFE. > > > > The planner assumes that all of the table, its descendant partitions, > > and their ancillary objects have the specified parallel safety or > > safer one. The user is responsible for its correctness. If the > > parallel processes find an object that is less safer than the assumed > > parallel safety during statement execution, it throws an ERROR and abort the > statement execution. > > > > When the parallel safety of some of these objects is changed, it's > > costly to reflect it on the parallel safety of tables that depend on > > them. So, we don't do it. Instead, we provide a utility function > > pg_get_parallel_safety('table_name') > > that returns records of (objid, classid, parallel_safety) that > > represent the parallel safety of objects that determine the parallel > > safety of the specified table. The function only outputs objects that > > are not parallel safe. Otherwise, it will consume excessive memory > > while accumulating the output. The user can use this function to > > identify problematic objects when a parallel DML fails or is not parallelized in > an expected manner. > > > > How does the executor detect parallel unsafe objects? There are two ways: > > > > 1) At loading time > > ... > > 2) At function execution time > > All related objects come down to some function execution. So, add a > > parallel safety check there when in a parallel worker. If the current > > process is a parallel worker and the function is parallel unsafe, > > error out with ereport(ERROR). This approach eliminates the oversight > > of parallel safety check with the additional bonus of tiny code change! > > > > The place would be FunctionCallInvoke(). It's a macro in fmgr.h now. > > Perhaps we should make it a function in fmgr.c, so that fmgr.h does > > not have to include header files for parallelism-related definitions. > > > > We have to evaluate the performance effect of converting > > FunctionCallInvoke() into a function and adding an if statement there, > > because it's a relatively low-level function. > > Based on above, we plan to move forward with the apporache 2) (declarative > idea). > > Attatching the POC patchset which including the following: > > 0001: provide a utility function pg_get_parallel_safety('table_name'). > > The function returns records of (objid, classid, parallel_safety) that represent > the parallel safety of objects that determine the parallel safety of the > specified table. > Note: The function only outputs objects that are not parallel safe. > (Thanks a lot for greg's previous work, most of the safety check code here is > based on it) > > 0002: allow user use "ALTER TABLE PARALLEL SAFE/UNSAFE/RESTRICTED". > > Add proparallel column in pg_class and allow use to change its. > > 0003: detect parallel unsafe objects in executor. > > Currently we choose to check function's parallel safety at function execution > time. > We add safety check at FunctionCallInvoke(), but it may be better to check in > fmgr_info_cxt_security. > we are still discussing it in another thread[1]. > > TODO: we currently skip checking built-in function's parallel safety, because > we lack the information about built-in > function's parallel safety, we cannot access pg_proc.proparallel in a low level > because it could result in infinite recursion. > Adding parallel property in fmgrBuiltin will enlarge the frequently accessed > fmgr_builtins and lock down the value of the > parallel-safety flag. The solution is still under discussion. Suggestions and > comments are welcome. > > 0004: fix some mislabeled function in testcase > > Since we check parallel safety of function at a low level, we found some > functions marked as parallel unsafe will be > executed in parallel mode in regression test when setting > force_parallel_mode=regress. After checking, these functions > are parallel safe, So , we plan to fix these function's parallel label. > Note: we plan to take 0004 as a separate patch , see[2], I post 0004 here just > to prevent some testcase failures. > > The above are the POC patches, it could be imperfect for now and I am still > working on improving it. > Suggestions and comments about the design or code are very welcome and > appreciated. Sorry, I forgot to attach the discussion link about [1] and [2]. [1] https://www.postgresql.org/message-id/756027.1619012086%40sss.pgh.pa.us [2] https://www.postgresql.org/message-id/OS0PR01MB571637085C0D3AFC3AB3600194479%40OS0PR01MB5716.jpnprd01.prod.outlook.com Best regards, houzj
pgsql-hackers by date: