RE: Parallel Inserts in CREATE TABLE AS - Mailing list pgsql-hackers

From Hou, Zhijie
Subject RE: Parallel Inserts in CREATE TABLE AS
Date
Msg-id b7be257dbc784001af779e8d75bea5c2@G08CNEXMBPEKD05.g08.fujitsu.local
Whole thread Raw
In response to Re: Parallel Inserts in CREATE TABLE AS  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: Parallel Inserts in CREATE TABLE AS  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
List pgsql-hackers
> Attaching v21 patch set, which has following changes:
> 1) 0001 - changed fpes->ins_cmd_type ==
> PARALLEL_INSERT_CMD_CREATE_TABLE_AS to fpes->ins_cmd_type !=
> PARALLEL_INSERT_CMD_UNDEF
> 2) 0002 - reworded the commit message.
> 3) 0003 - added cmin, xmin test case to one of the parallel insert cases
> to ensure leader and worker insert the tuples in the same xact and replaced
> memory usage output in numbers like 25kB to NkB to make the tests stable.
> 4) 0004 - updated one of the test output to be in NkB and made the assertion
> in SetParallelInsertState to be not under an if condition.
> 
> There's one open point [1] on selective skipping of error "cannot insert
> tuples in a parallel worker" in heap_prepare_insert(), thoughts are
> welcome.
> 
> Please consider the v21 patch set for further review.

Hi,

I took a look into the new patch and have some comments.

1.
+    /*
+     * Do not consider tuple cost in case of we intend to perform parallel
+     * inserts by workers. We would have turned on the ignore flag in
+     * apply_scanjoin_target_to_paths before generating Gather path for the
+     * upper level SELECT part of the query.
+     */
+    if ((root->parse->parallelInsCmdTupleCostOpt &
+         PARALLEL_INSERT_SELECT_QUERY) &&
+        (root->parse->parallelInsCmdTupleCostOpt &
+         PARALLEL_INSERT_CAN_IGN_TUP_COST))

Can we just check PARALLEL_INSERT_CAN_IGN_TUP_COST here ?
IMO, PARALLEL_INSERT_CAN_IGN_TUP_COST will be set only when PARALLEL_INSERT_SELECT_QUERY is set.


2.
+static void
+ParallelInsCmdEstimate(ParallelContext *pcxt, ParallelInsertCmdKind ins_cmd,
+                       void *ins_info)
...
+        info = (ParallelInsertCTASInfo *) ins_info;
+        intoclause_str = nodeToString(info->intoclause);
+        intoclause_len = strlen(intoclause_str) + 1;

+static void
+SaveParallelInsCmdInfo(ParallelContext *pcxt, ParallelInsertCmdKind ins_cmd,
+                       void *ins_info)
...
+        info = (ParallelInsertCTASInfo *)ins_info;
+        intoclause_str = nodeToString(info->intoclause);
+        intoclause_len = strlen(intoclause_str) + 1;
+        intoclause_space = shm_toc_allocate(pcxt->toc, intoclause_len);

I noticed the above code will call nodeToString and strlen twice which seems unnecessary.
Do you think it's better to store the result of nodetostring and strlen first and pass them when used ?

3.
+    if (node->need_to_scan_locally || node->nworkers_launched == 0)
+    {
+        EState       *estate = node->ps.state;
+        TupleTableSlot *outerTupleSlot;
+
+        for(;;)
+        {
+            /* Install our DSA area while executing the plan. */
+            estate->es_query_dsa =
+                    node->pei ? node->pei->area : NULL;
...
+            node->ps.state->es_processed++;
+        }

How about use the variables estate like 'estate-> es_processed++;'
Instead of node->ps.state->es_processed++;

Best regards,
houzj






pgsql-hackers by date:

Previous
From: "tsunakawa.takay@fujitsu.com"
Date:
Subject: RE: Disable WAL logging to speed up data loading
Next
From: Thomas Munro
Date:
Subject: Re: pg_preadv() and pg_pwritev()