Re: Parallel INSERT (INTO ... SELECT ...) - Mailing list pgsql-hackers

From vignesh C
Subject Re: Parallel INSERT (INTO ... SELECT ...)
Date
Msg-id CALDaNm3SG=D=2UHtOtzAG8i+AbUxsX6Z371c-bJ-X9h7xw5CYg@mail.gmail.com
Whole thread Raw
In response to Re: Parallel INSERT (INTO ... SELECT ...)  (Greg Nancarrow <gregn4422@gmail.com>)
Responses Re: Parallel INSERT (INTO ... SELECT ...)  (Greg Nancarrow <gregn4422@gmail.com>)
Re: Parallel INSERT (INTO ... SELECT ...)  (Greg Nancarrow <gregn4422@gmail.com>)
List pgsql-hackers
On Fri, Dec 11, 2020 at 4:30 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> Posting an updated set of patches to address recent feedback:
>
> - Removed conditional-locking code used in parallel-safety checking
> code (Tsunakawa-san feedback). It turns out that for the problem test
> case, no parallel-safety checking should be occurring that locks
> relations because those inserts are specifying VALUES, not an
> underlying SELECT, so the parallel-safety checking code was updated to
> bail out early if no underlying SELECT is specified for the INSERT. No
> change to the test code was required.
> - Added comment to better explain the reason for treating "INSERT ...
> ON CONFLICT ... DO UPDATE" as parallel-unsafe (Dilip)
> - Added assertion to heap_prepare_insert() (Amit)
> - Updated error handling for NULL index_expr_item case (Vignesh)

Thanks Greg for fixing and posting a new patch.
Few comments:
+-- Test INSERT with underlying query.
+-- (should create plan with parallel SELECT, Gather parent node)
+--
+explain(costs off) insert into para_insert_p1 select unique1,
stringu1 from tenk1;
+               QUERY PLAN
+----------------------------------------
+ Insert on para_insert_p1
+   ->  Gather
+         Workers Planned: 4
+         ->  Parallel Seq Scan on tenk1
+(4 rows)
+
+insert into para_insert_p1 select unique1, stringu1 from tenk1;
+select count(*), sum(unique1) from para_insert_p1;
+ count |   sum
+-------+----------
+ 10000 | 49995000
+(1 row)
+

For one of the test you can validate that the same transaction has
been used by all the parallel workers, you could use something like
below to validate:
SELECT COUNT(*) FROM (SELECT DISTINCT cmin,xmin FROM  para_insert_p1) as dt;

Few includes are not required:
 #include "executor/nodeGather.h"
+#include "executor/nodeModifyTable.h"
 #include "executor/nodeSubplan.h"
 #include "executor/tqueue.h"
 #include "miscadmin.h"
@@ -60,6 +61,7 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
        GatherState *gatherstate;
        Plan       *outerNode;
        TupleDesc       tupDesc;
+       Index           varno;

This include is not required in nodeModifyTable.c

+#include "catalog/index.h"
+#include "catalog/indexing.h"

@@ -43,7 +49,11 @@
 #include "parser/parse_agg.h"
 #include "parser/parse_coerce.h"
 #include "parser/parse_func.h"
+#include "parser/parsetree.h"
+#include "partitioning/partdesc.h"
+#include "rewrite/rewriteHandler.h"
 #include "rewrite/rewriteManip.h"
+#include "storage/lmgr.h"
 #include "tcop/tcopprot.h"

The includes indexing.h, rewriteHandler.h & lmgr.h is not required in clauses.c

There are few typos:
+        table and populate it can use a parallel plan. Another
exeption is the command
+        <literal>INSERT INTO ... SELECT ...</literal> which can use a
parallel plan for
+        the underlying <literal>SELECT</literal> part of the query.

exeption should be exception

+ /*
+ * For the number of workers to use for a parallel
+ * INSERT/UPDATE/DELETE, it seems resonable to use the same number
+ * of workers as estimated for the underlying query.
+ */
+ parallelModifyWorkers = path->parallel_workers;
resonable should be reasonable

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: EXPLAIN/EXPLAIN ANALYZE REFRESH MATERIALIZED VIEW
Next
From: Peter Smith
Date:
Subject: Re: Single transaction in the tablesync worker?