Parallel INSERT (INTO ... SELECT ...) - Mailing list pgsql-hackers

Hi Hackers,

Following on from Dilip Kumar's POC patch for allowing parallelism of
the SELECT part of "INSERT INTO ... SELECT ...", I have attached a POC
patch for allowing parallelism of both the INSERT and SELECT parts,
where it can be allowed.
For cases where it can't be allowed (e.g. INSERT into a table with
foreign keys, or INSERT INTO ... SELECT ... ON CONFLICT ... DO UPDATE
...") it at least allows parallelism of the SELECT part.
Obviously I've had to update the planner and executor and
parallel-worker code to make this happen, hopefully not breaking too
many things along the way.

Examples with patch applied:


(1) non-parallel:

test=# explain analyze insert into primary_tbl select * from third_tbl;
                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Insert on primary_tbl  (cost=0.00..154.99 rows=9999 width=12) (actual
time=108.445..108.446 rows=0 loops=1)
   ->  Seq Scan on third_tbl  (cost=0.00..154.99 rows=9999 width=12)
(actual time=0.009..5.282 rows=9999 loops=1)
 Planning Time: 0.132 ms
 Execution Time: 108.596 ms
(4 rows)


(2) parallel:

test=# explain analyze insert into primary_tbl select * from third_tbl;
                                                           QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=0.00..16.00 rows=9999 width=12) (actual
time=69.870..70.310 rows=0 loops=1)
   Workers Planned: 5
   Workers Launched: 5
   ->  Parallel Insert on primary_tbl  (cost=0.00..16.00 rows=500
width=12) (actual time=59.948..59.949 rows=0 loops=6)
         ->  Parallel Seq Scan on third_tbl  (cost=0.00..80.00
rows=2500 width=12) (actual time=0.014..0.922 rows=1666 loops=6)
 Planning Time: 0.121 ms
 Execution Time: 70.438 ms
(7 rows)


(3) parallel select only (insert into table with foreign key)

test=# explain analyze insert into secondary_tbl select * from third_tbl;
                                                           QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
 Insert on secondary_tbl  (cost=0.00..80.00 rows=9999 width=12)
(actual time=33.864..33.926 rows=0 loops=1)
   ->  Gather  (cost=0.00..80.00 rows=9999 width=12) (actual
time=0.451..5.201 rows=9999 loops=1)
         Workers Planned: 4
         Workers Launched: 4
         ->  Parallel Seq Scan on third_tbl  (cost=0.00..80.00
rows=2500 width=12) (actual time=0.013..0.717 rows=2000 loops=5)
 Planning Time: 0.127 ms
 Trigger for constraint secondary_tbl_index_fkey: time=331.834 calls=9999
 Execution Time: 367.342 ms
(8 rows)


Known issues/TODOs:
- Currently only for "INSERT INTO ... SELECT ...". To support "INSERT
INTO ... VALUES ..." would need additional Table AM functions for
dividing up the INSERT work amongst the workers (currently only exists
for scans).
- When INSERTs are made parallel, currently the reported row-count in
the "INSERT 0 <row-count>" status only reflects the rows that the
leader has processed (not the workers) - so it is obviously less than
the actual number of rows inserted.
- Functions relating to computing the number of parallel workers for
an INSERT, and the cost of an INSERT, need work.
- "force_parallel_mode" handling was updated so that it only affects
SELECT (not INSERT) - can't allow it for INSERT because we're only
supporting "INSERT INTO .. SELECT ..." and don't support other types
of INSERTs, and also can't allow attempted parallel UPDATEs resulting
from "INSERT INTO ... SELECT ... ON CONFLICT ... DO UPDATE" etc.


Thoughts and feedback?

Regards,
Greg Nancarrow
Fujitsu Australia

Attachment

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Parallel Full Hash Join
Next
From: "Hou, Zhijie"
Date:
Subject: Use appendStringInfoString and appendPQExpBufferStr where possible