Re: Parallel Inserts in CREATE TABLE AS - Mailing list pgsql-hackers
From | Bharath Rupireddy |
---|---|
Subject | Re: Parallel Inserts in CREATE TABLE AS |
Date | |
Msg-id | CALj2ACXmbka1P5pxOV2vU-Go3UPTtsPqZXE8nKW1mE49MQcZtw@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel Inserts in CREATE TABLE AS (vignesh C <vignesh21@gmail.com>) |
List | pgsql-hackers |
On Tue, Jan 5, 2021 at 10:08 AM vignesh C <vignesh21@gmail.com> wrote: > On Mon, Jan 4, 2021 at 3:07 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Wed, Dec 30, 2020 at 5:28 PM vignesh C <vignesh21@gmail.com> wrote: > > > Few comments: > > > - /* > > > - * To allow parallel inserts, we need to ensure that they are safe to be > > > - * performed in workers. We have the infrastructure to allow parallel > > > - * inserts in general except for the cases where inserts generate a new > > > - * CommandId (eg. inserts into a table having a foreign key column). > > > - */ > > > - if (IsParallelWorker()) > > > - ereport(ERROR, > > > - (errcode(ERRCODE_INVALID_TRANSACTION_STATE), > > > - errmsg("cannot insert tuples in a > > > parallel worker"))); > > > > > > Is it possible to add a check if it is a CTAS insert here as we do not > > > support insert in parallel workers from others as of now. > > > > Currently, there's no global variable in which we can selectively skip > > this in case of parallel insertion in CTAS. How about having a > > variable in any of the worker global contexts, set that when parallel > > insertion is chosen for CTAS and use that in heap_prepare_insert() to > > skip the above error? Eventually, we can remove this restriction > > entirely in case we fully allow parallelism for INSERT INTO SELECT, > > CTAS, and COPY. > > > > Thoughts? > > Yes, I felt that the leader can store the command as CTAS and the > leader/worker can use it to check and throw an error. The similar > change can be used for the parallel insert patches and once all the > patches are committed, we can remove it eventually. We can skip the error "cannot insert tuples in a parallel worker" in heap_prepare_insert() selectively for each parallel insertion and eventually we can remove that error after all the parallel insertion related patches are committed. The main problem is that we should be knowing in heap_prepare_insert() that we are coming from parallel insertion for CTAS, or some other command at the same time we don't want to alter the table_tuple_insert()/heap_prepare_insert() API because this change will be removed eventually. We can achieve this in below ways: 1) Add a backend global variable, set it before each table_tuple_insert() in intorel_receive() and use that in heap_prepare_insert() to skip the error. 2) Add a variable to MyBgworkerEntry structure, set it before each table_tuple_insert() in intorel_receive() or in ParallelQueryMain() if we are for CTAS parallel insertion and use that in heap_prepare_insert() to skip the error. 3) Currently, we pass table insert options to table_tuple_insert()/heap_prepare_insert(), which is a bitmap of below values. We could also add something like #define PARALLEL_INSERTION_CMD_CTAS 0x000F, set it before each table_tuple_insert() in intorel_receive() and use that in heap_prepare_insert() to skip the error, then unset it. /* "options" flag bits for table_tuple_insert */ /* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */ #define TABLE_INSERT_SKIP_FSM 0x0002 #define TABLE_INSERT_FROZEN 0x0004 #define TABLE_INSERT_NO_LOGICAL 0x0008 IMO either 2 or 3 would be fine. Thoughts? > > > + Oid objectid; /* workers to > > > open relation/table. */ > > > + /* Number of tuples inserted by all the workers. */ > > > + pg_atomic_uint64 processed; > > > > > > We can just mention relation instead of relation/table. > > > > I will modify it in the next patch set. > > > > > +select explain_pictas( > > > +'create table parallel_write as select length(stringu1) from tenk1;'); > > > + explain_pictas > > > +---------------------------------------------------------- > > > + Gather (actual rows=N loops=N) > > > + Workers Planned: 4 > > > + Workers Launched: N > > > + -> Create parallel_write > > > + -> Parallel Seq Scan on tenk1 (actual rows=N loops=N) > > > +(5 rows) > > > + > > > +select count(*) from parallel_write; > > > > > > Can we include selection of cmin, xmin for one of the test to verify > > > that it uses the same transaction id in the parallel workers > > > something like: > > > select distinct(cmin,xmin) from parallel_write; > > > > This is not possible since cmin and xmin are dynamic, we can not use > > them in test cases. I think it's not necessary to check whether the > > leader and workers are in the same txn or not, since we are not > > creating a new txn. All the txn state from the leader is serialized in > > SerializeTransactionState and restored in > > StartParallelWorkerTransaction. > > > > I had seen in your patch that you serialize and use the same > transaction, but it will be good if you can have at least one test > case to validate that the leader and worker both use the same > transaction. To solve the problem that you are facing where cmin and > xmin are dynamic, you can check the distinct count by using something > like below: > SELECT COUNT(*) FROM (SELECT DISTINCT cmin,xmin FROM t1) as dt; Thanks. So, the expectation is that the above query should always return 1 if both leader and workers shared the same txn. I will add this to one of the test cases in the next version of the patch set. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: