Re: Parallel Inserts in CREATE TABLE AS - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Parallel Inserts in CREATE TABLE AS
Date
Msg-id CALj2ACVufcSF4KG5M7g+hP91S+FeOrhO4U8v=B_tFk8on9ufZA@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Inserts in CREATE TABLE AS  (Luc Vlaming <luc@swarm64.com>)
Responses Re: Parallel Inserts in CREATE TABLE AS  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Re: Parallel Inserts in CREATE TABLE AS  (Luc Vlaming <luc@swarm64.com>)
List pgsql-hackers
On Mon, Jan 4, 2021 at 4:22 PM Luc Vlaming <luc@swarm64.com> wrote:
> Sorry it took so long to get back to reviewing this.

Thanks for the comments.

> wrt v18-0001....patch:
>
> +               /*
> +                * If the worker is for parallel insert in CTAS, then use the proper
> +                * dest receiver.
> +                */
> +               intoclause = (IntoClause *) stringToNode(intoclausestr);
> +               receiver = CreateIntoRelDestReceiver(intoclause);
> +               ((DR_intorel *)receiver)->is_parallel_worker = true;
> +               ((DR_intorel *)receiver)->object_id = fpes->objectid;
> I would move this into a function called e.g.
> GetCTASParallelWorkerReceiver so that the details wrt CTAS can be put in
> createas.c.
> I would then also split up intorel_startup into intorel_leader_startup
> and intorel_worker_startup, and in GetCTASParallelWorkerReceiver set
> self->pub.rStartup to intorel_worker_startup.

My intention was to not add any new APIs to the dest receiver. I simply made the changes in intorel_startup, in which for workers it just does the minimalistic work and exit from it. In the leader most of the table creation and sanity check is kept untouched. Please have a look at the v19 patch posted upthread [1].

> +       volatile pg_atomic_uint64       *processed;
> why is it volatile?

Intention is to always read from the actual memory location. I referred it from the way pg_atomic_fetch_add_u64_impl, pg_atomic_compare_exchange_u64_impl, pg_atomic_init_u64_impl and their u32 counterparts use pass the parameter as volatile pg_atomic_uint64 *ptr.

> +                       if (isctas)
> +                       {
> +                               intoclause = ((DR_intorel *) node->dest)->into;
> +                               objectid = ((DR_intorel *) node->dest)->object_id;
> +                       }
> Given that you extract them each once and then pass them directly into
> the parallel-worker, can't you instead pass in the destreceiver and
> leave that logic to ExecInitParallelPlan?

That's changed entirely in the v19 patch set posted upthread [1]. Please have a look. I didn't pass the dest receiver, to keep the API generic, I passed parallel insert command type and a void * ptr which points to insertion command because the information we pass to workers depends on the insertion command (for instance, the information needed by workers is for CTAS into clause and object id and for Refresh Mat View object id).

>
> +                                       if (IS_PARALLEL_CTAS_DEST(gstate->dest) &&
> +                                               ((DR_intorel *) gstate->dest)->into->rel &&
> +                                               ((DR_intorel *) gstate->dest)->into->rel->relname)
> why would rel and relname not be there? if no rows have been inserted?
> because it seems from the intorel_startup function that that would be
> set as soon as startup was done, which i assume (wrongly?) is always done?

Actually, that into clause rel variable is always being set in the gram.y for CTAS, Create Materialized View and SELECT INTO (because qualified_name non-terminal is not optional). My bad. I just added it as a sanity check. Actually, it's not required.

create_as_target:
            qualified_name opt_column_list table_access_method_clause
            OptWith OnCommitOption OptTableSpace
                {
                    $$ = makeNode(IntoClause);
                    $$->rel = $1;
create_mv_target:
            qualified_name opt_column_list table_access_method_clause opt_reloptions OptTableSpace
            {
                $$ = makeNode(IntoClause);
                $$->rel = $1;
into_clause:
            INTO OptTempTableName
            {
                $$ = makeNode(IntoClause);
               $$->rel = $2;

I will change the below code:
+                    if (GetParallelInsertCmdType(gstate->dest) ==
+                        PARALLEL_INSERT_CMD_CREATE_TABLE_AS &&
+                        ((DR_intorel *) gstate->dest)->into &&
+                        ((DR_intorel *) gstate->dest)->into->rel &&
+                        ((DR_intorel *) gstate->dest)->into->rel->relname)
+                    {

to:
+                    if (GetParallelInsertCmdType(gstate->dest) ==
+                        PARALLEL_INSERT_CMD_CREATE_TABLE_AS)
+                    {

I will update this in the next version of the patch set.

>
> +        * In case if no workers were launched, allow the leader to insert entire
> +        * tuples.
> what does "entire tuples" mean? should it maybe be "all tuples"?

Yeah, noticed that while working on the v19 patch set. Please have a look at the v19 patch posted upthread [1].

> ================
> wrt v18-0003....patch:
>
> not sure if it is needed, but i was wondering if we would want more
> tests with multiple gather nodes existing? caused e.g. by using CTE's,
> valid subquery's (like the one test you have, but without the group
> by/having)?

I'm not sure if we can have CTAS/CMV/SELECT INTO in CTEs like WITH, WITH RECURSIVE and I don't see that any of the WITH clause processing hits createas.c functions. So, IMHO, we don't need to add them. Please let me know if there are any specific use cases you have in mind.

For instance, I tried to cover Init/Sub Plan and Subquery cases with:

below case has multiple Gather, Init Plan:
+-- parallel inserts must occur, as there is init plan that gets executed by
+-- each parallel worker
+select explain_pictas(
+'create table parallel_write as select two col1,
+    (select two from (select * from tenk2) as tt limit 1) col2
+    from tenk1  where tenk1.four = 3;');

below case has Gather, Sub Plan:
+-- parallel inserts must not occur, as there is sub plan that gets executed by
+-- the Gather node in leader
+select explain_pictas(
+'create table parallel_write as select two col1,
+    (select tenk1.two from generate_series(1,1)) col2
+    from tenk1  where tenk1.four = 3;');

For multiple Gather node cases, I covered them with the Union All/Append cases in the 0004 patch. Please have a look.

pgsql-hackers by date:

Previous
From: "Joel Jacobson"
Date:
Subject: Re: Test harness for regex code (to allow importing Tcl's test suite)
Next
From: Önder Kalacı
Date:
Subject: Re: row filtering for logical replication