Re: Parallel Inserts in CREATE TABLE AS - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: Parallel Inserts in CREATE TABLE AS
Date
Msg-id CAFiTN-u_PEEYke66jMjLdXrcwgnmg-0nUG5oG-XtOqdQ1_vYVw@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Inserts in CREATE TABLE AS  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: Parallel Inserts in CREATE TABLE AS
List pgsql-hackers
On Wed, Dec 30, 2020 at 10:49 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, 30 Dec 2020 at 10:47 AM, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
>>
>> On Wed, Dec 30, 2020 at 10:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> > I have completed reviewing 0001, I don't have more comments, just one
>> > question.  Soon I will review the remaining patches.
>>
>> Thanks.
>>
>> > +    /* If parallel inserts are to be allowed, set a few extra information. */
>> > +    if (myState->is_parallel)
>> > +    {
>> > +        myState->object_id = intoRelationAddr.objectId;
>> > +
>> > +        /*
>> > +         * We don't need to skip contacting FSM while inserting tuples for
>> > +         * parallel mode, while extending the relations, workers instead of
>> > +         * blocking on a page while another worker is inserting, can check the
>> > +         * FSM for another page that can accommodate the tuples. This results
>> > +         * in major benefit for parallel inserts.
>> > +         */
>> > +        myState->ti_options = 0;
>> >
>> > Is there any performance data for this or just theoretical analysis?
>>
>> I have seen that we don't get much performance with the skip fsm
>> option, though I don't have the data to back it up. I'm planning to
>> run performance tests after the patches 0001, 0002 and 0003 get
>> reviewed. I will capture the data at that time. Hope that's fine.
>
>
> Yeah that’s fine
>

Some comments in 0002

1.
+/*
+ * Information sent to the planner from CTAS to account for the cost
+ * calculations in cost_gather. We need to do this because, no tuples will be
+ * received by the Gather node if the workers insert the tuples in parallel.
+ */
+typedef enum CTASParallelInsertOpt
+{
+ CTAS_PARALLEL_INS_UNDEF = 0, /* undefined */
+ CTAS_PARALLEL_INS_SELECT = 1 << 0, /* turn on this before planning */
+ /*
+ * Turn on this while planning for upper Gather path to ignore parallel
+ * tuple cost in cost_gather.
+ */
+ CTAS_PARALLEL_INS_TUP_COST_CAN_IGN = 1 << 1,
+ /* Turn on this after the cost is ignored. */
+ CTAS_PARALLEL_INS_TUP_COST_IGNORED = 1 << 2
+} CTASParallelInsertOpt;


I don't like the naming of these flags.  Especially no need to define
CTAS_PARALLEL_INS_UNDEF, we can directl use 0
for that purpose instead of giving some weird name.  So I suggest
first, just get rid of CTAS_PARALLEL_INS_UNDEF.

2.
+ /*
+ * Turn on a flag to ignore parallel tuple cost by the Gather path in
+ * cost_gather if the SELECT is for CTAS and we are generating an upper
+ * level Gather path.
+ */
+ bool ignore = ignore_parallel_tuple_cost(root);
+
  generate_useful_gather_paths(root, rel, false);

+ /*
+ * Reset the ignore flag, in case we turned it on but
+ * generate_useful_gather_paths returned without reaching cost_gather.
+ * If we reached cost_gather, we would have been reset it there.
+ */
+ if (ignore && (root->parse->CTASParallelInsInfo &
+ CTAS_PARALLEL_INS_TUP_COST_CAN_IGN))
+ {
+ root->parse->CTASParallelInsInfo &=
+ ~CTAS_PARALLEL_INS_TUP_COST_CAN_IGN;
+ }

I think th way we are using these cost ignoring flag, doesn't look clean.

I mean first, CTAS_PARALLEL_INS_SELECT is set if it is coming from
CTAS and then ignore_parallel_tuple_cost will
set the CTAS_PARALLEL_INS_TUP_COST_CAN_IGN if it satisfies certain
condition which is fine.  Now, internally cost
gather will add CTAS_PARALLEL_INS_TUP_COST_IGNORED and remove
CTAS_PARALLEL_INS_TUP_COST_CAN_IGN and if
CTAS_PARALLEL_INS_TUP_COST_CAN_IGN is not removed then we will remove
it outside.  Why do we need to remove
CTAS_PARALLEL_INS_TUP_COST_CAN_IGN flag at all?

3.
+ if (tuple_cost_flags && gstate->ps.ps_ProjInfo)
+ Assert(!(*tuple_cost_flags & CTAS_PARALLEL_INS_TUP_COST_IGNORED));

Instead of adding Assert inside an IF statement, you can convert whole
statement as an assert.  Lets not add unnecessary
if in the release mode.

4.
+ if ((root->parse->CTASParallelInsInfo & CTAS_PARALLEL_INS_SELECT) &&
+ (root->parse->CTASParallelInsInfo &
+ CTAS_PARALLEL_INS_TUP_COST_CAN_IGN))
+ {
+ ignore_tuple_cost = true;
+ root->parse->CTASParallelInsInfo &=
+ ~CTAS_PARALLEL_INS_TUP_COST_CAN_IGN;
+ root->parse->CTASParallelInsInfo |= CTAS_PARALLEL_INS_TUP_COST_IGNORED;
+ }
+
+ if (!ignore_tuple_cost)
+ run_cost += parallel_tuple_cost * path->path.rows;

Changes this to (if, else) as shown below, because if it goes to the
IF part then ignore_tuple_cost will always be true
so no need to have an extra if check.

if ((root->parse->CTASParallelInsInfo & CTAS_PARALLEL_INS_SELECT) &&
(root->parse->CTASParallelInsInfo &
CTAS_PARALLEL_INS_TUP_COST_CAN_IGN))
{
ignore_tuple_cost = true;
root->parse->CTASParallelInsInfo &=
~CTAS_PARALLEL_INS_TUP_COST_CAN_IGN;
root->parse->CTASParallelInsInfo |= CTAS_PARALLEL_INS_TUP_COST_IGNORED;
}
else
run_cost += parallel_tuple_cost * path->path.rows;

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Denis Smirnov
Date:
Subject: Re: PoC Refactor AM analyse API
Next
From: David Fetter
Date:
Subject: popcount