Thread: Parallel Inserts in CREATE TABLE AS
The idea of this patch is to allow the leader and each worker insert the tuples in parallel if the SELECT part of the CTAS is parallelizable. Along with the parallel inserts, if the CTAS code path is allowed to do table_multi_insert()[1], then the gain we achieve is as follows:
For a table with 2 integer columns, 100million tuples(more testing results are at [2]), the exec time on the HEAD is 120sec, where as with the parallelism patch proposed here and multi insert patch [1], with 3 workers and leader participation the exec time is 22sec(5.45X). With the current CTAS code which does single tuple insert(see intorel_receive()), the performance gain is limited to ~1.7X with parallelism. This is due to the fact that the workers contend more for locks on buffer pages while extending the table. So, the maximum benefit we could get for CTAS is with both parallelism and multi tuple inserts.
The design:
Let the planner know that the SELECT is from CTAS in createas.c so that it can set the number of tuples transferred from the workers to Gather node to 0. With this change, there are chances that the planner may choose the parallel plan. After the planning, check if the upper plan node is Gather in createas.c and mark a parallelism flag in the CTAS dest receiver. Pass the into clause, object id, command id from the leader to workers, so that each worker can create its own CTAS dest receiver. Leader inserts it's share of tuples if instructed to do, and so are workers. Each worker writes atomically it's number of inserted tuples into a shared memory variable, the leader combines this with it's own number of inserted tuples and shares to the client.
2. How to represent the parallel insert for CTAS in explain plans? The explain CTAS shows the plan for only the SELECT part. How about having some textual info along with the Gather node?
-----------------------------------------------------------------------------
Gather (cost=1000.00..108738.90 rows=0 width=8)
Workers Planned: 2
-> Parallel Seq Scan on t_test (cost=0.00..106748.00 rows=4954 width=8)
Filter: (many < 10000)
Thoughts?
Credits:
1. Thanks to DIlip Kumar for the main design idea and the discussions. Thanks to Vignesh for the discussions.
2. Patch development, testing is by me.
3. Thanks to the authors of table_multi_insert() in CTAS patch [1].
[1] - For table_multi_insert() in CTAS, I used an in-progress patch available at https://www.postgresql.org/message-id/CAEET0ZG31mD5SWjTYsAt0JTLReOejPvusJorZ3kGZ1%3DN1AC-Fw%40mail.gmail.com
[2] - Table with 2 integer columns, 100million tuples, with leader participation,with default postgresql.conf file. All readings are of triplet form - (workers, exec time in sec, improvement).
case 1: no multi inserts - (0,120,1X),(1,91,1.32X),(2,75,1.6X),(3,67,1.79X),(4,72,1.66X),(5,77,1.56),(6,83,1.44X)
case 2: with multi inserts - (0,59,1X),(1,32,1.84X),(2,28,2.1X),(3,25,2.36X),(4,23,2.56X),(5,22,2.68X),(6,22,2.68X)
case 3: same table but unlogged with multi inserts - (0,50,1X),(1,28,1.78X),(2,25,2X),(3,22,2.27X),(4,21,2.38X),(5,21,2.38X),(6,20,2.5X)
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Attachment
> > [1] - For table_multi_insert() in CTAS, I used an in-progress patch available at https://www.postgresql.org/message-id/CAEET0ZG31mD5SWjTYsAt0JTLReOejPvusJorZ3kGZ1%3DN1AC-Fw%40mail.gmail.com > [2] - Table with 2 integer columns, 100million tuples, with leader participation,with default postgresql.conf file. Allreadings are of triplet form - (workers, exec time in sec, improvement). > case 1: no multi inserts - (0,120,1X),(1,91,1.32X),(2,75,1.6X),(3,67,1.79X),(4,72,1.66X),(5,77,1.56),(6,83,1.44X) > case 2: with multi inserts - (0,59,1X),(1,32,1.84X),(2,28,2.1X),(3,25,2.36X),(4,23,2.56X),(5,22,2.68X),(6,22,2.68X) > case 3: same table but unlogged with multi inserts - (0,50,1X),(1,28,1.78X),(2,25,2X),(3,22,2.27X),(4,21,2.38X),(5,21,2.38X),(6,20,2.5X) > I feel this enhancement could give good improvement, +1 for this. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Hi, On 2020-09-23 17:20:20 +0530, Bharath Rupireddy wrote: > The idea of this patch is to allow the leader and each worker insert the > tuples in parallel if the SELECT part of the CTAS is parallelizable. Cool! > The design: I think it'd be good if you could explain a bit more why you think this safe to do in the way you have done it. E.g. from a quick scroll through the patch, there's not even a comment explaining that the only reason there doesn't need to be code dealing with xid assignment because we already did the catalog changes to create the table. But how does that work for SELECT INTO? Are you prohibiting that? ... > Pass the into clause, object id, command id from the leader to > workers, so that each worker can create its own CTAS dest > receiver. Leader inserts it's share of tuples if instructed to do, and > so are workers. Each worker writes atomically it's number of inserted > tuples into a shared memory variable, the leader combines this with > it's own number of inserted tuples and shares to the client. > > Below things are still pending. Thoughts are most welcome: > 1. How better we can lift the "cannot insert tuples in a parallel worker" > from heap_prepare_insert() for only CTAS cases or for that matter parallel > copy? How about having a variable in any of the worker global contexts and > use that? Of course, we can remove this restriction entirely in case we > fully allow parallelism for INSERT INTO SELECT, CTAS, and COPY. I have mentioned before that I think it'd be good if we changed the insert APIs to have a more 'scan' like structure. I am thinking of something like TableInsertScan* table_begin_insert(Relation); table_tuple_insert(TableInsertScan *is, other, args); table_multi_insert(TableInsertScan *is, other, args); table_end_insert(TableInsertScan *); that'd then replace the BulkInsertStateData logic we have right now. But more importantly it'd allow an AM to optimize operations across multiple inserts, which is important for column stores. And for the purpose of your question, we could then have a table_insert_allow_parallel(TableInsertScan *); or an additional arg to table_begin_insert(). > 3. Need to restrict parallel inserts, if CTAS tries to create temp/global > tables as the workers will not have access to those tables. Need to analyze > whether to allow parallelism if CTAS has prepared statements or with no > data. In which case does CTAS not create a table? You definitely need to ensure that the table is created before your workers are started, and there needs to be in a different CommandId. Greetings, Andres Freund
Thanks Andres for the comments. On Thu, Sep 24, 2020 at 8:11 AM Andres Freund <andres@anarazel.de> wrote: > > > The design: > > I think it'd be good if you could explain a bit more why you think this > safe to do in the way you have done it. > > E.g. from a quick scroll through the patch, there's not even a comment > explaining that the only reason there doesn't need to be code dealing > with xid assignment because we already did the catalog changes to create > the table. > Yes we do a bunch of catalog changes related to the created new table. We will have both the txn id and command id assigned when catalogue changes are being made. But, right after the table is created in the leader, the command id is incremented (CommandCounterIncrement() is called from create_ctas_internal()) whereas the txn id remains the same. The new command id is marked as GetCurrentCommandId(true); in intorel_startup, then the parallel mode is entered. The txn id and command id are serialized into parallel DSM, they are then available to all parallel workers. This is discussed in [1]. Few changes I have to make in the parallel worker code: set currentCommandIdUsed = true;, may be via a common API SetCurrentCommandIdUsedForWorker() proposed in [1] and remove the extra command id sharing from the leader to workers. I will add a few comments in the upcoming patch related to the above info. > > But how does that work for SELECT INTO? Are you prohibiting > that? ... > In case of SELECT INTO, a new table gets created and I'm not prohibiting the parallel inserts and I think we don't need to. Thoughts? > > > Below things are still pending. Thoughts are most welcome: > > 1. How better we can lift the "cannot insert tuples in a parallel worker" > > from heap_prepare_insert() for only CTAS cases or for that matter parallel > > copy? How about having a variable in any of the worker global contexts and > > use that? Of course, we can remove this restriction entirely in case we > > fully allow parallelism for INSERT INTO SELECT, CTAS, and COPY. > > And for the purpose of your question, we could then have a > table_insert_allow_parallel(TableInsertScan *); > or an additional arg to table_begin_insert(). > Removing "cannot insert tuples in a parallel worker" restriction from heap_prepare_insert() is a common problem for parallel inserts in general, i.e. parallel inserts in CTAS, parallel INSERT INTO SELECTs[1] and parallel copy[2]. It will be good if a common solution is agreed. > > > 3. Need to restrict parallel inserts, if CTAS tries to create temp/global > > tables as the workers will not have access to those tables. Need to analyze > > whether to allow parallelism if CTAS has prepared statements or with no > > data. > > In which case does CTAS not create a table? AFAICS, the table gets created in all the cases but the insertion of the data gets skipped if the user specifies "with no data" option in which case the select part is not even planned, and so the parallelism will also not be picked. > > You definitely need to > ensure that the table is created before your workers are started, and > there needs to be in a different CommandId. > Yeah, this is already being done. Table gets created in the leader(intorel_startup which gets called from dest->rStartup(dest in standard_ExecutorRun()) before entering the parallel mode. [1] https://www.postgresql.org/message-id/CAJcOf-fn1nhEtaU91NvRuA3EbvbJGACMd4_c%2BUu3XU5VMv37Aw%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAA4eK1%2BkpddvvLxWm4BuG_AhVvYz8mKAEa7osxp_X0d4ZEiV%3Dg%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 28, 2020 at 3:58 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Thanks Andres for the comments. > > On Thu, Sep 24, 2020 at 8:11 AM Andres Freund <andres@anarazel.de> wrote: > > > > > The design: > > > > I think it'd be good if you could explain a bit more why you think this > > safe to do in the way you have done it. > > > > E.g. from a quick scroll through the patch, there's not even a comment > > explaining that the only reason there doesn't need to be code dealing > > with xid assignment because we already did the catalog changes to create > > the table. > > > > Yes we do a bunch of catalog changes related to the created new table. > We will have both the txn id and command id assigned when catalogue > changes are being made. But, right after the table is created in the > leader, the command id is incremented (CommandCounterIncrement() is > called from create_ctas_internal()) whereas the txn id remains the > same. The new command id is marked as GetCurrentCommandId(true); in > intorel_startup, then the parallel mode is entered. The txn id and > command id are serialized into parallel DSM, they are then available > to all parallel workers. This is discussed in [1]. > > Few changes I have to make in the parallel worker code: set > currentCommandIdUsed = true;, may be via a common API > SetCurrentCommandIdUsedForWorker() proposed in [1] and remove the > extra command id sharing from the leader to workers. > > I will add a few comments in the upcoming patch related to the above info. > Yes, that would be good. > > > > But how does that work for SELECT INTO? Are you prohibiting > > that? ... > > > > In case of SELECT INTO, a new table gets created and I'm not > prohibiting the parallel inserts and I think we don't need to. > So, in this case, also do we ensure that table is created before we launch the workers. If so, I think you can explain in comments about it and what you need to do that to ensure the same. While skimming through the patch, a small thing I noticed: + /* + * SELECT part of the CTAS is parallelizable, so we can make + * each parallel worker insert the tuples that are resulted + * in it's execution into the target table. + */ + if (!is_matview && + IsA(plan->planTree, Gather)) + ((DR_intorel *) dest)->is_parallel = true; + I am not sure at this stage if this is the best way to make CTAS as parallel but if so, then probably you can expand the comments a bit to say why you consider only Gather node (and that too when it is the top-most node) and why not another parallel node like GatherMerge? > Thoughts? > > > > > > Below things are still pending. Thoughts are most welcome: > > > 1. How better we can lift the "cannot insert tuples in a parallel worker" > > > from heap_prepare_insert() for only CTAS cases or for that matter parallel > > > copy? How about having a variable in any of the worker global contexts and > > > use that? Of course, we can remove this restriction entirely in case we > > > fully allow parallelism for INSERT INTO SELECT, CTAS, and COPY. > > > > And for the purpose of your question, we could then have a > > table_insert_allow_parallel(TableInsertScan *); > > or an additional arg to table_begin_insert(). > > > > Removing "cannot insert tuples in a parallel worker" restriction from > heap_prepare_insert() is a common problem for parallel inserts in > general, i.e. parallel inserts in CTAS, parallel INSERT INTO > SELECTs[1] and parallel copy[2]. It will be good if a common solution > is agreed. > Right, for now, I think you can simply remove that check from the code instead of just commenting it. We will see if there is a better check/Assert we can add there. -- With Regards, Amit Kapila.
On Tue, Oct 6, 2020 at 10:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Yes we do a bunch of catalog changes related to the created new table. > > We will have both the txn id and command id assigned when catalogue > > changes are being made. But, right after the table is created in the > > leader, the command id is incremented (CommandCounterIncrement() is > > called from create_ctas_internal()) whereas the txn id remains the > > same. The new command id is marked as GetCurrentCommandId(true); in > > intorel_startup, then the parallel mode is entered. The txn id and > > command id are serialized into parallel DSM, they are then available > > to all parallel workers. This is discussed in [1]. > > > > Few changes I have to make in the parallel worker code: set > > currentCommandIdUsed = true;, may be via a common API > > SetCurrentCommandIdUsedForWorker() proposed in [1] and remove the > > extra command id sharing from the leader to workers. > > > > I will add a few comments in the upcoming patch related to the above info. > > > > Yes, that would be good. > Added comments. > > > > But how does that work for SELECT INTO? Are you prohibiting > > > that? ... > > > > > > > In case of SELECT INTO, a new table gets created and I'm not > > prohibiting the parallel inserts and I think we don't need to. > > > > So, in this case, also do we ensure that table is created before we > launch the workers. If so, I think you can explain in comments about > it and what you need to do that to ensure the same. > For SELECT INTO, the table gets created by the leader in create_ctas_internal(), then ExecInitParallelPlan() gets called which launches the workers and then the leader(if asked to do so) and the workers insert the rows. So, we don't need to do any extra work to ensure the table gets created before the workers start inserting tuples. > > While skimming through the patch, a small thing I noticed: > + /* > + * SELECT part of the CTAS is parallelizable, so we can make > + * each parallel worker insert the tuples that are resulted > + * in it's execution into the target table. > + */ > + if (!is_matview && > + IsA(plan->planTree, Gather)) > + ((DR_intorel *) dest)->is_parallel = true; > + > > I am not sure at this stage if this is the best way to make CTAS as > parallel but if so, then probably you can expand the comments a bit to > say why you consider only Gather node (and that too when it is the > top-most node) and why not another parallel node like GatherMerge? > If somebody expects to preserve the order of the tuples that are coming from GatherMerge node of the select part in CTAS or SELECT INTO while inserting, now if parallelism is allowed, that may not be the case i.e. the order of insertion of tuples may vary. I'm not quite sure, if someone wants to use order by in the select parts of CTAS or SELECT INTO in a real world use case. Thoughts? > > Right, for now, I think you can simply remove that check from the code > instead of just commenting it. We will see if there is a better > check/Assert we can add there. > Done. I also worked on some of the open points I listed earlier in my mail. > > 3. Need to restrict parallel inserts, if CTAS tries to create temp/global tables as the workers will not have access tothose tables. > Done. > > Need to analyze whether to allow parallelism if CTAS has prepared statements or with no data. > For prepared statements, the parallelism will not be picked and so is parallel insertion. For CTAS with no data option case the select part is not even planned, and so the parallelism will also not be picked. > > 4. Need to stop unnecessary parallel shared state such as tuple queue being created and shared to workers. > Done. I'm listing the things that are still pending. 1. How to represent the parallel insert for CTAS in explain plans? The explain CTAS shows the plan for only the SELECT part. How about having some textual info along with the Gather node? I'm not quite sure on this point, any suggestions are welcome. 2. Addition of new test cases. Testing with more scenarios and different data sets, sizes, tablespaces, select into. Analysis on the 2 mismatches in write_parallel.sql regression test. Attaching v2 patch, thoughts and comments are welcome. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Oct 14, 2020 at 2:46 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Tue, Oct 6, 2020 at 10:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > While skimming through the patch, a small thing I noticed: > > + /* > > + * SELECT part of the CTAS is parallelizable, so we can make > > + * each parallel worker insert the tuples that are resulted > > + * in it's execution into the target table. > > + */ > > + if (!is_matview && > > + IsA(plan->planTree, Gather)) > > + ((DR_intorel *) dest)->is_parallel = true; > > + > > > > I am not sure at this stage if this is the best way to make CTAS as > > parallel but if so, then probably you can expand the comments a bit to > > say why you consider only Gather node (and that too when it is the > > top-most node) and why not another parallel node like GatherMerge? > > > > If somebody expects to preserve the order of the tuples that are > coming from GatherMerge node of the select part in CTAS or SELECT INTO > while inserting, now if parallelism is allowed, that may not be the > case i.e. the order of insertion of tuples may vary. I'm not quite > sure, if someone wants to use order by in the select parts of CTAS or > SELECT INTO in a real world use case. Thoughts? > I think there is no reason why one can't use ORDER BY in the statements we are talking about here. But, I think we can't enable parallelism for GatherMerge is because for that node we always need to fetch the data in the leader backend to perform the final merge phase. So, I was expecting a small comment saying something on those lines. > > > > > Need to analyze whether to allow parallelism if CTAS has prepared statements or with no data. > > > > For prepared statements, the parallelism will not be picked and so is > parallel insertion. > Hmm, I am not sure what makes you say this statement. The parallelism is enabled for prepared statements since commit 57a6a72b6b. > > I'm listing the things that are still pending. > > 1. How to represent the parallel insert for CTAS in explain plans? The > explain CTAS shows the plan for only the SELECT part. How about having > some textual info along with the Gather node? I'm not quite sure on > this point, any suggestions are welcome. > I am also not sure about this point because we don't display anything for the DDL part in explain. Can you propose by showing some example of what you have in mind? -- With Regards, Amit Kapila.
>
> > If somebody expects to preserve the order of the tuples that are
> > coming from GatherMerge node of the select part in CTAS or SELECT INTO
> > while inserting, now if parallelism is allowed, that may not be the
> > case i.e. the order of insertion of tuples may vary. I'm not quite
> > sure, if someone wants to use order by in the select parts of CTAS or
> > SELECT INTO in a real world use case. Thoughts?
> >
>
> I think there is no reason why one can't use ORDER BY in the
> statements we are talking about here. But, I think we can't enable
> parallelism for GatherMerge is because for that node we always need to
> fetch the data in the leader backend to perform the final merge phase.
> So, I was expecting a small comment saying something on those lines.
>
Sure, I will add comments in the upcoming patch.
>
> > For prepared statements, the parallelism will not be picked and so is
> > parallel insertion.
>
> Hmm, I am not sure what makes you say this statement. The parallelism
> is enabled for prepared statements since commit 57a6a72b6b.
>
EXPLAIN ANALYZE CREATE TABLE t1_test AS EXECUTE myselect;
> > 1. How to represent the parallel insert for CTAS in explain plans? The
> > explain CTAS shows the plan for only the SELECT part. How about having
> > some textual info along with the Gather node? I'm not quite sure on
> > this point, any suggestions are welcome.
>
> I am also not sure about this point because we don't display anything
> for the DDL part in explain. Can you propose by showing some example
> of what you have in mind?
>
I thought we could have something like this.
-----------------------------------------------------------------------------
Gather (cost=1000.00..108738.90 rows=0 width=8)
Workers Planned: 2 Parallel Insert on t_test1
-> Parallel Seq Scan on t_test (cost=0.00..106748.00 rows=4954 width=8)
Filter: (many < 10000)
-----------------------------------------------------------------------------
On Thu, Oct 15, 2020 at 9:14 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Oct 14, 2020 at 6:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > For prepared statements, the parallelism will not be picked and so is > > > parallel insertion. > > > > Hmm, I am not sure what makes you say this statement. The parallelism > > is enabled for prepared statements since commit 57a6a72b6b. > > > > Thanks for letting me know this. I misunderstood the parallelism for prepared statements. Now, I verified with a properuse case(see below), where I had a prepared statement, CTAS having EXECUTE, in this case too parallelism is pickedand parallel insertion happened with the patch proposed in this thread. Do we have any problems if we allow parallelinsertion for these cases? > > PREPARE myselect AS SELECT * FROM t1; > EXPLAIN ANALYZE CREATE TABLE t1_test AS EXECUTE myselect; > > I think the commit 57a6a72b6b has not added any test cases, isn't it good to add one in prepare.sql or select_parallel.sql? > I am not sure if it is worth as this is not functionality which is too complex or there are many chances of getting it broken. > > > > > 1. How to represent the parallel insert for CTAS in explain plans? The > > > explain CTAS shows the plan for only the SELECT part. How about having > > > some textual info along with the Gather node? I'm not quite sure on > > > this point, any suggestions are welcome. > > > > I am also not sure about this point because we don't display anything > > for the DDL part in explain. Can you propose by showing some example > > of what you have in mind? > > > > I thought we could have something like this. > ----------------------------------------------------------------------------- > Gather (cost=1000.00..108738.90 rows=0 width=8) > Workers Planned: 2 Parallel Insert on t_test1 > -> Parallel Seq Scan on t_test (cost=0.00..106748.00 rows=4954 width=8) > Filter: (many < 10000) > ----------------------------------------------------------------------------- > maybe something like below: Gather (cost=1000.00..108738.90 rows=0 width=8) -> Create t_test1 -> Parallel Seq Scan on t_test I don't know what is the best thing to do here. I think for the temporary purpose you can keep something like above then once the patch is matured then we can take a separate opinion for this. -- With Regards, Amit Kapila.
On 14.10.20 11:16, Bharath Rupireddy wrote: > On Tue, Oct 6, 2020 at 10:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >>> Yes we do a bunch of catalog changes related to the created new table. >>> We will have both the txn id and command id assigned when catalogue >>> changes are being made. But, right after the table is created in the >>> leader, the command id is incremented (CommandCounterIncrement() is >>> called from create_ctas_internal()) whereas the txn id remains the >>> same. The new command id is marked as GetCurrentCommandId(true); in >>> intorel_startup, then the parallel mode is entered. The txn id and >>> command id are serialized into parallel DSM, they are then available >>> to all parallel workers. This is discussed in [1]. >>> >>> Few changes I have to make in the parallel worker code: set >>> currentCommandIdUsed = true;, may be via a common API >>> SetCurrentCommandIdUsedForWorker() proposed in [1] and remove the >>> extra command id sharing from the leader to workers. >>> >>> I will add a few comments in the upcoming patch related to the above info. >>> >> >> Yes, that would be good. >> > > Added comments. > >> >>>> But how does that work for SELECT INTO? Are you prohibiting >>>> that? ... >>>> >>> >>> In case of SELECT INTO, a new table gets created and I'm not >>> prohibiting the parallel inserts and I think we don't need to. >>> >> >> So, in this case, also do we ensure that table is created before we >> launch the workers. If so, I think you can explain in comments about >> it and what you need to do that to ensure the same. >> > > For SELECT INTO, the table gets created by the leader in > create_ctas_internal(), then ExecInitParallelPlan() gets called which > launches the workers and then the leader(if asked to do so) and the > workers insert the rows. So, we don't need to do any extra work to > ensure the table gets created before the workers start inserting > tuples. > >> >> While skimming through the patch, a small thing I noticed: >> + /* >> + * SELECT part of the CTAS is parallelizable, so we can make >> + * each parallel worker insert the tuples that are resulted >> + * in it's execution into the target table. >> + */ >> + if (!is_matview && >> + IsA(plan->planTree, Gather)) >> + ((DR_intorel *) dest)->is_parallel = true; >> + >> >> I am not sure at this stage if this is the best way to make CTAS as >> parallel but if so, then probably you can expand the comments a bit to >> say why you consider only Gather node (and that too when it is the >> top-most node) and why not another parallel node like GatherMerge? >> > > If somebody expects to preserve the order of the tuples that are > coming from GatherMerge node of the select part in CTAS or SELECT INTO > while inserting, now if parallelism is allowed, that may not be the > case i.e. the order of insertion of tuples may vary. I'm not quite > sure, if someone wants to use order by in the select parts of CTAS or > SELECT INTO in a real world use case. Thoughts? > >> >> Right, for now, I think you can simply remove that check from the code >> instead of just commenting it. We will see if there is a better >> check/Assert we can add there. >> > > Done. > > I also worked on some of the open points I listed earlier in my mail. > >> >> 3. Need to restrict parallel inserts, if CTAS tries to create temp/global tables as the workers will not have access tothose tables. >> > > Done. > >> >> Need to analyze whether to allow parallelism if CTAS has prepared statements or with no data. >> > > For prepared statements, the parallelism will not be picked and so is > parallel insertion. > For CTAS with no data option case the select part is not even planned, > and so the parallelism will also not be picked. > >> >> 4. Need to stop unnecessary parallel shared state such as tuple queue being created and shared to workers. >> > > Done. > > I'm listing the things that are still pending. > > 1. How to represent the parallel insert for CTAS in explain plans? The > explain CTAS shows the plan for only the SELECT part. How about having > some textual info along with the Gather node? I'm not quite sure on > this point, any suggestions are welcome. > 2. Addition of new test cases. Testing with more scenarios and > different data sets, sizes, tablespaces, select into. Analysis on the > 2 mismatches in write_parallel.sql regression test. > > Attaching v2 patch, thoughts and comments are welcome. > > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com > Hi, Really looking forward to this ending up in postgres as I think it's a very nice improvement. Whilst reviewing your patch I was wondering: is there a reason you did not introduce a batch insert in the destreceiver for the CTAS? For me this makes a huge difference in ingest speed as otherwise the inserts do not really scale so well as lock contention start to be a big problem. If you like I can make a patch to introduce this on top? Kind regards, Luc Swarm64
On Fri, Oct 16, 2020 at 11:33 AM Luc Vlaming <luc@swarm64.com> wrote: > > Really looking forward to this ending up in postgres as I think it's a > very nice improvement. > > Whilst reviewing your patch I was wondering: is there a reason you did > not introduce a batch insert in the destreceiver for the CTAS? For me > this makes a huge difference in ingest speed as otherwise the inserts do > not really scale so well as lock contention start to be a big problem. > If you like I can make a patch to introduce this on top? > Thanks for your interest. You are right, we can get maximum improvement if we have multi inserts in destreceiver for the CTAS on the similar lines to COPY FROM command. I specified this point in my first mail [1]. You may want to take a look at an already existing patch [2] for multi inserts, I think there are some review comments to be addressed in that patch. I would love to see the multi insert patch getting revived. [1] - https://www.postgresql.org/message-id/CALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%2B0-Jg%2BKYT7ZO-Ug%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAEET0ZG31mD5SWjTYsAt0JTLReOejPvusJorZ3kGZ1%3DN1AC-Fw%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On 16.10.20 08:23, Bharath Rupireddy wrote: > On Fri, Oct 16, 2020 at 11:33 AM Luc Vlaming <luc@swarm64.com> wrote: >> >> Really looking forward to this ending up in postgres as I think it's a >> very nice improvement. >> >> Whilst reviewing your patch I was wondering: is there a reason you did >> not introduce a batch insert in the destreceiver for the CTAS? For me >> this makes a huge difference in ingest speed as otherwise the inserts do >> not really scale so well as lock contention start to be a big problem. >> If you like I can make a patch to introduce this on top? >> > > Thanks for your interest. You are right, we can get maximum > improvement if we have multi inserts in destreceiver for the CTAS on > the similar lines to COPY FROM command. I specified this point in my > first mail [1]. You may want to take a look at an already existing > patch [2] for multi inserts, I think there are some review comments to > be addressed in that patch. I would love to see the multi insert patch > getting revived. > > [1] - https://www.postgresql.org/message-id/CALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%2B0-Jg%2BKYT7ZO-Ug%40mail.gmail.com > [2] - https://www.postgresql.org/message-id/CAEET0ZG31mD5SWjTYsAt0JTLReOejPvusJorZ3kGZ1%3DN1AC-Fw%40mail.gmail.com > > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com > Sorry had not seen that pointer in your first email. I'll first finish some other patches I'm working on and then I'll try to revive that patch. Thanks for the pointers. Kind regards, Luc Swarm64
>
> > > > 1. How to represent the parallel insert for CTAS in explain plans? The
> > > > explain CTAS shows the plan for only the SELECT part. How about having
> > > > some textual info along with the Gather node? I'm not quite sure on
> > > > this point, any suggestions are welcome.
> > >
> > > I am also not sure about this point because we don't display anything
> > > for the DDL part in explain. Can you propose by showing some example
> > > of what you have in mind?
> >
> > I thought we could have something like this.
> > -----------------------------------------------------------------------------
> > Gather (cost=1000.00..108738.90 rows=0 width=8)
> > Workers Planned: 2 Parallel Insert on t_test1
> > -> Parallel Seq Scan on t_test (cost=0.00..106748.00 rows=4954 width=8)
> > Filter: (many < 10000)
> > -----------------------------------------------------------------------------
>
> maybe something like below:
> Gather (cost=1000.00..108738.90 rows=0 width=8)
> -> Create t_test1
> -> Parallel Seq Scan on t_test
>
> I don't know what is the best thing to do here. I think for the
> temporary purpose you can keep something like above then once the
> patch is matured then we can take a separate opinion for this.
>
QUERY PLAN
---------------------------------------------------------------------------------
Gather (actual time=970.524..972.913 rows=0 loops=1)
-> Create t1_test
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on t1 (actual time=0.028..86.623 rows=333333 loops=3)
Planning Time: 0.049 ms
Execution Time: 973.733 ms
>
> I think there is no reason why one can't use ORDER BY in the
> statements we are talking about here. But, I think we can't enable
> parallelism for GatherMerge is because for that node we always need to
> fetch the data in the leader backend to perform the final merge phase.
> So, I was expecting a small comment saying something on those lines.
>
Added comments.
Attachment
On Mon, Oct 19, 2020 at 10:47 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Attaching v3 patch herewith. > > I'm done with all the open points in my list. Please review the v3 patch and provide comments. > Attaching v4 patch, rebased on the latest master 68b1a4877e. Also, added this feature to commitfest - https://commitfest.postgresql.org/31/2841/ With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
Hi, I'm very interested in this feature, and I'm looking at the patch, here are some comments. 1. + if (!TupIsNull(outerTupleSlot)) + { + (void) node->ps.dest->receiveSlot(outerTupleSlot, node->ps.dest); + node->ps.state->es_processed++; + } + + if(TupIsNull(outerTupleSlot)) + break; + } How about the following style: if(TupIsNull(outerTupleSlot)) Break; (void) node->ps.dest->receiveSlot(outerTupleSlot, node->ps.dest); node->ps.state->es_processed++; Which looks cleaner. 2. + + if (into != NULL && + IsA(into, IntoClause)) + { The check can be replaced by ISCTAS(into). 3. + /* + * For parallelizing inserts in CTAS i.e. making each + * parallel worker inerst it's tuples, we must send + * information such as intoclause(for each worker 'inerst' looks like a typo (insert). 4. + /* Estimate space for into clause for CTAS. */ + if (ISCTAS(planstate->intoclause)) + { + intoclausestr = nodeToString(planstate->intoclause); + shm_toc_estimate_chunk(&pcxt->estimator, strlen(intoclausestr) + 1); + shm_toc_estimate_keys(&pcxt->estimator, 1); + } ... + if (intoclausestr != NULL) + { + char *shmptr = (char *)shm_toc_allocate(pcxt->toc, + strlen(intoclausestr) + 1); + strcpy(shmptr, intoclausestr); + shm_toc_insert(pcxt->toc, PARALLEL_KEY_INTO_CLAUSE, shmptr); + } The code here call strlen(intoclausestr) for two times, After checking the existing code in ExecInitParallelPlan, It used to store the strlen in a variable. So how about the following style: intoclause_len = strlen(intoclausestr); ... /* Store serialized intoclause. */ intoclause_space = shm_toc_allocate(pcxt->toc, intoclause_len + 1); memcpy(shmptr, intoclausestr, intoclause_len + 1); shm_toc_insert(pcxt->toc, PARALLEL_KEY_INTO_CLAUSE, intoclause_space); the code in ExecInitParallelPlan 5. + if (intoclausestr != NULL) + { + char *shmptr = (char *)shm_toc_allocate(pcxt->toc, + strlen(intoclausestr) + 1); + strcpy(shmptr, intoclausestr); + shm_toc_insert(pcxt->toc, PARALLEL_KEY_INTO_CLAUSE, shmptr); + } + /* Set up the tuple queues that the workers will write into. */ - pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false); + if (intoclausestr == NULL) + pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false); The two check about intoclausestr seems can be combined like: if (intoclausestr != NULL) { ... } else { ... } Best regards, houzj
On Tue, Nov 24, 2020 at 4:43 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > I'm very interested in this feature, > and I'm looking at the patch, here are some comments. > Thanks for the review. > > How about the following style: > > if(TupIsNull(outerTupleSlot)) > Break; > > (void) node->ps.dest->receiveSlot(outerTupleSlot, node->ps.dest); > node->ps.state->es_processed++; > > Which looks cleaner. > Done. > > The check can be replaced by ISCTAS(into). > Done. > > 'inerst' looks like a typo (insert). > Corrected. > > The code here call strlen(intoclausestr) for two times, > After checking the existing code in ExecInitParallelPlan, > It used to store the strlen in a variable. > > So how about the following style: > > intoclause_len = strlen(intoclausestr); > ... > /* Store serialized intoclause. */ > intoclause_space = shm_toc_allocate(pcxt->toc, intoclause_len + 1); > memcpy(shmptr, intoclausestr, intoclause_len + 1); > shm_toc_insert(pcxt->toc, PARALLEL_KEY_INTO_CLAUSE, intoclause_space); > Done. > > The two check about intoclausestr seems can be combined like: > > if (intoclausestr != NULL) > { > ... > } > else > { > ... > } > Done. Attaching v5 patch. Please consider it for further review. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
Hi, I have an issue about the following code: econtext = node->ps.ps_ExprContext; ResetExprContext(econtext); + if (ISCTAS(node->ps.intoclause)) + { + ExecParallelInsertInCTAS(node); + return NULL; + } /* If no projection is required, we're done. */ if (node->ps.ps_ProjInfo == NULL) return slot; /* * Form the result tuple using ExecProject(), and return it. */ econtext->ecxt_outertuple = slot; return ExecProject(node->ps.ps_ProjInfo); It seems the projection will be skipped. Is this because projection is not required in this case ? (I'm not very familiar with where the projection will be.) If projection is not required here, shall we add some comments here? Best regards, houzj
On Thu, Nov 26, 2020 at 7:47 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > Hi, > > I have an issue about the following code: > > econtext = node->ps.ps_ExprContext; > ResetExprContext(econtext); > > + if (ISCTAS(node->ps.intoclause)) > + { > + ExecParallelInsertInCTAS(node); > + return NULL; > + } > > /* If no projection is required, we're done. */ > if (node->ps.ps_ProjInfo == NULL) > return slot; > > /* > * Form the result tuple using ExecProject(), and return it. > */ > econtext->ecxt_outertuple = slot; > return ExecProject(node->ps.ps_ProjInfo); > > It seems the projection will be skipped. > Is this because projection is not required in this case ? > (I'm not very familiar with where the projection will be.) > For parallel inserts in CTAS, I don't think we need to project the tuples being returned from the underlying plan nodes, and also we have nothing to project from the Gather node further up. The required projection will happen while the tuples are being returned from the underlying nodes and the projected tuples are being directly fed to CTAS's dest receiver intorel_receive(), from there into the created table. We don't need ExecProject again in ExecParallelInsertInCTAS(). For instance, projection will always be done when the tuple is being returned from an underlying sequential scan node(see ExecScan() --> ExecProject() and this is true for both leader and workers. In both leader and workers, we are just calling CTAS's dest receiver intorel_receive(). Thoughts? > > If projection is not required here, shall we add some comments here? > If the above point looks okay, I can add a comment. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Hi , > On Thu, Nov 26, 2020 at 7:47 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> > wrote: > > > > Hi, > > > > I have an issue about the following code: > > > > econtext = node->ps.ps_ExprContext; > > ResetExprContext(econtext); > > > > + if (ISCTAS(node->ps.intoclause)) > > + { > > + ExecParallelInsertInCTAS(node); > > + return NULL; > > + } > > > > /* If no projection is required, we're done. */ > > if (node->ps.ps_ProjInfo == NULL) > > return slot; > > > > /* > > * Form the result tuple using ExecProject(), and return it. > > */ > > econtext->ecxt_outertuple = slot; > > return ExecProject(node->ps.ps_ProjInfo); > > > > It seems the projection will be skipped. > > Is this because projection is not required in this case ? > > (I'm not very familiar with where the projection will be.) > > > > For parallel inserts in CTAS, I don't think we need to project the tuples > being returned from the underlying plan nodes, and also we have nothing > to project from the Gather node further up. The required projection will > happen while the tuples are being returned from the underlying nodes and > the projected tuples are being directly fed to CTAS's dest receiver > intorel_receive(), from there into the created table. We don't need > ExecProject again in ExecParallelInsertInCTAS(). > > For instance, projection will always be done when the tuple is being returned > from an underlying sequential scan node(see ExecScan() --> > ExecProject() and this is true for both leader and workers. In both leader > and workers, we are just calling CTAS's dest receiver intorel_receive(). > > Thoughts? I took a deep look at the projection logic. In most cases, you are right that Gather node does not need projection. In some rare cases, such as Subplan (or initplan I guess). The projection will happen in Gather node. The example: Create table test(i int); Create table test2(a int, b int); insert into test values(generate_series(1,10000000,1)); insert into test2 values(generate_series(1,1000,1), generate_series(1,1000,1)); postgres=# explain(verbose, costs off) select test.i,(select i from (select * from test2) as tt limit 1) from test wheretest.i < 2000; QUERY PLAN ---------------------------------------- Gather Output: test.i, (SubPlan 1) Workers Planned: 2 -> Parallel Seq Scan on public.test Output: test.i Filter: (test.i < 2000) SubPlan 1 -> Limit Output: (test.i) -> Seq Scan on public.test2 Output: test.i In this case, projection is necessary, because the subplan will be executed in projection. If skipped, the table created will loss some data. Best regards, houzj
On Thu, Nov 26, 2020 at 12:15 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > I took a deep look at the projection logic. > In most cases, you are right that Gather node does not need projection. > > In some rare cases, such as Subplan (or initplan I guess). > The projection will happen in Gather node. > > The example: > > Create table test(i int); > Create table test2(a int, b int); > insert into test values(generate_series(1,10000000,1)); > insert into test2 values(generate_series(1,1000,1), generate_series(1,1000,1)); > > postgres=# explain(verbose, costs off) select test.i,(select i from (select * from test2) as tt limit 1) from test wheretest.i < 2000; > QUERY PLAN > ---------------------------------------- > Gather > Output: test.i, (SubPlan 1) > Workers Planned: 2 > -> Parallel Seq Scan on public.test > Output: test.i > Filter: (test.i < 2000) > SubPlan 1 > -> Limit > Output: (test.i) > -> Seq Scan on public.test2 > Output: test.i > > In this case, projection is necessary, > because the subplan will be executed in projection. > > If skipped, the table created will loss some data. > Thanks a lot for the use case. Yes with the current patch table will lose data related to the subplan. On analyzing further, I think we can not allow parallel inserts in the cases when the Gather node has some projections to do. Because the workers can not perform that projection. So, having ps_ProjInfo in the Gather node is an indication for us to disable parallel inserts and only the leader can do the insertions after the Gather node does the required projections. Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Hi, > > I took a deep look at the projection logic. > > In most cases, you are right that Gather node does not need projection. > > > > In some rare cases, such as Subplan (or initplan I guess). > > The projection will happen in Gather node. > > > > The example: > > > > Create table test(i int); > > Create table test2(a int, b int); > > insert into test values(generate_series(1,10000000,1)); > > insert into test2 values(generate_series(1,1000,1), > > generate_series(1,1000,1)); > > > > postgres=# explain(verbose, costs off) select test.i,(select i from > (select * from test2) as tt limit 1) from test where test.i < 2000; > > QUERY PLAN > > ---------------------------------------- > > Gather > > Output: test.i, (SubPlan 1) > > Workers Planned: 2 > > -> Parallel Seq Scan on public.test > > Output: test.i > > Filter: (test.i < 2000) > > SubPlan 1 > > -> Limit > > Output: (test.i) > > -> Seq Scan on public.test2 > > Output: test.i > > > > In this case, projection is necessary, because the subplan will be > > executed in projection. > > > > If skipped, the table created will loss some data. > > > > Thanks a lot for the use case. Yes with the current patch table will lose > data related to the subplan. On analyzing further, I think we can not allow > parallel inserts in the cases when the Gather node has some projections > to do. Because the workers can not perform that projection. So, having > ps_ProjInfo in the Gather node is an indication for us to disable parallel > inserts and only the leader can do the insertions after the Gather node > does the required projections. > > Thoughts? > Agreed. 2. @@ -166,6 +228,16 @@ ExecGather(PlanState *pstate) { ParallelContext *pcxt; + /* + * Take the necessary information to be passed to workers for + * parallel inserts in CTAS. + */ + if (ISCTAS(node->ps.intoclause)) + { + node->ps.lefttree->intoclause = node->ps.intoclause; + node->ps.lefttree->objectid = node->ps.objectid; + } + /* Initialize, or re-initialize, shared state needed by workers. */ if (!node->pei) node->pei = ExecInitParallelPlan(node->ps.lefttree, I found the code pass intoclause and objectid to Gather node's lefttree. Is it necessary? It seems only Gather node will use the information. Best regards, houzj
On Fri, Nov 27, 2020 at 11:57 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > > Thanks a lot for the use case. Yes with the current patch table will lose > > data related to the subplan. On analyzing further, I think we can not allow > > parallel inserts in the cases when the Gather node has some projections > > to do. Because the workers can not perform that projection. So, having > > ps_ProjInfo in the Gather node is an indication for us to disable parallel > > inserts and only the leader can do the insertions after the Gather node > > does the required projections. > > > > Thoughts? > > > > Agreed. > Thanks! I will add/modify IsParallelInsertInCTASAllowed() to return false in this case. > > 2. > @@ -166,6 +228,16 @@ ExecGather(PlanState *pstate) > { > ParallelContext *pcxt; > > + /* > + * Take the necessary information to be passed to workers for > + * parallel inserts in CTAS. > + */ > + if (ISCTAS(node->ps.intoclause)) > + { > + node->ps.lefttree->intoclause = node->ps.intoclause; > + node->ps.lefttree->objectid = node->ps.objectid; > + } > + > /* Initialize, or re-initialize, shared state needed by workers. */ > if (!node->pei) > node->pei = ExecInitParallelPlan(node->ps.lefttree, > > I found the code pass intoclause and objectid to Gather node's lefttree. > Is it necessary? It seems only Gather node will use the information. > I am passing the required information from the up to here through PlanState structure. Since the Gather node's leftree is also a PlanState structure variable, here I just assigned them to pass that information to ExecInitParallelPlan(). With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On 25-11-2020 03:40, Bharath Rupireddy wrote: > On Tue, Nov 24, 2020 at 4:43 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: >> >> I'm very interested in this feature, >> and I'm looking at the patch, here are some comments. >> > > Thanks for the review. > >> >> How about the following style: >> >> if(TupIsNull(outerTupleSlot)) >> Break; >> >> (void) node->ps.dest->receiveSlot(outerTupleSlot, node->ps.dest); >> node->ps.state->es_processed++; >> >> Which looks cleaner. >> > > Done. > >> >> The check can be replaced by ISCTAS(into). >> > > Done. > >> >> 'inerst' looks like a typo (insert). >> > > Corrected. > >> >> The code here call strlen(intoclausestr) for two times, >> After checking the existing code in ExecInitParallelPlan, >> It used to store the strlen in a variable. >> >> So how about the following style: >> >> intoclause_len = strlen(intoclausestr); >> ... >> /* Store serialized intoclause. */ >> intoclause_space = shm_toc_allocate(pcxt->toc, intoclause_len + 1); >> memcpy(shmptr, intoclausestr, intoclause_len + 1); >> shm_toc_insert(pcxt->toc, PARALLEL_KEY_INTO_CLAUSE, intoclause_space); >> > > Done. > >> >> The two check about intoclausestr seems can be combined like: >> >> if (intoclausestr != NULL) >> { >> ... >> } >> else >> { >> ... >> } >> > > Done. > > Attaching v5 patch. Please consider it for further review. > > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com > Disclaimer: I have by no means throughly reviewed all the involved parts and am probably missing quite a bit of context so if I understood parts wrong or they have been discussed before then I'm sorry. Most notably the whole situation about the command-id is still elusive for me and I can really not judge yet anything related to that. IMHO The patch makes that we now have the gather do most of the CTAS work, which seems unwanted. For the non-ctas insert/update case it seems that a modifytable node exists to actually do the work. What I'm wondering is if it is maybe not better to introduce a CreateTable node as well? This would have several merits: - the rowcount of that node would be 0 for the parallel case, and non-zero for the serial case. Then the gather ndoe and the Query struct don't have to know about CTAS for the most part, removing e.g. the case distinctions in cost_gather. - the inserted rows can now be accounted in this new node instead of the parallel executor state, and this node can also do its own DSM intializations - the generation of a partial variants of the CreateTable node can now be done in the optimizer instead of the ExecCreateTableAs which IMHO is a more logical place to make these kind of decisions. which then also makes it potentially play nicer with costs and the like. - the explain code can now be in its own place instead of part of the gather node - IIUC it would allow the removal of the code to only launch parallel workers if its not CTAS, which IMHO would be quite a big benefit. Thoughts? Some small things I noticed while going through the patch: - Typo for the comment about "inintorel_startup" which should be intorel_startup - if (node->nworkers_launched == 0 && !node->need_to_scan_locally) can be changed into if (node->nworkers_launched == 0 because either way it'll be true. Regards, Luc Swarm64
On Fri, Nov 27, 2020 at 1:07 PM Luc Vlaming <luc@swarm64.com> wrote: > > Disclaimer: I have by no means throughly reviewed all the involved parts > and am probably missing quite a bit of context so if I understood parts > wrong or they have been discussed before then I'm sorry. Most notably > the whole situation about the command-id is still elusive for me and I > can really not judge yet anything related to that. > > IMHO The patch makes that we now have the gather do most of the CTAS > work, which seems unwanted. For the non-ctas insert/update case it seems > that a modifytable node exists to actually do the work. What I'm > wondering is if it is maybe not better to introduce a CreateTable node > as well? > This would have several merits: > - the rowcount of that node would be 0 for the parallel case, and > non-zero for the serial case. Then the gather ndoe and the Query struct > don't have to know about CTAS for the most part, removing e.g. the case > distinctions in cost_gather. > - the inserted rows can now be accounted in this new node instead of the > parallel executor state, and this node can also do its own DSM > intializations > - the generation of a partial variants of the CreateTable node can now > be done in the optimizer instead of the ExecCreateTableAs which IMHO is > a more logical place to make these kind of decisions. which then also > makes it potentially play nicer with costs and the like. > - the explain code can now be in its own place instead of part of the > gather node > - IIUC it would allow the removal of the code to only launch parallel > workers if its not CTAS, which IMHO would be quite a big benefit. > > Thoughts? > If I'm not wrong, I think currently we have no exec nodes for DDLs. I'm not sure whether we would like to introduce one for this. And also note that, both CTAS and CREATE MATERIALIZED VIEW(CMV) are handled with the same code, so if we have CreateTable as the new node, then do we also want to have another node or a generic node name? The main design idea of the patch proposed in this thread is that pushing the dest receiver down to the workers if the SELECT part of the CTAS or CMV is parallelizable. And also, for CTAS or CMV we do not do any planning as such, but the planner is just influenced to take into consideration that there are no tuples to transfer from the workers to Gather node which may make the planner choose parallelism for SELECT part. So, the planner work for CTAS or CMV is very minimal. I also have the idea of extending this design (if accepted) to REFRESH MATERIALIZED VIEW after some analysis. I may be wrong above, other hackers may have better opinions. > > Some small things I noticed while going through the patch: > - Typo for the comment about "inintorel_startup" which should be > intorel_startup > Corrected. > > - if (node->nworkers_launched == 0 && !node->need_to_scan_locally) > > can be changed into > if (node->nworkers_launched == 0 > because either way it'll be true. > Yes, !node->need_to_scan_locally is not necessary, we need to set it to true if there are no workers launched. I removed !node->need_to_scan_locally check from the if clause. > On Fri, Nov 27, 2020 at 11:57 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > > > > Thanks a lot for the use case. Yes with the current patch table will lose > > > data related to the subplan. On analyzing further, I think we can not allow > > > parallel inserts in the cases when the Gather node has some projections > > > to do. Because the workers can not perform that projection. So, having > > > ps_ProjInfo in the Gather node is an indication for us to disable parallel > > > inserts and only the leader can do the insertions after the Gather node > > > does the required projections. > > > > > > Thoughts? > > > > Agreed. > > Thanks! I will add/modify IsParallelInsertInCTASAllowed() to return > false in this case. > Modified. Attaching v6 patch that has the above review comments addressed. Please review it further. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Nov 30, 2020 at 10:43 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Fri, Nov 27, 2020 at 1:07 PM Luc Vlaming <luc@swarm64.com> wrote: > > > > Disclaimer: I have by no means throughly reviewed all the involved parts > > and am probably missing quite a bit of context so if I understood parts > > wrong or they have been discussed before then I'm sorry. Most notably > > the whole situation about the command-id is still elusive for me and I > > can really not judge yet anything related to that. > > > > IMHO The patch makes that we now have the gather do most of the CTAS > > work, which seems unwanted. For the non-ctas insert/update case it seems > > that a modifytable node exists to actually do the work. What I'm > > wondering is if it is maybe not better to introduce a CreateTable node > > as well? > > This would have several merits: > > - the rowcount of that node would be 0 for the parallel case, and > > non-zero for the serial case. Then the gather ndoe and the Query struct > > don't have to know about CTAS for the most part, removing e.g. the case > > distinctions in cost_gather. > > - the inserted rows can now be accounted in this new node instead of the > > parallel executor state, and this node can also do its own DSM > > intializations > > - the generation of a partial variants of the CreateTable node can now > > be done in the optimizer instead of the ExecCreateTableAs which IMHO is > > a more logical place to make these kind of decisions. which then also > > makes it potentially play nicer with costs and the like. > > - the explain code can now be in its own place instead of part of the > > gather node > > - IIUC it would allow the removal of the code to only launch parallel > > workers if its not CTAS, which IMHO would be quite a big benefit. > > > > Thoughts? > > > > If I'm not wrong, I think currently we have no exec nodes for DDLs. > I'm not sure whether we would like to introduce one for this. > Yeah, I am also not in favor of having an executor node for CTAS but OTOH, I also don't like the way you have jammed the relevant information in generic PlanState. How about keeping it in GatherState and initializing it in ExecCreateTableAs() after the executor start. You are already doing special treatment for the Gather node in ExecCreateTableAs (via IsParallelInsertInCTASAllowed) so we can as well initialize the required information in GatherState in ExecCreateTableAs. I think that might help in reducing the special treatment for intoclause at different places. Few other assorted comments: ========================= 1. +/* + * IsParallelInsertInCTASAllowed --- determine whether or not parallel + * insertion is possible. + */ +bool IsParallelInsertInCTASAllowed(IntoClause *into, QueryDesc *queryDesc) +{ .. .. if (ps && IsA(ps, GatherState) && !ps->ps_ProjInfo && + plannedstmt->parallelModeNeeded && + plannedstmt->planTree && + IsA(plannedstmt->planTree, Gather) && + plannedstmt->planTree->lefttree && + plannedstmt->planTree->lefttree->parallel_aware && + plannedstmt->planTree->lefttree->parallel_safe) + { + /* + * Since there are no rows that are transferred from workers to + * Gather node, so we set it to 0 to be visible in explain + * plans. Note that we would have already accounted this for + * cost calculations in cost_gather(). + */ + plannedstmt->planTree->plan_rows = 0; This looks a bit odd. The function name 'IsParallelInsertInCTASAllowed' suggests that it just checks whether parallelism is allowed but it is internally changing the plan_rows. It might be better to do this separately if the parallelism is allowed. 2. static void ExecShutdownGatherWorkers(GatherState *node); - +static void ExecParallelInsertInCTAS(GatherState *node); Spurious line removal. 3. /* Wait for the parallel workers to finish. */ + if (node->nworkers_launched > 0) + { + ExecShutdownGatherWorkers(node); + + /* + * Add up the total tuples inserted by all workers, to the tuples + * inserted by the leader(if any). This will be shared to client. + */ + node->ps.state->es_processed += pg_atomic_read_u64(node->pei->processed); + } The comment and code appear a bit misleading as the function seems to shutdown the workers rather than waiting for them to finish. How about using something like below: /* * Next, accumulate buffer and WAL usage. (This must wait for the workers * to finish, or we might get incomplete data.) */ if (nworkers > 0) { int i; /* Wait for all vacuum workers to finish */ WaitForParallelWorkersToFinish(lps->pcxt); for (i = 0; i < lps->pcxt->nworkers_launched; i++) InstrAccumParallelQuery(&lps->buffer_usage[i], &lps->wal_usage[i]); } This is how it works for parallel vacuum. 4. + + /* + * Make the number of tuples that are transferred from workers to gather + * node zero as each worker parallelly insert the tuples that are resulted + * from its chunk of plan execution. This change may make the parallel + * plan cheap among all other plans, and influence the planner to consider + * this parallel plan. + */ + if (!(root->parse->isForCTAS && + root->query_level == 1)) + run_cost += parallel_tuple_cost * path->path.rows; The above comment doesn't seem to convey what it intends to convey. How about changing it slightly as: "We don't compute the parallel_tuple_cost for CTAS because the number of tuples that are transferred from workers to the gather node is zero as each worker parallelly inserts the tuples that are resulted from its chunk of plan execution. This change may make the parallel plan cheap among all other plans, and influence the planner to consider this parallel plan." Then, we can also have an Assert for path->path.rows to zero for the CTAS case. 5. + /* Prallel inserts in CTAS related info is specified below. */ + IntoClause *intoclause; + Oid objectid; + DestReceiver *dest; } PlanState; Typo. /Prallel/Parallel 6. Currently, it seems the plan look like: Gather (actual time=970.524..972.913 rows=0 loops=1) -> Create t1_test Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on t1 (actual time=0.028..86.623 rows=333333 loops=3) I would prefer it to be: Gather (actual time=970.524..972.913 rows=0 loops=1) Workers Planned: 2 Workers Launched: 2 -> Create t1_test -> Parallel Seq Scan on t1 (actual time=0.028..86.623 rows=333333 loops=3) This way it looks like the writing part is done below the Gather node and also it will match the Parallel Insert patch of Greg. -- With Regards, Amit Kapila.
Thanks Amit for the review comments. On Sat, Dec 5, 2020 at 4:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > If I'm not wrong, I think currently we have no exec nodes for DDLs. > > I'm not sure whether we would like to introduce one for this. > > Yeah, I am also not in favor of having an executor node for CTAS but > OTOH, I also don't like the way you have jammed the relevant > information in generic PlanState. How about keeping it in GatherState > and initializing it in ExecCreateTableAs() after the executor start. > You are already doing special treatment for the Gather node in > ExecCreateTableAs (via IsParallelInsertInCTASAllowed) so we can as > well initialize the required information in GatherState in > ExecCreateTableAs. I think that might help in reducing the special > treatment for intoclause at different places. > Done. Added required info to GatherState node. While this reduced the changes at many other places, but had to pass the into clause and object id to ExecInitParallelPlan() as we do not send GatherState node to it. Hope that's okay. > > Few other assorted comments: > ========================= > 1. > This looks a bit odd. The function name > 'IsParallelInsertInCTASAllowed' suggests that it just checks whether > parallelism is allowed but it is internally changing the plan_rows. It > might be better to do this separately if the parallelism is allowed. > Changed. > > 2. > static void ExecShutdownGatherWorkers(GatherState *node); > - > +static void ExecParallelInsertInCTAS(GatherState *node); > > Spurious line removal. > Corrected. > > 3. > The comment and code appear a bit misleading as the function seems to > shutdown the workers rather than waiting for them to finish. How about > using something like below: > > /* > * Next, accumulate buffer and WAL usage. (This must wait for the workers > * to finish, or we might get incomplete data.) > */ > if (nworkers > 0) > { > int i; > > /* Wait for all vacuum workers to finish */ > WaitForParallelWorkersToFinish(lps->pcxt); > > for (i = 0; i < lps->pcxt->nworkers_launched; i++) > InstrAccumParallelQuery(&lps->buffer_usage[i], &lps->wal_usage[i]); > } > > This is how it works for parallel vacuum. > Done. > > 4. > The above comment doesn't seem to convey what it intends to convey. > How about changing it slightly as: "We don't compute the > parallel_tuple_cost for CTAS because the number of tuples that are > transferred from workers to the gather node is zero as each worker > parallelly inserts the tuples that are resulted from its chunk of plan > execution. This change may make the parallel plan cheap among all > other plans, and influence the planner to consider this parallel > plan." > Changed. > > Then, we can also have an Assert for path->path.rows to zero for the CTAS case. > We can not have Assert(path->path.rows == 0), because we are not changing this parameter upstream in or before the planning phase. We are just skipping to take it into account for CTAS. We may have to do extra checks over different places in case we have to make planner path->path.rows to 0 for CTAS. IMHO, that's not necessary. We can just skip taking this value in cost_gather. Thoughts? > > 5. > + /* Prallel inserts in CTAS related info is specified below. */ > + IntoClause *intoclause; > + Oid objectid; > + DestReceiver *dest; > } PlanState; > > Typo. /Prallel/Parallel > Corrected. > > 6. > Currently, it seems the plan look like: > Gather (actual time=970.524..972.913 rows=0 loops=1) > -> Create t1_test > Workers Planned: 2 > Workers Launched: 2 > -> Parallel Seq Scan on t1 (actual time=0.028..86.623 rows=333333 loops=3) > > I would prefer it to be: > Gather (actual time=970.524..972.913 rows=0 loops=1) > Workers Planned: 2 > Workers Launched: 2 > -> Create t1_test > -> Parallel Seq Scan on t1 (actual time=0.028..86.623 rows=333333 loops=3) > > This way it looks like the writing part is done below the Gather node > and also it will match the Parallel Insert patch of Greg. > Done. Attaching v7 patch. Please review it further. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
+ myState->output_cid = GetCurrentCommandId(false);
+ * not allowed to extend by the workers.
+ allowed = true;
+ plannedstmt->parallelModeNeeded &&
+ plannedstmt->planTree &&
+ IsA(plannedstmt->planTree, Gather) &&
+ plannedstmt->planTree->lefttree &&
+ plannedstmt->planTree->lefttree->parallel_aware &&
+ plannedstmt->planTree->lefttree->parallel_safe))
+ plannedstmt->parallelModeNeeded &&
+ plannedstmt->planTree &&
+ IsA(plannedstmt->planTree, Gather) &&
+ plannedstmt->planTree->lefttree &&
+ plannedstmt->planTree->lefttree->parallel_aware &&
+ plannedstmt->planTree->lefttree->parallel_safe)
+ * it to inform to the end client.
Thanks Amit for the review comments.
On Sat, Dec 5, 2020 at 4:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > If I'm not wrong, I think currently we have no exec nodes for DDLs.
> > I'm not sure whether we would like to introduce one for this.
>
> Yeah, I am also not in favor of having an executor node for CTAS but
> OTOH, I also don't like the way you have jammed the relevant
> information in generic PlanState. How about keeping it in GatherState
> and initializing it in ExecCreateTableAs() after the executor start.
> You are already doing special treatment for the Gather node in
> ExecCreateTableAs (via IsParallelInsertInCTASAllowed) so we can as
> well initialize the required information in GatherState in
> ExecCreateTableAs. I think that might help in reducing the special
> treatment for intoclause at different places.
>
Done. Added required info to GatherState node. While this reduced the
changes at many other places, but had to pass the into clause and
object id to ExecInitParallelPlan() as we do not send GatherState node
to it. Hope that's okay.
>
> Few other assorted comments:
> =========================
> 1.
> This looks a bit odd. The function name
> 'IsParallelInsertInCTASAllowed' suggests that it just checks whether
> parallelism is allowed but it is internally changing the plan_rows. It
> might be better to do this separately if the parallelism is allowed.
>
Changed.
>
> 2.
> static void ExecShutdownGatherWorkers(GatherState *node);
> -
> +static void ExecParallelInsertInCTAS(GatherState *node);
>
> Spurious line removal.
>
Corrected.
>
> 3.
> The comment and code appear a bit misleading as the function seems to
> shutdown the workers rather than waiting for them to finish. How about
> using something like below:
>
> /*
> * Next, accumulate buffer and WAL usage. (This must wait for the workers
> * to finish, or we might get incomplete data.)
> */
> if (nworkers > 0)
> {
> int i;
>
> /* Wait for all vacuum workers to finish */
> WaitForParallelWorkersToFinish(lps->pcxt);
>
> for (i = 0; i < lps->pcxt->nworkers_launched; i++)
> InstrAccumParallelQuery(&lps->buffer_usage[i], &lps->wal_usage[i]);
> }
>
> This is how it works for parallel vacuum.
>
Done.
>
> 4.
> The above comment doesn't seem to convey what it intends to convey.
> How about changing it slightly as: "We don't compute the
> parallel_tuple_cost for CTAS because the number of tuples that are
> transferred from workers to the gather node is zero as each worker
> parallelly inserts the tuples that are resulted from its chunk of plan
> execution. This change may make the parallel plan cheap among all
> other plans, and influence the planner to consider this parallel
> plan."
>
Changed.
>
> Then, we can also have an Assert for path->path.rows to zero for the CTAS case.
>
We can not have Assert(path->path.rows == 0), because we are not
changing this parameter upstream in or before the planning phase. We
are just skipping to take it into account for CTAS. We may have to do
extra checks over different places in case we have to make planner
path->path.rows to 0 for CTAS. IMHO, that's not necessary. We can just
skip taking this value in cost_gather. Thoughts?
>
> 5.
> + /* Prallel inserts in CTAS related info is specified below. */
> + IntoClause *intoclause;
> + Oid objectid;
> + DestReceiver *dest;
> } PlanState;
>
> Typo. /Prallel/Parallel
>
Corrected.
>
> 6.
> Currently, it seems the plan look like:
> Gather (actual time=970.524..972.913 rows=0 loops=1)
> -> Create t1_test
> Workers Planned: 2
> Workers Launched: 2
> -> Parallel Seq Scan on t1 (actual time=0.028..86.623 rows=333333 loops=3)
>
> I would prefer it to be:
> Gather (actual time=970.524..972.913 rows=0 loops=1)
> Workers Planned: 2
> Workers Launched: 2
> -> Create t1_test
> -> Parallel Seq Scan on t1 (actual time=0.028..86.623 rows=333333 loops=3)
>
> This way it looks like the writing part is done below the Gather node
> and also it will match the Parallel Insert patch of Greg.
>
Done.
Attaching v7 patch. Please review it further.
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Thanks for the comments. On Mon, Dec 7, 2020 at 8:56 AM Zhihong Yu <zyu@yugabyte.com> wrote: > > > + (void) SetCurrentCommandIdUsedForWorker(); > + myState->output_cid = GetCurrentCommandId(false); > > SetCurrentCommandIdUsedForWorker already has void as return type. The '(void)' is not needed. > Removed. > > + * rd_createSubid is marked invalid, otherwise, the table is > + * not allowed to extend by the workers. > > nit: to extend by the workers -> to be extended by the workers > Changed. > > For IsParallelInsertInCTASAllowed, logic is inside 'if (IS_CTAS(into))' block. > You can return false when (!IS_CTAS(into)) - this would save some indentation for the body. > Done. > > + if (rel && rel->relpersistence != RELPERSISTENCE_TEMP) > + allowed = true; > > Similarly, when the above condition doesn't hold, you can return false directly - reducing the next if condition to 'if(queryDesc)'. > Done. > > The composite condition is negated. Maybe you can write without negation: > Done. > > + * Write out the number of tuples this worker has inserted. Leader will use > + * it to inform to the end client. > > 'inform to the end client' -> 'inform the end client' (without to) > Changed. Attaching v8 patch. Consider this for further review. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
Hi + /* + * Flag to let the planner know that the SELECT query is for CTAS. This is + * used to calculate the tuple transfer cost from workers to gather node(in + * case parallelism kicks in for the SELECT part of the CTAS), to zero as + * each worker will insert its share of tuples in parallel. + */ + if (IsParallelInsertInCTASAllowed(into, NULL)) + query->isForCTAS = true; + /* + * We do not compute the parallel_tuple_cost for CTAS because the number of + * tuples that are transferred from workers to the gather node is zero as + * each worker, in parallel, inserts the tuples that are resulted from its + * chunk of plan execution. This change may make the parallel plan cheap + * among all other plans, and influence the planner to consider this + * parallel plan. + */ + if (!(root->parse->isForCTAS && + root->query_level == 1)) + run_cost += parallel_tuple_cost * path->path.rows; I noticed that the parallel_tuple_cost will still be ignored, When Gather is not the top node. Example: Create table test(i int); insert into test values(generate_series(1,10000000,1)); explain create table ntest3 as select * from test where i < 200 limit 10000; QUERY PLAN ------------------------------------------------------------------------------- Limit (cost=1000.00..97331.33 rows=1000 width=4) -> Gather (cost=1000.00..97331.33 rows=1000 width=4) Workers Planned: 2 -> Parallel Seq Scan on test (cost=0.00..96331.33 rows=417 width=4) Filter: (i < 200) The isForCTAS will be true because [create table as], the query_level is always 1 because there is no subquery. So even if gather is not the top node, parallel cost will still be ignored. Is that works as expected ? Best regards, houzj
On Mon, Dec 7, 2020 at 11:32 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > Hi > > + /* > + * Flag to let the planner know that the SELECT query is for CTAS. This is > + * used to calculate the tuple transfer cost from workers to gather node(in > + * case parallelism kicks in for the SELECT part of the CTAS), to zero as > + * each worker will insert its share of tuples in parallel. > + */ > + if (IsParallelInsertInCTASAllowed(into, NULL)) > + query->isForCTAS = true; > > > + /* > + * We do not compute the parallel_tuple_cost for CTAS because the number of > + * tuples that are transferred from workers to the gather node is zero as > + * each worker, in parallel, inserts the tuples that are resulted from its > + * chunk of plan execution. This change may make the parallel plan cheap > + * among all other plans, and influence the planner to consider this > + * parallel plan. > + */ > + if (!(root->parse->isForCTAS && > + root->query_level == 1)) > + run_cost += parallel_tuple_cost * path->path.rows; > > I noticed that the parallel_tuple_cost will still be ignored, > When Gather is not the top node. > > Example: > Create table test(i int); > insert into test values(generate_series(1,10000000,1)); > explain create table ntest3 as select * from test where i < 200 limit 10000; > > QUERY PLAN > ------------------------------------------------------------------------------- > Limit (cost=1000.00..97331.33 rows=1000 width=4) > -> Gather (cost=1000.00..97331.33 rows=1000 width=4) > Workers Planned: 2 > -> Parallel Seq Scan on test (cost=0.00..96331.33 rows=417 width=4) > Filter: (i < 200) > > > The isForCTAS will be true because [create table as], the > query_level is always 1 because there is no subquery. > So even if gather is not the top node, parallel cost will still be ignored. > > Is that works as expected ? > I don't think that is expected and is not the case without this patch. The cost shouldn't be changed for existing cases where the write is not pushed to workers. -- With Regards, Amit Kapila.
On Mon, Dec 7, 2020 at 2:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Dec 7, 2020 at 11:32 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > > > Hi > > > > + /* > > + * Flag to let the planner know that the SELECT query is for CTAS. This is > > + * used to calculate the tuple transfer cost from workers to gather node(in > > + * case parallelism kicks in for the SELECT part of the CTAS), to zero as > > + * each worker will insert its share of tuples in parallel. > > + */ > > + if (IsParallelInsertInCTASAllowed(into, NULL)) > > + query->isForCTAS = true; > > > > > > + /* > > + * We do not compute the parallel_tuple_cost for CTAS because the number of > > + * tuples that are transferred from workers to the gather node is zero as > > + * each worker, in parallel, inserts the tuples that are resulted from its > > + * chunk of plan execution. This change may make the parallel plan cheap > > + * among all other plans, and influence the planner to consider this > > + * parallel plan. > > + */ > > + if (!(root->parse->isForCTAS && > > + root->query_level == 1)) > > + run_cost += parallel_tuple_cost * path->path.rows; > > > > I noticed that the parallel_tuple_cost will still be ignored, > > When Gather is not the top node. > > > > Example: > > Create table test(i int); > > insert into test values(generate_series(1,10000000,1)); > > explain create table ntest3 as select * from test where i < 200 limit 10000; > > > > QUERY PLAN > > ------------------------------------------------------------------------------- > > Limit (cost=1000.00..97331.33 rows=1000 width=4) > > -> Gather (cost=1000.00..97331.33 rows=1000 width=4) > > Workers Planned: 2 > > -> Parallel Seq Scan on test (cost=0.00..96331.33 rows=417 width=4) > > Filter: (i < 200) > > > > > > The isForCTAS will be true because [create table as], the > > query_level is always 1 because there is no subquery. > > So even if gather is not the top node, parallel cost will still be ignored. > > > > Is that works as expected ? > > > > I don't think that is expected and is not the case without this patch. > The cost shouldn't be changed for existing cases where the write is > not pushed to workers. > Thanks for pointing that out. Yes it should not change for the cases where parallel inserts will not be picked later. Any better suggestions on how to make the planner consider that the CTAS might choose parallel inserts later at the same time avoiding the above issue in case it doesn't? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Mon, Dec 7, 2020 at 3:44 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Mon, Dec 7, 2020 at 2:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Dec 7, 2020 at 11:32 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > > > > > + if (!(root->parse->isForCTAS && > > > + root->query_level == 1)) > > > + run_cost += parallel_tuple_cost * path->path.rows; > > > > > > I noticed that the parallel_tuple_cost will still be ignored, > > > When Gather is not the top node. > > > > > > Example: > > > Create table test(i int); > > > insert into test values(generate_series(1,10000000,1)); > > > explain create table ntest3 as select * from test where i < 200 limit 10000; > > > > > > QUERY PLAN > > > ------------------------------------------------------------------------------- > > > Limit (cost=1000.00..97331.33 rows=1000 width=4) > > > -> Gather (cost=1000.00..97331.33 rows=1000 width=4) > > > Workers Planned: 2 > > > -> Parallel Seq Scan on test (cost=0.00..96331.33 rows=417 width=4) > > > Filter: (i < 200) > > > > > > > > > The isForCTAS will be true because [create table as], the > > > query_level is always 1 because there is no subquery. > > > So even if gather is not the top node, parallel cost will still be ignored. > > > > > > Is that works as expected ? > > > > > > > I don't think that is expected and is not the case without this patch. > > The cost shouldn't be changed for existing cases where the write is > > not pushed to workers. > > > > Thanks for pointing that out. Yes it should not change for the cases > where parallel inserts will not be picked later. > > Any better suggestions on how to make the planner consider that the > CTAS might choose parallel inserts later at the same time avoiding the > above issue in case it doesn't? > What is the need of checking query_level when 'isForCTAS' is set only when Gather is a top-node? -- With Regards, Amit Kapila.
On Mon, Dec 7, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > What is the need of checking query_level when 'isForCTAS' is set only > when Gather is a top-node? > isForCTAS is getting set before pg_plan_query() which is being used in cost_gather(). We will not have a Gather node by then and hence will not pass queryDesc to IsParallelInsertInCTASAllowed(into, NULL) while setting isForCTAS to true. Intention to check query_level == 1 in cost_gather is to consider for only top level query not for other sub queries. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Mon, Dec 7, 2020 at 4:20 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Mon, Dec 7, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > What is the need of checking query_level when 'isForCTAS' is set only > > when Gather is a top-node? > > > > isForCTAS is getting set before pg_plan_query() which is being used in > cost_gather(). We will not have a Gather node by then and hence will > not pass queryDesc to IsParallelInsertInCTASAllowed(into, NULL) while > setting isForCTAS to true. > IsParallelInsertInCTASAllowed() seems to be returning false if queryDesc is NULL, so won't isForCTAS be always set to false? I think I am missing something here. -- With Regards, Amit Kapila.
On Mon, Dec 7, 2020 at 5:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Dec 7, 2020 at 4:20 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Mon, Dec 7, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > What is the need of checking query_level when 'isForCTAS' is set only > > > when Gather is a top-node? > > > > > > > isForCTAS is getting set before pg_plan_query() which is being used in > > cost_gather(). We will not have a Gather node by then and hence will > > not pass queryDesc to IsParallelInsertInCTASAllowed(into, NULL) while > > setting isForCTAS to true. > > > > IsParallelInsertInCTASAllowed() seems to be returning false if > queryDesc is NULL, so won't isForCTAS be always set to false? I think > I am missing something here. > My bad. I utterly missed this, sorry for the confusion. My intention to have IsParallelInsertInCTASAllowed() is for two purposes. 1. when called before planning without queryDesc, it should return true if IS_CTAS(into) is true and is not a temporary table. 2. when called after planning with a non-null queryDesc, along with 1) checks, it should also perform the Gather State checks and return accordingly. I have corrected it in v9 patch. Please have a look. > > > > The isForCTAS will be true because [create table as], the > > > query_level is always 1 because there is no subquery. > > > So even if gather is not the top node, parallel cost will still be ignored. > > > > > > Is that works as expected ? > > > > > > > I don't think that is expected and is not the case without this patch. > > The cost shouldn't be changed for existing cases where the write is > > not pushed to workers. > > > > Thanks for pointing that out. Yes it should not change for the cases > where parallel inserts will not be picked later. > > Any better suggestions on how to make the planner consider that the > CTAS might choose parallel inserts later at the same time avoiding the > above issue in case it doesn't? > I'm not quite sure how to address this. Can we not allow the planner to consider that the select is for CTAS and check only after the planning is done for the Gather node and other checks? This is simple to do, but we might miss some parallel plans for the SELECTs because the planner would have already considered the tuple transfer cost from workers to Gather wrongly because of which that parallel plan would have become costlier compared to non parallel plans. IMO, we can do this since it also keeps the existing behaviour of the planner i.e. when the planner is planning for SELECTs it doesn't know that it is doing it for CTAS. Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Dec 7, 2020 at 7:04 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Mon, Dec 7, 2020 at 5:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Dec 7, 2020 at 4:20 PM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > On Mon, Dec 7, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > What is the need of checking query_level when 'isForCTAS' is set only > > > > when Gather is a top-node? > > > > > > > > > > isForCTAS is getting set before pg_plan_query() which is being used in > > > cost_gather(). We will not have a Gather node by then and hence will > > > not pass queryDesc to IsParallelInsertInCTASAllowed(into, NULL) while > > > setting isForCTAS to true. > > > > > > > IsParallelInsertInCTASAllowed() seems to be returning false if > > queryDesc is NULL, so won't isForCTAS be always set to false? I think > > I am missing something here. > > > > My bad. I utterly missed this, sorry for the confusion. > > My intention to have IsParallelInsertInCTASAllowed() is for two > purposes. 1. when called before planning without queryDesc, it should > return true if IS_CTAS(into) is true and is not a temporary table. 2. > when called after planning with a non-null queryDesc, along with 1) > checks, it should also perform the Gather State checks and return > accordingly. > > I have corrected it in v9 patch. Please have a look. > > > > > > > The isForCTAS will be true because [create table as], the > > > > query_level is always 1 because there is no subquery. > > > > So even if gather is not the top node, parallel cost will still be ignored. > > > > > > > > Is that works as expected ? > > > > > > > > > > I don't think that is expected and is not the case without this patch. > > > The cost shouldn't be changed for existing cases where the write is > > > not pushed to workers. > > > > > > > Thanks for pointing that out. Yes it should not change for the cases > > where parallel inserts will not be picked later. > > > > Any better suggestions on how to make the planner consider that the > > CTAS might choose parallel inserts later at the same time avoiding the > > above issue in case it doesn't? > > > > I'm not quite sure how to address this. Can we not allow the planner > to consider that the select is for CTAS and check only after the > planning is done for the Gather node and other checks? This is simple > to do, but we might miss some parallel plans for the SELECTs because > the planner would have already considered the tuple transfer cost from > workers to Gather wrongly because of which that parallel plan would > have become costlier compared to non parallel plans. IMO, we can do > this since it also keeps the existing behaviour of the planner i.e. > when the planner is planning for SELECTs it doesn't know that it is > doing it for CTAS. Thoughts? > I have done some initial review and I have a few comments. @@ -328,6 +316,15 @@ ExecCreateTableAs(ParseState *pstate, CreateTableAsStmt *stmt, query = linitial_node(Query, rewritten); Assert(query->commandType == CMD_SELECT); + /* + * Flag to let the planner know that the SELECT query is for CTAS. This + * is used to calculate the tuple transfer cost from workers to gather + * node(in case parallelism kicks in for the SELECT part of the CTAS), + * to zero as each worker will insert its share of tuples in parallel. + */ + if (IsParallelInsertInCTASAllowed(into, NULL)) + query->isForCTAS = true; + /* plan the query */ plan = pg_plan_query(query, pstate->p_sourcetext, CURSOR_OPT_PARALLEL_OK, params); @@ -350,6 +347,15 @@ ExecCreateTableAs(ParseState *pstate, CreateTableAsStmt *stmt, /* call ExecutorStart to prepare the plan for execution */ ExecutorStart(queryDesc, GetIntoRelEFlags(into)); + /* + * If SELECT part of the CTAS is parallelizable, then make each + * parallel worker insert the tuples that are resulted in its execution + * into the target table. We need plan state to be initialized by the + * executor to decide whether to allow parallel inserts or not. + */ + if (IsParallelInsertInCTASAllowed(into, queryDesc)) + SetCTASParallelInsertState(queryDesc); Once we have called IsParallelInsertInCTASAllowed and set the query->isForCTAS flag then why we are calling this again? —— --- + */ + if (!(root->parse->isForCTAS && + root->query_level == 1)) + run_cost += parallel_tuple_cost * path->path.rows; From this check, it appeared that the lower level gather will also get influenced by this, consider this -> NLJ -> Gather -> Parallel Seq Scan -> Index Scan This condition is only checking that it should be a top-level query and it should be under CTAS then this will impact all the gather nodes as shown in the above example. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Dec 7, 2020 at 7:04 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > I'm not quite sure how to address this. Can we not allow the planner > to consider that the select is for CTAS and check only after the > planning is done for the Gather node and other checks? > IIUC, you are saying that we should not influence the cost of gather node even when the insertion would be done by workers? I think that should be our fallback option anyway but that might miss some paths to be considered parallel where the cost becomes more due to parallel_tuple_cost (aka tuple transfer cost). I think the idea is we can avoid the tuple transfer cost only when Gather is the top node because only at that time we can push insertion down, right? How about if we have some way to detect the same before calling generate_useful_gather_paths()? I think when we are calling apply_scanjoin_target_to_paths() in grouping_planner(), if the query_level is 1, it is for CTAS, and it doesn't have a chance to create UPPER_REL (doesn't have grouping, order, limit, etc clause) then we can probably assume that the Gather will be top_node. I am not sure about this but I think it is worth exploring. -- With Regards, Amit Kapila.
> > I'm not quite sure how to address this. Can we not allow the planner > > to consider that the select is for CTAS and check only after the > > planning is done for the Gather node and other checks? > > > > IIUC, you are saying that we should not influence the cost of gather node > even when the insertion would be done by workers? I think that should be > our fallback option anyway but that might miss some paths to be considered > parallel where the cost becomes more due to parallel_tuple_cost (aka tuple > transfer cost). I think the idea is we can avoid the tuple transfer cost > only when Gather is the top node because only at that time we can push > insertion down, right? How about if we have some way to detect the same > before calling generate_useful_gather_paths()? I think when we are calling > apply_scanjoin_target_to_paths() in grouping_planner(), if the > query_level is 1, it is for CTAS, and it doesn't have a chance to create > UPPER_REL (doesn't have grouping, order, limit, etc clause) then we can > probably assume that the Gather will be top_node. I am not sure about this > but I think it is worth exploring. > I took a look at the parallel insert patch and have the same idea. https://commitfest.postgresql.org/31/2844/ * Consider generating Gather or Gather Merge paths. We must only do this * if the relation is parallel safe, and we don't do it for child rels to * avoid creating multiple Gather nodes within the same plan. We must do * this after all paths have been generated and before set_cheapest, since * one of the generated paths may turn out to be the cheapest one. */ if (rel->consider_parallel && !IS_OTHER_REL(rel)) generate_useful_gather_paths(root, rel, false); IMO Gatherpath created here seems the right one which can possible ignore parallel cost if in CTAS. But We need check the following parse option which will create path to be the parent of Gatherpath here. if (root->parse->rowMarks) if (limit_needed(root->parse)) if (root->parse->sortClause) if (root->parse->distinctClause) if (root->parse->hasWindowFuncs) if (root->parse->groupClause || root->parse->groupingSets || root->parse->hasAggs || root->root->hasHavingQual) Best regards, houzj
On Tue, Dec 8, 2020 at 6:24 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > > > I'm not quite sure how to address this. Can we not allow the planner > > > to consider that the select is for CTAS and check only after the > > > planning is done for the Gather node and other checks? > > > > > > > IIUC, you are saying that we should not influence the cost of gather node > > even when the insertion would be done by workers? I think that should be > > our fallback option anyway but that might miss some paths to be considered > > parallel where the cost becomes more due to parallel_tuple_cost (aka tuple > > transfer cost). I think the idea is we can avoid the tuple transfer cost > > only when Gather is the top node because only at that time we can push > > insertion down, right? How about if we have some way to detect the same > > before calling generate_useful_gather_paths()? I think when we are calling > > apply_scanjoin_target_to_paths() in grouping_planner(), if the > > query_level is 1, it is for CTAS, and it doesn't have a chance to create > > UPPER_REL (doesn't have grouping, order, limit, etc clause) then we can > > probably assume that the Gather will be top_node. I am not sure about this > > but I think it is worth exploring. > > > > I took a look at the parallel insert patch and have the same idea. > https://commitfest.postgresql.org/31/2844/ > > * Consider generating Gather or Gather Merge paths. We must only do this > * if the relation is parallel safe, and we don't do it for child rels to > * avoid creating multiple Gather nodes within the same plan. We must do > * this after all paths have been generated and before set_cheapest, since > * one of the generated paths may turn out to be the cheapest one. > */ > if (rel->consider_parallel && !IS_OTHER_REL(rel)) > generate_useful_gather_paths(root, rel, false); > > IMO Gatherpath created here seems the right one which can possible ignore parallel cost if in CTAS. > But We need check the following parse option which will create path to be the parent of Gatherpath here. > > if (root->parse->rowMarks) > if (limit_needed(root->parse)) > if (root->parse->sortClause) > if (root->parse->distinctClause) > if (root->parse->hasWindowFuncs) > if (root->parse->groupClause || root->parse->groupingSets || root->parse->hasAggs || root->root->hasHavingQual) > Thanks Amit and Hou. I will look into these areas and get back soon. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Tue, Dec 8, 2020 at 6:36 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Tue, Dec 8, 2020 at 6:24 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > > > > > I'm not quite sure how to address this. Can we not allow the planner > > > > to consider that the select is for CTAS and check only after the > > > > planning is done for the Gather node and other checks? > > > > > > > > > > IIUC, you are saying that we should not influence the cost of gather node > > > even when the insertion would be done by workers? I think that should be > > > our fallback option anyway but that might miss some paths to be considered > > > parallel where the cost becomes more due to parallel_tuple_cost (aka tuple > > > transfer cost). I think the idea is we can avoid the tuple transfer cost > > > only when Gather is the top node because only at that time we can push > > > insertion down, right? How about if we have some way to detect the same > > > before calling generate_useful_gather_paths()? I think when we are calling > > > apply_scanjoin_target_to_paths() in grouping_planner(), if the > > > query_level is 1, it is for CTAS, and it doesn't have a chance to create > > > UPPER_REL (doesn't have grouping, order, limit, etc clause) then we can > > > probably assume that the Gather will be top_node. I am not sure about this > > > but I think it is worth exploring. > > > > > > > I took a look at the parallel insert patch and have the same idea. > > https://commitfest.postgresql.org/31/2844/ > > > > * Consider generating Gather or Gather Merge paths. We must only do this > > * if the relation is parallel safe, and we don't do it for child rels to > > * avoid creating multiple Gather nodes within the same plan. We must do > > * this after all paths have been generated and before set_cheapest, since > > * one of the generated paths may turn out to be the cheapest one. > > */ > > if (rel->consider_parallel && !IS_OTHER_REL(rel)) > > generate_useful_gather_paths(root, rel, false); > > > > IMO Gatherpath created here seems the right one which can possible ignore parallel cost if in CTAS. > > But We need check the following parse option which will create path to be the parent of Gatherpath here. > > > > if (root->parse->rowMarks) > > if (limit_needed(root->parse)) > > if (root->parse->sortClause) > > if (root->parse->distinctClause) > > if (root->parse->hasWindowFuncs) > > if (root->parse->groupClause || root->parse->groupingSets || root->parse->hasAggs || root->root->hasHavingQual) > > > > Thanks Amit and Hou. I will look into these areas and get back soon. > It might be better to split the patch for this such that in the base patch, we won't consider anything special for gather costing w.r.t CTAS and in the next patch, we consider all the checks discussed above. -- With Regards, Amit Kapila.
On Tue, Dec 8, 2020 at 6:24 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > > > I'm not quite sure how to address this. Can we not allow the planner > > > to consider that the select is for CTAS and check only after the > > > planning is done for the Gather node and other checks? > > > > > > > IIUC, you are saying that we should not influence the cost of gather node > > even when the insertion would be done by workers? I think that should be > > our fallback option anyway but that might miss some paths to be considered > > parallel where the cost becomes more due to parallel_tuple_cost (aka tuple > > transfer cost). I think the idea is we can avoid the tuple transfer cost > > only when Gather is the top node because only at that time we can push > > insertion down, right? How about if we have some way to detect the same > > before calling generate_useful_gather_paths()? I think when we are calling > > apply_scanjoin_target_to_paths() in grouping_planner(), if the > > query_level is 1, it is for CTAS, and it doesn't have a chance to create > > UPPER_REL (doesn't have grouping, order, limit, etc clause) then we can > > probably assume that the Gather will be top_node. I am not sure about this > > but I think it is worth exploring. > > > > I took a look at the parallel insert patch and have the same idea. > https://commitfest.postgresql.org/31/2844/ > > * Consider generating Gather or Gather Merge paths. We must only do this > * if the relation is parallel safe, and we don't do it for child rels to > * avoid creating multiple Gather nodes within the same plan. We must do > * this after all paths have been generated and before set_cheapest, since > * one of the generated paths may turn out to be the cheapest one. > */ > if (rel->consider_parallel && !IS_OTHER_REL(rel)) > generate_useful_gather_paths(root, rel, false); > > IMO Gatherpath created here seems the right one which can possible ignore parallel cost if in CTAS. > But We need check the following parse option which will create path to be the parent of Gatherpath here. > > if (root->parse->rowMarks) > if (limit_needed(root->parse)) > if (root->parse->sortClause) > if (root->parse->distinctClause) > if (root->parse->hasWindowFuncs) > if (root->parse->groupClause || root->parse->groupingSets || root->parse->hasAggs || root->root->hasHavingQual) > Yeah, and as I pointed earlier, along with this we also need to consider that the RelOptInfo must be the final target(top level rel). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 9, 2020 at 10:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Dec 8, 2020 at 6:24 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > > > > > I'm not quite sure how to address this. Can we not allow the planner > > > > to consider that the select is for CTAS and check only after the > > > > planning is done for the Gather node and other checks? > > > > > > > > > > IIUC, you are saying that we should not influence the cost of gather node > > > even when the insertion would be done by workers? I think that should be > > > our fallback option anyway but that might miss some paths to be considered > > > parallel where the cost becomes more due to parallel_tuple_cost (aka tuple > > > transfer cost). I think the idea is we can avoid the tuple transfer cost > > > only when Gather is the top node because only at that time we can push > > > insertion down, right? How about if we have some way to detect the same > > > before calling generate_useful_gather_paths()? I think when we are calling > > > apply_scanjoin_target_to_paths() in grouping_planner(), if the > > > query_level is 1, it is for CTAS, and it doesn't have a chance to create > > > UPPER_REL (doesn't have grouping, order, limit, etc clause) then we can > > > probably assume that the Gather will be top_node. I am not sure about this > > > but I think it is worth exploring. > > > > > > > I took a look at the parallel insert patch and have the same idea. > > https://commitfest.postgresql.org/31/2844/ > > > > * Consider generating Gather or Gather Merge paths. We must only do this > > * if the relation is parallel safe, and we don't do it for child rels to > > * avoid creating multiple Gather nodes within the same plan. We must do > > * this after all paths have been generated and before set_cheapest, since > > * one of the generated paths may turn out to be the cheapest one. > > */ > > if (rel->consider_parallel && !IS_OTHER_REL(rel)) > > generate_useful_gather_paths(root, rel, false); > > > > IMO Gatherpath created here seems the right one which can possible ignore parallel cost if in CTAS. > > But We need check the following parse option which will create path to be the parent of Gatherpath here. > > > > if (root->parse->rowMarks) > > if (limit_needed(root->parse)) > > if (root->parse->sortClause) > > if (root->parse->distinctClause) > > if (root->parse->hasWindowFuncs) > > if (root->parse->groupClause || root->parse->groupingSets || root->parse->hasAggs || root->root->hasHavingQual) > > > > Yeah, and as I pointed earlier, along with this we also need to > consider that the RelOptInfo must be the final target(top level rel). > Attaching v10 patch set that includes the change suggested above for ignoring parallel tuple cost and also few more test cases. I split the patch as per Amit's suggestion. v10-0001 contains parallel inserts code without planner tuple cost changes and test cases. v10-0002 has required changes for ignoring planner tuple cost calculations. Please review it further. After the review and addressing all the comments, I plan to make some code common so that it can be used for Parallel Inserts in REFRESH MATERIALIZED VIEW. Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
+ type_is_collatable(col->typeName->typeOid))
+ ereport(ERROR,
On Wed, Dec 9, 2020 at 10:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Dec 8, 2020 at 6:24 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
> >
> > > > I'm not quite sure how to address this. Can we not allow the planner
> > > > to consider that the select is for CTAS and check only after the
> > > > planning is done for the Gather node and other checks?
> > > >
> > >
> > > IIUC, you are saying that we should not influence the cost of gather node
> > > even when the insertion would be done by workers? I think that should be
> > > our fallback option anyway but that might miss some paths to be considered
> > > parallel where the cost becomes more due to parallel_tuple_cost (aka tuple
> > > transfer cost). I think the idea is we can avoid the tuple transfer cost
> > > only when Gather is the top node because only at that time we can push
> > > insertion down, right? How about if we have some way to detect the same
> > > before calling generate_useful_gather_paths()? I think when we are calling
> > > apply_scanjoin_target_to_paths() in grouping_planner(), if the
> > > query_level is 1, it is for CTAS, and it doesn't have a chance to create
> > > UPPER_REL (doesn't have grouping, order, limit, etc clause) then we can
> > > probably assume that the Gather will be top_node. I am not sure about this
> > > but I think it is worth exploring.
> > >
> >
> > I took a look at the parallel insert patch and have the same idea.
> > https://commitfest.postgresql.org/31/2844/
> >
> > * Consider generating Gather or Gather Merge paths. We must only do this
> > * if the relation is parallel safe, and we don't do it for child rels to
> > * avoid creating multiple Gather nodes within the same plan. We must do
> > * this after all paths have been generated and before set_cheapest, since
> > * one of the generated paths may turn out to be the cheapest one.
> > */
> > if (rel->consider_parallel && !IS_OTHER_REL(rel))
> > generate_useful_gather_paths(root, rel, false);
> >
> > IMO Gatherpath created here seems the right one which can possible ignore parallel cost if in CTAS.
> > But We need check the following parse option which will create path to be the parent of Gatherpath here.
> >
> > if (root->parse->rowMarks)
> > if (limit_needed(root->parse))
> > if (root->parse->sortClause)
> > if (root->parse->distinctClause)
> > if (root->parse->hasWindowFuncs)
> > if (root->parse->groupClause || root->parse->groupingSets || root->parse->hasAggs || root->root->hasHavingQual)
> >
>
> Yeah, and as I pointed earlier, along with this we also need to
> consider that the RelOptInfo must be the final target(top level rel).
>
Attaching v10 patch set that includes the change suggested above for
ignoring parallel tuple cost and also few more test cases. I split the
patch as per Amit's suggestion. v10-0001 contains parallel inserts
code without planner tuple cost changes and test cases. v10-0002 has
required changes for ignoring planner tuple cost calculations.
Please review it further.
After the review and addressing all the comments, I plan to make some
code common so that it can be used for Parallel Inserts in REFRESH
MATERIALIZED VIEW. Thoughts?
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 10, 2020 at 7:48 AM Zhihong Yu <zyu@yugabyte.com> wrote: > + if (!OidIsValid(col->collOid) && > + type_is_collatable(col->typeName->typeOid)) > + ereport(ERROR, > ... > + attrList = lappend(attrList, col); > > Should attrList be freed when ereport is called ? > I think that's not necessary since we are going to throw an error anyways. And also that this is not a new code added as part of this feature, it is an existing code adjusted for parallel inserts. On looking further in the code base there are many places where we don't free up the lists before throwing errors. errmsg("column privileges are only valid for relations"))); errmsg("check constraint \"%s\" already exists", errmsg("name or argument lists may not contain nulls"))); elog(ERROR, "no tlist entry for key %d", keyresno); > + query->CTASParallelInsInfo &= CTAS_PARALLEL_INS_UNDEF; > > Since CTAS_PARALLEL_INS_UNDEF is 0, isn't the above equivalent to assigning the value of 0 ? > Yeah both are equivalent. For now I will keep it that way, I will change it in the next version of the patch. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Hi + allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo && + plannedstmt->parallelModeNeeded && + plannedstmt->planTree && + IsA(plannedstmt->planTree, Gather) && + plannedstmt->planTree->lefttree && + plannedstmt->planTree->lefttree->parallel_aware && + plannedstmt->planTree->lefttree->parallel_safe; I noticed it check both IsA(ps, GatherState) and IsA(plannedstmt->planTree, Gather). Does it mean it is possible that IsA(ps, GatherState) is true but IsA(plannedstmt->planTree, Gather) is false ? I did some test but did not find a case like that. Best regards, houzj
On Thu, Dec 10, 2020 at 3:59 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > Hi > > + allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo && > + plannedstmt->parallelModeNeeded && > + plannedstmt->planTree && > + IsA(plannedstmt->planTree, Gather) && > + plannedstmt->planTree->lefttree && > + plannedstmt->planTree->lefttree->parallel_aware && > + plannedstmt->planTree->lefttree->parallel_safe; > > I noticed it check both IsA(ps, GatherState) and IsA(plannedstmt->planTree, Gather). > Does it mean it is possible that IsA(ps, GatherState) is true but IsA(plannedstmt->planTree, Gather) is false ? > > I did some test but did not find a case like that. > This seems like an extra check. Apart from that if we combine 0001 and 0002 there should be an additional protection so that it should not happen that in cost_gather we have ignored the parallel tuple cost and now we are rejecting the parallel insert. Probably we should add an assert. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
> > + allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo &&
> > + plannedstmt->parallelModeNeeded &&
> > + plannedstmt->planTree &&
> > + IsA(plannedstmt->planTree, Gather) &&
> > + plannedstmt->planTree->lefttree &&
> > + plannedstmt->planTree->lefttree->parallel_aware &&
> > + plannedstmt->planTree->lefttree->parallel_safe;
> >
> > I noticed it check both IsA(ps, GatherState) and IsA(plannedstmt->planTree, Gather).
> > Does it mean it is possible that IsA(ps, GatherState) is true but IsA(plannedstmt->planTree, Gather) is false ?
> >
> > I did some test but did not find a case like that.
> >
>
> This seems like an extra check. Apart from that if we combine 0001
> and 0002 there should be an additional protection so that it should
> not happen that in cost_gather we have ignored the parallel tuple cost
> and now we are rejecting the parallel insert. Probably we should add
> an assert.
Yeah it's an extra check. I don't think we need that extra check IsA(plannedstmt->planTree, Gather). GatherState check is enough. I verified it as follows: the gatherstate will be allocated and initialized with the plan tree in ExecInitGather which are the ones we are checking here. So, there is no chance that the plan state is GatherState and the plan tree will not be Gather. I will remove IsA(plannedstmt->planTree, Gather) check in the next version of the patch set.
$10 = (GatherState *) 0x5647fac83850
(gdb) p gatherstate->ps.plan
$11 = (Plan *) 0x5647fac918a0
Breakpoint 1, IsParallelInsertInCTASAllowed (into=0x5647fac97580, queryDesc=0x5647fac835e0) at createas.c:663
663 {
(gdb) p ps
$13 = (PlanState *) 0x5647fac83850
(gdb) p ps->plan
$14 = (Plan *) 0x5647fac918a0
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 10, 2020 at 5:00 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Thu, Dec 10, 2020 at 4:49 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > + allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo && > > > + plannedstmt->parallelModeNeeded && > > > + plannedstmt->planTree && > > > + IsA(plannedstmt->planTree, Gather) && > > > + plannedstmt->planTree->lefttree && > > > + plannedstmt->planTree->lefttree->parallel_aware && > > > + plannedstmt->planTree->lefttree->parallel_safe; > > > > > > I noticed it check both IsA(ps, GatherState) and IsA(plannedstmt->planTree, Gather). > > > Does it mean it is possible that IsA(ps, GatherState) is true but IsA(plannedstmt->planTree, Gather) is false ? > > > > > > I did some test but did not find a case like that. > > > > > > > This seems like an extra check. Apart from that if we combine 0001 > > and 0002 there should be an additional protection so that it should > > not happen that in cost_gather we have ignored the parallel tuple cost > > and now we are rejecting the parallel insert. Probably we should add > > an assert. > > Yeah it's an extra check. I don't think we need that extra check IsA(plannedstmt->planTree, Gather). GatherState checkis enough. I verified it as follows: the gatherstate will be allocated and initialized with the plan tree in ExecInitGatherwhich are the ones we are checking here. So, there is no chance that the plan state is GatherState and theplan tree will not be Gather. I will remove IsA(plannedstmt->planTree, Gather) check in the next version of the patchset. > > Breakpoint 4, ExecInitGather (node=0x5647f98ae994 <ExecCheckRTEPerms+131>, estate=0x1ca8, eflags=730035099) at nodeGather.c:61 > (gdb) p gatherstate > $10 = (GatherState *) 0x5647fac83850 > (gdb) p gatherstate->ps.plan > $11 = (Plan *) 0x5647fac918a0 > > Breakpoint 1, IsParallelInsertInCTASAllowed (into=0x5647fac97580, queryDesc=0x5647fac835e0) at createas.c:663 > 663 { > (gdb) p ps > $13 = (PlanState *) 0x5647fac83850 > (gdb) p ps->plan > $14 = (Plan *) 0x5647fac918a0 > Hope you did not miss the second part of my comment " > Apart from that if we combine 0001 > and 0002 there should be additional protection so that it should > not happen that in cost_gather we have ignored the parallel tuple cost > and now we are rejecting the parallel insert. Probably we should add > an assert. " -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 10, 2020 at 5:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > + allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo && > > > > + plannedstmt->parallelModeNeeded && > > > > + plannedstmt->planTree && > > > > + IsA(plannedstmt->planTree, Gather) && > > > > + plannedstmt->planTree->lefttree && > > > > + plannedstmt->planTree->lefttree->parallel_aware && > > > > + plannedstmt->planTree->lefttree->parallel_safe; > > > > > > > > I noticed it check both IsA(ps, GatherState) and IsA(plannedstmt->planTree, Gather). > > > > Does it mean it is possible that IsA(ps, GatherState) is true but IsA(plannedstmt->planTree, Gather) is false ? > > > > > > > > I did some test but did not find a case like that. > > > > > > > This seems like an extra check. Apart from that if we combine 0001 > > > and 0002 there should be an additional protection so that it should > > > not happen that in cost_gather we have ignored the parallel tuple cost > > > and now we are rejecting the parallel insert. Probably we should add > > > an assert. > > > > Yeah it's an extra check. I don't think we need that extra check IsA(plannedstmt->planTree, Gather). GatherState checkis enough. I verified it as follows: the gatherstate will be allocated and initialized with the plan tree in ExecInitGatherwhich are the ones we are checking here. So, there is no chance that the plan state is GatherState and theplan tree will not be Gather. I will remove IsA(plannedstmt->planTree, Gather) check in the next version of the patchset. > > > > Breakpoint 4, ExecInitGather (node=0x5647f98ae994 <ExecCheckRTEPerms+131>, estate=0x1ca8, eflags=730035099) at nodeGather.c:61 > > (gdb) p gatherstate > > $10 = (GatherState *) 0x5647fac83850 > > (gdb) p gatherstate->ps.plan > > $11 = (Plan *) 0x5647fac918a0 > > > > Breakpoint 1, IsParallelInsertInCTASAllowed (into=0x5647fac97580, queryDesc=0x5647fac835e0) at createas.c:663 > > 663 { > > (gdb) p ps > > $13 = (PlanState *) 0x5647fac83850 > > (gdb) p ps->plan > > $14 = (Plan *) 0x5647fac918a0 > > > Hope you did not miss the second part of my comment > " > > Apart from that if we combine 0001 > > and 0002 there should be additional protection so that it should > > not happen that in cost_gather we have ignored the parallel tuple cost > > and now we are rejecting the parallel insert. Probably we should add > > an assert. > " IIUC, we need to set a flag in cost_gather(in 0002 patch) whenever we ignore the parallel tuple cost and while checking to allow or disallow parallel inserts in IsParallelInsertInCTASAllowed(), we need to add an assert something like Assert(cost_ignored_in_cost_gather && allow) before return allow; This assertion fails 1) either if we have not ignored the cost but allowing parallel inserts 2) or we ignored the cost but not allowing parallel inserts. 1) seems to be fine, we can go ahead and perform parallel inserts. 2) is the concern that the planner would have wrongly chosen the parallel plan, but in this case also isn't it better to go ahead with the parallel plan instead of failing the query? + /* + * We allow parallel inserts by the workers only if the Gather node has + * no projections to perform and if the upper node is Gather. In case, + * the Gather node has projections, which is possible if there are any + * subplans in the query, the workers can not do those projections. And + * when the upper node is GatherMerge, then the leader has to perform + * the final phase i.e. merge the results by workers. + */ + allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo && + plannedstmt->parallelModeNeeded && + plannedstmt->planTree && + plannedstmt->planTree->lefttree && + plannedstmt->planTree->lefttree->parallel_aware && + plannedstmt->planTree->lefttree->parallel_safe; + + return allow; + } With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 10, 2020 at 7:20 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > On Thu, Dec 10, 2020 at 5:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > + allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo && > > > > > + plannedstmt->parallelModeNeeded && > > > > > + plannedstmt->planTree && > > > > > + IsA(plannedstmt->planTree, Gather) && > > > > > + plannedstmt->planTree->lefttree && > > > > > + plannedstmt->planTree->lefttree->parallel_aware && > > > > > + plannedstmt->planTree->lefttree->parallel_safe; > > > > > > > > > > I noticed it check both IsA(ps, GatherState) and IsA(plannedstmt->planTree, Gather). > > > > > Does it mean it is possible that IsA(ps, GatherState) is true but IsA(plannedstmt->planTree, Gather) is false ? > > > > > > > > > > I did some test but did not find a case like that. > > > > > > > > > This seems like an extra check. Apart from that if we combine 0001 > > > > and 0002 there should be an additional protection so that it should > > > > not happen that in cost_gather we have ignored the parallel tuple cost > > > > and now we are rejecting the parallel insert. Probably we should add > > > > an assert. > > > > > > Yeah it's an extra check. I don't think we need that extra check IsA(plannedstmt->planTree, Gather). GatherState checkis enough. I verified it as follows: the gatherstate will be allocated and initialized with the plan tree in ExecInitGatherwhich are the ones we are checking here. So, there is no chance that the plan state is GatherState and theplan tree will not be Gather. I will remove IsA(plannedstmt->planTree, Gather) check in the next version of the patchset. > > > > > > Breakpoint 4, ExecInitGather (node=0x5647f98ae994 <ExecCheckRTEPerms+131>, estate=0x1ca8, eflags=730035099) at nodeGather.c:61 > > > (gdb) p gatherstate > > > $10 = (GatherState *) 0x5647fac83850 > > > (gdb) p gatherstate->ps.plan > > > $11 = (Plan *) 0x5647fac918a0 > > > > > > Breakpoint 1, IsParallelInsertInCTASAllowed (into=0x5647fac97580, queryDesc=0x5647fac835e0) at createas.c:663 > > > 663 { > > > (gdb) p ps > > > $13 = (PlanState *) 0x5647fac83850 > > > (gdb) p ps->plan > > > $14 = (Plan *) 0x5647fac918a0 > > > > > Hope you did not miss the second part of my comment > > " > > > Apart from that if we combine 0001 > > > and 0002 there should be additional protection so that it should > > > not happen that in cost_gather we have ignored the parallel tuple cost > > > and now we are rejecting the parallel insert. Probably we should add > > > an assert. > > " > > IIUC, we need to set a flag in cost_gather(in 0002 patch) whenever we > ignore the parallel tuple cost and while checking to allow or disallow > parallel inserts in IsParallelInsertInCTASAllowed(), we need to add an > assert something like Assert(cost_ignored_in_cost_gather && allow) > before return allow; > > This assertion fails 1) either if we have not ignored the cost but > allowing parallel inserts 2) or we ignored the cost but not allowing > parallel inserts. > > 1) seems to be fine, we can go ahead and perform parallel inserts. 2) > is the concern that the planner would have wrongly chosen the parallel > plan, but in this case also isn't it better to go ahead with the > parallel plan instead of failing the query? > > + /* > + * We allow parallel inserts by the workers only if the Gather node has > + * no projections to perform and if the upper node is Gather. In case, > + * the Gather node has projections, which is possible if there are any > + * subplans in the query, the workers can not do those projections. And > + * when the upper node is GatherMerge, then the leader has to perform > + * the final phase i.e. merge the results by workers. > + */ > + allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo && > + plannedstmt->parallelModeNeeded && > + plannedstmt->planTree && > + plannedstmt->planTree->lefttree && > + plannedstmt->planTree->lefttree->parallel_aware && > + plannedstmt->planTree->lefttree->parallel_safe; > + > + return allow; > + } I added the assertion into the 0002 patch so that it fails when the planner ignores parallel tuple cost and may choose parallel plan but later we don't allow parallel inserts. make check and make check-world passeses without any assertion failures. Attaching v11 patch set. Please review it further. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
Hi > Attaching v11 patch set. Please review it further. Currently with the patch, we can allow parallel CTAS when topnode is Gather. When top-node is Append and Gather is the sub-node of Append, I think we can still enable Parallel CTAS by pushing Parallel CTAS down to the sub-node Gather, such as: Append ------>Gather --------->Create table ------------->Seqscan ------>Gather --------->create table ------------->Seqscan And the use case seems common to me, such as: select * from A where xxx union all select * from B where xxx; I attatch a WIP patch which just show the possibility of this feature. The patch is based on the latest v11-patch. What do you think? Best regards, houzj
Attachment
On Mon, Dec 14, 2020 at 4:06 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Thu, Dec 10, 2020 at 7:20 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Thu, Dec 10, 2020 at 5:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > + allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo && > > > > > > + plannedstmt->parallelModeNeeded && > > > > > > + plannedstmt->planTree && > > > > > > + IsA(plannedstmt->planTree, Gather) && > > > > > > + plannedstmt->planTree->lefttree && > > > > > > + plannedstmt->planTree->lefttree->parallel_aware && > > > > > > + plannedstmt->planTree->lefttree->parallel_safe; > > > > > > > > > > > > I noticed it check both IsA(ps, GatherState) and IsA(plannedstmt->planTree, Gather). > > > > > > Does it mean it is possible that IsA(ps, GatherState) is true but IsA(plannedstmt->planTree, Gather) is false? > > > > > > > > > > > > I did some test but did not find a case like that. > > > > > > > > > > > This seems like an extra check. Apart from that if we combine 0001 > > > > > and 0002 there should be an additional protection so that it should > > > > > not happen that in cost_gather we have ignored the parallel tuple cost > > > > > and now we are rejecting the parallel insert. Probably we should add > > > > > an assert. > > > > > > > > Yeah it's an extra check. I don't think we need that extra check IsA(plannedstmt->planTree, Gather). GatherStatecheck is enough. I verified it as follows: the gatherstate will be allocated and initialized with the plan treein ExecInitGather which are the ones we are checking here. So, there is no chance that the plan state is GatherStateand the plan tree will not be Gather. I will remove IsA(plannedstmt->planTree, Gather) check in the next versionof the patch set. > > > > > > > > Breakpoint 4, ExecInitGather (node=0x5647f98ae994 <ExecCheckRTEPerms+131>, estate=0x1ca8, eflags=730035099) at nodeGather.c:61 > > > > (gdb) p gatherstate > > > > $10 = (GatherState *) 0x5647fac83850 > > > > (gdb) p gatherstate->ps.plan > > > > $11 = (Plan *) 0x5647fac918a0 > > > > > > > > Breakpoint 1, IsParallelInsertInCTASAllowed (into=0x5647fac97580, queryDesc=0x5647fac835e0) at createas.c:663 > > > > 663 { > > > > (gdb) p ps > > > > $13 = (PlanState *) 0x5647fac83850 > > > > (gdb) p ps->plan > > > > $14 = (Plan *) 0x5647fac918a0 > > > > > > > Hope you did not miss the second part of my comment > > > " > > > > Apart from that if we combine 0001 > > > > and 0002 there should be additional protection so that it should > > > > not happen that in cost_gather we have ignored the parallel tuple cost > > > > and now we are rejecting the parallel insert. Probably we should add > > > > an assert. > > > " > > > > IIUC, we need to set a flag in cost_gather(in 0002 patch) whenever we > > ignore the parallel tuple cost and while checking to allow or disallow > > parallel inserts in IsParallelInsertInCTASAllowed(), we need to add an > > assert something like Assert(cost_ignored_in_cost_gather && allow) > > before return allow; > > > > This assertion fails 1) either if we have not ignored the cost but > > allowing parallel inserts 2) or we ignored the cost but not allowing > > parallel inserts. > > > > 1) seems to be fine, we can go ahead and perform parallel inserts. 2) > > is the concern that the planner would have wrongly chosen the parallel > > plan, but in this case also isn't it better to go ahead with the > > parallel plan instead of failing the query? > > > > + /* > > + * We allow parallel inserts by the workers only if the Gather node has > > + * no projections to perform and if the upper node is Gather. In case, > > + * the Gather node has projections, which is possible if there are any > > + * subplans in the query, the workers can not do those projections. And > > + * when the upper node is GatherMerge, then the leader has to perform > > + * the final phase i.e. merge the results by workers. > > + */ > > + allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo && > > + plannedstmt->parallelModeNeeded && > > + plannedstmt->planTree && > > + plannedstmt->planTree->lefttree && > > + plannedstmt->planTree->lefttree->parallel_aware && > > + plannedstmt->planTree->lefttree->parallel_safe; > > + > > + return allow; > > + } > > I added the assertion into the 0002 patch so that it fails when the > planner ignores parallel tuple cost and may choose parallel plan but > later we don't allow parallel inserts. make check and make check-world > passeses without any assertion failures. > > Attaching v11 patch set. Please review it further. I can see a lot of unrelated changes in 0002, or you have done a lot of code refactoring especially in createas.c file. If it is intended refactoring then please move the refactoring to a separate patch so that the patch is readable. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi
> Attaching v11 patch set. Please review it further.
Currently with the patch, we can allow parallel CTAS when topnode is Gather.
When top-node is Append and Gather is the sub-node of Append, I think we can still enable
Parallel CTAS by pushing Parallel CTAS down to the sub-node Gather, such as:
Append
------>Gather
--------->Create table
------------->Seqscan
------>Gather
--------->create table
------------->Seqscan
And the use case seems common to me, such as:
select * from A where xxx union all select * from B where xxx;
I attatch a WIP patch which just show the possibility of this feature.
The patch is based on the latest v11-patch.
What do you think?
Best regards,
houzj
On Mon, Dec 14, 2020 at 6:08 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > Currently with the patch, we can allow parallel CTAS when topnode is Gather. > When top-node is Append and Gather is the sub-node of Append, I think we can still enable > Parallel CTAS by pushing Parallel CTAS down to the sub-node Gather, such as: > > Append > ------>Gather > --------->Create table > ------------->Seqscan > ------>Gather > --------->create table > ------------->Seqscan > > And the use case seems common to me, such as: > select * from A where xxx union all select * from B where xxx; Thanks for the append use case. Here's my analysis on pushing parallel inserts down even in case the top node is Append. For union cases which need to remove duplicate tuples, we can't push the inserts or CTAS dest receiver down. If I'm not wrong, Append node is not doing duplicate removal(??), I saw that it's the HashAggregate node (which is the top node that removes the duplicate tuples). And also for except/except all/intersect/intersect all cases we receive HashSetOp nodes on top of Append. So for both cases, our check for Gather or Append at the top node is enough to detect this to not allow parallel inserts. For union all: case 1: We can push the CTAS dest receiver to each Gather node Append ->Gather ->Parallel Seq Scan ->Gather ->Parallel Seq Scan ->Gather ->Parallel Seq Scan case 2: We can still push the CTAS dest receiver to each Gather node. Non-Gather nodes will do inserts as they do now i.e. by sending tuples to Append and from there to CTAS dest receiver. Append ->Gather ->Parallel Seq Scan ->Seq Scan / Join / any other non-Gather node ->Gather ->Parallel Seq Scan ->Seq Scan / Join / any other non-Gather node case 3: We can push the CTAS dest receiver to Gather Gather ->Parallel Append ->Parallel Seq Scan ->Parallel Seq Scan case 4: We can push the CTAS dest receiver to Gather Gather ->Parallel Append ->Parallel Seq Scan ->Parallel Seq Scan ->Seq Scan / Join / any other non-Gather node Please let me know if I'm missing any other possible use case. Thoughts? > I attach a WIP patch which just show the possibility of this feature. > The patch is based on the latest v11-patch. > > What do you think? As suggested by Amit earlier, I kept the 0001 patch(so far) such that it doesn't have the code to influence the planner to consider parallel tuple cost as 0. It works on the plan whatever gets generated and decides to allow parallel inserts or not. And in the 0002 patch, I added the code for influencing the planner to consider parallel tuple cost as 0. Maybe we can have a 0003 patch for tests alone. Once we are okay with the above analysis and use cases, we can incorporate the Append changes to respective patches. Hope that's okay. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Tue, Dec 15, 2020 at 2:06 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Mon, Dec 14, 2020 at 6:08 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > Currently with the patch, we can allow parallel CTAS when topnode is Gather. > > When top-node is Append and Gather is the sub-node of Append, I think we can still enable > > Parallel CTAS by pushing Parallel CTAS down to the sub-node Gather, such as: > > > > Append > > ------>Gather > > --------->Create table > > ------------->Seqscan > > ------>Gather > > --------->create table > > ------------->Seqscan > > > > And the use case seems common to me, such as: > > select * from A where xxx union all select * from B where xxx; > > Thanks for the append use case. > > Here's my analysis on pushing parallel inserts down even in case the > top node is Append. > > For union cases which need to remove duplicate tuples, we can't push > the inserts or CTAS dest receiver down. If I'm not wrong, Append node > is not doing duplicate removal(??), I saw that it's the HashAggregate > node (which is the top node that removes the duplicate tuples). And > also for except/except all/intersect/intersect all cases we receive > HashSetOp nodes on top of Append. So for both cases, our check for > Gather or Append at the top node is enough to detect this to not allow > parallel inserts. > > For union all: > case 1: We can push the CTAS dest receiver to each Gather node > Append > ->Gather > ->Parallel Seq Scan > ->Gather > ->Parallel Seq Scan > ->Gather > ->Parallel Seq Scan > > case 2: We can still push the CTAS dest receiver to each Gather node. > Non-Gather nodes will do inserts as they do now i.e. by sending tuples > to Append and from there to CTAS dest receiver. > Append > ->Gather > ->Parallel Seq Scan > ->Seq Scan / Join / any other non-Gather node > ->Gather > ->Parallel Seq Scan > ->Seq Scan / Join / any other non-Gather node > > case 3: We can push the CTAS dest receiver to Gather > Gather > ->Parallel Append > ->Parallel Seq Scan > ->Parallel Seq Scan > > case 4: We can push the CTAS dest receiver to Gather > Gather > ->Parallel Append > ->Parallel Seq Scan > ->Parallel Seq Scan > ->Seq Scan / Join / any other non-Gather node > > Please let me know if I'm missing any other possible use case. > > Thoughts? Your analysis looks right to me. > > I attach a WIP patch which just show the possibility of this feature. > > The patch is based on the latest v11-patch. > > > > What do you think? > > As suggested by Amit earlier, I kept the 0001 patch(so far) such that > it doesn't have the code to influence the planner to consider parallel > tuple cost as 0. It works on the plan whatever gets generated and > decides to allow parallel inserts or not. And in the 0002 patch, I > added the code for influencing the planner to consider parallel tuple > cost as 0. Maybe we can have a 0003 patch for tests alone. Yeah, that makes sense and it will be easy for the review. > Once we are okay with the above analysis and use cases, we can > incorporate the Append changes to respective patches. > > Hope that's okay. Make sense to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
> Thanks for the append use case. > > Here's my analysis on pushing parallel inserts down even in case the top > node is Append. > > For union cases which need to remove duplicate tuples, we can't push the > inserts or CTAS dest receiver down. If I'm not wrong, Append node is not > doing duplicate removal(??), I saw that it's the HashAggregate node (which > is the top node that removes the duplicate tuples). And also for > except/except all/intersect/intersect all cases we receive HashSetOp nodes > on top of Append. So for both cases, our check for Gather or Append at the > top node is enough to detect this to not allow parallel inserts. > > For union all: > case 1: We can push the CTAS dest receiver to each Gather node Append > ->Gather > ->Parallel Seq Scan > ->Gather > ->Parallel Seq Scan > ->Gather > ->Parallel Seq Scan > > case 2: We can still push the CTAS dest receiver to each Gather node. > Non-Gather nodes will do inserts as they do now i.e. by sending tuples to > Append and from there to CTAS dest receiver. > Append > ->Gather > ->Parallel Seq Scan > ->Seq Scan / Join / any other non-Gather node > ->Gather > ->Parallel Seq Scan > ->Seq Scan / Join / any other non-Gather node > > case 3: We can push the CTAS dest receiver to Gather Gather > ->Parallel Append > ->Parallel Seq Scan > ->Parallel Seq Scan > > case 4: We can push the CTAS dest receiver to Gather Gather > ->Parallel Append > ->Parallel Seq Scan > ->Parallel Seq Scan > ->Seq Scan / Join / any other non-Gather node > > Please let me know if I'm missing any other possible use case. > > Thoughts? Yes, The analysis looks right to me. > As suggested by Amit earlier, I kept the 0001 patch(so far) such that it > doesn't have the code to influence the planner to consider parallel tuple > cost as 0. It works on the plan whatever gets generated and decides to allow > parallel inserts or not. And in the 0002 patch, I added the code for > influencing the planner to consider parallel tuple cost as 0. Maybe we can > have a 0003 patch for tests alone. > > Once we are okay with the above analysis and use cases, we can incorporate > the Append changes to respective patches. > > Hope that's okay. A little explanation about how to push down the ctas info in append. 1. about how to ignore tuple cost in this case. IMO, it create gather path under append like the following: query_planner -make_one_rel --set_base_rel_sizes ---set_rel_size ----set_append_rel_size (*) -----set_rel_size ------set_subquery_pathlist -------subquery_planner --------grouping_planner ---------apply_scanjoin_target_to_paths ----------generate_useful_gather_paths set_append_rel_size seems the right place where we can check and set a flag to ignore tuple cost later. We can set the flag for two cases when there is no parent path will be created(such as : limit,sort,distinct...): i) query_level is 1 ii) query_level > 1 and we have set the flag in the parent_root. The case ii) is to check append under append: Append ->Append ->Gather ->Other plan 2.about how to push ctas info down. We traversing the whole plans tree, and we only care Append and Gather type. Gather: It set the ctas dest info and returned true at once if the gathernode does not have projection. Append: It will recursively traversing the subplan of Appendnode and will reture true if one of the subplan can be parallel. +PushDownCTASParallelInsertState(DestReceiver *dest, PlanState *ps) +{ + bool parallel = false; + + if(ps == NULL) + return parallel; + + if(IsA(ps, AppendState)) + { + AppendState *aps = (AppendState *) ps; + for(int i = 0; i < aps->as_nplans; i++) + { + parallel |= PushDownCTASParallelInsertState(dest, aps->appendplans[i]); + } + } + else if(IsA(ps, GatherState) && !ps->ps_ProjInfo) + { + GatherState *gstate = (GatherState *) ps; + parallel = true; + + ((DR_intorel *) dest)->is_parallel = true; + gstate->dest = dest; + ps->plan->plan_rows = 0; + } + + return parallel; +} Best regards, houzj
Attachment
> From: Hou, Zhijie [mailto:houzj.fnst@cn.fujitsu.com] > Sent: Tuesday, December 15, 2020 7:30 PM > To: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> > Cc: Amit Kapila <amit.kapila16@gmail.com>; Luc Vlaming <luc@swarm64.com>; > PostgreSQL-development <pgsql-hackers@postgresql.org>; Zhihong Yu > <zyu@yugabyte.com>; Dilip Kumar <dilipbalaut@gmail.com> > Subject: RE: Parallel Inserts in CREATE TABLE AS > > > Thanks for the append use case. > > > > Here's my analysis on pushing parallel inserts down even in case the > > top node is Append. > > > > For union cases which need to remove duplicate tuples, we can't push > > the inserts or CTAS dest receiver down. If I'm not wrong, Append node > > is not doing duplicate removal(??), I saw that it's the HashAggregate > > node (which is the top node that removes the duplicate tuples). And > > also for except/except all/intersect/intersect all cases we receive > > HashSetOp nodes on top of Append. So for both cases, our check for > > Gather or Append at the top node is enough to detect this to not allow > parallel inserts. > > > > For union all: > > case 1: We can push the CTAS dest receiver to each Gather node Append > > ->Gather > > ->Parallel Seq Scan > > ->Gather > > ->Parallel Seq Scan > > ->Gather > > ->Parallel Seq Scan > > > > case 2: We can still push the CTAS dest receiver to each Gather node. > > Non-Gather nodes will do inserts as they do now i.e. by sending tuples > > to Append and from there to CTAS dest receiver. > > Append > > ->Gather > > ->Parallel Seq Scan > > ->Seq Scan / Join / any other non-Gather node > > ->Gather > > ->Parallel Seq Scan > > ->Seq Scan / Join / any other non-Gather node > > > > case 3: We can push the CTAS dest receiver to Gather Gather > > ->Parallel Append > > ->Parallel Seq Scan > > ->Parallel Seq Scan > > > > case 4: We can push the CTAS dest receiver to Gather Gather > > ->Parallel Append > > ->Parallel Seq Scan > > ->Parallel Seq Scan > > ->Seq Scan / Join / any other non-Gather node > > > > Please let me know if I'm missing any other possible use case. > > > > Thoughts? > > > Yes, The analysis looks right to me. > > > > As suggested by Amit earlier, I kept the 0001 patch(so far) such that > > it doesn't have the code to influence the planner to consider parallel > > tuple cost as 0. It works on the plan whatever gets generated and > > decides to allow parallel inserts or not. And in the 0002 patch, I > > added the code for influencing the planner to consider parallel tuple > > cost as 0. Maybe we can have a 0003 patch for tests alone. > > > > Once we are okay with the above analysis and use cases, we can > > incorporate the Append changes to respective patches. > > > > Hope that's okay. > > A little explanation about how to push down the ctas info in append. > > 1. about how to ignore tuple cost in this case. > IMO, it create gather path under append like the following: > query_planner > -make_one_rel > --set_base_rel_sizes > ---set_rel_size > ----set_append_rel_size (*) > -----set_rel_size > ------set_subquery_pathlist > -------subquery_planner > --------grouping_planner > ---------apply_scanjoin_target_to_paths > ----------generate_useful_gather_paths > > set_append_rel_size seems the right place where we can check and set a flag > to ignore tuple cost later. > We can set the flag for two cases when there is no parent path will be > created(such as : limit,sort,distinct...): > i) query_level is 1 > ii) query_level > 1 and we have set the flag in the parent_root. > > The case ii) is to check append under append: > Append > ->Append > ->Gather > ->Other plan > > 2.about how to push ctas info down. > > We traversing the whole plans tree, and we only care Append and Gather type. > Gather: It set the ctas dest info and returned true at once if the gathernode > does not have projection. > Append: It will recursively traversing the subplan of Appendnode and will > reture true if one of the subplan can be parallel. > > +PushDownCTASParallelInsertState(DestReceiver *dest, PlanState *ps) { > + bool parallel = false; > + > + if(ps == NULL) > + return parallel; > + > + if(IsA(ps, AppendState)) > + { > + AppendState *aps = (AppendState *) ps; > + for(int i = 0; i < aps->as_nplans; i++) > + { > + parallel |= > PushDownCTASParallelInsertState(dest, aps->appendplans[i]); > + } > + } > + else if(IsA(ps, GatherState) && !ps->ps_ProjInfo) > + { > + GatherState *gstate = (GatherState *) ps; > + parallel = true; > + > + ((DR_intorel *) dest)->is_parallel = true; > + gstate->dest = dest; > + ps->plan->plan_rows = 0; > + } > + > + return parallel; > +} So sorry for my miss, my last patch has some mistakes. Attatch the new one. Best regards, houzj
Attachment
On Tue, Dec 15, 2020 at 5:48 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > A little explanation about how to push down the ctas info in append. > > > > 1. about how to ignore tuple cost in this case. > > IMO, it create gather path under append like the following: > > query_planner > > -make_one_rel > > --set_base_rel_sizes > > ---set_rel_size > > ----set_append_rel_size (*) > > -----set_rel_size > > ------set_subquery_pathlist > > -------subquery_planner > > --------grouping_planner > > ---------apply_scanjoin_target_to_paths > > ----------generate_useful_gather_paths > > > > set_append_rel_size seems the right place where we can check and set a flag > > to ignore tuple cost later. > > We can set the flag for two cases when there is no parent path will be > > created(such as : limit,sort,distinct...): > > i) query_level is 1 > > ii) query_level > 1 and we have set the flag in the parent_root. > > > > The case ii) is to check append under append: > > Append > > ->Append > > ->Gather > > ->Other plan > > > > 2.about how to push ctas info down. > > > > We traversing the whole plans tree, and we only care Append and Gather type. > > Gather: It set the ctas dest info and returned true at once if the gathernode > > does not have projection. > > Append: It will recursively traversing the subplan of Appendnode and will > > reture true if one of the subplan can be parallel. > > > > +PushDownCTASParallelInsertState(DestReceiver *dest, PlanState *ps) { > > + bool parallel = false; > > + > > + if(ps == NULL) > > + return parallel; > > + > > + if(IsA(ps, AppendState)) > > + { > > + AppendState *aps = (AppendState *) ps; > > + for(int i = 0; i < aps->as_nplans; i++) > > + { > > + parallel |= > > PushDownCTASParallelInsertState(dest, aps->appendplans[i]); > > + } > > + } > > + else if(IsA(ps, GatherState) && !ps->ps_ProjInfo) > > + { > > + GatherState *gstate = (GatherState *) ps; > > + parallel = true; > > + > > + ((DR_intorel *) dest)->is_parallel = true; > > + gstate->dest = dest; > > + ps->plan->plan_rows = 0; > > + } > > + > > + return parallel; > > +} > > So sorry for my miss, my last patch has some mistakes. > Attatch the new one. Thanks for the append patches. Basically your changes look good to me. I'm merging them to the original patch set and adding the test cases to cover these cases. I will post the updated patch set soon. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Tue, Dec 15, 2020 at 5:53 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > I'm merging them to the original patch set and adding the test cases > to cover these cases. I will post the updated patch set soon. Attaching v12 patch set. 0001 - parallel inserts without tuple cost enforcement. 0002 - enforce planner for parallel tuple cost 0003 - test cases Please review it further. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Dec 16, 2020 at 12:06 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Tue, Dec 15, 2020 at 5:53 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > I'm merging them to the original patch set and adding the test cases > > to cover these cases. I will post the updated patch set soon. > > Attaching v12 patch set. > > 0001 - parallel inserts without tuple cost enforcement. > 0002 - enforce planner for parallel tuple cost > 0003 - test cases > > Please review it further. > I think it will be clean to implement the parallel CTAS when a top-level node is the gather node. Basically, the idea is that whenever we get the gather on the top which doesn't have any projection then we can push down the dest receiver directly to the worker. I agree that append is an exception that doesn't do any extra processing other than appending the results, So IMHO it would be better that in the first part we parallelize the plan where gather node on top. I see that we have already worked on the patch where the append node is on top so I would suggest that we can keep that part in a separate patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi The cfbost seems complains about the testcase: Command exited with code 1 perl dumpregr.pl === $path ===\ndiff -w -U3 C:/projects/postgresql/src/test/regress/expected/write_parallel.out C:/projects/postgresql/src/test/regress/results/write_parallel.out --- C:/projects/postgresql/src/test/regress/expected/write_parallel.out 2020-12-21 01:41:17.745091500 +0000 +++ C:/projects/postgresql/src/test/regress/results/write_parallel.out 2020-12-21 01:47:20.375514800 +0000 @@ -1204,7 +1204,7 @@ -> Gather (actual rows=2 loops=1) Workers Planned: 3 Workers Launched: 3 - -> Parallel Seq Scan on temp2 (actual rows=0 loops=4) + -> Parallel Seq Scan on temp2 (actual rows=1 loops=4) Filter: (col2 < 3) Rows Removed by Filter: 1 (14 rows) @@ -1233,7 +1233,7 @@ -> Gather (actual rows=2 loops=1) Workers Planned: 3 Workers Launched: 3 - -> Parallel Seq Scan on temp2 (actual rows=0 loops=4) + -> Parallel Seq Scan on temp2 (actual rows=1 loops=4) Filter: (col2 < 3) Rows Removed by Filter: 1 (14 rows) Best regards, houzj
On Fri, Dec 18, 2020 at 10:08 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I think it will be clean to implement the parallel CTAS when a > top-level node is the gather node. Basically, the idea is that > whenever we get the gather on the top which doesn't have any > projection then we can push down the dest receiver directly to the > worker. I agree that append is an exception that doesn't do any extra > processing other than appending the results, So IMHO it would be > better that in the first part we parallelize the plan where gather > node on top. I see that we have already worked on the patch where the > append node is on top so I would suggest that we can keep that part in > a separate patch. Thanks! I rearranged the patches to keep the append part separate in the 0004 patch. Attaching v13 patch set: 0001 - parallel inserts in ctas without planner enforcement for tuple cost calculation 0002 - planner enforcement for tuple cost calculation 0003 - tests 0004 - enabling parallel inserts for Append cases, related planner enforcement code and tests. Please consider these patches for further review. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Dec 21, 2020 at 8:16 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > The cfbost seems complains about the testcase: > > Command exited with code 1 > perl dumpregr.pl > === $path ===\ndiff -w -U3 C:/projects/postgresql/src/test/regress/expected/write_parallel.out C:/projects/postgresql/src/test/regress/results/write_parallel.out > --- C:/projects/postgresql/src/test/regress/expected/write_parallel.out 2020-12-21 01:41:17.745091500 +0000 > +++ C:/projects/postgresql/src/test/regress/results/write_parallel.out 2020-12-21 01:47:20.375514800 +0000 > @@ -1204,7 +1204,7 @@ > -> Gather (actual rows=2 loops=1) > Workers Planned: 3 > Workers Launched: 3 > - -> Parallel Seq Scan on temp2 (actual rows=0 loops=4) > + -> Parallel Seq Scan on temp2 (actual rows=1 loops=4) > Filter: (col2 < 3) > Rows Removed by Filter: 1 > (14 rows) > @@ -1233,7 +1233,7 @@ > -> Gather (actual rows=2 loops=1) > Workers Planned: 3 > Workers Launched: 3 > - -> Parallel Seq Scan on temp2 (actual rows=0 loops=4) > + -> Parallel Seq Scan on temp2 (actual rows=1 loops=4) > Filter: (col2 < 3) > Rows Removed by Filter: 1 > (14 rows) Thanks! Looks like the explain analyze test case outputs can be unstable because we may not get the requested number of workers always. The comment before explain_parallel_append function in partition_prune.sql explains it well. Solution is to have a function similar to explain_parallel_append, say explain_parallel_inserts in write_parallel.sql and use that for all explain analyze cases. This will make the results consistent. Thoughts? If okay, I will update the test cases and post new patches. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Tue, Dec 22, 2020 at 12:32 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > On Mon, Dec 21, 2020 at 8:16 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > The cfbost seems complains about the testcase: > > > > Command exited with code 1 > > perl dumpregr.pl > > === $path ===\ndiff -w -U3 C:/projects/postgresql/src/test/regress/expected/write_parallel.out C:/projects/postgresql/src/test/regress/results/write_parallel.out > > --- C:/projects/postgresql/src/test/regress/expected/write_parallel.out 2020-12-21 01:41:17.745091500 +0000 > > +++ C:/projects/postgresql/src/test/regress/results/write_parallel.out 2020-12-21 01:47:20.375514800 +0000 > > @@ -1204,7 +1204,7 @@ > > -> Gather (actual rows=2 loops=1) > > Workers Planned: 3 > > Workers Launched: 3 > > - -> Parallel Seq Scan on temp2 (actual rows=0 loops=4) > > + -> Parallel Seq Scan on temp2 (actual rows=1 loops=4) > > Filter: (col2 < 3) > > Rows Removed by Filter: 1 > > (14 rows) > > @@ -1233,7 +1233,7 @@ > > -> Gather (actual rows=2 loops=1) > > Workers Planned: 3 > > Workers Launched: 3 > > - -> Parallel Seq Scan on temp2 (actual rows=0 loops=4) > > + -> Parallel Seq Scan on temp2 (actual rows=1 loops=4) > > Filter: (col2 < 3) > > Rows Removed by Filter: 1 > > (14 rows) > > Thanks! Looks like the explain analyze test case outputs can be > unstable because we may not get the requested number of workers > always. The comment before explain_parallel_append function in > partition_prune.sql explains it well. > > Solution is to have a function similar to explain_parallel_append, say > explain_parallel_inserts in write_parallel.sql and use that for all > explain analyze cases. This will make the results consistent. > Thoughts? If okay, I will update the test cases and post new patches. Attaching v14 patch set that has above changes. Please consider this for further review. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Dec 22, 2020 at 2:16 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Tue, Dec 22, 2020 at 12:32 PM Bharath Rupireddy > Attaching v14 patch set that has above changes. Please consider this > for further review. > Few comments: In the below case, should create be above Gather? postgres=# explain create table t7 as select * from t6; QUERY PLAN ------------------------------------------------------------------- Gather (cost=0.00..9.17 rows=0 width=4) Workers Planned: 2 -> Create t7 -> Parallel Seq Scan on t6 (cost=0.00..9.17 rows=417 width=4) (4 rows) Can we change it to something like: ------------------------------------------------------------------- Create t7 -> Gather (cost=0.00..9.17 rows=0 width=4) Workers Planned: 2 -> Parallel Seq Scan on t6 (cost=0.00..9.17 rows=417 width=4) (4 rows) You could change intoclause_len = strlen(intoclausestr) to strlen(intoclausestr) + 1 and use intoclause_len in the remaining places. We can avoid the +1 in the other places. + /* Estimate space for into clause for CTAS. */ + if (IS_CTAS(intoclause) && OidIsValid(objectid)) + { + intoclausestr = nodeToString(intoclause); + intoclause_len = strlen(intoclausestr); + shm_toc_estimate_chunk(&pcxt->estimator, intoclause_len + 1); + shm_toc_estimate_keys(&pcxt->estimator, 1); + } Can we use node->nworkers_launched == 0 in place of node->need_to_scan_locally, that way the setting and resetting of node->need_to_scan_locally can be removed. Unless need_to_scan_locally is needed in any of the functions that gets called. + /* Enable leader to insert in case no parallel workers were launched. */ + if (node->nworkers_launched == 0) + node->need_to_scan_locally = true; + + /* + * By now, for parallel workers (if launched any), would have started their + * work i.e. insertion to target table. In case the leader is chosen to + * participate for parallel inserts in CTAS, then finish its share before + * going to wait for the parallel workers to finish. + */ + if (node->need_to_scan_locally) + { Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 24, 2020 at 10:25 AM vignesh C <vignesh21@gmail.com> wrote: > > On Tue, Dec 22, 2020 at 2:16 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Tue, Dec 22, 2020 at 12:32 PM Bharath Rupireddy > > Attaching v14 patch set that has above changes. Please consider this > > for further review. > > > > Few comments: > In the below case, should create be above Gather? > postgres=# explain create table t7 as select * from t6; > QUERY PLAN > ------------------------------------------------------------------- > Gather (cost=0.00..9.17 rows=0 width=4) > Workers Planned: 2 > -> Create t7 > -> Parallel Seq Scan on t6 (cost=0.00..9.17 rows=417 width=4) > (4 rows) > > Can we change it to something like: > ------------------------------------------------------------------- > Create t7 > -> Gather (cost=0.00..9.17 rows=0 width=4) > Workers Planned: 2 > -> Parallel Seq Scan on t6 (cost=0.00..9.17 rows=417 width=4) > (4 rows) > I think it is better to have it in a way as in the current patch because that reflects that we are performing insert/create below Gather which is the purpose of this patch. I think this is similar to what the Parallel Insert patch [1] has for a similar plan. [1] - https://commitfest.postgresql.org/31/2844/ -- With Regards, Amit Kapila.
On Thu, Dec 24, 2020 at 10:25 AM vignesh C <vignesh21@gmail.com> wrote: > You could change intoclause_len = strlen(intoclausestr) to > strlen(intoclausestr) + 1 and use intoclause_len in the remaining > places. We can avoid the +1 in the other places. > + /* Estimate space for into clause for CTAS. */ > + if (IS_CTAS(intoclause) && OidIsValid(objectid)) > + { > + intoclausestr = nodeToString(intoclause); > + intoclause_len = strlen(intoclausestr); > + shm_toc_estimate_chunk(&pcxt->estimator, intoclause_len + 1); > + shm_toc_estimate_keys(&pcxt->estimator, 1); > + } Done. > Can we use node->nworkers_launched == 0 in place of > node->need_to_scan_locally, that way the setting and resetting of > node->need_to_scan_locally can be removed. Unless need_to_scan_locally > is needed in any of the functions that gets called. > + /* Enable leader to insert in case no parallel workers were launched. */ > + if (node->nworkers_launched == 0) > + node->need_to_scan_locally = true; > + > + /* > + * By now, for parallel workers (if launched any), would have > started their > + * work i.e. insertion to target table. In case the leader is chosen to > + * participate for parallel inserts in CTAS, then finish its > share before > + * going to wait for the parallel workers to finish. > + */ > + if (node->need_to_scan_locally) > + { need_to_scan_locally is being set in ExecGather() even if nworkers_launched > 0 it can still be true, so I think we can not remove need_to_scan_locally in ExecParallelInsertInCTAS. Attaching v15 patch set for further review. Note that the change is only in 0001 patch, other patches remain unchanged from v14. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Dec 24, 2020 at 11:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Dec 24, 2020 at 10:25 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Tue, Dec 22, 2020 at 2:16 PM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > >
> > > On Tue, Dec 22, 2020 at 12:32 PM Bharath Rupireddy
> > > Attaching v14 patch set that has above changes. Please consider this
> > > for further review.
> > >
> >
> > Few comments:
> > In the below case, should create be above Gather?
> > postgres=# explain create table t7 as select * from t6;
> > QUERY PLAN
> > -------------------------------------------------------------------
> > Gather (cost=0.00..9.17 rows=0 width=4)
> > Workers Planned: 2
> > -> Create t7
> > -> Parallel Seq Scan on t6 (cost=0.00..9.17 rows=417 width=4)
> > (4 rows)
> >
> > Can we change it to something like:
> > -------------------------------------------------------------------
> > Create t7
> > -> Gather (cost=0.00..9.17 rows=0 width=4)
> > Workers Planned: 2
> > -> Parallel Seq Scan on t6 (cost=0.00..9.17 rows=417 width=4)
> > (4 rows)
> >
>
> I think it is better to have it in a way as in the current patch
> because that reflects that we are performing insert/create below
> Gather which is the purpose of this patch. I think this is similar to
> what the Parallel Insert patch [1] has for a similar plan.
>
>
> [1] - https://commitfest.postgresql.org/31/2844/
>
Also another thing that I felt was that actually the Gather nodes will actually do the insert operation, the Create table will be done earlier itself. Should we change Create table to Insert table something like below:
QUERY PLAN
-------------------------------------------------------------------
Gather (cost=0.00..9.17 rows=0 width=4)
Workers Planned: 2
-> Insert table2 (instead of Create table2)
-> Parallel Seq Scan on table1 (cost=0.00..9.17 rows=417 width=4)
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
On Fri, Dec 25, 2020 at 7:12 AM vignesh C <vignesh21@gmail.com> wrote: > On Thu, Dec 24, 2020 at 11:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Dec 24, 2020 at 10:25 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On Tue, Dec 22, 2020 at 2:16 PM Bharath Rupireddy > > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > > On Tue, Dec 22, 2020 at 12:32 PM Bharath Rupireddy > > > > Attaching v14 patch set that has above changes. Please consider this > > > > for further review. > > > > > > > > > > Few comments: > > > In the below case, should create be above Gather? > > > postgres=# explain create table t7 as select * from t6; > > > QUERY PLAN > > > ------------------------------------------------------------------- > > > Gather (cost=0.00..9.17 rows=0 width=4) > > > Workers Planned: 2 > > > -> Create t7 > > > -> Parallel Seq Scan on t6 (cost=0.00..9.17 rows=417 width=4) > > > (4 rows) > > > > > > Can we change it to something like: > > > ------------------------------------------------------------------- > > > Create t7 > > > -> Gather (cost=0.00..9.17 rows=0 width=4) > > > Workers Planned: 2 > > > -> Parallel Seq Scan on t6 (cost=0.00..9.17 rows=417 width=4) > > > (4 rows) > > > > > > > I think it is better to have it in a way as in the current patch > > because that reflects that we are performing insert/create below > > Gather which is the purpose of this patch. I think this is similar to > > what the Parallel Insert patch [1] has for a similar plan. > > > > > > [1] - https://commitfest.postgresql.org/31/2844/ > > > > Also another thing that I felt was that actually the Gather nodes will actually do the insert operation, the Create tablewill be done earlier itself. Should we change Create table to Insert table something like below: > QUERY PLAN > ------------------------------------------------------------------- > Gather (cost=0.00..9.17 rows=0 width=4) > Workers Planned: 2 > -> Insert table2 (instead of Create table2) > -> Parallel Seq Scan on table1 (cost=0.00..9.17 rows=417 width=4) IMO, showing Insert under Gather makes sense if the query is INSERT INTO SELECT as it's in the other patch [1]. Since here it is a CTAS query, so having Create under Gather looks fine to me. This way we can also distinguish the EXPLAINs of parallel inserts in INSERT INTO SELECT and CTAS. And also, some might wonder that Create under Gather means that each parallel worker is creating the table, it's actually not the creation of the table that's parallelized but it's insertion. If required, we can clarify it in CTAS docs with a sample EXPLAIN. I have not yet added docs related to allowing parallel inserts in CTAS. Shall I add a para saying when parallel inserts can be picked and how the sample EXPLAIN looks? Thoughts? [1] - https://commitfest.postgresql.org/31/2844/ With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Fri, Dec 25, 2020 at 9:54 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Fri, Dec 25, 2020 at 7:12 AM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, Dec 24, 2020 at 11:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Dec 24, 2020 at 10:25 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > On Tue, Dec 22, 2020 at 2:16 PM Bharath Rupireddy > > > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > > > > On Tue, Dec 22, 2020 at 12:32 PM Bharath Rupireddy > > > > > Attaching v14 patch set that has above changes. Please consider this > > > > > for further review. > > > > > > > > > > > > > Few comments: > > > > In the below case, should create be above Gather? > > > > postgres=# explain create table t7 as select * from t6; > > > > QUERY PLAN > > > > ------------------------------------------------------------------- > > > > Gather (cost=0.00..9.17 rows=0 width=4) > > > > Workers Planned: 2 > > > > -> Create t7 > > > > -> Parallel Seq Scan on t6 (cost=0.00..9.17 rows=417 width=4) > > > > (4 rows) > > > > > > > > Can we change it to something like: > > > > ------------------------------------------------------------------- > > > > Create t7 > > > > -> Gather (cost=0.00..9.17 rows=0 width=4) > > > > Workers Planned: 2 > > > > -> Parallel Seq Scan on t6 (cost=0.00..9.17 rows=417 width=4) > > > > (4 rows) > > > > > > > > > > I think it is better to have it in a way as in the current patch > > > because that reflects that we are performing insert/create below > > > Gather which is the purpose of this patch. I think this is similar to > > > what the Parallel Insert patch [1] has for a similar plan. > > > > > > > > > [1] - https://commitfest.postgresql.org/31/2844/ > > > > > > > Also another thing that I felt was that actually the Gather nodes will actually do the insert operation, the Create tablewill be done earlier itself. Should we change Create table to Insert table something like below: > > QUERY PLAN > > ------------------------------------------------------------------- > > Gather (cost=0.00..9.17 rows=0 width=4) > > Workers Planned: 2 > > -> Insert table2 (instead of Create table2) > > -> Parallel Seq Scan on table1 (cost=0.00..9.17 rows=417 width=4) > > IMO, showing Insert under Gather makes sense if the query is INSERT > INTO SELECT as it's in the other patch [1]. Since here it is a CTAS > query, so having Create under Gather looks fine to me. This way we can > also distinguish the EXPLAINs of parallel inserts in INSERT INTO > SELECT and CTAS. I don't think that is a problem because now also if we EXPLAIN CTAS it will appear like we are executing the select query because that is what we are planning for only the select part. So now if we are including the INSERT in the planning and pushing the insert under the gather then it will make more sense to show INSERT instead of showing CREATE. Let's see what others think. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Dec 25, 2020 at 9:54 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Fri, Dec 25, 2020 at 7:12 AM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, Dec 24, 2020 at 11:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Dec 24, 2020 at 10:25 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > On Tue, Dec 22, 2020 at 2:16 PM Bharath Rupireddy > > > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > > > > On Tue, Dec 22, 2020 at 12:32 PM Bharath Rupireddy > > > > > Attaching v14 patch set that has above changes. Please consider this > > > > > for further review. > > > > > > > > > > > > > Few comments: > > > > In the below case, should create be above Gather? > > > > postgres=# explain create table t7 as select * from t6; > > > > QUERY PLAN > > > > ------------------------------------------------------------------- > > > > Gather (cost=0.00..9.17 rows=0 width=4) > > > > Workers Planned: 2 > > > > -> Create t7 > > > > -> Parallel Seq Scan on t6 (cost=0.00..9.17 rows=417 width=4) > > > > (4 rows) > > > > > > > > Can we change it to something like: > > > > ------------------------------------------------------------------- > > > > Create t7 > > > > -> Gather (cost=0.00..9.17 rows=0 width=4) > > > > Workers Planned: 2 > > > > -> Parallel Seq Scan on t6 (cost=0.00..9.17 rows=417 width=4) > > > > (4 rows) > > > > > > > > > > I think it is better to have it in a way as in the current patch > > > because that reflects that we are performing insert/create below > > > Gather which is the purpose of this patch. I think this is similar to > > > what the Parallel Insert patch [1] has for a similar plan. > > > > > > > > > [1] - https://commitfest.postgresql.org/31/2844/ > > > > > > > Also another thing that I felt was that actually the Gather nodes will actually do the insert operation, the Create tablewill be done earlier itself. Should we change Create table to Insert table something like below: > > QUERY PLAN > > ------------------------------------------------------------------- > > Gather (cost=0.00..9.17 rows=0 width=4) > > Workers Planned: 2 > > -> Insert table2 (instead of Create table2) > > -> Parallel Seq Scan on table1 (cost=0.00..9.17 rows=417 width=4) > > IMO, showing Insert under Gather makes sense if the query is INSERT > INTO SELECT as it's in the other patch [1]. Since here it is a CTAS > query, so having Create under Gather looks fine to me. This way we can > also distinguish the EXPLAINs of parallel inserts in INSERT INTO > SELECT and CTAS. > Right, IIRC, we have done the way it is in the patch for convenience and to move forward with it and come back to it later once all other parts of the patch are good. > And also, some might wonder that Create under Gather means that each > parallel worker is creating the table, it's actually not the creation > of the table that's parallelized but it's insertion. If required, we > can clarify it in CTAS docs with a sample EXPLAIN. I have not yet > added docs related to allowing parallel inserts in CTAS. Shall I add a > para saying when parallel inserts can be picked and how the sample > EXPLAIN looks? Thoughts? > Yeah, I don't see any problem with it, and maybe we can move Explain related code to a separate patch. The reason is we don't display DDL part without parallelism and this might need a separate discussion. -- With Regards, Amit Kapila.
On Fri, Dec 25, 2020 at 10:04 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Dec 25, 2020 at 9:54 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Fri, Dec 25, 2020 at 7:12 AM vignesh C <vignesh21@gmail.com> wrote: > > > On Thu, Dec 24, 2020 at 11:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Thu, Dec 24, 2020 at 10:25 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > > > On Tue, Dec 22, 2020 at 2:16 PM Bharath Rupireddy > > > > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > > > > > > On Tue, Dec 22, 2020 at 12:32 PM Bharath Rupireddy > > > > > > Attaching v14 patch set that has above changes. Please consider this > > > > > > for further review. > > > > > > > > > > > > > > > > Few comments: > > > > > In the below case, should create be above Gather? > > > > > postgres=# explain create table t7 as select * from t6; > > > > > QUERY PLAN > > > > > ------------------------------------------------------------------- > > > > > Gather (cost=0.00..9.17 rows=0 width=4) > > > > > Workers Planned: 2 > > > > > -> Create t7 > > > > > -> Parallel Seq Scan on t6 (cost=0.00..9.17 rows=417 width=4) > > > > > (4 rows) > > > > > > > > > > Can we change it to something like: > > > > > ------------------------------------------------------------------- > > > > > Create t7 > > > > > -> Gather (cost=0.00..9.17 rows=0 width=4) > > > > > Workers Planned: 2 > > > > > -> Parallel Seq Scan on t6 (cost=0.00..9.17 rows=417 width=4) > > > > > (4 rows) > > > > > > > > > > > > > I think it is better to have it in a way as in the current patch > > > > because that reflects that we are performing insert/create below > > > > Gather which is the purpose of this patch. I think this is similar to > > > > what the Parallel Insert patch [1] has for a similar plan. > > > > > > > > > > > > [1] - https://commitfest.postgresql.org/31/2844/ > > > > > > > > > > Also another thing that I felt was that actually the Gather nodes will actually do the insert operation, the Createtable will be done earlier itself. Should we change Create table to Insert table something like below: > > > QUERY PLAN > > > ------------------------------------------------------------------- > > > Gather (cost=0.00..9.17 rows=0 width=4) > > > Workers Planned: 2 > > > -> Insert table2 (instead of Create table2) > > > -> Parallel Seq Scan on table1 (cost=0.00..9.17 rows=417 width=4) > > > > IMO, showing Insert under Gather makes sense if the query is INSERT > > INTO SELECT as it's in the other patch [1]. Since here it is a CTAS > > query, so having Create under Gather looks fine to me. This way we can > > also distinguish the EXPLAINs of parallel inserts in INSERT INTO > > SELECT and CTAS. > > > > Right, IIRC, we have done the way it is in the patch for convenience > and to move forward with it and come back to it later once all other > parts of the patch are good. > > > And also, some might wonder that Create under Gather means that each > > parallel worker is creating the table, it's actually not the creation > > of the table that's parallelized but it's insertion. If required, we > > can clarify it in CTAS docs with a sample EXPLAIN. I have not yet > > added docs related to allowing parallel inserts in CTAS. Shall I add a > > para saying when parallel inserts can be picked and how the sample > > EXPLAIN looks? Thoughts? > > > > Yeah, I don't see any problem with it, and maybe we can move Explain > related code to a separate patch. The reason is we don't display DDL > part without parallelism and this might need a separate discussion. > This makes sense to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 24, 2020 at 1:07 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Thu, Dec 24, 2020 at 10:25 AM vignesh C <vignesh21@gmail.com> wrote: > > You could change intoclause_len = strlen(intoclausestr) to > > strlen(intoclausestr) + 1 and use intoclause_len in the remaining > > places. We can avoid the +1 in the other places. > > + /* Estimate space for into clause for CTAS. */ > > + if (IS_CTAS(intoclause) && OidIsValid(objectid)) > > + { > > + intoclausestr = nodeToString(intoclause); > > + intoclause_len = strlen(intoclausestr); > > + shm_toc_estimate_chunk(&pcxt->estimator, intoclause_len + 1); > > + shm_toc_estimate_keys(&pcxt->estimator, 1); > > + } > > Done. > > > Can we use node->nworkers_launched == 0 in place of > > node->need_to_scan_locally, that way the setting and resetting of > > node->need_to_scan_locally can be removed. Unless need_to_scan_locally > > is needed in any of the functions that gets called. > > + /* Enable leader to insert in case no parallel workers were launched. */ > > + if (node->nworkers_launched == 0) > > + node->need_to_scan_locally = true; > > + > > + /* > > + * By now, for parallel workers (if launched any), would have > > started their > > + * work i.e. insertion to target table. In case the leader is chosen to > > + * participate for parallel inserts in CTAS, then finish its > > share before > > + * going to wait for the parallel workers to finish. > > + */ > > + if (node->need_to_scan_locally) > > + { > > need_to_scan_locally is being set in ExecGather() even if > nworkers_launched > 0 it can still be true, so I think we can not > remove need_to_scan_locally in ExecParallelInsertInCTAS. > > Attaching v15 patch set for further review. Note that the change is > only in 0001 patch, other patches remain unchanged from v14. I have reviewed part of v15-0001 patch, I have a few comments, I will continue to review this. 1. @@ -763,18 +763,34 @@ GetCurrentCommandId(bool used) /* this is global to a transaction, not subtransaction-local */ if (used) { - /* - * Forbid setting currentCommandIdUsed in a parallel worker, because - * we have no provision for communicating this back to the leader. We - * could relax this restriction when currentCommandIdUsed was already - * true at the start of the parallel operation. - */ - Assert(!IsParallelWorker()); + /* + * This is a temporary hack for all common parallel insert cases i.e. + * insert into, ctas, copy from. To be changed later. In a parallel + * worker, set currentCommandIdUsed to true only if it was not set to + * true at the start of the parallel operation (by way of + * SetCurrentCommandIdUsedForWorker()). We have to do this because + * GetCurrentCommandId(true) may be called from anywhere, especially + * for parallel inserts, within parallel worker. + */ + Assert(!(IsParallelWorker() && !currentCommandIdUsed)); Why is this temporary hack? and what is the plan for removing this hack? 2. +/* + * ChooseParallelInsertsInCTAS --- determine whether or not parallel + * insertion is possible, if yes set the parallel insert state i.e. push down + * the dest receiver to the Gather nodes. + */ +void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc) +{ + if (!IS_CTAS(into)) + return; When will this hit? The functtion name suggest that it is from CTAS but now you have a check that if it is not for CTAS then return, can you add the comment that when do you expect this case? Also the function name should start in a new line i.e void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc) 3. +/* + * ChooseParallelInsertsInCTAS --- determine whether or not parallel + * insertion is possible, if yes set the parallel insert state i.e. push down + * the dest receiver to the Gather nodes. + */ Push down to the Gather nodes? I think the right statement will be push down below the Gather node. 4. intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo) { DR_intorel *myState = (DR_intorel *) self; + if (myState->is_parallel_worker) + { + /* In the worker */ + SetCurrentCommandIdUsedForWorker(); + myState->output_cid = GetCurrentCommandId(false); + } + else { non-parallel worker code } } I think instead of moving all the code related to non-parallel worker in the else we can do better. This will avoid unnecessary code movement. 4. intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo) { DR_intorel *myState = (DR_intorel *) self; -- Comment ->in parallel worker we don't need to crease dest recv blah blah + if (myState->is_parallel_worker) { --parallel worker handling-- return; } --non-parallel worker code stay right there, instead of moving to else 5. +/* + * ChooseParallelInsertsInCTAS --- determine whether or not parallel + * insertion is possible, if yes set the parallel insert state i.e. push down + * the dest receiver to the Gather nodes. + */ +void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc) +{ From function name and comments it appeared that this function will return boolean saying whether Parallel insert should be selected or not. I think name/comment should be better for this 6. /* + * For parallelizing inserts in CTAS i.e. making each parallel worker + * insert the tuples, we must send information such as into clause (for + * each worker to build separate dest receiver), object id (for each + * worker to open the created table). Comment is saying we need to pass object id but the code under this comment is not doing so. 7. + /* + * Since there are no rows that are transferred from workers to Gather + * node, so we set it to 0 to be visible in estimated row count of + * explain plans. + */ + queryDesc->planstate->plan->plan_rows = 0; This seems a bit hackies Why it is done after the planning, I mean plan must know that it is returning a 0 rows? 8. + char *intoclause_space = shm_toc_allocate(pcxt->toc, + intoclause_len); + memcpy(intoclause_space, intoclausestr, intoclause_len); + shm_toc_insert(pcxt->toc, PARALLEL_KEY_INTO_CLAUSE, intoclause_space); One blank line between variable declaration and next code segment, take care at other places as well. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 24, 2020 at 1:07 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Thu, Dec 24, 2020 at 10:25 AM vignesh C <vignesh21@gmail.com> wrote: > > You could change intoclause_len = strlen(intoclausestr) to > > strlen(intoclausestr) + 1 and use intoclause_len in the remaining > > places. We can avoid the +1 in the other places. > > + /* Estimate space for into clause for CTAS. */ > > + if (IS_CTAS(intoclause) && OidIsValid(objectid)) > > + { > > + intoclausestr = nodeToString(intoclause); > > + intoclause_len = strlen(intoclausestr); > > + shm_toc_estimate_chunk(&pcxt->estimator, intoclause_len + 1); > > + shm_toc_estimate_keys(&pcxt->estimator, 1); > > + } > > Done. > > > Can we use node->nworkers_launched == 0 in place of > > node->need_to_scan_locally, that way the setting and resetting of > > node->need_to_scan_locally can be removed. Unless need_to_scan_locally > > is needed in any of the functions that gets called. > > + /* Enable leader to insert in case no parallel workers were launched. */ > > + if (node->nworkers_launched == 0) > > + node->need_to_scan_locally = true; > > + > > + /* > > + * By now, for parallel workers (if launched any), would have > > started their > > + * work i.e. insertion to target table. In case the leader is chosen to > > + * participate for parallel inserts in CTAS, then finish its > > share before > > + * going to wait for the parallel workers to finish. > > + */ > > + if (node->need_to_scan_locally) > > + { > > need_to_scan_locally is being set in ExecGather() even if > nworkers_launched > 0 it can still be true, so I think we can not > remove need_to_scan_locally in ExecParallelInsertInCTAS. > > Attaching v15 patch set for further review. Note that the change is > only in 0001 patch, other patches remain unchanged from v14. > +-- parallel inserts must occur +select explain_pictas( +'create table parallel_write as select length(stringu1) from tenk1;'); +select count(*) from parallel_write; +drop table parallel_write; We can change comment "parallel inserts must occur" like "parallel insert must be selected for CTAS on normal table" +-- parallel inserts must occur +select explain_pictas( +'create unlogged table parallel_write as select length(stringu1) from tenk1;'); +select count(*) from parallel_write; +drop table parallel_write; We can change comment "parallel inserts must occur" like "parallel insert must be selected for CTAS on unlogged table" Similar comment need to be handled in other places also. +create function explain_pictas(text) returns setof text +language plpgsql as +$$ +declare + ln text; +begin + for ln in + execute format('explain (analyze, costs off, summary off, timing off) %s', + $1) + loop + ln := regexp_replace(ln, 'Workers Launched: \d+', 'Workers Launched: N'); + ln := regexp_replace(ln, 'actual rows=\d+ loops=\d+', 'actual rows=N loops=N'); + ln := regexp_replace(ln, 'Rows Removed by Filter: \d+', 'Rows Removed by Filter: N'); + return next ln; + end loop; +end; +$$; The above function is same as function present in partition_prune.sql: create function explain_parallel_append(text) returns setof text language plpgsql as $$ declare ln text; begin for ln in execute format('explain (analyze, costs off, summary off, timing off) %s', $1) loop ln := regexp_replace(ln, 'Workers Launched: \d+', 'Workers Launched: N'); ln := regexp_replace(ln, 'actual rows=\d+ loops=\d+', 'actual rows=N loops=N'); ln := regexp_replace(ln, 'Rows Removed by Filter: \d+', 'Rows Removed by Filter: N'); return next ln; end loop; end; $$; If possible try to make a common function for both and use. + if (intoclausestr && OidIsValid(objectid)) + fpes->objectid = objectid; + else + fpes->objectid = InvalidOid; Here OidIsValid(objectid) check is not required intoclausestr will be set only if OidIsValid. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Sat, Dec 26, 2020 at 11:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I have reviewed part of v15-0001 patch, I have a few comments, I will > continue to review this. Thanks a lot. > 1. > Why is this temporary hack? and what is the plan for removing this hack? The changes in xact.c, xact.h and heapam.c are common to all the parallel insert patches - COPY, INSERT INTO SELECT. That was the initial comment, I forgot to keep it in sync with the other patches. Now, I used the comment from INSERT INTO SELECT patch. IIRC, the plan was to have these code in all the parallel inserts patch, whichever gets to review and commit first, others will update their patches accordingly. > 2. > +/* > + * ChooseParallelInsertsInCTAS --- determine whether or not parallel > + * insertion is possible, if yes set the parallel insert state i.e. push down > + * the dest receiver to the Gather nodes. > + */ > +void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc) > +{ > + if (!IS_CTAS(into)) > + return; > > When will this hit? The functtion name suggest that it is from CTAS > but now you have a check that if it is > not for CTAS then return, can you add the comment that when do you > expect this case? Yes it will hit for explain cases, but I choose to remove this and check outside in the explain something like: if (into) ChooseParallelInsertsInCTAS() > Also the function name should start in a new line > i.e > void > ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc) Ah, missed that. Modified now. > 3. > +/* > + * ChooseParallelInsertsInCTAS --- determine whether or not parallel > + * insertion is possible, if yes set the parallel insert state i.e. push down > + * the dest receiver to the Gather nodes. > + */ > > Push down to the Gather nodes? I think the right statement will be > push down below the Gather node. Modified. > 4. > intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo) > { > DR_intorel *myState = (DR_intorel *) self; > > -- Comment ->in parallel worker we don't need to crease dest recv blah blah > + if (myState->is_parallel_worker) > { > --parallel worker handling-- > return; > } > > --non-parallel worker code stay right there, instead of moving to else Done. > 5. > +/* > + * ChooseParallelInsertsInCTAS --- determine whether or not parallel > + * insertion is possible, if yes set the parallel insert state i.e. push down > + * the dest receiver to the Gather nodes. > + */ > +void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc) > +{ > > From function name and comments it appeared that this function will > return boolean saying whether > Parallel insert should be selected or not. I think name/comment > should be better for this Yeah that function can still return void because no point in returning bool there, since the intention is to see if parallel inserts can be performed, if yes, set the state otherwise exit. I changed the function name to TryParallelizingInsertsInCTAS(). Let me know your suggestions if that doesn't work out. > 6. > /* > + * For parallelizing inserts in CTAS i.e. making each parallel worker > + * insert the tuples, we must send information such as into clause (for > + * each worker to build separate dest receiver), object id (for each > + * worker to open the created table). > > Comment is saying we need to pass object id but the code under this > comment is not doing so. Improved the comment. > 7. > + /* > + * Since there are no rows that are transferred from workers to Gather > + * node, so we set it to 0 to be visible in estimated row count of > + * explain plans. > + */ > + queryDesc->planstate->plan->plan_rows = 0; > > This seems a bit hackies Why it is done after the planning, I mean > plan must know that it is returning a 0 rows? This exists to show up the estimated row count(in case of EXPLAIN CTAS without ANALYZE) in the output. For EXPLAIN ANALYZE CTAS actual tuples are shown correctly as 0 because Gather doesn't receive any tuples. if (es->costs) { if (es->format == EXPLAIN_FORMAT_TEXT) { appendStringInfo(es->str, " (cost=%.2f..%.2f rows=%.0f width=%d)", plan->startup_cost, plan->total_cost, plan->plan_rows, plan->plan_width); Since it's an estimated row count(which may not be always correct), we will let the EXPLAIN plan show that and I think we can remove that part. Thoughts? I removed it in v6 patch set. > 8. > + char *intoclause_space = shm_toc_allocate(pcxt->toc, > + intoclause_len); > + memcpy(intoclause_space, intoclausestr, intoclause_len); > + shm_toc_insert(pcxt->toc, PARALLEL_KEY_INTO_CLAUSE, intoclause_space); > > One blank line between variable declaration and next code segment, > take care at other places as well. Done. I'm attaching the v16 patch set. Please note that I added the documentation saying that parallel insertions can happen and a sample output of the explain to 0003 patch as discussed in [1]. But I didn't move the explain output related code to a separate patch because it's a small snippet in explain.c. I hope that's okay. [1] - https://www.postgresql.org/message-id/CAA4eK1JqwXGYoGa1%2B3-f0T50dBGufvKaKQOee_AfFhygZ6QKtA%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sat, Dec 26, 2020 at 9:20 PM vignesh C <vignesh21@gmail.com> wrote: > +-- parallel inserts must occur > +select explain_pictas( > +'create table parallel_write as select length(stringu1) from tenk1;'); > +select count(*) from parallel_write; > +drop table parallel_write; > > We can change comment "parallel inserts must occur" like "parallel > insert must be selected for CTAS on normal table" > > +-- parallel inserts must occur > +select explain_pictas( > +'create unlogged table parallel_write as select length(stringu1) from tenk1;'); > +select count(*) from parallel_write; > +drop table parallel_write; > > We can change comment "parallel inserts must occur" like "parallel > insert must be selected for CTAS on unlogged table" > Similar comment need to be handled in other places also. I think the existing comments look fine. The info like table type and the Query CTAS or CMV is visible by looking at the test case. What I wanted from the comments is whether we support parallel inserts or not and if not why so that it will be easy to read. I tried to keep it as succinctly as possible. > If possible try to make a common function for both and use. Yes you are right. The function explain_pictas is the same as explain_parallel_append from partition_prune.sql. It's a test function, and I also see that we have serial_schedule and parallel_schedule which means that these sql files can run in any order. I'm not quite sure whether we can have it in a common test sql file and use it across other tests sql files. AFAICS, I didn't find any function being used in such a manner. Thoughts? > + if (intoclausestr && OidIsValid(objectid)) > + fpes->objectid = objectid; > + else > + fpes->objectid = InvalidOid; > Here OidIsValid(objectid) check is not required intoclausestr will be > set only if OidIsValid. Removed the OidIsValid check in the latest v16 patch set posted upthread. Please have a look. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
+ (root->parse->CTASParallelInsInfo &
+ CTAS_PARALLEL_INS_TUP_COST_CAN_IGN))
+ {
+ parallel |= PushDownCTASParallelInsertState(dest,
+ aps->appendplans[i],
+ gather_exists);
+ * be created(such as : limit,sort,distinct...):
+ root->parse->CTASParallelInsInfo |=
+ CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND;
+ }
+ }
+
+ if (root->parse->CTASParallelInsInfo &
+ CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND)
+ {
+ root->parse->CTASParallelInsInfo &=
+ ~CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND;
+ CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND = 1 << 3
On Sat, Dec 26, 2020 at 11:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have reviewed part of v15-0001 patch, I have a few comments, I will
> continue to review this.
Thanks a lot.
> 1.
> Why is this temporary hack? and what is the plan for removing this hack?
The changes in xact.c, xact.h and heapam.c are common to all the
parallel insert patches - COPY, INSERT INTO SELECT. That was the
initial comment, I forgot to keep it in sync with the other patches.
Now, I used the comment from INSERT INTO SELECT patch. IIRC, the plan
was to have these code in all the parallel inserts patch, whichever
gets to review and commit first, others will update their patches
accordingly.
> 2.
> +/*
> + * ChooseParallelInsertsInCTAS --- determine whether or not parallel
> + * insertion is possible, if yes set the parallel insert state i.e. push down
> + * the dest receiver to the Gather nodes.
> + */
> +void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)
> +{
> + if (!IS_CTAS(into))
> + return;
>
> When will this hit? The functtion name suggest that it is from CTAS
> but now you have a check that if it is
> not for CTAS then return, can you add the comment that when do you
> expect this case?
Yes it will hit for explain cases, but I choose to remove this and
check outside in the explain something like:
if (into)
ChooseParallelInsertsInCTAS()
> Also the function name should start in a new line
> i.e
> void
> ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)
Ah, missed that. Modified now.
> 3.
> +/*
> + * ChooseParallelInsertsInCTAS --- determine whether or not parallel
> + * insertion is possible, if yes set the parallel insert state i.e. push down
> + * the dest receiver to the Gather nodes.
> + */
>
> Push down to the Gather nodes? I think the right statement will be
> push down below the Gather node.
Modified.
> 4.
> intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
> {
> DR_intorel *myState = (DR_intorel *) self;
>
> -- Comment ->in parallel worker we don't need to crease dest recv blah blah
> + if (myState->is_parallel_worker)
> {
> --parallel worker handling--
> return;
> }
>
> --non-parallel worker code stay right there, instead of moving to else
Done.
> 5.
> +/*
> + * ChooseParallelInsertsInCTAS --- determine whether or not parallel
> + * insertion is possible, if yes set the parallel insert state i.e. push down
> + * the dest receiver to the Gather nodes.
> + */
> +void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)
> +{
>
> From function name and comments it appeared that this function will
> return boolean saying whether
> Parallel insert should be selected or not. I think name/comment
> should be better for this
Yeah that function can still return void because no point in returning
bool there, since the intention is to see if parallel inserts can be
performed, if yes, set the state otherwise exit. I changed the
function name to TryParallelizingInsertsInCTAS(). Let me know your
suggestions if that doesn't work out.
> 6.
> /*
> + * For parallelizing inserts in CTAS i.e. making each parallel worker
> + * insert the tuples, we must send information such as into clause (for
> + * each worker to build separate dest receiver), object id (for each
> + * worker to open the created table).
>
> Comment is saying we need to pass object id but the code under this
> comment is not doing so.
Improved the comment.
> 7.
> + /*
> + * Since there are no rows that are transferred from workers to Gather
> + * node, so we set it to 0 to be visible in estimated row count of
> + * explain plans.
> + */
> + queryDesc->planstate->plan->plan_rows = 0;
>
> This seems a bit hackies Why it is done after the planning, I mean
> plan must know that it is returning a 0 rows?
This exists to show up the estimated row count(in case of EXPLAIN CTAS
without ANALYZE) in the output. For EXPLAIN ANALYZE CTAS actual tuples
are shown correctly as 0 because Gather doesn't receive any tuples.
if (es->costs)
{
if (es->format == EXPLAIN_FORMAT_TEXT)
{
appendStringInfo(es->str, " (cost=%.2f..%.2f rows=%.0f width=%d)",
plan->startup_cost, plan->total_cost,
plan->plan_rows, plan->plan_width);
Since it's an estimated row count(which may not be always correct), we
will let the EXPLAIN plan show that and I think we can remove that
part. Thoughts?
I removed it in v6 patch set.
> 8.
> + char *intoclause_space = shm_toc_allocate(pcxt->toc,
> + intoclause_len);
> + memcpy(intoclause_space, intoclausestr, intoclause_len);
> + shm_toc_insert(pcxt->toc, PARALLEL_KEY_INTO_CLAUSE, intoclause_space);
>
> One blank line between variable declaration and next code segment,
> take care at other places as well.
Done.
I'm attaching the v16 patch set. Please note that I added the
documentation saying that parallel insertions can happen and a sample
output of the explain to 0003 patch as discussed in [1]. But I didn't
move the explain output related code to a separate patch because it's
a small snippet in explain.c. I hope that's okay.
[1] - https://www.postgresql.org/message-id/CAA4eK1JqwXGYoGa1%2B3-f0T50dBGufvKaKQOee_AfFhygZ6QKtA%40mail.gmail.com
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On Sun, Dec 27, 2020 at 2:20 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Sat, Dec 26, 2020 at 11:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have reviewed part of v15-0001 patch, I have a few comments, I will > > continue to review this. > > Thanks a lot. > > > 1. > > Why is this temporary hack? and what is the plan for removing this hack? > > The changes in xact.c, xact.h and heapam.c are common to all the > parallel insert patches - COPY, INSERT INTO SELECT. That was the > initial comment, I forgot to keep it in sync with the other patches. > Now, I used the comment from INSERT INTO SELECT patch. IIRC, the plan > was to have these code in all the parallel inserts patch, whichever > gets to review and commit first, others will update their patches > accordingly. > > > 2. > > +/* > > + * ChooseParallelInsertsInCTAS --- determine whether or not parallel > > + * insertion is possible, if yes set the parallel insert state i.e. push down > > + * the dest receiver to the Gather nodes. > > + */ > > +void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc) > > +{ > > + if (!IS_CTAS(into)) > > + return; > > > > When will this hit? The functtion name suggest that it is from CTAS > > but now you have a check that if it is > > not for CTAS then return, can you add the comment that when do you > > expect this case? > > Yes it will hit for explain cases, but I choose to remove this and > check outside in the explain something like: > if (into) > ChooseParallelInsertsInCTAS() > > > Also the function name should start in a new line > > i.e > > void > > ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc) > > Ah, missed that. Modified now. > > > 3. > > +/* > > + * ChooseParallelInsertsInCTAS --- determine whether or not parallel > > + * insertion is possible, if yes set the parallel insert state i.e. push down > > + * the dest receiver to the Gather nodes. > > + */ > > > > Push down to the Gather nodes? I think the right statement will be > > push down below the Gather node. > > Modified. > > > 4. > > intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo) > > { > > DR_intorel *myState = (DR_intorel *) self; > > > > -- Comment ->in parallel worker we don't need to crease dest recv blah blah > > + if (myState->is_parallel_worker) > > { > > --parallel worker handling-- > > return; > > } > > > > --non-parallel worker code stay right there, instead of moving to else > > Done. > > > 5. > > +/* > > + * ChooseParallelInsertsInCTAS --- determine whether or not parallel > > + * insertion is possible, if yes set the parallel insert state i.e. push down > > + * the dest receiver to the Gather nodes. > > + */ > > +void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc) > > +{ > > > > From function name and comments it appeared that this function will > > return boolean saying whether > > Parallel insert should be selected or not. I think name/comment > > should be better for this > > Yeah that function can still return void because no point in returning > bool there, since the intention is to see if parallel inserts can be > performed, if yes, set the state otherwise exit. I changed the > function name to TryParallelizingInsertsInCTAS(). Let me know your > suggestions if that doesn't work out. > > > 6. > > /* > > + * For parallelizing inserts in CTAS i.e. making each parallel worker > > + * insert the tuples, we must send information such as into clause (for > > + * each worker to build separate dest receiver), object id (for each > > + * worker to open the created table). > > > > Comment is saying we need to pass object id but the code under this > > comment is not doing so. > > Improved the comment. > > > 7. > > + /* > > + * Since there are no rows that are transferred from workers to Gather > > + * node, so we set it to 0 to be visible in estimated row count of > > + * explain plans. > > + */ > > + queryDesc->planstate->plan->plan_rows = 0; > > > > This seems a bit hackies Why it is done after the planning, I mean > > plan must know that it is returning a 0 rows? > > This exists to show up the estimated row count(in case of EXPLAIN CTAS > without ANALYZE) in the output. For EXPLAIN ANALYZE CTAS actual tuples > are shown correctly as 0 because Gather doesn't receive any tuples. > if (es->costs) > { > if (es->format == EXPLAIN_FORMAT_TEXT) > { > appendStringInfo(es->str, " (cost=%.2f..%.2f rows=%.0f width=%d)", > plan->startup_cost, plan->total_cost, > plan->plan_rows, plan->plan_width); > > Since it's an estimated row count(which may not be always correct), we > will let the EXPLAIN plan show that and I think we can remove that > part. Thoughts? > > I removed it in v6 patch set. > > > 8. > > + char *intoclause_space = shm_toc_allocate(pcxt->toc, > > + intoclause_len); > > + memcpy(intoclause_space, intoclausestr, intoclause_len); > > + shm_toc_insert(pcxt->toc, PARALLEL_KEY_INTO_CLAUSE, intoclause_space); > > > > One blank line between variable declaration and next code segment, > > take care at other places as well. > > Done. > > I'm attaching the v16 patch set. Please note that I added the > documentation saying that parallel insertions can happen and a sample > output of the explain to 0003 patch as discussed in [1]. But I didn't > move the explain output related code to a separate patch because it's > a small snippet in explain.c. I hope that's okay. > > [1] - https://www.postgresql.org/message-id/CAA4eK1JqwXGYoGa1%2B3-f0T50dBGufvKaKQOee_AfFhygZ6QKtA%40mail.gmail.com > Thanks for working on this, I will have a look at the updated patches soon. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sun, Dec 27, 2020 at 2:28 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Sat, Dec 26, 2020 at 9:20 PM vignesh C <vignesh21@gmail.com> wrote:
> > +-- parallel inserts must occur
> > +select explain_pictas(
> > +'create table parallel_write as select length(stringu1) from tenk1;');
> > +select count(*) from parallel_write;
> > +drop table parallel_write;
> >
> > We can change comment "parallel inserts must occur" like "parallel
> > insert must be selected for CTAS on normal table"
> >
> > +-- parallel inserts must occur
> > +select explain_pictas(
> > +'create unlogged table parallel_write as select length(stringu1) from tenk1;');
> > +select count(*) from parallel_write;
> > +drop table parallel_write;
> >
> > We can change comment "parallel inserts must occur" like "parallel
> > insert must be selected for CTAS on unlogged table"
> > Similar comment need to be handled in other places also.
>
> I think the existing comments look fine. The info like table type and
> the Query CTAS or CMV is visible by looking at the test case. What I
> wanted from the comments is whether we support parallel inserts or not
> and if not why so that it will be easy to read. I tried to keep it as
> succinctly as possible.
>
I saw few inconsistencies in the patch:
+-- parallel inserts must occur
+select explain_pictas(
+'create table parallel_write as select length(stringu1) from tenk1;');
+ explain_pictas
+-- parallel inserts must not occur as the table is temporary
+select explain_pictas(
+'create temporary table parallel_write as select length(stringu1) from tenk1;');
+ explain_pictas
+-- parallel inserts must occur, as there is init plan that gets executed by
+-- each parallel worker
+select explain_pictas(
+'create table parallel_write as select two col1,
+ (select two from (select * from tenk2) as tt limit 1) col2
+ from tenk1 where tenk1.four = 3;');
+ explain_pictas
+-- must occur
+set enable_nestloop to off;
+set enable_mergejoin to on;
+set enable_mergejoin to off;
+set enable_hashjoin to on;
+select explain_pictas(
On Mon, Dec 28, 2020 at 10:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Thanks for working on this, I will have a look at the updated patches soon. Attaching v17 patch set after addressing comments raised in other threads. Please consider this patch set for further review. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Dec 28, 2020 at 1:16 AM Zhihong Yu <zyu@yugabyte.com> wrote: > For v16-0002-Tuple-Cost-Adjustment-for-Parallel-Inserts-in-CTAS.patch: > > + if (ignore && > + (root->parse->CTASParallelInsInfo & > + CTAS_PARALLEL_INS_TUP_COST_CAN_IGN)) > > I wonder why CTAS_PARALLEL_INS_TUP_COST_CAN_IGN is checked again in the above if since when ignore_parallel_tuple_costreturns true, CTAS_PARALLEL_INS_TUP_COST_CAN_IGN is set already. Sometimes, we may set the flag CTAS_PARALLEL_INS_TUP_COST_CAN_IGN before generate_useful_gather_paths, but the generate_useful_gather_paths can return without reaching cost_gather where we reset. The generate_useful_gather_paths can return without reaching cost_gather, in following case if (rel->partial_pathlist == NIL) return; So, for such cases, I'm resetting it here. > + * In this function we only care Append and Gather nodes. > > 'care' -> 'care about' Done. > + for (int i = 0; i < aps->as_nplans; i++) > + { > + parallel |= PushDownCTASParallelInsertState(dest, > + aps->appendplans[i], > + gather_exists); > > It seems the loop termination condition can include parallel since we can come out of the loop once parallel is true. No, we can not come out of the for loop if parallel is true, because our intention there is to look for all the child/sub plans under Append, and push the inserts to the Gather nodes wherever possible. > + if (!allow && tuple_cost_flags && gather_exists) > > As the above code shows, gather_exists is only checked when allow is false. Yes, if at least one gather node exists under the Append for which the planner would have ignored the tuple cost, and now if we don't allow parallel inserts, we should assert that the parallelism is not picked because of wrong parallel tuple cost enforcement. > + * We set the flag for two cases when there is no parent path will > + * be created(such as : limit,sort,distinct...): > > Please correct the grammar : there are two verbs following 'when' Done. > For set_append_rel_size: > > + { > + root->parse->CTASParallelInsInfo |= > + CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND; > + } > + } > + > + if (root->parse->CTASParallelInsInfo & > + CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND) > + { > + root->parse->CTASParallelInsInfo &= > + ~CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND; > > In the if block for childrel->rtekind == RTE_SUBQUERY, CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND maybe set. Why is it clearedimmediately after ? Thanks for pointing that out. It's a miss, intention is to reset it after set_rel_size(). Corrected in the v17 patch. > + /* Set to this in case tuple cost needs to be ignored for Append cases. */ > + CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND = 1 << 3 > > Since each CTAS_PARALLEL_INS_ flag is a bit, maybe it's better to use 'turn on' or similar term in the comment. Because'set to' normally means assignment. Done. All the above comments are addressed in the v17 patch set posted upthread. Please have a look. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Mon, Dec 28, 2020 at 11:24 AM vignesh C <vignesh21@gmail.com> wrote: > Test comments are detailed in a few cases and in few others it is not detailed for similar kinds of parallelism selectedtests. I felt we could make the test comments consistent across the file. Modified the test case description in the v17 patch set posted upthread. Please have a look. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
+ * plan or under top Append node unless it does not have any projections to do.
+ {
+ GatherState *gstate = (GatherState *) ps;
+
+ parallel = true;
On Mon, Dec 28, 2020 at 10:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Thanks for working on this, I will have a look at the updated patches soon.
Attaching v17 patch set after addressing comments raised in other
threads. Please consider this patch set for further review.
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 30, 2020 at 5:22 AM Zhihong Yu <zyu@yugabyte.com> wrote: > w.r.t. v17-0004-Enable-CTAS-Parallel-Inserts-For-Append.patch > > + * Push the dest receiver to Gather node when it is either at the top of the > + * plan or under top Append node unless it does not have any projections to do. > > I think the 'unless' should be 'if'. As can be seen from the body of the method: > > + if (!ps->ps_ProjInfo) > + { > + GatherState *gstate = (GatherState *) ps; > + > + parallel = true; Thanks. Modified it in the 0004 patch. Attaching v18 patch set. Note that no change in 0001 to 0003 patches from v17. Please consider v18 patch set for further review. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Dec 28, 2020 at 10:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sun, Dec 27, 2020 at 2:20 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Sat, Dec 26, 2020 at 11:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > I have reviewed part of v15-0001 patch, I have a few comments, I will > > > continue to review this. > > > > Thanks a lot. > > > > > 1. > > > Why is this temporary hack? and what is the plan for removing this hack? > > > > The changes in xact.c, xact.h and heapam.c are common to all the > > parallel insert patches - COPY, INSERT INTO SELECT. That was the > > initial comment, I forgot to keep it in sync with the other patches. > > Now, I used the comment from INSERT INTO SELECT patch. IIRC, the plan > > was to have these code in all the parallel inserts patch, whichever > > gets to review and commit first, others will update their patches > > accordingly. > > > > > 2. > > > +/* > > > + * ChooseParallelInsertsInCTAS --- determine whether or not parallel > > > + * insertion is possible, if yes set the parallel insert state i.e. push down > > > + * the dest receiver to the Gather nodes. > > > + */ > > > +void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc) > > > +{ > > > + if (!IS_CTAS(into)) > > > + return; > > > > > > When will this hit? The functtion name suggest that it is from CTAS > > > but now you have a check that if it is > > > not for CTAS then return, can you add the comment that when do you > > > expect this case? > > > > Yes it will hit for explain cases, but I choose to remove this and > > check outside in the explain something like: > > if (into) > > ChooseParallelInsertsInCTAS() > > > > > Also the function name should start in a new line > > > i.e > > > void > > > ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc) > > > > Ah, missed that. Modified now. > > > > > 3. > > > +/* > > > + * ChooseParallelInsertsInCTAS --- determine whether or not parallel > > > + * insertion is possible, if yes set the parallel insert state i.e. push down > > > + * the dest receiver to the Gather nodes. > > > + */ > > > > > > Push down to the Gather nodes? I think the right statement will be > > > push down below the Gather node. > > > > Modified. > > > > > 4. > > > intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo) > > > { > > > DR_intorel *myState = (DR_intorel *) self; > > > > > > -- Comment ->in parallel worker we don't need to crease dest recv blah blah > > > + if (myState->is_parallel_worker) > > > { > > > --parallel worker handling-- > > > return; > > > } > > > > > > --non-parallel worker code stay right there, instead of moving to else > > > > Done. > > > > > 5. > > > +/* > > > + * ChooseParallelInsertsInCTAS --- determine whether or not parallel > > > + * insertion is possible, if yes set the parallel insert state i.e. push down > > > + * the dest receiver to the Gather nodes. > > > + */ > > > +void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc) > > > +{ > > > > > > From function name and comments it appeared that this function will > > > return boolean saying whether > > > Parallel insert should be selected or not. I think name/comment > > > should be better for this > > > > Yeah that function can still return void because no point in returning > > bool there, since the intention is to see if parallel inserts can be > > performed, if yes, set the state otherwise exit. I changed the > > function name to TryParallelizingInsertsInCTAS(). Let me know your > > suggestions if that doesn't work out. > > > > > 6. > > > /* > > > + * For parallelizing inserts in CTAS i.e. making each parallel worker > > > + * insert the tuples, we must send information such as into clause (for > > > + * each worker to build separate dest receiver), object id (for each > > > + * worker to open the created table). > > > > > > Comment is saying we need to pass object id but the code under this > > > comment is not doing so. > > > > Improved the comment. > > > > > 7. > > > + /* > > > + * Since there are no rows that are transferred from workers to Gather > > > + * node, so we set it to 0 to be visible in estimated row count of > > > + * explain plans. > > > + */ > > > + queryDesc->planstate->plan->plan_rows = 0; > > > > > > This seems a bit hackies Why it is done after the planning, I mean > > > plan must know that it is returning a 0 rows? > > > > This exists to show up the estimated row count(in case of EXPLAIN CTAS > > without ANALYZE) in the output. For EXPLAIN ANALYZE CTAS actual tuples > > are shown correctly as 0 because Gather doesn't receive any tuples. > > if (es->costs) > > { > > if (es->format == EXPLAIN_FORMAT_TEXT) > > { > > appendStringInfo(es->str, " (cost=%.2f..%.2f rows=%.0f width=%d)", > > plan->startup_cost, plan->total_cost, > > plan->plan_rows, plan->plan_width); > > > > Since it's an estimated row count(which may not be always correct), we > > will let the EXPLAIN plan show that and I think we can remove that > > part. Thoughts? > > > > I removed it in v6 patch set. > > > > > 8. > > > + char *intoclause_space = shm_toc_allocate(pcxt->toc, > > > + intoclause_len); > > > + memcpy(intoclause_space, intoclausestr, intoclause_len); > > > + shm_toc_insert(pcxt->toc, PARALLEL_KEY_INTO_CLAUSE, intoclause_space); > > > > > > One blank line between variable declaration and next code segment, > > > take care at other places as well. > > > > Done. > > > > I'm attaching the v16 patch set. Please note that I added the > > documentation saying that parallel insertions can happen and a sample > > output of the explain to 0003 patch as discussed in [1]. But I didn't > > move the explain output related code to a separate patch because it's > > a small snippet in explain.c. I hope that's okay. > > > > [1] - https://www.postgresql.org/message-id/CAA4eK1JqwXGYoGa1%2B3-f0T50dBGufvKaKQOee_AfFhygZ6QKtA%40mail.gmail.com > > > > Thanks for working on this, I will have a look at the updated patches soon. I have completed reviewing 0001, I don't have more comments, just one question. Soon I will review the remaining patches. + /* If parallel inserts are to be allowed, set a few extra information. */ + if (myState->is_parallel) + { + myState->object_id = intoRelationAddr.objectId; + + /* + * We don't need to skip contacting FSM while inserting tuples for + * parallel mode, while extending the relations, workers instead of + * blocking on a page while another worker is inserting, can check the + * FSM for another page that can accommodate the tuples. This results + * in major benefit for parallel inserts. + */ + myState->ti_options = 0; Is there any performance data for this or just theoretical analysis? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 30, 2020 at 10:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I have completed reviewing 0001, I don't have more comments, just one > question. Soon I will review the remaining patches. Thanks. > + /* If parallel inserts are to be allowed, set a few extra information. */ > + if (myState->is_parallel) > + { > + myState->object_id = intoRelationAddr.objectId; > + > + /* > + * We don't need to skip contacting FSM while inserting tuples for > + * parallel mode, while extending the relations, workers instead of > + * blocking on a page while another worker is inserting, can check the > + * FSM for another page that can accommodate the tuples. This results > + * in major benefit for parallel inserts. > + */ > + myState->ti_options = 0; > > Is there any performance data for this or just theoretical analysis? I have seen that we don't get much performance with the skip fsm option, though I don't have the data to back it up. I'm planning to run performance tests after the patches 0001, 0002 and 0003 get reviewed. I will capture the data at that time. Hope that's fine. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 30, 2020 at 10:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have completed reviewing 0001, I don't have more comments, just one
> question. Soon I will review the remaining patches.
Thanks.
> + /* If parallel inserts are to be allowed, set a few extra information. */
> + if (myState->is_parallel)
> + {
> + myState->object_id = intoRelationAddr.objectId;
> +
> + /*
> + * We don't need to skip contacting FSM while inserting tuples for
> + * parallel mode, while extending the relations, workers instead of
> + * blocking on a page while another worker is inserting, can check the
> + * FSM for another page that can accommodate the tuples. This results
> + * in major benefit for parallel inserts.
> + */
> + myState->ti_options = 0;
>
> Is there any performance data for this or just theoretical analysis?
I have seen that we don't get much performance with the skip fsm
option, though I don't have the data to back it up. I'm planning to
run performance tests after the patches 0001, 0002 and 0003 get
reviewed. I will capture the data at that time. Hope that's fine.
On Wed, Dec 30, 2020 at 10:49 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, 30 Dec 2020 at 10:47 AM, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: >> >> On Wed, Dec 30, 2020 at 10:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> > I have completed reviewing 0001, I don't have more comments, just one >> > question. Soon I will review the remaining patches. >> >> Thanks. >> >> > + /* If parallel inserts are to be allowed, set a few extra information. */ >> > + if (myState->is_parallel) >> > + { >> > + myState->object_id = intoRelationAddr.objectId; >> > + >> > + /* >> > + * We don't need to skip contacting FSM while inserting tuples for >> > + * parallel mode, while extending the relations, workers instead of >> > + * blocking on a page while another worker is inserting, can check the >> > + * FSM for another page that can accommodate the tuples. This results >> > + * in major benefit for parallel inserts. >> > + */ >> > + myState->ti_options = 0; >> > >> > Is there any performance data for this or just theoretical analysis? >> >> I have seen that we don't get much performance with the skip fsm >> option, though I don't have the data to back it up. I'm planning to >> run performance tests after the patches 0001, 0002 and 0003 get >> reviewed. I will capture the data at that time. Hope that's fine. > > > Yeah that’s fine > Some comments in 0002 1. +/* + * Information sent to the planner from CTAS to account for the cost + * calculations in cost_gather. We need to do this because, no tuples will be + * received by the Gather node if the workers insert the tuples in parallel. + */ +typedef enum CTASParallelInsertOpt +{ + CTAS_PARALLEL_INS_UNDEF = 0, /* undefined */ + CTAS_PARALLEL_INS_SELECT = 1 << 0, /* turn on this before planning */ + /* + * Turn on this while planning for upper Gather path to ignore parallel + * tuple cost in cost_gather. + */ + CTAS_PARALLEL_INS_TUP_COST_CAN_IGN = 1 << 1, + /* Turn on this after the cost is ignored. */ + CTAS_PARALLEL_INS_TUP_COST_IGNORED = 1 << 2 +} CTASParallelInsertOpt; I don't like the naming of these flags. Especially no need to define CTAS_PARALLEL_INS_UNDEF, we can directl use 0 for that purpose instead of giving some weird name. So I suggest first, just get rid of CTAS_PARALLEL_INS_UNDEF. 2. + /* + * Turn on a flag to ignore parallel tuple cost by the Gather path in + * cost_gather if the SELECT is for CTAS and we are generating an upper + * level Gather path. + */ + bool ignore = ignore_parallel_tuple_cost(root); + generate_useful_gather_paths(root, rel, false); + /* + * Reset the ignore flag, in case we turned it on but + * generate_useful_gather_paths returned without reaching cost_gather. + * If we reached cost_gather, we would have been reset it there. + */ + if (ignore && (root->parse->CTASParallelInsInfo & + CTAS_PARALLEL_INS_TUP_COST_CAN_IGN)) + { + root->parse->CTASParallelInsInfo &= + ~CTAS_PARALLEL_INS_TUP_COST_CAN_IGN; + } I think th way we are using these cost ignoring flag, doesn't look clean. I mean first, CTAS_PARALLEL_INS_SELECT is set if it is coming from CTAS and then ignore_parallel_tuple_cost will set the CTAS_PARALLEL_INS_TUP_COST_CAN_IGN if it satisfies certain condition which is fine. Now, internally cost gather will add CTAS_PARALLEL_INS_TUP_COST_IGNORED and remove CTAS_PARALLEL_INS_TUP_COST_CAN_IGN and if CTAS_PARALLEL_INS_TUP_COST_CAN_IGN is not removed then we will remove it outside. Why do we need to remove CTAS_PARALLEL_INS_TUP_COST_CAN_IGN flag at all? 3. + if (tuple_cost_flags && gstate->ps.ps_ProjInfo) + Assert(!(*tuple_cost_flags & CTAS_PARALLEL_INS_TUP_COST_IGNORED)); Instead of adding Assert inside an IF statement, you can convert whole statement as an assert. Lets not add unnecessary if in the release mode. 4. + if ((root->parse->CTASParallelInsInfo & CTAS_PARALLEL_INS_SELECT) && + (root->parse->CTASParallelInsInfo & + CTAS_PARALLEL_INS_TUP_COST_CAN_IGN)) + { + ignore_tuple_cost = true; + root->parse->CTASParallelInsInfo &= + ~CTAS_PARALLEL_INS_TUP_COST_CAN_IGN; + root->parse->CTASParallelInsInfo |= CTAS_PARALLEL_INS_TUP_COST_IGNORED; + } + + if (!ignore_tuple_cost) + run_cost += parallel_tuple_cost * path->path.rows; Changes this to (if, else) as shown below, because if it goes to the IF part then ignore_tuple_cost will always be true so no need to have an extra if check. if ((root->parse->CTASParallelInsInfo & CTAS_PARALLEL_INS_SELECT) && (root->parse->CTASParallelInsInfo & CTAS_PARALLEL_INS_TUP_COST_CAN_IGN)) { ignore_tuple_cost = true; root->parse->CTASParallelInsInfo &= ~CTAS_PARALLEL_INS_TUP_COST_CAN_IGN; root->parse->CTASParallelInsInfo |= CTAS_PARALLEL_INS_TUP_COST_IGNORED; } else run_cost += parallel_tuple_cost * path->path.rows; -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 30, 2020 at 10:47 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Dec 30, 2020 at 10:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have completed reviewing 0001, I don't have more comments, just one > > question. Soon I will review the remaining patches. > > Thanks. > > > + /* If parallel inserts are to be allowed, set a few extra information. */ > > + if (myState->is_parallel) > > + { > > + myState->object_id = intoRelationAddr.objectId; > > + > > + /* > > + * We don't need to skip contacting FSM while inserting tuples for > > + * parallel mode, while extending the relations, workers instead of > > + * blocking on a page while another worker is inserting, can check the > > + * FSM for another page that can accommodate the tuples. This results > > + * in major benefit for parallel inserts. > > + */ > > + myState->ti_options = 0; > > > > Is there any performance data for this or just theoretical analysis? > > I have seen that we don't get much performance with the skip fsm > option, though I don't have the data to back it up. I'm planning to > run performance tests after the patches 0001, 0002 and 0003 get > reviewed. I will capture the data at that time. Hope that's fine. > When you run the performance tests, you can try to capture and publish relation size & the number of pages that are getting created for base table and the CTAS table, you can use something like SELECT relpages FROM pg_class WHERE relname = 'tablename & SELECT pg_total_relation_size('tablename'). Just to make sure that there is no significant difference between the base table and CTAS table. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 30, 2020 at 9:25 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Dec 30, 2020 at 5:22 AM Zhihong Yu <zyu@yugabyte.com> wrote: > > w.r.t. v17-0004-Enable-CTAS-Parallel-Inserts-For-Append.patch > > > > + * Push the dest receiver to Gather node when it is either at the top of the > > + * plan or under top Append node unless it does not have any projections to do. > > > > I think the 'unless' should be 'if'. As can be seen from the body of the method: > > > > + if (!ps->ps_ProjInfo) > > + { > > + GatherState *gstate = (GatherState *) ps; > > + > > + parallel = true; > > Thanks. Modified it in the 0004 patch. Attaching v18 patch set. Note > that no change in 0001 to 0003 patches from v17. > > Please consider v18 patch set for further review. > Few comments: - /* - * To allow parallel inserts, we need to ensure that they are safe to be - * performed in workers. We have the infrastructure to allow parallel - * inserts in general except for the cases where inserts generate a new - * CommandId (eg. inserts into a table having a foreign key column). - */ - if (IsParallelWorker()) - ereport(ERROR, - (errcode(ERRCODE_INVALID_TRANSACTION_STATE), - errmsg("cannot insert tuples in a parallel worker"))); Is it possible to add a check if it is a CTAS insert here as we do not support insert in parallel workers from others as of now. + Oid objectid; /* workers to open relation/table. */ + /* Number of tuples inserted by all the workers. */ + pg_atomic_uint64 processed; We can just mention relation instead of relation/table. +select explain_pictas( +'create table parallel_write as select length(stringu1) from tenk1;'); + explain_pictas +---------------------------------------------------------- + Gather (actual rows=N loops=N) + Workers Planned: 4 + Workers Launched: N + -> Create parallel_write + -> Parallel Seq Scan on tenk1 (actual rows=N loops=N) +(5 rows) + +select count(*) from parallel_write; Can we include selection of cmin, xmin for one of the test to verify that it uses the same transaction id in the parallel workers something like: select distinct(cmin,xmin) from parallel_write; Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Thanks for the comments. How about naming like below more generically and placing them in parallel.h so that it will also be used for refresh materialized view? +typedef enum ParallelInsertTupleCostOpt +{ + PINS_SELECT_QUERY = 1 << 0, /* turn on this before planning */ + /* + * Turn on this while planning for upper Gather path to ignore parallel + * tuple cost in cost_gather. + */ + PINS_CAN_IGN_TUP_COST = 1 << 1, + /* Turn on this after the cost is ignored. */ + PINS_TUP_COST_IGNORED = 1 << 2 My plan was to get the main design idea of pushing the dest receiver to gather reviewed and once agreed, then I thought of making few functions common and place them in parallel.h and parallel.c so that they can be used for Parallel Inserts in REFRESH MATERIALIZED VIEW because the same design idea can be applied there as well. For instance my thoughts are: add the below structures, functions and other macros to parallel.h and parallel.c: typedef enum ParallelInsertKind { PINS_UNDEF = 0, PINS_CREATE_TABLE_AS, PINS_REFRESH_MAT_VIEW } ParallelInsertKind; typedef struct ParallelInsertCTASInfo { IntoClause *intoclause; Oid objectid; } ParallelInsertCTASInfo; typedef struct ParallelInsertRMVInfo { Oid objectid; } ParallelInsertRMVInfo; ExecInitParallelPlan(PlanState *planstate, EState *estate, Bitmapset *sendParams, int nworkers, - int64 tuples_needed) + int64 tuples_needed, ParallelInsertKind pinskind, + void *pinsinfo) Change ExecParallelInsertInCTAS to +static void +ExecParallelInsert(GatherState *node) +{ Change SetCTASParallelInsertState to +void +SetParallelInsertState(QueryDesc *queryDesc) Change IsParallelInsertionAllowedInCTAS to +bool +IsParallelInsertionAllowed(ParallelInsertKind pinskind, IntoClause *into) +{ Thoughts? If okay, I can work on these points and add a new patch into the patch set that will have changes for parallel inserts in REFRESH MATERIALIZED VIEW. On Wed, Dec 30, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Some comments in 0002 > > 1. > +/* > + * Information sent to the planner from CTAS to account for the cost > + * calculations in cost_gather. We need to do this because, no tuples will be > + * received by the Gather node if the workers insert the tuples in parallel. > + */ > +typedef enum CTASParallelInsertOpt > +{ > + CTAS_PARALLEL_INS_UNDEF = 0, /* undefined */ > + CTAS_PARALLEL_INS_SELECT = 1 << 0, /* turn on this before planning */ > + /* > + * Turn on this while planning for upper Gather path to ignore parallel > + * tuple cost in cost_gather. > + */ > + CTAS_PARALLEL_INS_TUP_COST_CAN_IGN = 1 << 1, > + /* Turn on this after the cost is ignored. */ > + CTAS_PARALLEL_INS_TUP_COST_IGNORED = 1 << 2 > +} CTASParallelInsertOpt; > > > I don't like the naming of these flags. Especially no need to define > CTAS_PARALLEL_INS_UNDEF, we can directl use 0 > for that purpose instead of giving some weird name. So I suggest > first, just get rid of CTAS_PARALLEL_INS_UNDEF. +1. I will change it in the next version of the patch. > 2. > + /* > + * Turn on a flag to ignore parallel tuple cost by the Gather path in > + * cost_gather if the SELECT is for CTAS and we are generating an upper > + * level Gather path. > + */ > + bool ignore = ignore_parallel_tuple_cost(root); > + > generate_useful_gather_paths(root, rel, false); > > + /* > + * Reset the ignore flag, in case we turned it on but > + * generate_useful_gather_paths returned without reaching cost_gather. > + * If we reached cost_gather, we would have been reset it there. > + */ > + if (ignore && (root->parse->CTASParallelInsInfo & > + CTAS_PARALLEL_INS_TUP_COST_CAN_IGN)) > + { > + root->parse->CTASParallelInsInfo &= > + ~CTAS_PARALLEL_INS_TUP_COST_CAN_IGN; > + } > > I think th way we are using these cost ignoring flag, doesn't look clean. > > I mean first, CTAS_PARALLEL_INS_SELECT is set if it is coming from > CTAS and then ignore_parallel_tuple_cost will > set the CTAS_PARALLEL_INS_TUP_COST_CAN_IGN if it satisfies certain > condition which is fine. Now, internally cost > gather will add CTAS_PARALLEL_INS_TUP_COST_IGNORED and remove > CTAS_PARALLEL_INS_TUP_COST_CAN_IGN and if > CTAS_PARALLEL_INS_TUP_COST_CAN_IGN is not removed then we will remove > it outside. Why do we need to remove > CTAS_PARALLEL_INS_TUP_COST_CAN_IGN flag at all? Yes we don't need to remove the CTAS_PARALLEL_INS_TUP_COST_CAN_IGN flag. I will change it in the next version. > 3. > + if (tuple_cost_flags && gstate->ps.ps_ProjInfo) > + Assert(!(*tuple_cost_flags & CTAS_PARALLEL_INS_TUP_COST_IGNORED)); > > Instead of adding Assert inside an IF statement, you can convert whole > statement as an assert. Lets not add unnecessary > if in the release mode. +1. I will change it in the version. > 4. > + if ((root->parse->CTASParallelInsInfo & CTAS_PARALLEL_INS_SELECT) && > + (root->parse->CTASParallelInsInfo & > + CTAS_PARALLEL_INS_TUP_COST_CAN_IGN)) > + { > + ignore_tuple_cost = true; > + root->parse->CTASParallelInsInfo &= > + ~CTAS_PARALLEL_INS_TUP_COST_CAN_IGN; > + root->parse->CTASParallelInsInfo |= CTAS_PARALLEL_INS_TUP_COST_IGNORED; > + } > + > + if (!ignore_tuple_cost) > + run_cost += parallel_tuple_cost * path->path.rows; > > Changes this to (if, else) as shown below, because if it goes to the > IF part then ignore_tuple_cost will always be true > so no need to have an extra if check. > > if ((root->parse->CTASParallelInsInfo & CTAS_PARALLEL_INS_SELECT) && > (root->parse->CTASParallelInsInfo & > CTAS_PARALLEL_INS_TUP_COST_CAN_IGN)) > { > ignore_tuple_cost = true; > root->parse->CTASParallelInsInfo &= > ~CTAS_PARALLEL_INS_TUP_COST_CAN_IGN; > root->parse->CTASParallelInsInfo |= CTAS_PARALLEL_INS_TUP_COST_IGNORED; > } > else > run_cost += parallel_tuple_cost * path->path.rows; +1. I will change it in the next version. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 30, 2020 at 7:47 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Thanks for the comments. > > How about naming like below more generically and placing them in > parallel.h so that it will also be used for refresh materialized view? > > +typedef enum ParallelInsertTupleCostOpt > +{ > + PINS_SELECT_QUERY = 1 << 0, /* turn on this before planning */ > + /* > + * Turn on this while planning for upper Gather path to ignore parallel > + * tuple cost in cost_gather. > + */ > + PINS_CAN_IGN_TUP_COST = 1 << 1, > + /* Turn on this after the cost is ignored. */ > + PINS_TUP_COST_IGNORED = 1 << 2 > > My plan was to get the main design idea of pushing the dest receiver > to gather reviewed and once agreed, then I thought of making few > functions common and place them in parallel.h and parallel.c so that > they can be used for Parallel Inserts in REFRESH MATERIALIZED VIEW > because the same design idea can be applied there as well. I think instead of PINS_* we can name PARALLEL_INSERT_* other than that I am fine with the name. > For instance my thoughts are: add the below structures, functions and > other macros to parallel.h and parallel.c: > typedef enum ParallelInsertKind > { > PINS_UNDEF = 0, > PINS_CREATE_TABLE_AS, > PINS_REFRESH_MAT_VIEW > } ParallelInsertKind; > > typedef struct ParallelInsertCTASInfo > { > IntoClause *intoclause; > Oid objectid; > } ParallelInsertCTASInfo; > > typedef struct ParallelInsertRMVInfo > { > Oid objectid; > } ParallelInsertRMVInfo; > > ExecInitParallelPlan(PlanState *planstate, EState *estate, > Bitmapset *sendParams, int nworkers, > - int64 tuples_needed) > + int64 tuples_needed, ParallelInsertKind pinskind, > + void *pinsinfo) > > Change ExecParallelInsertInCTAS to > > +static void > +ExecParallelInsert(GatherState *node) > +{ > > Change SetCTASParallelInsertState to > +void > +SetParallelInsertState(QueryDesc *queryDesc) > > Change IsParallelInsertionAllowedInCTAS to > > +bool > +IsParallelInsertionAllowed(ParallelInsertKind pinskind, IntoClause *into) > +{ > > Thoughts? > I haven’t thought about these structures yet but yeah making them generic will be good. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 30, 2020 at 5:26 PM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, Dec 30, 2020 at 10:47 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Wed, Dec 30, 2020 at 10:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > I have completed reviewing 0001, I don't have more comments, just one > > > question. Soon I will review the remaining patches. > > > > Thanks. > > > > > + /* If parallel inserts are to be allowed, set a few extra information. */ > > > + if (myState->is_parallel) > > > + { > > > + myState->object_id = intoRelationAddr.objectId; > > > + > > > + /* > > > + * We don't need to skip contacting FSM while inserting tuples for > > > + * parallel mode, while extending the relations, workers instead of > > > + * blocking on a page while another worker is inserting, can check the > > > + * FSM for another page that can accommodate the tuples. This results > > > + * in major benefit for parallel inserts. > > > + */ > > > + myState->ti_options = 0; > > > > > > Is there any performance data for this or just theoretical analysis? > > > > I have seen that we don't get much performance with the skip fsm > > option, though I don't have the data to back it up. I'm planning to > > run performance tests after the patches 0001, 0002 and 0003 get > > reviewed. I will capture the data at that time. Hope that's fine. > > > > When you run the performance tests, you can try to capture and publish > relation size & the number of pages that are getting created for base > table and the CTAS table, you can use something like SELECT relpages > FROM pg_class WHERE relname = 'tablename & SELECT > pg_total_relation_size('tablename'). Just to make sure that there is > no significant difference between the base table and CTAS table. I can do that, I'm sure the number of pages will be equal or little more, since I observed this for parallel copy. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 30, 2020 at 5:28 PM vignesh C <vignesh21@gmail.com> wrote: > Few comments: > - /* > - * To allow parallel inserts, we need to ensure that they are safe to be > - * performed in workers. We have the infrastructure to allow parallel > - * inserts in general except for the cases where inserts generate a new > - * CommandId (eg. inserts into a table having a foreign key column). > - */ > - if (IsParallelWorker()) > - ereport(ERROR, > - (errcode(ERRCODE_INVALID_TRANSACTION_STATE), > - errmsg("cannot insert tuples in a > parallel worker"))); > > Is it possible to add a check if it is a CTAS insert here as we do not > support insert in parallel workers from others as of now. Currently, there's no global variable in which we can selectively skip this in case of parallel insertion in CTAS. How about having a variable in any of the worker global contexts, set that when parallel insertion is chosen for CTAS and use that in heap_prepare_insert() to skip the above error? Eventually, we can remove this restriction entirely in case we fully allow parallelism for INSERT INTO SELECT, CTAS, and COPY. Thoughts? > + Oid objectid; /* workers to > open relation/table. */ > + /* Number of tuples inserted by all the workers. */ > + pg_atomic_uint64 processed; > > We can just mention relation instead of relation/table. I will modify it in the next patch set. > +select explain_pictas( > +'create table parallel_write as select length(stringu1) from tenk1;'); > + explain_pictas > +---------------------------------------------------------- > + Gather (actual rows=N loops=N) > + Workers Planned: 4 > + Workers Launched: N > + -> Create parallel_write > + -> Parallel Seq Scan on tenk1 (actual rows=N loops=N) > +(5 rows) > + > +select count(*) from parallel_write; > > Can we include selection of cmin, xmin for one of the test to verify > that it uses the same transaction id in the parallel workers > something like: > select distinct(cmin,xmin) from parallel_write; This is not possible since cmin and xmin are dynamic, we can not use them in test cases. I think it's not necessary to check whether the leader and workers are in the same txn or not, since we are not creating a new txn. All the txn state from the leader is serialized in SerializeTransactionState and restored in StartParallelWorkerTransaction. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On 30-12-2020 04:55, Bharath Rupireddy wrote: > On Wed, Dec 30, 2020 at 5:22 AM Zhihong Yu <zyu@yugabyte.com> wrote: >> w.r.t. v17-0004-Enable-CTAS-Parallel-Inserts-For-Append.patch >> >> + * Push the dest receiver to Gather node when it is either at the top of the >> + * plan or under top Append node unless it does not have any projections to do. >> >> I think the 'unless' should be 'if'. As can be seen from the body of the method: >> >> + if (!ps->ps_ProjInfo) >> + { >> + GatherState *gstate = (GatherState *) ps; >> + >> + parallel = true; > > Thanks. Modified it in the 0004 patch. Attaching v18 patch set. Note > that no change in 0001 to 0003 patches from v17. > > Please consider v18 patch set for further review. > > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com > Hi, Sorry it took so long to get back to reviewing this. wrt v18-0001....patch: + /* + * If the worker is for parallel insert in CTAS, then use the proper + * dest receiver. + */ + intoclause = (IntoClause *) stringToNode(intoclausestr); + receiver = CreateIntoRelDestReceiver(intoclause); + ((DR_intorel *)receiver)->is_parallel_worker = true; + ((DR_intorel *)receiver)->object_id = fpes->objectid; I would move this into a function called e.g. GetCTASParallelWorkerReceiver so that the details wrt CTAS can be put in createas.c. I would then also split up intorel_startup into intorel_leader_startup and intorel_worker_startup, and in GetCTASParallelWorkerReceiver set self->pub.rStartup to intorel_worker_startup. + volatile pg_atomic_uint64 *processed; why is it volatile? + if (isctas) + { + intoclause = ((DR_intorel *) node->dest)->into; + objectid = ((DR_intorel *) node->dest)->object_id; + } Given that you extract them each once and then pass them directly into the parallel-worker, can't you instead pass in the destreceiver and leave that logic to ExecInitParallelPlan? + if (IS_PARALLEL_CTAS_DEST(gstate->dest) && + ((DR_intorel *) gstate->dest)->into->rel && + ((DR_intorel *) gstate->dest)->into->rel->relname) why would rel and relname not be there? if no rows have been inserted? because it seems from the intorel_startup function that that would be set as soon as startup was done, which i assume (wrongly?) is always done? + * In case if no workers were launched, allow the leader to insert entire + * tuples. what does "entire tuples" mean? should it maybe be "all tuples"? ================ wrt v18-0002....patch: It looks like this introduces a state machine that goes like: - starts at CTAS_PARALLEL_INS_UNDEF - possibly moves to CTAS_PARALLEL_INS_SELECT - CTAS_PARALLEL_INS_TUP_COST_CAN_IGN can be added - if both were added at some stage, we can go to CTAS_PARALLEL_INS_TUP_COST_IGNORED and ignore the costs what i'm wondering is why you opted to put logic around generate_useful_gather_paths and in cost_gather when to me it seems more logical to put it in create_gather_path? i'm probably missing something there? ================ wrt v18-0003....patch: not sure if it is needed, but i was wondering if we would want more tests with multiple gather nodes existing? caused e.g. by using CTE's, valid subquery's (like the one test you have, but without the group by/having)? Kind regards, Luc
Hi > ================ > wrt v18-0002....patch: > > It looks like this introduces a state machine that goes like: > - starts at CTAS_PARALLEL_INS_UNDEF > - possibly moves to CTAS_PARALLEL_INS_SELECT > - CTAS_PARALLEL_INS_TUP_COST_CAN_IGN can be added > - if both were added at some stage, we can go to > CTAS_PARALLEL_INS_TUP_COST_IGNORED and ignore the costs > > what i'm wondering is why you opted to put logic around > generate_useful_gather_paths and in cost_gather when to me it seems more > logical to put it in create_gather_path? i'm probably missing something > there? IMO, The reason is we want to make sure we only ignore the cost when Gather is the top node. And it seems the generate_useful_gather_paths called in apply_scanjoin_target_to_paths is the right place which can onlycreate top node Gather. So we change the flag in apply_scanjoin_target_to_paths around generate_useful_gather_paths to identify the top node. Best regards, houzj
On 04-01-2021 12:16, Hou, Zhijie wrote: > Hi > >> ================ >> wrt v18-0002....patch: >> >> It looks like this introduces a state machine that goes like: >> - starts at CTAS_PARALLEL_INS_UNDEF >> - possibly moves to CTAS_PARALLEL_INS_SELECT >> - CTAS_PARALLEL_INS_TUP_COST_CAN_IGN can be added >> - if both were added at some stage, we can go to >> CTAS_PARALLEL_INS_TUP_COST_IGNORED and ignore the costs >> >> what i'm wondering is why you opted to put logic around >> generate_useful_gather_paths and in cost_gather when to me it seems more >> logical to put it in create_gather_path? i'm probably missing something >> there? > > IMO, The reason is we want to make sure we only ignore the cost when Gather is the top node. > And it seems the generate_useful_gather_paths called in apply_scanjoin_target_to_paths is the right place which can onlycreate top node Gather. > So we change the flag in apply_scanjoin_target_to_paths around generate_useful_gather_paths to identify the top node. > > > Best regards, > houzj > > Hi, I was wondering actually if we need the state machine. Reason is that as AFAICS the code could be placed in create_gather_path, where you can also check if it is a top gather node, whether the dest receiver is the right type, etc? To me that seems like a nicer solution as its makes that all logic that decides whether or not a parallel CTAS is valid is in a single place instead of distributed over various places. Kind regards, Luc
On Thu, Dec 31, 2020 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > How about naming like below more generically and placing them in > > parallel.h so that it will also be used for refresh materialized view? > > > > +typedef enum ParallelInsertTupleCostOpt > > +{ > > + PINS_SELECT_QUERY = 1 << 0, /* turn on this before planning */ > > + /* > > + * Turn on this while planning for upper Gather path to ignore parallel > > + * tuple cost in cost_gather. > > + */ > > + PINS_CAN_IGN_TUP_COST = 1 << 1, > > + /* Turn on this after the cost is ignored. */ > > + PINS_TUP_COST_IGNORED = 1 << 2 > > > > My plan was to get the main design idea of pushing the dest receiver > > to gather reviewed and once agreed, then I thought of making few > > functions common and place them in parallel.h and parallel.c so that > > they can be used for Parallel Inserts in REFRESH MATERIALIZED VIEW > > because the same design idea can be applied there as well. > > I think instead of PINS_* we can name PARALLEL_INSERT_* other than > that I am fine with the name. Done. > > > For instance my thoughts are: add the below structures, functions and > > other macros to parallel.h and parallel.c: > > typedef enum ParallelInsertKind > > { > > PINS_UNDEF = 0, > > PINS_CREATE_TABLE_AS, > > PINS_REFRESH_MAT_VIEW > > } ParallelInsertKind; > > > > typedef struct ParallelInsertCTASInfo > > { > > IntoClause *intoclause; > > Oid objectid; > > } ParallelInsertCTASInfo; > > > > typedef struct ParallelInsertRMVInfo > > { > > Oid objectid; > > } ParallelInsertRMVInfo; > > > > ExecInitParallelPlan(PlanState *planstate, EState *estate, > > Bitmapset *sendParams, int nworkers, > > - int64 tuples_needed) > > + int64 tuples_needed, ParallelInsertKind pinskind, > > + void *pinsinfo) > > > > Change ExecParallelInsertInCTAS to > > > > +static void > > +ExecParallelInsert(GatherState *node) > > +{ > > > > Change SetCTASParallelInsertState to > > +void > > +SetParallelInsertState(QueryDesc *queryDesc) > > > > Change IsParallelInsertionAllowedInCTAS to > > > > +bool > > +IsParallelInsertionAllowed(ParallelInsertKind pinskind, IntoClause *into) > > +{ > > > > Thoughts? > > > > I haven’t thought about these structures yet but yeah making them > generic will be good. Attaching v19 patch set. It has following changes: 1) generic code which can easily be extended to parallel inserts in Refresh Materialized View, parallelizing Copy To command 2) addressing the review comments received so far. Once these patches are reviewed and get to the commit stage, I can post a separate patch (probably in a separate thread) for parallel inserts in Refresh Materialized View based on this patch set. Please review the v19 patch set further. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
> Sorry it took so long to get back to reviewing this.
Thanks for the comments.
> wrt v18-0001....patch:
>
> + /*
> + * If the worker is for parallel insert in CTAS, then use the proper
> + * dest receiver.
> + */
> + intoclause = (IntoClause *) stringToNode(intoclausestr);
> + receiver = CreateIntoRelDestReceiver(intoclause);
> + ((DR_intorel *)receiver)->is_parallel_worker = true;
> + ((DR_intorel *)receiver)->object_id = fpes->objectid;
> I would move this into a function called e.g.
> GetCTASParallelWorkerReceiver so that the details wrt CTAS can be put in
> createas.c.
> I would then also split up intorel_startup into intorel_leader_startup
> and intorel_worker_startup, and in GetCTASParallelWorkerReceiver set
> self->pub.rStartup to intorel_worker_startup.
My intention was to not add any new APIs to the dest receiver. I simply made the changes in intorel_startup, in which for workers it just does the minimalistic work and exit from it. In the leader most of the table creation and sanity check is kept untouched. Please have a look at the v19 patch posted upthread [1].
> + volatile pg_atomic_uint64 *processed;
> why is it volatile?
Intention is to always read from the actual memory location. I referred it from the way pg_atomic_fetch_add_u64_impl, pg_atomic_compare_exchange_u64_impl, pg_atomic_init_u64_impl and their u32 counterparts use pass the parameter as volatile pg_atomic_uint64 *ptr.
> + if (isctas)
> + {
> + intoclause = ((DR_intorel *) node->dest)->into;
> + objectid = ((DR_intorel *) node->dest)->object_id;
> + }
> Given that you extract them each once and then pass them directly into
> the parallel-worker, can't you instead pass in the destreceiver and
> leave that logic to ExecInitParallelPlan?
That's changed entirely in the v19 patch set posted upthread [1]. Please have a look. I didn't pass the dest receiver, to keep the API generic, I passed parallel insert command type and a void * ptr which points to insertion command because the information we pass to workers depends on the insertion command (for instance, the information needed by workers is for CTAS into clause and object id and for Refresh Mat View object id).
>
> + if (IS_PARALLEL_CTAS_DEST(gstate->dest) &&
> + ((DR_intorel *) gstate->dest)->into->rel &&
> + ((DR_intorel *) gstate->dest)->into->rel->relname)
> why would rel and relname not be there? if no rows have been inserted?
> because it seems from the intorel_startup function that that would be
> set as soon as startup was done, which i assume (wrongly?) is always done?
create_as_target:
qualified_name opt_column_list table_access_method_clause
OptWith OnCommitOption OptTableSpace
{
$$ = makeNode(IntoClause);
$$->rel = $1;
qualified_name opt_column_list table_access_method_clause opt_reloptions OptTableSpace
{
$$ = makeNode(IntoClause);
$$->rel = $1;
INTO OptTempTableName
{
$$ = makeNode(IntoClause);
$$->rel = $2;
+ PARALLEL_INSERT_CMD_CREATE_TABLE_AS &&
+ ((DR_intorel *) gstate->dest)->into &&
+ ((DR_intorel *) gstate->dest)->into->rel &&
+ ((DR_intorel *) gstate->dest)->into->rel->relname)
+ {
+ PARALLEL_INSERT_CMD_CREATE_TABLE_AS)
+ {
> + * In case if no workers were launched, allow the leader to insert entire
> + * tuples.
> what does "entire tuples" mean? should it maybe be "all tuples"?
Yeah, noticed that while working on the v19 patch set. Please have a look at the v19 patch posted upthread [1].
> ================
> wrt v18-0003....patch:
>
> not sure if it is needed, but i was wondering if we would want more
> tests with multiple gather nodes existing? caused e.g. by using CTE's,
> valid subquery's (like the one test you have, but without the group
> by/having)?
+-- each parallel worker
+select explain_pictas(
+'create table parallel_write as select two col1,
+ (select two from (select * from tenk2) as tt limit 1) col2
+ from tenk1 where tenk1.four = 3;');
+-- the Gather node in leader
+select explain_pictas(
+'create table parallel_write as select two col1,
+ (select tenk1.two from generate_series(1,1)) col2
+ from tenk1 where tenk1.four = 3;');
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 4, 2021 at 5:44 PM Luc Vlaming <luc@swarm64.com> wrote: > On 04-01-2021 12:16, Hou, Zhijie wrote: > >> ================ > >> wrt v18-0002....patch: > >> > >> It looks like this introduces a state machine that goes like: > >> - starts at CTAS_PARALLEL_INS_UNDEF > >> - possibly moves to CTAS_PARALLEL_INS_SELECT > >> - CTAS_PARALLEL_INS_TUP_COST_CAN_IGN can be added > >> - if both were added at some stage, we can go to > >> CTAS_PARALLEL_INS_TUP_COST_IGNORED and ignore the costs > >> > >> what i'm wondering is why you opted to put logic around > >> generate_useful_gather_paths and in cost_gather when to me it seems more > >> logical to put it in create_gather_path? i'm probably missing something > >> there? > > > > IMO, The reason is we want to make sure we only ignore the cost when Gather is the top node. > > And it seems the generate_useful_gather_paths called in apply_scanjoin_target_to_paths is the right place which can onlycreate top node Gather. > > So we change the flag in apply_scanjoin_target_to_paths around generate_useful_gather_paths to identify the top node. Right. We wanted to ignore parallel tuple cost for only the upper Gather path. > I was wondering actually if we need the state machine. Reason is that as > AFAICS the code could be placed in create_gather_path, where you can > also check if it is a top gather node, whether the dest receiver is the > right type, etc? To me that seems like a nicer solution as its makes > that all logic that decides whether or not a parallel CTAS is valid is > in a single place instead of distributed over various places. IMO, we can't determine the fact that we are going to generate the top Gather path in create_gather_path. To decide on whether or not the top Gather path generation, I think it's not only required to check the root->query_level == 1 but we also need to rely on from where generate_useful_gather_paths gets called. For instance, for query_level 1, generate_useful_gather_paths gets called from 2 places in apply_scanjoin_target_to_paths. Likewise, create_gather_path also gets called from many places. IMO, the current way i.e. setting flag it in apply_scanjoin_target_to_paths and ignoring based on that in cost_gather seems safe. I may be wrong. Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 4, 2021 at 7:02 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > > + if (IS_PARALLEL_CTAS_DEST(gstate->dest) && > > + ((DR_intorel *) gstate->dest)->into->rel && > > + ((DR_intorel *) gstate->dest)->into->rel->relname) > > why would rel and relname not be there? if no rows have been inserted? > > because it seems from the intorel_startup function that that would be > > set as soon as startup was done, which i assume (wrongly?) is always done? > > Actually, that into clause rel variable is always being set in the gram.y for CTAS, Create Materialized View and SELECTINTO (because qualified_name non-terminal is not optional). My bad. I just added it as a sanity check. Actually, it'snot required. > > create_as_target: > qualified_name opt_column_list table_access_method_clause > OptWith OnCommitOption OptTableSpace > { > $$ = makeNode(IntoClause); > $$->rel = $1; > create_mv_target: > qualified_name opt_column_list table_access_method_clause opt_reloptions OptTableSpace > { > $$ = makeNode(IntoClause); > $$->rel = $1; > into_clause: > INTO OptTempTableName > { > $$ = makeNode(IntoClause); > $$->rel = $2; > > I will change the below code: > + if (GetParallelInsertCmdType(gstate->dest) == > + PARALLEL_INSERT_CMD_CREATE_TABLE_AS && > + ((DR_intorel *) gstate->dest)->into && > + ((DR_intorel *) gstate->dest)->into->rel && > + ((DR_intorel *) gstate->dest)->into->rel->relname) > + { > > to: > + if (GetParallelInsertCmdType(gstate->dest) == > + PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > + { > > I will update this in the next version of the patch set. Attaching v20 patch set that has above change in 0001 patch, note that 0002 to 0004 patches have no changes from v19. Please consider the v20 patch set for further review. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Jan 4, 2021 at 3:07 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Dec 30, 2020 at 5:28 PM vignesh C <vignesh21@gmail.com> wrote: > > Few comments: > > - /* > > - * To allow parallel inserts, we need to ensure that they are safe to be > > - * performed in workers. We have the infrastructure to allow parallel > > - * inserts in general except for the cases where inserts generate a new > > - * CommandId (eg. inserts into a table having a foreign key column). > > - */ > > - if (IsParallelWorker()) > > - ereport(ERROR, > > - (errcode(ERRCODE_INVALID_TRANSACTION_STATE), > > - errmsg("cannot insert tuples in a > > parallel worker"))); > > > > Is it possible to add a check if it is a CTAS insert here as we do not > > support insert in parallel workers from others as of now. > > Currently, there's no global variable in which we can selectively skip > this in case of parallel insertion in CTAS. How about having a > variable in any of the worker global contexts, set that when parallel > insertion is chosen for CTAS and use that in heap_prepare_insert() to > skip the above error? Eventually, we can remove this restriction > entirely in case we fully allow parallelism for INSERT INTO SELECT, > CTAS, and COPY. > > Thoughts? Yes, I felt that the leader can store the command as CTAS and the leader/worker can use it to check and throw an error. The similar change can be used for the parallel insert patches and once all the patches are committed, we can remove it eventually. > > > + Oid objectid; /* workers to > > open relation/table. */ > > + /* Number of tuples inserted by all the workers. */ > > + pg_atomic_uint64 processed; > > > > We can just mention relation instead of relation/table. > > I will modify it in the next patch set. > > > +select explain_pictas( > > +'create table parallel_write as select length(stringu1) from tenk1;'); > > + explain_pictas > > +---------------------------------------------------------- > > + Gather (actual rows=N loops=N) > > + Workers Planned: 4 > > + Workers Launched: N > > + -> Create parallel_write > > + -> Parallel Seq Scan on tenk1 (actual rows=N loops=N) > > +(5 rows) > > + > > +select count(*) from parallel_write; > > > > Can we include selection of cmin, xmin for one of the test to verify > > that it uses the same transaction id in the parallel workers > > something like: > > select distinct(cmin,xmin) from parallel_write; > > This is not possible since cmin and xmin are dynamic, we can not use > them in test cases. I think it's not necessary to check whether the > leader and workers are in the same txn or not, since we are not > creating a new txn. All the txn state from the leader is serialized in > SerializeTransactionState and restored in > StartParallelWorkerTransaction. > I had seen in your patch that you serialize and use the same transaction, but it will be good if you can have at least one test case to validate that the leader and worker both use the same transaction. To solve the problem that you are facing where cmin and xmin are dynamic, you can check the distinct count by using something like below: SELECT COUNT(*) FROM (SELECT DISTINCT cmin,xmin FROM t1) as dt; Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On 04-01-2021 14:32, Bharath Rupireddy wrote: > On Mon, Jan 4, 2021 at 4:22 PM Luc Vlaming <luc@swarm64.com > <mailto:luc@swarm64.com>> wrote: > > Sorry it took so long to get back to reviewing this. > > Thanks for the comments. > > > wrt v18-0001....patch: > > > > + /* > > + * If the worker is for parallel insert in CTAS, then > use the proper > > + * dest receiver. > > + */ > > + intoclause = (IntoClause *) stringToNode(intoclausestr); > > + receiver = CreateIntoRelDestReceiver(intoclause); > > + ((DR_intorel *)receiver)->is_parallel_worker = true; > > + ((DR_intorel *)receiver)->object_id = fpes->objectid; > > I would move this into a function called e.g. > > GetCTASParallelWorkerReceiver so that the details wrt CTAS can be put in > > createas.c. > > I would then also split up intorel_startup into intorel_leader_startup > > and intorel_worker_startup, and in GetCTASParallelWorkerReceiver set > > self->pub.rStartup to intorel_worker_startup. > > My intention was to not add any new APIs to the dest receiver. I simply > made the changes in intorel_startup, in which for workers it just does > the minimalistic work and exit from it. In the leader most of the table > creation and sanity check is kept untouched. Please have a look at the > v19 patch posted upthread [1]. > Looks much better, really nicely abstracted away in the v20 patch. > > + volatile pg_atomic_uint64 *processed; > > why is it volatile? > > Intention is to always read from the actual memory location. I referred > it from the way pg_atomic_fetch_add_u64_impl, > pg_atomic_compare_exchange_u64_impl, pg_atomic_init_u64_impl and their > u32 counterparts use pass the parameter as volatile pg_atomic_uint64 *ptr. > Okay I had not seen this syntax before for atomics with the volatile keyword but its apparently how the atomics abstraction works in postgresql. > > + if (isctas) > > + { > > + intoclause = ((DR_intorel *) > node->dest)->into; > > + objectid = ((DR_intorel *) > node->dest)->object_id; > > + } > > Given that you extract them each once and then pass them directly into > > the parallel-worker, can't you instead pass in the destreceiver and > > leave that logic to ExecInitParallelPlan? > > That's changed entirely in the v19 patch set posted upthread [1]. Please > have a look. I didn't pass the dest receiver, to keep the API generic, I > passed parallel insert command type and a void * ptr which points to > insertion command because the information we pass to workers depends on > the insertion command (for instance, the information needed by workers > is for CTAS into clause and object id and for Refresh Mat View object id). > > > > > + if > (IS_PARALLEL_CTAS_DEST(gstate->dest) && > > + ((DR_intorel *) > gstate->dest)->into->rel && > > + ((DR_intorel *) > gstate->dest)->into->rel->relname) > > why would rel and relname not be there? if no rows have been inserted? > > because it seems from the intorel_startup function that that would be > > set as soon as startup was done, which i assume (wrongly?) is always > done? > > Actually, that into clause rel variable is always being set in the > gram.y for CTAS, Create Materialized View and SELECT INTO (because > qualified_name non-terminal is not optional). My bad. I just added it as > a sanity check. Actually, it's not required. > > create_as_target: > *qualified_name* opt_column_list table_access_method_clause > OptWith OnCommitOption OptTableSpace > { > $$ = makeNode(IntoClause); > * $$->rel = $1;* > create_mv_target: > *qualified_name* opt_column_list table_access_method_clause > opt_reloptions OptTableSpace > { > $$ = makeNode(IntoClause); > * $$->rel = $1;* > into_clause: > INTO OptTempTableName > { > $$ = makeNode(IntoClause); > * $$->rel = $2;* > > I will change the below code: > + if (GetParallelInsertCmdType(gstate->dest) == > + PARALLEL_INSERT_CMD_CREATE_TABLE_AS && > + ((DR_intorel *) gstate->dest)->into && > + ((DR_intorel *) gstate->dest)->into->rel && > + ((DR_intorel *) gstate->dest)->into->rel->relname) > + { > > to: > + if (GetParallelInsertCmdType(gstate->dest) == > + PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > + { > > I will update this in the next version of the patch set. > Thanks > > > > + * In case if no workers were launched, allow the leader to > insert entire > > + * tuples. > > what does "entire tuples" mean? should it maybe be "all tuples"? > > Yeah, noticed that while working on the v19 patch set. Please have a > look at the v19 patch posted upthread [1]. > > > ================ > > wrt v18-0003....patch: > > > > not sure if it is needed, but i was wondering if we would want more > > tests with multiple gather nodes existing? caused e.g. by using CTE's, > > valid subquery's (like the one test you have, but without the group > > by/having)? > > I'm not sure if we can have CTAS/CMV/SELECT INTO in CTEs like WITH, WITH > RECURSIVE and I don't see that any of the WITH clause processing hits > createas.c functions. So, IMHO, we don't need to add them. Please let me > know if there are any specific use cases you have in mind. > > For instance, I tried to cover Init/Sub Plan and Subquery cases with: > > below case has multiple Gather, Init Plan: > +-- parallel inserts must occur, as there is init plan that gets executed by > +-- each parallel worker > +select explain_pictas( > +'create table parallel_write as select two col1, > + (select two from (select * from tenk2) as tt limit 1) col2 > + from tenk1 where tenk1.four = 3;'); > > below case has Gather, Sub Plan: > +-- parallel inserts must not occur, as there is sub plan that gets > executed by > +-- the Gather node in leader > +select explain_pictas( > +'create table parallel_write as select two col1, > + (select tenk1.two from generate_series(1,1)) col2 > + from tenk1 where tenk1.four = 3;'); > > For multiple Gather node cases, I covered them with the Union All/Append > cases in the 0004 patch. Please have a look. > Right, had not reviewed part 4 yet. My bad. > [1] - > https://www.postgresql.org/message-id/CALj2ACWth7mVQtqdYJwSn1mNmaHwxNE7YSYxRSLmfkqxRk%2Bzmg%40mail.gmail.com > <https://www.postgresql.org/message-id/CALj2ACWth7mVQtqdYJwSn1mNmaHwxNE7YSYxRSLmfkqxRk%2Bzmg%40mail.gmail.com> > > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com> Kind regards, Luc
On 05-01-2021 04:59, Bharath Rupireddy wrote: > On Mon, Jan 4, 2021 at 7:02 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: >> >>> + if (IS_PARALLEL_CTAS_DEST(gstate->dest) && >>> + ((DR_intorel *) gstate->dest)->into->rel && >>> + ((DR_intorel *) gstate->dest)->into->rel->relname) >>> why would rel and relname not be there? if no rows have been inserted? >>> because it seems from the intorel_startup function that that would be >>> set as soon as startup was done, which i assume (wrongly?) is always done? >> >> Actually, that into clause rel variable is always being set in the gram.y for CTAS, Create Materialized View and SELECTINTO (because qualified_name non-terminal is not optional). My bad. I just added it as a sanity check. Actually, it'snot required. >> >> create_as_target: >> qualified_name opt_column_list table_access_method_clause >> OptWith OnCommitOption OptTableSpace >> { >> $$ = makeNode(IntoClause); >> $$->rel = $1; >> create_mv_target: >> qualified_name opt_column_list table_access_method_clause opt_reloptions OptTableSpace >> { >> $$ = makeNode(IntoClause); >> $$->rel = $1; >> into_clause: >> INTO OptTempTableName >> { >> $$ = makeNode(IntoClause); >> $$->rel = $2; >> >> I will change the below code: >> + if (GetParallelInsertCmdType(gstate->dest) == >> + PARALLEL_INSERT_CMD_CREATE_TABLE_AS && >> + ((DR_intorel *) gstate->dest)->into && >> + ((DR_intorel *) gstate->dest)->into->rel && >> + ((DR_intorel *) gstate->dest)->into->rel->relname) >> + { >> >> to: >> + if (GetParallelInsertCmdType(gstate->dest) == >> + PARALLEL_INSERT_CMD_CREATE_TABLE_AS) >> + { >> >> I will update this in the next version of the patch set. > > Attaching v20 patch set that has above change in 0001 patch, note that > 0002 to 0004 patches have no changes from v19. Please consider the v20 > patch set for further review. > > > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com > Hi, Reviewing further v20-0001: I would still opt for moving the code for the parallel worker into a separate function, and then setting rStartup of the dest receiver to that function in ExecParallelGetInsReceiver, as its completely independent code. Just a matter of style I guess. Maybe I'm not completely following why but afaics we want parallel inserts in various scenarios, not just CTAS? I'm asking because code like + if (fpes->ins_cmd_type == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) + pg_atomic_add_fetch_u64(&fpes->processed, queryDesc->estate->es_processed); seems very specific to CTAS. For now that seems fine but I suppose that would be generalized soon after? Basically I would have expected the if to compare against PARALLEL_INSERT_CMD_UNDEF. Apart from these small things v20-0001 looks (very) good to me. v20-0002: will reply on the specific mail-thread about the state machine v20-0003 and v20-0004: looks good to me. Kind regards, Luc
On 04-01-2021 14:53, Bharath Rupireddy wrote: > On Mon, Jan 4, 2021 at 5:44 PM Luc Vlaming <luc@swarm64.com> wrote: >> On 04-01-2021 12:16, Hou, Zhijie wrote: >>>> ================ >>>> wrt v18-0002....patch: >>>> >>>> It looks like this introduces a state machine that goes like: >>>> - starts at CTAS_PARALLEL_INS_UNDEF >>>> - possibly moves to CTAS_PARALLEL_INS_SELECT >>>> - CTAS_PARALLEL_INS_TUP_COST_CAN_IGN can be added >>>> - if both were added at some stage, we can go to >>>> CTAS_PARALLEL_INS_TUP_COST_IGNORED and ignore the costs >>>> >>>> what i'm wondering is why you opted to put logic around >>>> generate_useful_gather_paths and in cost_gather when to me it seems more >>>> logical to put it in create_gather_path? i'm probably missing something >>>> there? >>> >>> IMO, The reason is we want to make sure we only ignore the cost when Gather is the top node. >>> And it seems the generate_useful_gather_paths called in apply_scanjoin_target_to_paths is the right place which can onlycreate top node Gather. >>> So we change the flag in apply_scanjoin_target_to_paths around generate_useful_gather_paths to identify the top node. > > Right. We wanted to ignore parallel tuple cost for only the upper Gather path. > >> I was wondering actually if we need the state machine. Reason is that as >> AFAICS the code could be placed in create_gather_path, where you can >> also check if it is a top gather node, whether the dest receiver is the >> right type, etc? To me that seems like a nicer solution as its makes >> that all logic that decides whether or not a parallel CTAS is valid is >> in a single place instead of distributed over various places. > > IMO, we can't determine the fact that we are going to generate the top > Gather path in create_gather_path. To decide on whether or not the top > Gather path generation, I think it's not only required to check the > root->query_level == 1 but we also need to rely on from where > generate_useful_gather_paths gets called. For instance, for > query_level 1, generate_useful_gather_paths gets called from 2 places > in apply_scanjoin_target_to_paths. Likewise, create_gather_path also > gets called from many places. IMO, the current way i.e. setting flag > it in apply_scanjoin_target_to_paths and ignoring based on that in > cost_gather seems safe. > > I may be wrong. Thoughts? > > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com > So the way I understand it the requirements are: - it needs to be the top-most gather - it should not do anything with the rows after the gather node as this would make the parallel inserts conceptually invalid. Right now we're trying to judge what might be added on-top that could change the rows by inspecting all parts of the root object that would cause anything to be added, and add a little statemachine to track the state of that knowledge. To me this has the downside that the list in HAS_PARENT_PATH_GENERATING_CLAUSE has to be exhaustive, and we need to make sure it stays up-to-date, which could result in regressions if not tracked carefully. Personally I would therefore go for a design which is safe in the sense that regressions are not as easily introduced. IMHO that could be done by inspecting the planned query afterwards, and then judging whether or not the parallel inserts are actually the right thing to do. Another way to create more safety against regressions would be to add an assert upon execution of the query that if we do parallel inserts that only a subset of allowed nodes exists above the gather node. Some (not extremely fact checked) approaches as food for thought: 1. Plan the query as normal, and then afterwards look at the resulting plan to see if there are only nodes that are ok between the gather node and the top node, which afaics would only be things like append nodes. Which would mean two things: - at the end of subquery_planner before the final_rel is fetched, we add another pass like the grouping_planner called e.g. parallel_modify_planner or so, which traverses the query plan and checks if the inserts would indeed be executed parallel, and if so sets the cost of the gather to 0. - we always keep around the best gathered partial path, or the partial path itself. 2. Generate both gather paths: one with zero cost for the inserts and one with costs. the one with zero costs would however be kept separately and added as prime candidate for the final rel. then we can check in the subquery_planner if the final candidate is different and then choose. Kind regards, Luc
On Tue, Jan 5, 2021 at 10:08 AM vignesh C <vignesh21@gmail.com> wrote: > On Mon, Jan 4, 2021 at 3:07 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Wed, Dec 30, 2020 at 5:28 PM vignesh C <vignesh21@gmail.com> wrote: > > > Few comments: > > > - /* > > > - * To allow parallel inserts, we need to ensure that they are safe to be > > > - * performed in workers. We have the infrastructure to allow parallel > > > - * inserts in general except for the cases where inserts generate a new > > > - * CommandId (eg. inserts into a table having a foreign key column). > > > - */ > > > - if (IsParallelWorker()) > > > - ereport(ERROR, > > > - (errcode(ERRCODE_INVALID_TRANSACTION_STATE), > > > - errmsg("cannot insert tuples in a > > > parallel worker"))); > > > > > > Is it possible to add a check if it is a CTAS insert here as we do not > > > support insert in parallel workers from others as of now. > > > > Currently, there's no global variable in which we can selectively skip > > this in case of parallel insertion in CTAS. How about having a > > variable in any of the worker global contexts, set that when parallel > > insertion is chosen for CTAS and use that in heap_prepare_insert() to > > skip the above error? Eventually, we can remove this restriction > > entirely in case we fully allow parallelism for INSERT INTO SELECT, > > CTAS, and COPY. > > > > Thoughts? > > Yes, I felt that the leader can store the command as CTAS and the > leader/worker can use it to check and throw an error. The similar > change can be used for the parallel insert patches and once all the > patches are committed, we can remove it eventually. We can skip the error "cannot insert tuples in a parallel worker" in heap_prepare_insert() selectively for each parallel insertion and eventually we can remove that error after all the parallel insertion related patches are committed. The main problem is that we should be knowing in heap_prepare_insert() that we are coming from parallel insertion for CTAS, or some other command at the same time we don't want to alter the table_tuple_insert()/heap_prepare_insert() API because this change will be removed eventually. We can achieve this in below ways: 1) Add a backend global variable, set it before each table_tuple_insert() in intorel_receive() and use that in heap_prepare_insert() to skip the error. 2) Add a variable to MyBgworkerEntry structure, set it before each table_tuple_insert() in intorel_receive() or in ParallelQueryMain() if we are for CTAS parallel insertion and use that in heap_prepare_insert() to skip the error. 3) Currently, we pass table insert options to table_tuple_insert()/heap_prepare_insert(), which is a bitmap of below values. We could also add something like #define PARALLEL_INSERTION_CMD_CTAS 0x000F, set it before each table_tuple_insert() in intorel_receive() and use that in heap_prepare_insert() to skip the error, then unset it. /* "options" flag bits for table_tuple_insert */ /* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */ #define TABLE_INSERT_SKIP_FSM 0x0002 #define TABLE_INSERT_FROZEN 0x0004 #define TABLE_INSERT_NO_LOGICAL 0x0008 IMO either 2 or 3 would be fine. Thoughts? > > > + Oid objectid; /* workers to > > > open relation/table. */ > > > + /* Number of tuples inserted by all the workers. */ > > > + pg_atomic_uint64 processed; > > > > > > We can just mention relation instead of relation/table. > > > > I will modify it in the next patch set. > > > > > +select explain_pictas( > > > +'create table parallel_write as select length(stringu1) from tenk1;'); > > > + explain_pictas > > > +---------------------------------------------------------- > > > + Gather (actual rows=N loops=N) > > > + Workers Planned: 4 > > > + Workers Launched: N > > > + -> Create parallel_write > > > + -> Parallel Seq Scan on tenk1 (actual rows=N loops=N) > > > +(5 rows) > > > + > > > +select count(*) from parallel_write; > > > > > > Can we include selection of cmin, xmin for one of the test to verify > > > that it uses the same transaction id in the parallel workers > > > something like: > > > select distinct(cmin,xmin) from parallel_write; > > > > This is not possible since cmin and xmin are dynamic, we can not use > > them in test cases. I think it's not necessary to check whether the > > leader and workers are in the same txn or not, since we are not > > creating a new txn. All the txn state from the leader is serialized in > > SerializeTransactionState and restored in > > StartParallelWorkerTransaction. > > > > I had seen in your patch that you serialize and use the same > transaction, but it will be good if you can have at least one test > case to validate that the leader and worker both use the same > transaction. To solve the problem that you are facing where cmin and > xmin are dynamic, you can check the distinct count by using something > like below: > SELECT COUNT(*) FROM (SELECT DISTINCT cmin,xmin FROM t1) as dt; Thanks. So, the expectation is that the above query should always return 1 if both leader and workers shared the same txn. I will add this to one of the test cases in the next version of the patch set. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Tue, Jan 5, 2021 at 12:43 PM Luc Vlaming <luc@swarm64.com> wrote: > > On 04-01-2021 14:32, Bharath Rupireddy wrote: > > On Mon, Jan 4, 2021 at 4:22 PM Luc Vlaming <luc@swarm64.com > > <mailto:luc@swarm64.com>> wrote: > > > Sorry it took so long to get back to reviewing this. > > > > Thanks for the comments. > > > > > wrt v18-0001....patch: > > > > > > + /* > > > + * If the worker is for parallel insert in CTAS, then > > use the proper > > > + * dest receiver. > > > + */ > > > + intoclause = (IntoClause *) stringToNode(intoclausestr); > > > + receiver = CreateIntoRelDestReceiver(intoclause); > > > + ((DR_intorel *)receiver)->is_parallel_worker = true; > > > + ((DR_intorel *)receiver)->object_id = fpes->objectid; > > > I would move this into a function called e.g. > > > GetCTASParallelWorkerReceiver so that the details wrt CTAS can be put in > > > createas.c. > > > I would then also split up intorel_startup into intorel_leader_startup > > > and intorel_worker_startup, and in GetCTASParallelWorkerReceiver set > > > self->pub.rStartup to intorel_worker_startup. > > > > My intention was to not add any new APIs to the dest receiver. I simply > > made the changes in intorel_startup, in which for workers it just does > > the minimalistic work and exit from it. In the leader most of the table > > creation and sanity check is kept untouched. Please have a look at the > > v19 patch posted upthread [1]. > > > > Looks much better, really nicely abstracted away in the v20 patch. > > > > + volatile pg_atomic_uint64 *processed; > > > why is it volatile? > > > > Intention is to always read from the actual memory location. I referred > > it from the way pg_atomic_fetch_add_u64_impl, > > pg_atomic_compare_exchange_u64_impl, pg_atomic_init_u64_impl and their > > u32 counterparts use pass the parameter as volatile pg_atomic_uint64 *ptr. But in your case, I do not understand the intention that where do you think that the compiler can optimize it and read the old value? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 05-01-2021 11:32, Dilip Kumar wrote: > On Tue, Jan 5, 2021 at 12:43 PM Luc Vlaming <luc@swarm64.com> wrote: >> >> On 04-01-2021 14:32, Bharath Rupireddy wrote: >>> On Mon, Jan 4, 2021 at 4:22 PM Luc Vlaming <luc@swarm64.com >>> <mailto:luc@swarm64.com>> wrote: >>> > Sorry it took so long to get back to reviewing this. >>> >>> Thanks for the comments. >>> >>> > wrt v18-0001....patch: >>> > >>> > + /* >>> > + * If the worker is for parallel insert in CTAS, then >>> use the proper >>> > + * dest receiver. >>> > + */ >>> > + intoclause = (IntoClause *) stringToNode(intoclausestr); >>> > + receiver = CreateIntoRelDestReceiver(intoclause); >>> > + ((DR_intorel *)receiver)->is_parallel_worker = true; >>> > + ((DR_intorel *)receiver)->object_id = fpes->objectid; >>> > I would move this into a function called e.g. >>> > GetCTASParallelWorkerReceiver so that the details wrt CTAS can be put in >>> > createas.c. >>> > I would then also split up intorel_startup into intorel_leader_startup >>> > and intorel_worker_startup, and in GetCTASParallelWorkerReceiver set >>> > self->pub.rStartup to intorel_worker_startup. >>> >>> My intention was to not add any new APIs to the dest receiver. I simply >>> made the changes in intorel_startup, in which for workers it just does >>> the minimalistic work and exit from it. In the leader most of the table >>> creation and sanity check is kept untouched. Please have a look at the >>> v19 patch posted upthread [1]. >>> >> >> Looks much better, really nicely abstracted away in the v20 patch. >> >>> > + volatile pg_atomic_uint64 *processed; >>> > why is it volatile? >>> >>> Intention is to always read from the actual memory location. I referred >>> it from the way pg_atomic_fetch_add_u64_impl, >>> pg_atomic_compare_exchange_u64_impl, pg_atomic_init_u64_impl and their >>> u32 counterparts use pass the parameter as volatile pg_atomic_uint64 *ptr. > > But in your case, I do not understand the intention that where do you > think that the compiler can optimize it and read the old value? > It can not and should not. I had just only seen so far c++ atomic variables and not a (postgres-specific?) c atomic variable which apparently requires the volatile keyword. My stupidity ;) Cheers, Luc
On Tue, Jan 5, 2021 at 1:00 PM Luc Vlaming <luc@swarm64.com> wrote: > Reviewing further v20-0001: > > I would still opt for moving the code for the parallel worker into a > separate function, and then setting rStartup of the dest receiver to > that function in ExecParallelGetInsReceiver, as its completely > independent code. Just a matter of style I guess. If we were to have a intorel_startup_worker and assign it to self->pub.rStartup, 1) we can do it in the CreateIntoRelDestReceiver, we have to pass a parameter to CreateIntoRelDestReceiver as an indication of parallel worker, which requires code changes in places wherever CreateIntoRelDestReceiver is used. 2) we can also assign intorel_startup_worker after CreateIntoRelDestReceiver in ExecParallelGetInsReceiver, but that doesn't look good to me. 3) we can duplicate CreateIntoRelDestReceiver and have a CreateIntoRelParallelDestReceiver with the only change being that self->pub.rStartup = intorel_startup_worker; IMHO, the way it is currently, looks good. Anyways, I'm open to changing that if we agree on any of the above 3 ways. If we were to do any of the above, then we might have to do the same thing for other commands Refresh Materialized View or Copy To where we can parallelize. Thoughts? > Maybe I'm not completely following why but afaics we want parallel > inserts in various scenarios, not just CTAS? I'm asking because code like > + if (fpes->ins_cmd_type == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > + pg_atomic_add_fetch_u64(&fpes->processed, > queryDesc->estate->es_processed); > seems very specific to CTAS. For now that seems fine but I suppose that > would be generalized soon after? Basically I would have expected the if > to compare against PARALLEL_INSERT_CMD_UNDEF. After this patch is reviewed and goes for commit, then the next thing I plan to do is to allow parallel inserts in Refresh Materialized View and it can be used for that. I think the processed variable can also be used for parallel inserts in INSERT INTO SELECT [1] as well. Currently, I'm keeping it for CTAS, maybe later (after this is committed) it can be generalized. Thoughts? [1] - https://www.postgresql.org/message-id/CAA4eK1LMmz58ej5BgVLJ8VsUGd%3D%2BKcaA8X%3DkStORhxpfpODOxg%40mail.gmail.com > Apart from these small things v20-0001 looks (very) good to me. > v20-0003 and v20-0004: > looks good to me. Thanks. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
+ (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS));
+
+ if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS)
On Mon, Jan 4, 2021 at 7:02 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> > + if (IS_PARALLEL_CTAS_DEST(gstate->dest) &&
> > + ((DR_intorel *) gstate->dest)->into->rel &&
> > + ((DR_intorel *) gstate->dest)->into->rel->relname)
> > why would rel and relname not be there? if no rows have been inserted?
> > because it seems from the intorel_startup function that that would be
> > set as soon as startup was done, which i assume (wrongly?) is always done?
>
> Actually, that into clause rel variable is always being set in the gram.y for CTAS, Create Materialized View and SELECT INTO (because qualified_name non-terminal is not optional). My bad. I just added it as a sanity check. Actually, it's not required.
>
> create_as_target:
> qualified_name opt_column_list table_access_method_clause
> OptWith OnCommitOption OptTableSpace
> {
> $$ = makeNode(IntoClause);
> $$->rel = $1;
> create_mv_target:
> qualified_name opt_column_list table_access_method_clause opt_reloptions OptTableSpace
> {
> $$ = makeNode(IntoClause);
> $$->rel = $1;
> into_clause:
> INTO OptTempTableName
> {
> $$ = makeNode(IntoClause);
> $$->rel = $2;
>
> I will change the below code:
> + if (GetParallelInsertCmdType(gstate->dest) ==
> + PARALLEL_INSERT_CMD_CREATE_TABLE_AS &&
> + ((DR_intorel *) gstate->dest)->into &&
> + ((DR_intorel *) gstate->dest)->into->rel &&
> + ((DR_intorel *) gstate->dest)->into->rel->relname)
> + {
>
> to:
> + if (GetParallelInsertCmdType(gstate->dest) ==
> + PARALLEL_INSERT_CMD_CREATE_TABLE_AS)
> + {
>
> I will update this in the next version of the patch set.
Attaching v20 patch set that has above change in 0001 patch, note that
0002 to 0004 patches have no changes from v19. Please consider the v20
patch set for further review.
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 6, 2021 at 8:19 AM Zhihong Yu <zyu@yugabyte.com> wrote: > For v20-0001-Parallel-Inserts-in-CREATE-TABLE-AS.patch : > > ParallelInsCmdEstimate : > > + Assert(pcxt && ins_info && > + (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS)); > + > + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > > Sinc the if condition is covered by the assertion, I wonder why the if check is still needed. > > Similar comment for SaveParallelInsCmdFixedInfo and SaveParallelInsCmdInfo Thanks. The idea is to have assertion with all the expected ins_cmd types, and then later to have selective handling for different ins_cmds. For example, if we add (in future) parallel insertion in Refresh Materialized View, then the code in those functions will be something like: +static void +ParallelInsCmdEstimate(ParallelContext *pcxt, ParallelInsertCmdKind ins_cmd, + void *ins_info) +{ + Assert(pcxt && ins_info && + (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS || + (ins_cmd == PARALLEL_INSERT_CMD_REFRESH_MAT_VIEW)); + + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) + { + + } + else if (ns_cmd == PARALLEL_INSERT_CMD_REFRESH_MAT_VIEW) + { + + } Similarly for other functions as well. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 6, 2021 at 8:19 AM Zhihong Yu <zyu@yugabyte.com> wrote:
> For v20-0001-Parallel-Inserts-in-CREATE-TABLE-AS.patch :
>
> ParallelInsCmdEstimate :
>
> + Assert(pcxt && ins_info &&
> + (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS));
> +
> + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS)
>
> Sinc the if condition is covered by the assertion, I wonder why the if check is still needed.
>
> Similar comment for SaveParallelInsCmdFixedInfo and SaveParallelInsCmdInfo
Thanks.
The idea is to have assertion with all the expected ins_cmd types, and
then later to have selective handling for different ins_cmds. For
example, if we add (in future) parallel insertion in Refresh
Materialized View, then the code in those functions will be something
like:
+static void
+ParallelInsCmdEstimate(ParallelContext *pcxt, ParallelInsertCmdKind ins_cmd,
+ void *ins_info)
+{
+ Assert(pcxt && ins_info &&
+ (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS ||
+ (ins_cmd == PARALLEL_INSERT_CMD_REFRESH_MAT_VIEW));
+
+ if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS)
+ {
+
+ }
+ else if (ns_cmd == PARALLEL_INSERT_CMD_REFRESH_MAT_VIEW)
+ {
+
+ }
Similarly for other functions as well.
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 6, 2021 at 9:23 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > +/* + * List the commands here for which parallel insertions are possible. + */ +typedef enum ParallelInsertCmdKind +{ + PARALLEL_INSERT_CMD_UNDEF = 0, + PARALLEL_INSERT_CMD_CREATE_TABLE_AS +} ParallelInsertCmdKind; I see there is some code that is generic for CTAS and INSERT INTO SELECT *, So is it possible to take out that common code to a separate base patch? Later both CTAS and INSERT INTO SELECT * can expand that for their usage. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
> > For v20-0001-Parallel-Inserts-in-CREATE-TABLE-AS.patch : > > > > ParallelInsCmdEstimate : > > > > + Assert(pcxt && ins_info && > > + (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS)); > > + > > + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > > > > Sinc the if condition is covered by the assertion, I wonder why the if > check is still needed. > > > > Similar comment for SaveParallelInsCmdFixedInfo and > > SaveParallelInsCmdInfo > > Thanks. > > The idea is to have assertion with all the expected ins_cmd types, and then > later to have selective handling for different ins_cmds. For example, if > we add (in future) parallel insertion in Refresh Materialized View, then > the code in those functions will be something > like: > > +static void > +ParallelInsCmdEstimate(ParallelContext *pcxt, ParallelInsertCmdKind > ins_cmd, > + void *ins_info) > +{ > + Assert(pcxt && ins_info && > + (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS || > + (ins_cmd == PARALLEL_INSERT_CMD_REFRESH_MAT_VIEW)); > + > + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > + { > + > + } > + else if (ns_cmd == PARALLEL_INSERT_CMD_REFRESH_MAT_VIEW) > + { > + > + } > > Similarly for other functions as well. I think it makes sense. And if the check about ' ins_cmd == xxx1 || ins_cmd == xxx2' may be used in some places, How about define a generic function with some comment to mention the purpose. An example in INSERT INTO SELECT patch: +/* + * IsModifySupportedInParallelMode + * + * Indicates whether execution of the specified table-modification command + * (INSERT/UPDATE/DELETE) in parallel-mode is supported, subject to certain + * parallel-safety conditions. + */ +static inline bool +IsModifySupportedInParallelMode(CmdType commandType) +{ + /* Currently only INSERT is supported */ + return (commandType == CMD_INSERT); +} Best regards, houzj
On Wed, Jan 6, 2021 at 10:05 AM Zhihong Yu <zyu@yugabyte.com> wrote: > > The plan sounds good. > > Before the second command type is added, can you leave out the 'if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS)' andkeep the pair of curlies ? > > You can add the if condition back when the second command type is added. Thanks. IMO, an empty pair of curlies is not a good idea. Having if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) doesn't harm anything. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 6, 2021 at 11:06 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > > > For v20-0001-Parallel-Inserts-in-CREATE-TABLE-AS.patch : > > > > > > ParallelInsCmdEstimate : > > > > > > + Assert(pcxt && ins_info && > > > + (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS)); > > > + > > > + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > > > > > > Sinc the if condition is covered by the assertion, I wonder why the if > > check is still needed. > > > > > > Similar comment for SaveParallelInsCmdFixedInfo and > > > SaveParallelInsCmdInfo > > > > Thanks. > > > > The idea is to have assertion with all the expected ins_cmd types, and then > > later to have selective handling for different ins_cmds. For example, if > > we add (in future) parallel insertion in Refresh Materialized View, then > > the code in those functions will be something > > like: > > > > +static void > > +ParallelInsCmdEstimate(ParallelContext *pcxt, ParallelInsertCmdKind > > ins_cmd, > > + void *ins_info) > > +{ > > + Assert(pcxt && ins_info && > > + (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS || > > + (ins_cmd == PARALLEL_INSERT_CMD_REFRESH_MAT_VIEW)); > > + > > + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > > + { > > + > > + } > > + else if (ns_cmd == PARALLEL_INSERT_CMD_REFRESH_MAT_VIEW) > > + { > > + > > + } > > > > Similarly for other functions as well. > > I think it makes sense. > > And if the check about ' ins_cmd == xxx1 || ins_cmd == xxx2' may be used in some places, > How about define a generic function with some comment to mention the purpose. > > An example in INSERT INTO SELECT patch: > +/* > + * IsModifySupportedInParallelMode > + * > + * Indicates whether execution of the specified table-modification command > + * (INSERT/UPDATE/DELETE) in parallel-mode is supported, subject to certain > + * parallel-safety conditions. > + */ > +static inline bool > +IsModifySupportedInParallelMode(CmdType commandType) > +{ > + /* Currently only INSERT is supported */ > + return (commandType == CMD_INSERT); > +} The intention of assert is to verify that those functions are called for appropriate commands such as CTAS, Refresh Mat View and so on with correct parameters. I really don't think so we can replace the assert with a function like above, in the release mode assertion will always be true. In a way, that assertion is for only debugging purposes. And I also think that when we as the callers know when to call those new functions, we can even remove the assertions, if they are really a problem here. Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 6, 2021 at 10:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Jan 6, 2021 at 9:23 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > +/* > + * List the commands here for which parallel insertions are possible. > + */ > +typedef enum ParallelInsertCmdKind > +{ > + PARALLEL_INSERT_CMD_UNDEF = 0, > + PARALLEL_INSERT_CMD_CREATE_TABLE_AS > +} ParallelInsertCmdKind; > > I see there is some code that is generic for CTAS and INSERT INTO > SELECT *, So is it > possible to take out that common code to a separate base patch? Later > both CTAS and INSERT INTO SELECT * can expand > that for their usage. I currently see the common code for parallel inserts i.e. insert into selects, copy, ctas/create mat view/refresh mat view is the code in - heapam.c, xact.c and xact.h. I can make a separate patch if required for these changes alone. Thoughts? IIRC parallel inserts in insert into select and copy don't use the design idea of pushing the dest receiver down to Gather. Whereas ctas/create mat view, refresh mat view, copy to can use the idea of pushing the dest receiver to Gather and can easily extend on the patches I made here. Is there anything else do you feel that we can have in common? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
> > I think it makes sense. > > > > And if the check about ' ins_cmd == xxx1 || ins_cmd == xxx2' may be > > used in some places, How about define a generic function with some comment > to mention the purpose. > > > > An example in INSERT INTO SELECT patch: > > +/* > > + * IsModifySupportedInParallelMode > > + * > > + * Indicates whether execution of the specified table-modification > > +command > > + * (INSERT/UPDATE/DELETE) in parallel-mode is supported, subject to > > +certain > > + * parallel-safety conditions. > > + */ > > +static inline bool > > +IsModifySupportedInParallelMode(CmdType commandType) { > > + /* Currently only INSERT is supported */ > > + return (commandType == CMD_INSERT); } > > The intention of assert is to verify that those functions are called for > appropriate commands such as CTAS, Refresh Mat View and so on with correct > parameters. I really don't think so we can replace the assert with a function > like above, in the release mode assertion will always be true. In a way, > that assertion is for only debugging purposes. And I also think that when > we as the callers know when to call those new functions, we can even remove > the assertions, if they are really a problem here. Thoughts? Hi Thanks for the explanation. If the check about command type is only used in assert, I think you are right. I suggested a new function because I guess the check can be used in some other places. Such as: + /* Okay to parallelize inserts, so mark it. */ + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) + ((DR_intorel *) dest)->is_parallel = true; + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) + ((DR_intorel *) dest)->is_parallel = false; Or + if (fpes->ins_cmd_type == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) + pg_atomic_add_fetch_u64(&fpes->processed, queryDesc->estate->es_processed); If you think the above code will extend the ins_cmd type check in the future, the generic function may make sense. Best regards, houzj
On Wed, Jan 6, 2021 at 11:26 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Jan 6, 2021 at 10:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Jan 6, 2021 at 9:23 AM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > +/* > > + * List the commands here for which parallel insertions are possible. > > + */ > > +typedef enum ParallelInsertCmdKind > > +{ > > + PARALLEL_INSERT_CMD_UNDEF = 0, > > + PARALLEL_INSERT_CMD_CREATE_TABLE_AS > > +} ParallelInsertCmdKind; > > > > I see there is some code that is generic for CTAS and INSERT INTO > > SELECT *, So is it > > possible to take out that common code to a separate base patch? Later > > both CTAS and INSERT INTO SELECT * can expand > > that for their usage. > > I currently see the common code for parallel inserts i.e. insert into > selects, copy, ctas/create mat view/refresh mat view is the code in - > heapam.c, xact.c and xact.h. I can make a separate patch if required > for these changes alone. Thoughts? I just saw this structure (ParallelInsertCmdKind) where it is defining the ParallelInsertCmdKind and also usage is different based on the command type. So I think the code which is defining the generic code e.g. this structure and other similar code can go to the first patch and we can build the remaining patch atop that patch. But if you think this is just this structure and not much code is common then we can let it be. > IIRC parallel inserts in insert into select and copy don't use the > design idea of pushing the dest receiver down to Gather. Whereas > ctas/create mat view, refresh mat view, copy to can use the idea of > pushing the dest receiver to Gather and can easily extend on the > patches I made here. > > Is there anything else do you feel that we can have in common? Nothing specific. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 6, 2021 at 11:30 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > > > I think it makes sense. > > > > > > And if the check about ' ins_cmd == xxx1 || ins_cmd == xxx2' may be > > > used in some places, How about define a generic function with some comment > > to mention the purpose. > > > > > > An example in INSERT INTO SELECT patch: > > > +/* > > > + * IsModifySupportedInParallelMode > > > + * > > > + * Indicates whether execution of the specified table-modification > > > +command > > > + * (INSERT/UPDATE/DELETE) in parallel-mode is supported, subject to > > > +certain > > > + * parallel-safety conditions. > > > + */ > > > +static inline bool > > > +IsModifySupportedInParallelMode(CmdType commandType) { > > > + /* Currently only INSERT is supported */ > > > + return (commandType == CMD_INSERT); } > > > > The intention of assert is to verify that those functions are called for > > appropriate commands such as CTAS, Refresh Mat View and so on with correct > > parameters. I really don't think so we can replace the assert with a function > > like above, in the release mode assertion will always be true. In a way, > > that assertion is for only debugging purposes. And I also think that when > > we as the callers know when to call those new functions, we can even remove > > the assertions, if they are really a problem here. Thoughts? > Hi > > Thanks for the explanation. > > If the check about command type is only used in assert, I think you are right. > I suggested a new function because I guess the check can be used in some other places. > Such as: > > + /* Okay to parallelize inserts, so mark it. */ > + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > + ((DR_intorel *) dest)->is_parallel = true; > > + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > + ((DR_intorel *) dest)->is_parallel = false; We need to know exactly what is the command in above place, to dereference and mark is_parallel to true, because is_parallel is being added to the respective structures, not to the generic _DestReceiver structure. So, in future the above code becomes something like below: + /* Okay to parallelize inserts, so mark it. */ + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) + ((DR_intorel *) dest)->is_parallel = true; + else if (ins_cmd == PARALLEL_INSERT_CMD_REFRESH_MAT_VIEW) + ((DR_transientrel *) dest)->is_parallel = true; + else if (ins_cmd == PARALLEL_INSERT_CMD_COPY_TO) + ((DR_copy *) dest)->is_parallel = true; In the below place, instead of new function, I think we can just have something like if (fpes->ins_cmd_type != PARALLEL_INSERT_CMD_UNDEF) > Or > > + if (fpes->ins_cmd_type == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > + pg_atomic_add_fetch_u64(&fpes->processed, queryDesc->estate->es_processed); > > If you think the above code will extend the ins_cmd type check in the future, the generic function may make sense. We can also change below to fpes->ins_cmd_type != PARALLEL_INSERT_CMD_UNDEF. + if (fpes->ins_cmd_type == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) + receiver = ExecParallelGetInsReceiver(toc, fpes); If okay, I will modify it in the next version of the patch. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On 05-01-2021 13:57, Bharath Rupireddy wrote: > On Tue, Jan 5, 2021 at 1:00 PM Luc Vlaming <luc@swarm64.com> wrote: >> Reviewing further v20-0001: >> >> I would still opt for moving the code for the parallel worker into a >> separate function, and then setting rStartup of the dest receiver to >> that function in ExecParallelGetInsReceiver, as its completely >> independent code. Just a matter of style I guess. > > If we were to have a intorel_startup_worker and assign it to > self->pub.rStartup, 1) we can do it in the CreateIntoRelDestReceiver, > we have to pass a parameter to CreateIntoRelDestReceiver as an > indication of parallel worker, which requires code changes in places > wherever CreateIntoRelDestReceiver is used. 2) we can also assign > intorel_startup_worker after CreateIntoRelDestReceiver in > ExecParallelGetInsReceiver, but that doesn't look good to me. 3) we > can duplicate CreateIntoRelDestReceiver and have a > CreateIntoRelParallelDestReceiver with the only change being that > self->pub.rStartup = intorel_startup_worker; > > IMHO, the way it is currently, looks good. Anyways, I'm open to > changing that if we agree on any of the above 3 ways. The current way is good enough, it was a suggestion as personally I find it hard to read to have two completely separate code paths in the same function. If any I would opt for something like 3) where there's a CreateIntoRelParallelDestReceiver which calls CreateIntoRelDestReceiver and then overrides rStartup to intorel_startup_worker. Then no callsites have to change except the ones that are for parallel workers. > > If we were to do any of the above, then we might have to do the same > thing for other commands Refresh Materialized View or Copy To where we > can parallelize. > > Thoughts? > >> Maybe I'm not completely following why but afaics we want parallel >> inserts in various scenarios, not just CTAS? I'm asking because code like >> + if (fpes->ins_cmd_type == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) >> + pg_atomic_add_fetch_u64(&fpes->processed, >> queryDesc->estate->es_processed); >> seems very specific to CTAS. For now that seems fine but I suppose that >> would be generalized soon after? Basically I would have expected the if >> to compare against PARALLEL_INSERT_CMD_UNDEF. > > After this patch is reviewed and goes for commit, then the next thing > I plan to do is to allow parallel inserts in Refresh Materialized View > and it can be used for that. I think the processed variable can also > be used for parallel inserts in INSERT INTO SELECT [1] as well. > Currently, I'm keeping it for CTAS, maybe later (after this is > committed) it can be generalized. > > Thoughts? Sounds good > > [1] - https://www.postgresql.org/message-id/CAA4eK1LMmz58ej5BgVLJ8VsUGd%3D%2BKcaA8X%3DkStORhxpfpODOxg%40mail.gmail.com > >> Apart from these small things v20-0001 looks (very) good to me. >> v20-0003 and v20-0004: >> looks good to me. > > Thanks. > > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com > Kind regards, Luc
> > > > + /* Okay to parallelize inserts, so mark it. */ > > + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > > + ((DR_intorel *) dest)->is_parallel = true; > > > > + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > > + ((DR_intorel *) dest)->is_parallel = false; > > We need to know exactly what is the command in above place, to dereference > and mark is_parallel to true, because is_parallel is being added to the > respective structures, not to the generic _DestReceiver structure. So, in > future the above code becomes something like below: > > + /* Okay to parallelize inserts, so mark it. */ > + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > + ((DR_intorel *) dest)->is_parallel = true; > + else if (ins_cmd == PARALLEL_INSERT_CMD_REFRESH_MAT_VIEW) > + ((DR_transientrel *) dest)->is_parallel = true; > + else if (ins_cmd == PARALLEL_INSERT_CMD_COPY_TO) > + ((DR_copy *) dest)->is_parallel = true; > > In the below place, instead of new function, I think we can just have > something like if (fpes->ins_cmd_type != PARALLEL_INSERT_CMD_UNDEF) > > > Or > > > > + if (fpes->ins_cmd_type == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > > + pg_atomic_add_fetch_u64(&fpes->processed, > > + queryDesc->estate->es_processed); > > > > If you think the above code will extend the ins_cmd type check in the > future, the generic function may make sense. > > We can also change below to fpes->ins_cmd_type != > PARALLEL_INSERT_CMD_UNDEF. > > + if (fpes->ins_cmd_type == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > + receiver = ExecParallelGetInsReceiver(toc, fpes); > > If okay, I will modify it in the next version of the patch. Yes, that looks good to me. Best regards, houzj
On Tue, Jan 5, 2021 at 1:25 PM Luc Vlaming <luc@swarm64.com> wrote: > >>>> wrt v18-0002....patch: > >>>> > >>>> It looks like this introduces a state machine that goes like: > >>>> - starts at CTAS_PARALLEL_INS_UNDEF > >>>> - possibly moves to CTAS_PARALLEL_INS_SELECT > >>>> - CTAS_PARALLEL_INS_TUP_COST_CAN_IGN can be added > >>>> - if both were added at some stage, we can go to > >>>> CTAS_PARALLEL_INS_TUP_COST_IGNORED and ignore the costs > >>>> > >>>> what i'm wondering is why you opted to put logic around > >>>> generate_useful_gather_paths and in cost_gather when to me it seems more > >>>> logical to put it in create_gather_path? i'm probably missing something > >>>> there? > >>> > >>> IMO, The reason is we want to make sure we only ignore the cost when Gather is the top node. > >>> And it seems the generate_useful_gather_paths called in apply_scanjoin_target_to_paths is the right place which canonly create top node Gather. > >>> So we change the flag in apply_scanjoin_target_to_paths around generate_useful_gather_paths to identify the top node. > > > > Right. We wanted to ignore parallel tuple cost for only the upper Gather path. > > > >> I was wondering actually if we need the state machine. Reason is that as > >> AFAICS the code could be placed in create_gather_path, where you can > >> also check if it is a top gather node, whether the dest receiver is the > >> right type, etc? To me that seems like a nicer solution as its makes > >> that all logic that decides whether or not a parallel CTAS is valid is > >> in a single place instead of distributed over various places. > > > > IMO, we can't determine the fact that we are going to generate the top > > Gather path in create_gather_path. To decide on whether or not the top > > Gather path generation, I think it's not only required to check the > > root->query_level == 1 but we also need to rely on from where > > generate_useful_gather_paths gets called. For instance, for > > query_level 1, generate_useful_gather_paths gets called from 2 places > > in apply_scanjoin_target_to_paths. Likewise, create_gather_path also > > gets called from many places. IMO, the current way i.e. setting flag > > it in apply_scanjoin_target_to_paths and ignoring based on that in > > cost_gather seems safe. > > > > I may be wrong. Thoughts? > > So the way I understand it the requirements are: > - it needs to be the top-most gather > - it should not do anything with the rows after the gather node as this > would make the parallel inserts conceptually invalid. Right. > Right now we're trying to judge what might be added on-top that could > change the rows by inspecting all parts of the root object that would > cause anything to be added, and add a little statemachine to track the > state of that knowledge. To me this has the downside that the list in > HAS_PARENT_PATH_GENERATING_CLAUSE has to be exhaustive, and we need to > make sure it stays up-to-date, which could result in regressions if not > tracked carefully. Right. Any new clause that will be added which generates an upper path in grouping_planner after apply_scanjoin_target_to_paths also needs to be added to HAS_PARENT_PATH_GENERATING_CLAUSE. Otherwise, we might ignore the parallel tuple cost because of which the parallel plan may be chosen and we go for parallel inserts only when the top node is Gather. I don't think any new clause that will be added generates a new upper Gather node in grouping_planner after apply_scanjoin_target_to_paths. > Personally I would therefore go for a design which is safe in the sense > that regressions are not as easily introduced. IMHO that could be done > by inspecting the planned query afterwards, and then judging whether or > not the parallel inserts are actually the right thing to do. The 0001 patch does that. It doesn't have any influence on the planner for parallel tuple cost calculation, it just looks at the generated plan and decides on parallel inserts. Having said that, we might miss parallel plans even though we know that there will not be tuples transferred from workers to Gather. So, 0002 patch adds the code for influencing the planner for parallel tuple cost. > Another way to create more safety against regressions would be to add an > assert upon execution of the query that if we do parallel inserts that > only a subset of allowed nodes exists above the gather node. Yes, we already do this. Please have a look at SetParallelInsertState() in the 0002 patch. The idea is that in any case, if the planner ignored the tuple cost, but we later not allow parallel inserts either due to the upper node is not Gather or Gather with projections. The assertion fails. So, in case any new parent path generating clause is added (apart from the ones that are there in HAS_PARENT_PATH_GENERATING_CLAUSE) and we ignore the tuple cost, then this Assert will catch it. Currently, I couldn't find any assertion failures in my debug build with make check and make check world. + else + { + /* + * Upper Gather node has projections, so parallel insertions are not + * allowed. + */ + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) + ((DR_intorel *) dest)->is_parallel = false; + + gstate->dest = NULL; + + /* + * Before returning, ensure that we have not done wrong parallel tuple + * cost enforcement in the planner. Main reason for this assertion is + * to check if we enforced the planner to ignore the parallel tuple + * cost (with the intention of choosing parallel inserts) due to which + * the parallel plan may have been chosen, but we do not allow the + * parallel inserts now. + * + * If we have correctly ignored parallel tuple cost in the planner + * while creating Gather path, then this assertion failure should not + * occur. In case it occurs, that means the planner may have chosen + * this parallel plan because of our wrong enforcement. So let's try to + * catch that here. + */ + Assert(tuple_cost_opts && !(*tuple_cost_opts & + PARALLEL_INSERT_TUP_COST_IGNORED)); + } With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
that the planner may choose the parallel plan.
+ PARALLEL_INSERT_SELECT_QUERY) &&
+ (root->parse->parallelInsCmdTupleCostOpt &
+ PARALLEL_INSERT_CAN_IGN_TUP_COST))
+ {
+ /* We are ignoring the parallel tuple cost, so mark it. */
+ root->parse->parallelInsCmdTupleCostOpt |=
+ PARALLEL_INSERT_TUP_COST_IGNORED;
On Mon, Jan 4, 2021 at 7:02 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> > + if (IS_PARALLEL_CTAS_DEST(gstate->dest) &&
> > + ((DR_intorel *) gstate->dest)->into->rel &&
> > + ((DR_intorel *) gstate->dest)->into->rel->relname)
> > why would rel and relname not be there? if no rows have been inserted?
> > because it seems from the intorel_startup function that that would be
> > set as soon as startup was done, which i assume (wrongly?) is always done?
>
> Actually, that into clause rel variable is always being set in the gram.y for CTAS, Create Materialized View and SELECT INTO (because qualified_name non-terminal is not optional). My bad. I just added it as a sanity check. Actually, it's not required.
>
> create_as_target:
> qualified_name opt_column_list table_access_method_clause
> OptWith OnCommitOption OptTableSpace
> {
> $$ = makeNode(IntoClause);
> $$->rel = $1;
> create_mv_target:
> qualified_name opt_column_list table_access_method_clause opt_reloptions OptTableSpace
> {
> $$ = makeNode(IntoClause);
> $$->rel = $1;
> into_clause:
> INTO OptTempTableName
> {
> $$ = makeNode(IntoClause);
> $$->rel = $2;
>
> I will change the below code:
> + if (GetParallelInsertCmdType(gstate->dest) ==
> + PARALLEL_INSERT_CMD_CREATE_TABLE_AS &&
> + ((DR_intorel *) gstate->dest)->into &&
> + ((DR_intorel *) gstate->dest)->into->rel &&
> + ((DR_intorel *) gstate->dest)->into->rel->relname)
> + {
>
> to:
> + if (GetParallelInsertCmdType(gstate->dest) ==
> + PARALLEL_INSERT_CMD_CREATE_TABLE_AS)
> + {
>
> I will update this in the next version of the patch set.
Attaching v20 patch set that has above change in 0001 patch, note that
0002 to 0004 patches have no changes from v19. Please consider the v20
patch set for further review.
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 7, 2021 at 5:12 AM Zhihong Yu <zyu@yugabyte.com> wrote: > > Hi, > For v20-0002-Tuple-Cost-Adjustment-for-Parallel-Inserts-in-CTAS.patch : > > workers to Gather node to 0. With this change, there are chances > that the planner may choose the parallel plan. > > It would be nice if the scenarios where a parallel plan is not chosen are listed. There are many reasons, the planner may not choose a parallel plan for the select part, for instance if there are temporary tables, parallel unsafe functions, or the parallelism GUCs are not set properly, foreign tables and so on. see https://www.postgresql.org/docs/devel/parallel-safety.html. I don't think so, we will add all the scenarios into the commit message. Having said that, we have extensive comments in the code(especially in the function SetParallelInsertState) about when parallel inserts are chosen. + * Parallel insertions are possible only if the upper node is Gather. */ + if (!IsA(gstate, GatherState)) return; + * Parallelize inserts only when the upper Gather node has no projections. */ + if (!gstate->ps.ps_ProjInfo) + { + /* Okay to parallelize inserts, so mark it. */ + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) + ((DR_intorel *) dest)->is_parallel = true; + + /* + * For parallelizing inserts, we must send some information so that the + * workers can build their own dest receivers. For CTAS, this info is + * into clause, object id (to open the created table). + * + * Since the required information is available in the dest receiver, + * store a reference to it in the Gather state so that it will be used + * in ExecInitParallelPlan to pick the information. + */ + gstate->dest = dest; + } + else + { + /* + * Upper Gather node has projections, so parallel insertions are not + * allowed. + */ > + if ((root->parse->parallelInsCmdTupleCostOpt & > + PARALLEL_INSERT_SELECT_QUERY) && > + (root->parse->parallelInsCmdTupleCostOpt & > + PARALLEL_INSERT_CAN_IGN_TUP_COST)) > + { > + /* We are ignoring the parallel tuple cost, so mark it. */ > + root->parse->parallelInsCmdTupleCostOpt |= > + PARALLEL_INSERT_TUP_COST_IGNORED; > > If I read the code correctly, when both PARALLEL_INSERT_SELECT_QUERY and PARALLEL_INSERT_CAN_IGN_TUP_COST are set, PARALLEL_INSERT_TUP_COST_IGNOREDis implied. > > Maybe we don't need the PARALLEL_INSERT_TUP_COST_IGNORED enum - the setting (1) of the first two bits should suffice. The way these flags work is as follows: before planning in CTAS, we set PARALLEL_INSERT_SELECT_QUERY, before we go for generating upper gather path we set PARALLEL_INSERT_CAN_IGN_TUP_COST, and when we actually ignored the tuple cost in cost_gather we set PARALLEL_INSERT_TUP_COST_IGNORED. There are chances that we set PARALLEL_INSERT_CAN_IGN_TUP_COST before calling generate_useful_gather_paths, and the function generate_useful_gather_paths can return before reaching cost_gather, see below snippets. So, we need the 3 flags. void generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows) { ListCell *lc; double rows; double *rowsp = NULL; List *useful_pathkeys_list = NIL; Path *cheapest_partial_path = NULL; /* If there are no partial paths, there's nothing to do here. */ if (rel->partial_pathlist == NIL) return; /* Should we override the rel's rowcount estimate? */ if (override_rows) rowsp = &rows; /* generate the regular gather (merge) paths */ generate_gather_paths(root, rel, override_rows); void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows) { Path *cheapest_partial_path; Path *simple_gather_path; ListCell *lc; double rows; double *rowsp = NULL; /* If there are no partial paths, there's nothing to do here. */ if (rel->partial_pathlist == NIL) return; With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 7, 2021 at 5:12 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> Hi,
> For v20-0002-Tuple-Cost-Adjustment-for-Parallel-Inserts-in-CTAS.patch :
>
> workers to Gather node to 0. With this change, there are chances
> that the planner may choose the parallel plan.
>
> It would be nice if the scenarios where a parallel plan is not chosen are listed.
There are many reasons, the planner may not choose a parallel plan for
the select part, for instance if there are temporary tables, parallel
unsafe functions, or the parallelism GUCs are not set properly,
foreign tables and so on. see
https://www.postgresql.org/docs/devel/parallel-safety.html. I don't
think so, we will add all the scenarios into the commit message.
Having said that, we have extensive comments in the code(especially in
the function SetParallelInsertState) about when parallel inserts are
chosen.
+ * Parallel insertions are possible only if the upper node is Gather.
*/
+ if (!IsA(gstate, GatherState))
return;
+ * Parallelize inserts only when the upper Gather node has no projections.
*/
+ if (!gstate->ps.ps_ProjInfo)
+ {
+ /* Okay to parallelize inserts, so mark it. */
+ if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS)
+ ((DR_intorel *) dest)->is_parallel = true;
+
+ /*
+ * For parallelizing inserts, we must send some information so that the
+ * workers can build their own dest receivers. For CTAS, this info is
+ * into clause, object id (to open the created table).
+ *
+ * Since the required information is available in the dest receiver,
+ * store a reference to it in the Gather state so that it will be used
+ * in ExecInitParallelPlan to pick the information.
+ */
+ gstate->dest = dest;
+ }
+ else
+ {
+ /*
+ * Upper Gather node has projections, so parallel insertions are not
+ * allowed.
+ */
> + if ((root->parse->parallelInsCmdTupleCostOpt &
> + PARALLEL_INSERT_SELECT_QUERY) &&
> + (root->parse->parallelInsCmdTupleCostOpt &
> + PARALLEL_INSERT_CAN_IGN_TUP_COST))
> + {
> + /* We are ignoring the parallel tuple cost, so mark it. */
> + root->parse->parallelInsCmdTupleCostOpt |=
> + PARALLEL_INSERT_TUP_COST_IGNORED;
>
> If I read the code correctly, when both PARALLEL_INSERT_SELECT_QUERY and PARALLEL_INSERT_CAN_IGN_TUP_COST are set, PARALLEL_INSERT_TUP_COST_IGNORED is implied.
>
> Maybe we don't need the PARALLEL_INSERT_TUP_COST_IGNORED enum - the setting (1) of the first two bits should suffice.
The way these flags work is as follows: before planning in CTAS, we
set PARALLEL_INSERT_SELECT_QUERY, before we go for generating upper
gather path we set PARALLEL_INSERT_CAN_IGN_TUP_COST, and when we
actually ignored the tuple cost in cost_gather we set
PARALLEL_INSERT_TUP_COST_IGNORED. There are chances that we set
PARALLEL_INSERT_CAN_IGN_TUP_COST before calling
generate_useful_gather_paths, and the function
generate_useful_gather_paths can return before reaching cost_gather,
see below snippets. So, we need the 3 flags.
void
generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool
override_rows)
{
ListCell *lc;
double rows;
double *rowsp = NULL;
List *useful_pathkeys_list = NIL;
Path *cheapest_partial_path = NULL;
/* If there are no partial paths, there's nothing to do here. */
if (rel->partial_pathlist == NIL)
return;
/* Should we override the rel's rowcount estimate? */
if (override_rows)
rowsp = &rows;
/* generate the regular gather (merge) paths */
generate_gather_paths(root, rel, override_rows);
void
generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
{
Path *cheapest_partial_path;
Path *simple_gather_path;
ListCell *lc;
double rows;
double *rowsp = NULL;
/* If there are no partial paths, there's nothing to do here. */
if (rel->partial_pathlist == NIL)
return;
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Hi, Attaching v21 patch set, which has following changes: 1) 0001 - changed fpes->ins_cmd_type == PARALLEL_INSERT_CMD_CREATE_TABLE_AS to fpes->ins_cmd_type != PARALLEL_INSERT_CMD_UNDEF 2) 0002 - reworded the commit message. 3) 0003 - added cmin, xmin test case to one of the parallel insert cases to ensure leader and worker insert the tuples in the same xact and replaced memory usage output in numbers like 25kB to NkB to make the tests stable. 4) 0004 - updated one of the test output to be in NkB and made the assertion in SetParallelInsertState to be not under an if condition. There's one open point [1] on selective skipping of error "cannot insert tuples in a parallel worker" in heap_prepare_insert(), thoughts are welcome. Please consider the v21 patch set for further review. [1] - https://www.postgresql.org/message-id/CALj2ACXmbka1P5pxOV2vU-Go3UPTtsPqZXE8nKW1mE49MQcZtw%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
> Attaching v21 patch set, which has following changes: > 1) 0001 - changed fpes->ins_cmd_type == > PARALLEL_INSERT_CMD_CREATE_TABLE_AS to fpes->ins_cmd_type != > PARALLEL_INSERT_CMD_UNDEF > 2) 0002 - reworded the commit message. > 3) 0003 - added cmin, xmin test case to one of the parallel insert cases > to ensure leader and worker insert the tuples in the same xact and replaced > memory usage output in numbers like 25kB to NkB to make the tests stable. > 4) 0004 - updated one of the test output to be in NkB and made the assertion > in SetParallelInsertState to be not under an if condition. > > There's one open point [1] on selective skipping of error "cannot insert > tuples in a parallel worker" in heap_prepare_insert(), thoughts are > welcome. > > Please consider the v21 patch set for further review. Hi, I took a look into the new patch and have some comments. 1. + /* + * Do not consider tuple cost in case of we intend to perform parallel + * inserts by workers. We would have turned on the ignore flag in + * apply_scanjoin_target_to_paths before generating Gather path for the + * upper level SELECT part of the query. + */ + if ((root->parse->parallelInsCmdTupleCostOpt & + PARALLEL_INSERT_SELECT_QUERY) && + (root->parse->parallelInsCmdTupleCostOpt & + PARALLEL_INSERT_CAN_IGN_TUP_COST)) Can we just check PARALLEL_INSERT_CAN_IGN_TUP_COST here ? IMO, PARALLEL_INSERT_CAN_IGN_TUP_COST will be set only when PARALLEL_INSERT_SELECT_QUERY is set. 2. +static void +ParallelInsCmdEstimate(ParallelContext *pcxt, ParallelInsertCmdKind ins_cmd, + void *ins_info) ... + info = (ParallelInsertCTASInfo *) ins_info; + intoclause_str = nodeToString(info->intoclause); + intoclause_len = strlen(intoclause_str) + 1; +static void +SaveParallelInsCmdInfo(ParallelContext *pcxt, ParallelInsertCmdKind ins_cmd, + void *ins_info) ... + info = (ParallelInsertCTASInfo *)ins_info; + intoclause_str = nodeToString(info->intoclause); + intoclause_len = strlen(intoclause_str) + 1; + intoclause_space = shm_toc_allocate(pcxt->toc, intoclause_len); I noticed the above code will call nodeToString and strlen twice which seems unnecessary. Do you think it's better to store the result of nodetostring and strlen first and pass them when used ? 3. + if (node->need_to_scan_locally || node->nworkers_launched == 0) + { + EState *estate = node->ps.state; + TupleTableSlot *outerTupleSlot; + + for(;;) + { + /* Install our DSA area while executing the plan. */ + estate->es_query_dsa = + node->pei ? node->pei->area : NULL; ... + node->ps.state->es_processed++; + } How about use the variables estate like 'estate-> es_processed++;' Instead of node->ps.state->es_processed++; Best regards, houzj
On Mon, Jan 11, 2021 at 6:37 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > Attaching v21 patch set, which has following changes: > > 1) 0001 - changed fpes->ins_cmd_type == > > PARALLEL_INSERT_CMD_CREATE_TABLE_AS to fpes->ins_cmd_type != > > PARALLEL_INSERT_CMD_UNDEF > > 2) 0002 - reworded the commit message. > > 3) 0003 - added cmin, xmin test case to one of the parallel insert cases > > to ensure leader and worker insert the tuples in the same xact and replaced > > memory usage output in numbers like 25kB to NkB to make the tests stable. > > 4) 0004 - updated one of the test output to be in NkB and made the assertion > > in SetParallelInsertState to be not under an if condition. > > > > There's one open point [1] on selective skipping of error "cannot insert > > tuples in a parallel worker" in heap_prepare_insert(), thoughts are > > welcome. > > > > Please consider the v21 patch set for further review. > > Hi, > > I took a look into the new patch and have some comments. Thanks. > 1. > + /* > + * Do not consider tuple cost in case of we intend to perform parallel > + * inserts by workers. We would have turned on the ignore flag in > + * apply_scanjoin_target_to_paths before generating Gather path for the > + * upper level SELECT part of the query. > + */ > + if ((root->parse->parallelInsCmdTupleCostOpt & > + PARALLEL_INSERT_SELECT_QUERY) && > + (root->parse->parallelInsCmdTupleCostOpt & > + PARALLEL_INSERT_CAN_IGN_TUP_COST)) > > Can we just check PARALLEL_INSERT_CAN_IGN_TUP_COST here ? > IMO, PARALLEL_INSERT_CAN_IGN_TUP_COST will be set only when PARALLEL_INSERT_SELECT_QUERY is set. +1. Changed. > 2. > +static void > +ParallelInsCmdEstimate(ParallelContext *pcxt, ParallelInsertCmdKind ins_cmd, > + void *ins_info) > ... > + info = (ParallelInsertCTASInfo *) ins_info; > + intoclause_str = nodeToString(info->intoclause); > + intoclause_len = strlen(intoclause_str) + 1; > > +static void > +SaveParallelInsCmdInfo(ParallelContext *pcxt, ParallelInsertCmdKind ins_cmd, > + void *ins_info) > ... > + info = (ParallelInsertCTASInfo *)ins_info; > + intoclause_str = nodeToString(info->intoclause); > + intoclause_len = strlen(intoclause_str) + 1; > + intoclause_space = shm_toc_allocate(pcxt->toc, intoclause_len); > > I noticed the above code will call nodeToString and strlen twice which seems unnecessary. > Do you think it's better to store the result of nodetostring and strlen first and pass them when used ? I wanted to keep the API generic, not do nodeToString, strlen outside and pass it to the APIs. I don't think it will add too much function call cost since it's run only in the leader. This way, the code and API looks more readable. Thoughts? > 3. > + if (node->need_to_scan_locally || node->nworkers_launched == 0) > + { > + EState *estate = node->ps.state; > + TupleTableSlot *outerTupleSlot; > + > + for(;;) > + { > + /* Install our DSA area while executing the plan. */ > + estate->es_query_dsa = > + node->pei ? node->pei->area : NULL; > ... > + node->ps.state->es_processed++; > + } > > How about use the variables estate like 'estate-> es_processed++;' > Instead of node->ps.state->es_processed++; +1. Changed. Attaching v22 patch set with changes only in 0001 and 0002. Please consider it for further review. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On 06-01-2021 09:32, Bharath Rupireddy wrote: > On Tue, Jan 5, 2021 at 1:25 PM Luc Vlaming <luc@swarm64.com> wrote: >>>>>> wrt v18-0002....patch: >>>>>> >>>>>> It looks like this introduces a state machine that goes like: >>>>>> - starts at CTAS_PARALLEL_INS_UNDEF >>>>>> - possibly moves to CTAS_PARALLEL_INS_SELECT >>>>>> - CTAS_PARALLEL_INS_TUP_COST_CAN_IGN can be added >>>>>> - if both were added at some stage, we can go to >>>>>> CTAS_PARALLEL_INS_TUP_COST_IGNORED and ignore the costs >>>>>> >>>>>> what i'm wondering is why you opted to put logic around >>>>>> generate_useful_gather_paths and in cost_gather when to me it seems more >>>>>> logical to put it in create_gather_path? i'm probably missing something >>>>>> there? >>>>> >>>>> IMO, The reason is we want to make sure we only ignore the cost when Gather is the top node. >>>>> And it seems the generate_useful_gather_paths called in apply_scanjoin_target_to_paths is the right place which canonly create top node Gather. >>>>> So we change the flag in apply_scanjoin_target_to_paths around generate_useful_gather_paths to identify the top node. >>> >>> Right. We wanted to ignore parallel tuple cost for only the upper Gather path. >>> >>>> I was wondering actually if we need the state machine. Reason is that as >>>> AFAICS the code could be placed in create_gather_path, where you can >>>> also check if it is a top gather node, whether the dest receiver is the >>>> right type, etc? To me that seems like a nicer solution as its makes >>>> that all logic that decides whether or not a parallel CTAS is valid is >>>> in a single place instead of distributed over various places. >>> >>> IMO, we can't determine the fact that we are going to generate the top >>> Gather path in create_gather_path. To decide on whether or not the top >>> Gather path generation, I think it's not only required to check the >>> root->query_level == 1 but we also need to rely on from where >>> generate_useful_gather_paths gets called. For instance, for >>> query_level 1, generate_useful_gather_paths gets called from 2 places >>> in apply_scanjoin_target_to_paths. Likewise, create_gather_path also >>> gets called from many places. IMO, the current way i.e. setting flag >>> it in apply_scanjoin_target_to_paths and ignoring based on that in >>> cost_gather seems safe. >>> >>> I may be wrong. Thoughts? >> >> So the way I understand it the requirements are: >> - it needs to be the top-most gather >> - it should not do anything with the rows after the gather node as this >> would make the parallel inserts conceptually invalid. > > Right. > >> Right now we're trying to judge what might be added on-top that could >> change the rows by inspecting all parts of the root object that would >> cause anything to be added, and add a little statemachine to track the >> state of that knowledge. To me this has the downside that the list in >> HAS_PARENT_PATH_GENERATING_CLAUSE has to be exhaustive, and we need to >> make sure it stays up-to-date, which could result in regressions if not >> tracked carefully. > > Right. Any new clause that will be added which generates an upper path > in grouping_planner after apply_scanjoin_target_to_paths also needs to > be added to HAS_PARENT_PATH_GENERATING_CLAUSE. Otherwise, we might > ignore the parallel tuple cost because of which the parallel plan may > be chosen and we go for parallel inserts only when the top node is > Gather. I don't think any new clause that will be added generates a > new upper Gather node in grouping_planner after > apply_scanjoin_target_to_paths. > >> Personally I would therefore go for a design which is safe in the sense >> that regressions are not as easily introduced. IMHO that could be done >> by inspecting the planned query afterwards, and then judging whether or >> not the parallel inserts are actually the right thing to do. > > The 0001 patch does that. It doesn't have any influence on the planner > for parallel tuple cost calculation, it just looks at the generated > plan and decides on parallel inserts. Having said that, we might miss > parallel plans even though we know that there will not be tuples > transferred from workers to Gather. So, 0002 patch adds the code for > influencing the planner for parallel tuple cost. > Ok. Thanks for the explanation and sorry for the confusion. >> Another way to create more safety against regressions would be to add an >> assert upon execution of the query that if we do parallel inserts that >> only a subset of allowed nodes exists above the gather node. > > Yes, we already do this. Please have a look at > SetParallelInsertState() in the 0002 patch. The idea is that in any > case, if the planner ignored the tuple cost, but we later not allow > parallel inserts either due to the upper node is not Gather or Gather > with projections. The assertion fails. So, in case any new parent path > generating clause is added (apart from the ones that are there in > HAS_PARENT_PATH_GENERATING_CLAUSE) and we ignore the tuple cost, then > this Assert will catch it. Currently, I couldn't find any assertion > failures in my debug build with make check and make check world. > Ok. Seems I missed that assert when reviewing. > + else > + { > + /* > + * Upper Gather node has projections, so parallel insertions are not > + * allowed. > + */ > + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > + ((DR_intorel *) dest)->is_parallel = false; > + > + gstate->dest = NULL; > + > + /* > + * Before returning, ensure that we have not done wrong parallel tuple > + * cost enforcement in the planner. Main reason for this assertion is > + * to check if we enforced the planner to ignore the parallel tuple > + * cost (with the intention of choosing parallel inserts) due to which > + * the parallel plan may have been chosen, but we do not allow the > + * parallel inserts now. > + * > + * If we have correctly ignored parallel tuple cost in the planner > + * while creating Gather path, then this assertion failure should not > + * occur. In case it occurs, that means the planner may have chosen > + * this parallel plan because of our wrong enforcement. So let's try to > + * catch that here. > + */ > + Assert(tuple_cost_opts && !(*tuple_cost_opts & > + PARALLEL_INSERT_TUP_COST_IGNORED)); > + } > > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com > Kind regards, Luc
On Mon, Jan 11, 2021 at 8:51 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > Attaching v22 patch set with changes only in 0001 and 0002. Please > consider it for further review. Seems like v22 patch was failing in cfbot for one of the unstable test cases. Attaching v23 patch set with modification in 0003 and 0004 patches. No changes to 0001 and 0002 patches. Hopefully cfbot will be happy with v23. Please consider v23 for further review. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
Hi Bharath, I'm trying to take some performance measurements on you patch v23. But when I started, I found an issue about the tuples unbalance distribution among workers(99% tuples read by one worker)under specified case which lead the "parallel select" part makes no performance gain. Then I find it's not introduced by your patch, because it's also happening in master(HEAD). But I don't know how to dealwith it , so I put it here to see if anybody know what's going wrong with this or have good ideas to deal this issue. Here are the conditions to produce the issue: 1. high CPU spec environment(say above 20 processors). In smaller CPU, it also happen but not so obvious(40% tuples on oneworker in my tests). 2. query plan is "serial insert + parallel select", I have reproduce this behavior in (CTAS, Select into, insert into select). 3. select part needs to query large data size(e.g. query 100 million from 200 million). According to above, IMHO, I guess it may be caused by the leader write rate can't catch the worker read rate, then the tuplesof one worker blocked in the queue, become more and more. Below is my test info: 1. test spec environment CentOS 8.2, 128G RAM, 40 processors, disk SAS 2. test data prepare create table x(a int, b int, c int); create index on x(a); insert into x select generate_series(1,200000000),floor(random()*(10001-1)+1),floor(random()*(10001-1)+1); 3. test execute results *Patched CTAS*: please look at worker 2, 99% tuples read by it. explain analyze verbose create table test(a,b,c) as select a,floor(random()*(10001-1)+1),c from x where b%2=0; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Gather (cost=1000.00..1942082.77 rows=1000001 width=16) (actual time=0.203..24023.686 rows=100006268 loops=1) Output: a, floor(((random() * '10000'::double precision) + '1'::double precision)), c Workers Planned: 4 Workers Launched: 4 -> Parallel Seq Scan on public.x (cost=0.00..1831082.66 rows=250000 width=8) (actual time=0.016..4367.035 rows=20001254loops=5) Output: a, c Filter: ((x.b % 2) = 0) Rows Removed by Filter: 19998746 Worker 0: actual time=0.016..19.265 rows=94592 loops=1 Worker 1: actual time=0.027..31.422 rows=94574 loops=1 Worker 2: actual time=0.014..21744.549 rows=99627749 loops=1 Worker 3: actual time=0.015..19.347 rows=94586 loops=1 Planning Time: 0.098 ms Execution Time: 91054.828 ms *Non-patched CTAS*: please look at worker 0, also 99% tuples read by it. explain analyze verbose create table test(a,b,c) as select a,floor(random()*(10001-1)+1),c from x where b%2=0; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Gather (cost=1000.00..1942082.77 rows=1000001 width=16) (actual time=0.283..19216.157 rows=100003148 loops=1) Output: a, floor(((random() * '10000'::double precision) + '1'::double precision)), c Workers Planned: 4 Workers Launched: 4 -> Parallel Seq Scan on public.x (cost=0.00..1831082.66 rows=250000 width=8) (actual time=0.020..4380.360 rows=20000630loops=5) Output: a, c Filter: ((x.b % 2) = 0) Rows Removed by Filter: 19999370 Worker 0: actual time=0.013..21805.647 rows=99624833 loops=1 Worker 1: actual time=0.016..19.790 rows=94398 loops=1 Worker 2: actual time=0.013..35.340 rows=94423 loops=1 Worker 3: actual time=0.035..19.849 rows=94679 loops=1 Planning Time: 0.083 ms Execution Time: 91151.097 ms I'm still working on the performance tests on your patch, if I make some progress, I will post my results here. Regards, Tang
>
> Hi Bharath,
>
> I'm trying to take some performance measurements on you patch v23.
> But when I started, I found an issue about the tuples unbalance distribution among workers(99% tuples read by one worker) under specified case which lead the "parallel select" part makes no performance gain.
> Then I find it's not introduced by your patch, because it's also happening in master(HEAD). But I don't know how to deal with it , so I put it here to see if anybody know what's going wrong with this or have good ideas to deal this issue.
>
> Here are the conditions to produce the issue:
> 1. high CPU spec environment(say above 20 processors). In smaller CPU, it also happen but not so obvious(40% tuples on one worker in my tests).
> 2. query plan is "serial insert + parallel select", I have reproduce this behavior in (CTAS, Select into, insert into select).
> 3. select part needs to query large data size(e.g. query 100 million from 200 million).
>
> According to above, IMHO, I guess it may be caused by the leader write rate can't catch the worker read rate, then the tuples of one worker blocked in the queue, become more and more.
>
> Below is my test info:
> 1. test spec environment
> CentOS 8.2, 128G RAM, 40 processors, disk SAS
>
> 2. test data prepare
> create table x(a int, b int, c int);
> create index on x(a);
> insert into x select generate_series(1,200000000),floor(random()*(10001-1)+1),floor(random()*(10001-1)+1);
>
> 3. test execute results
> *Patched CTAS*: please look at worker 2, 99% tuples read by it.
> explain analyze verbose create table test(a,b,c) as select a,floor(random()*(10001-1)+1),c from x where b%2=0;
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------------------
> Gather (cost=1000.00..1942082.77 rows=1000001 width=16) (actual time=0.203..24023.686 rows=100006268 loops=1)
> Output: a, floor(((random() * '10000'::double precision) + '1'::double precision)), c
> Workers Planned: 4
> Workers Launched: 4
> -> Parallel Seq Scan on public.x (cost=0.00..1831082.66 rows=250000 width=8) (actual time=0.016..4367.035 rows=20001254 loops=5)
> Output: a, c
> Filter: ((x.b % 2) = 0)
> Rows Removed by Filter: 19998746
> Worker 0: actual time=0.016..19.265 rows=94592 loops=1
> Worker 1: actual time=0.027..31.422 rows=94574 loops=1
> Worker 2: actual time=0.014..21744.549 rows=99627749 loops=1
> Worker 3: actual time=0.015..19.347 rows=94586 loops=1 Planning Time: 0.098 ms Execution Time: 91054.828 ms
>
> *Non-patched CTAS*: please look at worker 0, also 99% tuples read by it.
> explain analyze verbose create table test(a,b,c) as select a,floor(random()*(10001-1)+1),c from x where b%2=0;
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------------------
> Gather (cost=1000.00..1942082.77 rows=1000001 width=16) (actual time=0.283..19216.157 rows=100003148 loops=1)
> Output: a, floor(((random() * '10000'::double precision) + '1'::double precision)), c
> Workers Planned: 4
> Workers Launched: 4
> -> Parallel Seq Scan on public.x (cost=0.00..1831082.66 rows=250000 width=8) (actual time=0.020..4380.360 rows=20000630 loops=5)
> Output: a, c
> Filter: ((x.b % 2) = 0)
> Rows Removed by Filter: 19999370
> Worker 0: actual time=0.013..21805.647 rows=99624833 loops=1
> Worker 1: actual time=0.016..19.790 rows=94398 loops=1
> Worker 2: actual time=0.013..35.340 rows=94423 loops=1
> Worker 3: actual time=0.035..19.849 rows=94679 loops=1 Planning Time: 0.083 ms Execution Time: 91151.097 ms
>
> I'm still working on the performance tests on your patch, if I make some progress, I will post my results here.
Thanks a lot for the tests. In your test case, parallel insertions are not being picked because the Gather node has some projections(floor(((random() * '10000'::double precision) + '1'::double precision)) to perform. That's expected. Whenever parallel insertions are chosen for CTAS, we should see "Create target_table '' under Gather node [1] and also the actual row count for Gather node 0 (but in your test it is rows=100006268) in the explain analyze output. Coming to your test case, if it's modified to something like [1], where the Gather node has no projections, then parallel insertions will be chosen.
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..3846.71 rows=1000 width=12) (actual time=5581.308..5581.379 rows=0 loops=1)
Output: a, b, c
Workers Planned: 1
Workers Launched: 1
-> Create test
-> Parallel Seq Scan on public.x (cost=0.00..2846.71 rows=588 width=12) (actual time=0.014..29.512 rows=50023 loops=2)
Output: a, b, c
Filter: ((x.b % 2) = 0)
Rows Removed by Filter: 49977
Worker 0: actual time=0.015..29.751 rows=49419 loops=1
Planning Time: 1574.584 ms
Execution Time: 6437.562 ms
(12 rows)
>Thanks a lot for the tests. In your test case, parallel insertions are not being picked because the Gather node has
> some projections(floor(((random() * '10000'::double precision) + >'1'::double precision)) to perform. That's expected.
>Whenever parallel insertions are chosen for CTAS, we should see "Create target_table '' under Gather node [1] and
>also the actual >row count for Gather node 0 (but in your test it is rows=100006268) in the explain analyze output.
>Coming to your test case, if it's modified to something like [1], where the Gather >node has no projections,
>then parallel insertions will be chosen.
Thanks for your explanation and test.
Actually, I deliberately made my test case(with projection) to pick serial insert to make tuples unbalance distribution(99% tuples read by one worker) happened.
This issue will lead the performance regression.
But it's not introduced by your patch, it’s happening in master(HEAD).
Do you have some thoughts about this.
>[1] - I did this test on my development system, I will run on some performance system and post my observations.
Thank you, It will be very kind of you to do this.
To reproduce above issue, you need to use my case(with projection). Because it won’t occur in “parallel insert”.
Regards,
Tang
Hi Bharath,
I choose 5 cases which pick parallel insert plan in CTAS to measure the patched performance. Each case run 30 times.
Most of the tests execution become faster with this patch.
However, Test NO 4(create table xxx as table xxx.) appears performance degradation. I tested various table size(2/10/20 millions), they all have a 6%-10% declines. I think it may need some check at this problem.
Below are my test results. 'Test NO' is corresponded to 'Test NO' in attached test_ctas.sql file.
reg%=(patched-master)/master
Test NO | Test Case |reg% | patched(ms) | master(ms)
--------|--------------------------------|------|--------------|-------------
1 | CTAS select from table | -9% | 16709.50477 | 18370.76660
2 | Append plan | -14% | 16542.97807 | 19305.86600
3 | initial plan under Gather node| -5% | 13374.27187 | 14120.02633
4 | CTAS table | 10% | 20835.48800 | 18986.40350
5 | CTAS select from execute | -6% | 16973.73890 | 18008.59789
About Test NO 4:
In master(HEAD), this test case picks serial seq scan.
query plan likes:
----------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.tenk1 (cost=0.00..444828.12 rows=10000012 width=244) (actual time=0.005..1675.268 rows=10000000 loops=1)
Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4 Planning Time: 0.053 ms Execution Time: 20165.023 ms
With this patch, it will choose parallel seq scan and parallel insert.
query plan likes:
----------------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..370828.03 rows=10000012 width=244) (actual time=20428.823..20437.143 rows=0 loops=1)
Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4
Workers Planned: 4
Workers Launched: 4
-> Create test
-> Parallel Seq Scan on public.tenk1 (cost=0.00..369828.03 rows=2500003 width=244) (actual time=0.021..411.094 rows=2000000 loops=5)
Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4
Worker 0: actual time=0.023..390.856 rows=1858407 loops=1
Worker 1: actual time=0.024..468.587 rows=2264494 loops=1
Worker 2: actual time=0.023..473.170 rows=2286580 loops=1
Worker 3: actual time=0.027..373.727 rows=1853216 loops=1 Planning Time: 0.053 ms Execution Time: 20437.643 ms
test machine spec:
CentOS 8.2, 128G RAM, 40 processors, disk SAS
Regards,
Tang
Attachment
On Wed, Jan 27, 2021 at 1:25 PM Tang, Haiying <tanghy.fnst@cn.fujitsu.com> wrote: > I choose 5 cases which pick parallel insert plan in CTAS to measure the patched performance. Each case run 30 times. > > Most of the tests execution become faster with this patch. > > However, Test NO 4(create table xxx as table xxx.) appears performance degradation. I tested various table size(2/10/20millions), they all have a 6%-10% declines. I think it may need some check at this problem. > > > > Below are my test results. 'Test NO' is corresponded to 'Test NO' in attached test_ctas.sql file. > > reg%=(patched-master)/master > > Test NO | Test Case |reg% | patched(ms) | master(ms) > > --------|--------------------------------|------|--------------|------------- > > 1 | CTAS select from table | -9% | 16709.50477 | 18370.76660 > > 2 | Append plan | -14% | 16542.97807 | 19305.86600 > > 3 | initial plan under Gather node| -5% | 13374.27187 | 14120.02633 > > 4 | CTAS table | 10% | 20835.48800 | 18986.40350 > > 5 | CTAS select from execute | -6% | 16973.73890 | 18008.59789 > > > > About Test NO 4: > > In master(HEAD), this test case picks serial seq scan. > > query plan likes: > > ---------------------------------------------------------------------------------------------------------------------------------------------------------- > > Seq Scan on public.tenk1 (cost=0.00..444828.12 rows=10000012 width=244) (actual time=0.005..1675.268 rows=10000000 loops=1) > > Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, twothousand, fivethous, tenthous, odd, even, stringu1,stringu2, string4 Planning Time: 0.053 ms Execution Time: 20165.023 ms > > > > With this patch, it will choose parallel seq scan and parallel insert. > > query plan likes: > > ---------------------------------------------------------------------------------------------------------------------------------------------------------- > > Gather (cost=1000.00..370828.03 rows=10000012 width=244) (actual time=20428.823..20437.143 rows=0 loops=1) > > Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, twothousand, fivethous, tenthous, odd, even, stringu1,stringu2, string4 > > Workers Planned: 4 > > Workers Launched: 4 > > -> Create test > > -> Parallel Seq Scan on public.tenk1 (cost=0.00..369828.03 rows=2500003 width=244) (actual time=0.021..411.094 rows=2000000loops=5) > > Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, twothousand, fivethous, tenthous, odd, even,stringu1, stringu2, string4 > > Worker 0: actual time=0.023..390.856 rows=1858407 loops=1 > > Worker 1: actual time=0.024..468.587 rows=2264494 loops=1 > > Worker 2: actual time=0.023..473.170 rows=2286580 loops=1 > > Worker 3: actual time=0.027..373.727 rows=1853216 loops=1 Planning Time: 0.053 ms Execution Time: 20437.643ms > > > > test machine spec: > > CentOS 8.2, 128G RAM, 40 processors, disk SAS Thanks a lot for the performance tests and test cases. I will analyze why the performance is degrading one case and respond soon. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 27, 2021 at 1:47 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Jan 27, 2021 at 1:25 PM Tang, Haiying > <tanghy.fnst@cn.fujitsu.com> wrote: > > I choose 5 cases which pick parallel insert plan in CTAS to measure the patched performance. Each case run 30 times. > > > > Most of the tests execution become faster with this patch. > > > > However, Test NO 4(create table xxx as table xxx.) appears performance degradation. I tested various table size(2/10/20millions), they all have a 6%-10% declines. I think it may need some check at this problem. > > > > > > > > Below are my test results. 'Test NO' is corresponded to 'Test NO' in attached test_ctas.sql file. > > > > reg%=(patched-master)/master > > > > Test NO | Test Case |reg% | patched(ms) | master(ms) > > > > --------|--------------------------------|------|--------------|------------- > > > > 1 | CTAS select from table | -9% | 16709.50477 | 18370.76660 > > > > 2 | Append plan | -14% | 16542.97807 | 19305.86600 > > > > 3 | initial plan under Gather node| -5% | 13374.27187 | 14120.02633 > > > > 4 | CTAS table | 10% | 20835.48800 | 18986.40350 > > > > 5 | CTAS select from execute | -6% | 16973.73890 | 18008.59789 > > > > > > > > About Test NO 4: > > > > In master(HEAD), this test case picks serial seq scan. > > > > query plan likes: > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------- > > > > Seq Scan on public.tenk1 (cost=0.00..444828.12 rows=10000012 width=244) (actual time=0.005..1675.268 rows=10000000 loops=1) > > > > Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, twothousand, fivethous, tenthous, odd, even,stringu1, stringu2, string4 Planning Time: 0.053 ms Execution Time: 20165.023 ms > > > > > > > > With this patch, it will choose parallel seq scan and parallel insert. > > > > query plan likes: > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------- > > > > Gather (cost=1000.00..370828.03 rows=10000012 width=244) (actual time=20428.823..20437.143 rows=0 loops=1) > > > > Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, twothousand, fivethous, tenthous, odd, even,stringu1, stringu2, string4 > > > > Workers Planned: 4 > > > > Workers Launched: 4 > > > > -> Create test > > > > -> Parallel Seq Scan on public.tenk1 (cost=0.00..369828.03 rows=2500003 width=244) (actual time=0.021..411.094 rows=2000000loops=5) > > > > Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, twothousand, fivethous, tenthous, odd,even, stringu1, stringu2, string4 > > > > Worker 0: actual time=0.023..390.856 rows=1858407 loops=1 > > > > Worker 1: actual time=0.024..468.587 rows=2264494 loops=1 > > > > Worker 2: actual time=0.023..473.170 rows=2286580 loops=1 > > > > Worker 3: actual time=0.027..373.727 rows=1853216 loops=1 Planning Time: 0.053 ms Execution Time: 20437.643ms > > > > > > > > test machine spec: > > > > CentOS 8.2, 128G RAM, 40 processors, disk SAS > > Thanks a lot for the performance tests and test cases. I will analyze > why the performance is degrading one case and respond soon. I analyzed performance of parallel inserts in CTAS for different cases with tuple size 32bytes, 59bytes, 241bytes and 1064bytes. We could gain if the tuple sizes are lower. But if the tuple size is larger i..e 1064bytes, there's a regression with parallel inserts. Upon further analysis, it turned out that the parallel workers are requiring frequent extra blocks addition while concurrently extending the relation(in RelationAddExtraBlocks) and the majority of the time spent is going into flushing those new empty pages/blocks onto the disk. I saw no regression when I incremented(for testing purpose) the rate at which the extra blocks are added in RelationAddExtraBlocks to extraBlocks = Min(1024, lockWaiters * 512); (currently it is extraBlocks = Min(512, lockWaiters * 20); Incrementing the extra blocks addition rate is not a practical solution to this problem though. In an offlist discussion with Robert and Dilip, using fallocate to extend the relation may help to extend the relation faster. In regards to this, it looks like the AIO/DIO patch set of Andres [1] which involves using fallocate() to extend files will surely be helpful. Until then, we honestly feel that the parallel inserts in CTAS patch set be put on hold and revive it later. [1] - https://www.postgresql.org/message-id/flat/20210223100344.llw5an2aklengrmn%40alap3.anarazel.de With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> >I analyzed performance of parallel inserts in CTAS for different cases >with tuple size 32bytes, 59bytes, 241bytes and 1064bytes. We could >gain if the tuple sizes are lower. But if the tuple size is larger >i..e 1064bytes, there's a regression with parallel inserts. Thanks for the update. BTW, May be you have some more testcases that can reproduce this regression easily. Can you please share some of the testcase (with big tuple size) with me. Regards, Tang
On Fri, Mar 19, 2021 at 12:45 PM tanghy.fnst@fujitsu.com <tanghy.fnst@fujitsu.com> wrote: > > From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> > >I analyzed performance of parallel inserts in CTAS for different cases > >with tuple size 32bytes, 59bytes, 241bytes and 1064bytes. We could > >gain if the tuple sizes are lower. But if the tuple size is larger > >i..e 1064bytes, there's a regression with parallel inserts. > > Thanks for the update. > BTW, May be you have some more testcases that can reproduce this regression easily. > Can you please share some of the testcase (with big tuple size) with me. They are pretty simple though. I think someone can also check if the same regression exists for parallel inserts in "INSERT INTO SELECT" patch set as well for larger tuple sizes. [1] DROP TABLE tenk1; CREATE UNLOGGED TABLE tenk1(c1 int, c2 int); INSERT INTO tenk1 values(generate_series(1,100000000), generate_series(1,100000000)); explain analyze verbose create table test as select * from tenk1; DROP TABLE tenk1; CREATE UNLOGGED TABLE tenk1(c1 int, c2 int, c3 varchar(8), c4 varchar(8), c5 varchar(8)); INSERT INTO tenk1 values(generate_series(1,100000000), generate_series(1,100000000), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8))); explain analyze verbose create table test as select * from tenk1; DROP TABLE tenk1; CREATE UNLOGGED TABLE tenk1(c1 bigint, c2 bigint, c3 name, c4 name, c5 name, c6 varchar(8)); INSERT INTO tenk1 values(generate_series(1,100000000), generate_series(1,100000000), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8))); explain analyze verbose create table test as select * from tenk1; DROP TABLE tenk1; CREATE UNLOGGED TABLE tenk1(c1 bigint, c2 bigint, c3 name, c4 name, c5 name, c6 name, c7 name, c8 name, c9 name, c10 name, c11 name, c12 name, c13 name, c14 name, c15 name, c16 name, c17 name, c18 name); INSERT INTO tenk1 values(generate_series(1,10000000), generate_series(1,10000000), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8))); explain analyze verbose create unlogged table test as select * from tenk1; With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
> They are pretty simple though. I think someone can also check if the same > regression exists for parallel inserts in "INSERT INTO SELECT" > patch set as well for larger tuple sizes. Thanks for reminding. I did some performance tests for parallel inserts in " INSERT INTO SELECT " with the testcase you provided, the regression seems does not exists in "INSERT INTO SELECT". I will try to test with larger tuple size later. Best regards, houzj
> > They are pretty simple though. I think someone can also check if the > > same regression exists for parallel inserts in "INSERT INTO SELECT" > > patch set as well for larger tuple sizes. > > Thanks for reminding. > I did some performance tests for parallel inserts in " INSERT INTO SELECT " with > the testcase you provided, the regression seems does not exists in "INSERT > INTO SELECT". I forgot to share the test results with Parallel CTAS. I test with sql: explain analyze verbose create table test as select * from tenk1; > CREATE UNLOGGED TABLE tenk1(c1 int, c2 int); > CREATE UNLOGGED TABLE tenk1(c1 int, c2 int, c3 varchar(8), c4 varchar(8), c5 varchar(8)); > CREATE UNLOGGED TABLE tenk1(c1 bigint, c2 bigint, c3 name, c4 name, c5 name, c6 varchar(8)); I did not see regression in these cases (low tuple size). > CREATE UNLOGGED TABLE tenk1(c1 bigint, c2 bigint, c3 name, c4 name, c5 name, c6 name, c7 name, c8 name, c9 name, c10 name,c11 name, c12 name, c13 name, c14 name, > c15 name, c16 name, c17 name, c18 name); I can see the degradation in this case. The average test results of CTAS are: Serial CTAS -----Execution Time: 80892.240 ms Parallel CTAS -----Execution Time: 85725.591 ms About 6% degradation. I also test with Parallel INSERT patch in this case. (Note: to keep consistent, I create a new target table(test) before inserting.) The average test results of Parallel INSERT are: Serial Parallel INSERT ------ Execution Time: 90075.501 ms Parallel Parallel INSERT----- Execution Time: 85812.202 ms No degradation. Best regards, houzj
On Fri, Mar 19, 2021 at 4:33 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > In an offlist discussion with Robert and Dilip, using fallocate to > extend the relation may help to extend the relation faster. In regards > to this, it looks like the AIO/DIO patch set of Andres [1] which > involves using fallocate() to extend files will surely be helpful. > Until then, we honestly feel that the parallel inserts in CTAS patch > set be put on hold and revive it later. > Hi, I had partially reviewed some of the patches (first scan) when I was alerted to your post and intention to put the patch on hold. I thought I'd just post the comments I have so far, and you can look at them at a later time when/if you revive the patch. Patch 0001 1) Patch comment Leader inserts its share of tuples if instructed to do, and so are workers should be: Leader inserts its share of tuples if instructed to, and so do the workers. 2) void SetParallelInsertState(ParallelInsertCmdKind ins_cmd, QueryDesc *queryDesc) { GatherState *gstate; DestReceiver *dest; Assert(queryDesc && (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS)); gstate = (GatherState *) queryDesc->planstate; dest = queryDesc->dest; /* * Parallel insertions are not possible either if the upper node is not * Gather or it's a Gather but it have some projections to perform. */ if (!IsA(gstate, GatherState) || gstate->ps.ps_ProjInfo) return; I think it would look better for code to be: dest = queryDesc->dest; /* * Parallel insertions are not possible either if the upper node is not * Gather or it's a Gather but it have some projections to perform. */ if (!IsA(queryDesc->planstate, GatherState) || queryDesc->planstate.ps_ProjInfo) return; gstate = (GatherState *) queryDesc->planstate; 3) src/backend/executor/execParallel.c + pg_atomic_uint64 processed; I am wondering, when there is contention from multiple workers in writing back their processed count, how well does this work? Any performance issues? For the Parallel INSERT patch (which has not yet been committed) it currently uses an array of processed counts for the workers (since # of workers is capped) so there is never any contention related to this. 4) src/backend/executor/execParallel.c You shouldn't use intermingled declarations and code. https://www.postgresql.org/docs/13/source-conventions.html Best to move the uninitialized variable declaration to the top of the block: ParallelInsertCTASInfo *info = NULL; char *intoclause_str = NULL; int intoclause_len; char *intoclause_space = NULL; should be: int intoclause_len; ParallelInsertCTASInfo *info = NULL; char *intoclause_str = NULL; char *intoclause_space = NULL; 5) ExecParallelGetInsReceiver Would look better to have: DR_intorel *receiver; receiver = (DR_intorel *)CreateIntoRelDestReceiver(intoclause); receiver->is_parallel_worker = true; receiver->object_id = fpes->objectid; 6) GetParallelInsertCmdType I think the following would be better: ParallelInsertCmdKind GetParallelInsertCmdType(DestReceiver *dest) { if (dest && dest->mydest == DestIntoRel && ((DR_intorel *) dest)->is_parallel) return PARALLEL_INSERT_CMD_CREATE_TABLE_AS; return PARALLEL_INSERT_CMD_UNDEF; } 7) IsParallelInsertAllowed In the following code: /* Below check may hit in case this function is called from explain.c. */ if (!(into && IsA(into, IntoClause))) return false; If "into" is non-NULL, isn't it guaranteed to point at an IntoClause? I think the code can just be: /* Below check may hit in case this function is called from explain.c. */ if (!into) return false; 8) ExecGather The comments and variable name are likely to cause confusion when the parallel INSERT statement is implemented. Suggest minor change: change: bool perform_parallel_ins = false; to: bool perform_parallel_ins_no_readers = false; change: /* * Do not create tuple queue readers for commands with parallel * insertion. Because the gather node will not receive any * tuples, the workers will insert the tuples into the target * relation. */ to: /* * Do not create tuple queue readers for commands with parallel * insertion that don't additionally return tuples. In this case, * the workers will only insert the tuples into the target * relation and the gather node will not receive any tuples. */ I think some changes in other areas are needed for the same reasons. Patch 0002 1) I noticed that "rows" is not zero (and so is not displayed as 0 in the EXPLAIN output for Gather) for the Gather node when parallel inserts will be used. This doesn't seem to be right. I think that if PARALLEL_INSERT_CAN_IGN_TUP_COST is set, path->rows should be set to 0, and just let existing "run_cost" be evaluated as normal (which will be 0 as path->rows is 0). 2) Is PARALLEL_INSERT_TUP_COST_IGNORED actually needed? Couldn't only PARALLEL_INSERT_CAN_IGN_TUP_COST be used for the purpose of ignoring parallel tuple cost? Regards, Greg Nancarrow Fujitsu Australia
On Fri, Mar 19, 2021 at 11:02 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Jan 27, 2021 at 1:47 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > I analyzed performance of parallel inserts in CTAS for different cases > with tuple size 32bytes, 59bytes, 241bytes and 1064bytes. We could > gain if the tuple sizes are lower. But if the tuple size is larger > i..e 1064bytes, there's a regression with parallel inserts. Upon > further analysis, it turned out that the parallel workers are > requiring frequent extra blocks addition while concurrently extending > the relation(in RelationAddExtraBlocks) and the majority of the time > spent is going into flushing those new empty pages/blocks onto the > disk. > How you have ensured that the cost is due to the flushing of pages? AFAICS, we don't flush the pages rather just write them and then register those to be flushed by checkpointer, now it is possible that the checkpointer sync queue gets full and the backend has to write by itself but have we checked that? I think we can check via wait events, if it is due to flush then we should see a lot of file sync (WAIT_EVENT_DATA_FILE_SYNC) wait events. The other possibility could be that the free pages added to FSM by one worker are not being used by another worker due to some reason. Can we debug and check if the pages added by one worker are being used by another worker? -- With Regards, Amit Kapila.
On Fri, May 21, 2021 at 3:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Mar 19, 2021 at 11:02 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Wed, Jan 27, 2021 at 1:47 PM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > I analyzed performance of parallel inserts in CTAS for different cases > > with tuple size 32bytes, 59bytes, 241bytes and 1064bytes. We could > > gain if the tuple sizes are lower. But if the tuple size is larger > > i..e 1064bytes, there's a regression with parallel inserts. Upon > > further analysis, it turned out that the parallel workers are > > requiring frequent extra blocks addition while concurrently extending > > the relation(in RelationAddExtraBlocks) and the majority of the time > > spent is going into flushing those new empty pages/blocks onto the > > disk. > > > > How you have ensured that the cost is due to the flushing of pages? > AFAICS, we don't flush the pages rather just write them and then > register those to be flushed by checkpointer, now it is possible that > the checkpointer sync queue gets full and the backend has to write by > itself but have we checked that? I think we can check via wait events, > if it is due to flush then we should see a lot of file sync > (WAIT_EVENT_DATA_FILE_SYNC) wait events. The other possibility could > be that the free pages added to FSM by one worker are not being used > by another worker due to some reason. Can we debug and check if the > pages added by one worker are being used by another worker? Thanks! I will work on the above points sometime later. BTW, I forgot to mention one point earlier that we see a benefit without parallelism if only multi inserts are used for CTAS instead of single inserts. See [2] for more testing results. I used "New Table Access Methods for Multi and Single Inserts" patches from [1] for this testing. I think it's a good idea to revisit that work. [1] - https://www.postgresql.org/message-id/CALj2ACXdrOmB6Na9amHWZHKvRT3Z0nwTRsCwoMT-npOBtmXLXg%40mail.gmail.com [2] case 1 - 2 integer(of 4 bytes each) columns, tuple size 32 bytes, 100mn tuples on master - 130sec on master with multi inserts - 105sec, gain - 1.23X on parallel CTAS patch without multi inserts - (2 workers, 82sec, 1.58X), (4 workers, 83sec, 1.56X) on parallel CTAS patch with multi inserts - (2 workers, 45sec, 2.33X, overall gain if seen from master 2.88X), (4 workers, 33sec, 3.18X, overall gain if seen from master 3.9X) case 2 - 2 integer(of 4 bytes each) columns, 3 varchar(8), tuple size 59 bytes, 100mn tuples on master - 185sec on master with multi inserts - 121sec, gain - 1.52X on parallel CTAS patch without multi inserts - (2 workers, 120sec, 1.54X), (4 workers, 123sec, 1.5X) on parallel CTAS patch with multi inserts - (2 workers, 68sec, 1.77X, overall gain if seen from master 2.72X), (4 workers, 61sec, 1.98X, overall gain if seen from master 3.03X) Above two cases are the best cases with tuple size a few bytes where parallel CTAS + multi inserts would give up to 3.9X and 3.03X benefits. case 3 - 2 bigint(of 8 bytes each) columns, 3 name(of 64 bytes each) columns, 1 varchar(8), tuple size 241 bytes, 100mn tuples on master - 367sec on master with multi inserts - 291sec, gain - 1.26X on parallel CTAS patch without multi inserts - (2 workers, 334sec, 1.09X), (4 workers, 336sec, 1.09X) on parallel CTAS patch with multi inserts - (2 workers, 284sec, 1.02X, overall gain if seen from master 1.29X), (4 workers, 278sec, 1.04X, overall gain if seen from master 1.32X) Above case where tuple size is 241 bytes, we don't gain much. case 4 - 2 bigint(of 8 bytes each) columns, 16 name(of 64 bytes each) columns, tuple size 1064 bytes, 10mn tuples on master - 120sec on master with multi inserts - 115sec, gain - 1.04X on parallel CTAS patch without multi inserts - (2 workers, 140sec, 0.85X), (4 workers, 142sec, 0.84X) on parallel CTAS patch with multi inserts - (2 workers, 133sec, 0.86X, overall loss if seen from master 0.9X), (4 workers, 134sec, 0.85X, overall loss if seen from master 0.89X) Above case where tuple size is 1064 bytes, we gain very little with multi inserts and with parallel inserts we cause regression. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Hi Bharath-san, From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Sent: Friday, May 21, 2021 6:49 PM > > On Fri, May 21, 2021 at 3:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Mar 19, 2021 at 11:02 AM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > On Wed, Jan 27, 2021 at 1:47 PM Bharath Rupireddy > > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > > > > I analyzed performance of parallel inserts in CTAS for different > > > cases with tuple size 32bytes, 59bytes, 241bytes and 1064bytes. We > > > could gain if the tuple sizes are lower. But if the tuple size is > > > larger i..e 1064bytes, there's a regression with parallel inserts. > > > Upon further analysis, it turned out that the parallel workers are > > > requiring frequent extra blocks addition while concurrently > > > extending the relation(in RelationAddExtraBlocks) and the majority > > > of the time spent is going into flushing those new empty > > > pages/blocks onto the disk. > > > > > > > How you have ensured that the cost is due to the flushing of pages? > > AFAICS, we don't flush the pages rather just write them and then > > register those to be flushed by checkpointer, now it is possible that > > the checkpointer sync queue gets full and the backend has to write by > > itself but have we checked that? I think we can check via wait events, > > if it is due to flush then we should see a lot of file sync > > (WAIT_EVENT_DATA_FILE_SYNC) wait events. The other possibility could > > be that the free pages added to FSM by one worker are not being used > > by another worker due to some reason. Can we debug and check if the > > pages added by one worker are being used by another worker? > > Thanks! I will work on the above points sometime later. I noticed one place which could be one of the reasons that cause the performance degradation. + /* + * We don't need to skip contacting FSM while inserting tuples for + * parallel mode, while extending the relations, workers instead of + * blocking on a page while another worker is inserting, can check the + * FSM for another page that can accommodate the tuples. This results + * in major benefit for parallel inserts. + */ + myState->ti_options = 0; I am not quite sure that disabling the " SKIP FSM " in parallel worker will bring performance gain. In my test environment, if I change this code to use option " TABLE_INSERT_SKIP_FSM ", then there seems no performance degradation . Could you please have a try on it ? (I test with the SQL you provided earlier[1]) [1] https://www.postgresql.org/message-id/CALj2ACWFvNm4d_uqT2iECPqaXZjEd-O%2By8xbghvqXeMLj0pxGw%40mail.gmail.com Best regards, houzj
Bharath-san, all, Hmm, I didn't experience performance degradation on my poor-man's Linux VM (4 CPU, 4 GB RAM, HDD)... [benchmark preparation] autovacuum = off shared_buffers = 1GB checkpoint_timeout = 1h max_wal_size = 8GB min_wal_size = 8GB (other settings to enable parallelism) CREATE UNLOGGED TABLE a (c char(1100)); INSERT INTO a SELECT i FROM generate_series(1, 300000) i; (the table size is 335 MB) [benchmark] CREATE TABLE b AS SELECT * FROM a; DROP TABLE a; CHECKPOINT; (measure only CTAS) [results] parallel_leader_participation = off workers time(ms) 0 3921 2 3290 4 3132 parallel_leader_participation = on workers time(ms) 2 3266 4 3247 Although this should be a controversial and may be crazy idea, the following change brought 4-11% speedup. This is becauseI thought parallel workers might contend for WAL flush as a result of them using the limited ring buffer and flushingdirty buffers when the ring buffer is filled. Can we take advantage of this? [GetBulkInsertState] /* bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);*/ bistate->strategy = NULL; [results] parallel_leader_participation = off workers time(ms) 0 3695 (5% reduction) 2 3135 (4% reduction) 4 2767 (11% reduction) Regards Takayuki Tsunakawa
From: houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> > + /* > + * We don't need to skip contacting FSM while inserting tuples > for > + * parallel mode, while extending the relations, workers > instead of > + * blocking on a page while another worker is inserting, can > check the > + * FSM for another page that can accommodate the tuples. > This results > + * in major benefit for parallel inserts. > + */ > + myState->ti_options = 0; > > I am not quite sure that disabling the " SKIP FSM " in parallel worker will bring > performance gain. > In my test environment, if I change this code to use option " > TABLE_INSERT_SKIP_FSM ", then there > seems no performance degradation. +1, probably. Does the code comment represent the situation like this? 1. Worker 1 is inserting into page 1. 2. Worker 2 tries to insert into page 1, but cannot acquire the buffer content lock of page 1 because worker 1 holds it. 3. Worker 2 looks up FSM to find a page with enough free space. But isn't FSM still empty during CTAS? Regards Takayuki Tsunakawa
On Tue, May 25, 2021 at 12:05 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > I noticed one place which could be one of the reasons that cause the performance degradation. > > + /* > + * We don't need to skip contacting FSM while inserting tuples for > + * parallel mode, while extending the relations, workers instead of > + * blocking on a page while another worker is inserting, can check the > + * FSM for another page that can accommodate the tuples. This results > + * in major benefit for parallel inserts. > + */ > + myState->ti_options = 0; > > I am not quite sure that disabling the " SKIP FSM " in parallel worker will bring performance gain. > In my test environment, if I change this code to use option " TABLE_INSERT_SKIP_FSM ", then there > seems no performance degradation . Could you please have a try on it ? > (I test with the SQL you provided earlier[1]) Thanks for trying that out. Please see the code around the use_fsm flag in RelationGetBufferForTuple for more understanding of the points below. What happens if FSM is skipped i.e. myState->ti_options = TABLE_INSERT_SKIP_FSM;? 1) The flag use_fsm will be false in heap_insert->RelationGetBufferForTuple. 2) Each worker initially gets a block and keeps inserting into it until it is full. When the block is full, the worker doesn't look in FSM GetPageWithFreeSpace as use_fsm is false. It directly goes for relation extension and tries to acquire relation extension lock with LockRelationForExtension. Note that the bulk extension of blocks with RelationAddExtraBlocks is not reached as use_fsm is false. 3) After acquiring the relation extension lock, it adds an extra new block with ReadBufferBI(relation, P_NEW, ...), see the comment "In addition to whatever extension we performed above, we always add at least one block to satisfy our own request." The tuple is inserted into this new block. Basically, the workers can't look for the empty pages from the pages added by other workers, they keep doing the above steps in silos. What happens if FSM is not skipped i.e. myState->ti_options = 0;? 1) The flag use_fsm will be true in heap_insert->RelationGetBufferForTuple. 2) Each worker initially gets a block and keeps inserting into it until it is full. When the block is full, the worker looks for the page with free space in FSM GetPageWithFreeSpace as use_fsm is true. If it can't find any page with the required amount of free space, it goes for bulk relation extension(RelationAddExtraBlocks) after acquiring relation extension lock with ConditionalLockRelationForExtension. Then the worker adds extraBlocks = Min(512, lockWaiters * 20); new blocks in RelationAddExtraBlocks and immediately updates the bottom level of FSM for each block (see the comment around RecordPageWithFreeSpace for why only the bottom level, not the entire FSM tree). After all the blocks are added, then it updates the entire FSM tree FreeSpaceMapVacuumRange. 4) After the bulk extension, then the worker adds another block see the comment "In addition to whatever extension we performed above, we always add at least one block to satisfy our own request." and inserts tuple into this new block. Basically, the workers can benefit from the bulk extension of the relation and they always can look for the empty pages from the pages added by other workers. There are high chances that the blocks will be available after bulk extension. Having said that, if the added extra blocks are consumed by the workers so fast i.e. if the tuple sizes are big i.e very less tuples per page, then, the bulk extension too can't help much and there will be more contention on the relation extension lock. Well, one might think to add more blocks at a time, say Min(1024, lockWaiters * 128/256/512) than currently extraBlocks = Min(512, lockWaiters * 20);. This will work (i.e. we don't see any regression with parallel inserts in CTAS patches), but it can't be a practical solution. Because the total pages for the relation will be more with many pages having more free space. Furthermore, the future sequential scans on that relation might take a lot of time. If myState->ti_options = TABLE_INSERT_SKIP_FSM; in only the place(within if (myState->is_parallel)), then it will be effective for leader i.e. leader will not look for FSM, but all the workers will, because within if (myState->is_parallel_worker) in intorel_startup, myState->ti_options = 0; for workers. I ran tests with configuration shown at [1] for the case 4 (2 bigint(of 8 bytes each) columns, 16 name(of 64 bytes each) columns, tuple size 1064 bytes, 10mn tuples) with leader participation where I'm seeing regression: 1) when myState->ti_options = TABLE_INSERT_SKIP_FSM; for both leader and workers, then my results are as follows: 0 workers - 116934.137, 2 workers - 209802.060, 4 workers - 248580.275 2) when myState->ti_options = 0; for both leader and workers, then my results are as follows: 0 workers - 1116184.718, 2 workers - 139798.055, 4 workers - 143022.409 I hope the above explanation and the test results should clarify the fact that skipping FSM doesn't solve the problem. Let me know if anything is not clear or I'm missing something. [1] postgresql.conf parameters used: shared_buffers = 40GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off port = 5440 System Configuration: RAM: 528GB Disk Type: SSD Disk Size: 1.5TB lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 8 NUMA node(s): 8 Vendor ID: GenuineIntel CPU family: 6 Model: 47 Model name: Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz Stepping: 2 CPU MHz: 1064.000 CPU max MHz: 2129.0000 CPU min MHz: 1064.0000 BogoMIPS: 4266.62 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 24576K With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Tue, May 25, 2021 at 1:10 PM tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote: > > Although this should be a controversial and may be crazy idea, the following change brought 4-11% speedup. This is becauseI thought parallel workers might contend for WAL flush as a result of them using the limited ring buffer and flushingdirty buffers when the ring buffer is filled. Can we take advantage of this? > > [GetBulkInsertState] > /* bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);*/ > bistate->strategy = NULL; You are right. If ring buffer(16MB) is not used and shared buffers(1GB) are used instead, in your case since the table size is 335MB and it can fit in the shared buffers, there will not be any or will be very minimal dirty buffer flushing, so there will be more some more speedup. Otherwise, the similar speed up can be observed when the BAS_BULKWRITE is increased a bit from the current 16MB to some other reasonable value. I earlier tried these experiments. Otherwise, as I said in [1], we can also increase the number of extra blocks added at a time, say Min(1024, lockWaiters * 128/256/512) than currently extraBlocks = Min(512, lockWaiters * 20);. This will also give some speedup and we don't see any regression with parallel inserts in CTAS patches. But, I'm not so sure that the hackers will agree any of the above as a practical solution to the "relation extension" problem. [1] https://www.postgresql.org/message-id/CALj2ACVdcrjwHXwvJqT-Fa32vnJEOjteep_3L24X8MK50E7M8w%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Tue, May 25, 2021 at 1:50 PM tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote: > > From: houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> > > + /* > > + * We don't need to skip contacting FSM while inserting tuples > > for > > + * parallel mode, while extending the relations, workers > > instead of > > + * blocking on a page while another worker is inserting, can > > check the > > + * FSM for another page that can accommodate the tuples. > > This results > > + * in major benefit for parallel inserts. > > + */ > > + myState->ti_options = 0; > > > > I am not quite sure that disabling the " SKIP FSM " in parallel worker will bring > > performance gain. > > In my test environment, if I change this code to use option " > > TABLE_INSERT_SKIP_FSM ", then there > > seems no performance degradation. > > +1, probably. I tried to explain it at [1]. Please have a look. > Does the code comment represent the situation like this? > > 1. Worker 1 is inserting into page 1. > > 2. Worker 2 tries to insert into page 1, but cannot acquire the buffer content lock of page 1 because worker 1 holds it. > > 3. Worker 2 looks up FSM to find a page with enough free space. I tried to explain it at [1]. Please have a look. > But isn't FSM still empty during CTAS? No, FSM will be built on the fly in case if we don't skip the FSM i.e. myState->ti_options = 0, see RelationGetBufferForTuple with use_fsm = true -> GetPageWithFreeSpace -> fsm_search -> fsm_set_and_search -> fsm_readbuf with extend = true. [1] https://www.postgresql.org/message-id/CALj2ACVdcrjwHXwvJqT-Fa32vnJEOjteep_3L24X8MK50E7M8w%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Fri, May 21, 2021 at 3:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Mar 19, 2021 at 11:02 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Wed, Jan 27, 2021 at 1:47 PM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > I analyzed performance of parallel inserts in CTAS for different cases > > with tuple size 32bytes, 59bytes, 241bytes and 1064bytes. We could > > gain if the tuple sizes are lower. But if the tuple size is larger > > i..e 1064bytes, there's a regression with parallel inserts. Upon > > further analysis, it turned out that the parallel workers are > > requiring frequent extra blocks addition while concurrently extending > > the relation(in RelationAddExtraBlocks) and the majority of the time > > spent is going into flushing those new empty pages/blocks onto the > > disk. > > > > How you have ensured that the cost is due to the flushing of pages? I think I'm wrong to just say the problem is with the flushing of empty pages when bulk extending the relation. I should have said the problem is with the "relation extension lock", but I will hold on to it for a moment until I capture the relation extension lock wait events for the regression causing cases. I will share the information soon. > AFAICS, we don't flush the pages rather just write them and then > register those to be flushed by checkpointer, now it is possible that > the checkpointer sync queue gets full and the backend has to write by > itself but have we checked that? I think we can check via wait events, > if it is due to flush then we should see a lot of file sync > (WAIT_EVENT_DATA_FILE_SYNC) wait events. I will also capture the data file sync events along with relation extension lock wait events. > The other possibility could > be that the free pages added to FSM by one worker are not being used > by another worker due to some reason. Can we debug and check if the > pages added by one worker are being used by another worker? I tried to explain it at [1]. Please have a look. It looks like the burden is more on the "relation extension lock" and the way the extra new blocks are getting added. [1] https://www.postgresql.org/message-id/CALj2ACVdcrjwHXwvJqT-Fa32vnJEOjteep_3L24X8MK50E7M8w%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Wed, May 26, 2021 at 5:28 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Fri, May 21, 2021 at 3:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Mar 19, 2021 at 11:02 AM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > The other possibility could > > be that the free pages added to FSM by one worker are not being used > > by another worker due to some reason. Can we debug and check if the > > pages added by one worker are being used by another worker? > > I tried to explain it at [1]. Please have a look. > I have read it but I think we should try to ensure practically what is happening because it is possible that first time worker checked in FSM without taking relation extension lock, it didn't find any free page, and then when it tried to acquire the conditional lock, it got the same and just extended the relation by one block. So, in such a case it won't be able to use the newly added pages by another worker. I am not sure any such thing is happening here but I think it is better to verify it in some way. Also, I am not sure if just getting the info about the relation extension lock is sufficient? -- With Regards, Amit Kapila.
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Sent: Wednesday, May 26, 2021 7:22 PM > Thanks for trying that out. > > Please see the code around the use_fsm flag in RelationGetBufferForTuple for > more understanding of the points below. > > What happens if FSM is skipped i.e. myState->ti_options = > TABLE_INSERT_SKIP_FSM;? > 1) The flag use_fsm will be false in heap_insert->RelationGetBufferForTuple. > 2) Each worker initially gets a block and keeps inserting into it until it is full. > When the block is full, the worker doesn't look in FSM GetPageWithFreeSpace > as use_fsm is false. It directly goes for relation extension and tries to acquire > relation extension lock with LockRelationForExtension. Note that the bulk > extension of blocks with RelationAddExtraBlocks is not reached as use_fsm is > false. > 3) After acquiring the relation extension lock, it adds an extra new block with > ReadBufferBI(relation, P_NEW, ...), see the comment "In addition to whatever > extension we performed above, we always add at least one block to satisfy our > own request." The tuple is inserted into this new block. > > Basically, the workers can't look for the empty pages from the pages added by > other workers, they keep doing the above steps in silos. > > What happens if FSM is not skipped i.e. myState->ti_options = 0;? > 1) The flag use_fsm will be true in heap_insert->RelationGetBufferForTuple. > 2) Each worker initially gets a block and keeps inserting into it until it is full. > When the block is full, the worker looks for the page with free space in FSM > GetPageWithFreeSpace as use_fsm is true. > If it can't find any page with the required amount of free space, it goes for bulk > relation extension(RelationAddExtraBlocks) after acquiring relation extension > lock with ConditionalLockRelationForExtension. Then the worker adds > extraBlocks = Min(512, lockWaiters * 20); new blocks in > RelationAddExtraBlocks and immediately updates the bottom level of FSM for > each block (see the comment around RecordPageWithFreeSpace for why only > the bottom level, not the entire FSM tree). After all the blocks are added, then > it updates the entire FSM tree FreeSpaceMapVacuumRange. > 4) After the bulk extension, then the worker adds another block see the > comment "In addition to whatever extension we performed above, we always > add at least one block to satisfy our own request." and inserts tuple into this > new block. > > Basically, the workers can benefit from the bulk extension of the relation and > they always can look for the empty pages from the pages added by other > workers. There are high chances that the blocks will be available after bulk > extension. Having said that, if the added extra blocks are consumed by the > workers so fast i.e. if the tuple sizes are big i.e very less tuples per page, then, > the bulk extension too can't help much and there will be more contention on > the relation extension lock. Well, one might think to add more blocks at a time, > say Min(1024, lockWaiters * 128/256/512) than currently extraBlocks = Min(512, > lockWaiters * 20);. This will work (i.e. we don't see any regression with parallel > inserts in CTAS patches), but it can't be a practical solution. Because the total > pages for the relation will be more with many pages having more free space. > Furthermore, the future sequential scans on that relation might take a lot of > time. > > If myState->ti_options = TABLE_INSERT_SKIP_FSM; in only the place(within if > (myState->is_parallel)), then it will be effective for leader i.e. leader will not > look for FSM, but all the workers will, because within if > (myState->is_parallel_worker) in intorel_startup, > myState->ti_options = 0; for workers. > > I ran tests with configuration shown at [1] for the case 4 (2 bigint(of 8 bytes > each) columns, 16 name(of 64 bytes each) columns, tuple size 1064 bytes, 10mn > tuples) with leader participation where I'm seeing regression: > > 1) when myState->ti_options = TABLE_INSERT_SKIP_FSM; for both leader and > workers, then my results are as follows: > 0 workers - 116934.137, 2 workers - 209802.060, 4 workers - 248580.275 > 2) when myState->ti_options = 0; for both leader and workers, then my results > are as follows: > 0 workers - 1116184.718, 2 workers - 139798.055, 4 workers - 143022.409 > I hope the above explanation and the test results should clarify the fact that > skipping FSM doesn't solve the problem. Let me know if anything is not clear or > I'm missing something. Thanks for the explanation. I followed your above test steps and the below configuration, but my test results are a little different from yours. I am not sure the exact reason, maybe because of the hardware.. Test INSERT 10000000 rows((2 bigint(of 8 bytes) 16 name(of 64 bytes each) columns): SERIAL: 22023.631 ms PARALLEL 2 WORKER [NOT SKIP FSM]: 21824.934 ms [SKIP FSM]: 19381.474 ms PARALLEL 4 WORKER [NOT SKIP FSM]: 20481.117 ms [SKIP FSM]: 18381.305 ms I am afraid that the using the FSM seems not get a stable performance gain(at least on my machine), I will take a deep look into this to figure out the difference. A naive idea it that the benefit that bulk extension bring is not much greater than the cost in FSM. Do you have some ideas on it ? My test machine: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz Stepping: 7 CPU MHz: 2901.005 CPU max MHz: 3200.0000 CPU min MHz: 1000.0000 BogoMIPS: 4400.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 14080K Best regards, houzj > [1] postgresql.conf parameters used: > shared_buffers = 40GB > max_worker_processes = 32 > max_parallel_maintenance_workers = 24 > max_parallel_workers = 32 > synchronous_commit = off > checkpoint_timeout = 1d > max_wal_size = 24GB > min_wal_size = 15GB > autovacuum = off > port = 5440 > > System Configuration: > RAM: 528GB > Disk Type: SSD > Disk Size: 1.5TB > lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 128 > On-line CPU(s) list: 0-127 > Thread(s) per core: 2 > Core(s) per socket: 8 > Socket(s): 8 > NUMA node(s): 8 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 47 > Model name: Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz > Stepping: 2 > CPU MHz: 1064.000 > CPU max MHz: 2129.0000 > CPU min MHz: 1064.0000 > BogoMIPS: 4266.62 > Virtualization: VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache: 256K > L3 cache: 24576K
Thank you for the detailed analysis, I'll look into it too. (The times have changed...) From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> > Well, one might think to add more blocks at a time, say > Min(1024, lockWaiters * 128/256/512) than currently extraBlocks = > Min(512, lockWaiters * 20);. This will work (i.e. we don't see any > regression with parallel inserts in CTAS patches), but it can't be a > practical solution. Because the total pages for the relation will be > more with many pages having more free space. Furthermore, the future > sequential scans on that relation might take a lot of time. > Otherwise, the similar speed up can be observed when the BAS_BULKWRITE > is increased a bit from the current 16MB to some other reasonable > value. I earlier tried these experiments. > > Otherwise, as I said in [1], we can also increase the number of extra > blocks added at a time, say Min(1024, lockWaiters * 128/256/512) than > currently extraBlocks = Min(512, lockWaiters * 20);. This will also > give some speedup and we don't see any regression with parallel > inserts in CTAS patches. > > But, I'm not so sure that the hackers will agree any of the above as a > practical solution to the "relation extension" problem. I think I understand your concern about resource consumption and impact on other concurrently running jobs (OLTP, data analysis.) OTOH, what's the situation like when the user wants to run CTAS, and further, wants to speed it up by using parallelism? isn't it okay to let the (parallel) CTAS use as much as it wants? At least, I think we can provide anothermode for it, like Oracle provides conditional path mode and direct path mode for INSERT and data loading. What do we want to do to maximize parallel CTAS speedup if we were a bit unshackled from the current constraints (alignmentwith existing code, impact on other concurrent workloads)? * Use as many shared buffers as possible to decrease WAL flush. Otherwise, INSERT SELECT may be faster? * Minimize relation extension (= increase the block count per extension) posix_fallocate() would help too. * Allocate added pages among parallel workers, and each worker fills pages to their full capacity. The worker that extended the relation stores the page numbers of added pages in shared memory for parallel execution. Eachworker gets a page from there after waiting for the relation extension lock, instead of using FSM. The last pages that the workers used will be filled halfway, but the amount of unused space should be low compared to thetotal table size. Regards Takayuki Tsunakawa
On Thu, May 27, 2021 at 7:12 AM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > I followed your above test steps and the below configuration, but my test results are a little different from yours. > I am not sure the exact reason, maybe because of the hardware.. > > Test INSERT 10000000 rows((2 bigint(of 8 bytes) 16 name(of 64 bytes each) columns): > SERIAL: 22023.631 ms > PARALLEL 2 WORKER [NOT SKIP FSM]: 21824.934 ms [SKIP FSM]: 19381.474 ms > PARALLEL 4 WORKER [NOT SKIP FSM]: 20481.117 ms [SKIP FSM]: 18381.305 ms I'm not sure why there's a huge difference in the execution time, on your system it just takes ~20sec whereas on my system(with SSD) it takes ~115 sec. I hope you didn't try creating the unlogged table in CTAS right? Just for reference, the exact use case I tried is at [1]. The configure command I used to build the postgres source code is at [2]. I don't know whether I'm missing something here. [1] case 4 - 2 bigint(of 8 bytes each) columns, 16 name(of 64 bytes each) columns, tuple size 1064 bytes, 10mn tuples DROP TABLE tenk1; CREATE UNLOGGED TABLE tenk1(c1 bigint, c2 bigint, c3 name, c4 name, c5 name, c6 name, c7 name, c8 name, c9 name, c10 name, c11 name, c12 name, c13 name, c14 name, c15 name, c16 name, c17 name, c18 name); INSERT INTO tenk1 values(generate_series(1,10000000), generate_series(1,10000000), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8))); explain analyze verbose create table test as select * from tenk1; [2] ./configure --with-zlib --prefix=$PWD/inst/ --with-openssl --with-readline --with-libxml > war.log && make -j 8 install > war.log 2>&1 & With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Wed, May 26, 2021 at 5:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, May 26, 2021 at 5:28 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Fri, May 21, 2021 at 3:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Fri, Mar 19, 2021 at 11:02 AM Bharath Rupireddy > > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > > > The other possibility could > > > be that the free pages added to FSM by one worker are not being used > > > by another worker due to some reason. Can we debug and check if the > > > pages added by one worker are being used by another worker? > > > > I tried to explain it at [1]. Please have a look. > > > > I have read it but I think we should try to ensure practically what is > happening because it is possible that first time worker checked in FSM > without taking relation extension lock, it didn't find any free page, > and then when it tried to acquire the conditional lock, it got the > same and just extended the relation by one block. So, in such a case > it won't be able to use the newly added pages by another worker. I am > not sure any such thing is happening here but I think it is better to > verify it in some way. Also, I am not sure if just getting the info > about the relation extension lock is sufficient? > One idea to find this out could be that we have three counters for each worker which counts the number of times each worker extended the relation in bulk, the number of times each worker extended the relation by one block, the number of times each worker gets the page from FSM. It might be possible that with this we will be able to figure out why there is a difference between your and Hou-San's results. -- With Regards, Amit Kapila.
On Thu, May 27, 2021 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I have read it but I think we should try to ensure practically what is > > happening because it is possible that first time worker checked in FSM > > without taking relation extension lock, it didn't find any free page, > > and then when it tried to acquire the conditional lock, it got the > > same and just extended the relation by one block. So, in such a case > > it won't be able to use the newly added pages by another worker. I am > > not sure any such thing is happening here but I think it is better to > > verify it in some way. Also, I am not sure if just getting the info > > about the relation extension lock is sufficient? > > > > One idea to find this out could be that we have three counters for > each worker which counts the number of times each worker extended the > relation in bulk, the number of times each worker extended the > relation by one block, the number of times each worker gets the page > from FSM. It might be possible that with this we will be able to > figure out why there is a difference between your and Hou-San's > results. Yeah, that helps. And also, the time spent in LockRelationForExtension, ConditionalLockRelationForExtension, GetPageWithFreeSpace and RelationAddExtraBlocks too can give some insight. My plan is to have a patch with above info added in (which I will share it here so that others can test and see the results too) and run the "case 4" where there's a regression seen on my system. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Thu, May 27, 2021 at 7:12 AM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > I am afraid that the using the FSM seems not get a stable performance gain(at least on my machine), > I will take a deep look into this to figure out the difference. A naive idea it that the benefit that bulk extension > bring is not much greater than the cost in FSM. > Do you have some ideas on it ? I think, if we try what Amit and I said in [1], we should get some insights on whether the bulk relation extension is taking more time or the FSM lookup. I plan to share the testing patch adding the timings and the counters so that you can also test from your end. I hope that's fine with you. [1] - https://www.postgresql.org/message-id/CALj2ACXskhY58%3DFh8TioKLL1DXYkKdyEyWFYykf-6aLJgJ2qmQ%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Thu, May 27, 2021 at 10:16 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Thu, May 27, 2021 at 7:12 AM houzj.fnst@fujitsu.com > <houzj.fnst@fujitsu.com> wrote: > > I am afraid that the using the FSM seems not get a stable performance gain(at least on my machine), > > I will take a deep look into this to figure out the difference. A naive idea it that the benefit that bulk extension > > bring is not much greater than the cost in FSM. > > Do you have some ideas on it ? > > I think, if we try what Amit and I said in [1], we should get some > insights on whether the bulk relation extension is taking more time or > the FSM lookup. I plan to share the testing patch adding the timings > and the counters so that you can also test from your end. I hope > that's fine with you. I think some other cause of contention on relation extension locks are 1. CTAS is using a buffer strategy and due to that, it might need to evict out the buffer frequently for getting the new block in. Maybe we can identify by turning off the buffer strategy for CTAS and increasing the shared buffer so that data fits in memory. 2. I think the parallel worker are scanning are producing a lot of tuple in a short time so the demand for the new block is very high compare to what AddExtra block is able to produce, so maybe you can try adding more block by increasing the multiplier and see what is the impact. 3. Also try where the underlying select query has some complex condition and also it select fewer record say 50%, 40%...10% and see what are the numbers. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
From: Dilip Kumar <dilipbalaut@gmail.com> > I think some other cause of contention on relation extension locks are > 1. CTAS is using a buffer strategy and due to that, it might need to > evict out the buffer frequently for getting the new block in. Maybe > we can identify by turning off the buffer strategy for CTAS and > increasing the shared buffer so that data fits in memory. Yes, both Bhrath-san (on a rich-man's machine) and I (on a poor-man's VM) saw that it's effective. I think we should removethis shackle from CTAS. The question is why CTAS chose to use BULKWRITE strategy in the past. We need to know that to make a better decision. Ican understand why VACUUM uses a ring buffer, because it should want to act humbly as a background maintenance task to notcause trouble to frontend tasks. But why does CTAS have to be humble? If CTAS needs to be modest, why doesn't it usethe BULKREAD strategy for its SELECT? Regards Takayuki Tsunakawa
From: Dilip Kumar <dilipbalaut@gmail.com>
> I think some other cause of contention on relation extension locks are
> 1. CTAS is using a buffer strategy and due to that, it might need to
> evict out the buffer frequently for getting the new block in. Maybe
> we can identify by turning off the buffer strategy for CTAS and
> increasing the shared buffer so that data fits in memory.
Yes, both Bhrath-san (on a rich-man's machine) and I (on a poor-man's VM) saw that it's effective. I think we should remove this shackle from CTAS.
The question is why CTAS chose to use BULKWRITE strategy in the past. We need to know that to make a better decision.
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Sent: Thursday, May 27, 2021 12:46 PM > On Thu, May 27, 2021 at 7:12 AM houzj.fnst@fujitsu.com > <houzj.fnst@fujitsu.com> wrote: > > I am afraid that the using the FSM seems not get a stable performance > > gain(at least on my machine), I will take a deep look into this to > > figure out the difference. A naive idea it that the benefit that bulk extension > bring is not much greater than the cost in FSM. > > Do you have some ideas on it ? > > I think, if we try what Amit and I said in [1], we should get some insights on > whether the bulk relation extension is taking more time or the FSM lookup. I > plan to share the testing patch adding the timings and the counters so that you > can also test from your end. I hope that's fine with you. Sure, it will be nice if we can calculate the exact time. Thanks in advance. BTW, I checked my test results, I was testing INSERT INTO unlogged table. I re-test INSERT into normal(logged) table again, it seems [SKIP FSM] still Looks slightly better. Although, the 4 workers case still has performance degradation compared to serial case. SERIAL: 58759.213 ms PARALLEL 2 WORKER [NOT SKIP FSM]: 68390.221 ms [SKIP FSM]: 58633.924 ms PARALLEL 4 WORKER [NOT SKIP FSM]: 67448.142 ms [SKIP FSM]: 66,960.305 ms Best regards, houzj
On Thu, May 27, 2021 at 12:19 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > BTW, I checked my test results, I was testing INSERT INTO unlogged table. What do you mean by "testing INSERT INTO"? Is it that you are testing the timings for parallel inserts in INSERT INTO ... SELECT command? If so, why should we test parallel inserts in the INSERT INTO ... SELECT command here? The way I test parallel inserts in CTAS is: Apply the latest v23 patch set available at [1]. Run the data preparation sqls from [2]. Enable timing and run the CTAS query from [3]. Run with 0, 2 and 4 workers with leader participation on. [1] - https://www.postgresql.org/message-id/CALj2ACXVWr1o%2BFZrkQt-2GvYfuMQeJjWohajmp62Wr6BU8Y4VA%40mail.gmail.com [2] DROP TABLE tenk1; CREATE UNLOGGED TABLE tenk1(c1 bigint, c2 bigint, c3 name, c4 name, c5 name, c6 name, c7 name, c8 name, c9 name, c10 name, c11 name, c12 name, c13 name, c14 name, c15 name, c16 name, c17 name, c18 name); INSERT INTO tenk1 values(generate_series(1,100000), generate_series(1,10000000), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8)), upper(substring(md5(random()::varchar),2,8))); [3] EXPLAIN ANALYZE VERBOSE CREATE TABLE test AS SELECT * FROM tenk1; With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Sent: Thursday, May 27, 2021 2:59 PM > On Thu, May 27, 2021 at 12:19 PM houzj.fnst@fujitsu.com > <houzj.fnst@fujitsu.com> wrote: > > BTW, I checked my test results, I was testing INSERT INTO unlogged table. > > What do you mean by "testing INSERT INTO"? Is it that you are testing the > timings for parallel inserts in INSERT INTO ... SELECT command? If so, why > should we test parallel inserts in the INSERT INTO ... SELECT command here? Oops, sorry, it's a typo, I actually meant CREATE TABLE AS SELECT. Best regards, houzj
From: Dilip Kumar <dilipbalaut@gmail.com> Basically you are creating a new table and loading data to it and that means you will be less likely to access those datasoon so for such thing spoiling buffer cache may not be a good idea. -------------------------------------------------- Some people, including me, would say that the table will be accessed soon and that's why the data is loaded quickly duringminimal maintenance hours. -------------------------------------------------- I was just suggesting only for experiments for identifying the root cause. -------------------------------------------------- I thought this is a good chance to possibly change things better (^^). I guess the user would simply think like this: "I just want to finish CTAS as quickly as possible, so I configured to takeadvantage of parallelism. I want CTAS to make most use of our resources. Why doesn't Postgres try to limit resourceusage (by using the ring buffer) against my will?" Regards Takayuki Tsunakawa
On Thu, May 27, 2021 at 12:46 PM tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote: > > From: Dilip Kumar <dilipbalaut@gmail.com> > Basically you are creating a new table and loading data to it and that means you will be less likely to access those datasoon so for such thing spoiling buffer cache may not be a good idea. > -------------------------------------------------- > > Some people, including me, would say that the table will be accessed soon and that's why the data is loaded quickly duringminimal maintenance hours. > > > -------------------------------------------------- > I was just suggesting only for experiments for identifying the root cause. > -------------------------------------------------- > > I thought this is a good chance to possibly change things better (^^). > I guess the user would simply think like this: "I just want to finish CTAS as quickly as possible, so I configured to takeadvantage of parallelism. I want CTAS to make most use of our resources. Why doesn't Postgres try to limit resourceusage (by using the ring buffer) against my will?" If the idea is to give the user control of whether or not to use the separate RING BUFFER for bulk inserts/writes, then how about giving it as a rel option? Currently BAS_BULKWRITE (GetBulkInsertState), is being used by CTAS, Refresh Mat View, Table Rewrites (ATRewriteTable) and COPY. Furthermore, we could make the rel option an integer and allow users to provide the size of the ring buffer they want to choose for a particular bulk insert operation (of course with a max limit which is not exceeding the shared buffers or some reasonable amount not exceeding the RAM of the system). I think we can discuss this in a separate thread and see what other hackers think. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
From: houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> > Although, the 4 workers case still has performance degradation compared to > serial case. > > SERIAL: 58759.213 ms > PARALLEL 2 WORKER [NOT SKIP FSM]: 68390.221 ms [SKIP FSM]: > 58633.924 ms > PARALLEL 4 WORKER [NOT SKIP FSM]: 67448.142 ms [SKIP FSM]: > 66,960.305 ms Can you see any difference in table sizes? Regards Takayuki Tsunakawa
On Thu, May 27, 2021 at 1:03 PM tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote: > > From: houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> > > Although, the 4 workers case still has performance degradation compared to > > serial case. > > > > SERIAL: 58759.213 ms > > PARALLEL 2 WORKER [NOT SKIP FSM]: 68390.221 ms [SKIP FSM]: > > 58633.924 ms > > PARALLEL 4 WORKER [NOT SKIP FSM]: 67448.142 ms [SKIP FSM]: > > 66,960.305 ms > > Can you see any difference in table sizes? Also, the number of pages the table occupies in each case along with table size would give more insights. I do as follows to get the number of pages a relation occupies: CREATE EXTENSION pgstattuple; SELECT pg_relpages('test'); With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> > I think we can discuss this in a separate thread and see what other > hackers think. OK, unless we won't get stuck in the current direction. (Our goal is to not degrade in performance, but to outperform serialexecution, isn't it?) > If the idea is to give the user control of whether or not to use the > separate RING BUFFER for bulk inserts/writes, then how about giving it > as a rel option? Currently BAS_BULKWRITE (GetBulkInsertState), is > being used by CTAS, Refresh Mat View, Table Rewrites (ATRewriteTable) > and COPY. Furthermore, we could make the rel option an integer and > allow users to provide the size of the ring buffer they want to choose > for a particular bulk insert operation (of course with a max limit > which is not exceeding the shared buffers or some reasonable amount > not exceeding the RAM of the system). I think it's not a table property but an execution property. So, it'd be appropriate to control it with the SET command,just like the DBA sets work_mem and maintenance_work_mem for specific maintenance operations. I'll stop on this here... Regards Takayuki Tsunakawa
On Thu, May 27, 2021 at 10:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, May 27, 2021 at 10:16 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Thu, May 27, 2021 at 7:12 AM houzj.fnst@fujitsu.com > > <houzj.fnst@fujitsu.com> wrote: > > > I am afraid that the using the FSM seems not get a stable performance gain(at least on my machine), > > > I will take a deep look into this to figure out the difference. A naive idea it that the benefit that bulk extension > > > bring is not much greater than the cost in FSM. > > > Do you have some ideas on it ? > > > > I think, if we try what Amit and I said in [1], we should get some > > insights on whether the bulk relation extension is taking more time or > > the FSM lookup. I plan to share the testing patch adding the timings > > and the counters so that you can also test from your end. I hope > > that's fine with you. > > I think some other cause of contention on relation extension locks are > 1. CTAS is using a buffer strategy and due to that, it might need to > evict out the buffer frequently for getting the new block in. Maybe > we can identify by turning off the buffer strategy for CTAS and > increasing the shared buffer so that data fits in memory. > One more thing to ensure is whether all the workers are using the same access strategy? -- With Regards, Amit Kapila.
On Thu, May 27, 2021 at 2:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I think some other cause of contention on relation extension locks are > > 1. CTAS is using a buffer strategy and due to that, it might need to > > evict out the buffer frequently for getting the new block in. Maybe > > we can identify by turning off the buffer strategy for CTAS and > > increasing the shared buffer so that data fits in memory. > > > > One more thing to ensure is whether all the workers are using the same > access strategy? In the Parallel Inserts in CTAS patches, the leader and each worker uses its own ring buffer of 16MB i.e. does myState->bistate = GetBulkInsertState(); separately. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Sent: Thursday, May 27, 2021 3:41 PM > > On Thu, May 27, 2021 at 1:03 PM tsunakawa.takay@fujitsu.com > <tsunakawa.takay@fujitsu.com> wrote: > > > > From: houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> > > > Although, the 4 workers case still has performance degradation > > > compared to serial case. > > > > > > SERIAL: 58759.213 ms > > > PARALLEL 2 WORKER [NOT SKIP FSM]: 68390.221 ms [SKIP FSM]: > > > 58633.924 ms > > > PARALLEL 4 WORKER [NOT SKIP FSM]: 67448.142 ms [SKIP FSM]: > > > 66,960.305 ms > > > > Can you see any difference in table sizes? > > Also, the number of pages the table occupies in each case along with table size > would give more insights. > > I do as follows to get the number of pages a relation occupies: > CREATE EXTENSION pgstattuple; > SELECT pg_relpages('test'); It seems the difference between SKIP FSM and NOT SKIP FSM is not big. I tried serval times and the average result is almost the same. pg_relpages ------------- 1428575 pg_relation_size ------------- 11702976512(11G) Best regards, houzj
On Thu, May 27, 2021 at 9:53 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > One idea to find this out could be that we have three counters for > > each worker which counts the number of times each worker extended the > > relation in bulk, the number of times each worker extended the > > relation by one block, the number of times each worker gets the page > > from FSM. It might be possible that with this we will be able to > > figure out why there is a difference between your and Hou-San's > > results. > > Yeah, that helps. And also, the time spent in > LockRelationForExtension, ConditionalLockRelationForExtension, > GetPageWithFreeSpace and RelationAddExtraBlocks too can give some > insight. > > My plan is to have a patch with above info added in (which I will > share it here so that others can test and see the results too) and run > the "case 4" where there's a regression seen on my system. I captured below information with the attached patch 0001-test-times-and-block-counts.patch applied on top of CTAS v23 patch set. Testing details are attached in the file named "test". Total time spent in LockRelationForExtension Total time spent in GetPageWithFreeSpace Total time spent in RelationAddExtraBlocks Total number of times extended the relation in bulk Total number of times extended the relation by one block Total number of blocks added in bulk extension Total number of times getting the page from FSM Here is a summary of what I observed: 1) The execution time with 2 workers, without TABLE_INSERT_SKIP_FSM (140 sec) is more than with 0 workers (112 sec) 2) The execution time with 2 workers, with TABLE_INSERT_SKIP_FSM (225 sec) is more than with 2 workers, without TABLE_INSERT_SKIP_FSM (140 sec) 3) Majority of the time is going into waiting for relation extension lock in LockRelationForExtension. With 2 workers, without TABLE_INSERT_SKIP_FSM, out of total execution time 140 sec, the time spent in LockRelationForExtension is ~40 sec and the time spent in RelationAddExtraBlocks is ~20 sec. So, ~60 sec are being spent in these two functions. With 2 workers, with TABLE_INSERT_SKIP_FSM, out of total execution time 225 sec, the time spent in LockRelationForExtension is ~135 sec and the time spent in RelationAddExtraBlocks is 0 sec (because we skip FSM, no bulk extend logic applies). So, most of the time is being spent in LockRelationForExtension. I'm still not sure why the execution time with 0 workers (or serial execution or no parallelism involved) on my testing system is 112 sec compared to 58 sec on Hou-San's system for the same use case. Maybe the testing system I'm using is not of the latest configuration compared to others. Having said that, I request others to try and see if the same observations (as above) are made on their testing systems for the same use case. If others don't see regression (with just 2 workers) or they observe not much difference with and without TABLE_INSERT_SKIP_FSM. I'm open to changing the parallel inserts in CTAS code to use TABLE_INSERT_SKIP_FSM. In any case, if the observation is that there's a good amount of time being spent in LockRelationForExtension, I'm not sure (at this point) whether we can do something here or just live with it. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> > I'm still not sure why the execution time with 0 workers (or serial execution or > no parallelism involved) on my testing system is 112 sec compared to 58 sec on > Hou-San's system for the same use case. Maybe the testing system I'm using > is not of the latest configuration compared to others. What's the setting of wal_level on your two's systems? I thought it could be that you set it to > minimal, while Hou-sanset it to minimal. (I forgot the results of 2 and 4 workers, though.) Regards Takayuki Tsunakawa
From: Tsunakawa, Takayuki/綱川 貴之 <tsunakawa.takay@fujitsu.com> Sent: Friday, May 28, 2021 8:55 AM > To: 'Bharath Rupireddy' <bharath.rupireddyforpostgres@gmail.com>; Hou, > Zhijie/侯 志杰 <houzj.fnst@fujitsu.com> > Cc: Amit Kapila <amit.kapila16@gmail.com>; Tang, Haiying/唐 海英 > <tanghy.fnst@fujitsu.com>; PostgreSQL-development > <pgsql-hackers@postgresql.org>; Zhihong Yu <zyu@yugabyte.com>; Luc > Vlaming <luc@swarm64.com>; Dilip Kumar <dilipbalaut@gmail.com>; > vignesh C <vignesh21@gmail.com> > Subject: RE: Parallel Inserts in CREATE TABLE AS > > From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> > > I'm still not sure why the execution time with 0 workers (or serial > > execution or no parallelism involved) on my testing system is 112 sec > > compared to 58 sec on Hou-San's system for the same use case. Maybe > > the testing system I'm using is not of the latest configuration compared to > others. > > What's the setting of wal_level on your two's systems? I thought it could be > that you set it to > minimal, while Hou-san set it to minimal. (I forgot the > results of 2 and 4 workers, though.) I think I followed the configuration that Bharath-san mentioned. It could be the hardware's difference, because I am not using SSD. I will try to test on SSD to see if there is some difference. I only change the the following configuration: shared_buffers = 40GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off Best regards, houzj
On Thu, May 27, 2021 at 7:37 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Thu, May 27, 2021 at 9:53 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > One idea to find this out could be that we have three counters for > > > each worker which counts the number of times each worker extended the > > > relation in bulk, the number of times each worker extended the > > > relation by one block, the number of times each worker gets the page > > > from FSM. It might be possible that with this we will be able to > > > figure out why there is a difference between your and Hou-San's > > > results. > > > > Yeah, that helps. And also, the time spent in > > LockRelationForExtension, ConditionalLockRelationForExtension, > > GetPageWithFreeSpace and RelationAddExtraBlocks too can give some > > insight. > > > > My plan is to have a patch with above info added in (which I will > > share it here so that others can test and see the results too) and run > > the "case 4" where there's a regression seen on my system. > > I captured below information with the attached patch > 0001-test-times-and-block-counts.patch applied on top of CTAS v23 > patch set. Testing details are attached in the file named "test". > Total time spent in LockRelationForExtension > Total time spent in GetPageWithFreeSpace > Total time spent in RelationAddExtraBlocks > Total number of times extended the relation in bulk > Total number of times extended the relation by one block > Total number of blocks added in bulk extension > Total number of times getting the page from FSM > In your results, the number of pages each process is getting from FSM is not matching with the number of blocks added. I think we need to increment 'fsm_hit_count' in RecordAndGetPageWithFreeSpace as well because that is also called and the process can get a free page via the same. The other thing to check via debugger is when one worker adds the blocks in bulk does another parallel worker gets all those blocks. You can achieve that by allowing one worker (say worker-1) to extend the relation in bulk and then let it wait and allow another worker (say worker-2) to proceed and see if it gets all the pages added by worker-1 from FSM. You need to keep the leader also waiting or not perform any operation. -- With Regards, Amit Kapila.
On Fri, May 28, 2021 at 6:24 AM tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote: > > From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> > > I'm still not sure why the execution time with 0 workers (or serial execution or > > no parallelism involved) on my testing system is 112 sec compared to 58 sec on > > Hou-San's system for the same use case. Maybe the testing system I'm using > > is not of the latest configuration compared to others. > > What's the setting of wal_level on your two's systems? I thought it could be that you set it to > minimal, while Hou-sanset it to minimal. (I forgot the results of 2 and 4 workers, though.) Thanks. I was earlier running with default wal_level = replica. Results on my system, with wal_level = minimal, PSA file "test_results2" for more details: Without TABLE_INSERT_SKIP_FSM: 0 workers/serial execution - Time: 61875.255 ms (01:01.875) 2 workers - Time: 89227.379 ms (01:29.227) 4 workers - Time: 81484.876 ms (01:21.485) With TABLE_INSERT_SKIP_FSM: 0 workers/serial execution - Time: 61279.764 ms (01:01.280) 2 workers - Time: 208620.453 ms (03:28.620) 4 workers - Time: 223737.081 ms (03:43.737) Results on my system, with wal_level = replica, PSA file "test_results1" for more details: Without TABLE_INSERT_SKIP_FSM: 0 workers/serial execution - Time: 112175.273 ms (01:52.175) 2 workers - Time: 140441.158 ms (02:20.441) 4 workers - Time: 141750.577 ms (02:21.751) With TABLE_INSERT_SKIP_FSM: 0 workers/serial execution - Time: 112637.906 ms (01:52.638) 2 workers - Time: 225358.287 ms (03:45.358) 4 workers - Time: 242172.600 ms (04:02.173) Results on Hou-san's system: SERIAL: 58759.213 ms PARALLEL 2 WORKER [NOT SKIP FSM]: 68390.221 ms [SKIP FSM]: 58633.924 ms PARALLEL 4 WORKER [NOT SKIP FSM]: 67448.142 ms [SKIP FSM]: 66,960.305 ms Majority of the time is being spent in LockRelationForExtension, RelationAddExtraBlocks without TABLE_INSERT_SKIP_FSM and in LockRelationForExtension with TABLE_INSERT_SKIP_FSM. The observations made at [1] still hold true with wal_level = minimal. I request Hou-san to capture the same info with the add-on patch shared earlier. This would help us to be on the same page. We can further think on: 1) Why so much time is being spent in LockRelationForExtension? 2) Whether to use TABLE_INSERT_SKIP_FSM or not, in other words, whether to take advantage of bulk relation extension or not. 3) If bulk relation extension is to be used i.e. without TABLE_INSERT_SKIP_FSM flag, then whether the blocks being added by one worker are immediately visible to other workers or not after it finishes adding all the blocks. [1] - https://www.postgresql.org/message-id/CALj2ACV-VToW65BE6ndDEB7S_3qhzQ_BUWtw2q6V88iwTwwPSg%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Sent: Thursday, May 27, 2021 10:07 PM > On Thu, May 27, 2021 at 9:53 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > One idea to find this out could be that we have three counters for > > > each worker which counts the number of times each worker extended > > > the relation in bulk, the number of times each worker extended the > > > relation by one block, the number of times each worker gets the page > > > from FSM. It might be possible that with this we will be able to > > > figure out why there is a difference between your and Hou-San's > > > results. > > > > Yeah, that helps. And also, the time spent in > > LockRelationForExtension, ConditionalLockRelationForExtension, > > GetPageWithFreeSpace and RelationAddExtraBlocks too can give some > > insight. > > > > My plan is to have a patch with above info added in (which I will > > share it here so that others can test and see the results too) and run > > the "case 4" where there's a regression seen on my system. > > I captured below information with the attached patch > 0001-test-times-and-block-counts.patch applied on top of CTAS v23 patch set. > Testing details are attached in the file named "test". > Total time spent in LockRelationForExtension Total time spent in > GetPageWithFreeSpace Total time spent in RelationAddExtraBlocks Total > number of times extended the relation in bulk Total number of times extended > the relation by one block Total number of blocks added in bulk extension Total > number of times getting the page from FSM > > Here is a summary of what I observed: > 1) The execution time with 2 workers, without TABLE_INSERT_SKIP_FSM > (140 sec) is more than with 0 workers (112 sec) > 2) The execution time with 2 workers, with TABLE_INSERT_SKIP_FSM (225 > sec) is more than with 2 workers, without TABLE_INSERT_SKIP_FSM (140 > sec) > 3) Majority of the time is going into waiting for relation extension lock in > LockRelationForExtension. With 2 workers, without TABLE_INSERT_SKIP_FSM, > out of total execution time 140 sec, the time spent in LockRelationForExtension > is ~40 sec and the time spent in RelationAddExtraBlocks is ~20 sec. So, ~60 sec > are being spent in these two functions. With 2 workers, with > TABLE_INSERT_SKIP_FSM, out of total execution time 225 sec, the time spent > in LockRelationForExtension is ~135 sec and the time spent in > RelationAddExtraBlocks is 0 sec (because we skip FSM, no bulk extend logic > applies). So, most of the time is being spent in LockRelationForExtension. > > I'm still not sure why the execution time with 0 workers (or serial execution or > no parallelism involved) on my testing system is 112 sec compared to 58 sec on > Hou-San's system for the same use case. Maybe the testing system I'm using is > not of the latest configuration compared to others. > > Having said that, I request others to try and see if the same observations (as > above) are made on their testing systems for the same use case. If others don't > see regression (with just 2 workers) or they observe not much difference with > and without TABLE_INSERT_SKIP_FSM. Thanks for the patch ! I attached my test results. Note I did not change the wal_level to minimal. I only change the the following configuration: shared_buffers = 40GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off Best regards, houzj
Attachment
On Fri, May 28, 2021 at 8:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, May 27, 2021 at 7:37 PM Bharath Rupireddy > > > > I captured below information with the attached patch > > 0001-test-times-and-block-counts.patch applied on top of CTAS v23 > > patch set. Testing details are attached in the file named "test". > > Total time spent in LockRelationForExtension > > Total time spent in GetPageWithFreeSpace > > Total time spent in RelationAddExtraBlocks > > Total number of times extended the relation in bulk > > Total number of times extended the relation by one block > > Total number of blocks added in bulk extension > > Total number of times getting the page from FSM > > > > In your results, the number of pages each process is getting from FSM > is not matching with the number of blocks added. I think we need to > increment 'fsm_hit_count' in RecordAndGetPageWithFreeSpace as well > because that is also called and the process can get a free page via > the same. The other thing to check via debugger is when one worker > adds the blocks in bulk does another parallel worker gets all those > blocks. You can achieve that by allowing one worker (say worker-1) to > extend the relation in bulk and then let it wait and allow another > worker (say worker-2) to proceed and see if it gets all the pages > added by worker-1 from FSM. You need to keep the leader also waiting > or not perform any operation. > While looking at results, I have observed one more thing that we are trying to parallelize I/O due to which we might not be seeing benefit in such cases. I think even for non-write queries there won't be any (much) benefit if we can't parallelize CPU usage. Basically, the test you are doing is for statement: explain analyze verbose create table test as select * from tenk1;. Now, in this statement, there is no qualification and still, the Gather node is generated for it, this won't be the case if we check "select * from tenk1". Is it due to the reason that the patch completely ignores the parallel_tuple_cost? But still, it should prefer a serial plan due parallel_setup_cost, why is that not happening? Anyway, I think we should not parallelize such queries where we can't parallelize CPU usage. Have you tried the cases without changing any of the costings for parallelism? -- With Regards, Amit Kapila.
On Sat, May 29, 2021 at 9:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > While looking at results, I have observed one more thing that we are > trying to parallelize I/O due to which we might not be seeing benefit > in such cases. I think even for non-write queries there won't be any > (much) benefit if we can't parallelize CPU usage. Basically, the test > you are doing is for statement: explain analyze verbose create table > test as select * from tenk1;. Now, in this statement, there is no > qualification and still, the Gather node is generated for it, this > won't be the case if we check "select * from tenk1". Is it due to the > reason that the patch completely ignores the parallel_tuple_cost? But > still, it should prefer a serial plan due parallel_setup_cost, why is > that not happening? Anyway, I think we should not parallelize such > queries where we can't parallelize CPU usage. Have you tried the cases > without changing any of the costings for parallelism? Hi, I measured the execution timings for parallel inserts in CTAS in cases where the planner chooses parallelism for selects naturally. This means, I have used only 0001 patch from v23 patch set at [1]. I have not used the 0002 patch that makes parallel_tuple_cost 0. Query used for all these tests is below. Also, attached table creation sqls in the file "test_cases". EXPLAIN (ANALYZE, VERBOSE) create table test1 as select * from tenk1 t1, tenk2 t2 where t1.c1 = t2.d2; All the results are of the form (number of workers, exec time in milli sec). Test case 1: both tenk1 and tenk2 are of tables with 1 integer(of 4 bytes) columns, tuple size 28 bytes, 100mn tuples master: (0, 277886.951 ms), (2, 171183.221 ms), (4, 159703.496 ms) with parallel inserts CTAS patch: (0, 264709.186 ms), (2, 128354.448 ms), (4, 111533.731 ms) Test case 2: both tenk1 and tenk2 are of tables with 2 integer(of 4 bytes each) columns, 3 varchar(8), tuple size 59 bytes, 100mn tuples master: (0, 453505.228 ms), (2, 236762.759 ms), (4, 219038.126 ms) with parallel inserts CTAS patch: (0, 470483.818 ms), (2, 219374.198 ms), (4, 203543.681 ms) Test case 3: both tenk1 and tenk2 are of tables with 2 bigint(of 8 bytes each) columns, 3 name(of 64 bytes each) columns, 1 varchar(8), tuple size 241 bytes, 100mn tuples master: (0, 1052725.928 ms), (2, 592968.486 ms), (4, 562137.891 ms) with parallel inserts CTAS patch: (0, 1019086.805 ms), (2, 634448.322 ms), (4, 680793.305 ms) Test case 4: both tenk1 and tenk2 are of tables with 2 bigint(of 8 bytes each) columns, 16 name(of 64 bytes each) columns, tuple size 1064 bytes, 10mn tuples master: (0, 371931.497 ms), (2, 247206.841 ms), (4, 241959.839 ms) with parallel inserts CTAS patch: (0, 396342.329 ms), (2, 333860.472 ms), (4, 317895.558 ms) Observation: parallel insert + parallel select gives good benefit wIth very lesser tuple sizes, cases 1 and 2. If the tuple size is bigger serial insert + parallel select fares better, cases 3 and 4. In the coming days, I will try to work on more performance analysis and clarify some of the points raised upthread. [1] - https://www.postgresql.org/message-id/CALj2ACXVWr1o%2BFZrkQt-2GvYfuMQeJjWohajmp62Wr6BU8Y4VA%40mail.gmail.com [2] - postgresql.conf changes I made: shared_buffers = 40GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = on checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off wal_level = replica With Regards, Bharath Rupireddy.