Thread: INSERT INTO SELECT, Why Parallelism is not selected?

INSERT INTO SELECT, Why Parallelism is not selected?

From
Dilip Kumar
Date:
I have just notice that the parallelism is off even for the select
part of the query mentioned in the $subject.  I see the only reason it
is not getting parallel because we block the parallelism if the query
type is not SELECT.  I don't see any reason for not selecting the
parallelism for this query.  I have quickly hacked the code to enable
the parallelism for this query.  I can see there is a significant
improvement if we can enable the parallelism in this case.  For an
experiment, I have just relaxed a couple of checks, maybe if we think
that it's good to enable the parallelism for this case we can try to
put better checks which are specific for this quey.

No parallel:
postgres[36635]=# explain analyze insert into t2 select * from t where a < 100;
 Insert on t2  (cost=0.00..29742.00 rows=100 width=105) (actual
time=278.300..278.300 rows=0 loops=1)
   ->  Seq Scan on t  (cost=0.00..29742.00 rows=100 width=105) (actual
time=0.061..277.330 rows=99 loops=1)
         Filter: (a < 100)
         Rows Removed by Filter: 999901
 Planning Time: 0.093 ms
 Execution Time: 278.330 ms
(6 rows)

With parallel
 Insert on t2  (cost=1000.00..23460.33 rows=100 width=105) (actual
time=108.410..108.410 rows=0 loops=1)
   ->  Gather  (cost=1000.00..23460.33 rows=100 width=105) (actual
time=0.306..108.973 rows=99 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Parallel Seq Scan on t  (cost=0.00..22450.33 rows=42
width=105) (actual time=66.396..101.979 rows=33 loops=3)
               Filter: (a < 100)
               Rows Removed by Filter: 333300
 Planning Time: 0.154 ms
 Execution Time: 110.158 ms
(9 rows)

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Amit Kapila
Date:
On Sat, Jul 11, 2020 at 6:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have just notice that the parallelism is off even for the select
> part of the query mentioned in the $subject.  I see the only reason it
> is not getting parallel because we block the parallelism if the query
> type is not SELECT.  I don't see any reason for not selecting the
> parallelism for this query.  I have quickly hacked the code to enable
> the parallelism for this query.  I can see there is a significant
> improvement if we can enable the parallelism in this case.  For an
> experiment, I have just relaxed a couple of checks, maybe if we think
> that it's good to enable the parallelism for this case we can try to
> put better checks which are specific for this quey.
>

+1.  I also don't see any problem with this idea considering we will
find a better way to enable the parallelism for this case because we
can already use parallelism for statements like "create table
<tbl_name> as select ...".  I think we can do more than this by
parallelizing the Insert part of this query as well as we have lifted
group locking restrictions related to RelationExtension and Page lock
in PG13.  It would be really cool to do that unless we see any
fundamental problems with it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Dilip Kumar
Date:
On Mon, Jul 13, 2020 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jul 11, 2020 at 6:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have just notice that the parallelism is off even for the select
> > part of the query mentioned in the $subject.  I see the only reason it
> > is not getting parallel because we block the parallelism if the query
> > type is not SELECT.  I don't see any reason for not selecting the
> > parallelism for this query.  I have quickly hacked the code to enable
> > the parallelism for this query.  I can see there is a significant
> > improvement if we can enable the parallelism in this case.  For an
> > experiment, I have just relaxed a couple of checks, maybe if we think
> > that it's good to enable the parallelism for this case we can try to
> > put better checks which are specific for this quey.
> >
>
> +1.  I also don't see any problem with this idea considering we will
> find a better way to enable the parallelism for this case because we
> can already use parallelism for statements like "create table
> <tbl_name> as select ...".

Okay, thanks for the feedback.

  I think we can do more than this by
> parallelizing the Insert part of this query as well as we have lifted
> group locking restrictions related to RelationExtension and Page lock
> in PG13.  It would be really cool to do that unless we see any
> fundamental problems with it.

+1, I think it will be cool to support for insert part here as well as
insert part in 'Create Table As Select..' as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Rushabh Lathia
Date:


On Sat, Jul 11, 2020 at 6:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
I have just notice that the parallelism is off even for the select
part of the query mentioned in the $subject.  I see the only reason it
is not getting parallel because we block the parallelism if the query
type is not SELECT.  I don't see any reason for not selecting the
parallelism for this query.  I have quickly hacked the code to enable
the parallelism for this query.  I can see there is a significant
improvement if we can enable the parallelism in this case.  For an
experiment, I have just relaxed a couple of checks, maybe if we think
that it's good to enable the parallelism for this case we can try to
put better checks which are specific for this quey.


+1 for the idea.  For the given example also it shows a good performance
gain and I also don't any reason on restrict the parallel case for INSERT INTO SELECT.
 
No parallel:
postgres[36635]=# explain analyze insert into t2 select * from t where a < 100;
 Insert on t2  (cost=0.00..29742.00 rows=100 width=105) (actual
time=278.300..278.300 rows=0 loops=1)
   ->  Seq Scan on t  (cost=0.00..29742.00 rows=100 width=105) (actual
time=0.061..277.330 rows=99 loops=1)
         Filter: (a < 100)
         Rows Removed by Filter: 999901
 Planning Time: 0.093 ms
 Execution Time: 278.330 ms
(6 rows)

With parallel
 Insert on t2  (cost=1000.00..23460.33 rows=100 width=105) (actual
time=108.410..108.410 rows=0 loops=1)
   ->  Gather  (cost=1000.00..23460.33 rows=100 width=105) (actual
time=0.306..108.973 rows=99 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Parallel Seq Scan on t  (cost=0.00..22450.33 rows=42
width=105) (actual time=66.396..101.979 rows=33 loops=3)
               Filter: (a < 100)
               Rows Removed by Filter: 333300
 Planning Time: 0.154 ms
 Execution Time: 110.158 ms
(9 rows)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


--
Rushabh Lathia

Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Robert Haas
Date:
On Sat, Jul 11, 2020 at 8:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have just notice that the parallelism is off even for the select
> part of the query mentioned in the $subject.  I see the only reason it
> is not getting parallel because we block the parallelism if the query
> type is not SELECT.  I don't see any reason for not selecting the
> parallelism for this query.

There's a relevant comment near the top of heap_prepare_insert().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Amit Kapila
Date:
On Wed, Jul 15, 2020 at 12:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Jul 11, 2020 at 8:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have just notice that the parallelism is off even for the select
> > part of the query mentioned in the $subject.  I see the only reason it
> > is not getting parallel because we block the parallelism if the query
> > type is not SELECT.  I don't see any reason for not selecting the
> > parallelism for this query.
>
> There's a relevant comment near the top of heap_prepare_insert().
>

I think that is no longer true after commits 85f6b49c2c and 3ba59ccc89
where we have allowed relation extension and page locks to conflict
among group members.  We have accordingly changed comments at a few
places but forgot to update this one.  I will check and see if any
other similar comments are there which needs to be updated.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Amit Kapila
Date:
On Wed, Jul 15, 2020 at 8:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 15, 2020 at 12:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Sat, Jul 11, 2020 at 8:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > I have just notice that the parallelism is off even for the select
> > > part of the query mentioned in the $subject.  I see the only reason it
> > > is not getting parallel because we block the parallelism if the query
> > > type is not SELECT.  I don't see any reason for not selecting the
> > > parallelism for this query.
> >
> > There's a relevant comment near the top of heap_prepare_insert().
> >
>
> I think that is no longer true after commits 85f6b49c2c and 3ba59ccc89
> where we have allowed relation extension and page locks to conflict
> among group members.  We have accordingly changed comments at a few
> places but forgot to update this one.  I will check and see if any
> other similar comments are there which needs to be updated.
>

The attached patch fixes the comments.  Let me know if you think I
have missed anything or any other comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Dilip Kumar
Date:
On Thu, Jul 16, 2020 at 8:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 15, 2020 at 8:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jul 15, 2020 at 12:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > On Sat, Jul 11, 2020 at 8:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > I have just notice that the parallelism is off even for the select
> > > > part of the query mentioned in the $subject.  I see the only reason it
> > > > is not getting parallel because we block the parallelism if the query
> > > > type is not SELECT.  I don't see any reason for not selecting the
> > > > parallelism for this query.
> > >
> > > There's a relevant comment near the top of heap_prepare_insert().
> > >
> >
> > I think that is no longer true after commits 85f6b49c2c and 3ba59ccc89
> > where we have allowed relation extension and page locks to conflict
> > among group members.  We have accordingly changed comments at a few
> > places but forgot to update this one.  I will check and see if any
> > other similar comments are there which needs to be updated.
> >
>
> The attached patch fixes the comments.  Let me know if you think I
> have missed anything or any other comments.

Your comments look good to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Robert Haas
Date:
On Wed, Jul 15, 2020 at 11:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> The attached patch fixes the comments.  Let me know if you think I
> have missed anything or any other comments.

If it's safe now, why not remove the error check?

(Is it safe? Could there be other problems?)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Amit Kapila
Date:
On Thu, Jul 16, 2020 at 6:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jul 15, 2020 at 11:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > The attached patch fixes the comments.  Let me know if you think I
> > have missed anything or any other comments.
>
> If it's safe now, why not remove the error check?
>

I think it is not safe for all kind of Inserts (see my response later
in email), so we need some smarts to identify un-safe inserts before
we can open this check.

> (Is it safe? Could there be other problems?)
>

I think we need to be careful of two things: (a) Do we want to enable
parallel inserts where tuple locks are involved, forex. in statements
like "Insert into primary_tbl Select * from secondary_tbl Where col <
10 For Update;"?  In such statements, I don't see any problem because
each worker will operate on a separate page and even if the leader
already has a lock on the tuple, it will be granted to the worker as
it is taken in the same transaction.  (b) The insert statements that
can generate 'CommandIds' which can happen while insert into tables
with foreign keys, see below test:

CREATE TABLE primary_tbl(index INTEGER PRIMARY KEY, height real, weight real);
insert into primary_tbl values(1, 1.1, 100);
insert into primary_tbl values(2, 1.2, 100);
insert into primary_tbl values(3, 1.3, 100);

CREATE TABLE secondary_tbl(index INTEGER REFERENCES
primary_tbl(index), height real, weight real);

insert into secondary_tbl values(generate_series(1,3),1.2,200);

Here we can't parallelise statements like "insert into secondary_tbl
values(generate_series(1,3),1.2,200);" as they will generate
'CommandIds' for each row insert into table with foreign key.  The new
command id is generated while performing a foreign key check.  Now, it
is a separate question whether generating a command id for each row
insert is required or not but as of now we can't parallelize such
statements.

Do you have something else in mind?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Amit Kapila
Date:
On Fri, Jul 17, 2020 at 11:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Do you have something else in mind?
>

I am planning to commit the comments change patch attached in the
above email [1] next week sometime (probably Monday or Tuesday) unless
you have something more to add?


[1] -
https://www.postgresql.org/message-id/CAA4eK1%2BRL7c_s%3D%2BTwAE6DJ1MmupbEiGCFLt97US%2BDMm6UxAjTA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Robert Haas
Date:
On Fri, Jul 24, 2020 at 7:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jul 17, 2020 at 11:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Do you have something else in mind?
>
> I am planning to commit the comments change patch attached in the
> above email [1] next week sometime (probably Monday or Tuesday) unless
> you have something more to add?

Well, I think the comments could be more clear - for the insert case
specifically - about which cases you think are and are not safe.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Well, I think the comments could be more clear - for the insert case
> specifically - about which cases you think are and are not safe.

Yeah, the proposed comment changes don't actually add much.  Also
please try to avoid inserting non-ASCII   into the source code;
at least in my mail reader, that attachment has some.

            regards, tom lane



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Amit Kapila
Date:
On Fri, Jul 24, 2020 at 7:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Robert Haas <robertmhaas@gmail.com> writes:
> > Well, I think the comments could be more clear - for the insert case
> > specifically - about which cases you think are and are not safe.
>

Okay, I'll update the patch accordingly.

> Yeah, the proposed comment changes don't actually add much.  Also
> please try to avoid inserting non-ASCII   into the source code;
> at least in my mail reader, that attachment has some.
>

I don't see any non-ASCII characters in the patch.  I have applied and
checked (via vi editor) the patch as well but don't see any non-ASCII
characters.  How can I verify that?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> On Fri, Jul 24, 2020 at 7:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, the proposed comment changes don't actually add much.  Also
>> please try to avoid inserting non-ASCII   into the source code;
>> at least in my mail reader, that attachment has some.

> I don't see any non-ASCII characters in the patch.  I have applied and
> checked (via vi editor) the patch as well but don't see any non-ASCII
> characters.  How can I verify that?

They're definitely there:

$ od -c 0001-Fix-comments-in-heapam.c.patch
...
0002740   h   e  \n   +  \t       *       l   e   a   d   e   r       c
0002760   a   n       p   e   r   f   o   r   m       t   h   e       i
0003000   n   s   e   r   t   .     302 240   T   h   i   s       r   e
0003020   s   t   r   i   c   t   i   o   n       c   a   n       b   e
0003040       u   p   l   i   f   t   e   d       o   n   c   e       w
0003060   e  \n   +  \t       *       a   l   l   o   w       t   h   e
0003100 302 240   p   l   a   n   n   e   r       t   o       g   e   n
0003120   e   r   a   t   e       p   a   r   a   l   l   e   l       p
0003140   l   a   n   s       f   o   r       i   n   s   e   r   t   s
0003160   .  \n      \t       *   /  \n      \t   i   f       (   I   s
...

I'm not sure if "git diff --check" would whine about this.

            regards, tom lane



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Amit Kapila
Date:
On Sat, Jul 25, 2020 at 8:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Fri, Jul 24, 2020 at 7:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Yeah, the proposed comment changes don't actually add much.  Also
> >> please try to avoid inserting non-ASCII   into the source code;
> >> at least in my mail reader, that attachment has some.
>
> > I don't see any non-ASCII characters in the patch.  I have applied and
> > checked (via vi editor) the patch as well but don't see any non-ASCII
> > characters.  How can I verify that?
>
> They're definitely there:
>
> $ od -c 0001-Fix-comments-in-heapam.c.patch
> ...
> 0002740   h   e  \n   +  \t       *       l   e   a   d   e   r       c
> 0002760   a   n       p   e   r   f   o   r   m       t   h   e       i
> 0003000   n   s   e   r   t   .     302 240   T   h   i   s       r   e
> 0003020   s   t   r   i   c   t   i   o   n       c   a   n       b   e
> 0003040       u   p   l   i   f   t   e   d       o   n   c   e       w
> 0003060   e  \n   +  \t       *       a   l   l   o   w       t   h   e
> 0003100 302 240   p   l   a   n   n   e   r       t   o       g   e   n
> 0003120   e   r   a   t   e       p   a   r   a   l   l   e   l       p
> 0003140   l   a   n   s       f   o   r       i   n   s   e   r   t   s
> 0003160   .  \n      \t       *   /  \n      \t   i   f       (   I   s
> ...
>

Thanks, I could see that.

> I'm not sure if "git diff --check" would whine about this.
>

No, "git diff --check" doesn't help.  I have tried pgindent but that
also doesn't help neither was I expecting it to help.  I am still not
able to figure out how I goofed up this but will spend some more time
on this.  In the meantime, I have updated the patch to improve the
comments as suggested by Robert.  Do let me know if you want to
edit/add something more?

-- 
With Regards,
Amit Kapila.

Attachment

Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Amit Kapila
Date:
On Sun, Jul 26, 2020 at 4:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jul 25, 2020 at 8:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
>
> No, "git diff --check" doesn't help.  I have tried pgindent but that
> also doesn't help neither was I expecting it to help.  I am still not
> able to figure out how I goofed up this but will spend some more time
> on this.
>

I think I have figured out how the patch ended up having  . Some
editors add it when we use two spaces after a period (.) and I could
see that my Gmail client does that for such a pattern.  Normally, I
use one of MSVC, vi, or NetBeans IDE to develop code/patch but
sometimes I check the comments by pasting in Gmail client to find
typos or such and then fix them manually. I guess in this case I have
used Gmail client to write this comment and then copy/pasted it for
the patch. I will be careful not to do this in the future.

-- 
With Regards,
Amit Kapila.



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Robert Haas
Date:
On Sun, Jul 26, 2020 at 7:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> No, "git diff --check" doesn't help.  I have tried pgindent but that
> also doesn't help neither was I expecting it to help.  I am still not
> able to figure out how I goofed up this but will spend some more time
> on this.  In the meantime, I have updated the patch to improve the
> comments as suggested by Robert.  Do let me know if you want to
> edit/add something more?

I still don't agree with this as proposed.

+ * For now, we don't allow parallel inserts of any form not even where the
+ * leader can perform the insert.  This restriction can be uplifted once
+ * we allow the planner to generate parallel plans for inserts.  We can

If I'm understanding this correctly, this logic is completely
backwards. We don't prohibit inserts here because we know the planner
can't generate them. We prohibit inserts here because, if the planner
somehow did generate them, it wouldn't be safe. You're saying that
it's not allowed because we don't try to do it yet, but actually it's
not allowed because we want to make sure that we don't accidentally
try to do it. That's very different.

+ * parallelize inserts unless they generate a new commandid (ex. inserts
+ * into a table having foreign key column) or lock tuples (ex. statements
+ * like Insert .. Select For Update).

I understand the part about generating new command IDs, but not the
part about locking tuples. Why would that be a problem? Can it better
explained here?

Examples in comments are typically introduced with e.g., not ex.

+ * We should be able to parallelize
+ * the later case if we can ensure that no two parallel processes can ever
+ * operate on the same page.

I don't know whether this is talking about two processes operating on
the same page at the same time, or ever within a single query
execution. If it's the former, perhaps we need to explain why that's a
concern for parallel query but not otherwise; if it's the latter, that
seems impossible to guarantee and imagining that we'll ever be able to
do so seems like wishful thinking.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Amit Kapila
Date:
On Wed, Jul 29, 2020 at 7:18 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> I still don't agree with this as proposed.
>
> + * For now, we don't allow parallel inserts of any form not even where the
> + * leader can perform the insert.  This restriction can be uplifted once
> + * we allow the planner to generate parallel plans for inserts.  We can
>
> If I'm understanding this correctly, this logic is completely
> backwards. We don't prohibit inserts here because we know the planner
> can't generate them. We prohibit inserts here because, if the planner
> somehow did generate them, it wouldn't be safe. You're saying that
> it's not allowed because we don't try to do it yet, but actually it's
> not allowed because we want to make sure that we don't accidentally
> try to do it. That's very different.
>

Right, so how about something like: "To allow parallel inserts, we
need to ensure that they are safe to be performed in workers.  We have
the infrastructure to allow parallel inserts in general except for the
case where inserts generate a new commandid (eg. inserts into a table
having a foreign key column)."  We can extend this for tuple locking
if required as per the below discussion.  Kindly suggest if you prefer
a different wording here.

>
> + * We should be able to parallelize
> + * the later case if we can ensure that no two parallel processes can ever
> + * operate on the same page.
>
> I don't know whether this is talking about two processes operating on
> the same page at the same time, or ever within a single query
> execution. If it's the former, perhaps we need to explain why that's a
> concern for parallel query but not otherwise;
>

I am talking about the former case and I know that as per current
design it is not possible that two worker processes try to operate on
the same page but I was trying to be pessimistic so that we can ensure
that via some form of Assert.  I don't know whether it is important to
mention this case or not but for the sake of extra safety, I have
mentioned it.

-- 
With Regards,
Amit Kapila.



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Amit Kapila
Date:
On Thu, Jul 30, 2020 at 12:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 29, 2020 at 7:18 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > I still don't agree with this as proposed.
> >
> > + * For now, we don't allow parallel inserts of any form not even where the
> > + * leader can perform the insert.  This restriction can be uplifted once
> > + * we allow the planner to generate parallel plans for inserts.  We can
> >
> > If I'm understanding this correctly, this logic is completely
> > backwards. We don't prohibit inserts here because we know the planner
> > can't generate them. We prohibit inserts here because, if the planner
> > somehow did generate them, it wouldn't be safe. You're saying that
> > it's not allowed because we don't try to do it yet, but actually it's
> > not allowed because we want to make sure that we don't accidentally
> > try to do it. That's very different.
> >
>
> Right, so how about something like: "To allow parallel inserts, we
> need to ensure that they are safe to be performed in workers.  We have
> the infrastructure to allow parallel inserts in general except for the
> case where inserts generate a new commandid (eg. inserts into a table
> having a foreign key column)."  We can extend this for tuple locking
> if required as per the below discussion.  Kindly suggest if you prefer
> a different wording here.
>
> >
> > + * We should be able to parallelize
> > + * the later case if we can ensure that no two parallel processes can ever
> > + * operate on the same page.
> >
> > I don't know whether this is talking about two processes operating on
> > the same page at the same time, or ever within a single query
> > execution. If it's the former, perhaps we need to explain why that's a
> > concern for parallel query but not otherwise;
> >
>
> I am talking about the former case and I know that as per current
> design it is not possible that two worker processes try to operate on
> the same page but I was trying to be pessimistic so that we can ensure
> that via some form of Assert.
>

I think the two worker processes can operate on the same page for a
parallel index scan case but it won't be for same tuple. I am not able
to think of any case where we should be worried about tuple locking
for Insert's case, so we can probably skip writing anything about it
unless someone else can think of such a case.


-- 
With Regards,
Amit Kapila.



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Bharath Rupireddy
Date:
On Tue, Jul 14, 2020 at 1:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > I think we can do more than this by
> > parallelizing the Insert part of this query as well as we have lifted
> > group locking restrictions related to RelationExtension and Page lock
> > in PG13.  It would be really cool to do that unless we see any
> > fundamental problems with it.
>
> +1, I think it will be cool to support for insert part here as well as
> insert part in 'Create Table As Select..' as well.
>

+1 to parallelize inserts. Currently, ExecInsert() and CTAS use
table_tuple_insert(), if we parallelize these parts, each worker will
be inserting it's tuples(one tuple at a time) into the same data page,
until space is available, if not a new data page can be obtained by
any of the worker, others might start inserting into it. This way,
will there be lock contention on data pages?. Do we also need to make
inserts to use table_multi_insert() (like the way "COPY" uses) instead
of table_tuple_insert()?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Amit Kapila
Date:
On Tue, Aug 18, 2020 at 1:37 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Tue, Jul 14, 2020 at 1:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 13, 2020 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > I think we can do more than this by
> > > parallelizing the Insert part of this query as well as we have lifted
> > > group locking restrictions related to RelationExtension and Page lock
> > > in PG13.  It would be really cool to do that unless we see any
> > > fundamental problems with it.
> >
> > +1, I think it will be cool to support for insert part here as well as
> > insert part in 'Create Table As Select..' as well.
> >
>
> +1 to parallelize inserts. Currently, ExecInsert() and CTAS use
> table_tuple_insert(), if we parallelize these parts, each worker will
> be inserting it's tuples(one tuple at a time) into the same data page,
> until space is available, if not a new data page can be obtained by
> any of the worker, others might start inserting into it. This way,
> will there be lock contention on data pages?
>

It is possible but we need to check how much that is a bottleneck
because that should not be a big part of the operation. And, it won't
be any worse than inserts via multiple backends. I think it is
important to do that way, otherwise, some of the pages can remain
half-empty.

Right now, the plan for Insert ... Select is like
Insert on <tbl_x>
   ->  Seq Scan on <tbl_y>
         ....

In the above the scan could be index scan as well. What we want is:
Gather
  -> Insert on <tbl_x>
       ->  Seq Scan on <tbl_y>
         ....

>. Do we also need to make
> inserts to use table_multi_insert() (like the way "COPY" uses) instead
> of table_tuple_insert()?
>

I am not sure at this stage but if it turns out to be a big problem
then we might think of inventing some way to allow individual workers
to operate on different pages. I think even without that we should be
able to make a big gain as reads, filtering, etc can be done in
parallel.

-- 
With Regards,
Amit Kapila.



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Amit Kapila
Date:
On Thu, Jul 30, 2020 at 6:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 30, 2020 at 12:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jul 29, 2020 at 7:18 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > I still don't agree with this as proposed.
> > >
> > > + * For now, we don't allow parallel inserts of any form not even where the
> > > + * leader can perform the insert.  This restriction can be uplifted once
> > > + * we allow the planner to generate parallel plans for inserts.  We can
> > >
> > > If I'm understanding this correctly, this logic is completely
> > > backwards. We don't prohibit inserts here because we know the planner
> > > can't generate them. We prohibit inserts here because, if the planner
> > > somehow did generate them, it wouldn't be safe. You're saying that
> > > it's not allowed because we don't try to do it yet, but actually it's
> > > not allowed because we want to make sure that we don't accidentally
> > > try to do it. That's very different.
> > >
> >
> > Right, so how about something like: "To allow parallel inserts, we
> > need to ensure that they are safe to be performed in workers.  We have
> > the infrastructure to allow parallel inserts in general except for the
> > case where inserts generate a new commandid (eg. inserts into a table
> > having a foreign key column)."

Robert, Dilip, do you see any problem if we change the comment on the
above lines? Feel free to suggest if you have something better in
mind.

> >  We can extend this for tuple locking
> > if required as per the below discussion.  Kindly suggest if you prefer
> > a different wording here.
> >

I feel we can leave this based on the reasoning provided below.

> > >
> > > + * We should be able to parallelize
> > > + * the later case if we can ensure that no two parallel processes can ever
> > > + * operate on the same page.
> > >
> > > I don't know whether this is talking about two processes operating on
> > > the same page at the same time, or ever within a single query
> > > execution. If it's the former, perhaps we need to explain why that's a
> > > concern for parallel query but not otherwise;
> > >
> >
> > I am talking about the former case and I know that as per current
> > design it is not possible that two worker processes try to operate on
> > the same page but I was trying to be pessimistic so that we can ensure
> > that via some form of Assert.
> >
>
> I think the two worker processes can operate on the same page for a
> parallel index scan case but it won't be for same tuple. I am not able
> to think of any case where we should be worried about tuple locking
> for Insert's case, so we can probably skip writing anything about it
> unless someone else can think of such a case.
>

-- 
With Regards,
Amit Kapila.



Re: INSERT INTO SELECT, Why Parallelism is not selected?

From
Amit Kapila
Date:
On Wed, Sep 9, 2020 at 10:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 30, 2020 at 6:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jul 30, 2020 at 12:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Jul 29, 2020 at 7:18 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > >
> > > > I still don't agree with this as proposed.
> > > >
> > > > + * For now, we don't allow parallel inserts of any form not even where the
> > > > + * leader can perform the insert.  This restriction can be uplifted once
> > > > + * we allow the planner to generate parallel plans for inserts.  We can
> > > >
> > > > If I'm understanding this correctly, this logic is completely
> > > > backwards. We don't prohibit inserts here because we know the planner
> > > > can't generate them. We prohibit inserts here because, if the planner
> > > > somehow did generate them, it wouldn't be safe. You're saying that
> > > > it's not allowed because we don't try to do it yet, but actually it's
> > > > not allowed because we want to make sure that we don't accidentally
> > > > try to do it. That's very different.
> > > >
> > >
> > > Right, so how about something like: "To allow parallel inserts, we
> > > need to ensure that they are safe to be performed in workers.  We have
> > > the infrastructure to allow parallel inserts in general except for the
> > > case where inserts generate a new commandid (eg. inserts into a table
> > > having a foreign key column)."
>
> Robert, Dilip, do you see any problem if we change the comment on the
> above lines? Feel free to suggest if you have something better in
> mind.
>

Hearing no further comments, I have pushed the changes as discussed above.

-- 
With Regards,
Amit Kapila.