Re: Parallel copy - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Parallel copy
Date
Msg-id CAA4eK1Jq6-=TYTj37XnWioaOkZ+nY0ipAuNyT=6AfU_0iPvqXA@mail.gmail.com
Whole thread Raw
In response to Re: Parallel copy  (Ashutosh Sharma <ashu.coek88@gmail.com>)
Responses Re: Parallel copy
Re: Parallel copy
List pgsql-hackers
On Fri, Jun 12, 2020 at 4:57 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi All,
>
> I've spent little bit of time going through the project discussion that has happened in this email thread and to
startwith I have few questions which I would like to put here: 
>
> Q1) Are we also planning to read the input data in parallel or is it only about performing the multi-insert operation
inparallel? AFAIU, the data reading part will be done by the leader process alone so no parallelism is involved there. 
>

Yes, your understanding is correct.

> Q2) How are we going to deal with the partitioned tables?
>

I haven't studied the patch but my understanding is that we will
support parallel copy for partitioned tables with a few restrictions
as explained in my earlier email [1].  See, Case-2 (b) in the email.

> I mean will there be some worker process dedicated for each partition or how is it?

No, it the split is just based on the input, otherwise each worker
should insert as we would have done without any workers.

> Q3) Incase of toast tables, there is a possibility of having a single tuple in the input file which could be of a
verybig size (probably in GB) eventually resulting in a bigger file size. So, in this case, how are we going to decide
thenumber of worker processes to be launched. I mean, although the file size is big, but the number of tuples to be
processedis just one or few of them, so, can we decide the number of the worker processes to be launched based on the
filesize? 
>

Yeah, such situations would be tricky, so we should have an option for
user to specify the number of workers.

> Q4) Who is going to process constraints (preferably the deferred constraint) that is supposed to be executed at the
COMMITtime? I mean is it the leader process or the worker process or in such cases we won't be choosing the parallelism
atall? 
>

In the first version, we won't do parallelism for this.  Again, see
one of my earlier email [1] where I have explained this and other
cases where we won't be supporting parallel copy.

> Q5) Do we have any risk of table bloating when the data is loaded in parallel. I am just asking this because incase
ofparallelism there would be multiple processes performing bulk insert into a table. There is a chance that the table
filemight get extended even if there is some space into the file being written into, but that space is locked by some
otherworker process and hence that might result in a creation of a new block for that table. Sorry, if I am missing
somethinghere. 
>

Hmm, each worker will operate at page level, after first insertion,
the same worker will try to insert in the same page in which it has
inserted last, so there shouldn't be such a problem.

> Please note that I haven't gone through all the emails in this thread so there is a possibility that I might have
repeatedthe question that has already been raised and answered here. If that is the case, I am sorry for that, but it
wouldbe very helpful if someone could point out that thread so that I can go through it. Thank you. 
>

No problem, I understand sometimes it is difficult to go through each
and every email especially when the discussion is long.  Anyway,
thanks for showing the interest in the patch.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BANNEaMJCCXm4naweP5PLY6LhJMvGo_V7-Pnfbh6GsOA%40mail.gmail.com


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Next
From: Michael Paquier
Date:
Subject: Re: doc examples for pghandler