Thread: Parallel copy

Parallel copy

From
Amit Kapila
Date:
This work is to parallelize the copy command and in particular "Copy
<table_name> from 'filename' Where <condition>;" command.

Before going into how and what portion of 'copy command' processing we
can parallelize, let us see in brief what are the top-level operations
we perform while copying from the file into a table.  We read the file
in 64KB chunks, then find the line endings and process that data line
by line, where each line corresponds to one tuple.  We first form the
tuple (in form of value/null array) from that line, check if it
qualifies the where condition and if it qualifies, then perform
constraint check and few other checks and then finally store it in
local tuple array.  Once we reach 1000 tuples or consumed 64KB
(whichever occurred first), we insert them together via
table_multi_insert API and then for each tuple insert into the
index(es) and execute after row triggers.

So if we see here we do a lot of work after reading each 64K chunk.
We can read the next chunk only after all the tuples are processed in
the previous chunk we read.  This brings us an opportunity to
parallelize each 64K chunk processing.  I think we can do this in more
than one way.

The first idea is that we allocate each chunk to a worker and once the
worker has finished processing the current chunk, it can start with
the next unprocessed chunk.  Here, we need to see how to handle the
partial tuples at the end or beginning of each chunk.  We can read the
chunks in dsa/dsm instead of in local buffer for processing.
Alternatively, if we think that accessing shared memory can be costly
we can read the entire chunk in local memory, but copy the partial
tuple at the beginning of a chunk (if any) to a dsa.  We mainly need
partial tuple in the shared memory area. The worker which has found
the initial part of the partial tuple will be responsible to process
the entire tuple. Now, to detect whether there is a partial tuple at
the beginning of a chunk, we always start reading one byte, prior to
the start of the current chunk and if that byte is not a terminating
line byte, we know that it is a partial tuple.  Now, while processing
the chunk, we will ignore this first line and start after the first
terminating line.

To connect the partial tuple in two consecutive chunks, we need to
have another data structure (for the ease of reference in this email,
I call it CTM (chunk-tuple-map)) in shared memory where we store some
per-chunk information like the chunk-number, dsa location of that
chunk and a variable which indicates whether we can free/reuse the
current entry.  Whenever we encounter the partial tuple at the
beginning of a chunk we note down its chunk number, and dsa location
in CTM.  Next, whenever we encounter any partial tuple at the end of
the chunk, we search CTM for next chunk-number and read from
corresponding dsa location till we encounter terminating line byte.
Once we have read and processed this partial tuple, we can mark the
entry as available for reuse.  There are some loose ends here like how
many entries shall we allocate in this data structure.  It depends on
whether we want to allow the worker to start reading the next chunk
before the partial tuple of the previous chunk is processed.  To keep
it simple, we can allow the worker to process the next chunk only when
the partial tuple in the previous chunk is processed.  This will allow
us to keep the entries equal to a number of workers in CTM.  I think
we can easily improve this if we want but I don't think it will matter
too much as in most cases by the time we processed the tuples in that
chunk, the partial tuple would have been consumed by the other worker.

Another approach that came up during an offlist discussion with Robert
is that we have one dedicated worker for reading the chunks from file
and it copies the complete tuples of one chunk in the shared memory
and once that is done, a handover that chunks to another worker which
can process tuples in that area.  We can imagine that the reader
worker is responsible to form some sort of work queue that can be
processed by the other workers.  In this idea, we won't be able to get
the benefit of initial tokenization (forming tuple boundaries) via
parallel workers and might need some additional memory processing as
after reader worker has handed the initial shared memory segment, we
need to somehow identify tuple boundaries and then process them.

Another thing we need to figure out is the how many workers to use for
the copy command.  I think we can use it based on the file size which
needs some experiments or may be based on user input.

I think we have two related problems to solve for this (a) relation
extension lock (required for extending the relation) which won't
conflict among workers due to group locking, we are working on a
solution for this in another thread [1], (b) Use of Page locks in Gin
indexes, we can probably disallow parallelism if the table has Gin
index which is not a great thing but not bad either.

To be clear, this work is for PG14.

Thoughts?

[1] - https://www.postgresql.org/message-id/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B%2BSs%3DGn1LA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Thomas Munro
Date:
On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> This work is to parallelize the copy command and in particular "Copy
> <table_name> from 'filename' Where <condition>;" command.

Nice project, and a great stepping stone towards parallel DML.

> The first idea is that we allocate each chunk to a worker and once the
> worker has finished processing the current chunk, it can start with
> the next unprocessed chunk.  Here, we need to see how to handle the
> partial tuples at the end or beginning of each chunk.  We can read the
> chunks in dsa/dsm instead of in local buffer for processing.
> Alternatively, if we think that accessing shared memory can be costly
> we can read the entire chunk in local memory, but copy the partial
> tuple at the beginning of a chunk (if any) to a dsa.  We mainly need
> partial tuple in the shared memory area. The worker which has found
> the initial part of the partial tuple will be responsible to process
> the entire tuple. Now, to detect whether there is a partial tuple at
> the beginning of a chunk, we always start reading one byte, prior to
> the start of the current chunk and if that byte is not a terminating
> line byte, we know that it is a partial tuple.  Now, while processing
> the chunk, we will ignore this first line and start after the first
> terminating line.

That's quiet similar to the approach I took with a parallel file_fdw
patch[1], which mostly consisted of parallelising the reading part of
copy.c, except that...

> To connect the partial tuple in two consecutive chunks, we need to
> have another data structure (for the ease of reference in this email,
> I call it CTM (chunk-tuple-map)) in shared memory where we store some
> per-chunk information like the chunk-number, dsa location of that
> chunk and a variable which indicates whether we can free/reuse the
> current entry.  Whenever we encounter the partial tuple at the
> beginning of a chunk we note down its chunk number, and dsa location
> in CTM.  Next, whenever we encounter any partial tuple at the end of
> the chunk, we search CTM for next chunk-number and read from
> corresponding dsa location till we encounter terminating line byte.
> Once we have read and processed this partial tuple, we can mark the
> entry as available for reuse.  There are some loose ends here like how
> many entries shall we allocate in this data structure.  It depends on
> whether we want to allow the worker to start reading the next chunk
> before the partial tuple of the previous chunk is processed.  To keep
> it simple, we can allow the worker to process the next chunk only when
> the partial tuple in the previous chunk is processed.  This will allow
> us to keep the entries equal to a number of workers in CTM.  I think
> we can easily improve this if we want but I don't think it will matter
> too much as in most cases by the time we processed the tuples in that
> chunk, the partial tuple would have been consumed by the other worker.

... I didn't use a shm 'partial tuple' exchanging mechanism, I just
had each worker follow the final tuple in its chunk into the next
chunk, and have each worker ignore the first tuple in chunk after
chunk 0 because it knows someone else is looking after that.  That
means that there was some double reading going on near the boundaries,
and considering how much I've been complaining about bogus extra
system calls on this mailing list lately, yeah, your idea of doing a
bit more coordination is a better idea.  If you go this way, you might
at least find the copy.c part of the patch I wrote useful as stand-in
scaffolding code in the meantime while you prototype the parallel
writing side, if you don't already have something better for this?

> Another approach that came up during an offlist discussion with Robert
> is that we have one dedicated worker for reading the chunks from file
> and it copies the complete tuples of one chunk in the shared memory
> and once that is done, a handover that chunks to another worker which
> can process tuples in that area.  We can imagine that the reader
> worker is responsible to form some sort of work queue that can be
> processed by the other workers.  In this idea, we won't be able to get
> the benefit of initial tokenization (forming tuple boundaries) via
> parallel workers and might need some additional memory processing as
> after reader worker has handed the initial shared memory segment, we
> need to somehow identify tuple boundaries and then process them.

Yeah, I have also wondered about something like this in a slightly
different context.  For parallel query in general, I wondered if there
should be a Parallel Scatter node, that can be put on top of any
parallel-safe plan, and it runs it in a worker process that just
pushes tuples into a single-producer multi-consumer shm queue, and
then other workers read from that whenever they need a tuple.  Hmm,
but for COPY, I suppose you'd want to push the raw lines with minimal
examination, not tuples, into a shm queue, so I guess that's a bit
different.

> Another thing we need to figure out is the how many workers to use for
> the copy command.  I think we can use it based on the file size which
> needs some experiments or may be based on user input.

It seems like we don't even really have a general model for that sort
of thing in the rest of the system yet, and I guess some kind of
fairly dumb explicit system would make sense in the early days...

> Thoughts?

This is cool.

[1] https://www.postgresql.org/message-id/CA%2BhUKGKZu8fpZo0W%3DPOmQEN46kXhLedzqqAnt5iJZy7tD0x6sw%40mail.gmail.com



Re: Parallel copy

From
Amit Kapila
Date:
On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > This work is to parallelize the copy command and in particular "Copy
> > <table_name> from 'filename' Where <condition>;" command.
>
> Nice project, and a great stepping stone towards parallel DML.
>

Thanks.

> > The first idea is that we allocate each chunk to a worker and once the
> > worker has finished processing the current chunk, it can start with
> > the next unprocessed chunk.  Here, we need to see how to handle the
> > partial tuples at the end or beginning of each chunk.  We can read the
> > chunks in dsa/dsm instead of in local buffer for processing.
> > Alternatively, if we think that accessing shared memory can be costly
> > we can read the entire chunk in local memory, but copy the partial
> > tuple at the beginning of a chunk (if any) to a dsa.  We mainly need
> > partial tuple in the shared memory area. The worker which has found
> > the initial part of the partial tuple will be responsible to process
> > the entire tuple. Now, to detect whether there is a partial tuple at
> > the beginning of a chunk, we always start reading one byte, prior to
> > the start of the current chunk and if that byte is not a terminating
> > line byte, we know that it is a partial tuple.  Now, while processing
> > the chunk, we will ignore this first line and start after the first
> > terminating line.
>
> That's quiet similar to the approach I took with a parallel file_fdw
> patch[1], which mostly consisted of parallelising the reading part of
> copy.c, except that...
>
> > To connect the partial tuple in two consecutive chunks, we need to
> > have another data structure (for the ease of reference in this email,
> > I call it CTM (chunk-tuple-map)) in shared memory where we store some
> > per-chunk information like the chunk-number, dsa location of that
> > chunk and a variable which indicates whether we can free/reuse the
> > current entry.  Whenever we encounter the partial tuple at the
> > beginning of a chunk we note down its chunk number, and dsa location
> > in CTM.  Next, whenever we encounter any partial tuple at the end of
> > the chunk, we search CTM for next chunk-number and read from
> > corresponding dsa location till we encounter terminating line byte.
> > Once we have read and processed this partial tuple, we can mark the
> > entry as available for reuse.  There are some loose ends here like how
> > many entries shall we allocate in this data structure.  It depends on
> > whether we want to allow the worker to start reading the next chunk
> > before the partial tuple of the previous chunk is processed.  To keep
> > it simple, we can allow the worker to process the next chunk only when
> > the partial tuple in the previous chunk is processed.  This will allow
> > us to keep the entries equal to a number of workers in CTM.  I think
> > we can easily improve this if we want but I don't think it will matter
> > too much as in most cases by the time we processed the tuples in that
> > chunk, the partial tuple would have been consumed by the other worker.
>
> ... I didn't use a shm 'partial tuple' exchanging mechanism, I just
> had each worker follow the final tuple in its chunk into the next
> chunk, and have each worker ignore the first tuple in chunk after
> chunk 0 because it knows someone else is looking after that.  That
> means that there was some double reading going on near the boundaries,
>

Right and especially if the part in the second chunk is bigger, then
we might need to read most of the second chunk.

> and considering how much I've been complaining about bogus extra
> system calls on this mailing list lately, yeah, your idea of doing a
> bit more coordination is a better idea.  If you go this way, you might
> at least find the copy.c part of the patch I wrote useful as stand-in
> scaffolding code in the meantime while you prototype the parallel
> writing side, if you don't already have something better for this?
>

No, I haven't started writing anything yet, but I have some ideas on
how to achieve this.  I quickly skimmed through your patch and I think
that can be used as a starting point though if we decide to go with
accumulating the partial tuple or all the data in shm, then the things
might differ.

> > Another approach that came up during an offlist discussion with Robert
> > is that we have one dedicated worker for reading the chunks from file
> > and it copies the complete tuples of one chunk in the shared memory
> > and once that is done, a handover that chunks to another worker which
> > can process tuples in that area.  We can imagine that the reader
> > worker is responsible to form some sort of work queue that can be
> > processed by the other workers.  In this idea, we won't be able to get
> > the benefit of initial tokenization (forming tuple boundaries) via
> > parallel workers and might need some additional memory processing as
> > after reader worker has handed the initial shared memory segment, we
> > need to somehow identify tuple boundaries and then process them.
>
> Yeah, I have also wondered about something like this in a slightly
> different context.  For parallel query in general, I wondered if there
> should be a Parallel Scatter node, that can be put on top of any
> parallel-safe plan, and it runs it in a worker process that just
> pushes tuples into a single-producer multi-consumer shm queue, and
> then other workers read from that whenever they need a tuple.
>

The idea sounds great but the past experience shows that shoving all
the tuples through queue might add a significant overhead.  However, I
don't know how exactly you are planning to use it?

>  Hmm,
> but for COPY, I suppose you'd want to push the raw lines with minimal
> examination, not tuples, into a shm queue, so I guess that's a bit
> different.
>

Yeah.

> > Another thing we need to figure out is the how many workers to use for
> > the copy command.  I think we can use it based on the file size which
> > needs some experiments or may be based on user input.
>
> It seems like we don't even really have a general model for that sort
> of thing in the rest of the system yet, and I guess some kind of
> fairly dumb explicit system would make sense in the early days...
>

makes sense.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Alastair Turner
Date:
On Fri, 14 Feb 2020 at 11:57, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
 ...
> > Another approach that came up during an offlist discussion with Robert
> > is that we have one dedicated worker for reading the chunks from file
> > and it copies the complete tuples of one chunk in the shared memory
> > and once that is done, a handover that chunks to another worker which
> > can process tuples in that area.  We can imagine that the reader
> > worker is responsible to form some sort of work queue that can be
> > processed by the other workers.  In this idea, we won't be able to get
> > the benefit of initial tokenization (forming tuple boundaries) via
> > parallel workers and might need some additional memory processing as
> > after reader worker has handed the initial shared memory segment, we
> > need to somehow identify tuple boundaries and then process them.

Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line returns in quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote state required. A single worker could also process a stream without having to reread/rewind so it would be able to process input from STDIN or PROGRAM sources, making the improvements applicable to load operations done by third party tools and scripted \copy in psql.
 
>
...

> > Another thing we need to figure out is the how many workers to use for
> > the copy command.  I think we can use it based on the file size which
> > needs some experiments or may be based on user input.
>
> It seems like we don't even really have a general model for that sort
> of thing in the rest of the system yet, and I guess some kind of
> fairly dumb explicit system would make sense in the early days...
>

makes sense.
The ratio between chunking or line parsing processes and the parallel worker pool would vary with the width of the table, complexity of the data or file (dates, encoding conversions), complexity of constraints and acceptable impact of the load. Being able to control it through user input would be great.

--
Alastair

Re: Parallel copy

From
Amit Kapila
Date:
On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote:
>
> On Fri, 14 Feb 2020 at 11:57, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> >
>> > On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>  ...
>>
>> > > Another approach that came up during an offlist discussion with Robert
>> > > is that we have one dedicated worker for reading the chunks from file
>> > > and it copies the complete tuples of one chunk in the shared memory
>> > > and once that is done, a handover that chunks to another worker which
>> > > can process tuples in that area.  We can imagine that the reader
>> > > worker is responsible to form some sort of work queue that can be
>> > > processed by the other workers.  In this idea, we won't be able to get
>> > > the benefit of initial tokenization (forming tuple boundaries) via
>> > > parallel workers and might need some additional memory processing as
>> > > after reader worker has handed the initial shared memory segment, we
>> > > need to somehow identify tuple boundaries and then process them.
>
>
> Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line returns
inquoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote state
required.
>

AFAIU, the whole of this in-quote/out-of-quote state is manged inside
CopyReadLineText which will be done by each of the parallel workers,
something on the lines of what Thomas did in his patch [1].
Basically, we need to invent a mechanism to allocate chunks to
individual workers and then the whole processing will be done as we
are doing now except for special handling for partial tuples which I
have explained in my previous email.  Am, I missing something here?

>>
>> >
>
> ...
>>
>>
>> > > Another thing we need to figure out is the how many workers to use for
>> > > the copy command.  I think we can use it based on the file size which
>> > > needs some experiments or may be based on user input.
>> >
>> > It seems like we don't even really have a general model for that sort
>> > of thing in the rest of the system yet, and I guess some kind of
>> > fairly dumb explicit system would make sense in the early days...
>> >
>>
>> makes sense.
>
> The ratio between chunking or line parsing processes and the parallel worker pool would vary with the width of the
table,complexity of the data or file (dates, encoding conversions), complexity of constraints and acceptable impact of
theload. Being able to control it through user input would be great. 
>

Okay, I think one simple way could be that we compute the number of
workers based on filesize (some experiments are required to determine
this) unless the user has given the input.  If the user has provided
the input then we can use that with an upper limit to
max_parallel_workers.


[1] - https://www.postgresql.org/message-id/CA%2BhUKGKZu8fpZo0W%3DPOmQEN46kXhLedzqqAnt5iJZy7tD0x6sw%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Alastair Turner
Date:
On Sat, 15 Feb 2020 at 04:55, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote:
> >
...
> >
> > Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line
returnsin quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote
staterequired.
 
> >
>
> AFAIU, the whole of this in-quote/out-of-quote state is manged inside
> CopyReadLineText which will be done by each of the parallel workers,
> something on the lines of what Thomas did in his patch [1].
> Basically, we need to invent a mechanism to allocate chunks to
> individual workers and then the whole processing will be done as we
> are doing now except for special handling for partial tuples which I
> have explained in my previous email.  Am, I missing something here?
>
The problem case that I see is the chunk boundary falling in the
middle of a quoted field where
 - The quote opens in chunk 1
 - The quote closes in chunk 2
 - There is an EoL character between the start of chunk 2 and the closing quote

When the worker processing chunk 2 starts, it believes itself to be in
out-of-quote state, so only data between the start of the chunk and
the EoL is regarded as belonging to the partial line. From that point
on the parsing of the rest of the chunk goes off track.

Some of the resulting errors can be avoided by, for instance,
requiring a quote to be preceded by a delimiter or EoL. That answer
fails when fields end with EoL characters, which happens often enough
in the wild.

Recovering from an incorrect in-quote/out-of-quote state assumption at
the start of parsing a chunk just seems like a hole with no bottom. So
it looks to me like it's best done in a single process which can keep
track of that state reliably.

--
Aastair



Re: Parallel copy

From
Amit Kapila
Date:
On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion@decodable.me> wrote:
>
> On Sat, 15 Feb 2020 at 04:55, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote:
> > >
> ...
> > >
> > > Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line
returnsin quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote
staterequired.
 
> > >
> >
> > AFAIU, the whole of this in-quote/out-of-quote state is manged inside
> > CopyReadLineText which will be done by each of the parallel workers,
> > something on the lines of what Thomas did in his patch [1].
> > Basically, we need to invent a mechanism to allocate chunks to
> > individual workers and then the whole processing will be done as we
> > are doing now except for special handling for partial tuples which I
> > have explained in my previous email.  Am, I missing something here?
> >
> The problem case that I see is the chunk boundary falling in the
> middle of a quoted field where
>  - The quote opens in chunk 1
>  - The quote closes in chunk 2
>  - There is an EoL character between the start of chunk 2 and the closing quote
>
> When the worker processing chunk 2 starts, it believes itself to be in
> out-of-quote state, so only data between the start of the chunk and
> the EoL is regarded as belonging to the partial line. From that point
> on the parsing of the rest of the chunk goes off track.
>
> Some of the resulting errors can be avoided by, for instance,
> requiring a quote to be preceded by a delimiter or EoL. That answer
> fails when fields end with EoL characters, which happens often enough
> in the wild.
>
> Recovering from an incorrect in-quote/out-of-quote state assumption at
> the start of parsing a chunk just seems like a hole with no bottom. So
> it looks to me like it's best done in a single process which can keep
> track of that state reliably.
>

Good point and I agree with you that having a single process would
avoid any such stuff.   However, I will think some more on it and if
you/anyone else gets some idea on how to deal with this in a
multi-worker system (where we can allow each worker to read and
process the chunk) then feel free to share your thoughts.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
David Fetter
Date:
On Sat, Feb 15, 2020 at 06:02:06PM +0530, Amit Kapila wrote:
> On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion@decodable.me> wrote:
> >
> > On Sat, 15 Feb 2020 at 04:55, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote:
> > > >
> > ...
> > > >
> > > > Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line
returnsin quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote
staterequired.
 
> > > >
> > >
> > > AFAIU, the whole of this in-quote/out-of-quote state is manged inside
> > > CopyReadLineText which will be done by each of the parallel workers,
> > > something on the lines of what Thomas did in his patch [1].
> > > Basically, we need to invent a mechanism to allocate chunks to
> > > individual workers and then the whole processing will be done as we
> > > are doing now except for special handling for partial tuples which I
> > > have explained in my previous email.  Am, I missing something here?
> > >
> > The problem case that I see is the chunk boundary falling in the
> > middle of a quoted field where
> >  - The quote opens in chunk 1
> >  - The quote closes in chunk 2
> >  - There is an EoL character between the start of chunk 2 and the closing quote
> >
> > When the worker processing chunk 2 starts, it believes itself to be in
> > out-of-quote state, so only data between the start of the chunk and
> > the EoL is regarded as belonging to the partial line. From that point
> > on the parsing of the rest of the chunk goes off track.
> >
> > Some of the resulting errors can be avoided by, for instance,
> > requiring a quote to be preceded by a delimiter or EoL. That answer
> > fails when fields end with EoL characters, which happens often enough
> > in the wild.
> >
> > Recovering from an incorrect in-quote/out-of-quote state assumption at
> > the start of parsing a chunk just seems like a hole with no bottom. So
> > it looks to me like it's best done in a single process which can keep
> > track of that state reliably.
> >
> 
> Good point and I agree with you that having a single process would
> avoid any such stuff.   However, I will think some more on it and if
> you/anyone else gets some idea on how to deal with this in a
> multi-worker system (where we can allow each worker to read and
> process the chunk) then feel free to share your thoughts.

I see two pieces of this puzzle: an input format we control, and the
ones we don't.

In the former case, we could encode all fields with base85 (or
something similar that reduces the input alphabet efficiently), then
reserve bytes that denote delimiters of various types. ASCII has
separators for file, group, record, and unit that we could use as
inspiration.

I don't have anything to offer for free-form input other than to agree
that it looks like a hole with no bottom, and maybe we should just
keep that process serial, at least until someone finds a bottom.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Parallel copy

From
Andrew Dunstan
Date:
On 2/15/20 7:32 AM, Amit Kapila wrote:
> On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion@decodable.me> wrote:
>> On Sat, 15 Feb 2020 at 04:55, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote:
>> ...
>>>> Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line
returnsin quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote
staterequired.
 
>>>>
>>> AFAIU, the whole of this in-quote/out-of-quote state is manged inside
>>> CopyReadLineText which will be done by each of the parallel workers,
>>> something on the lines of what Thomas did in his patch [1].
>>> Basically, we need to invent a mechanism to allocate chunks to
>>> individual workers and then the whole processing will be done as we
>>> are doing now except for special handling for partial tuples which I
>>> have explained in my previous email.  Am, I missing something here?
>>>
>> The problem case that I see is the chunk boundary falling in the
>> middle of a quoted field where
>>  - The quote opens in chunk 1
>>  - The quote closes in chunk 2
>>  - There is an EoL character between the start of chunk 2 and the closing quote
>>
>> When the worker processing chunk 2 starts, it believes itself to be in
>> out-of-quote state, so only data between the start of the chunk and
>> the EoL is regarded as belonging to the partial line. From that point
>> on the parsing of the rest of the chunk goes off track.
>>
>> Some of the resulting errors can be avoided by, for instance,
>> requiring a quote to be preceded by a delimiter or EoL. That answer
>> fails when fields end with EoL characters, which happens often enough
>> in the wild.
>>
>> Recovering from an incorrect in-quote/out-of-quote state assumption at
>> the start of parsing a chunk just seems like a hole with no bottom. So
>> it looks to me like it's best done in a single process which can keep
>> track of that state reliably.
>>
> Good point and I agree with you that having a single process would
> avoid any such stuff.   However, I will think some more on it and if
> you/anyone else gets some idea on how to deal with this in a
> multi-worker system (where we can allow each worker to read and
> process the chunk) then feel free to share your thoughts.
>


IIRC, in_quote only matters here in CSV mode (because CSV fields can
have embedded newlines). So why not just forbid parallel copy in CSV
mode, at least for now? I guess it depends on the actual use case. If we
expect to be parallel loading humungous CSVs then that won't fly.


cheers


andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Parallel copy

From
Amit Kapila
Date:
On Sun, Feb 16, 2020 at 12:21 PM Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
> On 2/15/20 7:32 AM, Amit Kapila wrote:
> > On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion@decodable.me> wrote:
> >>>
> >> The problem case that I see is the chunk boundary falling in the
> >> middle of a quoted field where
> >>  - The quote opens in chunk 1
> >>  - The quote closes in chunk 2
> >>  - There is an EoL character between the start of chunk 2 and the closing quote
> >>
> >> When the worker processing chunk 2 starts, it believes itself to be in
> >> out-of-quote state, so only data between the start of the chunk and
> >> the EoL is regarded as belonging to the partial line. From that point
> >> on the parsing of the rest of the chunk goes off track.
> >>
> >> Some of the resulting errors can be avoided by, for instance,
> >> requiring a quote to be preceded by a delimiter or EoL. That answer
> >> fails when fields end with EoL characters, which happens often enough
> >> in the wild.
> >>
> >> Recovering from an incorrect in-quote/out-of-quote state assumption at
> >> the start of parsing a chunk just seems like a hole with no bottom. So
> >> it looks to me like it's best done in a single process which can keep
> >> track of that state reliably.
> >>
> > Good point and I agree with you that having a single process would
> > avoid any such stuff.   However, I will think some more on it and if
> > you/anyone else gets some idea on how to deal with this in a
> > multi-worker system (where we can allow each worker to read and
> > process the chunk) then feel free to share your thoughts.
> >
>
>
> IIRC, in_quote only matters here in CSV mode (because CSV fields can
> have embedded newlines).
>

AFAIU, that is correct.

> So why not just forbid parallel copy in CSV
> mode, at least for now? I guess it depends on the actual use case. If we
> expect to be parallel loading humungous CSVs then that won't fly.
>

I am not sure about this part.  However, I guess we should at the very
least have some extendable solution that can deal with csv, otherwise,
we might end up re-designing everything if someday we want to deal
with CSV.  One naive idea is that in csv mode, we can set up the
things slightly differently like the worker, won't start processing
the chunk unless the previous chunk is completely parsed.  So each
worker would first parse and tokenize the entire chunk and then start
writing it.  So, this will make the reading/parsing part serialized,
but writes can still be parallel.  Now, I don't know if it is a good
idea to process in a different way for csv mode.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Ants Aasma
Date:
On Sat, 15 Feb 2020 at 14:32, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Good point and I agree with you that having a single process would
> avoid any such stuff.   However, I will think some more on it and if
> you/anyone else gets some idea on how to deal with this in a
> multi-worker system (where we can allow each worker to read and
> process the chunk) then feel free to share your thoughts.

I think having a single process handle splitting the input into tuples makes
most sense. It's possible to parse csv at multiple GB/s rates [1], finding
tuple boundaries is a subset of that task.

My first thought for a design would be to have two shared memory ring buffers,
one for data and one for tuple start positions. Reader process reads the CSV
data into the main buffer, finds tuple start locations in there and writes
those to the secondary buffer.

Worker processes claim a chunk of tuple positions from the secondary buffer and
update their "keep this data around" position with the first position. Then
proceed to parse and insert the tuples, updating their position until they find
the end of the last tuple in the chunk.

Buffer size, maximum and minimum chunk size could be tunable. Ideally the
buffers would be at least big enough to absorb one of the workers getting
scheduled out for a timeslice, which could be up to tens of megabytes.

Regards,
Ants Aasma

[1] https://github.com/geofflangdale/simdcsv/



Re: Parallel copy

From
Kyotaro Horiguchi
Date:
At Mon, 17 Feb 2020 16:49:22 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Sun, Feb 16, 2020 at 12:21 PM Andrew Dunstan
> <andrew.dunstan@2ndquadrant.com> wrote:
> > On 2/15/20 7:32 AM, Amit Kapila wrote:
> > > On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion@decodable.me> wrot> > So why not just forbid parallel
copyin CSV
 
> > mode, at least for now? I guess it depends on the actual use case. If we
> > expect to be parallel loading humungous CSVs then that won't fly.
> >
> 
> I am not sure about this part.  However, I guess we should at the very
> least have some extendable solution that can deal with csv, otherwise,
> we might end up re-designing everything if someday we want to deal
> with CSV.  One naive idea is that in csv mode, we can set up the
> things slightly differently like the worker, won't start processing
> the chunk unless the previous chunk is completely parsed.  So each
> worker would first parse and tokenize the entire chunk and then start
> writing it.  So, this will make the reading/parsing part serialized,
> but writes can still be parallel.  Now, I don't know if it is a good
> idea to process in a different way for csv mode.

In an extreme case, if we didn't see a QUOTE in a chunk, we cannot
know the chunk is in a quoted section or not, until all the past
chunks are parsed.  After all we are forced to parse fully
sequentially as far as we allow QUOTE.

On the other hand, if we allowed "COPY t FROM f WITH (FORMAT CSV,
QUOTE '')" in order to signal that there's no quoted section in the
file then all chunks would be fully concurrently parsable.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Parallel copy

From
Thomas Munro
Date:
On Tue, Feb 18, 2020 at 4:04 AM Ants Aasma <ants@cybertec.at> wrote:
> On Sat, 15 Feb 2020 at 14:32, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Good point and I agree with you that having a single process would
> > avoid any such stuff.   However, I will think some more on it and if
> > you/anyone else gets some idea on how to deal with this in a
> > multi-worker system (where we can allow each worker to read and
> > process the chunk) then feel free to share your thoughts.
>
> I think having a single process handle splitting the input into tuples makes
> most sense. It's possible to parse csv at multiple GB/s rates [1], finding
> tuple boundaries is a subset of that task.

Yeah, this is compelling.  Even though it has to read the file
serially, the real gains from parallel COPY should come from doing the
real work in parallel: data-type parsing, tuple forming, WHERE clause
filtering, partition routing, buffer management, insertion and
associated triggers, FKs and index maintenance.

The reason I used the other approach for the file_fdw patch is that I
was trying to make it look as much as possible like parallel
sequential scan and not create an extra worker, because I didn't feel
like an FDW should be allowed to do that (what if executor nodes all
over the query tree created worker processes willy-nilly?).  Obviously
it doesn't work correctly for embedded newlines, and even if you
decree that multi-line values aren't allowed in parallel COPY, the
stuff about tuples crossing chunk boundaries is still a bit unpleasant
(whether solved by double reading as I showed, or a bunch of tap
dancing in shared memory) and creates overheads.

> My first thought for a design would be to have two shared memory ring buffers,
> one for data and one for tuple start positions. Reader process reads the CSV
> data into the main buffer, finds tuple start locations in there and writes
> those to the secondary buffer.
>
> Worker processes claim a chunk of tuple positions from the secondary buffer and
> update their "keep this data around" position with the first position. Then
> proceed to parse and insert the tuples, updating their position until they find
> the end of the last tuple in the chunk.

+1.  That sort of two-queue scheme is exactly how I sketched out a
multi-consumer queue for a hypothetical Parallel Scatter node.  It
probably gets a bit trickier when the payload has to be broken up into
fragments to wrap around the "data" buffer N times.



Re: Parallel copy

From
Ants Aasma
Date:
On Tue, 18 Feb 2020 at 04:40, Thomas Munro <thomas.munro@gmail.com> wrote:
> +1.  That sort of two-queue scheme is exactly how I sketched out a
> multi-consumer queue for a hypothetical Parallel Scatter node.  It
> probably gets a bit trickier when the payload has to be broken up into
> fragments to wrap around the "data" buffer N times.

At least for copy it should be easy enough - it already has to handle reading
data block by block. If worker updates its position while doing so the reader
can wrap around the data buffer.

There will be no parallelism while one worker is buffering up a line larger
than the data buffer, but that doesn't seem like a major issue. Once the line is
buffered and begins inserting next worker can start buffering the next tuple.

Regards,
Ants Aasma



Re: Parallel copy

From
Amit Kapila
Date:
On Mon, Feb 17, 2020 at 8:34 PM Ants Aasma <ants@cybertec.at> wrote:
>
> On Sat, 15 Feb 2020 at 14:32, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Good point and I agree with you that having a single process would
> > avoid any such stuff.   However, I will think some more on it and if
> > you/anyone else gets some idea on how to deal with this in a
> > multi-worker system (where we can allow each worker to read and
> > process the chunk) then feel free to share your thoughts.
>
> I think having a single process handle splitting the input into tuples makes
> most sense. It's possible to parse csv at multiple GB/s rates [1], finding
> tuple boundaries is a subset of that task.
>
> My first thought for a design would be to have two shared memory ring buffers,
> one for data and one for tuple start positions. Reader process reads the CSV
> data into the main buffer, finds tuple start locations in there and writes
> those to the secondary buffer.
>
> Worker processes claim a chunk of tuple positions from the secondary buffer and
> update their "keep this data around" position with the first position. Then
> proceed to parse and insert the tuples, updating their position until they find
> the end of the last tuple in the chunk.
>

This is something similar to what I had also in mind for this idea.  I
had thought of handing over complete chunk (64K or whatever we
decide).  The one thing that slightly bothers me is that we will add
some additional overhead of copying to and from shared memory which
was earlier from local process memory.  And, the tokenization (finding
line boundaries) would be serial.  I think that tokenization should be
a small part of the overall work we do during the copy operation, but
will do some measurements to ascertain the same.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Amit Kapila
Date:
On Tue, Feb 18, 2020 at 7:28 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Mon, 17 Feb 2020 16:49:22 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
> > On Sun, Feb 16, 2020 at 12:21 PM Andrew Dunstan
> > <andrew.dunstan@2ndquadrant.com> wrote:
> > > On 2/15/20 7:32 AM, Amit Kapila wrote:
> > > > On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion@decodable.me> wrot> > So why not just forbid parallel
copyin CSV
 
> > > mode, at least for now? I guess it depends on the actual use case. If we
> > > expect to be parallel loading humungous CSVs then that won't fly.
> > >
> >
> > I am not sure about this part.  However, I guess we should at the very
> > least have some extendable solution that can deal with csv, otherwise,
> > we might end up re-designing everything if someday we want to deal
> > with CSV.  One naive idea is that in csv mode, we can set up the
> > things slightly differently like the worker, won't start processing
> > the chunk unless the previous chunk is completely parsed.  So each
> > worker would first parse and tokenize the entire chunk and then start
> > writing it.  So, this will make the reading/parsing part serialized,
> > but writes can still be parallel.  Now, I don't know if it is a good
> > idea to process in a different way for csv mode.
>
> In an extreme case, if we didn't see a QUOTE in a chunk, we cannot
> know the chunk is in a quoted section or not, until all the past
> chunks are parsed.  After all we are forced to parse fully
> sequentially as far as we allow QUOTE.
>

Right, I think the benefits of this as compared to single reader idea
would be (a) we can save accessing shared memory for the most part of
the chunk (b) for non-csv mode, even the tokenization (finding line
boundaries) would also be parallel.   OTOH, doing processing
differently for csv and non-csv mode might not be good.

> On the other hand, if we allowed "COPY t FROM f WITH (FORMAT CSV,
> QUOTE '')" in order to signal that there's no quoted section in the
> file then all chunks would be fully concurrently parsable.
>

Yeah, if we can provide such an option, we can probably make parallel
csv processing equivalent to non-csv.  However, users might not like
this as I think in some cases it won't be easier for them to tell
whether the file has quoted fields or not.  I am not very sure of this
point.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Kyotaro Horiguchi
Date:
At Tue, 18 Feb 2020 15:59:36 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Tue, Feb 18, 2020 at 7:28 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > In an extreme case, if we didn't see a QUOTE in a chunk, we cannot
> > know the chunk is in a quoted section or not, until all the past
> > chunks are parsed.  After all we are forced to parse fully
> > sequentially as far as we allow QUOTE.
> >
> 
> Right, I think the benefits of this as compared to single reader idea
> would be (a) we can save accessing shared memory for the most part of
> the chunk (b) for non-csv mode, even the tokenization (finding line
> boundaries) would also be parallel.   OTOH, doing processing
> differently for csv and non-csv mode might not be good.

Agreed. So I think it's a good point of compromize.

> > On the other hand, if we allowed "COPY t FROM f WITH (FORMAT CSV,
> > QUOTE '')" in order to signal that there's no quoted section in the
> > file then all chunks would be fully concurrently parsable.
> >
> 
> Yeah, if we can provide such an option, we can probably make parallel
> csv processing equivalent to non-csv.  However, users might not like
> this as I think in some cases it won't be easier for them to tell
> whether the file has quoted fields or not.  I am not very sure of this
> point.

I'm not sure how large portion of the usage contains quoted sections,
so I'm not sure how it is useful..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Parallel copy

From
Ants Aasma
Date:
On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
> This is something similar to what I had also in mind for this idea.  I
> had thought of handing over complete chunk (64K or whatever we
> decide).  The one thing that slightly bothers me is that we will add
> some additional overhead of copying to and from shared memory which
> was earlier from local process memory.  And, the tokenization (finding
> line boundaries) would be serial.  I think that tokenization should be
> a small part of the overall work we do during the copy operation, but
> will do some measurements to ascertain the same.

I don't think any extra copying is needed. The reader can directly
fread()/pq_copymsgbytes() into shared memory, and the workers can run
CopyReadLineText() inner loop directly off of the buffer in shared memory.

For serial performance of tokenization into lines, I really think a SIMD
based approach will be fast enough for quite some time. I hacked up the code in
the simdcsv  project to only tokenize on line endings and it was able to
tokenize a CSV file with short lines at 8+ GB/s. There are going to be many
other bottlenecks before this one starts limiting. Patch attached if you'd
like to try that out.

Regards,
Ants Aasma

Attachment

Re: Parallel copy

From
Amit Kapila
Date:
On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <ants@cybertec.at> wrote:
>
> On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > This is something similar to what I had also in mind for this idea.  I
> > had thought of handing over complete chunk (64K or whatever we
> > decide).  The one thing that slightly bothers me is that we will add
> > some additional overhead of copying to and from shared memory which
> > was earlier from local process memory.  And, the tokenization (finding
> > line boundaries) would be serial.  I think that tokenization should be
> > a small part of the overall work we do during the copy operation, but
> > will do some measurements to ascertain the same.
>
> I don't think any extra copying is needed.
>

I am talking about access to shared memory instead of the process
local memory.  I understand that an extra copy won't be required.

> The reader can directly
> fread()/pq_copymsgbytes() into shared memory, and the workers can run
> CopyReadLineText() inner loop directly off of the buffer in shared memory.
>

I am slightly confused here.  AFAIU, the for(;;) loop in
CopyReadLineText is about finding the line endings which we thought
that the reader process will do.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Mike Blackwell
Date:
On Sun, Feb 16, 2020 at 12:51 AM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:

IIRC, in_quote only matters here in CSV mode (because CSV fields can
have embedded newlines). So why not just forbid parallel copy in CSV
mode, at least for now? I guess it depends on the actual use case. If we
expect to be parallel loading humungous CSVs then that won't fly.

Loading large CSV files is pretty common here.  I hope this can be supported.



MIKE BLACKWELL





Re: Parallel copy

From
Ants Aasma
Date:
On Tue, 18 Feb 2020 at 15:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <ants@cybertec.at> wrote:
> >
> > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > This is something similar to what I had also in mind for this idea.  I
> > > had thought of handing over complete chunk (64K or whatever we
> > > decide).  The one thing that slightly bothers me is that we will add
> > > some additional overhead of copying to and from shared memory which
> > > was earlier from local process memory.  And, the tokenization (finding
> > > line boundaries) would be serial.  I think that tokenization should be
> > > a small part of the overall work we do during the copy operation, but
> > > will do some measurements to ascertain the same.
> >
> > I don't think any extra copying is needed.
> >
>
> I am talking about access to shared memory instead of the process
> local memory.  I understand that an extra copy won't be required.
>
> > The reader can directly
> > fread()/pq_copymsgbytes() into shared memory, and the workers can run
> > CopyReadLineText() inner loop directly off of the buffer in shared memory.
> >
>
> I am slightly confused here.  AFAIU, the for(;;) loop in
> CopyReadLineText is about finding the line endings which we thought
> that the reader process will do.

Indeed, I somehow misread the code while scanning over it. So CopyReadLineText
currently copies data from cstate->raw_buf to the StringInfo in
cstate->line_buf. In parallel mode it would copy it from the shared data buffer
to local line_buf until it hits the line end found by the data reader. The
amount of copying done is still exactly the same as it is now.

Regards,
Ants Aasma



Re: Parallel copy

From
David Fetter
Date:
On Tue, Feb 18, 2020 at 06:51:29PM +0530, Amit Kapila wrote:
> On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <ants@cybertec.at> wrote:
> >
> > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > This is something similar to what I had also in mind for this idea.  I
> > > had thought of handing over complete chunk (64K or whatever we
> > > decide).  The one thing that slightly bothers me is that we will add
> > > some additional overhead of copying to and from shared memory which
> > > was earlier from local process memory.  And, the tokenization (finding
> > > line boundaries) would be serial.  I think that tokenization should be
> > > a small part of the overall work we do during the copy operation, but
> > > will do some measurements to ascertain the same.
> >
> > I don't think any extra copying is needed.
> 
> I am talking about access to shared memory instead of the process
> local memory.  I understand that an extra copy won't be required.

Isn't accessing shared memory from different pieces of execution what
threads were designed to do?

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Parallel copy

From
Amit Kapila
Date:
On Tue, Feb 18, 2020 at 8:41 PM David Fetter <david@fetter.org> wrote:
>
> On Tue, Feb 18, 2020 at 06:51:29PM +0530, Amit Kapila wrote:
> > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <ants@cybertec.at> wrote:
> > >
> > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > This is something similar to what I had also in mind for this idea.  I
> > > > had thought of handing over complete chunk (64K or whatever we
> > > > decide).  The one thing that slightly bothers me is that we will add
> > > > some additional overhead of copying to and from shared memory which
> > > > was earlier from local process memory.  And, the tokenization (finding
> > > > line boundaries) would be serial.  I think that tokenization should be
> > > > a small part of the overall work we do during the copy operation, but
> > > > will do some measurements to ascertain the same.
> > >
> > > I don't think any extra copying is needed.
> >
> > I am talking about access to shared memory instead of the process
> > local memory.  I understand that an extra copy won't be required.
>
> Isn't accessing shared memory from different pieces of execution what
> threads were designed to do?
>

Sorry, but I don't understand what you mean by the above?  We are
going to use background workers (which are processes) for parallel
workers.  In general, it might not make a big difference in accessing
shared memory as compared to local memory especially because the cost
of other stuff in the copy is relatively higher.  But still, it is a
point to consider.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Amit Kapila
Date:
On Tue, Feb 18, 2020 at 8:08 PM Ants Aasma <ants@cybertec.at> wrote:
>
> On Tue, 18 Feb 2020 at 15:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <ants@cybertec.at> wrote:
> > >
> > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > This is something similar to what I had also in mind for this idea.  I
> > > > had thought of handing over complete chunk (64K or whatever we
> > > > decide).  The one thing that slightly bothers me is that we will add
> > > > some additional overhead of copying to and from shared memory which
> > > > was earlier from local process memory.  And, the tokenization (finding
> > > > line boundaries) would be serial.  I think that tokenization should be
> > > > a small part of the overall work we do during the copy operation, but
> > > > will do some measurements to ascertain the same.
> > >
> > > I don't think any extra copying is needed.
> > >
> >
> > I am talking about access to shared memory instead of the process
> > local memory.  I understand that an extra copy won't be required.
> >
> > > The reader can directly
> > > fread()/pq_copymsgbytes() into shared memory, and the workers can run
> > > CopyReadLineText() inner loop directly off of the buffer in shared memory.
> > >
> >
> > I am slightly confused here.  AFAIU, the for(;;) loop in
> > CopyReadLineText is about finding the line endings which we thought
> > that the reader process will do.
>
> Indeed, I somehow misread the code while scanning over it. So CopyReadLineText
> currently copies data from cstate->raw_buf to the StringInfo in
> cstate->line_buf. In parallel mode it would copy it from the shared data buffer
> to local line_buf until it hits the line end found by the data reader. The
> amount of copying done is still exactly the same as it is now.
>

Yeah, on a broader level it will be something like that, but actual
details might vary during implementation.  BTW, have you given any
thoughts on one other approach I have shared above [1]?  We might not
go with that idea, but it is better to discuss different ideas and
evaluate their pros and cons.

[1] - https://www.postgresql.org/message-id/CAA4eK1LyAyPCtBk4rkwomeT6%3DyTse5qWws-7i9EFwnUFZhvu5w%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Amit Kapila
Date:
On Tue, Feb 18, 2020 at 7:51 PM Mike Blackwell <mike.blackwell@rrd.com> wrote:
On Sun, Feb 16, 2020 at 12:51 AM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:

IIRC, in_quote only matters here in CSV mode (because CSV fields can
have embedded newlines). So why not just forbid parallel copy in CSV
mode, at least for now? I guess it depends on the actual use case. If we
expect to be parallel loading humungous CSVs then that won't fly.

Loading large CSV files is pretty common here.  I hope this can be supported.


Thank you for your inputs.  It is important and valuable.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
Ants Aasma
Date:
On Wed, 19 Feb 2020 at 06:22, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 18, 2020 at 8:08 PM Ants Aasma <ants@cybertec.at> wrote:
> >
> > On Tue, 18 Feb 2020 at 15:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <ants@cybertec.at> wrote:
> > > >
> > > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > This is something similar to what I had also in mind for this idea.  I
> > > > > had thought of handing over complete chunk (64K or whatever we
> > > > > decide).  The one thing that slightly bothers me is that we will add
> > > > > some additional overhead of copying to and from shared memory which
> > > > > was earlier from local process memory.  And, the tokenization (finding
> > > > > line boundaries) would be serial.  I think that tokenization should be
> > > > > a small part of the overall work we do during the copy operation, but
> > > > > will do some measurements to ascertain the same.
> > > >
> > > > I don't think any extra copying is needed.
> > > >
> > >
> > > I am talking about access to shared memory instead of the process
> > > local memory.  I understand that an extra copy won't be required.
> > >
> > > > The reader can directly
> > > > fread()/pq_copymsgbytes() into shared memory, and the workers can run
> > > > CopyReadLineText() inner loop directly off of the buffer in shared memory.
> > > >
> > >
> > > I am slightly confused here.  AFAIU, the for(;;) loop in
> > > CopyReadLineText is about finding the line endings which we thought
> > > that the reader process will do.
> >
> > Indeed, I somehow misread the code while scanning over it. So CopyReadLineText
> > currently copies data from cstate->raw_buf to the StringInfo in
> > cstate->line_buf. In parallel mode it would copy it from the shared data buffer
> > to local line_buf until it hits the line end found by the data reader. The
> > amount of copying done is still exactly the same as it is now.
> >
>
> Yeah, on a broader level it will be something like that, but actual
> details might vary during implementation.  BTW, have you given any
> thoughts on one other approach I have shared above [1]?  We might not
> go with that idea, but it is better to discuss different ideas and
> evaluate their pros and cons.
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1LyAyPCtBk4rkwomeT6%3DyTse5qWws-7i9EFwnUFZhvu5w%40mail.gmail.com

It seems to be that at least for the general CSV case the tokenization to
tuples is an inherently serial task. Adding thread synchronization to that path
for coordinating between multiple workers is only going to make it slower. It
may be possible to enforce limitations on the input (e.g. no quotes allowed) or
do some speculative tokenization (e.g. if we encounter quote before newline
assume the chunk started in a quoted section) to make it possible to do the
tokenization in parallel. But given that the simpler and more featured approach
of handling it in a single reader process looks to be fast enough, I don't see
the point. I rather think that the next big step would be to overlap reading
input and tokenization, hopefully by utilizing Andres's work on asyncio.

Regards,
Ants Aasma



Re: Parallel copy

From
Tomas Vondra
Date:
On Wed, Feb 19, 2020 at 11:02:15AM +0200, Ants Aasma wrote:
>On Wed, 19 Feb 2020 at 06:22, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Tue, Feb 18, 2020 at 8:08 PM Ants Aasma <ants@cybertec.at> wrote:
>> >
>> > On Tue, 18 Feb 2020 at 15:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> > >
>> > > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <ants@cybertec.at> wrote:
>> > > >
>> > > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> > > > > This is something similar to what I had also in mind for this idea.  I
>> > > > > had thought of handing over complete chunk (64K or whatever we
>> > > > > decide).  The one thing that slightly bothers me is that we will add
>> > > > > some additional overhead of copying to and from shared memory which
>> > > > > was earlier from local process memory.  And, the tokenization (finding
>> > > > > line boundaries) would be serial.  I think that tokenization should be
>> > > > > a small part of the overall work we do during the copy operation, but
>> > > > > will do some measurements to ascertain the same.
>> > > >
>> > > > I don't think any extra copying is needed.
>> > > >
>> > >
>> > > I am talking about access to shared memory instead of the process
>> > > local memory.  I understand that an extra copy won't be required.
>> > >
>> > > > The reader can directly
>> > > > fread()/pq_copymsgbytes() into shared memory, and the workers can run
>> > > > CopyReadLineText() inner loop directly off of the buffer in shared memory.
>> > > >
>> > >
>> > > I am slightly confused here.  AFAIU, the for(;;) loop in
>> > > CopyReadLineText is about finding the line endings which we thought
>> > > that the reader process will do.
>> >
>> > Indeed, I somehow misread the code while scanning over it. So CopyReadLineText
>> > currently copies data from cstate->raw_buf to the StringInfo in
>> > cstate->line_buf. In parallel mode it would copy it from the shared data buffer
>> > to local line_buf until it hits the line end found by the data reader. The
>> > amount of copying done is still exactly the same as it is now.
>> >
>>
>> Yeah, on a broader level it will be something like that, but actual
>> details might vary during implementation.  BTW, have you given any
>> thoughts on one other approach I have shared above [1]?  We might not
>> go with that idea, but it is better to discuss different ideas and
>> evaluate their pros and cons.
>>
>> [1] - https://www.postgresql.org/message-id/CAA4eK1LyAyPCtBk4rkwomeT6%3DyTse5qWws-7i9EFwnUFZhvu5w%40mail.gmail.com
>
>It seems to be that at least for the general CSV case the tokenization to
>tuples is an inherently serial task. Adding thread synchronization to that path
>for coordinating between multiple workers is only going to make it slower. It
>may be possible to enforce limitations on the input (e.g. no quotes allowed) or
>do some speculative tokenization (e.g. if we encounter quote before newline
>assume the chunk started in a quoted section) to make it possible to do the
>tokenization in parallel. But given that the simpler and more featured approach
>of handling it in a single reader process looks to be fast enough, I don't see
>the point. I rather think that the next big step would be to overlap reading
>input and tokenization, hopefully by utilizing Andres's work on asyncio.
>

I generally agree with the impression that parsing CSV is tricky and
unlikely to benefit from parallelism in general. There may be cases with
restrictions making it easier (e.g. restrictions on the format) but that
might be a bit too complex to start with.

For example, I had an idea to parallelise the planning by splitting it
into two phases:

1) indexing

Splits the CSV file into equally-sized chunks, make each worker to just
scan through it's chunk and store positions of delimiters, quotes,
newlines etc. This is probably the most expensive part of the parsing
(essentially go char by char), and we'd speed it up linearly.

2) merge

Combine the information from (1) in a single process, and actually parse
the CSV data - we would not have to inspect each character, because we'd
know positions of interesting chars, so this should be fast. We might
have to recheck some stuff (e.g. escaping) but it should still be much
faster.

But yes, this may be a bit complex and I'm not sure it's worth it.

The one piece of information I'm missing here is at least a very rough
quantification of the individual steps of CSV processing - for example
if parsing takes only 10% of the time, it's pretty pointless to start by
parallelising this part and we should focus on the rest. If it's 50% it
might be a different story. Has anyone done any measurements?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Parallel copy

From
Amit Kapila
Date:
On Wed, Feb 19, 2020 at 4:08 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> The one piece of information I'm missing here is at least a very rough
> quantification of the individual steps of CSV processing - for example
> if parsing takes only 10% of the time, it's pretty pointless to start by
> parallelising this part and we should focus on the rest. If it's 50% it
> might be a different story.
>

Right, this is important information to know.

> Has anyone done any measurements?
>

Not yet, but planning to work on it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
David Fetter
Date:
On Fri, Feb 14, 2020 at 01:41:54PM +0530, Amit Kapila wrote:
> This work is to parallelize the copy command and in particular "Copy
> <table_name> from 'filename' Where <condition>;" command.

Apropos of the initial parsing issue generally, there's an interesting
approach taken here: https://github.com/robertdavidgraham/wc2

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Parallel copy

From
Amit Kapila
Date:
On Thu, Feb 20, 2020 at 5:12 AM David Fetter <david@fetter.org> wrote:
>
> On Fri, Feb 14, 2020 at 01:41:54PM +0530, Amit Kapila wrote:
> > This work is to parallelize the copy command and in particular "Copy
> > <table_name> from 'filename' Where <condition>;" command.
>
> Apropos of the initial parsing issue generally, there's an interesting
> approach taken here: https://github.com/robertdavidgraham/wc2
>

Thanks for sharing.  I might be missing something, but I can't figure
out how this can help here.  Does this in some way help to allow
multiple workers to read and tokenize the chunks?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Tomas Vondra
Date:
On Thu, Feb 20, 2020 at 04:11:39PM +0530, Amit Kapila wrote:
>On Thu, Feb 20, 2020 at 5:12 AM David Fetter <david@fetter.org> wrote:
>>
>> On Fri, Feb 14, 2020 at 01:41:54PM +0530, Amit Kapila wrote:
>> > This work is to parallelize the copy command and in particular "Copy
>> > <table_name> from 'filename' Where <condition>;" command.
>>
>> Apropos of the initial parsing issue generally, there's an interesting
>> approach taken here: https://github.com/robertdavidgraham/wc2
>>
>
>Thanks for sharing.  I might be missing something, but I can't figure
>out how this can help here.  Does this in some way help to allow
>multiple workers to read and tokenize the chunks?
>

I think the wc2 is showing that maybe instead of parallelizing the
parsing, we might instead try using a different tokenizer/parser and
make the implementation more efficient instead of just throwing more
CPUs on it.

I don't know if our code is similar to what wc does, maytbe parsing
csv is more complicated than what wc does.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Parallel copy

From
David Fetter
Date:
On Thu, Feb 20, 2020 at 02:36:02PM +0100, Tomas Vondra wrote:
> On Thu, Feb 20, 2020 at 04:11:39PM +0530, Amit Kapila wrote:
> > On Thu, Feb 20, 2020 at 5:12 AM David Fetter <david@fetter.org> wrote:
> > > 
> > > On Fri, Feb 14, 2020 at 01:41:54PM +0530, Amit Kapila wrote:
> > > > This work is to parallelize the copy command and in particular "Copy
> > > > <table_name> from 'filename' Where <condition>;" command.
> > > 
> > > Apropos of the initial parsing issue generally, there's an interesting
> > > approach taken here: https://github.com/robertdavidgraham/wc2
> > > 
> > 
> > Thanks for sharing.  I might be missing something, but I can't figure
> > out how this can help here.  Does this in some way help to allow
> > multiple workers to read and tokenize the chunks?
> 
> I think the wc2 is showing that maybe instead of parallelizing the
> parsing, we might instead try using a different tokenizer/parser and
> make the implementation more efficient instead of just throwing more
> CPUs on it.

That was what I had in mind.

> I don't know if our code is similar to what wc does, maytbe parsing
> csv is more complicated than what wc does.

CSV parsing differs from wc in that there are more states in the state
machine, but I don't see anything fundamentally different.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Parallel copy

From
Ants Aasma
Date:
On Thu, 20 Feb 2020 at 18:43, David Fetter <david@fetter.org> wrote:>
> On Thu, Feb 20, 2020 at 02:36:02PM +0100, Tomas Vondra wrote:
> > I think the wc2 is showing that maybe instead of parallelizing the
> > parsing, we might instead try using a different tokenizer/parser and
> > make the implementation more efficient instead of just throwing more
> > CPUs on it.
>
> That was what I had in mind.
>
> > I don't know if our code is similar to what wc does, maytbe parsing
> > csv is more complicated than what wc does.
>
> CSV parsing differs from wc in that there are more states in the state
> machine, but I don't see anything fundamentally different.

The trouble with a state machine based approach is that the state
transitions form a dependency chain, which means that at best the
processing rate will be 4-5 cycles per byte (L1 latency to fetch the
next state).

I whipped together a quick prototype that uses SIMD and bitmap
manipulations to do the equivalent of CopyReadLineText() in csv mode
including quotes and escape handling, this runs at 0.25-0.5 cycles per
byte.

Regards,
Ants Aasma

Attachment

Re: Parallel copy

From
Tomas Vondra
Date:
On Fri, Feb 21, 2020 at 02:54:31PM +0200, Ants Aasma wrote:
>On Thu, 20 Feb 2020 at 18:43, David Fetter <david@fetter.org> wrote:>
>> On Thu, Feb 20, 2020 at 02:36:02PM +0100, Tomas Vondra wrote:
>> > I think the wc2 is showing that maybe instead of parallelizing the
>> > parsing, we might instead try using a different tokenizer/parser and
>> > make the implementation more efficient instead of just throwing more
>> > CPUs on it.
>>
>> That was what I had in mind.
>>
>> > I don't know if our code is similar to what wc does, maytbe parsing
>> > csv is more complicated than what wc does.
>>
>> CSV parsing differs from wc in that there are more states in the state
>> machine, but I don't see anything fundamentally different.
>
>The trouble with a state machine based approach is that the state
>transitions form a dependency chain, which means that at best the
>processing rate will be 4-5 cycles per byte (L1 latency to fetch the
>next state).
>
>I whipped together a quick prototype that uses SIMD and bitmap
>manipulations to do the equivalent of CopyReadLineText() in csv mode
>including quotes and escape handling, this runs at 0.25-0.5 cycles per
>byte.
>

Interesting. How does that compare to what we currently have?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Parallel copy

From
Robert Haas
Date:
On Tue, Feb 18, 2020 at 6:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I am talking about access to shared memory instead of the process
> local memory.  I understand that an extra copy won't be required.

You make it sound like there is some performance penalty for accessing
shared memory, but I don't think that's true. It's true that
*contended* access to shared memory can be slower, because if multiple
processes are trying to access the same memory, and especially if
multiple processes are trying to write the same memory, then the cache
lines have to be shared and that has a cost. However, I don't think
that would create any noticeable effect in this case. First, there's
presumably only one writer process. Second, you wouldn't normally have
multiple readers working on the same part of the data at the same
time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel copy

From
Andres Freund
Date:
Hi,

On 2020-02-19 11:38:45 +0100, Tomas Vondra wrote:
> I generally agree with the impression that parsing CSV is tricky and
> unlikely to benefit from parallelism in general. There may be cases with
> restrictions making it easier (e.g. restrictions on the format) but that
> might be a bit too complex to start with.
> 
> For example, I had an idea to parallelise the planning by splitting it
> into two phases:

FWIW, I think we ought to rewrite our COPY parsers before we go for
complex schemes. They're way slower than a decent green-field
CSV/... parser.


> The one piece of information I'm missing here is at least a very rough
> quantification of the individual steps of CSV processing - for example
> if parsing takes only 10% of the time, it's pretty pointless to start by
> parallelising this part and we should focus on the rest. If it's 50% it
> might be a different story. Has anyone done any measurements?

Not recently, but I'm pretty sure that I've observed CSV parsing to be
way more than 10%.

Greetings,

Andres Freund



Re: Parallel copy

From
Tomas Vondra
Date:
On Sun, Feb 23, 2020 at 05:09:51PM -0800, Andres Freund wrote:
>Hi,
>
>On 2020-02-19 11:38:45 +0100, Tomas Vondra wrote:
>> I generally agree with the impression that parsing CSV is tricky and
>> unlikely to benefit from parallelism in general. There may be cases with
>> restrictions making it easier (e.g. restrictions on the format) but that
>> might be a bit too complex to start with.
>>
>> For example, I had an idea to parallelise the planning by splitting it
>> into two phases:
>
>FWIW, I think we ought to rewrite our COPY parsers before we go for
>complex schemes. They're way slower than a decent green-field
>CSV/... parser.
>

Yep, that's quite possible.

>
>> The one piece of information I'm missing here is at least a very rough
>> quantification of the individual steps of CSV processing - for example
>> if parsing takes only 10% of the time, it's pretty pointless to start by
>> parallelising this part and we should focus on the rest. If it's 50% it
>> might be a different story. Has anyone done any measurements?
>
>Not recently, but I'm pretty sure that I've observed CSV parsing to be
>way more than 10%.
>

Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
so I still think we need to do some measurements first. I'm willing to
do that, but (a) I doubt I'll have time for that until after 2020-03,
and (b) it'd be good to agree on some set of typical CSV files.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Parallel copy

From
Amit Kapila
Date:
On Tue, Feb 25, 2020 at 9:30 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sun, Feb 23, 2020 at 05:09:51PM -0800, Andres Freund wrote:
> >Hi,
> >
> >> The one piece of information I'm missing here is at least a very rough
> >> quantification of the individual steps of CSV processing - for example
> >> if parsing takes only 10% of the time, it's pretty pointless to start by
> >> parallelising this part and we should focus on the rest. If it's 50% it
> >> might be a different story. Has anyone done any measurements?
> >
> >Not recently, but I'm pretty sure that I've observed CSV parsing to be
> >way more than 10%.
> >
>
> Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> so I still think we need to do some measurements first.
>

Agreed.

> I'm willing to
> do that, but (a) I doubt I'll have time for that until after 2020-03,
> and (b) it'd be good to agree on some set of typical CSV files.
>

Right, I don't know what is the best way to define that.  I can think
of the below tests.

1. A table with 10 columns (with datatypes as integers, date, text).
It has one index (unique/primary). Load with 1 million rows (basically
the data should be probably 5-10 GB).
2. A table with 10 columns (with datatypes as integers, date, text).
It has three indexes, one index can be (unique/primary). Load with 1
million rows (basically the data should be probably 5-10 GB).
3. A table with 10 columns (with datatypes as integers, date, text).
It has three indexes, one index can be (unique/primary). It has before
and after trigeers. Load with 1 million rows (basically the data
should be probably 5-10 GB).
4. A table with 10 columns (with datatypes as integers, date, text).
It has five or six indexes, one index can be (unique/primary). Load
with 1 million rows (basically the data should be probably 5-10 GB).

Among all these tests, we can check how much time did we spend in
reading, parsing the csv files vs. rest of execution?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Alastair Turner
Date:
On Wed, 26 Feb 2020 at 10:54, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 25, 2020 at 9:30 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
...
> >
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first.
> >
>
> Agreed.
>
> > I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
> >
>
> Right, I don't know what is the best way to define that.  I can think
> of the below tests.
>
> 1. A table with 10 columns (with datatypes as integers, date, text).
> It has one index (unique/primary). Load with 1 million rows (basically
> the data should be probably 5-10 GB).
> 2. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). Load with 1
> million rows (basically the data should be probably 5-10 GB).
> 3. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). It has before
> and after trigeers. Load with 1 million rows (basically the data
> should be probably 5-10 GB).
> 4. A table with 10 columns (with datatypes as integers, date, text).
> It has five or six indexes, one index can be (unique/primary). Load
> with 1 million rows (basically the data should be probably 5-10 GB).
>
> Among all these tests, we can check how much time did we spend in
> reading, parsing the csv files vs. rest of execution?

That's a good set of tests of what happens after the parse. Two
simpler test runs may provide useful baselines - no
constraints/indexes with all columns varchar and no
constraints/indexes with columns correctly typed.

For testing the impact of various parts of the parse process, my idea would be:
 - A base dataset with 10 columns including int, date and text. One
text field quoted and containing both delimiters and line terminators
 - A derivative to measure just line parsing - strip the quotes around
the text field and quote the whole row as one text field
 - A derivative to measure the impact of quoted fields - clean up the
text field so it doesn't require quoting
 - A derivative to measure the impact of row length - run ten rows
together to make 100 column rows, but only a tenth as many rows

If that sounds reasonable, I'll try to knock up a generator.

--
Alastair



Re: Parallel copy

From
Ants Aasma
Date:
On Tue, 25 Feb 2020 at 18:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> so I still think we need to do some measurements first. I'm willing to
> do that, but (a) I doubt I'll have time for that until after 2020-03,
> and (b) it'd be good to agree on some set of typical CSV files.

I agree that getting a nice varied dataset would be nice. Including
things like narrow integer only tables, strings with newlines and
escapes in them, extremely wide rows.

I tried to capture a quick profile just to see what it looks like.
Grabbed a random open data set from the web, about 800MB of narrow
rows CSV [1].

Script:
CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true);

Profile:
# Samples: 59K of event 'cycles:u'
# Event count (approx.): 57644269486
#
# Overhead  Command   Shared Object       Symbol
# ........  ........  ..................
.......................................
#
    18.24%  postgres  postgres            [.] CopyReadLine
     9.23%  postgres  postgres            [.] NextCopyFrom
     8.87%  postgres  postgres            [.] NextCopyFromRawFields
     5.82%  postgres  postgres            [.] pg_verify_mbstr_len
     5.45%  postgres  postgres            [.] pg_strtoint32
     4.16%  postgres  postgres            [.] heap_fill_tuple
     4.03%  postgres  postgres            [.] heap_compute_data_size
     3.83%  postgres  postgres            [.] CopyFrom
     3.78%  postgres  postgres            [.] AllocSetAlloc
     3.53%  postgres  postgres            [.] heap_form_tuple
     2.96%  postgres  postgres            [.] InputFunctionCall
     2.89%  postgres  libc-2.30.so        [.] __memmove_avx_unaligned_erms
     1.82%  postgres  libc-2.30.so        [.] __strlen_avx2
     1.72%  postgres  postgres            [.] AllocSetReset
     1.72%  postgres  postgres            [.] RelationPutHeapTuple
     1.47%  postgres  postgres            [.] heap_prepare_insert
     1.31%  postgres  postgres            [.] heap_multi_insert
     1.25%  postgres  postgres            [.] textin
     1.24%  postgres  postgres            [.] int4in
     1.05%  postgres  postgres            [.] tts_buffer_heap_clear
     0.85%  postgres  postgres            [.] pg_any_to_server
     0.80%  postgres  postgres            [.] pg_comp_crc32c_sse42
     0.77%  postgres  postgres            [.] cstring_to_text_with_len
     0.69%  postgres  postgres            [.] AllocSetFree
     0.60%  postgres  postgres            [.] appendBinaryStringInfo
     0.55%  postgres  postgres            [.] tts_buffer_heap_materialize.part.0
     0.54%  postgres  postgres            [.] palloc
     0.54%  postgres  libc-2.30.so        [.] __memmove_avx_unaligned
     0.51%  postgres  postgres            [.] palloc0
     0.51%  postgres  postgres            [.] pg_encoding_max_length
     0.48%  postgres  postgres            [.] enlargeStringInfo
     0.47%  postgres  postgres            [.] ExecStoreVirtualTuple
     0.45%  postgres  postgres            [.] PageAddItemExtended

So that confirms that the parsing is a huge chunk of overhead with
current splitting into lines being the largest portion. Amdahl's law
says that splitting into tuples needs to be made fast before
parallelizing makes any sense.

Regards,
Ants Aasma

[1]
https://www3.stats.govt.nz/2018census/Age-sex-by-ethnic-group-grouped-total-responses-census-usually-resident-population-counts-2006-2013-2018-Censuses-RC-TA-SA2-DHB.zip



Re: Parallel copy

From
Dilip Kumar
Date:
On Wed, Feb 26, 2020 at 8:47 PM Ants Aasma <ants@cybertec.at> wrote:
>
> On Tue, 25 Feb 2020 at 18:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first. I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
>
> I agree that getting a nice varied dataset would be nice. Including
> things like narrow integer only tables, strings with newlines and
> escapes in them, extremely wide rows.
>
> I tried to capture a quick profile just to see what it looks like.
> Grabbed a random open data set from the web, about 800MB of narrow
> rows CSV [1].
>
> Script:
> CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
> COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true);
>
> Profile:
> # Samples: 59K of event 'cycles:u'
> # Event count (approx.): 57644269486
> #
> # Overhead  Command   Shared Object       Symbol
> # ........  ........  ..................
> .......................................
> #
>     18.24%  postgres  postgres            [.] CopyReadLine
>      9.23%  postgres  postgres            [.] NextCopyFrom
>      8.87%  postgres  postgres            [.] NextCopyFromRawFields
>      5.82%  postgres  postgres            [.] pg_verify_mbstr_len
>      5.45%  postgres  postgres            [.] pg_strtoint32
>      4.16%  postgres  postgres            [.] heap_fill_tuple
>      4.03%  postgres  postgres            [.] heap_compute_data_size
>      3.83%  postgres  postgres            [.] CopyFrom
>      3.78%  postgres  postgres            [.] AllocSetAlloc
>      3.53%  postgres  postgres            [.] heap_form_tuple
>      2.96%  postgres  postgres            [.] InputFunctionCall
>      2.89%  postgres  libc-2.30.so        [.] __memmove_avx_unaligned_erms
>      1.82%  postgres  libc-2.30.so        [.] __strlen_avx2
>      1.72%  postgres  postgres            [.] AllocSetReset
>      1.72%  postgres  postgres            [.] RelationPutHeapTuple
>      1.47%  postgres  postgres            [.] heap_prepare_insert
>      1.31%  postgres  postgres            [.] heap_multi_insert
>      1.25%  postgres  postgres            [.] textin
>      1.24%  postgres  postgres            [.] int4in
>      1.05%  postgres  postgres            [.] tts_buffer_heap_clear
>      0.85%  postgres  postgres            [.] pg_any_to_server
>      0.80%  postgres  postgres            [.] pg_comp_crc32c_sse42
>      0.77%  postgres  postgres            [.] cstring_to_text_with_len
>      0.69%  postgres  postgres            [.] AllocSetFree
>      0.60%  postgres  postgres            [.] appendBinaryStringInfo
>      0.55%  postgres  postgres            [.] tts_buffer_heap_materialize.part.0
>      0.54%  postgres  postgres            [.] palloc
>      0.54%  postgres  libc-2.30.so        [.] __memmove_avx_unaligned
>      0.51%  postgres  postgres            [.] palloc0
>      0.51%  postgres  postgres            [.] pg_encoding_max_length
>      0.48%  postgres  postgres            [.] enlargeStringInfo
>      0.47%  postgres  postgres            [.] ExecStoreVirtualTuple
>      0.45%  postgres  postgres            [.] PageAddItemExtended
>
> So that confirms that the parsing is a huge chunk of overhead with
> current splitting into lines being the largest portion. Amdahl's law
> says that splitting into tuples needs to be made fast before
> parallelizing makes any sense.
>

I have ran very simple case on table with 2 indexes and I can see a
lot of time is spent in index insertion.  I agree that there is a good
amount of time spent in tokanizing but it is not very huge compared to
index insertion.

I have expanded the time spent in the CopyFrom function from my perf
report and pasted here.  We can see that a lot of time is spent in
ExecInsertIndexTuples(77%).   I agree that we need to further evaluate
that out of which how much is I/O vs CPU operations.  But, the point I
want to make is that it's not true for all the cases that parsing is
taking maximum amout of time.

   - 99.50% CopyFrom
      - 82.90% CopyMultiInsertInfoFlush
         - 82.85% CopyMultiInsertBufferFlush
            + 77.68% ExecInsertIndexTuples
            + 3.74% table_multi_insert
            + 0.89% ExecClearTuple
      - 12.54% NextCopyFrom
         - 7.70% NextCopyFromRawFields
            - 5.72% CopyReadLine
                 3.96% CopyReadLineText
               + 1.49% pg_any_to_server
              1.86% CopyReadAttributesCSV
         + 3.68% InputFunctionCall
      + 2.11% ExecMaterializeSlot
      + 0.94% MemoryContextReset

My test:
-- Prepare:
CREATE TABLE t (a int, b int, c varchar);
insert into t select i,i, 'aaaaaaaaaaaaaaaaaaaaaaaa' from
generate_series(1,10000000) as i;
copy t to '/home/dilipkumar/a.csv'  WITH (FORMAT 'csv', HEADER true);
truncate table t;
create index idx on t(a);
create index idx1 on t(c);

-- Test CopyFrom and measure with perf:
copy t from '/home/dilipkumar/a.csv'  WITH (FORMAT 'csv', HEADER true);

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:
On Wed, Feb 26, 2020 at 4:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 25, 2020 at 9:30 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > On Sun, Feb 23, 2020 at 05:09:51PM -0800, Andres Freund wrote:
> > >Hi,
> > >
> > >> The one piece of information I'm missing here is at least a very rough
> > >> quantification of the individual steps of CSV processing - for example
> > >> if parsing takes only 10% of the time, it's pretty pointless to start by
> > >> parallelising this part and we should focus on the rest. If it's 50% it
> > >> might be a different story. Has anyone done any measurements?
> > >
> > >Not recently, but I'm pretty sure that I've observed CSV parsing to be
> > >way more than 10%.
> > >
> >
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first.
> >
>
> Agreed.
>
> > I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
> >
>
> Right, I don't know what is the best way to define that.  I can think
> of the below tests.
>
> 1. A table with 10 columns (with datatypes as integers, date, text).
> It has one index (unique/primary). Load with 1 million rows (basically
> the data should be probably 5-10 GB).
> 2. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). Load with 1
> million rows (basically the data should be probably 5-10 GB).
> 3. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). It has before
> and after trigeers. Load with 1 million rows (basically the data
> should be probably 5-10 GB).
> 4. A table with 10 columns (with datatypes as integers, date, text).
> It has five or six indexes, one index can be (unique/primary). Load
> with 1 million rows (basically the data should be probably 5-10 GB).
>

I have tried to capture the execution time taken for 3 scenarios which I felt could give a fair idea:
Test1 (Table with 3 indexes and 1 trigger)
Test2 (Table with 2 indexes)
Test3 (Table without indexes/triggers)

I have captured the following details:
File Read time - time taken to read the file from CopyGetData function.
Read line Time -  time taken to read line from NextCopyFrom function(read time & tokenise time excluded)
Tokenize Time - time taken to tokenize the contents from NextCopyFromRawFields function.
Data Execution Time - remaining execution time from the total time

The execution breakdown for the tests are  given below:
Test/ Time(In Seconds)Total TimeFile Read TimeRead line /Buffer Read TimeTokenize TimeData Execution Time
Test11693.3690.25634.1735.5781653.362
Test2736.0960.28839.7626.525689.521
Test3112.060.26639.1896.43366.172

Steps for the scenarios:
Test1(Table with 3 indexes and 1 trigger):
CREATE TABLE census2 (year int,age int,ethnic int,sex int,area text,count text);
CREATE TABLE census3(year int,age int,ethnic int,sex int,area text,count text);

CREATE INDEX idx1_census2 on census2(year);
CREATE INDEX idx2_census2 on census2(age);
CREATE INDEX idx2_census2 on census2(ethnic);

CREATE or REPLACE FUNCTION census2_afterinsert()
RETURNS TRIGGER
AS $$
BEGIN
  INSERT INTO census3  SELECT * FROM census2 limit 1;
  RETURN NEW;
END;
$$
LANGUAGE plpgsql;

CREATE TRIGGER census2_trigger AFTER INSERT  ON census2 FOR EACH ROW EXECUTE PROCEDURE census2_afterinsert();
COPY census2 FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);

Test2 (Table with 2 indexes):
CREATE TABLE census1 (year int,age int,ethnic int,sex int,area text,count text);
CREATE INDEX idx1_census1 on census1(year);
CREATE INDEX idx2_census1 on census1(age);
COPY census1 FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);

Test3 (Table without indexes/triggers):
CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
COPY census FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);

Note: The Data8277.csv used was the same data that Ants aasma had used.

From the above result we could infer that Read line will have to be done sequentially. Read line time takes about 2.01%, 5.40% and 34.97%of the total time. I felt we will be able to parallelise the remaining  phases of the copy. The performance improvement will vary based on the scenario(indexes/triggers), it will be proportionate to the number of indexes and triggers. Read line can also be parallelised in txt format(non csv). I feel parallelising copy could give significant improvement in quite some scenarios.

Further I'm planning to see how the execution will be for toast table. I'm also planning to do test on RAM disk where I will configure the data on RAM disk, so that we can further eliminate the I/O cost.

Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
vignesh C
Date:
On Wed, Feb 26, 2020 at 8:47 PM Ants Aasma <ants@cybertec.at> wrote:
>
> On Tue, 25 Feb 2020 at 18:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first. I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
>
> I agree that getting a nice varied dataset would be nice. Including
> things like narrow integer only tables, strings with newlines and
> escapes in them, extremely wide rows.
>
> I tried to capture a quick profile just to see what it looks like.
> Grabbed a random open data set from the web, about 800MB of narrow
> rows CSV [1].
>
> Script:
> CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
> COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true);
>
> Profile:
> # Samples: 59K of event 'cycles:u'
> # Event count (approx.): 57644269486
> #
> # Overhead  Command   Shared Object       Symbol
> # ........  ........  ..................
> .......................................
> #
>     18.24%  postgres  postgres            [.] CopyReadLine
>      9.23%  postgres  postgres            [.] NextCopyFrom
>      8.87%  postgres  postgres            [.] NextCopyFromRawFields
>      5.82%  postgres  postgres            [.] pg_verify_mbstr_len
>      5.45%  postgres  postgres            [.] pg_strtoint32
>      4.16%  postgres  postgres            [.] heap_fill_tuple
>      4.03%  postgres  postgres            [.] heap_compute_data_size
>      3.83%  postgres  postgres            [.] CopyFrom
>      3.78%  postgres  postgres            [.] AllocSetAlloc
>      3.53%  postgres  postgres            [.] heap_form_tuple
>      2.96%  postgres  postgres            [.] InputFunctionCall
>      2.89%  postgres  libc-2.30.so        [.] __memmove_avx_unaligned_erms
>      1.82%  postgres  libc-2.30.so        [.] __strlen_avx2
>      1.72%  postgres  postgres            [.] AllocSetReset
>      1.72%  postgres  postgres            [.] RelationPutHeapTuple
>      1.47%  postgres  postgres            [.] heap_prepare_insert
>      1.31%  postgres  postgres            [.] heap_multi_insert
>      1.25%  postgres  postgres            [.] textin
>      1.24%  postgres  postgres            [.] int4in
>      1.05%  postgres  postgres            [.] tts_buffer_heap_clear
>      0.85%  postgres  postgres            [.] pg_any_to_server
>      0.80%  postgres  postgres            [.] pg_comp_crc32c_sse42
>      0.77%  postgres  postgres            [.] cstring_to_text_with_len
>      0.69%  postgres  postgres            [.] AllocSetFree
>      0.60%  postgres  postgres            [.] appendBinaryStringInfo
>      0.55%  postgres  postgres            [.] tts_buffer_heap_materialize.part.0
>      0.54%  postgres  postgres            [.] palloc
>      0.54%  postgres  libc-2.30.so        [.] __memmove_avx_unaligned
>      0.51%  postgres  postgres            [.] palloc0
>      0.51%  postgres  postgres            [.] pg_encoding_max_length
>      0.48%  postgres  postgres            [.] enlargeStringInfo
>      0.47%  postgres  postgres            [.] ExecStoreVirtualTuple
>      0.45%  postgres  postgres            [.] PageAddItemExtended
>
> So that confirms that the parsing is a huge chunk of overhead with
> current splitting into lines being the largest portion. Amdahl's law
> says that splitting into tuples needs to be made fast before
> parallelizing makes any sense.
>

I had taken perf report with the same test data that you had used, I was getting the following results:
.....
+   99.61%     0.00%  postgres  postgres            [.] PortalRun
+   99.61%     0.00%  postgres  postgres            [.] PortalRunMulti
+   99.61%     0.00%  postgres  postgres            [.] PortalRunUtility
+   99.61%     0.00%  postgres  postgres            [.] ProcessUtility
+   99.61%     0.00%  postgres  postgres            [.] standard_ProcessUtility
+   99.61%     0.00%  postgres  postgres            [.] DoCopy
+   99.30%     0.94%  postgres  postgres            [.] CopyFrom
+   51.61%     7.76%  postgres  postgres            [.] NextCopyFrom
+   23.66%     0.01%  postgres  postgres            [.] CopyMultiInsertInfoFlush
+   23.61%     0.28%  postgres  postgres            [.] CopyMultiInsertBufferFlush
+   21.99%     1.02%  postgres  postgres            [.] NextCopyFromRawFields
+   19.79%     0.01%  postgres  postgres            [.] table_multi_insert
+   19.32%     3.00%  postgres  postgres            [.] heap_multi_insert
+   18.27%     2.44%  postgres  postgres            [.] InputFunctionCall
+   15.19%     0.89%  postgres  postgres            [.] CopyReadLine
+   13.05%     0.18%  postgres  postgres            [.] ExecMaterializeSlot
+   13.00%     0.55%  postgres  postgres            [.] tts_buffer_heap_materialize
+   12.31%     1.77%  postgres  postgres            [.] heap_form_tuple
+   10.43%     0.45%  postgres  postgres            [.] int4in
+   10.18%     8.92%  postgres  postgres            [.] CopyReadLineText 
......

In my results I observed execution table_multi_insert was nearly 20%. Also I felt like once we have made few tuples from CopyReadLine, the parallel workers should be able to start consuming the data and process the data. We need not wait for the complete tokenisation to be finished. Once few tuples are tokenised parallel workers should start consuming the data parallelly and tokenisation should happen simultaneously. In this way once the copy is done parallelly total execution time should be CopyReadLine Time + delta processing time. 

Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
vignesh C
Date:
I have got the execution breakdown for few scenarios with normal disk and RAM disk.

Execution breakup in Normal disk:
Test/ Time(In Seconds)
Total TImeFile Read Timecopyreadline Time
Remaining 
Execution Time
Read line percentage
Test1(3 index + 1 trigger)2099.0170.31110.2562088.450.4886096682
Test2(2 index)657.9940.30310.171647.521.545758776
Test3(no index, no trigger)112.410.29610.996101.1189.782047861
Test4(toast)360.0281.4346.556312.04212.93121646

Execution breakup in RAM disk:
Test/ Time(In Seconds)
Total TImeFile Read Timecopyreadline Time
Remaining 
Execution Time
Read line percentage
Test1(3 index + 1 trigger)1571.5580.2596.9861564.3130.4445270235
Test2(2 index)369.9420.2636.848362.8311.851100983
Test3(no index, no trigger)54.0770.2396.80547.03312.58390813
Test4(toast)96.3230.91826.60368.80227.61853348

Steps for the scenarios:
Test1(Table with 3 indexes and 1 trigger):
CREATE TABLE census2 (year int,age int,ethnic int,sex int,area text,count text);
CREATE TABLE census3(year int,age int,ethnic int,sex int,area text,count text);

CREATE INDEX idx1_census2 on census2(year);
CREATE INDEX idx2_census2 on census2(age);
CREATE INDEX idx3_census2 on census2(ethnic);

CREATE or REPLACE FUNCTION census2_afterinsert()
RETURNS TRIGGER
AS $$
BEGIN
  INSERT INTO census3  SELECT * FROM census2 limit 1;
  RETURN NEW;
END;
$$
LANGUAGE plpgsql;

CREATE TRIGGER census2_trigger AFTER INSERT  ON census2 FOR EACH ROW EXECUTE PROCEDURE census2_afterinsert();
COPY census2 FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);

Test2 (Table with 2 indexes):
CREATE TABLE census1 (year int,age int,ethnic int,sex int,area text,count text);
CREATE INDEX idx1_census1 on census1(year);
CREATE INDEX idx2_census1 on census1(age);
COPY census1 FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);

Test3 (Table without indexes/triggers):
CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
COPY census FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);

Random open data set from the web, about 800MB of narrow rows CSV [1] was used in the above tests, the same which Ants Aasma had used.

Test4 (Toast table):
CREATE TABLE indtoasttest(descr text, cnt int DEFAULT 0, f1 text, f2 text);
alter table indtoasttest alter column f1 set storage external;
alter table indtoasttest alter column f2 set storage external;
inserted 262144 records
copy indtoasttest to '/mnt/magnetic/vignesh.c/postgres/toast_data3.csv'  WITH (FORMAT 'csv', HEADER true);

CREATE TABLE indtoasttest1(descr text, cnt int DEFAULT 0, f1 text, f2 text);
alter table indtoasttest1 alter column f1 set storage external;
alter table indtoasttest1 alter column f2 set storage external;
copy indtoasttest1 from '/mnt/magnetic/vignesh.c/postgres/toast_data3.csv'  WITH (FORMAT 'csv', HEADER true);

We could infer that Read line Time cannot be parallelized, this is mainly because if the data has quote present we will not be able to differentiate if it is part of previous record or it is part of current record. The rest of the execution time can be parallelized. Read line Time takes about 0.5%, 1.5%, 9.8% & 12.9% of the total time. We could parallelize the remaining  phases of the copy. The performance improvement will vary based on the scenario(indexes/triggers), it will be proportionate to the number of indexes and triggers. Read line can also be parallelized in txt format(non csv). We feel parallelize copy could give significant improvement in many scenarios.

Attached patch for reference which was used to capture the execution time breakup.

Thoughts?

Regards,
Vignesh

On Tue, Mar 3, 2020 at 11:44 AM vignesh C <vignesh21@gmail.com> wrote:
On Wed, Feb 26, 2020 at 8:47 PM Ants Aasma <ants@cybertec.at> wrote:
>
> On Tue, 25 Feb 2020 at 18:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first. I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
>
> I agree that getting a nice varied dataset would be nice. Including
> things like narrow integer only tables, strings with newlines and
> escapes in them, extremely wide rows.
>
> I tried to capture a quick profile just to see what it looks like.
> Grabbed a random open data set from the web, about 800MB of narrow
> rows CSV [1].
>
> Script:
> CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
> COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true);
>
> Profile:
> # Samples: 59K of event 'cycles:u'
> # Event count (approx.): 57644269486
> #
> # Overhead  Command   Shared Object       Symbol
> # ........  ........  ..................
> .......................................
> #
>     18.24%  postgres  postgres            [.] CopyReadLine
>      9.23%  postgres  postgres            [.] NextCopyFrom
>      8.87%  postgres  postgres            [.] NextCopyFromRawFields
>      5.82%  postgres  postgres            [.] pg_verify_mbstr_len
>      5.45%  postgres  postgres            [.] pg_strtoint32
>      4.16%  postgres  postgres            [.] heap_fill_tuple
>      4.03%  postgres  postgres            [.] heap_compute_data_size
>      3.83%  postgres  postgres            [.] CopyFrom
>      3.78%  postgres  postgres            [.] AllocSetAlloc
>      3.53%  postgres  postgres            [.] heap_form_tuple
>      2.96%  postgres  postgres            [.] InputFunctionCall
>      2.89%  postgres  libc-2.30.so        [.] __memmove_avx_unaligned_erms
>      1.82%  postgres  libc-2.30.so        [.] __strlen_avx2
>      1.72%  postgres  postgres            [.] AllocSetReset
>      1.72%  postgres  postgres            [.] RelationPutHeapTuple
>      1.47%  postgres  postgres            [.] heap_prepare_insert
>      1.31%  postgres  postgres            [.] heap_multi_insert
>      1.25%  postgres  postgres            [.] textin
>      1.24%  postgres  postgres            [.] int4in
>      1.05%  postgres  postgres            [.] tts_buffer_heap_clear
>      0.85%  postgres  postgres            [.] pg_any_to_server
>      0.80%  postgres  postgres            [.] pg_comp_crc32c_sse42
>      0.77%  postgres  postgres            [.] cstring_to_text_with_len
>      0.69%  postgres  postgres            [.] AllocSetFree
>      0.60%  postgres  postgres            [.] appendBinaryStringInfo
>      0.55%  postgres  postgres            [.] tts_buffer_heap_materialize.part.0
>      0.54%  postgres  postgres            [.] palloc
>      0.54%  postgres  libc-2.30.so        [.] __memmove_avx_unaligned
>      0.51%  postgres  postgres            [.] palloc0
>      0.51%  postgres  postgres            [.] pg_encoding_max_length
>      0.48%  postgres  postgres            [.] enlargeStringInfo
>      0.47%  postgres  postgres            [.] ExecStoreVirtualTuple
>      0.45%  postgres  postgres            [.] PageAddItemExtended
>
> So that confirms that the parsing is a huge chunk of overhead with
> current splitting into lines being the largest portion. Amdahl's law
> says that splitting into tuples needs to be made fast before
> parallelizing makes any sense.
>

I had taken perf report with the same test data that you had used, I was getting the following results:
.....
+   99.61%     0.00%  postgres  postgres            [.] PortalRun
+   99.61%     0.00%  postgres  postgres            [.] PortalRunMulti
+   99.61%     0.00%  postgres  postgres            [.] PortalRunUtility
+   99.61%     0.00%  postgres  postgres            [.] ProcessUtility
+   99.61%     0.00%  postgres  postgres            [.] standard_ProcessUtility
+   99.61%     0.00%  postgres  postgres            [.] DoCopy
+   99.30%     0.94%  postgres  postgres            [.] CopyFrom
+   51.61%     7.76%  postgres  postgres            [.] NextCopyFrom
+   23.66%     0.01%  postgres  postgres            [.] CopyMultiInsertInfoFlush
+   23.61%     0.28%  postgres  postgres            [.] CopyMultiInsertBufferFlush
+   21.99%     1.02%  postgres  postgres            [.] NextCopyFromRawFields
+   19.79%     0.01%  postgres  postgres            [.] table_multi_insert
+   19.32%     3.00%  postgres  postgres            [.] heap_multi_insert
+   18.27%     2.44%  postgres  postgres            [.] InputFunctionCall
+   15.19%     0.89%  postgres  postgres            [.] CopyReadLine
+   13.05%     0.18%  postgres  postgres            [.] ExecMaterializeSlot
+   13.00%     0.55%  postgres  postgres            [.] tts_buffer_heap_materialize
+   12.31%     1.77%  postgres  postgres            [.] heap_form_tuple
+   10.43%     0.45%  postgres  postgres            [.] int4in
+   10.18%     8.92%  postgres  postgres            [.] CopyReadLineText 
......

In my results I observed execution table_multi_insert was nearly 20%. Also I felt like once we have made few tuples from CopyReadLine, the parallel workers should be able to start consuming the data and process the data. We need not wait for the complete tokenisation to be finished. Once few tuples are tokenised parallel workers should start consuming the data parallelly and tokenisation should happen simultaneously. In this way once the copy is done parallelly total execution time should be CopyReadLine Time + delta processing time. 

Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Parallel copy

From
vignesh C
Date:
On Thu, Mar 12, 2020 at 6:39 PM vignesh C <vignesh21@gmail.com> wrote:
>

Existing parallel copy code flow.  Copy supports copy operation from
csv, txt & bin format file. For processing csv & text format, it will
read 64kb chunk or lesser size if in case the file has lesser size
contents in the input file. Server will then read one tuple of data
and do the processing of the tuple. If the above tuple that is
generated was less than 64kb data, then the server will try to
generate another tuple for processing from the remaining unprocessed
data. If it is not able to generate one tuple from the unprocessed
data it will do a further 64kb data read or lesser remaining size that
is present in the file and send the tuple for processing. This process
is repeated till the complete file is processed. For processing bin
format file the flow is slightly different. Server will read the
number of columns that are present. Then read the column size data and
then read the actual column contents, repeat this for all the columns.
Server will then process the tuple that is generated. This process is
repeated for all the remaining tuples in the bin file. The tuple
processing flow is the same in all the formats. Currently all the
operations happen sequentially. This project will help in
parallelizing the copy operation.

I'm planning to do the POC of parallel copy with the below design:
Proposed Syntax:
COPY table_name FROM ‘copy_file' WITH (FORMAT ‘format’, PARALLEL ‘workers’);
Users can specify the number of workers that must be used for copying
the data in parallel. Here ‘workers’ is the number of workers that
must be used for parallel copy operation apart from the leader. Leader
is responsible for reading the data from the input file and generating
the work for the workers. Leader will start a transaction and share
this transaction with the workers. All workers will be using the same
transaction to insert the records. Leader will create a circular queue
and share it across the workers. The circular queue will be present in
DSM. Leader will be using a fixed size queue to share the contents
between the leader and the workers. Currently we will have 100
elements present in the queue. This will be created before the workers
are started and shared with the workers. The data structures that are
required by the parallel workers will be initialized by the leader,
the size required in dsm will be calculated and the necessary keys
will be loaded in the DSM. The specified number of workers will then
be launched. Leader will read the table data from the file and copy
the contents to the queue element by element. Each element in the
queue will have 64K size DSA. This DSA will be used to store tuple
contents from the file. The leader will try to copy as much content as
possible within one 64K DSA queue element. We intend to store at least
one tuple in each queue element. There are some cases where the 64K
space may not be enough to store a single tuple. Mostly in cases where
the table has toast data present and the single tuple can be more than
64K size. In these scenarios we will extend the DSA space accordingly.
We cannot change the size of the dsm once the workers are launched.
Whereas in case of DSA we can free the dsa pointer and reallocate the
dsa pointer based on the memory size required. This is the very reason
for choosing DSA over DSM for storing the data that must be inserted
into the relation. Leader will keep on loading the data into the queue
till the queue becomes full. Leader will transform his role into a
worker either when the Queue is full or the Complete file is
processed. Once the queue is full, the leader will switch its role to
become a worker, then the leader will continue to act as worker till
25% of the elements in the queue is consumed by all the workers. Once
there is at least 25% space available in the queue leader who was
working as a worker will switch its role back to become the leader
again. The above process of filling the queue will be continued by the
leader until the whole file is processed. Leader will wait until the
respective workers finish processing the queue elements. The copy from
functionality is also being used during initdb operations where the
copy is intended to be performed in single mode or the user can still
continue running in non-parallel mode. In case of non parallel mode,
memory allocation will happen using palloc instead of DSM/DSA and most
of the flow will be the same in both parallel and non parallel cases.

We had a couple of options for the way in which queue elements can be stored.
Option 1:  Each element (DSA chunk) will contain tuples such that each
tuple will be preceded by the length of the tuple.  So the tuples will
be arranged like (Length of tuple-1, tuple-1), (Length of tuple-2,
tuple-2), .... Or Option 2: Each element (DSA chunk) will contain only
tuples (tuple-1), (tuple-2), .....  And we will have a second
ring-buffer which contains a start-offset or length of each tuple. The
old design used to generate one tuple of data and process tuple by
tuple. In the new design, the server will generate multiple tuples of
data per queue element. The worker will then process data tuple by
tuple. As we are processing the data tuple by tuple, I felt both of
the options are almost the same. However Design1 was chosen over
Design 2 as we can save up on some space that was required by another
variable in each element of the queue.

The parallel workers will read the tuples from the queue and do the
following operations, all of these operations: a) where clause
handling, b) convert tuple to columns, c) add default null values for
the missing columns that are not present in that record, d) find the
partition if it is partitioned table, e) before row insert Triggers,
constraints  f) insertion of the data. Rest of the flow is the same as
the existing code.

Enhancements after POC is done:
Initially we plan to use the number of workers based on the worker
count user has specified, Later we will do some experiments and think
of an approach to choose workers automatically after processing sample
contents from the file.
Initially we plan to use 100 elements in the queue, Later we will
experiment to find the right size for the queue once the basic patch
is ready.
Initially we plan to generate the transaction from the leader and
share it across to the workers. Later we will change this in such a
way that the first process that will do an insert operation will
generate the transaction and share it with the rest of them.

Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Ants Aasma
Date:
On Tue, 7 Apr 2020 at 08:24, vignesh C <vignesh21@gmail.com> wrote:
> Leader will create a circular queue
> and share it across the workers. The circular queue will be present in
> DSM. Leader will be using a fixed size queue to share the contents
> between the leader and the workers. Currently we will have 100
> elements present in the queue. This will be created before the workers
> are started and shared with the workers. The data structures that are
> required by the parallel workers will be initialized by the leader,
> the size required in dsm will be calculated and the necessary keys
> will be loaded in the DSM. The specified number of workers will then
> be launched. Leader will read the table data from the file and copy
> the contents to the queue element by element. Each element in the
> queue will have 64K size DSA. This DSA will be used to store tuple
> contents from the file. The leader will try to copy as much content as
> possible within one 64K DSA queue element. We intend to store at least
> one tuple in each queue element. There are some cases where the 64K
> space may not be enough to store a single tuple. Mostly in cases where
> the table has toast data present and the single tuple can be more than
> 64K size. In these scenarios we will extend the DSA space accordingly.
> We cannot change the size of the dsm once the workers are launched.
> Whereas in case of DSA we can free the dsa pointer and reallocate the
> dsa pointer based on the memory size required. This is the very reason
> for choosing DSA over DSM for storing the data that must be inserted
> into the relation.

I think the element based approach and requirement that all tuples fit
into the queue makes things unnecessarily complex. The approach I
detailed earlier allows for tuples to be bigger than the buffer. In
that case a worker will claim the long tuple from the ring queue of
tuple start positions, and starts copying it into its local line_buf.
This can wrap around the buffer multiple times until the next start
position shows up. At that point this worker can proceed with
inserting the tuple and the next worker will claim the next tuple.

This way nothing needs to be resized, there is no risk of a file with
huge tuples running the system out of memory because each element will
be reallocated to be huge and the number of elements is not something
that has to be tuned.

> We had a couple of options for the way in which queue elements can be stored.
> Option 1:  Each element (DSA chunk) will contain tuples such that each
> tuple will be preceded by the length of the tuple.  So the tuples will
> be arranged like (Length of tuple-1, tuple-1), (Length of tuple-2,
> tuple-2), .... Or Option 2: Each element (DSA chunk) will contain only
> tuples (tuple-1), (tuple-2), .....  And we will have a second
> ring-buffer which contains a start-offset or length of each tuple. The
> old design used to generate one tuple of data and process tuple by
> tuple. In the new design, the server will generate multiple tuples of
> data per queue element. The worker will then process data tuple by
> tuple. As we are processing the data tuple by tuple, I felt both of
> the options are almost the same. However Design1 was chosen over
> Design 2 as we can save up on some space that was required by another
> variable in each element of the queue.

With option 1 it's not possible to read input data into shared memory
and there needs to be an extra memcpy in the time critical sequential
flow of the leader. With option 2 data could be read directly into the
shared memory buffer. With future async io support, reading and
looking for tuple boundaries could be performed concurrently.


Regards,
Ants Aasma
Cybertec



Re: Parallel copy

From
Amit Kapila
Date:
On Tue, Apr 7, 2020 at 7:08 PM Ants Aasma <ants@cybertec.at> wrote:
>
> On Tue, 7 Apr 2020 at 08:24, vignesh C <vignesh21@gmail.com> wrote:
> > Leader will create a circular queue
> > and share it across the workers. The circular queue will be present in
> > DSM. Leader will be using a fixed size queue to share the contents
> > between the leader and the workers. Currently we will have 100
> > elements present in the queue. This will be created before the workers
> > are started and shared with the workers. The data structures that are
> > required by the parallel workers will be initialized by the leader,
> > the size required in dsm will be calculated and the necessary keys
> > will be loaded in the DSM. The specified number of workers will then
> > be launched. Leader will read the table data from the file and copy
> > the contents to the queue element by element. Each element in the
> > queue will have 64K size DSA. This DSA will be used to store tuple
> > contents from the file. The leader will try to copy as much content as
> > possible within one 64K DSA queue element. We intend to store at least
> > one tuple in each queue element. There are some cases where the 64K
> > space may not be enough to store a single tuple. Mostly in cases where
> > the table has toast data present and the single tuple can be more than
> > 64K size. In these scenarios we will extend the DSA space accordingly.
> > We cannot change the size of the dsm once the workers are launched.
> > Whereas in case of DSA we can free the dsa pointer and reallocate the
> > dsa pointer based on the memory size required. This is the very reason
> > for choosing DSA over DSM for storing the data that must be inserted
> > into the relation.
>
> I think the element based approach and requirement that all tuples fit
> into the queue makes things unnecessarily complex. The approach I
> detailed earlier allows for tuples to be bigger than the buffer. In
> that case a worker will claim the long tuple from the ring queue of
> tuple start positions, and starts copying it into its local line_buf.
> This can wrap around the buffer multiple times until the next start
> position shows up. At that point this worker can proceed with
> inserting the tuple and the next worker will claim the next tuple.
>

IIUC, with the fixed size buffer, the parallelism might hit a bit
because till the worker copies the data from shared buffer to local
buffer the reader process won't be able to continue.  I think there
will be somewhat more leader-worker coordination is required with the
fixed buffer size. However, as you pointed out, we can't allow it to
increase it to max_size possible for all tuples as that might require
a lot of memory.  One idea could be that we allow it for first any
such tuple and then if any other element/chunk in the queue required
more memory than the default 64KB, then we will always fallback to use
the memory we have allocated for first chunk.  This will allow us to
not use more memory except for one tuple and won't hit parallelism
much as in many cases not all tuples will be so large.

I think in the proposed approach queue element is nothing but a way to
divide the work among workers based on size rather than based on
number of tuples.  Say if we try to divide the work among workers
based on start offsets, it can be more tricky.  Because it could lead
to either a lot of contentention if we choose say one offset
per-worker (basically copy the data for one tuple, process it and then
pick next tuple) or probably unequal division of work because some can
be smaller and others can be bigger.  I guess division based on size
would be a better idea. OTOH, I see the advantage of your approach as
well and I will think more on it.

>
> > We had a couple of options for the way in which queue elements can be stored.
> > Option 1:  Each element (DSA chunk) will contain tuples such that each
> > tuple will be preceded by the length of the tuple.  So the tuples will
> > be arranged like (Length of tuple-1, tuple-1), (Length of tuple-2,
> > tuple-2), .... Or Option 2: Each element (DSA chunk) will contain only
> > tuples (tuple-1), (tuple-2), .....  And we will have a second
> > ring-buffer which contains a start-offset or length of each tuple. The
> > old design used to generate one tuple of data and process tuple by
> > tuple. In the new design, the server will generate multiple tuples of
> > data per queue element. The worker will then process data tuple by
> > tuple. As we are processing the data tuple by tuple, I felt both of
> > the options are almost the same. However Design1 was chosen over
> > Design 2 as we can save up on some space that was required by another
> > variable in each element of the queue.
>
> With option 1 it's not possible to read input data into shared memory
> and there needs to be an extra memcpy in the time critical sequential
> flow of the leader. With option 2 data could be read directly into the
> shared memory buffer. With future async io support, reading and
> looking for tuple boundaries could be performed concurrently.
>

Yeah, option-2 sounds better.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Robert Haas
Date:
On Tue, Apr 7, 2020 at 9:38 AM Ants Aasma <ants@cybertec.at> wrote:
> I think the element based approach and requirement that all tuples fit
> into the queue makes things unnecessarily complex. The approach I
> detailed earlier allows for tuples to be bigger than the buffer. In
> that case a worker will claim the long tuple from the ring queue of
> tuple start positions, and starts copying it into its local line_buf.
> This can wrap around the buffer multiple times until the next start
> position shows up. At that point this worker can proceed with
> inserting the tuple and the next worker will claim the next tuple.
>
> This way nothing needs to be resized, there is no risk of a file with
> huge tuples running the system out of memory because each element will
> be reallocated to be huge and the number of elements is not something
> that has to be tuned.

+1. This seems like the right way to do it.

> > We had a couple of options for the way in which queue elements can be stored.
> > Option 1:  Each element (DSA chunk) will contain tuples such that each
> > tuple will be preceded by the length of the tuple.  So the tuples will
> > be arranged like (Length of tuple-1, tuple-1), (Length of tuple-2,
> > tuple-2), .... Or Option 2: Each element (DSA chunk) will contain only
> > tuples (tuple-1), (tuple-2), .....  And we will have a second
> > ring-buffer which contains a start-offset or length of each tuple. The
> > old design used to generate one tuple of data and process tuple by
> > tuple. In the new design, the server will generate multiple tuples of
> > data per queue element. The worker will then process data tuple by
> > tuple. As we are processing the data tuple by tuple, I felt both of
> > the options are almost the same. However Design1 was chosen over
> > Design 2 as we can save up on some space that was required by another
> > variable in each element of the queue.
>
> With option 1 it's not possible to read input data into shared memory
> and there needs to be an extra memcpy in the time critical sequential
> flow of the leader. With option 2 data could be read directly into the
> shared memory buffer. With future async io support, reading and
> looking for tuple boundaries could be performed concurrently.

But option 2 still seems significantly worse than your proposal above, right?

I really think we don't want a single worker in charge of finding
tuple boundaries for everybody. That adds a lot of unnecessary
inter-process communication and synchronization. Each process should
just get the next tuple starting after where the last one ended, and
then advance the end pointer so that the next process can do the same
thing. Vignesh's proposal involves having a leader process that has to
switch roles - he picks an arbitrary 25% threshold - and if it doesn't
switch roles at the right time, performance will be impacted. If the
leader doesn't get scheduled in time to refill the queue before it
runs completely empty, workers will have to wait. Ants's scheme avoids
that risk: whoever needs the next tuple reads the next line. There's
no need to ever wait for the leader because there is no leader.

I think it's worth enumerating some of the other ways that a project
in this area can fail to achieve good speedups, so that we can try to
avoid those that are avoidable and be aware of the others:

- If we're unable to supply data to the COPY process as fast as the
workers could load it, then speed will be limited at that point. We
know reading the file from disk is pretty fast compared to what a
single process can do. I'm not sure we've tested what happens with a
network socket. It will depend on the network speed some, but it might
be useful to know how many MB/s we can pump through over a UNIX
socket.

- The portion of the time that is used to split the lines is not
easily parallelizable. That seems to be a fairly small percentage for
a reasonably wide table, but it looks significant (13-18%) for a
narrow table. Such cases will gain less performance and be limited to
a smaller number of workers. I think we also need to be careful about
files whose lines are longer than the size of the buffer. If we're not
careful, we could get a significant performance drop-off in such
cases. We should make sure to pick an algorithm that seems like it
will handle such cases without serious regressions and check that a
file composed entirely of such long lines is handled reasonably
efficiently.

- There could be index contention. Let's suppose that we can read data
super fast and break it up into lines super fast. Maybe the file we're
reading is fully RAM-cached and the lines are long. Now all of the
backends are inserting into the indexes at the same time, and they
might be trying to insert into the same pages. If so, lock contention
could become a factor that hinders performance.

- There could also be similar contention on the heap. Say the tuples
are narrow, and many backends are trying to insert tuples into the
same heap page at the same time. This would lead to many lock/unlock
cycles. This could be avoided if the backends avoid targeting the same
heap pages, but I'm not sure there's any reason to expect that they
would do so unless we make some special provision for it.

- These problems could also arise with respect to TOAST table
insertions, either on the TOAST table itself or on its index. This
would only happen if the table contains a lot of toastable values, but
that could be the case: imagine a table with a bunch of columns each
of which contains a long string that isn't very compressible.

- What else? I bet the above list is not comprehensive.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel copy

From
Ants Aasma
Date:
On Wed, 8 Apr 2020 at 22:30, Robert Haas <robertmhaas@gmail.com> wrote:
> - If we're unable to supply data to the COPY process as fast as the
> workers could load it, then speed will be limited at that point. We
> know reading the file from disk is pretty fast compared to what a
> single process can do. I'm not sure we've tested what happens with a
> network socket. It will depend on the network speed some, but it might
> be useful to know how many MB/s we can pump through over a UNIX
> socket.

This raises a good point. If at some point we want to minimize the
amount of memory copies then we might want to allow for RDMA to
directly write incoming network traffic into a distributing ring
buffer, which would include the protocol level headers. But at this
point we are so far off from network reception becoming a bottleneck I
don't think it's worth holding anything up for not allowing for zero
copy transfers.

> - The portion of the time that is used to split the lines is not
> easily parallelizable. That seems to be a fairly small percentage for
> a reasonably wide table, but it looks significant (13-18%) for a
> narrow table. Such cases will gain less performance and be limited to
> a smaller number of workers. I think we also need to be careful about
> files whose lines are longer than the size of the buffer. If we're not
> careful, we could get a significant performance drop-off in such
> cases. We should make sure to pick an algorithm that seems like it
> will handle such cases without serious regressions and check that a
> file composed entirely of such long lines is handled reasonably
> efficiently.

I don't have a proof, but my gut feel tells me that it's fundamentally
impossible to ingest csv without a serial line-ending/comment
tokenization pass. The current line splitting algorithm is terrible.
I'm currently working with some scientific data where on ingestion
CopyReadLineText() is about 25% on profiles. I prototyped a
replacement that can do ~8GB/s on narrow rows, more on wider ones.

For rows that are consistently wider than the input buffer I think
parallelism will still give a win - the serial phase is just memcpy
through a ringbuffer, after which a worker goes away to perform the
actual insert, letting the next worker read the data. The memcpy is
already happening today, CopyReadLineText() copies the input buffer
into a StringInfo, so the only extra work is synchronization between
leader and worker.

> - There could be index contention. Let's suppose that we can read data
> super fast and break it up into lines super fast. Maybe the file we're
> reading is fully RAM-cached and the lines are long. Now all of the
> backends are inserting into the indexes at the same time, and they
> might be trying to insert into the same pages. If so, lock contention
> could become a factor that hinders performance.

Different data distribution strategies can have an effect on that.
Dealing out input data in larger or smaller chunks will have a
considerable effect on contention, btree page splits and all kinds of
things. I think the common theme would be a push to increase chunk
size to reduce contention..

> - There could also be similar contention on the heap. Say the tuples
> are narrow, and many backends are trying to insert tuples into the
> same heap page at the same time. This would lead to many lock/unlock
> cycles. This could be avoided if the backends avoid targeting the same
> heap pages, but I'm not sure there's any reason to expect that they
> would do so unless we make some special provision for it.

I thought there already was a provision for that. Am I mis-remembering?

> - What else? I bet the above list is not comprehensive.

I think parallel copy patch needs to concentrate on splitting input
data to workers. After that any performance issues would be basically
the same as a normal parallel insert workload. There may well be
bottlenecks there, but those could be tackled independently.

Regards,
Ants Aasma
Cybertec



Re: Parallel copy

From
Amit Kapila
Date:
On Thu, Apr 9, 2020 at 1:00 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Apr 7, 2020 at 9:38 AM Ants Aasma <ants@cybertec.at> wrote:
> >
> > With option 1 it's not possible to read input data into shared memory
> > and there needs to be an extra memcpy in the time critical sequential
> > flow of the leader. With option 2 data could be read directly into the
> > shared memory buffer. With future async io support, reading and
> > looking for tuple boundaries could be performed concurrently.
>
> But option 2 still seems significantly worse than your proposal above, right?
>
> I really think we don't want a single worker in charge of finding
> tuple boundaries for everybody. That adds a lot of unnecessary
> inter-process communication and synchronization. Each process should
> just get the next tuple starting after where the last one ended, and
> then advance the end pointer so that the next process can do the same
> thing. Vignesh's proposal involves having a leader process that has to
> switch roles - he picks an arbitrary 25% threshold - and if it doesn't
> switch roles at the right time, performance will be impacted. If the
> leader doesn't get scheduled in time to refill the queue before it
> runs completely empty, workers will have to wait. Ants's scheme avoids
> that risk: whoever needs the next tuple reads the next line. There's
> no need to ever wait for the leader because there is no leader.
>

Hmm, I think in his scheme also there is a single reader process.  See
the email above [1] where he described how it should work.  I think
the difference is in the division of work.  AFAIU, in Ants scheme, the
worker needs to pick the work from tuple_offset queue whereas in
Vignesh's scheme it will be based on the size (each worker will get
probably 64KB of work).  I think in his scheme the main thing to find
out is how many tuple offsets to be assigned to each worker in one-go
so that we don't unnecessarily add contention for finding the work
unit.  I think we need to find the right balance between size and
number of tuples.  I am trying to consider size here because larger
sized tuples will probably require more time as we need to allocate
more space for them and also probably requires more processing time.
One way to achieve that could be each worker will try to claim 500
tuples (or some other threshold number) but if their size is greater
than 64K (or some other threshold size) then the worker will try with
lesser number of tuples (such that the size of the chunk of tuples is
less than a threshold size.).

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Amit Kapila
Date:
On Thu, Apr 9, 2020 at 4:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Apr 9, 2020 at 1:00 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Tue, Apr 7, 2020 at 9:38 AM Ants Aasma <ants@cybertec.at> wrote:
> > >
> > > With option 1 it's not possible to read input data into shared memory
> > > and there needs to be an extra memcpy in the time critical sequential
> > > flow of the leader. With option 2 data could be read directly into the
> > > shared memory buffer. With future async io support, reading and
> > > looking for tuple boundaries could be performed concurrently.
> >
> > But option 2 still seems significantly worse than your proposal above, right?
> >
> > I really think we don't want a single worker in charge of finding
> > tuple boundaries for everybody. That adds a lot of unnecessary
> > inter-process communication and synchronization. Each process should
> > just get the next tuple starting after where the last one ended, and
> > then advance the end pointer so that the next process can do the same
> > thing. Vignesh's proposal involves having a leader process that has to
> > switch roles - he picks an arbitrary 25% threshold - and if it doesn't
> > switch roles at the right time, performance will be impacted. If the
> > leader doesn't get scheduled in time to refill the queue before it
> > runs completely empty, workers will have to wait. Ants's scheme avoids
> > that risk: whoever needs the next tuple reads the next line. There's
> > no need to ever wait for the leader because there is no leader.
> >
>
> Hmm, I think in his scheme also there is a single reader process.  See
> the email above [1] where he described how it should work.
>

oops, I forgot to specify the link to the email.  See
https://www.postgresql.org/message-id/CANwKhkO87A8gApobOz_o6c9P5auuEG1W2iCz0D5CfOeGgAnk3g%40mail.gmail.com


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Amit Kapila
Date:
On Thu, Apr 9, 2020 at 3:55 AM Ants Aasma <ants@cybertec.at> wrote:
>
> On Wed, 8 Apr 2020 at 22:30, Robert Haas <robertmhaas@gmail.com> wrote:
>
> > - The portion of the time that is used to split the lines is not
> > easily parallelizable. That seems to be a fairly small percentage for
> > a reasonably wide table, but it looks significant (13-18%) for a
> > narrow table. Such cases will gain less performance and be limited to
> > a smaller number of workers. I think we also need to be careful about
> > files whose lines are longer than the size of the buffer. If we're not
> > careful, we could get a significant performance drop-off in such
> > cases. We should make sure to pick an algorithm that seems like it
> > will handle such cases without serious regressions and check that a
> > file composed entirely of such long lines is handled reasonably
> > efficiently.
>
> I don't have a proof, but my gut feel tells me that it's fundamentally
> impossible to ingest csv without a serial line-ending/comment
> tokenization pass.
>

I think even if we try to do it via multiple workers it might not be
better.  In such a scheme,  every worker needs to update the end
boundaries and the next worker to keep a check if the previous has
updated the end pointer.  I think this can add a significant
synchronization effort for cases where tuples are of 100 or so bytes
which will be a common case.

> The current line splitting algorithm is terrible.
> I'm currently working with some scientific data where on ingestion
> CopyReadLineText() is about 25% on profiles. I prototyped a
> replacement that can do ~8GB/s on narrow rows, more on wider ones.
>

Good to hear.  I think that will be a good project on its own and that
might give a boost to parallel copy as with that we can further reduce
the non-parallelizable work unit.

> For rows that are consistently wider than the input buffer I think
> parallelism will still give a win - the serial phase is just memcpy
> through a ringbuffer, after which a worker goes away to perform the
> actual insert, letting the next worker read the data. The memcpy is
> already happening today, CopyReadLineText() copies the input buffer
> into a StringInfo, so the only extra work is synchronization between
> leader and worker.
>
>
> > - There could also be similar contention on the heap. Say the tuples
> > are narrow, and many backends are trying to insert tuples into the
> > same heap page at the same time. This would lead to many lock/unlock
> > cycles. This could be avoided if the backends avoid targeting the same
> > heap pages, but I'm not sure there's any reason to expect that they
> > would do so unless we make some special provision for it.
>
> I thought there already was a provision for that. Am I mis-remembering?
>

The copy uses heap_multi_insert to insert batch of tuples and I think
each batch should ideally use a different page mostly it will be a new
page. So, not sure if this will be a problem or a problem of a level
for which we need to do some special handling.  But if this turns out
to be a problem, we definetly need some better way to deal with it.

> > - What else? I bet the above list is not comprehensive.
>
> I think parallel copy patch needs to concentrate on splitting input
> data to workers. After that any performance issues would be basically
> the same as a normal parallel insert workload. There may well be
> bottlenecks there, but those could be tackled independently.
>

I agree.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Dilip Kumar
Date:
On Thu, Apr 9, 2020 at 1:00 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Apr 7, 2020 at 9:38 AM Ants Aasma <ants@cybertec.at> wrote:
> > I think the element based approach and requirement that all tuples fit
> > into the queue makes things unnecessarily complex. The approach I
> > detailed earlier allows for tuples to be bigger than the buffer. In
> > that case a worker will claim the long tuple from the ring queue of
> > tuple start positions, and starts copying it into its local line_buf.
> > This can wrap around the buffer multiple times until the next start
> > position shows up. At that point this worker can proceed with
> > inserting the tuple and the next worker will claim the next tuple.
> >
> > This way nothing needs to be resized, there is no risk of a file with
> > huge tuples running the system out of memory because each element will
> > be reallocated to be huge and the number of elements is not something
> > that has to be tuned.
>
> +1. This seems like the right way to do it.
>
> > > We had a couple of options for the way in which queue elements can be stored.
> > > Option 1:  Each element (DSA chunk) will contain tuples such that each
> > > tuple will be preceded by the length of the tuple.  So the tuples will
> > > be arranged like (Length of tuple-1, tuple-1), (Length of tuple-2,
> > > tuple-2), .... Or Option 2: Each element (DSA chunk) will contain only
> > > tuples (tuple-1), (tuple-2), .....  And we will have a second
> > > ring-buffer which contains a start-offset or length of each tuple. The
> > > old design used to generate one tuple of data and process tuple by
> > > tuple. In the new design, the server will generate multiple tuples of
> > > data per queue element. The worker will then process data tuple by
> > > tuple. As we are processing the data tuple by tuple, I felt both of
> > > the options are almost the same. However Design1 was chosen over
> > > Design 2 as we can save up on some space that was required by another
> > > variable in each element of the queue.
> >
> > With option 1 it's not possible to read input data into shared memory
> > and there needs to be an extra memcpy in the time critical sequential
> > flow of the leader. With option 2 data could be read directly into the
> > shared memory buffer. With future async io support, reading and
> > looking for tuple boundaries could be performed concurrently.
>
> But option 2 still seems significantly worse than your proposal above, right?
>
> I really think we don't want a single worker in charge of finding
> tuple boundaries for everybody. That adds a lot of unnecessary
> inter-process communication and synchronization. Each process should
> just get the next tuple starting after where the last one ended, and
> then advance the end pointer so that the next process can do the same
> thing. Vignesh's proposal involves having a leader process that has to
> switch roles - he picks an arbitrary 25% threshold - and if it doesn't
> switch roles at the right time, performance will be impacted. If the
> leader doesn't get scheduled in time to refill the queue before it
> runs completely empty, workers will have to wait. Ants's scheme avoids
> that risk: whoever needs the next tuple reads the next line. There's
> no need to ever wait for the leader because there is no leader.

I agree that if the leader switches the role, then it is possible that
sometimes the leader might not produce the work before the queue is
empty.  OTOH, the problem with the approach you are suggesting is that
the work will be generated on-demand, i.e. there is no specific
process who is generating the data while workers are busy inserting
the data.  So IMHO, if we have a specific leader process then there
will always be work available for all the workers.  I agree that we
need to find the correct point when the leader will work as a worker.
One idea could be that when the queue is full and there is no space to
push more work to queue then the leader himself processes that work.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Robert Haas
Date:
On Thu, Apr 9, 2020 at 7:49 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I agree that if the leader switches the role, then it is possible that
> sometimes the leader might not produce the work before the queue is
> empty.  OTOH, the problem with the approach you are suggesting is that
> the work will be generated on-demand, i.e. there is no specific
> process who is generating the data while workers are busy inserting
> the data.

I think you have a point. The way I think things could go wrong if we
don't have a leader is if it tends to happen that everyone wants new
work at the same time. In that case, everyone will wait at once,
whereas if there is a designated process that aggressively queues up
work, we could perhaps avoid that. Note that you really have to have
the case where everyone wants new work at the exact same moment,
because otherwise they just all take turns finding work for
themselves, and everything is fine, because nobody's waiting for
anybody else to do any work, so everyone is always making forward
progress.

Now on the other hand, if we do have a leader, and for some reason
it's slow in responding, everyone will have to wait. That could happen
either because the leader also has other responsibilities, like
reading data or helping with the main work when the queue is full, or
just because the system is really busy and the leader doesn't get
scheduled on-CPU for a while. I am inclined to think that's likely to
be a more serious problem.

The thing is, the problem of everyone needing new work at the same
time can't really keep on repeating. Say that everyone finishes
processing their first chunk at the same time. Now everyone needs a
second chunk, and in a leaderless system, they must take turns getting
it. So they will go in some order. The ones who go later will
presumably also finish later, so the end times for the second and
following chunks will be scattered. You shouldn't get repeated
pile-ups with everyone finishing at the same time, because each time
it happens, it will force a little bit of waiting that will spread
things out. If they clump up again, that will happen again, but it
shouldn't happen every time.

But in the case where there is a leader, I don't think there's any
similar protection. Suppose we go with the design Vignesh proposes
where the leader switches to processing chunks when the queue is more
than 75% full. If the leader has a "hiccup" where it gets swapped out
or is busy with processing a chunk for a longer-than-normal time, all
of the other processes have to wait for it. Now we can probably tune
this to some degree by adjusting the queue size and fullness
thresholds, but the optimal values for those parameters might be quite
different on different systems, depending on load, I/O performance,
CPU architecture, etc. If there's a system or configuration where the
leader tends not to respond fast enough, it will probably just keep
happening, because nothing in the algorithm will tend to shake it out
of that bad pattern.

I'm not 100% certain that my analysis here is right, so it will be
interesting to hear from other people. However, as a general rule, I
think we want to minimize the amount of work that can only be done by
one process (the leader) and maximize the amount that can be done by
any process with whichever one is available taking on the job. In the
case of COPY FROM STDIN, the reads from the network socket can only be
done by the one process connected to it. In the case of COPY from a
file, even that could be rotated around, if all processes open the
file individually and seek to the appropriate offset.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel copy

From
Andres Freund
Date:
Hi,

On April 9, 2020 4:01:43 AM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
>On Thu, Apr 9, 2020 at 3:55 AM Ants Aasma <ants@cybertec.at> wrote:
>>
>> On Wed, 8 Apr 2020 at 22:30, Robert Haas <robertmhaas@gmail.com>
>wrote:
>>
>> > - The portion of the time that is used to split the lines is not
>> > easily parallelizable. That seems to be a fairly small percentage
>for
>> > a reasonably wide table, but it looks significant (13-18%) for a
>> > narrow table. Such cases will gain less performance and be limited
>to
>> > a smaller number of workers. I think we also need to be careful
>about
>> > files whose lines are longer than the size of the buffer. If we're
>not
>> > careful, we could get a significant performance drop-off in such
>> > cases. We should make sure to pick an algorithm that seems like it
>> > will handle such cases without serious regressions and check that a
>> > file composed entirely of such long lines is handled reasonably
>> > efficiently.
>>
>> I don't have a proof, but my gut feel tells me that it's
>fundamentally
>> impossible to ingest csv without a serial line-ending/comment
>> tokenization pass.

I can't quite see a way either. But even if it were, I have a hard time seeing parallelizing that path as the right
thing.


>I think even if we try to do it via multiple workers it might not be
>better.  In such a scheme,  every worker needs to update the end
>boundaries and the next worker to keep a check if the previous has
>updated the end pointer.  I think this can add a significant
>synchronization effort for cases where tuples are of 100 or so bytes
>which will be a common case.

It seems like it'd also have terrible caching and instruction level parallelism behavior. By constantly switching the
processthat analyzes boundaries, the current data will have to be brought into l1/register, rather than staying there. 

I'm fairly certain that we do *not* want to distribute input data between processes on a single tuple basis. Probably
noteven below a few hundred kb. If there's any sort of natural clustering in the loaded data - extremely common, think
timestamps- splitting on a granular basis will make indexing much more expensive. And have a lot more contention. 


>> The current line splitting algorithm is terrible.
>> I'm currently working with some scientific data where on ingestion
>> CopyReadLineText() is about 25% on profiles. I prototyped a
>> replacement that can do ~8GB/s on narrow rows, more on wider ones.

We should really replace the entire copy parsing code. It's terrible.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Parallel copy

From
Robert Haas
Date:
On Thu, Apr 9, 2020 at 2:55 PM Andres Freund <andres@anarazel.de> wrote:
> I'm fairly certain that we do *not* want to distribute input data between processes on a single tuple basis. Probably
noteven below a few hundred kb. If there's any sort of natural clustering in the loaded data - extremely common, think
timestamps- splitting on a granular basis will make indexing much more expensive. And have a lot more contention. 

That's a fair point. I think the solution ought to be that once any
process starts finding line endings, it continues until it's grabbed
at least a certain amount of data for itself. Then it stops and lets
some other process grab a chunk of data.

Or are you are arguing that there should be only one process that's
allowed to find line endings for the entire duration of the load?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel copy

From
Andres Freund
Date:
Hi,

On April 9, 2020 12:29:09 PM PDT, Robert Haas <robertmhaas@gmail.com> wrote:
>On Thu, Apr 9, 2020 at 2:55 PM Andres Freund <andres@anarazel.de>
>wrote:
>> I'm fairly certain that we do *not* want to distribute input data
>between processes on a single tuple basis. Probably not even below a
>few hundred kb. If there's any sort of natural clustering in the loaded
>data - extremely common, think timestamps - splitting on a granular
>basis will make indexing much more expensive. And have a lot more
>contention.
>
>That's a fair point. I think the solution ought to be that once any
>process starts finding line endings, it continues until it's grabbed
>at least a certain amount of data for itself. Then it stops and lets
>some other process grab a chunk of data.
>
>Or are you are arguing that there should be only one process that's
>allowed to find line endings for the entire duration of the load?

I've not yet read the whole thread. So I'm probably restating ideas.

Imo, yes, there should be only one process doing the chunking. For ilp, cache efficiency, but also because the leader
isthe only process with access to the network socket. It should load input data into one large buffer that's shared
acrossprocesses. There should be a separate ringbuffer with tuple/partial tuple (for huge tuples) offsets. Worker
processesshould grab large chunks of offsets from the offset ringbuffer. If the ringbuffer is not full, the worker
chunksshould be reduced in size.   

Given that everything stalls if the leader doesn't accept further input data, as well as when there are no available
splittedchunks, it doesn't seem like a good idea to have the leader do other work. 


I don't think optimizing/targeting copy from local files, where multiple processes could read, is useful. COPY STDIN is
theonly thing that practically matters. 

Andres


--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Parallel copy

From
Robert Haas
Date:
On Thu, Apr 9, 2020 at 4:00 PM Andres Freund <andres@anarazel.de> wrote:
> I've not yet read the whole thread. So I'm probably restating ideas.

Yeah, but that's OK.

> Imo, yes, there should be only one process doing the chunking. For ilp, cache efficiency, but also because the leader
isthe only process with access to the network socket. It should load input data into one large buffer that's shared
acrossprocesses. There should be a separate ringbuffer with tuple/partial tuple (for huge tuples) offsets. Worker
processesshould grab large chunks of offsets from the offset ringbuffer. If the ringbuffer is not full, the worker
chunksshould be reduced in size. 

My concern here is that it's going to be hard to avoid processes going
idle. If the leader does nothing at all once the ring buffer is full,
it's wasting time that it could spend processing a chunk. But if it
picks up a chunk, then it might not get around to refilling the buffer
before other processes are idle with no work to do.

Still, it might be the case that having the process that is reading
the data also find the line endings is so fast that it makes no sense
to split those two tasks. After all, whoever just read the data must
have it in cache, and that helps a lot.

> Given that everything stalls if the leader doesn't accept further input data, as well as when there are no available
splittedchunks, it doesn't seem like a good idea to have the leader do other work. 
>
> I don't think optimizing/targeting copy from local files, where multiple processes could read, is useful. COPY STDIN
isthe only thing that practically matters. 

Yeah, I think Amit has been thinking primarily in terms of COPY from
files, and I've been encouraging him to at least consider the STDIN
case. But I think you're right, and COPY FROM STDIN should be the
design center for this feature.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel copy

From
Andres Freund
Date:
Hi,

On 2020-04-10 07:40:06 -0400, Robert Haas wrote:
> On Thu, Apr 9, 2020 at 4:00 PM Andres Freund <andres@anarazel.de> wrote:
> > Imo, yes, there should be only one process doing the chunking. For ilp, cache efficiency, but also because the
leaderis the only process with access to the network socket. It should load input data into one large buffer that's
sharedacross processes. There should be a separate ringbuffer with tuple/partial tuple (for huge tuples) offsets.
Workerprocesses should grab large chunks of offsets from the offset ringbuffer. If the ringbuffer is not full, the
workerchunks should be reduced in size.
 
> 
> My concern here is that it's going to be hard to avoid processes going
> idle. If the leader does nothing at all once the ring buffer is full,
> it's wasting time that it could spend processing a chunk. But if it
> picks up a chunk, then it might not get around to refilling the buffer
> before other processes are idle with no work to do.

An idle process doesn't cost much. Processes that use CPU inefficiently
however...


> Still, it might be the case that having the process that is reading
> the data also find the line endings is so fast that it makes no sense
> to split those two tasks. After all, whoever just read the data must
> have it in cache, and that helps a lot.

Yea. And if it's not fast enough to split lines, then we have a problem
regardless of which process does the splitting.

Greetings,

Andres Freund



Re: Parallel copy

From
Robert Haas
Date:
On Fri, Apr 10, 2020 at 2:26 PM Andres Freund <andres@anarazel.de> wrote:
> > Still, it might be the case that having the process that is reading
> > the data also find the line endings is so fast that it makes no sense
> > to split those two tasks. After all, whoever just read the data must
> > have it in cache, and that helps a lot.
>
> Yea. And if it's not fast enough to split lines, then we have a problem
> regardless of which process does the splitting.

Still, if the reader does the splitting, then you don't need as much
IPC, right? The shared memory data structure is just a ring of bytes,
and whoever reads from it is responsible for the rest.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel copy

From
Andres Freund
Date:
Hi,

On 2020-04-13 14:13:46 -0400, Robert Haas wrote:
> On Fri, Apr 10, 2020 at 2:26 PM Andres Freund <andres@anarazel.de> wrote:
> > > Still, it might be the case that having the process that is reading
> > > the data also find the line endings is so fast that it makes no sense
> > > to split those two tasks. After all, whoever just read the data must
> > > have it in cache, and that helps a lot.
> >
> > Yea. And if it's not fast enough to split lines, then we have a problem
> > regardless of which process does the splitting.
> 
> Still, if the reader does the splitting, then you don't need as much
> IPC, right? The shared memory data structure is just a ring of bytes,
> and whoever reads from it is responsible for the rest.

I don't think so. If only one process does the splitting, the
exclusively locked section is just popping off a bunch of offsets of the
ring. And that could fairly easily be done with atomic ops (since what
we need is basically a single producer multiple consumer queue, which
can be done lock free fairly easily ). Whereas in the case of each
process doing the splitting, the exclusively locked part is splitting
along lines - which takes considerably longer than just popping off a
few offsets.

Greetings,

Andres Freund



Re: Parallel copy

From
Robert Haas
Date:
On Mon, Apr 13, 2020 at 4:16 PM Andres Freund <andres@anarazel.de> wrote:
> I don't think so. If only one process does the splitting, the
> exclusively locked section is just popping off a bunch of offsets of the
> ring. And that could fairly easily be done with atomic ops (since what
> we need is basically a single producer multiple consumer queue, which
> can be done lock free fairly easily ). Whereas in the case of each
> process doing the splitting, the exclusively locked part is splitting
> along lines - which takes considerably longer than just popping off a
> few offsets.

Hmm, that does seem believable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel copy

From
Kuntal Ghosh
Date:
Hello,

I was going through some literatures on parsing CSV files in a fully
parallelized way and found (from [1]) an interesting approach
implemented in the open-source project ParaText[2]. The algorithm
follows a two-phase approach: the first pass identifies the adjusted
chunks in parallel by exploiting the simplicity of CSV formats and the
second phase processes complete records within each adjusted chunk by
one of the available workers. Here is the sketch:

1. Each worker scans a distinct fixed sized chunk of the CSV file and
collects the following three stats from the chunk:
a) number of quotes
b) position of the first new line after even number of quotes
c) position of the first new line after odd number of quotes
2. Once stats from all the chunks are collected, the leader identifies
the adjusted chunk boundaries by iterating over the stats linearly:
- For the k-th chunk, the leader adds the number of quotes in k-1 chunks.
- If the number is even, then the k-th chunk does not start in the
middle of a quoted field, and the first newline after an even number
of quotes (the second collected information) is the first record
delimiter in this chunk.
- Otherwise, if the number is odd, the first newline after an odd
number of quotes (the third collected information) is the first record
delimiter.
- The end position of the adjusted chunk is obtained based on the
starting position of the next adjusted chunk.
3. Once the boundaries of the chunks are determined (forming adjusted
chunks), individual worker may take up one adjusted chunk and process
the tuples independently.

Although this approach parses the CSV in parallel, it requires two
scan on the CSV file. So, given a system with spinning hard-disk and
small RAM, as per my understanding, the algorithm will perform very
poorly. But, if we use this algorithm to parse a CSV file on a
multi-core system with a large RAM, the performance might be improved
significantly [1].

Hence, I was trying to think whether we can leverage this idea for
implementing parallel COPY in PG. We can design an algorithm similar
to parallel hash-join where the workers pass through different phases.
1. Phase 1 - Read fixed size chunks in parallel, store the chunks and
the small stats about each chunk in the shared memory. If the shared
memory is full, go to phase 2.
2. Phase 2 - Allow a single worker to process the stats and decide the
actual chunk boundaries so that no tuple spans across two different
chunks. Go to phase 3.
3. Phase 3 - Each worker picks one adjusted chunk, parse and process
tuples from the same. Once done with one chunk, it picks the next one
and so on.
4. If there are still some unread contents, go back to phase 1.

We can probably use separate workers for phase 1 and phase 3 so that
they can work concurrently.

Advantages:
1. Each worker spends some significant time in each phase. Gets
benefit of the instruction cache - at least in phase 1.
2. It also has the same advantage of parallel hash join - fast workers
get to work more.
3. We can extend this solution for reading data from STDIN. Of course,
the phase 1 and phase 2 must be performed by the leader process who
can read from the socket.

Disadvantages:
1. Surely doesn't work if we don't have enough shared memory.
2. Probably, this approach is just impractical for PG due to certain
limitations.

Thoughts?

[1] https://www.microsoft.com/en-us/research/uploads/prod/2019/04/chunker-sigmod19.pdf
[2] ParaText. https://github.com/wiseio/paratext.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Ants Aasma
Date:
On Tue, 14 Apr 2020 at 22:40, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> 1. Each worker scans a distinct fixed sized chunk of the CSV file and
> collects the following three stats from the chunk:
> a) number of quotes
> b) position of the first new line after even number of quotes
> c) position of the first new line after odd number of quotes
> 2. Once stats from all the chunks are collected, the leader identifies
> the adjusted chunk boundaries by iterating over the stats linearly:
> - For the k-th chunk, the leader adds the number of quotes in k-1 chunks.
> - If the number is even, then the k-th chunk does not start in the
> middle of a quoted field, and the first newline after an even number
> of quotes (the second collected information) is the first record
> delimiter in this chunk.
> - Otherwise, if the number is odd, the first newline after an odd
> number of quotes (the third collected information) is the first record
> delimiter.
> - The end position of the adjusted chunk is obtained based on the
> starting position of the next adjusted chunk.

The trouble is that, at least with current coding, the number of
quotes in a chunk can depend on whether the chunk started in a quote
or not. That's because escape characters only count inside quotes. See
for example the following csv:

foo,\"bar
baz",\"xyz"

This currently parses as one line and the number of parsed quotes
doesn't change if you add a quote in front.

But the general approach of doing the tokenization in parallel and
then a serial pass over the tokenization would still work. The quote
counting and new line finding just has to be done for both starting in
quote and not starting in quote case.

Using phases doesn't look like the correct approach - the tokenization
can be prepared just in time for the serial pass and processing the
chunk can proceed immediately after. This could all be done by having
the data in a single ringbuffer with a processing pipeline where one
process does the reading, then workers grab tokenization chunks as
they become available, then one process handles determining the chunk
boundaries, after which the chunks are processed.

But I still don't think this is something to worry about for the first
version. Just a better line splitting algorithm should go a looong way
in feeding a large number of workers, even when inserting to an
unindexed unlogged table. If we get the SIMD line splitting in, it
will be enough to overwhelm most I/O subsystems available today.

Regards,
Ants Aasma



Re: Parallel copy

From
Ants Aasma
Date:
On Mon, 13 Apr 2020 at 23:16, Andres Freund <andres@anarazel.de> wrote:
> > Still, if the reader does the splitting, then you don't need as much
> > IPC, right? The shared memory data structure is just a ring of bytes,
> > and whoever reads from it is responsible for the rest.
>
> I don't think so. If only one process does the splitting, the
> exclusively locked section is just popping off a bunch of offsets of the
> ring. And that could fairly easily be done with atomic ops (since what
> we need is basically a single producer multiple consumer queue, which
> can be done lock free fairly easily ). Whereas in the case of each
> process doing the splitting, the exclusively locked part is splitting
> along lines - which takes considerably longer than just popping off a
> few offsets.

I see the benefit of having one process responsible for splitting as
being able to run ahead of the workers to queue up work when many of
them need new data at the same time. I don't think the locking
benefits of a ring are important in this case. At current rather
conservative chunk sizes we are looking at ~100k chunks per second at
best, normal locking should be perfectly adequate. And chunk size can
easily be increased. I see the main value in it being simple.

But there is a point that having a layer of indirection instead of a
linear buffer allows for some workers to fall behind. Either because
the kernel scheduled them out for a time slice, or they need to do I/O
or because inserting some tuple hit an unique conflict and needs to
wait for a tx to complete or abort to resolve. With a ring buffer
reading has to wait on the slowest worker reading its chunk. Having
workers copy the data to a local buffer as the first step would reduce
the probability of hitting any issues. But still, at GB/s rates,
hiding a 10ms timeslice of delay would need 10's of megabytes of
buffer.

FWIW. I think just increasing the buffer is good enough - the CPUs
processing this workload are likely to have tens to hundreds of
megabytes of cache on board.



Re: Parallel copy

From
Amit Kapila
Date:
On Wed, Apr 15, 2020 at 1:10 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
> Hence, I was trying to think whether we can leverage this idea for
> implementing parallel COPY in PG. We can design an algorithm similar
> to parallel hash-join where the workers pass through different phases.
> 1. Phase 1 - Read fixed size chunks in parallel, store the chunks and
> the small stats about each chunk in the shared memory. If the shared
> memory is full, go to phase 2.
> 2. Phase 2 - Allow a single worker to process the stats and decide the
> actual chunk boundaries so that no tuple spans across two different
> chunks. Go to phase 3.
>
> 3. Phase 3 - Each worker picks one adjusted chunk, parse and process
> tuples from the same. Once done with one chunk, it picks the next one
> and so on.
>
> 4. If there are still some unread contents, go back to phase 1.
>
> We can probably use separate workers for phase 1 and phase 3 so that
> they can work concurrently.
>
> Advantages:
> 1. Each worker spends some significant time in each phase. Gets
> benefit of the instruction cache - at least in phase 1.
> 2. It also has the same advantage of parallel hash join - fast workers
> get to work more.
> 3. We can extend this solution for reading data from STDIN. Of course,
> the phase 1 and phase 2 must be performed by the leader process who
> can read from the socket.
>
> Disadvantages:
> 1. Surely doesn't work if we don't have enough shared memory.
> 2. Probably, this approach is just impractical for PG due to certain
> limitations.
>

As I understand this, it needs to parse the lines twice (second time
in phase-3) and till the first two phases are over, we can't start the
tuple processing work which is done in phase-3.  So even if the
tokenization is done a bit faster but we will lose some on processing
the tuples which might not be an overall win and in fact, it can be
worse as compared to the single reader approach being discussed.
Now, if the work done in tokenization is a major (or significant)
portion of the copy then thinking of such a technique might be useful
but that is not the case as seen in the data shared above (the
tokenize time is very less as compared to data processing time) in
this email.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Robert Haas
Date:
On Wed, Apr 15, 2020 at 7:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> As I understand this, it needs to parse the lines twice (second time
> in phase-3) and till the first two phases are over, we can't start the
> tuple processing work which is done in phase-3.  So even if the
> tokenization is done a bit faster but we will lose some on processing
> the tuples which might not be an overall win and in fact, it can be
> worse as compared to the single reader approach being discussed.
> Now, if the work done in tokenization is a major (or significant)
> portion of the copy then thinking of such a technique might be useful
> but that is not the case as seen in the data shared above (the
> tokenize time is very less as compared to data processing time) in
> this email.

It seems to me that a good first step here might be to forget about
parallelism for a minute and just write a patch to make the line
splitting as fast as possible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel copy

From
Kuntal Ghosh
Date:
On Wed, Apr 15, 2020 at 2:15 PM Ants Aasma <ants@cybertec.at> wrote:
>
> On Tue, 14 Apr 2020 at 22:40, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> > 1. Each worker scans a distinct fixed sized chunk of the CSV file and
> > collects the following three stats from the chunk:
> > a) number of quotes
> > b) position of the first new line after even number of quotes
> > c) position of the first new line after odd number of quotes
> > 2. Once stats from all the chunks are collected, the leader identifies
> > the adjusted chunk boundaries by iterating over the stats linearly:
> > - For the k-th chunk, the leader adds the number of quotes in k-1 chunks.
> > - If the number is even, then the k-th chunk does not start in the
> > middle of a quoted field, and the first newline after an even number
> > of quotes (the second collected information) is the first record
> > delimiter in this chunk.
> > - Otherwise, if the number is odd, the first newline after an odd
> > number of quotes (the third collected information) is the first record
> > delimiter.
> > - The end position of the adjusted chunk is obtained based on the
> > starting position of the next adjusted chunk.
>
> The trouble is that, at least with current coding, the number of
> quotes in a chunk can depend on whether the chunk started in a quote
> or not. That's because escape characters only count inside quotes. See
> for example the following csv:
>
> foo,\"bar
> baz",\"xyz"
>
> This currently parses as one line and the number of parsed quotes
> doesn't change if you add a quote in front.
>
> But the general approach of doing the tokenization in parallel and
> then a serial pass over the tokenization would still work. The quote
> counting and new line finding just has to be done for both starting in
> quote and not starting in quote case.
>
Yeah, right.

> Using phases doesn't look like the correct approach - the tokenization
> can be prepared just in time for the serial pass and processing the
> chunk can proceed immediately after. This could all be done by having
> the data in a single ringbuffer with a processing pipeline where one
> process does the reading, then workers grab tokenization chunks as
> they become available, then one process handles determining the chunk
> boundaries, after which the chunks are processed.
>
I was thinking from this point of view - the sooner we introduce
parallelism in the process, the greater the benefits. Probably there
isn't any way to avoid a single-pass over the data (phase - 2 in the
above case) to tokenise the chunks. So yeah, if the reading and
tokenisation phase doesn't take much time, parallelising the same will
just be an overkill. As pointed by Andres and you, using a lock-free
circular buffer implementation sounds the way to go forward. AFAIK,
FIFO circular queue with CAS-based implementation suffers from two
problems - 1. (as pointed by you) slow workers may block producers. 2.
Since it doesn't partition the queue among the workers, does not
achieve good locality and cache-friendliness, limits their scalability
on NUMA systems.

> But I still don't think this is something to worry about for the first
> version. Just a better line splitting algorithm should go a looong way
> in feeding a large number of workers, even when inserting to an
> unindexed unlogged table. If we get the SIMD line splitting in, it
> will be enough to overwhelm most I/O subsystems available today.
>
Yeah. Parsing text is a great use case for data parallelism which can
be achieved by SIMD instructions. Consider processing 8-bit ASCII
characters in 512-bit SIMD word. A lot of code and complexity from
CopyReadLineText will surely go away. And further (I'm not sure in
this point), if we can use the schema of the table, perhaps JIT can
generate machine code to efficient read of fields based on their
types.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Andres Freund
Date:
On 2020-04-15 10:12:14 -0400, Robert Haas wrote:
> On Wed, Apr 15, 2020 at 7:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > As I understand this, it needs to parse the lines twice (second time
> > in phase-3) and till the first two phases are over, we can't start the
> > tuple processing work which is done in phase-3.  So even if the
> > tokenization is done a bit faster but we will lose some on processing
> > the tuples which might not be an overall win and in fact, it can be
> > worse as compared to the single reader approach being discussed.
> > Now, if the work done in tokenization is a major (or significant)
> > portion of the copy then thinking of such a technique might be useful
> > but that is not the case as seen in the data shared above (the
> > tokenize time is very less as compared to data processing time) in
> > this email.
> 
> It seems to me that a good first step here might be to forget about
> parallelism for a minute and just write a patch to make the line
> splitting as fast as possible.

+1

Compared to all the rest of the efforts during COPY a fast "split rows"
implementation should not be a bottleneck anymore.



Re: Parallel copy

From
Andres Freund
Date:
Hi,

On 2020-04-15 20:36:39 +0530, Kuntal Ghosh wrote:
> I was thinking from this point of view - the sooner we introduce
> parallelism in the process, the greater the benefits.

I don't really agree. Sure, that's true from a theoretical perspective,
but the incremental gains may be very small, and the cost in complexity
very high. If we can get single threaded splitting of rows to be >4GB/s,
which should very well be attainable, the rest of the COPY work is going
to dominate the time.  We shouldn't add complexity to parallelize more
of the line splitting, caring too much about scalable datastructures,
etc when the bottleneck after some straightforward optimization is
usually still in the parallelized part.

I'd expect that for now we'd likely hit scalability issues in other
parts of the system first (e.g. extension locks, buffer mapping).

Greetings,

Andres Freund



Re: Parallel copy

From
Andres Freund
Date:
Hi,

On 2020-04-15 12:05:47 +0300, Ants Aasma wrote:
> I see the benefit of having one process responsible for splitting as
> being able to run ahead of the workers to queue up work when many of
> them need new data at the same time.

Yea, I agree.


> I don't think the locking benefits of a ring are important in this
> case. At current rather conservative chunk sizes we are looking at
> ~100k chunks per second at best, normal locking should be perfectly
> adequate. And chunk size can easily be increased. I see the main value
> in it being simple.

I think the locking benefits of not needing to hold a lock *while*
splitting (as we'd need in some proposal floated earlier) is likely to
already be beneficial. I don't think we need to worry about lock
scalability protecting the queue of already split data, for now.

I don't think we really want to have a much larger chunk size,
btw. Makes it more likely for data to workers to take an uneven amount
of time.


> But there is a point that having a layer of indirection instead of a
> linear buffer allows for some workers to fall behind.

Yea. It'd probably make sense to read the input data into an array of
evenly sized blocks, and have the datastructure (still think a
ringbuffer makes sense) of split boundaries point into those entries. If
we don't require the input blocks to be in-order in that array, we can
reuse blocks therein that are fully processed, even if "earlier" data in
the input has not yet been fully processed.


> With a ring buffer reading has to wait on the slowest worker reading
> its chunk.

To be clear, I was only thinking of using a ringbuffer to indicate split
boundaries. And that workers would just pop entries from it before they
actually process the data (stored outside of the ringbuffer). Since the
split boundaries will always be read in order by workers, and the
entries will be tiny, there's no need to avoid copying out entries.


So basically what I was thinking we *eventually* may want (I'd forgo some
of this initially) is something like:

struct InputBlock
{
    uint32 unprocessed_chunk_parts;
    uint32 following_block;
    char data[INPUT_BLOCK_SIZE]
};

// array of input data, with > 2*nworkers entries
InputBlock *input_blocks;

struct ChunkedInputBoundary
{
    uint32 firstblock;
    uint32 startoff;
};

struct ChunkedInputBoundaries
{
    uint32 read_pos;
    uint32 write_end;
    ChunkedInputBoundary ring[RINGSIZE];
};

Where the leader would read data into InputBlocks with
unprocessed_chunk_parts == 0. Then it'd split the read input data into
chunks (presumably with chunk size << input block size), putting
identified chunks into ChunkedInputBoundaries. For each
ChunkedInputBoundary it'd increment the unprocessed_chunk_parts of each
InputBlock containing parts of the chunk.  For chunks across >1
InputBlocks each InputBlock's following_block would be set accordingly.

Workers would just pop an entry from the ringbuffer (making that entry
reusable), and process the chunk. The underlying data would not be
copied out of the InputBlocks, but obviously readers would need to take
care to handle InputBlock boundaries. Whenever a chunk is fully read, or
when crossing a InputBlock boundary, the InputBlock's
unprocessed_chunk_parts would be decremented.

Recycling of InputBlocks could probably just be an occasional linear
search for buffers with unprocessed_chunk_parts == 0.


Something roughly like this should not be too complicated to
implement. Unless extremely unluckly (very wide input data spanning many
InputBlocks) a straggling reader would not prevent global progress, it'd
just prevent reuse of the InputBlocks with data for its chunk (normally
that'd be two InputBlocks, not more).


> Having workers copy the data to a local buffer as the first
> step would reduce the probability of hitting any issues. But still, at
> GB/s rates, hiding a 10ms timeslice of delay would need 10's of
> megabytes of buffer.

Yea. Given the likelihood of blocking on resources (reading in index
data, writing out dirty buffers for reclaim, row locks for uniqueness
checks, extension locks, ...), as well as non uniform per-row costs
(partial indexes, index splits, ...) I think we ought to try to cope
well with that. IMO/IME it'll be common to see stalls that are much
longer than 10ms for processes that do COPY, even when the system is not
overloaded.


> FWIW. I think just increasing the buffer is good enough - the CPUs
> processing this workload are likely to have tens to hundreds of
> megabytes of cache on board.

It'll not necessarily be a cache shared between leader / workers though,
and some of the cache-cache transfers will be more expensive even within
a socket (between core complexes for AMD, multi chip processors for
Intel).

Greetings,

Andres Freund



Re: Parallel copy

From
Kuntal Ghosh
Date:
On Wed, Apr 15, 2020 at 10:45 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-04-15 20:36:39 +0530, Kuntal Ghosh wrote:
> > I was thinking from this point of view - the sooner we introduce
> > parallelism in the process, the greater the benefits.
>
> I don't really agree. Sure, that's true from a theoretical perspective,
> but the incremental gains may be very small, and the cost in complexity
> very high. If we can get single threaded splitting of rows to be >4GB/s,
> which should very well be attainable, the rest of the COPY work is going
> to dominate the time.  We shouldn't add complexity to parallelize more
> of the line splitting, caring too much about scalable datastructures,
> etc when the bottleneck after some straightforward optimization is
> usually still in the parallelized part.
>
> I'd expect that for now we'd likely hit scalability issues in other
> parts of the system first (e.g. extension locks, buffer mapping).
>
Got your point. In this particular case, a single producer is fast
enough (or probably we can make it fast enough) to generate enough
chunks for multiple consumers so that they don't stay idle and wait
for work.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Amit Kapila
Date:
On Wed, Apr 15, 2020 at 11:49 PM Andres Freund <andres@anarazel.de> wrote:
>
> To be clear, I was only thinking of using a ringbuffer to indicate split
> boundaries. And that workers would just pop entries from it before they
> actually process the data (stored outside of the ringbuffer). Since the
> split boundaries will always be read in order by workers, and the
> entries will be tiny, there's no need to avoid copying out entries.
>

I think the binary mode processing will be slightly different because
unlike text and csv format, the data is stored in Length, Value format
for each column and there are no line markers.  I don't think there
will be a big difference but still, we need to somewhere keep the
information what is the format of data in ring buffers.  Basically, we
can copy the data in Length, Value format and once the writers know
about the format, they will parse the data in the appropriate format.
We currently also have a different way of parsing the binary format,
see NextCopyFrom.  I think we need to be careful about avoiding
duplicate work as much as possible.

Apart from this, we have analyzed the other cases as mentioned below
where we need to decide whether we can allow parallelism for the copy
command.
Case-1:
Do we want to enable parallelism for a copy when transition tables are
involved?  Basically, during the copy, we do capture tuples in
transition tables for certain cases like when after statement trigger
accesses the same relation on which we have a trigger.  See the
example below [1].  We decide this in function
MakeTransitionCaptureState. For such cases, we collect minimal tuples
in the tuple store after processing them so that later after statement
triggers can access them.  Now, if we want to enable parallelism for
such cases, we instead need to store and access tuples from shared
tuple store (sharedtuplestore.c/sharedtuplestore.h).  However, it
doesn't have the facility to store tuples in-memory, so we always need
to store and access from a file which could be costly unless we also
have an additional way to store minimal tuples in shared memory till
work_memory and then in shared tuple store.  It is possible to do all
this or part of this work to enable parallel copy for such cases but I
am not sure if it is worth it. We can decide to not enable parallelism
for such cases and later allow if we see demand for the same and it
will also help us to not introduce additional work/complexity in the
first version of the patch.

Case-2:
The Single Insertion mode (CIM_SINGLE) is performed in various
scenarios and whether we can allow parallelism for those depends on
case to case basis which is discussed below:
a. When there are BEFORE/INSTEAD OF triggers on the table.  We don't
allow multi-inserts in such cases because such triggers might query
the table we're inserting into and act differently if the tuples that
have already been processed and prepared for insertion are not there.
Now, if we allow parallelism with such triggers the behavior would
depend on if the parallel worker has already inserted or not that
particular row.  I guess such functions should ideally be marked as
parallel-unsafe.  So, in short in this case whether to allow
parallelism or not depends upon the parallel-safety marking of this
function.
b. For partitioned tables, we can't support multi-inserts when there
are any statement-level insert triggers.  This is because as of now,
we expect that any before row insert and statement-level insert
triggers are on the same relation.  Now, there is no harm in allowing
parallelism for such cases but it depends upon if we have the
infrastructure (basically allow tuples to be collected in shared tuple
store) to support statement-level insert triggers.
c. For inserts into foreign tables.  We can't allow the parallelism in
this case because each worker needs to establish the FDW connection
and operate in a separate transaction.  Now unless we have a
capability to provide a two-phase commit protocol for "Transactions
involving multiple postgres foreign servers" (which is being discussed
in a separate thread [2]), we can't allow this.
d. If there are volatile default expressions or the where clause
contains a volatile expression.  Here, we can check if the expression
is parallel-safe, then we can allow parallelism.

Case-3:
In copy command, for performing foreign key checks, we take KEY SHARE
lock on primary key table rows which inturn will increment the command
counter and updates the snapshot.  Now, as we share the snapshots at
the beginning of the command, we can't allow it to be changed later.
So, unless we do something special for it, I think we can't allow
parallelism in such cases.

I couldn't think of many problems if we allow parallelism in such
cases.  One inconsistency, if we allow FK checks via workers, would be
that at the end of COPY the value of command_counter will not be what
we expect as we wouldn't have accounted for that from workers.  Now,
if COPY is being done in a  transaction it will not assign the correct
values to the next commands.  Also, for executing deferred triggers,
we use transaction snapshot, so if anything is changed in snapshot via
parallel workers, ideally it should have synced the changed snapshot
in the worker.

Now, the other concern could be that different workers can try to
acquire KEY SHARE lock on the same tuples which they will be able to
acquire due to group locking or otherwise but I don't see any problem
with it.

I am not sure if it above leads to any user-visible problem but I
might be missing something here. I think if we can think of any real
problems we can try to design a better solution to address those.

Case-4:
For Deferred Triggers, it seems we record CTIDs of tuples (via
ExecARInsertTriggers->AfterTriggerSaveEvent) and then execute deferred
triggers at transaction end using AfterTriggerFireDeferred or at end
of the statement.  The challenge to allow parallelism for such cases
is we need to capture the CTID events in shared memory.  For that, we
either need to invent a new infrastructure for event capturing in
shared memory which will be a huge task on its own. The other idea is
to get CTIDs via shared memory and then add those to event queues via
leader but I think in that case we need to ensure the order of CTIDs
(basically it should be in the same order in which we have processed
them).

[1] -
create or replace function dump_insert() returns trigger language plpgsql as
$$
  begin
    raise notice 'trigger = %, new table = %',
                 TG_NAME,
                 (select string_agg(new_table::text, ', ' order by a)
from new_table);
    return null;
  end;
$$;

create table test (a int);
create trigger trg1_test  after insert on test referencing new table
as new_table  for each statement execute procedure dump_insert();
copy test (a) from stdin;
1
2
3
\.

[2] - https://www.postgresql.org/message-id/20191206.173215.1818665441859410805.horikyota.ntt%40gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Robert Haas
Date:
I wonder why you're still looking at this instead of looking at just
speeding up the current code, especially the line splitting, per
previous discussion. And then coming back to study this issue more
after that's done.

On Mon, May 11, 2020 at 8:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Apart from this, we have analyzed the other cases as mentioned below
> where we need to decide whether we can allow parallelism for the copy
> command.
> Case-1:
> Do we want to enable parallelism for a copy when transition tables are
> involved?

I think it would be OK not to support this.

> Case-2:
> a. When there are BEFORE/INSTEAD OF triggers on the table.
> b. For partitioned tables, we can't support multi-inserts when there
> are any statement-level insert triggers.
> c. For inserts into foreign tables.
> d. If there are volatile default expressions or the where clause
> contains a volatile expression.  Here, we can check if the expression
> is parallel-safe, then we can allow parallelism.

This all sounds fine.

> Case-3:
> In copy command, for performing foreign key checks, we take KEY SHARE
> lock on primary key table rows which inturn will increment the command
> counter and updates the snapshot.  Now, as we share the snapshots at
> the beginning of the command, we can't allow it to be changed later.
> So, unless we do something special for it, I think we can't allow
> parallelism in such cases.

This sounds like much more of a problem to me; it'd be a significant
restriction that would kick in routine cases where the user isn't
doing anything particularly exciting. The command counter presumably
only needs to be updated once per command, so maybe we could do that
before we start parallelism. However, I think we would need to have
some kind of dynamic memory structure to which new combo CIDs can be
added by any member of the group, and then discovered by other members
of the group later. At the end of the parallel operation, the leader
must discover any combo CIDs added by others to that table before
destroying it, even if it has no immediate use for the information. We
can't allow a situation where the group members have inconsistent
notions of which combo CIDs exist or what their mappings are, and if
KEY SHARE locks are being taken, new combo CIDs could be created.

> Case-4:
> For Deferred Triggers, it seems we record CTIDs of tuples (via
> ExecARInsertTriggers->AfterTriggerSaveEvent) and then execute deferred
> triggers at transaction end using AfterTriggerFireDeferred or at end
> of the statement.

I think this could be left for the future.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel copy

From
Amit Kapila
Date:
On Mon, May 11, 2020 at 11:52 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> I wonder why you're still looking at this instead of looking at just
> speeding up the current code, especially the line splitting,
>

Because the line splitting is just 1-2% of overall work in common
cases.  See the data shared by Vignesh for various workloads [1].  The
time it takes is in range of 0.5-12% approximately and for cases like
a table with few indexes, it is not more than 1-2%.

[1] - https://www.postgresql.org/message-id/CALDaNm3r8cPsk0Vo_-6AXipTrVwd0o9U2S0nCmRdku1Dn-Tpqg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Amit Kapila
Date:
On Mon, May 11, 2020 at 11:52 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> > Case-3:
> > In copy command, for performing foreign key checks, we take KEY SHARE
> > lock on primary key table rows which inturn will increment the command
> > counter and updates the snapshot.  Now, as we share the snapshots at
> > the beginning of the command, we can't allow it to be changed later.
> > So, unless we do something special for it, I think we can't allow
> > parallelism in such cases.
>
> This sounds like much more of a problem to me; it'd be a significant
> restriction that would kick in routine cases where the user isn't
> doing anything particularly exciting. The command counter presumably
> only needs to be updated once per command, so maybe we could do that
> before we start parallelism. However, I think we would need to have
> some kind of dynamic memory structure to which new combo CIDs can be
> added by any member of the group, and then discovered by other members
> of the group later. At the end of the parallel operation, the leader
> must discover any combo CIDs added by others to that table before
> destroying it, even if it has no immediate use for the information. We
> can't allow a situation where the group members have inconsistent
> notions of which combo CIDs exist or what their mappings are, and if
> KEY SHARE locks are being taken, new combo CIDs could be created.
>

AFAIU, we don't generate combo CIDs for this case.  See below code in
heap_lock_tuple():

/*
* Store transaction information of xact locking the tuple.
*
* Note: Cmax is meaningless in this context, so don't set it; this avoids
* possibly generating a useless combo CID.  Moreover, if we're locking a
* previously updated tuple, it's important to preserve the Cmax.
*
* Also reset the HOT UPDATE bit, but only if there's no update; otherwise
* we would break the HOT chain.
*/
tuple->t_data->t_infomask &= ~HEAP_XMAX_BITS;
tuple->t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
tuple->t_data->t_infomask |= new_infomask;
tuple->t_data->t_infomask2 |= new_infomask2;

I don't understand why we need to do something special for combo CIDs
if they are not generated during this operation?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Robert Haas
Date:
On Tue, May 12, 2020 at 1:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I don't understand why we need to do something special for combo CIDs
> if they are not generated during this operation?

Hmm. Well I guess if they're not being generated then we don't need to
do anything about them, but I still think we should try to work around
having to disable parallelism for a table which is referenced by
foreign keys.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel copy

From
Amit Kapila
Date:
On Thu, May 14, 2020 at 12:39 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, May 12, 2020 at 1:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I don't understand why we need to do something special for combo CIDs
> > if they are not generated during this operation?
>
> Hmm. Well I guess if they're not being generated then we don't need to
> do anything about them, but I still think we should try to work around
> having to disable parallelism for a table which is referenced by
> foreign keys.
>

Okay, just to be clear, we want to allow parallelism for a table that
has foreign keys.  Basically, a parallel copy should work while
loading data into tables having FK references.

To support that, we need to consider a few things.
a. Currently, we increment the command counter each time we take a key
share lock on a tuple during trigger execution.  I am really not sure
if this is required during Copy command execution or we can just
increment it once for the copy.   If we need to increment the command
counter just once for copy command then for the parallel copy we can
ensure that we do it just once at the end of the parallel copy but if
not then we might need some special handling.

b.  Another point is that after inserting rows we record CTIDs of the
tuples in the event queue and then once all tuples are processed we
call FK trigger for each CTID.  Now, with parallelism, the FK checks
will be processed once the worker processed one chunk.  I don't see
any problem with it but still, this will be a bit different from what
we do in serial case.  Do you see any problem with this?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Dilip Kumar
Date:
On Thu, May 14, 2020 at 11:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, May 14, 2020 at 12:39 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Tue, May 12, 2020 at 1:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > I don't understand why we need to do something special for combo CIDs
> > > if they are not generated during this operation?
> >
> > Hmm. Well I guess if they're not being generated then we don't need to
> > do anything about them, but I still think we should try to work around
> > having to disable parallelism for a table which is referenced by
> > foreign keys.
> >
>
> Okay, just to be clear, we want to allow parallelism for a table that
> has foreign keys.  Basically, a parallel copy should work while
> loading data into tables having FK references.
>
> To support that, we need to consider a few things.
> a. Currently, we increment the command counter each time we take a key
> share lock on a tuple during trigger execution.  I am really not sure
> if this is required during Copy command execution or we can just
> increment it once for the copy.   If we need to increment the command
> counter just once for copy command then for the parallel copy we can
> ensure that we do it just once at the end of the parallel copy but if
> not then we might need some special handling.
>
> b.  Another point is that after inserting rows we record CTIDs of the
> tuples in the event queue and then once all tuples are processed we
> call FK trigger for each CTID.  Now, with parallelism, the FK checks
> will be processed once the worker processed one chunk.  I don't see
> any problem with it but still, this will be a bit different from what
> we do in serial case.  Do you see any problem with this?

IMHO, it should not be a problem because without parallelism also we
trigger the foreign key check when we detect EOF and end of data from
STDIN.  And, with parallel workers also the worker will assume that it
has complete all the work and it can go for the foreign key check is
only after the leader receives EOF and end of data from STDIN.

The only difference is that each worker is not waiting for all the
data (from all workers) to get inserted before checking the
constraint.  Moreover, we are not supporting external triggers with
the parallel copy, otherwise, we might have to worry that those
triggers could do something on the primary table before we check the
constraint.  I am not sure if there are any other factors that I am
missing.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Robert Haas
Date:
On Thu, May 14, 2020 at 2:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> To support that, we need to consider a few things.
> a. Currently, we increment the command counter each time we take a key
> share lock on a tuple during trigger execution.  I am really not sure
> if this is required during Copy command execution or we can just
> increment it once for the copy.   If we need to increment the command
> counter just once for copy command then for the parallel copy we can
> ensure that we do it just once at the end of the parallel copy but if
> not then we might need some special handling.

My sense is that it would be a lot more sensible to do it at the
*beginning* of the parallel operation. Once we do it once, we
shouldn't ever do it again; that's how it works now. Deferring it
until later seems much more likely to break things.

> b.  Another point is that after inserting rows we record CTIDs of the
> tuples in the event queue and then once all tuples are processed we
> call FK trigger for each CTID.  Now, with parallelism, the FK checks
> will be processed once the worker processed one chunk.  I don't see
> any problem with it but still, this will be a bit different from what
> we do in serial case.  Do you see any problem with this?

I think there could be some problems here. For instance, suppose that
there are two entries for different workers for the same CTID. If the
leader were trying to do all the work, they'd be handled
consecutively. If they were from completely unrelated processes,
locking would serialize them. But group locking won't, so there you
have an issue, I think. Also, it's not ideal from a work-distribution
perspective: one worker could finish early and be unable to help the
others.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel copy

From
Amit Kapila
Date:
On Fri, May 15, 2020 at 1:51 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, May 14, 2020 at 2:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > To support that, we need to consider a few things.
> > a. Currently, we increment the command counter each time we take a key
> > share lock on a tuple during trigger execution.  I am really not sure
> > if this is required during Copy command execution or we can just
> > increment it once for the copy.   If we need to increment the command
> > counter just once for copy command then for the parallel copy we can
> > ensure that we do it just once at the end of the parallel copy but if
> > not then we might need some special handling.
>
> My sense is that it would be a lot more sensible to do it at the
> *beginning* of the parallel operation. Once we do it once, we
> shouldn't ever do it again; that's how it works now. Deferring it
> until later seems much more likely to break things.
>

AFAIU, we always increment the command counter after executing the
command.  Why do we want to do it differently here?

> > b.  Another point is that after inserting rows we record CTIDs of the
> > tuples in the event queue and then once all tuples are processed we
> > call FK trigger for each CTID.  Now, with parallelism, the FK checks
> > will be processed once the worker processed one chunk.  I don't see
> > any problem with it but still, this will be a bit different from what
> > we do in serial case.  Do you see any problem with this?
>
> I think there could be some problems here. For instance, suppose that
> there are two entries for different workers for the same CTID.
>

First, let me clarify the CTID I have used in my email are for the
table in which insertion is happening which means FK table.  So, in
such a case, we can't have the same CTIDs queued for different
workers.  Basically, we use CTID to fetch the row from FK table later
and form a query to lock (in KEY SHARE mode) the corresponding tuple
in PK table.  Now, it is possible that two different workers try to
lock the same row of PK table.  I am not clear what problem group
locking can have in this case because these are non-conflicting locks.
Can you please elaborate a bit more?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Robert Haas
Date:
On Fri, May 15, 2020 at 12:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > My sense is that it would be a lot more sensible to do it at the
> > *beginning* of the parallel operation. Once we do it once, we
> > shouldn't ever do it again; that's how it works now. Deferring it
> > until later seems much more likely to break things.
>
> AFAIU, we always increment the command counter after executing the
> command.  Why do we want to do it differently here?

Hmm, now I'm starting to think that I'm confused about what is under
discussion here. Which CommandCounterIncrement() are we talking about
here?

> First, let me clarify the CTID I have used in my email are for the
> table in which insertion is happening which means FK table.  So, in
> such a case, we can't have the same CTIDs queued for different
> workers.  Basically, we use CTID to fetch the row from FK table later
> and form a query to lock (in KEY SHARE mode) the corresponding tuple
> in PK table.  Now, it is possible that two different workers try to
> lock the same row of PK table.  I am not clear what problem group
> locking can have in this case because these are non-conflicting locks.
> Can you please elaborate a bit more?

I'm concerned about two workers trying to take the same lock at the
same time. If that's prevented by the buffer locking then I think it's
OK, but if it's prevented by a heavyweight lock then it's not going to
work in this case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel copy

From
Amit Kapila
Date:
On Fri, May 15, 2020 at 6:49 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, May 15, 2020 at 12:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > My sense is that it would be a lot more sensible to do it at the
> > > *beginning* of the parallel operation. Once we do it once, we
> > > shouldn't ever do it again; that's how it works now. Deferring it
> > > until later seems much more likely to break things.
> >
> > AFAIU, we always increment the command counter after executing the
> > command.  Why do we want to do it differently here?
>
> Hmm, now I'm starting to think that I'm confused about what is under
> discussion here. Which CommandCounterIncrement() are we talking about
> here?
>

The one we do after executing a non-readonly command.  Let me try to
explain by example:

CREATE TABLE tab_fk_referenced_chk(refindex INTEGER PRIMARY KEY,
height real, weight real);
insert into tab_fk_referenced_chk values( 1, 1.1, 100);
CREATE TABLE tab_fk_referencing_chk(index INTEGER REFERENCES
tab_fk_referenced_chk(refindex), height real, weight real);

COPY tab_fk_referencing_chk(index,height,weight) FROM stdin WITH(
DELIMITER ',');
1,1.1,100
1,2.1,200
1,3.1,300
\.

In the above case, even though we are executing a single command from
the user perspective, but the currentCommandId will be four after the
command.  One increment will be for the copy command and the other
three increments are for locking tuple in PK table
(tab_fk_referenced_chk) for three tuples in FK table
(tab_fk_referencing_chk).  Now, for parallel workers, it is
(theoretically) possible that the three tuples are processed by three
different workers which don't get synced as of now.  The question was
do we see any kind of problem with this and if so can we just sync it
up at the end of parallelism.

> > First, let me clarify the CTID I have used in my email are for the
> > table in which insertion is happening which means FK table.  So, in
> > such a case, we can't have the same CTIDs queued for different
> > workers.  Basically, we use CTID to fetch the row from FK table later
> > and form a query to lock (in KEY SHARE mode) the corresponding tuple
> > in PK table.  Now, it is possible that two different workers try to
> > lock the same row of PK table.  I am not clear what problem group
> > locking can have in this case because these are non-conflicting locks.
> > Can you please elaborate a bit more?
>
> I'm concerned about two workers trying to take the same lock at the
> same time. If that's prevented by the buffer locking then I think it's
> OK, but if it's prevented by a heavyweight lock then it's not going to
> work in this case.
>

We do take buffer lock in exclusive mode before trying to acquire KEY
SHARE lock on the tuple, so the two workers shouldn't try to acquire
at the same time.  I think you are trying to see if in any case, two
workers try to acquire heavyweight lock like tuple lock or something
like that to perform the operation then it will create a problem
because due to group locking it will allow such an operation where it
should not have been.  But I don't think anything of that sort is
feasible in COPY operation and if it is then we probably need to
carefully block it or find some solution for it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:
Hi.

We have made a patch on the lines that were discussed in the previous mails. We could achieve up to 9.87X performance improvement. The improvement varies from case to case.

Workers/
Exec time (seconds)
copy from file, 
2 indexes on integer columns
1 index on text column
copy from stdin, 
2 indexes on integer columns 
1 index on text column
copy from file, 1 gist index on text columncopy from file,
3 indexes on integer columns
copy from stdin, 3 indexes on integer columns
01162.772(1X)1176.035(1X)827.669(1X)216.171(1X)217.376(1X)
11110.288(1.05X)1120.556(1.05X)747.384(1.11X)174.242(1.24X)163.492(1.33X)
2635.249(1.83X)668.18(1.76X)435.673(1.9X)133.829(1.61X)126.516(1.72X)
4336.835(3.45X)346.768(3.39X)236.406(3.5X)105.767(2.04X)107.382(2.02X)
8188.577(6.17X)194.491(6.04X)148.962(5.56X)100.708(2.15X)107.72(2.01X)
16126.819(9.17X)146.402(8.03X)119.923(6.9X)97.996(2.2X)106.531(2.04X)
20117.845(9.87X)149.203(7.88X)138.741(5.96X)97.94(2.21X)107.5(2.02)
30127.554(9.11X)161.218(7.29X)172.443(4.8X)98.232(2.2X)108.778(1.99X)

Posting the initial patch to get the feedback. 

Design of the Parallel Copy: The backend, to which the "COPY FROM" query is submitted acts as leader with the responsibility of reading data from the file/stdin, launching at most n number of workers as specified with PARALLEL 'n' option in the "COPY FROM" query. The leader populates the common data required for the workers execution in the DSM and shares it with the workers. The leader then executes before statement triggers if there exists any. Leader populates DSM chunks which includes the start offset and chunk size, while populating the chunks it reads as many blocks as required into the DSM data blocks from the file. Each block is of 64K size. The leader parses the data to identify a chunk, the existing logic from CopyReadLineText which identifies the chunks with some changes was used for this. Leader checks if a free chunk is available to copy the information, if there is no free chunk it waits till the required chunk is freed up by the worker and then copies the identified chunks information (offset & chunk size) into the DSM chunks. This process is repeated till the complete file is processed. Simultaneously, the workers cache the chunks(50) locally into the local memory and release the chunks to the leader for further populating. Each worker processes the chunk that it cached and inserts it into the table. The leader waits till all the chunks populated are processed by the workers and exits.

We would like to include support of parallel copy for referential integrity constraints and parallelizing copy from binary format files in the future.
The above mentioned tests were run with CSV format, file size of 5.1GB & 10 million records in the table. The postgres configuration and system configuration used is attached in config.txt.
Myself and one of my colleagues Bharath have developed this patch. We would like to thank Amit, Dilip, Robert, Andres, Ants, Kuntal, Alastair, Tomas, David, Thomas, Andrew & Kyotaro for their thoughts/discussions/suggestions.

Thoughts?

Regards,
Vignesh

On Mon, May 18, 2020 at 10:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, May 15, 2020 at 6:49 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, May 15, 2020 at 12:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > My sense is that it would be a lot more sensible to do it at the
> > > *beginning* of the parallel operation. Once we do it once, we
> > > shouldn't ever do it again; that's how it works now. Deferring it
> > > until later seems much more likely to break things.
> >
> > AFAIU, we always increment the command counter after executing the
> > command.  Why do we want to do it differently here?
>
> Hmm, now I'm starting to think that I'm confused about what is under
> discussion here. Which CommandCounterIncrement() are we talking about
> here?
>

The one we do after executing a non-readonly command.  Let me try to
explain by example:

CREATE TABLE tab_fk_referenced_chk(refindex INTEGER PRIMARY KEY,
height real, weight real);
insert into tab_fk_referenced_chk values( 1, 1.1, 100);
CREATE TABLE tab_fk_referencing_chk(index INTEGER REFERENCES
tab_fk_referenced_chk(refindex), height real, weight real);

COPY tab_fk_referencing_chk(index,height,weight) FROM stdin WITH(
DELIMITER ',');
1,1.1,100
1,2.1,200
1,3.1,300
\.

In the above case, even though we are executing a single command from
the user perspective, but the currentCommandId will be four after the
command.  One increment will be for the copy command and the other
three increments are for locking tuple in PK table
(tab_fk_referenced_chk) for three tuples in FK table
(tab_fk_referencing_chk).  Now, for parallel workers, it is
(theoretically) possible that the three tuples are processed by three
different workers which don't get synced as of now.  The question was
do we see any kind of problem with this and if so can we just sync it
up at the end of parallelism.

> > First, let me clarify the CTID I have used in my email are for the
> > table in which insertion is happening which means FK table.  So, in
> > such a case, we can't have the same CTIDs queued for different
> > workers.  Basically, we use CTID to fetch the row from FK table later
> > and form a query to lock (in KEY SHARE mode) the corresponding tuple
> > in PK table.  Now, it is possible that two different workers try to
> > lock the same row of PK table.  I am not clear what problem group
> > locking can have in this case because these are non-conflicting locks.
> > Can you please elaborate a bit more?
>
> I'm concerned about two workers trying to take the same lock at the
> same time. If that's prevented by the buffer locking then I think it's
> OK, but if it's prevented by a heavyweight lock then it's not going to
> work in this case.
>

We do take buffer lock in exclusive mode before trying to acquire KEY
SHARE lock on the tuple, so the two workers shouldn't try to acquire
at the same time.  I think you are trying to see if in any case, two
workers try to acquire heavyweight lock like tuple lock or something
like that to perform the operation then it will create a problem
because due to group locking it will allow such an operation where it
should not have been.  But I don't think anything of that sort is
feasible in COPY operation and if it is then we probably need to
carefully block it or find some solution for it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Parallel copy

From
Robert Haas
Date:
On Mon, May 18, 2020 at 12:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> In the above case, even though we are executing a single command from
> the user perspective, but the currentCommandId will be four after the
> command.  One increment will be for the copy command and the other
> three increments are for locking tuple in PK table
> (tab_fk_referenced_chk) for three tuples in FK table
> (tab_fk_referencing_chk).  Now, for parallel workers, it is
> (theoretically) possible that the three tuples are processed by three
> different workers which don't get synced as of now.  The question was
> do we see any kind of problem with this and if so can we just sync it
> up at the end of parallelism.

I strongly disagree with the idea of "just sync(ing) it up at the end
of parallelism". That seems like a completely unprincipled approach to
the problem. Either the command counter increment is important or it's
not. If it's not important, maybe we can arrange to skip it in the
first place. If it is important, then it's probably not OK for each
backend to be doing it separately.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel copy

From
Andres Freund
Date:
Hi,

On 2020-06-03 12:13:14 -0400, Robert Haas wrote:
> On Mon, May 18, 2020 at 12:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > In the above case, even though we are executing a single command from
> > the user perspective, but the currentCommandId will be four after the
> > command.  One increment will be for the copy command and the other
> > three increments are for locking tuple in PK table
> > (tab_fk_referenced_chk) for three tuples in FK table
> > (tab_fk_referencing_chk).  Now, for parallel workers, it is
> > (theoretically) possible that the three tuples are processed by three
> > different workers which don't get synced as of now.  The question was
> > do we see any kind of problem with this and if so can we just sync it
> > up at the end of parallelism.

> I strongly disagree with the idea of "just sync(ing) it up at the end
> of parallelism". That seems like a completely unprincipled approach to
> the problem. Either the command counter increment is important or it's
> not. If it's not important, maybe we can arrange to skip it in the
> first place. If it is important, then it's probably not OK for each
> backend to be doing it separately.

That scares me too. These command counter increments definitely aren't
unnecessary in the general case.

Even in the example you share above, aren't we potentially going to
actually lock rows multiple times from within the same transaction,
instead of once?  If the command counter increments from within
ri_trigger.c aren't visible to other parallel workers/leader, we'll not
correctly understand that a locked row is invisible to heap_lock_tuple,
because we're not using a new enough snapshot (by dint of not having a
new enough cid).

I've not dug through everything that'd potentially cause, but it seems
pretty clearly a no-go from here.

Greetings,

Andres Freund



Re: Parallel copy

From
Andres Freund
Date:
Hi,

On 2020-06-03 15:53:24 +0530, vignesh C wrote:
> Workers/
> Exec time (seconds) copy from file,
> 2 indexes on integer columns
> 1 index on text column copy from stdin,
> 2 indexes on integer columns
> 1 index on text column copy from file, 1 gist index on text column copy
> from file,
> 3 indexes on integer columns copy from stdin, 3 indexes on integer columns
> 0 1162.772(1X) 1176.035(1X) 827.669(1X) 216.171(1X) 217.376(1X)
> 1 1110.288(1.05X) 1120.556(1.05X) 747.384(1.11X) 174.242(1.24X) 163.492(1.33X)
> 2 635.249(1.83X) 668.18(1.76X) 435.673(1.9X) 133.829(1.61X) 126.516(1.72X)
> 4 336.835(3.45X) 346.768(3.39X) 236.406(3.5X) 105.767(2.04X) 107.382(2.02X)
> 8 188.577(6.17X) 194.491(6.04X) 148.962(5.56X) 100.708(2.15X) 107.72(2.01X)
> 16 126.819(9.17X) 146.402(8.03X) 119.923(6.9X) 97.996(2.2X) 106.531(2.04X)
> 20 *117.845(9.87X)* 149.203(7.88X) 138.741(5.96X) 97.94(2.21X) 107.5(2.02)
> 30 127.554(9.11X) 161.218(7.29X) 172.443(4.8X) 98.232(2.2X) 108.778(1.99X)

Hm. you don't explicitly mention that in your design, but given how
small the benefits going from 0-1 workers is, I assume the leader
doesn't do any "chunk processing" on its own?


> Design of the Parallel Copy: The backend, to which the "COPY FROM" query is
> submitted acts as leader with the responsibility of reading data from the
> file/stdin, launching at most n number of workers as specified with
> PARALLEL 'n' option in the "COPY FROM" query. The leader populates the
> common data required for the workers execution in the DSM and shares it
> with the workers. The leader then executes before statement triggers if
> there exists any. Leader populates DSM chunks which includes the start
> offset and chunk size, while populating the chunks it reads as many blocks
> as required into the DSM data blocks from the file. Each block is of 64K
> size. The leader parses the data to identify a chunk, the existing logic
> from CopyReadLineText which identifies the chunks with some changes was
> used for this. Leader checks if a free chunk is available to copy the
> information, if there is no free chunk it waits till the required chunk is
> freed up by the worker and then copies the identified chunks information
> (offset & chunk size) into the DSM chunks. This process is repeated till
> the complete file is processed. Simultaneously, the workers cache the
> chunks(50) locally into the local memory and release the chunks to the
> leader for further populating. Each worker processes the chunk that it
> cached and inserts it into the table. The leader waits till all the chunks
> populated are processed by the workers and exits.

Why do we need the local copy of 50 chunks? Copying memory around is far
from free. I don't see why it'd be better to add per-process caching,
rather than making the DSM bigger? I can see some benefit in marking
multiple chunks as being processed with one lock acquisition, but I
don't think adding a memory copy is a good idea.


This patch *desperately* needs to be split up. It imo is close to
unreviewable, due to a large amount of changes that just move code
around without other functional changes being mixed in with the actual
new stuff.


>  /*
> + * State of the chunk.
> + */
> +typedef enum ChunkState
> +{
> +    CHUNK_INIT,                    /* initial state of chunk */
> +    CHUNK_LEADER_POPULATING,    /* leader processing chunk */
> +    CHUNK_LEADER_POPULATED,        /* leader completed populating chunk */
> +    CHUNK_WORKER_PROCESSING,    /* worker processing chunk */
> +    CHUNK_WORKER_PROCESSED        /* worker completed processing chunk */
> +}ChunkState;
> +
> +#define RAW_BUF_SIZE 65536        /* we palloc RAW_BUF_SIZE+1 bytes */
> +
> +#define DATA_BLOCK_SIZE RAW_BUF_SIZE
> +#define RINGSIZE (10 * 1000)
> +#define MAX_BLOCKS_COUNT 1000
> +#define WORKER_CHUNK_COUNT 50    /* should be mod of RINGSIZE */
> +
> +#define    IsParallelCopy()        (cstate->is_parallel)
> +#define IsLeader()                (cstate->pcdata->is_leader)
> +#define IsHeaderLine()            (cstate->header_line && cstate->cur_lineno == 1)
> +
> +/*
> + * Copy data block information.
> + */
> +typedef struct CopyDataBlock
> +{
> +    /* The number of unprocessed chunks in the current block. */
> +    pg_atomic_uint32 unprocessed_chunk_parts;
> +
> +    /*
> +     * If the current chunk data is continued into another block,
> +     * following_block will have the position where the remaining data need to
> +     * be read.
> +     */
> +    uint32    following_block;
> +
> +    /*
> +     * This flag will be set, when the leader finds out this block can be read
> +     * safely by the worker. This helps the worker to start processing the chunk
> +     * early where the chunk will be spread across many blocks and the worker
> +     * need not wait for the complete chunk to be processed.
> +     */
> +    bool   curr_blk_completed;
> +    char   data[DATA_BLOCK_SIZE + 1]; /* data read from file */
> +}CopyDataBlock;

What's the + 1 here about?


> +/*
> + * Parallel copy line buffer information.
> + */
> +typedef struct ParallelCopyLineBuf
> +{
> +    StringInfoData        line_buf;
> +    uint64                cur_lineno;    /* line number for error messages */
> +}ParallelCopyLineBuf;

Why do we need separate infrastructure for this? We shouldn't duplicate
infrastructure unnecessarily.



> +/*
> + * Common information that need to be copied to shared memory.
> + */
> +typedef struct CopyWorkerCommonData
> +{

Why is parallel specific stuff here suddenly not named ParallelCopy*
anymore? If you introduce a naming like that it imo should be used
consistently.

> +    /* low-level state data */
> +    CopyDest            copy_dest;        /* type of copy source/destination */
> +    int                 file_encoding;    /* file or remote side's character encoding */
> +    bool                need_transcoding;    /* file encoding diff from server? */
> +    bool                encoding_embeds_ascii;    /* ASCII can be non-first byte? */
> +
> +    /* parameters from the COPY command */
> +    bool                csv_mode;        /* Comma Separated Value format? */
> +    bool                header_line;    /* CSV header line? */
> +    int                 null_print_len; /* length of same */
> +    bool                force_quote_all;    /* FORCE_QUOTE *? */
> +    bool                convert_selectively;    /* do selective binary conversion? */
> +
> +    /* Working state for COPY FROM */
> +    AttrNumber          num_defaults;
> +    Oid                 relid;
> +}CopyWorkerCommonData;

But I actually think we shouldn't have this information in two different
structs. This should exist once, independent of using parallel /
non-parallel copy.


> +/* List information */
> +typedef struct ListInfo
> +{
> +    int    count;        /* count of attributes */
> +
> +    /* string info in the form info followed by info1, info2... infon  */
> +    char    info[1];
> +} ListInfo;

Based on these comments I have no idea what this could be for.


>  /*
> - * This keeps the character read at the top of the loop in the buffer
> - * even if there is more than one read-ahead.
> + * This keeps the character read at the top of the loop in the buffer
> + * even if there is more than one read-ahead.
> + */
> +#define IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(extralen) \
> +if (1) \
> +{ \
> +    if (copy_buff_state.raw_buf_ptr + (extralen) >= copy_buff_state.copy_buf_len && !hit_eof) \
> +    { \
> +        if (IsParallelCopy()) \
> +        { \
> +            copy_buff_state.chunk_size = prev_chunk_size; /* update previous chunk size */ \
> +            if (copy_buff_state.block_switched) \
> +            { \
> +                pg_atomic_sub_fetch_u32(©_buff_state.data_blk_ptr->unprocessed_chunk_parts, 1); \
> +                copy_buff_state.copy_buf_len = prev_copy_buf_len; \
> +            } \
> +        } \
> +        copy_buff_state.raw_buf_ptr = prev_raw_ptr; /* undo fetch */ \
> +        need_data = true; \
> +        continue; \
> +    } \
> +} else ((void) 0)

I think it's an absolutely clear no-go to add new branches to
these. They're *really* hot already, and this is going to sprinkle a
significant amount of new instructions over a lot of places.


> +/*
> + * SET_RAWBUF_FOR_LOAD - Set raw_buf to the shared memory where the file data must
> + * be read.
> + */
> +#define SET_RAWBUF_FOR_LOAD() \
> +{ \
> +    ShmCopyInfo    *pcshared_info = cstate->pcdata->pcshared_info; \
> +    uint32 cur_block_pos; \
> +    /* \
> +     * Mark the previous block as completed, worker can start copying this data. \
> +     */ \
> +    if (copy_buff_state.data_blk_ptr != copy_buff_state.curr_data_blk_ptr && \
> +        copy_buff_state.data_blk_ptr->curr_blk_completed == false) \
> +        copy_buff_state.data_blk_ptr->curr_blk_completed = true; \
> +    \
> +    copy_buff_state.data_blk_ptr = copy_buff_state.curr_data_blk_ptr; \
> +    cur_block_pos = WaitGetFreeCopyBlock(pcshared_info); \
> +    copy_buff_state.curr_data_blk_ptr = &pcshared_info->data_blocks[cur_block_pos]; \
> +    \
> +    if (!copy_buff_state.data_blk_ptr) \
> +    { \
> +        copy_buff_state.data_blk_ptr = copy_buff_state.curr_data_blk_ptr; \
> +        chunk_first_block = cur_block_pos; \
> +    } \
> +    else if (need_data == false) \
> +        copy_buff_state.data_blk_ptr->following_block = cur_block_pos; \
> +    \
> +    cstate->raw_buf = copy_buff_state.curr_data_blk_ptr->data; \
> +    copy_buff_state.copy_raw_buf = cstate->raw_buf; \
> +}
> +
> +/*
> + * END_CHUNK_PARALLEL_COPY - Update the chunk information in shared memory.
> + */
> +#define END_CHUNK_PARALLEL_COPY() \
> +{ \
> +    if (!IsHeaderLine()) \
> +    { \
> +        ShmCopyInfo *pcshared_info = cstate->pcdata->pcshared_info; \
> +        ChunkBoundaries *chunkBoundaryPtr = &pcshared_info->chunk_boundaries; \
> +        if (copy_buff_state.chunk_size) \
> +        { \
> +            ChunkBoundary *chunkInfo = &chunkBoundaryPtr->ring[chunk_pos]; \
> +            /* \
> +             * If raw_buf_ptr is zero, unprocessed_chunk_parts would have been \
> +             * incremented in SEEK_COPY_BUFF_POS. This will happen if the whole \
> +             * chunk finishes at the end of the current block. If the \
> +             * new_line_size > raw_buf_ptr, then the new block has only new line \
> +             * char content. The unprocessed count should not be increased in \
> +             * this case. \
> +             */ \
> +            if (copy_buff_state.raw_buf_ptr != 0 && \
> +                copy_buff_state.raw_buf_ptr > new_line_size) \
> +                pg_atomic_add_fetch_u32(©_buff_state.curr_data_blk_ptr->unprocessed_chunk_parts, 1); \
> +            \
> +            /* Update chunk size. */ \
> +            pg_atomic_write_u32(&chunkInfo->chunk_size, copy_buff_state.chunk_size); \
> +            pg_atomic_write_u32(&chunkInfo->chunk_state, CHUNK_LEADER_POPULATED); \
> +            elog(DEBUG1, "[Leader] After adding - chunk position:%d, chunk_size:%d", \
> +                        chunk_pos, copy_buff_state.chunk_size); \
> +            pcshared_info->populated++; \
> +        } \
> +        else if (new_line_size) \
> +        { \
> +            /* \
> +             * This means only new line char, empty record should be \
> +             * inserted. \
> +             */ \
> +            ChunkBoundary *chunkInfo; \
> +            chunk_pos = UpdateBlockInChunkInfo(cstate, -1, -1, 0, \
> +                                               CHUNK_LEADER_POPULATED); \
> +            chunkInfo = &chunkBoundaryPtr->ring[chunk_pos]; \
> +            elog(DEBUG1, "[Leader] Added empty chunk with offset:%d, chunk position:%d, chunk size:%d", \
> +                         chunkInfo->start_offset, chunk_pos, \
> +                         pg_atomic_read_u32(&chunkInfo->chunk_size)); \
> +            pcshared_info->populated++; \
> +        } \
> +    }\
> +    \
> +    /*\
> +     * All of the read data is processed, reset index & len. In the\
> +     * subsequent read, we will get a new block and copy data in to the\
> +     * new block.\
> +     */\
> +    if (copy_buff_state.raw_buf_ptr == copy_buff_state.copy_buf_len)\
> +    {\
> +        cstate->raw_buf_index = 0;\
> +        cstate->raw_buf_len = 0;\
> +    }\
> +    else\
> +        cstate->raw_buf_len = copy_buff_state.copy_buf_len;\
> +}

Why are these macros? They are way way way above a length where that
makes any sort of sense.


Greetings,

Andres Freund



Re: Parallel copy

From
Amit Kapila
Date:
On Thu, Jun 4, 2020 at 12:09 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-06-03 12:13:14 -0400, Robert Haas wrote:
> > On Mon, May 18, 2020 at 12:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > In the above case, even though we are executing a single command from
> > > the user perspective, but the currentCommandId will be four after the
> > > command.  One increment will be for the copy command and the other
> > > three increments are for locking tuple in PK table
> > > (tab_fk_referenced_chk) for three tuples in FK table
> > > (tab_fk_referencing_chk).  Now, for parallel workers, it is
> > > (theoretically) possible that the three tuples are processed by three
> > > different workers which don't get synced as of now.  The question was
> > > do we see any kind of problem with this and if so can we just sync it
> > > up at the end of parallelism.
>
> > I strongly disagree with the idea of "just sync(ing) it up at the end
> > of parallelism". That seems like a completely unprincipled approach to
> > the problem. Either the command counter increment is important or it's
> > not. If it's not important, maybe we can arrange to skip it in the
> > first place. If it is important, then it's probably not OK for each
> > backend to be doing it separately.
>
> That scares me too. These command counter increments definitely aren't
> unnecessary in the general case.
>

Yeah, this is what we want to understand?  Can you explain how they
are useful here?  AFAIU, heap_lock_tuple doesn't use commandid while
storing the transaction information of xact while locking the tuple.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Andres Freund
Date:
Hi,

On 2020-06-04 08:10:07 +0530, Amit Kapila wrote:
> On Thu, Jun 4, 2020 at 12:09 AM Andres Freund <andres@anarazel.de> wrote:
> > > I strongly disagree with the idea of "just sync(ing) it up at the end
> > > of parallelism". That seems like a completely unprincipled approach to
> > > the problem. Either the command counter increment is important or it's
> > > not. If it's not important, maybe we can arrange to skip it in the
> > > first place. If it is important, then it's probably not OK for each
> > > backend to be doing it separately.
> >
> > That scares me too. These command counter increments definitely aren't
> > unnecessary in the general case.
> >
> 
> Yeah, this is what we want to understand?  Can you explain how they
> are useful here?  AFAIU, heap_lock_tuple doesn't use commandid while
> storing the transaction information of xact while locking the tuple.

But the HeapTupleSatisfiesUpdate() call does use it?

And even if that weren't an issue, I don't see how it's defensible to
just randomly break the the commandid coherency for parallel copy.

Greetings,

Andres Freund



Re: Parallel copy

From
Amit Kapila
Date:
On Thu, Jun 4, 2020 at 9:10 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-06-04 08:10:07 +0530, Amit Kapila wrote:
> > On Thu, Jun 4, 2020 at 12:09 AM Andres Freund <andres@anarazel.de> wrote:
> > > > I strongly disagree with the idea of "just sync(ing) it up at the end
> > > > of parallelism". That seems like a completely unprincipled approach to
> > > > the problem. Either the command counter increment is important or it's
> > > > not. If it's not important, maybe we can arrange to skip it in the
> > > > first place. If it is important, then it's probably not OK for each
> > > > backend to be doing it separately.
> > >
> > > That scares me too. These command counter increments definitely aren't
> > > unnecessary in the general case.
> > >
> >
> > Yeah, this is what we want to understand?  Can you explain how they
> > are useful here?  AFAIU, heap_lock_tuple doesn't use commandid while
> > storing the transaction information of xact while locking the tuple.
>
> But the HeapTupleSatisfiesUpdate() call does use it?
>

It won't use 'cid' for lockers or multi-lockers case (AFAICS, there
are special case handling for lockers/multi-lockers).  I think it is
used for updates/deletes.

> And even if that weren't an issue, I don't see how it's defensible to
> just randomly break the the commandid coherency for parallel copy.
>

At this stage, we are evelauating whether there is any need to
increment command counter for foreign key checks or is it just
happening because we are using using some common code to execute
"Select ... For Key Share" statetement during these checks.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:
On Thu, Jun 4, 2020 at 12:44 AM Andres Freund <andres@anarazel.de> wrote
>
>
> Hm. you don't explicitly mention that in your design, but given how
> small the benefits going from 0-1 workers is, I assume the leader
> doesn't do any "chunk processing" on its own?
>

Yes you are right, the leader does not do any processing, Leader's
work is mainly to populate the shared memory with the offset
information for each record.

>
>
> > Design of the Parallel Copy: The backend, to which the "COPY FROM" query is
> > submitted acts as leader with the responsibility of reading data from the
> > file/stdin, launching at most n number of workers as specified with
> > PARALLEL 'n' option in the "COPY FROM" query. The leader populates the
> > common data required for the workers execution in the DSM and shares it
> > with the workers. The leader then executes before statement triggers if
> > there exists any. Leader populates DSM chunks which includes the start
> > offset and chunk size, while populating the chunks it reads as many blocks
> > as required into the DSM data blocks from the file. Each block is of 64K
> > size. The leader parses the data to identify a chunk, the existing logic
> > from CopyReadLineText which identifies the chunks with some changes was
> > used for this. Leader checks if a free chunk is available to copy the
> > information, if there is no free chunk it waits till the required chunk is
> > freed up by the worker and then copies the identified chunks information
> > (offset & chunk size) into the DSM chunks. This process is repeated till
> > the complete file is processed. Simultaneously, the workers cache the
> > chunks(50) locally into the local memory and release the chunks to the
> > leader for further populating. Each worker processes the chunk that it
> > cached and inserts it into the table. The leader waits till all the chunks
> > populated are processed by the workers and exits.
>
> Why do we need the local copy of 50 chunks? Copying memory around is far
> from free. I don't see why it'd be better to add per-process caching,
> rather than making the DSM bigger? I can see some benefit in marking
> multiple chunks as being processed with one lock acquisition, but I
> don't think adding a memory copy is a good idea.

We had run performance with  csv data file, 5.1GB, 10million tuples, 2
indexes on integer columns, results for the same are given below. We
noticed in some cases the performance is better if we copy the 50
records locally and release the shared memory. We will get better
benefits as the workers increase. Thoughts?
------------------------------------------------------------------------------------------------
Workers  | Exec time (With local copying | Exec time (Without copying,
               | 50 records & release the         | processing record by record)
               | shared memory)                      |
------------------------------------------------------------------------------------------------
0             |   1162.772(1X)                        |       1152.684(1X)
2             |   635.249(1.83X)                     |       647.894(1.78X)
4             |   336.835(3.45X)                     |       335.534(3.43X)
8             |   188.577(6.17 X)                    |       189.461(6.08X)
16           |   126.819(9.17X)                     |       142.730(8.07X)
20           |   117.845(9.87X)                     |       146.533(7.87X)
30           |   127.554(9.11X)                     |       160.307(7.19X)

> This patch *desperately* needs to be split up. It imo is close to
> unreviewable, due to a large amount of changes that just move code
> around without other functional changes being mixed in with the actual
> new stuff.

I have split the patch, the new split patches are attached.

>
>
>
> >  /*
> > + * State of the chunk.
> > + */
> > +typedef enum ChunkState
> > +{
> > +     CHUNK_INIT,                                     /* initial state of chunk */
> > +     CHUNK_LEADER_POPULATING,        /* leader processing chunk */
> > +     CHUNK_LEADER_POPULATED,         /* leader completed populating chunk */
> > +     CHUNK_WORKER_PROCESSING,        /* worker processing chunk */
> > +     CHUNK_WORKER_PROCESSED          /* worker completed processing chunk */
> > +}ChunkState;
> > +
> > +#define RAW_BUF_SIZE 65536           /* we palloc RAW_BUF_SIZE+1 bytes */
> > +
> > +#define DATA_BLOCK_SIZE RAW_BUF_SIZE
> > +#define RINGSIZE (10 * 1000)
> > +#define MAX_BLOCKS_COUNT 1000
> > +#define WORKER_CHUNK_COUNT 50        /* should be mod of RINGSIZE */
> > +
> > +#define      IsParallelCopy()                (cstate->is_parallel)
> > +#define IsLeader()                           (cstate->pcdata->is_leader)
> > +#define IsHeaderLine()                       (cstate->header_line && cstate->cur_lineno == 1)
> > +
> > +/*
> > + * Copy data block information.
> > + */
> > +typedef struct CopyDataBlock
> > +{
> > +     /* The number of unprocessed chunks in the current block. */
> > +     pg_atomic_uint32 unprocessed_chunk_parts;
> > +
> > +     /*
> > +      * If the current chunk data is continued into another block,
> > +      * following_block will have the position where the remaining data need to
> > +      * be read.
> > +      */
> > +     uint32  following_block;
> > +
> > +     /*
> > +      * This flag will be set, when the leader finds out this block can be read
> > +      * safely by the worker. This helps the worker to start processing the chunk
> > +      * early where the chunk will be spread across many blocks and the worker
> > +      * need not wait for the complete chunk to be processed.
> > +      */
> > +     bool   curr_blk_completed;
> > +     char   data[DATA_BLOCK_SIZE + 1]; /* data read from file */
> > +}CopyDataBlock;
>
> What's the + 1 here about?

Fixed this, removed +1. That is not needed.

>
>
> > +/*
> > + * Parallel copy line buffer information.
> > + */
> > +typedef struct ParallelCopyLineBuf
> > +{
> > +     StringInfoData          line_buf;
> > +     uint64                          cur_lineno;     /* line number for error messages */
> > +}ParallelCopyLineBuf;
>
> Why do we need separate infrastructure for this? We shouldn't duplicate
> infrastructure unnecessarily.
>

This was required for copying the multiple records locally and
releasing the shared memory. I have not changed this, will decide on
this based on the decision taken for one of the previous comments.

>
>
>
> > +/*
> > + * Common information that need to be copied to shared memory.
> > + */
> > +typedef struct CopyWorkerCommonData
> > +{
>
> Why is parallel specific stuff here suddenly not named ParallelCopy*
> anymore? If you introduce a naming like that it imo should be used
> consistently.

Fixed, changed to maintain ParallelCopy in all structs.

>
> > +     /* low-level state data */
> > +     CopyDest            copy_dest;          /* type of copy source/destination */
> > +     int                 file_encoding;      /* file or remote side's character encoding */
> > +     bool                need_transcoding;   /* file encoding diff from server? */
> > +     bool                encoding_embeds_ascii;      /* ASCII can be non-first byte? */
> > +
> > +     /* parameters from the COPY command */
> > +     bool                csv_mode;           /* Comma Separated Value format? */
> > +     bool                header_line;        /* CSV header line? */
> > +     int                 null_print_len; /* length of same */
> > +     bool                force_quote_all;    /* FORCE_QUOTE *? */
> > +     bool                convert_selectively;        /* do selective binary conversion? */
> > +
> > +     /* Working state for COPY FROM */
> > +     AttrNumber          num_defaults;
> > +     Oid                 relid;
> > +}CopyWorkerCommonData;
>
> But I actually think we shouldn't have this information in two different
> structs. This should exist once, independent of using parallel /
> non-parallel copy.
>

This structure helps in storing the common data from CopyStateData
that are required by the workers. This information will then be
allocated and stored into the DSM for the worker to retrieve and copy
it to CopyStateData.

>
> > +/* List information */
> > +typedef struct ListInfo
> > +{
> > +     int     count;          /* count of attributes */
> > +
> > +     /* string info in the form info followed by info1, info2... infon  */
> > +     char    info[1];
> > +} ListInfo;
>
> Based on these comments I have no idea what this could be for.
>

Have added better comments for this. The following is added: This
structure will help in converting a List data type into the below
structure format with the count having the number of elements in the
list and the info having the List elements appended contiguously. This
converted structure will be allocated in shared memory and stored in
DSM for the worker to retrieve and later convert it back to List data
type.

>
> >  /*
> > - * This keeps the character read at the top of the loop in the buffer
> > - * even if there is more than one read-ahead.
> > + * This keeps the character read at the top of the loop in the buffer
> > + * even if there is more than one read-ahead.
> > + */
> > +#define IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(extralen) \
> > +if (1) \
> > +{ \
> > +     if (copy_buff_state.raw_buf_ptr + (extralen) >= copy_buff_state.copy_buf_len && !hit_eof) \
> > +     { \
> > +             if (IsParallelCopy()) \
> > +             { \
> > +                     copy_buff_state.chunk_size = prev_chunk_size; /* update previous chunk size */ \
> > +                     if (copy_buff_state.block_switched) \
> > +                     { \
> > +                             pg_atomic_sub_fetch_u32(©_buff_state.data_blk_ptr->unprocessed_chunk_parts, 1);
\
> > +                             copy_buff_state.copy_buf_len = prev_copy_buf_len; \
> > +                     } \
> > +             } \
> > +             copy_buff_state.raw_buf_ptr = prev_raw_ptr; /* undo fetch */ \
> > +             need_data = true; \
> > +             continue; \
> > +     } \
> > +} else ((void) 0)
>
> I think it's an absolutely clear no-go to add new branches to
> these. They're *really* hot already, and this is going to sprinkle a
> significant amount of new instructions over a lot of places.
>

Fixed, removed this.

>
>
> > +/*
> > + * SET_RAWBUF_FOR_LOAD - Set raw_buf to the shared memory where the file data must
> > + * be read.
> > + */
> > +#define SET_RAWBUF_FOR_LOAD() \
> > +{ \
> > +     ShmCopyInfo     *pcshared_info = cstate->pcdata->pcshared_info; \
> > +     uint32 cur_block_pos; \
> > +     /* \
> > +      * Mark the previous block as completed, worker can start copying this data. \
> > +      */ \
> > +     if (copy_buff_state.data_blk_ptr != copy_buff_state.curr_data_blk_ptr && \
> > +             copy_buff_state.data_blk_ptr->curr_blk_completed == false) \
> > +             copy_buff_state.data_blk_ptr->curr_blk_completed = true; \
> > +     \
> > +     copy_buff_state.data_blk_ptr = copy_buff_state.curr_data_blk_ptr; \
> > +     cur_block_pos = WaitGetFreeCopyBlock(pcshared_info); \
> > +     copy_buff_state.curr_data_blk_ptr = &pcshared_info->data_blocks[cur_block_pos]; \
> > +     \
> > +     if (!copy_buff_state.data_blk_ptr) \
> > +     { \
> > +             copy_buff_state.data_blk_ptr = copy_buff_state.curr_data_blk_ptr; \
> > +             chunk_first_block = cur_block_pos; \
> > +     } \
> > +     else if (need_data == false) \
> > +             copy_buff_state.data_blk_ptr->following_block = cur_block_pos; \
> > +     \
> > +     cstate->raw_buf = copy_buff_state.curr_data_blk_ptr->data; \
> > +     copy_buff_state.copy_raw_buf = cstate->raw_buf; \
> > +}
> > +
> > +/*
> > + * END_CHUNK_PARALLEL_COPY - Update the chunk information in shared memory.
> > + */
> > +#define END_CHUNK_PARALLEL_COPY() \
> > +{ \
> > +     if (!IsHeaderLine()) \
> > +     { \
> > +             ShmCopyInfo *pcshared_info = cstate->pcdata->pcshared_info; \
> > +             ChunkBoundaries *chunkBoundaryPtr = &pcshared_info->chunk_boundaries; \
> > +             if (copy_buff_state.chunk_size) \
> > +             { \
> > +                     ChunkBoundary *chunkInfo = &chunkBoundaryPtr->ring[chunk_pos]; \
> > +                     /* \
> > +                      * If raw_buf_ptr is zero, unprocessed_chunk_parts would have been \
> > +                      * incremented in SEEK_COPY_BUFF_POS. This will happen if the whole \
> > +                      * chunk finishes at the end of the current block. If the \
> > +                      * new_line_size > raw_buf_ptr, then the new block has only new line \
> > +                      * char content. The unprocessed count should not be increased in \
> > +                      * this case. \
> > +                      */ \
> > +                     if (copy_buff_state.raw_buf_ptr != 0 && \
> > +                             copy_buff_state.raw_buf_ptr > new_line_size) \
> > +                             pg_atomic_add_fetch_u32(©_buff_state.curr_data_blk_ptr->unprocessed_chunk_parts,
1);\
 
> > +                     \
> > +                     /* Update chunk size. */ \
> > +                     pg_atomic_write_u32(&chunkInfo->chunk_size, copy_buff_state.chunk_size); \
> > +                     pg_atomic_write_u32(&chunkInfo->chunk_state, CHUNK_LEADER_POPULATED); \
> > +                     elog(DEBUG1, "[Leader] After adding - chunk position:%d, chunk_size:%d", \
> > +                                             chunk_pos, copy_buff_state.chunk_size); \
> > +                     pcshared_info->populated++; \
> > +             } \
> > +             else if (new_line_size) \
> > +             { \
> > +                     /* \
> > +                      * This means only new line char, empty record should be \
> > +                      * inserted. \
> > +                      */ \
> > +                     ChunkBoundary *chunkInfo; \
> > +                     chunk_pos = UpdateBlockInChunkInfo(cstate, -1, -1, 0, \
> > +                                                                                        CHUNK_LEADER_POPULATED);
\
> > +                     chunkInfo = &chunkBoundaryPtr->ring[chunk_pos]; \
> > +                     elog(DEBUG1, "[Leader] Added empty chunk with offset:%d, chunk position:%d, chunk size:%d",
\
> > +                                              chunkInfo->start_offset, chunk_pos, \
> > +                                              pg_atomic_read_u32(&chunkInfo->chunk_size)); \
> > +                     pcshared_info->populated++; \
> > +             } \
> > +     }\
> > +     \
> > +     /*\
> > +      * All of the read data is processed, reset index & len. In the\
> > +      * subsequent read, we will get a new block and copy data in to the\
> > +      * new block.\
> > +      */\
> > +     if (copy_buff_state.raw_buf_ptr == copy_buff_state.copy_buf_len)\
> > +     {\
> > +             cstate->raw_buf_index = 0;\
> > +             cstate->raw_buf_len = 0;\
> > +     }\
> > +     else\
> > +             cstate->raw_buf_len = copy_buff_state.copy_buf_len;\
> > +}
>
> Why are these macros? They are way way way above a length where that
> makes any sort of sense.
>

Converted these macros to functions.


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
Ashutosh Sharma
Date:
Hi All,

I've spent little bit of time going through the project discussion that has happened in this email thread and to start with I have few questions which I would like to put here:

Q1) Are we also planning to read the input data in parallel or is it only about performing the multi-insert operation in parallel? AFAIU, the data reading part will be done by the leader process alone so no parallelism is involved there.

Q2) How are we going to deal with the partitioned tables? I mean will there be some worker process dedicated for each partition or how is it? Further, the challenge that I see incase of partitioned tables is that we would have a single input file containing data to be inserted into multiple tables (aka partitions) unlike the normal case where all the tuples in the input file would belong to the same table.

Q3) Incase of toast tables, there is a possibility of having a single tuple in the input file which could be of a very big size (probably in GB) eventually resulting in a bigger file size. So, in this case, how are we going to decide the number of worker processes to be launched. I mean, although the file size is big, but the number of tuples to be processed is just one or few of them, so, can we decide the number of the worker processes to be launched based on the file size?

Q4) Who is going to process constraints (preferably the deferred constraint) that is supposed to be executed at the COMMIT time? I mean is it the leader process or the worker process or in such cases we won't be choosing the parallelism at all?

Q5) Do we have any risk of table bloating when the data is loaded in parallel. I am just asking this because incase of parallelism there would be multiple processes performing bulk insert into a table. There is a chance that the table file might get extended even if there is some space into the file being written into, but that space is locked by some other worker process and hence that might result in a creation of a new block for that table. Sorry, if I am missing something here.

Please note that I haven't gone through all the emails in this thread so there is a possibility that I might have repeated the question that has already been raised and answered here. If that is the case, I am sorry for that, but it would be very helpful if someone could point out that thread so that I can go through it. Thank you.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Fri, Jun 12, 2020 at 11:01 AM vignesh C <vignesh21@gmail.com> wrote:
On Thu, Jun 4, 2020 at 12:44 AM Andres Freund <andres@anarazel.de> wrote
>
>
> Hm. you don't explicitly mention that in your design, but given how
> small the benefits going from 0-1 workers is, I assume the leader
> doesn't do any "chunk processing" on its own?
>

Yes you are right, the leader does not do any processing, Leader's
work is mainly to populate the shared memory with the offset
information for each record.

>
>
> > Design of the Parallel Copy: The backend, to which the "COPY FROM" query is
> > submitted acts as leader with the responsibility of reading data from the
> > file/stdin, launching at most n number of workers as specified with
> > PARALLEL 'n' option in the "COPY FROM" query. The leader populates the
> > common data required for the workers execution in the DSM and shares it
> > with the workers. The leader then executes before statement triggers if
> > there exists any. Leader populates DSM chunks which includes the start
> > offset and chunk size, while populating the chunks it reads as many blocks
> > as required into the DSM data blocks from the file. Each block is of 64K
> > size. The leader parses the data to identify a chunk, the existing logic
> > from CopyReadLineText which identifies the chunks with some changes was
> > used for this. Leader checks if a free chunk is available to copy the
> > information, if there is no free chunk it waits till the required chunk is
> > freed up by the worker and then copies the identified chunks information
> > (offset & chunk size) into the DSM chunks. This process is repeated till
> > the complete file is processed. Simultaneously, the workers cache the
> > chunks(50) locally into the local memory and release the chunks to the
> > leader for further populating. Each worker processes the chunk that it
> > cached and inserts it into the table. The leader waits till all the chunks
> > populated are processed by the workers and exits.
>
> Why do we need the local copy of 50 chunks? Copying memory around is far
> from free. I don't see why it'd be better to add per-process caching,
> rather than making the DSM bigger? I can see some benefit in marking
> multiple chunks as being processed with one lock acquisition, but I
> don't think adding a memory copy is a good idea.

We had run performance with  csv data file, 5.1GB, 10million tuples, 2
indexes on integer columns, results for the same are given below. We
noticed in some cases the performance is better if we copy the 50
records locally and release the shared memory. We will get better
benefits as the workers increase. Thoughts?
------------------------------------------------------------------------------------------------
Workers  | Exec time (With local copying | Exec time (Without copying,
               | 50 records & release the         | processing record by record)
               | shared memory)                      |
------------------------------------------------------------------------------------------------
0             |   1162.772(1X)                        |       1152.684(1X)
2             |   635.249(1.83X)                     |       647.894(1.78X)
4             |   336.835(3.45X)                     |       335.534(3.43X)
8             |   188.577(6.17 X)                    |       189.461(6.08X)
16           |   126.819(9.17X)                     |       142.730(8.07X)
20           |   117.845(9.87X)                     |       146.533(7.87X)
30           |   127.554(9.11X)                     |       160.307(7.19X)

> This patch *desperately* needs to be split up. It imo is close to
> unreviewable, due to a large amount of changes that just move code
> around without other functional changes being mixed in with the actual
> new stuff.

I have split the patch, the new split patches are attached.

>
>
>
> >  /*
> > + * State of the chunk.
> > + */
> > +typedef enum ChunkState
> > +{
> > +     CHUNK_INIT,                                     /* initial state of chunk */
> > +     CHUNK_LEADER_POPULATING,        /* leader processing chunk */
> > +     CHUNK_LEADER_POPULATED,         /* leader completed populating chunk */
> > +     CHUNK_WORKER_PROCESSING,        /* worker processing chunk */
> > +     CHUNK_WORKER_PROCESSED          /* worker completed processing chunk */
> > +}ChunkState;
> > +
> > +#define RAW_BUF_SIZE 65536           /* we palloc RAW_BUF_SIZE+1 bytes */
> > +
> > +#define DATA_BLOCK_SIZE RAW_BUF_SIZE
> > +#define RINGSIZE (10 * 1000)
> > +#define MAX_BLOCKS_COUNT 1000
> > +#define WORKER_CHUNK_COUNT 50        /* should be mod of RINGSIZE */
> > +
> > +#define      IsParallelCopy()                (cstate->is_parallel)
> > +#define IsLeader()                           (cstate->pcdata->is_leader)
> > +#define IsHeaderLine()                       (cstate->header_line && cstate->cur_lineno == 1)
> > +
> > +/*
> > + * Copy data block information.
> > + */
> > +typedef struct CopyDataBlock
> > +{
> > +     /* The number of unprocessed chunks in the current block. */
> > +     pg_atomic_uint32 unprocessed_chunk_parts;
> > +
> > +     /*
> > +      * If the current chunk data is continued into another block,
> > +      * following_block will have the position where the remaining data need to
> > +      * be read.
> > +      */
> > +     uint32  following_block;
> > +
> > +     /*
> > +      * This flag will be set, when the leader finds out this block can be read
> > +      * safely by the worker. This helps the worker to start processing the chunk
> > +      * early where the chunk will be spread across many blocks and the worker
> > +      * need not wait for the complete chunk to be processed.
> > +      */
> > +     bool   curr_blk_completed;
> > +     char   data[DATA_BLOCK_SIZE + 1]; /* data read from file */
> > +}CopyDataBlock;
>
> What's the + 1 here about?

Fixed this, removed +1. That is not needed.

>
>
> > +/*
> > + * Parallel copy line buffer information.
> > + */
> > +typedef struct ParallelCopyLineBuf
> > +{
> > +     StringInfoData          line_buf;
> > +     uint64                          cur_lineno;     /* line number for error messages */
> > +}ParallelCopyLineBuf;
>
> Why do we need separate infrastructure for this? We shouldn't duplicate
> infrastructure unnecessarily.
>

This was required for copying the multiple records locally and
releasing the shared memory. I have not changed this, will decide on
this based on the decision taken for one of the previous comments.

>
>
>
> > +/*
> > + * Common information that need to be copied to shared memory.
> > + */
> > +typedef struct CopyWorkerCommonData
> > +{
>
> Why is parallel specific stuff here suddenly not named ParallelCopy*
> anymore? If you introduce a naming like that it imo should be used
> consistently.

Fixed, changed to maintain ParallelCopy in all structs.

>
> > +     /* low-level state data */
> > +     CopyDest            copy_dest;          /* type of copy source/destination */
> > +     int                 file_encoding;      /* file or remote side's character encoding */
> > +     bool                need_transcoding;   /* file encoding diff from server? */
> > +     bool                encoding_embeds_ascii;      /* ASCII can be non-first byte? */
> > +
> > +     /* parameters from the COPY command */
> > +     bool                csv_mode;           /* Comma Separated Value format? */
> > +     bool                header_line;        /* CSV header line? */
> > +     int                 null_print_len; /* length of same */
> > +     bool                force_quote_all;    /* FORCE_QUOTE *? */
> > +     bool                convert_selectively;        /* do selective binary conversion? */
> > +
> > +     /* Working state for COPY FROM */
> > +     AttrNumber          num_defaults;
> > +     Oid                 relid;
> > +}CopyWorkerCommonData;
>
> But I actually think we shouldn't have this information in two different
> structs. This should exist once, independent of using parallel /
> non-parallel copy.
>

This structure helps in storing the common data from CopyStateData
that are required by the workers. This information will then be
allocated and stored into the DSM for the worker to retrieve and copy
it to CopyStateData.

>
> > +/* List information */
> > +typedef struct ListInfo
> > +{
> > +     int     count;          /* count of attributes */
> > +
> > +     /* string info in the form info followed by info1, info2... infon  */
> > +     char    info[1];
> > +} ListInfo;
>
> Based on these comments I have no idea what this could be for.
>

Have added better comments for this. The following is added: This
structure will help in converting a List data type into the below
structure format with the count having the number of elements in the
list and the info having the List elements appended contiguously. This
converted structure will be allocated in shared memory and stored in
DSM for the worker to retrieve and later convert it back to List data
type.

>
> >  /*
> > - * This keeps the character read at the top of the loop in the buffer
> > - * even if there is more than one read-ahead.
> > + * This keeps the character read at the top of the loop in the buffer
> > + * even if there is more than one read-ahead.
> > + */
> > +#define IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(extralen) \
> > +if (1) \
> > +{ \
> > +     if (copy_buff_state.raw_buf_ptr + (extralen) >= copy_buff_state.copy_buf_len && !hit_eof) \
> > +     { \
> > +             if (IsParallelCopy()) \
> > +             { \
> > +                     copy_buff_state.chunk_size = prev_chunk_size; /* update previous chunk size */ \
> > +                     if (copy_buff_state.block_switched) \
> > +                     { \
> > +                             pg_atomic_sub_fetch_u32(&copy_buff_state.data_blk_ptr->unprocessed_chunk_parts, 1); \
> > +                             copy_buff_state.copy_buf_len = prev_copy_buf_len; \
> > +                     } \
> > +             } \
> > +             copy_buff_state.raw_buf_ptr = prev_raw_ptr; /* undo fetch */ \
> > +             need_data = true; \
> > +             continue; \
> > +     } \
> > +} else ((void) 0)
>
> I think it's an absolutely clear no-go to add new branches to
> these. They're *really* hot already, and this is going to sprinkle a
> significant amount of new instructions over a lot of places.
>

Fixed, removed this.

>
>
> > +/*
> > + * SET_RAWBUF_FOR_LOAD - Set raw_buf to the shared memory where the file data must
> > + * be read.
> > + */
> > +#define SET_RAWBUF_FOR_LOAD() \
> > +{ \
> > +     ShmCopyInfo     *pcshared_info = cstate->pcdata->pcshared_info; \
> > +     uint32 cur_block_pos; \
> > +     /* \
> > +      * Mark the previous block as completed, worker can start copying this data. \
> > +      */ \
> > +     if (copy_buff_state.data_blk_ptr != copy_buff_state.curr_data_blk_ptr && \
> > +             copy_buff_state.data_blk_ptr->curr_blk_completed == false) \
> > +             copy_buff_state.data_blk_ptr->curr_blk_completed = true; \
> > +     \
> > +     copy_buff_state.data_blk_ptr = copy_buff_state.curr_data_blk_ptr; \
> > +     cur_block_pos = WaitGetFreeCopyBlock(pcshared_info); \
> > +     copy_buff_state.curr_data_blk_ptr = &pcshared_info->data_blocks[cur_block_pos]; \
> > +     \
> > +     if (!copy_buff_state.data_blk_ptr) \
> > +     { \
> > +             copy_buff_state.data_blk_ptr = copy_buff_state.curr_data_blk_ptr; \
> > +             chunk_first_block = cur_block_pos; \
> > +     } \
> > +     else if (need_data == false) \
> > +             copy_buff_state.data_blk_ptr->following_block = cur_block_pos; \
> > +     \
> > +     cstate->raw_buf = copy_buff_state.curr_data_blk_ptr->data; \
> > +     copy_buff_state.copy_raw_buf = cstate->raw_buf; \
> > +}
> > +
> > +/*
> > + * END_CHUNK_PARALLEL_COPY - Update the chunk information in shared memory.
> > + */
> > +#define END_CHUNK_PARALLEL_COPY() \
> > +{ \
> > +     if (!IsHeaderLine()) \
> > +     { \
> > +             ShmCopyInfo *pcshared_info = cstate->pcdata->pcshared_info; \
> > +             ChunkBoundaries *chunkBoundaryPtr = &pcshared_info->chunk_boundaries; \
> > +             if (copy_buff_state.chunk_size) \
> > +             { \
> > +                     ChunkBoundary *chunkInfo = &chunkBoundaryPtr->ring[chunk_pos]; \
> > +                     /* \
> > +                      * If raw_buf_ptr is zero, unprocessed_chunk_parts would have been \
> > +                      * incremented in SEEK_COPY_BUFF_POS. This will happen if the whole \
> > +                      * chunk finishes at the end of the current block. If the \
> > +                      * new_line_size > raw_buf_ptr, then the new block has only new line \
> > +                      * char content. The unprocessed count should not be increased in \
> > +                      * this case. \
> > +                      */ \
> > +                     if (copy_buff_state.raw_buf_ptr != 0 && \
> > +                             copy_buff_state.raw_buf_ptr > new_line_size) \
> > +                             pg_atomic_add_fetch_u32(&copy_buff_state.curr_data_blk_ptr->unprocessed_chunk_parts, 1); \
> > +                     \
> > +                     /* Update chunk size. */ \
> > +                     pg_atomic_write_u32(&chunkInfo->chunk_size, copy_buff_state.chunk_size); \
> > +                     pg_atomic_write_u32(&chunkInfo->chunk_state, CHUNK_LEADER_POPULATED); \
> > +                     elog(DEBUG1, "[Leader] After adding - chunk position:%d, chunk_size:%d", \
> > +                                             chunk_pos, copy_buff_state.chunk_size); \
> > +                     pcshared_info->populated++; \
> > +             } \
> > +             else if (new_line_size) \
> > +             { \
> > +                     /* \
> > +                      * This means only new line char, empty record should be \
> > +                      * inserted. \
> > +                      */ \
> > +                     ChunkBoundary *chunkInfo; \
> > +                     chunk_pos = UpdateBlockInChunkInfo(cstate, -1, -1, 0, \
> > +                                                                                        CHUNK_LEADER_POPULATED); \
> > +                     chunkInfo = &chunkBoundaryPtr->ring[chunk_pos]; \
> > +                     elog(DEBUG1, "[Leader] Added empty chunk with offset:%d, chunk position:%d, chunk size:%d", \
> > +                                              chunkInfo->start_offset, chunk_pos, \
> > +                                              pg_atomic_read_u32(&chunkInfo->chunk_size)); \
> > +                     pcshared_info->populated++; \
> > +             } \
> > +     }\
> > +     \
> > +     /*\
> > +      * All of the read data is processed, reset index & len. In the\
> > +      * subsequent read, we will get a new block and copy data in to the\
> > +      * new block.\
> > +      */\
> > +     if (copy_buff_state.raw_buf_ptr == copy_buff_state.copy_buf_len)\
> > +     {\
> > +             cstate->raw_buf_index = 0;\
> > +             cstate->raw_buf_len = 0;\
> > +     }\
> > +     else\
> > +             cstate->raw_buf_len = copy_buff_state.copy_buf_len;\
> > +}
>
> Why are these macros? They are way way way above a length where that
> makes any sort of sense.
>

Converted these macros to functions.


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
Amit Kapila
Date:
On Fri, Jun 12, 2020 at 4:57 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi All,
>
> I've spent little bit of time going through the project discussion that has happened in this email thread and to
startwith I have few questions which I would like to put here: 
>
> Q1) Are we also planning to read the input data in parallel or is it only about performing the multi-insert operation
inparallel? AFAIU, the data reading part will be done by the leader process alone so no parallelism is involved there. 
>

Yes, your understanding is correct.

> Q2) How are we going to deal with the partitioned tables?
>

I haven't studied the patch but my understanding is that we will
support parallel copy for partitioned tables with a few restrictions
as explained in my earlier email [1].  See, Case-2 (b) in the email.

> I mean will there be some worker process dedicated for each partition or how is it?

No, it the split is just based on the input, otherwise each worker
should insert as we would have done without any workers.

> Q3) Incase of toast tables, there is a possibility of having a single tuple in the input file which could be of a
verybig size (probably in GB) eventually resulting in a bigger file size. So, in this case, how are we going to decide
thenumber of worker processes to be launched. I mean, although the file size is big, but the number of tuples to be
processedis just one or few of them, so, can we decide the number of the worker processes to be launched based on the
filesize? 
>

Yeah, such situations would be tricky, so we should have an option for
user to specify the number of workers.

> Q4) Who is going to process constraints (preferably the deferred constraint) that is supposed to be executed at the
COMMITtime? I mean is it the leader process or the worker process or in such cases we won't be choosing the parallelism
atall? 
>

In the first version, we won't do parallelism for this.  Again, see
one of my earlier email [1] where I have explained this and other
cases where we won't be supporting parallel copy.

> Q5) Do we have any risk of table bloating when the data is loaded in parallel. I am just asking this because incase
ofparallelism there would be multiple processes performing bulk insert into a table. There is a chance that the table
filemight get extended even if there is some space into the file being written into, but that space is locked by some
otherworker process and hence that might result in a creation of a new block for that table. Sorry, if I am missing
somethinghere. 
>

Hmm, each worker will operate at page level, after first insertion,
the same worker will try to insert in the same page in which it has
inserted last, so there shouldn't be such a problem.

> Please note that I haven't gone through all the emails in this thread so there is a possibility that I might have
repeatedthe question that has already been raised and answered here. If that is the case, I am sorry for that, but it
wouldbe very helpful if someone could point out that thread so that I can go through it. Thank you. 
>

No problem, I understand sometimes it is difficult to go through each
and every email especially when the discussion is long.  Anyway,
thanks for showing the interest in the patch.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BANNEaMJCCXm4naweP5PLY6LhJMvGo_V7-Pnfbh6GsOA%40mail.gmail.com


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Bharath Rupireddy
Date:
Hi,

Attached is the patch supporting parallel copy for binary format files.

The performance improvement achieved with different workers is as shown below. Dataset used has 10million tuples and is of 5.3GB size.

parallel workerstest case 1(exec time in sec): copy from binary file, 2 indexes on integer columns and 1 index on text columntest case 2(exec time in sec): copy from binary file, 1 gist index on text columntest case 3(exec time in sec): copy from binary file, 3 indexes on integer columns
01106.899(1X)772.758(1X)171.338(1X)
11094.165(1.01X)757.365(1.02X)163.018(1.05X)
2618.397(1.79X)428.304(1.8X)117.508(1.46X)
4320.511(3.45X)231.938(3.33X)80.297(2.13X)
8172.462(6.42X)150.212(5.14X)71.518(2.39X)
16110.460(10.02X)124.929(6.18X)91.308(1.88X)
2098.470(11.24X)137.313(5.63X)95.289(1.79X)
30109.229(10.13X)173.54(4.45X)95.799(1.78X)

Design followed for developing this patch:

Leader reads data from the file into the DSM data blocks each of 64K size. It also identifies each tuple data block id, start offset, end offset, tuple size and updates this information in the ring data structure. Workers parallely read the tuple information from the ring data structure, the actual tuple data from the data blocks and parallely insert the tuples into the table.

Please note that this patch can be applied on the series of patches that were posted previously[1] for parallel copy for csv/text files.
The correct order to apply all the patches is -
0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
and
0005-Parallel-Copy-For-Binary-Format-Files.patch

The above tests were run with the configuration attached config.txt, which is the same used for performance tests of csv/text files posted earlier in this mail chain.

Request the community to take this patch up for review along with the parallel copy for csv/text file patches and provide feedback.


With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Parallel copy

From
Ashutosh Sharma
Date:
Thanks Amit for the clarifications. Regarding partitioned table, one of the question was - if we are loading data into a partitioned table using COPY command, then the input file would contain tuples for different tables (partitions) unlike the normal table case where all the tuples in the input file would belong to the same table. So, in such a case, how are we going to accumulate tuples into the DSM? I mean will the leader process check which tuple needs to be routed to which partition and accordingly accumulate them into the DSM. For e.g. let's say in the input data file we have 10 tuples where the 1st tuple belongs to partition1, 2nd belongs to partition2 and likewise. So, in such cases, will the leader process accumulate all the tuples belonging to partition1 into one DSM and tuples belonging to partition2 into some other DSM and assign them to the worker process or we have taken some other approach to handle this scenario?

Further, I haven't got much time to look into the links that you have shared in your previous response. Will have a look into those and will also slowly start looking into the patches  as and when I get some time. Thank you.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Sat, Jun 13, 2020 at 9:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Jun 12, 2020 at 4:57 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi All,
>
> I've spent little bit of time going through the project discussion that has happened in this email thread and to start with I have few questions which I would like to put here:
>
> Q1) Are we also planning to read the input data in parallel or is it only about performing the multi-insert operation in parallel? AFAIU, the data reading part will be done by the leader process alone so no parallelism is involved there.
>

Yes, your understanding is correct.

> Q2) How are we going to deal with the partitioned tables?
>

I haven't studied the patch but my understanding is that we will
support parallel copy for partitioned tables with a few restrictions
as explained in my earlier email [1].  See, Case-2 (b) in the email.

> I mean will there be some worker process dedicated for each partition or how is it?

No, it the split is just based on the input, otherwise each worker
should insert as we would have done without any workers.

> Q3) Incase of toast tables, there is a possibility of having a single tuple in the input file which could be of a very big size (probably in GB) eventually resulting in a bigger file size. So, in this case, how are we going to decide the number of worker processes to be launched. I mean, although the file size is big, but the number of tuples to be processed is just one or few of them, so, can we decide the number of the worker processes to be launched based on the file size?
>

Yeah, such situations would be tricky, so we should have an option for
user to specify the number of workers.

> Q4) Who is going to process constraints (preferably the deferred constraint) that is supposed to be executed at the COMMIT time? I mean is it the leader process or the worker process or in such cases we won't be choosing the parallelism at all?
>

In the first version, we won't do parallelism for this.  Again, see
one of my earlier email [1] where I have explained this and other
cases where we won't be supporting parallel copy.

> Q5) Do we have any risk of table bloating when the data is loaded in parallel. I am just asking this because incase of parallelism there would be multiple processes performing bulk insert into a table. There is a chance that the table file might get extended even if there is some space into the file being written into, but that space is locked by some other worker process and hence that might result in a creation of a new block for that table. Sorry, if I am missing something here.
>

Hmm, each worker will operate at page level, after first insertion,
the same worker will try to insert in the same page in which it has
inserted last, so there shouldn't be such a problem.

> Please note that I haven't gone through all the emails in this thread so there is a possibility that I might have repeated the question that has already been raised and answered here. If that is the case, I am sorry for that, but it would be very helpful if someone could point out that thread so that I can go through it. Thank you.
>

No problem, I understand sometimes it is difficult to go through each
and every email especially when the discussion is long.  Anyway,
thanks for showing the interest in the patch.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BANNEaMJCCXm4naweP5PLY6LhJMvGo_V7-Pnfbh6GsOA%40mail.gmail.com


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
Amit Kapila
Date:
On Mon, Jun 15, 2020 at 7:41 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Thanks Amit for the clarifications. Regarding partitioned table, one of the question was - if we are loading data
intoa partitioned table using COPY command, then the input file would contain tuples for different tables (partitions)
unlikethe normal table case where all the tuples in the input file would belong to the same table. So, in such a case,
howare we going to accumulate tuples into the DSM? I mean will the leader process check which tuple needs to be routed
towhich partition and accordingly accumulate them into the DSM. For e.g. let's say in the input data file we have 10
tupleswhere the 1st tuple belongs to partition1, 2nd belongs to partition2 and likewise. So, in such cases, will the
leaderprocess accumulate all the tuples belonging to partition1 into one DSM and tuples belonging to partition2 into
someother DSM and assign them to the worker process or we have taken some other approach to handle this scenario? 
>

No, all the tuples (for all partitions) will be accumulated in a
single DSM and the workers/leader will route the tuple to an
appropriate partition.

> Further, I haven't got much time to look into the links that you have shared in your previous response. Will have a
lookinto those and will also slowly start looking into the patches  as and when I get some time. Thank you. 
>

Yeah, it will be good if you go through all the emails once because
most of the decisions (and design) in the patch is supposed to be
based on the discussion in this thread.

Note - Please don't top post, try to give inline replies.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:
Hi,

I have included tests for parallel copy feature & few bugs that were
identified during testing have been fixed. Attached patches for the
same.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

On Tue, Jun 16, 2020 at 3:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 15, 2020 at 7:41 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > Thanks Amit for the clarifications. Regarding partitioned table, one of the question was - if we are loading data
intoa partitioned table using COPY command, then the input file would contain tuples for different tables (partitions)
unlikethe normal table case where all the tuples in the input file would belong to the same table. So, in such a case,
howare we going to accumulate tuples into the DSM? I mean will the leader process check which tuple needs to be routed
towhich partition and accordingly accumulate them into the DSM. For e.g. let's say in the input data file we have 10
tupleswhere the 1st tuple belongs to partition1, 2nd belongs to partition2 and likewise. So, in such cases, will the
leaderprocess accumulate all the tuples belonging to partition1 into one DSM and tuples belonging to partition2 into
someother DSM and assign them to the worker process or we have taken some other approach to handle this scenario? 
> >
>
> No, all the tuples (for all partitions) will be accumulated in a
> single DSM and the workers/leader will route the tuple to an
> appropriate partition.
>
> > Further, I haven't got much time to look into the links that you have shared in your previous response. Will have a
lookinto those and will also slowly start looking into the patches  as and when I get some time. Thank you. 
> >
>
> Yeah, it will be good if you go through all the emails once because
> most of the decisions (and design) in the patch is supposed to be
> based on the discussion in this thread.
>
> Note - Please don't top post, try to give inline replies.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
vignesh C
Date:
On Mon, Jun 15, 2020 at 4:39 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> The above tests were run with the configuration attached config.txt, which is the same used for performance tests of
csv/textfiles posted earlier in this mail chain.
 
>
> Request the community to take this patch up for review along with the parallel copy for csv/text file patches and
providefeedback.
 
>

I had reviewed the patch, few comments:
+
+       /*
+        * Parallel copy for binary formatted files
+        */
+       ParallelCopyDataBlock *curr_data_block;
+       ParallelCopyDataBlock *prev_data_block;
+       uint32                             curr_data_offset;
+       uint32                             curr_block_pos;
+       ParallelCopyTupleInfo  curr_tuple_start_info;
+       ParallelCopyTupleInfo  curr_tuple_end_info;
 } CopyStateData;

 The new members added should be present in ParallelCopyData

+       if (cstate->curr_tuple_start_info.block_id ==
cstate->curr_tuple_end_info.block_id)
+       {
+               elog(DEBUG1,"LEADER - tuple lies in a single data block");
+
+               line_size = cstate->curr_tuple_end_info.offset -
cstate->curr_tuple_start_info.offset + 1;
+
pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[cstate->curr_tuple_start_info.block_id].unprocessed_line_parts,
1);
+       }
+       else
+       {
+               uint32 following_block_id =
pcshared_info->data_blocks[cstate->curr_tuple_start_info.block_id].following_block;
+
+               elog(DEBUG1,"LEADER - tuple is spread across data blocks");
+
+               line_size = DATA_BLOCK_SIZE -
cstate->curr_tuple_start_info.offset -
+
pcshared_info->data_blocks[cstate->curr_tuple_start_info.block_id].skip_bytes;
+
+
pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[cstate->curr_tuple_start_info.block_id].unprocessed_line_parts,
1);
+
+               while (following_block_id !=
cstate->curr_tuple_end_info.block_id)
+               {
+                       line_size = line_size + DATA_BLOCK_SIZE -
pcshared_info->data_blocks[following_block_id].skip_bytes;
+
+
pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[following_block_id].unprocessed_line_parts,
1);
+
+                       following_block_id =
pcshared_info->data_blocks[following_block_id].following_block;
+
+                       if (following_block_id == -1)
+                               break;
+               }
+
+               if (following_block_id != -1)
+
pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[following_block_id].unprocessed_line_parts,
1);
+
+               line_size = line_size + cstate->curr_tuple_end_info.offset + 1;
+       }

line_size can be set as and when we process the tuple from
CopyReadBinaryTupleLeader and this can be set at the end. That way the
above code can be removed.

+
+       /*
+        * Parallel copy for binary formatted files
+        */
+       ParallelCopyDataBlock *curr_data_block;
+       ParallelCopyDataBlock *prev_data_block;
+       uint32                             curr_data_offset;
+       uint32                             curr_block_pos;
+       ParallelCopyTupleInfo  curr_tuple_start_info;
+       ParallelCopyTupleInfo  curr_tuple_end_info;
 } CopyStateData;

curr_block_pos variable is present in ParallelCopyShmInfo, we could
use it and remove from here.
curr_data_offset, similar variable raw_buf_index is present in
CopyStateData, we could use it and remove from here.

+ if (cstate->curr_data_offset + sizeof(fld_count) >= (DATA_BLOCK_SIZE - 1))
+ {
+ ParallelCopyDataBlock *data_block = NULL;
+ uint8 movebytes = 0;
+
+ block_pos = WaitGetFreeCopyBlock(pcshared_info);
+
+ movebytes = DATA_BLOCK_SIZE - cstate->curr_data_offset;
+
+ cstate->curr_data_block->skip_bytes = movebytes;
+
+ data_block = &pcshared_info->data_blocks[block_pos];
+
+ if (movebytes > 0)
+ memmove(&data_block->data[0],
&cstate->curr_data_block->data[cstate->curr_data_offset],
+ movebytes);
+
+ elog(DEBUG1, "LEADER - field count is spread across data blocks -
moved %d bytes from current block %u to %u block",
+ movebytes, cstate->curr_block_pos, block_pos);
+
+ readbytes = CopyGetData(cstate, &data_block->data[movebytes], 1,
(DATA_BLOCK_SIZE - movebytes));
+
+ elog(DEBUG1, "LEADER - bytes read from file after field count is
moved to next data block %d", readbytes);
+
+ if (cstate->reached_eof)
+ ereport(ERROR,
+ (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+ errmsg("unexpected EOF in COPY data")));
+
+ cstate->curr_data_block = data_block;
+ cstate->curr_data_offset = 0;
+ cstate->curr_block_pos = block_pos;
+ }

This code is duplicate in CopyReadBinaryTupleLeader &
CopyReadBinaryAttributeLeader. We could make a function and re-use.

+/*
+ * CopyReadBinaryAttributeWorker - leader identifies boundaries/offsets
+ * for each attribute/column, it moves on to next data block if the
+ * attribute/column is spread across data blocks.
+ */
+static pg_attribute_always_inline Datum
+CopyReadBinaryAttributeWorker(CopyState cstate, int column_no,
+               FmgrInfo *flinfo, Oid typioparam, int32 typmod, bool *isnull)
+{
+       int32           fld_size;
+       Datum           result;

column_no is not used, it can be removed

+       if (fld_count == -1)
+       {
+               /*
+                       * Received EOF marker.  In a V3-protocol copy,
wait for the
+                       * protocol-level EOF, and complain if it doesn't come
+                       * immediately.  This ensures that we correctly
handle CopyFail,
+                       * if client chooses to send that now.
+                       *
+                       * Note that we MUST NOT try to read more data
in an old-protocol
+                       * copy, since there is no protocol-level EOF
marker then.  We
+                       * could go either way for copy from file, but
choose to throw
+                       * error if there's data after the EOF marker,
for consistency
+                       * with the new-protocol case.
+                       */
+               char            dummy;
+
+               if (cstate->copy_dest != COPY_OLD_FE &&
+                       CopyGetData(cstate, &dummy, 1, 1) > 0)
+                       ereport(ERROR,
+                                       (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+                                               errmsg("received copy
data after EOF marker")));
+               return true;
+       }
+
+       if (fld_count != attr_count)
+               ereport(ERROR,
+                       (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+                       errmsg("row field count is %d, expected %d",
+                       (int) fld_count, attr_count)));
+
+       cstate->curr_tuple_start_info.block_id = cstate->curr_block_pos;
+       cstate->curr_tuple_start_info.offset =  cstate->curr_data_offset;
+       cstate->curr_data_offset = cstate->curr_data_offset + sizeof(fld_count);
+       new_block_pos = cstate->curr_block_pos;
+
+       foreach(cur, cstate->attnumlist)
+       {
+               int                     attnum = lfirst_int(cur);
+               int                     m = attnum - 1;
+               Form_pg_attribute att = TupleDescAttr(tupDesc, m);

The above code is present in NextCopyFrom & CopyReadBinaryTupleLeader,
check if we can make a common function or we could use NextCopyFrom as
it is.

+       memcpy(&fld_count,
&cstate->curr_data_block->data[cstate->curr_data_offset],
sizeof(fld_count));
+       fld_count = (int16) pg_ntoh16(fld_count);
+
+       if (fld_count == -1)
+       {
+               return true;
+       }

Should this be an assert in CopyReadBinaryTupleWorker function as this
check is already done in the leader.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Ashutosh Sharma
Date:
Hi,

I just got some time to review the first patch in the list i.e. 0001-Copy-code-readjustment-to-support-parallel-copy.patch. As the patch name suggests, it is just trying to reshuffle the existing code for COPY command here and there. There is no extra changes added in the patch as such, but still I do have some review comments, please have a look:

1) Can you please add some comments atop the new function PopulateAttributes() describing its functionality in detail. Further, this new function contains the code from BeginCopy() to set attribute level options used with COPY FROM such as FORCE_QUOTE, FORCE_NOT_NULL, FORCE_NULL etc. in cstate and along with that it also copies the code from BeginCopy() to set other infos such as client encoding type, encoding conversion etc. Hence, I think it would be good to give it some better name, basically something that matches with what actually it is doing.

2) Again, the name for the new function CheckCopyFromValidity() doesn't look good to me. From the function name it appears as if it does the sanity check of the entire COPY FROM command, but actually it is just doing the sanity check for the target relation specified with COPY FROM. So, probably something like CheckTargetRelValidity would look more sensible, I think? TBH, I am not good at naming the functions so you can always ignore my suggestions about function and variable names :)

3) Any reason for not making CheckCopyFromValidity as a macro instead of a new function. It is just doing the sanity check for the target relation.

4) Earlier in CopyReadLine() function while trying to clear the EOL marker from cstate->line_buf.data (copied data), we were not checking if the line read by CopyReadLineText() function is a header line or not, but I can see that your patch checks that before clearing the EOL marker. Any reason for this extra check?

5) I noticed the below spurious line removal in the patch.

@@ -3839,7 +3953,6 @@ static bool
 CopyReadLine(CopyState cstate)
 {
    bool        result;
-

Please note that I haven't got a chance to look into other patches as of now. I will do that whenever possible. Thank you.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Fri, Jun 12, 2020 at 11:01 AM vignesh C <vignesh21@gmail.com> wrote:
On Thu, Jun 4, 2020 at 12:44 AM Andres Freund <andres@anarazel.de> wrote
>
>
> Hm. you don't explicitly mention that in your design, but given how
> small the benefits going from 0-1 workers is, I assume the leader
> doesn't do any "chunk processing" on its own?
>

Yes you are right, the leader does not do any processing, Leader's
work is mainly to populate the shared memory with the offset
information for each record.

>
>
> > Design of the Parallel Copy: The backend, to which the "COPY FROM" query is
> > submitted acts as leader with the responsibility of reading data from the
> > file/stdin, launching at most n number of workers as specified with
> > PARALLEL 'n' option in the "COPY FROM" query. The leader populates the
> > common data required for the workers execution in the DSM and shares it
> > with the workers. The leader then executes before statement triggers if
> > there exists any. Leader populates DSM chunks which includes the start
> > offset and chunk size, while populating the chunks it reads as many blocks
> > as required into the DSM data blocks from the file. Each block is of 64K
> > size. The leader parses the data to identify a chunk, the existing logic
> > from CopyReadLineText which identifies the chunks with some changes was
> > used for this. Leader checks if a free chunk is available to copy the
> > information, if there is no free chunk it waits till the required chunk is
> > freed up by the worker and then copies the identified chunks information
> > (offset & chunk size) into the DSM chunks. This process is repeated till
> > the complete file is processed. Simultaneously, the workers cache the
> > chunks(50) locally into the local memory and release the chunks to the
> > leader for further populating. Each worker processes the chunk that it
> > cached and inserts it into the table. The leader waits till all the chunks
> > populated are processed by the workers and exits.
>
> Why do we need the local copy of 50 chunks? Copying memory around is far
> from free. I don't see why it'd be better to add per-process caching,
> rather than making the DSM bigger? I can see some benefit in marking
> multiple chunks as being processed with one lock acquisition, but I
> don't think adding a memory copy is a good idea.

We had run performance with  csv data file, 5.1GB, 10million tuples, 2
indexes on integer columns, results for the same are given below. We
noticed in some cases the performance is better if we copy the 50
records locally and release the shared memory. We will get better
benefits as the workers increase. Thoughts?
------------------------------------------------------------------------------------------------
Workers  | Exec time (With local copying | Exec time (Without copying,
               | 50 records & release the         | processing record by record)
               | shared memory)                      |
------------------------------------------------------------------------------------------------
0             |   1162.772(1X)                        |       1152.684(1X)
2             |   635.249(1.83X)                     |       647.894(1.78X)
4             |   336.835(3.45X)                     |       335.534(3.43X)
8             |   188.577(6.17 X)                    |       189.461(6.08X)
16           |   126.819(9.17X)                     |       142.730(8.07X)
20           |   117.845(9.87X)                     |       146.533(7.87X)
30           |   127.554(9.11X)                     |       160.307(7.19X)

> This patch *desperately* needs to be split up. It imo is close to
> unreviewable, due to a large amount of changes that just move code
> around without other functional changes being mixed in with the actual
> new stuff.

I have split the patch, the new split patches are attached.

>
>
>
> >  /*
> > + * State of the chunk.
> > + */
> > +typedef enum ChunkState
> > +{
> > +     CHUNK_INIT,                                     /* initial state of chunk */
> > +     CHUNK_LEADER_POPULATING,        /* leader processing chunk */
> > +     CHUNK_LEADER_POPULATED,         /* leader completed populating chunk */
> > +     CHUNK_WORKER_PROCESSING,        /* worker processing chunk */
> > +     CHUNK_WORKER_PROCESSED          /* worker completed processing chunk */
> > +}ChunkState;
> > +
> > +#define RAW_BUF_SIZE 65536           /* we palloc RAW_BUF_SIZE+1 bytes */
> > +
> > +#define DATA_BLOCK_SIZE RAW_BUF_SIZE
> > +#define RINGSIZE (10 * 1000)
> > +#define MAX_BLOCKS_COUNT 1000
> > +#define WORKER_CHUNK_COUNT 50        /* should be mod of RINGSIZE */
> > +
> > +#define      IsParallelCopy()                (cstate->is_parallel)
> > +#define IsLeader()                           (cstate->pcdata->is_leader)
> > +#define IsHeaderLine()                       (cstate->header_line && cstate->cur_lineno == 1)
> > +
> > +/*
> > + * Copy data block information.
> > + */
> > +typedef struct CopyDataBlock
> > +{
> > +     /* The number of unprocessed chunks in the current block. */
> > +     pg_atomic_uint32 unprocessed_chunk_parts;
> > +
> > +     /*
> > +      * If the current chunk data is continued into another block,
> > +      * following_block will have the position where the remaining data need to
> > +      * be read.
> > +      */
> > +     uint32  following_block;
> > +
> > +     /*
> > +      * This flag will be set, when the leader finds out this block can be read
> > +      * safely by the worker. This helps the worker to start processing the chunk
> > +      * early where the chunk will be spread across many blocks and the worker
> > +      * need not wait for the complete chunk to be processed.
> > +      */
> > +     bool   curr_blk_completed;
> > +     char   data[DATA_BLOCK_SIZE + 1]; /* data read from file */
> > +}CopyDataBlock;
>
> What's the + 1 here about?

Fixed this, removed +1. That is not needed.

>
>
> > +/*
> > + * Parallel copy line buffer information.
> > + */
> > +typedef struct ParallelCopyLineBuf
> > +{
> > +     StringInfoData          line_buf;
> > +     uint64                          cur_lineno;     /* line number for error messages */
> > +}ParallelCopyLineBuf;
>
> Why do we need separate infrastructure for this? We shouldn't duplicate
> infrastructure unnecessarily.
>

This was required for copying the multiple records locally and
releasing the shared memory. I have not changed this, will decide on
this based on the decision taken for one of the previous comments.

>
>
>
> > +/*
> > + * Common information that need to be copied to shared memory.
> > + */
> > +typedef struct CopyWorkerCommonData
> > +{
>
> Why is parallel specific stuff here suddenly not named ParallelCopy*
> anymore? If you introduce a naming like that it imo should be used
> consistently.

Fixed, changed to maintain ParallelCopy in all structs.

>
> > +     /* low-level state data */
> > +     CopyDest            copy_dest;          /* type of copy source/destination */
> > +     int                 file_encoding;      /* file or remote side's character encoding */
> > +     bool                need_transcoding;   /* file encoding diff from server? */
> > +     bool                encoding_embeds_ascii;      /* ASCII can be non-first byte? */
> > +
> > +     /* parameters from the COPY command */
> > +     bool                csv_mode;           /* Comma Separated Value format? */
> > +     bool                header_line;        /* CSV header line? */
> > +     int                 null_print_len; /* length of same */
> > +     bool                force_quote_all;    /* FORCE_QUOTE *? */
> > +     bool                convert_selectively;        /* do selective binary conversion? */
> > +
> > +     /* Working state for COPY FROM */
> > +     AttrNumber          num_defaults;
> > +     Oid                 relid;
> > +}CopyWorkerCommonData;
>
> But I actually think we shouldn't have this information in two different
> structs. This should exist once, independent of using parallel /
> non-parallel copy.
>

This structure helps in storing the common data from CopyStateData
that are required by the workers. This information will then be
allocated and stored into the DSM for the worker to retrieve and copy
it to CopyStateData.

>
> > +/* List information */
> > +typedef struct ListInfo
> > +{
> > +     int     count;          /* count of attributes */
> > +
> > +     /* string info in the form info followed by info1, info2... infon  */
> > +     char    info[1];
> > +} ListInfo;
>
> Based on these comments I have no idea what this could be for.
>

Have added better comments for this. The following is added: This
structure will help in converting a List data type into the below
structure format with the count having the number of elements in the
list and the info having the List elements appended contiguously. This
converted structure will be allocated in shared memory and stored in
DSM for the worker to retrieve and later convert it back to List data
type.

>
> >  /*
> > - * This keeps the character read at the top of the loop in the buffer
> > - * even if there is more than one read-ahead.
> > + * This keeps the character read at the top of the loop in the buffer
> > + * even if there is more than one read-ahead.
> > + */
> > +#define IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(extralen) \
> > +if (1) \
> > +{ \
> > +     if (copy_buff_state.raw_buf_ptr + (extralen) >= copy_buff_state.copy_buf_len && !hit_eof) \
> > +     { \
> > +             if (IsParallelCopy()) \
> > +             { \
> > +                     copy_buff_state.chunk_size = prev_chunk_size; /* update previous chunk size */ \
> > +                     if (copy_buff_state.block_switched) \
> > +                     { \
> > +                             pg_atomic_sub_fetch_u32(&copy_buff_state.data_blk_ptr->unprocessed_chunk_parts, 1); \
> > +                             copy_buff_state.copy_buf_len = prev_copy_buf_len; \
> > +                     } \
> > +             } \
> > +             copy_buff_state.raw_buf_ptr = prev_raw_ptr; /* undo fetch */ \
> > +             need_data = true; \
> > +             continue; \
> > +     } \
> > +} else ((void) 0)
>
> I think it's an absolutely clear no-go to add new branches to
> these. They're *really* hot already, and this is going to sprinkle a
> significant amount of new instructions over a lot of places.
>

Fixed, removed this.

>
>
> > +/*
> > + * SET_RAWBUF_FOR_LOAD - Set raw_buf to the shared memory where the file data must
> > + * be read.
> > + */
> > +#define SET_RAWBUF_FOR_LOAD() \
> > +{ \
> > +     ShmCopyInfo     *pcshared_info = cstate->pcdata->pcshared_info; \
> > +     uint32 cur_block_pos; \
> > +     /* \
> > +      * Mark the previous block as completed, worker can start copying this data. \
> > +      */ \
> > +     if (copy_buff_state.data_blk_ptr != copy_buff_state.curr_data_blk_ptr && \
> > +             copy_buff_state.data_blk_ptr->curr_blk_completed == false) \
> > +             copy_buff_state.data_blk_ptr->curr_blk_completed = true; \
> > +     \
> > +     copy_buff_state.data_blk_ptr = copy_buff_state.curr_data_blk_ptr; \
> > +     cur_block_pos = WaitGetFreeCopyBlock(pcshared_info); \
> > +     copy_buff_state.curr_data_blk_ptr = &pcshared_info->data_blocks[cur_block_pos]; \
> > +     \
> > +     if (!copy_buff_state.data_blk_ptr) \
> > +     { \
> > +             copy_buff_state.data_blk_ptr = copy_buff_state.curr_data_blk_ptr; \
> > +             chunk_first_block = cur_block_pos; \
> > +     } \
> > +     else if (need_data == false) \
> > +             copy_buff_state.data_blk_ptr->following_block = cur_block_pos; \
> > +     \
> > +     cstate->raw_buf = copy_buff_state.curr_data_blk_ptr->data; \
> > +     copy_buff_state.copy_raw_buf = cstate->raw_buf; \
> > +}
> > +
> > +/*
> > + * END_CHUNK_PARALLEL_COPY - Update the chunk information in shared memory.
> > + */
> > +#define END_CHUNK_PARALLEL_COPY() \
> > +{ \
> > +     if (!IsHeaderLine()) \
> > +     { \
> > +             ShmCopyInfo *pcshared_info = cstate->pcdata->pcshared_info; \
> > +             ChunkBoundaries *chunkBoundaryPtr = &pcshared_info->chunk_boundaries; \
> > +             if (copy_buff_state.chunk_size) \
> > +             { \
> > +                     ChunkBoundary *chunkInfo = &chunkBoundaryPtr->ring[chunk_pos]; \
> > +                     /* \
> > +                      * If raw_buf_ptr is zero, unprocessed_chunk_parts would have been \
> > +                      * incremented in SEEK_COPY_BUFF_POS. This will happen if the whole \
> > +                      * chunk finishes at the end of the current block. If the \
> > +                      * new_line_size > raw_buf_ptr, then the new block has only new line \
> > +                      * char content. The unprocessed count should not be increased in \
> > +                      * this case. \
> > +                      */ \
> > +                     if (copy_buff_state.raw_buf_ptr != 0 && \
> > +                             copy_buff_state.raw_buf_ptr > new_line_size) \
> > +                             pg_atomic_add_fetch_u32(&copy_buff_state.curr_data_blk_ptr->unprocessed_chunk_parts, 1); \
> > +                     \
> > +                     /* Update chunk size. */ \
> > +                     pg_atomic_write_u32(&chunkInfo->chunk_size, copy_buff_state.chunk_size); \
> > +                     pg_atomic_write_u32(&chunkInfo->chunk_state, CHUNK_LEADER_POPULATED); \
> > +                     elog(DEBUG1, "[Leader] After adding - chunk position:%d, chunk_size:%d", \
> > +                                             chunk_pos, copy_buff_state.chunk_size); \
> > +                     pcshared_info->populated++; \
> > +             } \
> > +             else if (new_line_size) \
> > +             { \
> > +                     /* \
> > +                      * This means only new line char, empty record should be \
> > +                      * inserted. \
> > +                      */ \
> > +                     ChunkBoundary *chunkInfo; \
> > +                     chunk_pos = UpdateBlockInChunkInfo(cstate, -1, -1, 0, \
> > +                                                                                        CHUNK_LEADER_POPULATED); \
> > +                     chunkInfo = &chunkBoundaryPtr->ring[chunk_pos]; \
> > +                     elog(DEBUG1, "[Leader] Added empty chunk with offset:%d, chunk position:%d, chunk size:%d", \
> > +                                              chunkInfo->start_offset, chunk_pos, \
> > +                                              pg_atomic_read_u32(&chunkInfo->chunk_size)); \
> > +                     pcshared_info->populated++; \
> > +             } \
> > +     }\
> > +     \
> > +     /*\
> > +      * All of the read data is processed, reset index & len. In the\
> > +      * subsequent read, we will get a new block and copy data in to the\
> > +      * new block.\
> > +      */\
> > +     if (copy_buff_state.raw_buf_ptr == copy_buff_state.copy_buf_len)\
> > +     {\
> > +             cstate->raw_buf_index = 0;\
> > +             cstate->raw_buf_len = 0;\
> > +     }\
> > +     else\
> > +             cstate->raw_buf_len = copy_buff_state.copy_buf_len;\
> > +}
>
> Why are these macros? They are way way way above a length where that
> makes any sort of sense.
>

Converted these macros to functions.


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
vignesh C
Date:
Thanks Ashutosh For your review, my comments are inline.
On Fri, Jun 19, 2020 at 5:41 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi,
>
> I just got some time to review the first patch in the list i.e.
0001-Copy-code-readjustment-to-support-parallel-copy.patch.As the patch name suggests, it is just trying to reshuffle
theexisting code for COPY command here and there. There is no extra changes added in the patch as such, but still I do
havesome review comments, please have a look: 
>
> 1) Can you please add some comments atop the new function PopulateAttributes() describing its functionality in
detail.Further, this new function contains the code from BeginCopy() to set attribute level options used with COPY FROM
suchas FORCE_QUOTE, FORCE_NOT_NULL, FORCE_NULL etc. in cstate and along with that it also copies the code from
BeginCopy()to set other infos such as client encoding type, encoding conversion etc. Hence, I think it would be good to
giveit some better name, basically something that matches with what actually it is doing. 
>

There is no new code added in this function, some part of code from
BeginCopy was made in to a new function as this part of code will also
be required for the parallel copy workers before the workers start the
actual copy operation. This code was made into a function to avoid
duplication. Changed the function name to PopulateGlobalsForCopyFrom &
added few comments.

> 2) Again, the name for the new function CheckCopyFromValidity() doesn't look good to me. From the function name it
appearsas if it does the sanity check of the entire COPY FROM command, but actually it is just doing the sanity check
forthe target relation specified with COPY FROM. So, probably something like CheckTargetRelValidity would look more
sensible,I think? TBH, I am not good at naming the functions so you can always ignore my suggestions about function and
variablenames :) 
>

Changed as suggested.
> 3) Any reason for not making CheckCopyFromValidity as a macro instead of a new function. It is just doing the sanity
checkfor the target relation. 
>

I felt there is reasonable number of lines in the function & it is not
in performance intensive path, so I preferred function over macro.
Your thoughts?

> 4) Earlier in CopyReadLine() function while trying to clear the EOL marker from cstate->line_buf.data (copied data),
wewere not checking if the line read by CopyReadLineText() function is a header line or not, but I can see that your
patchchecks that before clearing the EOL marker. Any reason for this extra check? 
>

If you see the caller of CopyReadLine, i.e. NextCopyFromRawFields does
nothing for the header line, server basically calls CopyReadLine
again, it is a kind of small optimization. Anyway server is not going
to do anything with header line, I felt no need to clear EOL marker
for header lines.
/* on input just throw the header line away */
if (cstate->cur_lineno == 0 && cstate->header_line)
{
cstate->cur_lineno++;
if (CopyReadLine(cstate))
return false; /* done */
}

cstate->cur_lineno++;

/* Actually read the line into memory here */
done = CopyReadLine(cstate);
I think no need to make a fix for this. Your thoughts?

> 5) I noticed the below spurious line removal in the patch.
>
> @@ -3839,7 +3953,6 @@ static bool
>  CopyReadLine(CopyState cstate)
>  {
>     bool        result;
> -
>

Fixed.
I have attached the patch for the same with the fixes.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
vignesh C
Date:
On Tue, Jun 23, 2020 at 8:07 AM vignesh C <vignesh21@gmail.com> wrote:
> I have attached the patch for the same with the fixes.

The patches were not applying on the head, attached the patches that can be applied on head.
I have added a commitfest entry[1] for this feature.


On Tue, Jun 23, 2020 at 8:07 AM vignesh C <vignesh21@gmail.com> wrote:
Thanks Ashutosh For your review, my comments are inline.
On Fri, Jun 19, 2020 at 5:41 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi,
>
> I just got some time to review the first patch in the list i.e. 0001-Copy-code-readjustment-to-support-parallel-copy.patch. As the patch name suggests, it is just trying to reshuffle the existing code for COPY command here and there. There is no extra changes added in the patch as such, but still I do have some review comments, please have a look:
>
> 1) Can you please add some comments atop the new function PopulateAttributes() describing its functionality in detail. Further, this new function contains the code from BeginCopy() to set attribute level options used with COPY FROM such as FORCE_QUOTE, FORCE_NOT_NULL, FORCE_NULL etc. in cstate and along with that it also copies the code from BeginCopy() to set other infos such as client encoding type, encoding conversion etc. Hence, I think it would be good to give it some better name, basically something that matches with what actually it is doing.
>

There is no new code added in this function, some part of code from
BeginCopy was made in to a new function as this part of code will also
be required for the parallel copy workers before the workers start the
actual copy operation. This code was made into a function to avoid
duplication. Changed the function name to PopulateGlobalsForCopyFrom &
added few comments.

> 2) Again, the name for the new function CheckCopyFromValidity() doesn't look good to me. From the function name it appears as if it does the sanity check of the entire COPY FROM command, but actually it is just doing the sanity check for the target relation specified with COPY FROM. So, probably something like CheckTargetRelValidity would look more sensible, I think? TBH, I am not good at naming the functions so you can always ignore my suggestions about function and variable names :)
>

Changed as suggested.
> 3) Any reason for not making CheckCopyFromValidity as a macro instead of a new function. It is just doing the sanity check for the target relation.
>

I felt there is reasonable number of lines in the function & it is not
in performance intensive path, so I preferred function over macro.
Your thoughts?

> 4) Earlier in CopyReadLine() function while trying to clear the EOL marker from cstate->line_buf.data (copied data), we were not checking if the line read by CopyReadLineText() function is a header line or not, but I can see that your patch checks that before clearing the EOL marker. Any reason for this extra check?
>

If you see the caller of CopyReadLine, i.e. NextCopyFromRawFields does
nothing for the header line, server basically calls CopyReadLine
again, it is a kind of small optimization. Anyway server is not going
to do anything with header line, I felt no need to clear EOL marker
for header lines.
/* on input just throw the header line away */
if (cstate->cur_lineno == 0 && cstate->header_line)
{
cstate->cur_lineno++;
if (CopyReadLine(cstate))
return false; /* done */
}

cstate->cur_lineno++;

/* Actually read the line into memory here */
done = CopyReadLine(cstate);
I think no need to make a fix for this. Your thoughts?

> 5) I noticed the below spurious line removal in the patch.
>
> @@ -3839,7 +3953,6 @@ static bool
>  CopyReadLine(CopyState cstate)
>  {
>     bool        result;
> -
>

Fixed.
I have attached the patch for the same with the fixes.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Parallel copy

From
Bharath Rupireddy
Date:
Hi,

Thanks Vignesh for reviewing parallel copy for binary format files
patch. I tried to address the comments in the attached patch
(0006-Parallel-Copy-For-Binary-Format-Files.patch).

On Thu, Jun 18, 2020 at 6:42 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, Jun 15, 2020 at 4:39 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > The above tests were run with the configuration attached config.txt, which is the same used for performance tests
ofcsv/text files posted earlier in this mail chain.
 
> >
> > Request the community to take this patch up for review along with the parallel copy for csv/text file patches and
providefeedback.
 
> >
>
> I had reviewed the patch, few comments:
>
>  The new members added should be present in ParallelCopyData

Added to ParallelCopyData.

>
> line_size can be set as and when we process the tuple from
> CopyReadBinaryTupleLeader and this can be set at the end. That way the
> above code can be removed.
>

curr_tuple_start_info and curr_tuple_end_info variables are now local
variables to CopyReadBinaryTupleLeader and the line size calculation
code is moved to CopyReadBinaryAttributeLeader.

>
> curr_block_pos variable is present in ParallelCopyShmInfo, we could
> use it and remove from here.
> curr_data_offset, similar variable raw_buf_index is present in
> CopyStateData, we could use it and remove from here.
>

Yes, making use of them now.

>
> This code is duplicate in CopyReadBinaryTupleLeader &
> CopyReadBinaryAttributeLeader. We could make a function and re-use.
>

Added a new function AdjustFieldInfo.

>
> column_no is not used, it can be removed
>

Removed.

>
> The above code is present in NextCopyFrom & CopyReadBinaryTupleLeader,
> check if we can make a common function or we could use NextCopyFrom as
> it is.
>

Added a macro CHECK_FIELD_COUNT.

> +       if (fld_count == -1)
> +       {
> +               return true;
> +       }
>
> Should this be an assert in CopyReadBinaryTupleWorker function as this
> check is already done in the leader.
>

This check in leader signifies the end of the file. For the workers,
the eof is when GetLinePosition() returns -1.
    line_pos = GetLinePosition(cstate);
    if (line_pos == -1)
        return true;
In case the if (fld_count == -1) is encountered in the worker, workers
should just return true from CopyReadBinaryTupleWorker marking eof.
Having this as an assert doesn't serve the purpose I feel.

Along with the review comments addressed
patch(0006-Parallel-Copy-For-Binary-Format-Files.patch) also attaching
all other latest series of patches(0001 to 0005) from [1], the order
of applying patches is from 0001 to 0006.

[1] https://www.postgresql.org/message-id/CALDaNm0H3N9gK7CMheoaXkO99g%3DuAPA93nSZXu0xDarPyPY6sg%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
Bharath Rupireddy
Date:
Hi,

It looks like the parsing of newly introduced "PARALLEL" option for
COPY FROM command has an issue(in the
0002-Framework-for-leader-worker-in-parallel-copy.patch),
Mentioning ....PARALLEL '4ar2eteid'); would pass with 4 workers since
atoi() is being used for converting string to integer which just
returns 4, ignoring other strings.

I used strtol(), added error checks and introduced the error "
improper use of argument to option "parallel"" for the above cases.

        parallel '4ar2eteid');
ERROR:  improper use of argument to option "parallel"
LINE 5:         parallel '1\');

Along with the updated patch
0002-Framework-for-leader-worker-in-parallel-copy.patch, also
attaching all the latest patches from [1].

[1] - https://www.postgresql.org/message-id/CALj2ACW94icER3WrWapon7JkcX8j0TGRue5ycWMTEvgA3X7fOg%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

On Tue, Jun 23, 2020 at 12:22 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, Jun 23, 2020 at 8:07 AM vignesh C <vignesh21@gmail.com> wrote:
> > I have attached the patch for the same with the fixes.
>
> The patches were not applying on the head, attached the patches that can be applied on head.
> I have added a commitfest entry[1] for this feature.
>
> [1] - https://commitfest.postgresql.org/28/2610/
>
>
> On Tue, Jun 23, 2020 at 8:07 AM vignesh C <vignesh21@gmail.com> wrote:
>>
>> Thanks Ashutosh For your review, my comments are inline.
>> On Fri, Jun 19, 2020 at 5:41 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I just got some time to review the first patch in the list i.e.
0001-Copy-code-readjustment-to-support-parallel-copy.patch.As the patch name suggests, it is just trying to reshuffle
theexisting code for COPY command here and there. There is no extra changes added in the patch as such, but still I do
havesome review comments, please have a look: 
>> >
>> > 1) Can you please add some comments atop the new function PopulateAttributes() describing its functionality in
detail.Further, this new function contains the code from BeginCopy() to set attribute level options used with COPY FROM
suchas FORCE_QUOTE, FORCE_NOT_NULL, FORCE_NULL etc. in cstate and along with that it also copies the code from
BeginCopy()to set other infos such as client encoding type, encoding conversion etc. Hence, I think it would be good to
giveit some better name, basically something that matches with what actually it is doing. 
>> >
>>
>> There is no new code added in this function, some part of code from
>> BeginCopy was made in to a new function as this part of code will also
>> be required for the parallel copy workers before the workers start the
>> actual copy operation. This code was made into a function to avoid
>> duplication. Changed the function name to PopulateGlobalsForCopyFrom &
>> added few comments.
>>
>> > 2) Again, the name for the new function CheckCopyFromValidity() doesn't look good to me. From the function name it
appearsas if it does the sanity check of the entire COPY FROM command, but actually it is just doing the sanity check
forthe target relation specified with COPY FROM. So, probably something like CheckTargetRelValidity would look more
sensible,I think? TBH, I am not good at naming the functions so you can always ignore my suggestions about function and
variablenames :) 
>> >
>>
>> Changed as suggested.
>> > 3) Any reason for not making CheckCopyFromValidity as a macro instead of a new function. It is just doing the
sanitycheck for the target relation. 
>> >
>>
>> I felt there is reasonable number of lines in the function & it is not
>> in performance intensive path, so I preferred function over macro.
>> Your thoughts?
>>
>> > 4) Earlier in CopyReadLine() function while trying to clear the EOL marker from cstate->line_buf.data (copied
data),we were not checking if the line read by CopyReadLineText() function is a header line or not, but I can see that
yourpatch checks that before clearing the EOL marker. Any reason for this extra check? 
>> >
>>
>> If you see the caller of CopyReadLine, i.e. NextCopyFromRawFields does
>> nothing for the header line, server basically calls CopyReadLine
>> again, it is a kind of small optimization. Anyway server is not going
>> to do anything with header line, I felt no need to clear EOL marker
>> for header lines.
>> /* on input just throw the header line away */
>> if (cstate->cur_lineno == 0 && cstate->header_line)
>> {
>> cstate->cur_lineno++;
>> if (CopyReadLine(cstate))
>> return false; /* done */
>> }
>>
>> cstate->cur_lineno++;
>>
>> /* Actually read the line into memory here */
>> done = CopyReadLine(cstate);
>> I think no need to make a fix for this. Your thoughts?
>>
>> > 5) I noticed the below spurious line removal in the patch.
>> >
>> > @@ -3839,7 +3953,6 @@ static bool
>> >  CopyReadLine(CopyState cstate)
>> >  {
>> >     bool        result;
>> > -
>> >
>>
>> Fixed.
>> I have attached the patch for the same with the fixes.
>> Thoughts?
>>
>> Regards,
>> Vignesh
>> EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Bharath Rupireddy
Date:
On Wed, Jun 24, 2020 at 2:16 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Hi,
>
> It looks like the parsing of newly introduced "PARALLEL" option for
> COPY FROM command has an issue(in the
> 0002-Framework-for-leader-worker-in-parallel-copy.patch),
> Mentioning ....PARALLEL '4ar2eteid'); would pass with 4 workers since
> atoi() is being used for converting string to integer which just
> returns 4, ignoring other strings.
>
> I used strtol(), added error checks and introduced the error "
> improper use of argument to option "parallel"" for the above cases.
>
>         parallel '4ar2eteid');
> ERROR:  improper use of argument to option "parallel"
> LINE 5:         parallel '1\');
>
> Along with the updated patch
> 0002-Framework-for-leader-worker-in-parallel-copy.patch, also
> attaching all the latest patches from [1].
>
> [1] - https://www.postgresql.org/message-id/CALj2ACW94icER3WrWapon7JkcX8j0TGRue5ycWMTEvgA3X7fOg%40mail.gmail.com
>

I'm sorry, I forgot to attach the patches. Here are the latest series
of patches.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
Bharath Rupireddy
Date:
Hi,

0006 patch has some code clean up and issue fixes found during internal testing.

Attaching the latest patches herewith.

The order of applying the patches remains the same i.e. from 0001 to 0006.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
vignesh C
Date:
Hi,

I have made few changes in 0003 & 0005 patch, there were a couple of
bugs in 0003 patch & some random test failures in 0005 patch.
Attached new patches which include the fixes for the same.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



On Fri, Jun 26, 2020 at 2:34 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Hi,
>
> 0006 patch has some code clean up and issue fixes found during internal testing.
>
> Attaching the latest patches herewith.
>
> The order of applying the patches remains the same i.e. from 0001 to 0006.
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
vignesh C
Date:
On Wed, Jul 1, 2020 at 2:46 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Hi,
>
> I have made few changes in 0003 & 0005 patch, there were a couple of
> bugs in 0003 patch & some random test failures in 0005 patch.
> Attached new patches which include the fixes for the same.

I have made changes in 0003 patch, to remove changes made in pqmq.c for parallel worker error handling hang issue. This is being discussed in email [1] separately as it is a bug in the head. The rest of the patches have no changes.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
vignesh C
Date:
On Wed, Jun 24, 2020 at 1:41 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Along with the review comments addressed
> patch(0006-Parallel-Copy-For-Binary-Format-Files.patch) also attaching
> all other latest series of patches(0001 to 0005) from [1], the order
> of applying patches is from 0001 to 0006.
>
> [1] https://www.postgresql.org/message-id/CALDaNm0H3N9gK7CMheoaXkO99g%3DuAPA93nSZXu0xDarPyPY6sg%40mail.gmail.com
>

Some comments:

+       movebytes = DATA_BLOCK_SIZE - cstate->raw_buf_index;
+
+       cstate->pcdata->curr_data_block->skip_bytes = movebytes;
+
+       data_block = &pcshared_info->data_blocks[block_pos];
+
+       if (movebytes > 0)
+               memmove(&data_block->data[0],
&cstate->pcdata->curr_data_block->data[cstate->raw_buf_index],
+                       movebytes);
we can create a local variable and use in place of
cstate->pcdata->curr_data_block.

+       if (cstate->raw_buf_index + sizeof(fld_count) >= (DATA_BLOCK_SIZE - 1))
+               AdjustFieldInfo(cstate, 1);
+
+       memcpy(&fld_count,
&cstate->pcdata->curr_data_block->data[cstate->raw_buf_index],
sizeof(fld_count));
Should this be like below, as the remaining size can fit in current block:
       if (cstate->raw_buf_index + sizeof(fld_count) >= DATA_BLOCK_SIZE)

+       if ((cstate->raw_buf_index + sizeof(fld_size)) >= (DATA_BLOCK_SIZE - 1))
+       {
+               AdjustFieldInfo(cstate, 2);
+               *new_block_pos = pcshared_info->cur_block_pos;
+       }
Same like above.

+       movebytes = DATA_BLOCK_SIZE - cstate->raw_buf_index;
+
+       cstate->pcdata->curr_data_block->skip_bytes = movebytes;
+
+       data_block = &pcshared_info->data_blocks[block_pos];
+
+       if (movebytes > 0)
Instead of the above check, we can have an assert check for movebytes.

+       if (mode == 1)
+       {
+               cstate->pcdata->curr_data_block = data_block;
+               cstate->raw_buf_index = 0;
+       }
+       else if(mode == 2)
+       {
+               ParallelCopyDataBlock *prev_data_block = NULL;
+               prev_data_block = cstate->pcdata->curr_data_block;
+               prev_data_block->following_block = block_pos;
+               cstate->pcdata->curr_data_block = data_block;
+
+               if (prev_data_block->curr_blk_completed  == false)
+                       prev_data_block->curr_blk_completed = true;
+
+               cstate->raw_buf_index = 0;
+       }

This code is common for both, keep in common flow and remove if (mode == 1)
cstate->pcdata->curr_data_block = data_block;
cstate->raw_buf_index = 0;

+#define CHECK_FIELD_COUNT \
+{\
+       if (fld_count == -1) \
+       { \
+               if (IsParallelCopy() && \
+                       !IsLeader()) \
+                       return true; \
+               else if (IsParallelCopy() && \
+                       IsLeader()) \
+               { \
+                       if
(cstate->pcdata->curr_data_block->data[cstate->raw_buf_index +
sizeof(fld_count)] != 0) \
+                               ereport(ERROR, \
+
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT), \
+                                               errmsg("received copy
data after EOF marker"))); \
+                       return true; \
+               } \
We only copy sizeof(fld_count), Shouldn't we check fld_count !=
cstate->max_fields? Am I missing something here?

+       if ((cstate->raw_buf_index + sizeof(fld_size)) >= (DATA_BLOCK_SIZE - 1))
+       {
+               AdjustFieldInfo(cstate, 2);
+               *new_block_pos = pcshared_info->cur_block_pos;
+       }
+
+       memcpy(&fld_size,
&cstate->pcdata->curr_data_block->data[cstate->raw_buf_index],
sizeof(fld_size));
+
+       cstate->raw_buf_index = cstate->raw_buf_index + sizeof(fld_size);
+
+       fld_size = (int32) pg_ntoh32(fld_size);
+
+       if (fld_size == 0)
+               ereport(ERROR,
+                               (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+                                errmsg("unexpected EOF in COPY data")));
+
+       if (fld_size < -1)
+               ereport(ERROR,
+                               (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+                                errmsg("invalid field size")));
+
+       if ((DATA_BLOCK_SIZE - cstate->raw_buf_index) >= fld_size)
+       {
+               cstate->raw_buf_index = cstate->raw_buf_index + fld_size;
+       }
We can keep the check like cstate->raw_buf_index + fld_size < ..., for
better readability and consistency.

+static pg_attribute_always_inline void
+CopyReadBinaryAttributeLeader(CopyState cstate, FmgrInfo *flinfo,
+       Oid typioparam, int32 typmod, uint32 *new_block_pos,
+       int m, ParallelCopyTupleInfo *tuple_start_info_ptr,
+       ParallelCopyTupleInfo *tuple_end_info_ptr, uint32 *line_size)
flinfo, typioparam & typmod is not used, we can remove the parameter.

+static pg_attribute_always_inline void
+CopyReadBinaryAttributeLeader(CopyState cstate, FmgrInfo *flinfo,
+       Oid typioparam, int32 typmod, uint32 *new_block_pos,
+       int m, ParallelCopyTupleInfo *tuple_start_info_ptr,
+       ParallelCopyTupleInfo *tuple_end_info_ptr, uint32 *line_size)
I felt this function need not be an inline function.

+               /* binary format */
+               /* for paralle copy leader, fill in the error
There are some typos, run spell check

+               /* raw_buf_index should never cross data block size,
+                * as the required number of data blocks would have
+                * been obtained in the above while loop.
+                */
There are few places, commenting style should be changed to postgres style

+       if (cstate->pcdata->curr_data_block == NULL)
+       {
+               block_pos = WaitGetFreeCopyBlock(pcshared_info);
+
+               cstate->pcdata->curr_data_block =
&pcshared_info->data_blocks[block_pos];
+
+               cstate->raw_buf_index = 0;
+
+               readbytes = CopyGetData(cstate,
&cstate->pcdata->curr_data_block->data, 1, DATA_BLOCK_SIZE);
+
+               elog(DEBUG1, "LEADER - bytes read from file %d", readbytes);
+
+               if (cstate->reached_eof)
+                       return true;
+       }
There are many empty lines, these are not required.


+       if (cstate->raw_buf_index + sizeof(fld_count) >= (DATA_BLOCK_SIZE - 1))
+               AdjustFieldInfo(cstate, 1);
+
+       memcpy(&fld_count,
&cstate->pcdata->curr_data_block->data[cstate->raw_buf_index],
sizeof(fld_count));
+
+       fld_count = (int16) pg_ntoh16(fld_count);
+
+       CHECK_FIELD_COUNT;
+
+       cstate->raw_buf_index = cstate->raw_buf_index + sizeof(fld_count);
+       new_block_pos = pcshared_info->cur_block_pos;
You can run pg_indent once for the changes.

+       if (mode == 1)
+       {
+               cstate->pcdata->curr_data_block = data_block;
+               cstate->raw_buf_index = 0;
+       }
+       else if(mode == 2)
+       {
Could use macros for 1 & 2 for better readability.

+               if (tuple_start_info_ptr->block_id ==
tuple_end_info_ptr->block_id)
+               {
+                       elog(DEBUG1,"LEADER - tuple lies in a single
data block");
+
+                       *line_size = tuple_end_info_ptr->offset -
tuple_start_info_ptr->offset + 1;
+
pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[tuple_start_info_ptr->block_id].unprocessed_line_parts,
1);
+               }
+               else
+               {
+                       uint32 following_block_id =
pcshared_info->data_blocks[tuple_start_info_ptr->block_id].following_block;
+
+                       elog(DEBUG1,"LEADER - tuple is spread across
data blocks");
+
+                       *line_size = DATA_BLOCK_SIZE -
tuple_start_info_ptr->offset -
+
pcshared_info->data_blocks[tuple_start_info_ptr->block_id].skip_bytes;
+
+
pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[tuple_start_info_ptr->block_id].unprocessed_line_parts,
1);
+
+                       while (following_block_id !=
tuple_end_info_ptr->block_id)
+                       {
+                               *line_size = *line_size +
DATA_BLOCK_SIZE -
pcshared_info->data_blocks[following_block_id].skip_bytes;
+
+
pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[following_block_id].unprocessed_line_parts,
1);
+
+                               following_block_id =
pcshared_info->data_blocks[following_block_id].following_block;
+
+                               if (following_block_id == -1)
+                                       break;
+                       }
+
+                       if (following_block_id != -1)
+
pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[following_block_id].unprocessed_line_parts,
1);
+
+                       *line_size = *line_size +
tuple_end_info_ptr->offset + 1;
+               }
We could calculate the size as we parse and identify one record, if we
do that way this can be removed.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Bharath Rupireddy
Date:
Thanks Vignesh for the review. Addressed the comments in 0006 patch.

>
> we can create a local variable and use in place of
> cstate->pcdata->curr_data_block.

Done.

> +       if (cstate->raw_buf_index + sizeof(fld_count) >= (DATA_BLOCK_SIZE - 1))
> +               AdjustFieldInfo(cstate, 1);
> +
> +       memcpy(&fld_count,
> &cstate->pcdata->curr_data_block->data[cstate->raw_buf_index],
> sizeof(fld_count));
> Should this be like below, as the remaining size can fit in current block:
>        if (cstate->raw_buf_index + sizeof(fld_count) >= DATA_BLOCK_SIZE)
>
> +       if ((cstate->raw_buf_index + sizeof(fld_size)) >= (DATA_BLOCK_SIZE - 1))
> +       {
> +               AdjustFieldInfo(cstate, 2);
> +               *new_block_pos = pcshared_info->cur_block_pos;
> +       }
> Same like above.

Yes you are right. Changed.

>
> +       movebytes = DATA_BLOCK_SIZE - cstate->raw_buf_index;
> +
> +       cstate->pcdata->curr_data_block->skip_bytes = movebytes;
> +
> +       data_block = &pcshared_info->data_blocks[block_pos];
> +
> +       if (movebytes > 0)
> Instead of the above check, we can have an assert check for movebytes.

No, we can't use assert here. For the edge case where the current data
block is full to the size DATA_BLOCK_SIZE, then movebytes will be 0,
but we need to get a new data block. We avoid memmove by having
movebytes>0 check.

> +       if (mode == 1)
> +       {
> +               cstate->pcdata->curr_data_block = data_block;
> +               cstate->raw_buf_index = 0;
> +       }
> +       else if(mode == 2)
> +       {
> +               ParallelCopyDataBlock *prev_data_block = NULL;
> +               prev_data_block = cstate->pcdata->curr_data_block;
> +               prev_data_block->following_block = block_pos;
> +               cstate->pcdata->curr_data_block = data_block;
> +
> +               if (prev_data_block->curr_blk_completed  == false)
> +                       prev_data_block->curr_blk_completed = true;
> +
> +               cstate->raw_buf_index = 0;
> +       }
>
> This code is common for both, keep in common flow and remove if (mode == 1)
> cstate->pcdata->curr_data_block = data_block;
> cstate->raw_buf_index = 0;
>

Done.

> +#define CHECK_FIELD_COUNT \
> +{\
> +       if (fld_count == -1) \
> +       { \
> +               if (IsParallelCopy() && \
> +                       !IsLeader()) \
> +                       return true; \
> +               else if (IsParallelCopy() && \
> +                       IsLeader()) \
> +               { \
> +                       if
> (cstate->pcdata->curr_data_block->data[cstate->raw_buf_index +
> sizeof(fld_count)] != 0) \
> +                               ereport(ERROR, \
> +
> (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), \
> +                                               errmsg("received copy
> data after EOF marker"))); \
> +                       return true; \
> +               } \
> We only copy sizeof(fld_count), Shouldn't we check fld_count !=
> cstate->max_fields? Am I missing something here?

fld_count != cstate->max_fields check is done after the above checks.

> +       if ((DATA_BLOCK_SIZE - cstate->raw_buf_index) >= fld_size)
> +       {
> +               cstate->raw_buf_index = cstate->raw_buf_index + fld_size;
> +       }
> We can keep the check like cstate->raw_buf_index + fld_size < ..., for
> better readability and consistency.
>

I think this is okay. It gives a good meaning that available bytes in
the current data block is greater or equal to fld_size then, the tuple
lies in the current data block.

> +static pg_attribute_always_inline void
> +CopyReadBinaryAttributeLeader(CopyState cstate, FmgrInfo *flinfo,
> +       Oid typioparam, int32 typmod, uint32 *new_block_pos,
> +       int m, ParallelCopyTupleInfo *tuple_start_info_ptr,
> +       ParallelCopyTupleInfo *tuple_end_info_ptr, uint32 *line_size)
> flinfo, typioparam & typmod is not used, we can remove the parameter.
>

Done.

> +static pg_attribute_always_inline void
> +CopyReadBinaryAttributeLeader(CopyState cstate, FmgrInfo *flinfo,
> +       Oid typioparam, int32 typmod, uint32 *new_block_pos,
> +       int m, ParallelCopyTupleInfo *tuple_start_info_ptr,
> +       ParallelCopyTupleInfo *tuple_end_info_ptr, uint32 *line_size)
> I felt this function need not be an inline function.

Yes. Changed.

>
> +               /* binary format */
> +               /* for paralle copy leader, fill in the error
> There are some typos, run spell check

Done.

>
> +               /* raw_buf_index should never cross data block size,
> +                * as the required number of data blocks would have
> +                * been obtained in the above while loop.
> +                */
> There are few places, commenting style should be changed to postgres style

Changed.

>
> +       if (cstate->pcdata->curr_data_block == NULL)
> +       {
> +               block_pos = WaitGetFreeCopyBlock(pcshared_info);
> +
> +               cstate->pcdata->curr_data_block =
> &pcshared_info->data_blocks[block_pos];
> +
> +               cstate->raw_buf_index = 0;
> +
> +               readbytes = CopyGetData(cstate,
> &cstate->pcdata->curr_data_block->data, 1, DATA_BLOCK_SIZE);
> +
> +               elog(DEBUG1, "LEADER - bytes read from file %d", readbytes);
> +
> +               if (cstate->reached_eof)
> +                       return true;
> +       }
> There are many empty lines, these are not required.
>

Removed.

>
> +
> +       fld_count = (int16) pg_ntoh16(fld_count);
> +
> +       CHECK_FIELD_COUNT;
> +
> +       cstate->raw_buf_index = cstate->raw_buf_index + sizeof(fld_count);
> +       new_block_pos = pcshared_info->cur_block_pos;
> You can run pg_indent once for the changes.
>

I ran pg_indent and observed that there are many places getting
modified by pg_indent. If we need to run pg_indet on copy.c for
parallel copy alone, then first, we need to run on plane copy.c and
take those changes and then run for all parallel copy files. I think
we better run pg_indent, for all the parallel copy patches once and
for all, maybe just before we kind of finish up all the code reviews.

> +       if (mode == 1)
> +       {
> +               cstate->pcdata->curr_data_block = data_block;
> +               cstate->raw_buf_index = 0;
> +       }
> +       else if(mode == 2)
> +       {
> Could use macros for 1 & 2 for better readability.

Done.

>
> +
> +                               if (following_block_id == -1)
> +                                       break;
> +                       }
> +
> +                       if (following_block_id != -1)
> +
> pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[following_block_id].unprocessed_line_parts,
> 1);
> +
> +                       *line_size = *line_size +
> tuple_end_info_ptr->offset + 1;
> +               }
> We could calculate the size as we parse and identify one record, if we
> do that way this can be removed.
>

Done.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
Rafia Sabih
Date:
On Sat, 11 Jul 2020 at 08:55, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Thanks Vignesh for the review. Addressed the comments in 0006 patch.
>
> >
> > we can create a local variable and use in place of
> > cstate->pcdata->curr_data_block.
>
> Done.
>
> > +       if (cstate->raw_buf_index + sizeof(fld_count) >= (DATA_BLOCK_SIZE - 1))
> > +               AdjustFieldInfo(cstate, 1);
> > +
> > +       memcpy(&fld_count,
> > &cstate->pcdata->curr_data_block->data[cstate->raw_buf_index],
> > sizeof(fld_count));
> > Should this be like below, as the remaining size can fit in current block:
> >        if (cstate->raw_buf_index + sizeof(fld_count) >= DATA_BLOCK_SIZE)
> >
> > +       if ((cstate->raw_buf_index + sizeof(fld_size)) >= (DATA_BLOCK_SIZE - 1))
> > +       {
> > +               AdjustFieldInfo(cstate, 2);
> > +               *new_block_pos = pcshared_info->cur_block_pos;
> > +       }
> > Same like above.
>
> Yes you are right. Changed.
>
> >
> > +       movebytes = DATA_BLOCK_SIZE - cstate->raw_buf_index;
> > +
> > +       cstate->pcdata->curr_data_block->skip_bytes = movebytes;
> > +
> > +       data_block = &pcshared_info->data_blocks[block_pos];
> > +
> > +       if (movebytes > 0)
> > Instead of the above check, we can have an assert check for movebytes.
>
> No, we can't use assert here. For the edge case where the current data
> block is full to the size DATA_BLOCK_SIZE, then movebytes will be 0,
> but we need to get a new data block. We avoid memmove by having
> movebytes>0 check.
>
> > +       if (mode == 1)
> > +       {
> > +               cstate->pcdata->curr_data_block = data_block;
> > +               cstate->raw_buf_index = 0;
> > +       }
> > +       else if(mode == 2)
> > +       {
> > +               ParallelCopyDataBlock *prev_data_block = NULL;
> > +               prev_data_block = cstate->pcdata->curr_data_block;
> > +               prev_data_block->following_block = block_pos;
> > +               cstate->pcdata->curr_data_block = data_block;
> > +
> > +               if (prev_data_block->curr_blk_completed  == false)
> > +                       prev_data_block->curr_blk_completed = true;
> > +
> > +               cstate->raw_buf_index = 0;
> > +       }
> >
> > This code is common for both, keep in common flow and remove if (mode == 1)
> > cstate->pcdata->curr_data_block = data_block;
> > cstate->raw_buf_index = 0;
> >
>
> Done.
>
> > +#define CHECK_FIELD_COUNT \
> > +{\
> > +       if (fld_count == -1) \
> > +       { \
> > +               if (IsParallelCopy() && \
> > +                       !IsLeader()) \
> > +                       return true; \
> > +               else if (IsParallelCopy() && \
> > +                       IsLeader()) \
> > +               { \
> > +                       if
> > (cstate->pcdata->curr_data_block->data[cstate->raw_buf_index +
> > sizeof(fld_count)] != 0) \
> > +                               ereport(ERROR, \
> > +
> > (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), \
> > +                                               errmsg("received copy
> > data after EOF marker"))); \
> > +                       return true; \
> > +               } \
> > We only copy sizeof(fld_count), Shouldn't we check fld_count !=
> > cstate->max_fields? Am I missing something here?
>
> fld_count != cstate->max_fields check is done after the above checks.
>
> > +       if ((DATA_BLOCK_SIZE - cstate->raw_buf_index) >= fld_size)
> > +       {
> > +               cstate->raw_buf_index = cstate->raw_buf_index + fld_size;
> > +       }
> > We can keep the check like cstate->raw_buf_index + fld_size < ..., for
> > better readability and consistency.
> >
>
> I think this is okay. It gives a good meaning that available bytes in
> the current data block is greater or equal to fld_size then, the tuple
> lies in the current data block.
>
> > +static pg_attribute_always_inline void
> > +CopyReadBinaryAttributeLeader(CopyState cstate, FmgrInfo *flinfo,
> > +       Oid typioparam, int32 typmod, uint32 *new_block_pos,
> > +       int m, ParallelCopyTupleInfo *tuple_start_info_ptr,
> > +       ParallelCopyTupleInfo *tuple_end_info_ptr, uint32 *line_size)
> > flinfo, typioparam & typmod is not used, we can remove the parameter.
> >
>
> Done.
>
> > +static pg_attribute_always_inline void
> > +CopyReadBinaryAttributeLeader(CopyState cstate, FmgrInfo *flinfo,
> > +       Oid typioparam, int32 typmod, uint32 *new_block_pos,
> > +       int m, ParallelCopyTupleInfo *tuple_start_info_ptr,
> > +       ParallelCopyTupleInfo *tuple_end_info_ptr, uint32 *line_size)
> > I felt this function need not be an inline function.
>
> Yes. Changed.
>
> >
> > +               /* binary format */
> > +               /* for paralle copy leader, fill in the error
> > There are some typos, run spell check
>
> Done.
>
> >
> > +               /* raw_buf_index should never cross data block size,
> > +                * as the required number of data blocks would have
> > +                * been obtained in the above while loop.
> > +                */
> > There are few places, commenting style should be changed to postgres style
>
> Changed.
>
> >
> > +       if (cstate->pcdata->curr_data_block == NULL)
> > +       {
> > +               block_pos = WaitGetFreeCopyBlock(pcshared_info);
> > +
> > +               cstate->pcdata->curr_data_block =
> > &pcshared_info->data_blocks[block_pos];
> > +
> > +               cstate->raw_buf_index = 0;
> > +
> > +               readbytes = CopyGetData(cstate,
> > &cstate->pcdata->curr_data_block->data, 1, DATA_BLOCK_SIZE);
> > +
> > +               elog(DEBUG1, "LEADER - bytes read from file %d", readbytes);
> > +
> > +               if (cstate->reached_eof)
> > +                       return true;
> > +       }
> > There are many empty lines, these are not required.
> >
>
> Removed.
>
> >
> > +
> > +       fld_count = (int16) pg_ntoh16(fld_count);
> > +
> > +       CHECK_FIELD_COUNT;
> > +
> > +       cstate->raw_buf_index = cstate->raw_buf_index + sizeof(fld_count);
> > +       new_block_pos = pcshared_info->cur_block_pos;
> > You can run pg_indent once for the changes.
> >
>
> I ran pg_indent and observed that there are many places getting
> modified by pg_indent. If we need to run pg_indet on copy.c for
> parallel copy alone, then first, we need to run on plane copy.c and
> take those changes and then run for all parallel copy files. I think
> we better run pg_indent, for all the parallel copy patches once and
> for all, maybe just before we kind of finish up all the code reviews.
>
> > +       if (mode == 1)
> > +       {
> > +               cstate->pcdata->curr_data_block = data_block;
> > +               cstate->raw_buf_index = 0;
> > +       }
> > +       else if(mode == 2)
> > +       {
> > Could use macros for 1 & 2 for better readability.
>
> Done.
>
> >
> > +
> > +                               if (following_block_id == -1)
> > +                                       break;
> > +                       }
> > +
> > +                       if (following_block_id != -1)
> > +
> > pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[following_block_id].unprocessed_line_parts,
> > 1);
> > +
> > +                       *line_size = *line_size +
> > tuple_end_info_ptr->offset + 1;
> > +               }
> > We could calculate the size as we parse and identify one record, if we
> > do that way this can be removed.
> >
>
> Done.

Hi Bharath,

I was looking forward to review this patch-set but unfortunately it is
showing a reject in copy.c, and might need a rebase.
I was applying on master over the commit-
cd22d3cdb9bd9963c694c01a8c0232bbae3ddcfb.

-- 
Regards,
Rafia Sabih



Re: Parallel copy

From
Bharath Rupireddy
Date:
>
> Hi Bharath,
>
> I was looking forward to review this patch-set but unfortunately it is
> showing a reject in copy.c, and might need a rebase.
> I was applying on master over the commit-
> cd22d3cdb9bd9963c694c01a8c0232bbae3ddcfb.
>

Thanks for showing interest. Please find the patch set rebased to
latest commit b1e48bbe64a411666bb1928b9741e112e267836d.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
Amit Kapila
Date:
On Sun, Jul 12, 2020 at 5:48 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> >
> > Hi Bharath,
> >
> > I was looking forward to review this patch-set but unfortunately it is
> > showing a reject in copy.c, and might need a rebase.
> > I was applying on master over the commit-
> > cd22d3cdb9bd9963c694c01a8c0232bbae3ddcfb.
> >
>
> Thanks for showing interest. Please find the patch set rebased to
> latest commit b1e48bbe64a411666bb1928b9741e112e267836d.
>

Few comments:
====================
0001-Copy-code-readjustment-to-support-parallel-copy

I am not sure converting the code to macros is a good idea, it makes
this code harder to read.  Also, there are a few changes which I am
not sure are necessary.
1.
+/*
+ * CLEAR_EOL_FROM_COPIED_DATA - Clear EOL from the copied data.
+ */
+#define CLEAR_EOL_FROM_COPIED_DATA(copy_line_data, copy_line_pos,
copy_line_size) \
+{ \
+ /* \
+ * If we didn't hit EOF, then we must have transferred the EOL marker \
+ * to line_buf along with the data.  Get rid of it. \
+ */ \
+   switch (cstate->eol_type) \
+   { \
+    case EOL_NL: \
+    Assert(copy_line_size >= 1); \
+    Assert(copy_line_data[copy_line_pos - 1] == '\n'); \
+    copy_line_data[copy_line_pos - 1] = '\0'; \
+    copy_line_size--; \
+    break; \
+    case EOL_CR: \
+    Assert(copy_line_size >= 1); \
+    Assert(copy_line_data[copy_line_pos - 1] == '\r'); \
+    copy_line_data[copy_line_pos - 1] = '\0'; \
+    copy_line_size--; \
+    break; \
+    case EOL_CRNL: \
+    Assert(copy_line_size >= 2); \
+    Assert(copy_line_data[copy_line_pos - 2] == '\r'); \
+    Assert(copy_line_data[copy_line_pos - 1] == '\n'); \
+    copy_line_data[copy_line_pos - 2] = '\0'; \
+    copy_line_size -= 2; \
+    break; \
+    case EOL_UNKNOWN: \
+    /* shouldn't get here */ \
+    Assert(false); \
+    break; \
+   } \
+}

In the original code, we are using only len and buffer, here we are
using position, length/size and buffer.  Is it really required or can
we do with just len and buffer?

2.
+/*
+ * INCREMENTPROCESSED - Increment the lines processed.
+ */
+#define INCREMENTPROCESSED(processed)  \
+processed++;
+
+/*
+ * GETPROCESSED - Get the lines processed.
+ */
+#define GETPROCESSED(processed) \
+return processed;
+

I don't like converting above to macros.  I don't think converting
such things to macros will buy us much.

0002-Framework-for-leader-worker-in-parallel-copy
3.
 /*
+ * Copy data block information.
+ */
+typedef struct ParallelCopyDataBlock

It is better to add a few comments atop this data structure to explain
how it is used?

4.
+ * ParallelCopyLineBoundary is common data structure between leader & worker,
+ * this is protected by the following sequence in the leader & worker.
+ * Leader should operate in the following order:
+ * 1) update first_block, start_offset & cur_lineno in any order.
+ * 2) update line_size.
+ * 3) update line_state.
+ * Worker should operate in the following order:
+ * 1) read line_size.
+ * 2) only one worker should choose one line for processing, this is handled by
+ *    using pg_atomic_compare_exchange_u32, worker will change the sate to
+ *    LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED.
+ * 3) read first_block, start_offset & cur_lineno in any order.
+ */
+typedef struct ParallelCopyLineBoundary

Here, you have mentioned how workers and leader should operate to make
sure access to the data is sane.  However, you have not explained what
is the problem if they don't do so and it is not apparent to me.
Also, it is not very clear what is the purpose of this data structure
from comments.

5.
+/*
+ * Circular queue used to store the line information.
+ */
+typedef struct ParallelCopyLineBoundaries
+{
+ /* Position for the leader to populate a line. */
+ uint32 leader_pos;

I don't think the variable needs to be named as leader_pos, it is okay
to name it is as 'pos' as the comment above it explains its usage.

7.
+#define DATA_BLOCK_SIZE RAW_BUF_SIZE
+#define RINGSIZE (10 * 1000)
+#define MAX_BLOCKS_COUNT 1000
+#define WORKER_CHUNK_COUNT 50 /* should be mod of RINGSIZE */

It would be good if you can write a few comments to explain why you
have chosen these default values.

8.
ParallelCopyCommonKeyData, shall we name this as
SerializedParallelCopyState or something like that?  For example, see
SerializedSnapshotData which has been used to pass snapshot
information to passed to workers.

9.
+CopyCommonInfoForWorker(CopyState cstate, ParallelCopyCommonKeyData
*shared_cstate)

If you agree with point-8, then let's name this as
SerializeParallelCopyState.  See, if there is more usage of similar
types in the patch then lets change those as well.

10.
+ * in the DSM. The specified number of workers will then be launched.
+ *
+ */
+static ParallelContext*
+BeginParallelCopy(int nworkers, CopyState cstate, List *attnamelist, Oid relid)

No need of an extra line with only '*' in the above multi-line comment.

11.
BeginParallelCopy(..)
{
..
+ EstimateLineKeysStr(pcxt, cstate->null_print);
+ EstimateLineKeysStr(pcxt, cstate->null_print_client);
+ EstimateLineKeysStr(pcxt, cstate->delim);
+ EstimateLineKeysStr(pcxt, cstate->quote);
+ EstimateLineKeysStr(pcxt, cstate->escape);
..
}

Why do we need to do this separately for each variable of cstate?
Can't we serialize it along with other members of
SerializeParallelCopyState (a new name for ParallelCopyCommonKeyData)?

12.
BeginParallelCopy(..)
{
..
+ LaunchParallelWorkers(pcxt);
+ if (pcxt->nworkers_launched == 0)
+ {
+ EndParallelCopy(pcxt);
+ elog(WARNING,
+ "No workers available, copy will be run in non-parallel mode");
..
}

I don't see the need to issue a WARNING if we are not able to launch
workers.  We don't do that for other cases where we fail to launch
workers.

13.
+}
+/*
+ * ParallelCopyMain -
..

+}
+/*
+ * ParallelCopyLeader

One line space is required before starting a new function.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:
Thanks for the comments Amit.
On Wed, Jul 15, 2020 at 10:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Few comments:
> ====================
> 0001-Copy-code-readjustment-to-support-parallel-copy
>
> I am not sure converting the code to macros is a good idea, it makes
> this code harder to read.  Also, there are a few changes which I am
> not sure are necessary.
> 1.
> +/*
> + * CLEAR_EOL_FROM_COPIED_DATA - Clear EOL from the copied data.
> + */
> +#define CLEAR_EOL_FROM_COPIED_DATA(copy_line_data, copy_line_pos,
> copy_line_size) \
> +{ \
> + /* \
> + * If we didn't hit EOF, then we must have transferred the EOL marker \
> + * to line_buf along with the data.  Get rid of it. \
> + */ \
> +   switch (cstate->eol_type) \
> +   { \
> +    case EOL_NL: \
> +    Assert(copy_line_size >= 1); \
> +    Assert(copy_line_data[copy_line_pos - 1] == '\n'); \
> +    copy_line_data[copy_line_pos - 1] = '\0'; \
> +    copy_line_size--; \
> +    break; \
> +    case EOL_CR: \
> +    Assert(copy_line_size >= 1); \
> +    Assert(copy_line_data[copy_line_pos - 1] == '\r'); \
> +    copy_line_data[copy_line_pos - 1] = '\0'; \
> +    copy_line_size--; \
> +    break; \
> +    case EOL_CRNL: \
> +    Assert(copy_line_size >= 2); \
> +    Assert(copy_line_data[copy_line_pos - 2] == '\r'); \
> +    Assert(copy_line_data[copy_line_pos - 1] == '\n'); \
> +    copy_line_data[copy_line_pos - 2] = '\0'; \
> +    copy_line_size -= 2; \
> +    break; \
> +    case EOL_UNKNOWN: \
> +    /* shouldn't get here */ \
> +    Assert(false); \
> +    break; \
> +   } \
> +}
>
> In the original code, we are using only len and buffer, here we are
> using position, length/size and buffer.  Is it really required or can
> we do with just len and buffer?
>

Position is required so that we can have common code for parallel &
non-parallel copy, in case of parallel copy position & length will
differ as they can spread across multiple data blocks. Retained the
variables as is.
Changed the macro to function.

> 2.
> +/*
> + * INCREMENTPROCESSED - Increment the lines processed.
> + */
> +#define INCREMENTPROCESSED(processed)  \
> +processed++;
> +
> +/*
> + * GETPROCESSED - Get the lines processed.
> + */
> +#define GETPROCESSED(processed) \
> +return processed;
> +
>
> I don't like converting above to macros.  I don't think converting
> such things to macros will buy us much.
>

This macro will be extended to in
0003-Allow-copy-from-command-to-process-data-from-file.patch:
+#define INCREMENTPROCESSED(processed) \
+{ \
+       if (!IsParallelCopy()) \
+               processed++; \
+       else \
+
pg_atomic_add_fetch_u64(&cstate->pcdata->pcshared_info->processed, 1);
\
+}

This need to be made to macro so that it can handle both parallel copy
and non parallel copy.
Retaining this as macro, if you insist I can move the change to
0003-Allow-copy-from-command-to-process-data-from-file.patch patch.


> 0002-Framework-for-leader-worker-in-parallel-copy
> 3.
>  /*
> + * Copy data block information.
> + */
> +typedef struct ParallelCopyDataBlock
>
> It is better to add a few comments atop this data structure to explain
> how it is used?
>

Fixed.

> 4.
> + * ParallelCopyLineBoundary is common data structure between leader & worker,
> + * this is protected by the following sequence in the leader & worker.
> + * Leader should operate in the following order:
> + * 1) update first_block, start_offset & cur_lineno in any order.
> + * 2) update line_size.
> + * 3) update line_state.
> + * Worker should operate in the following order:
> + * 1) read line_size.
> + * 2) only one worker should choose one line for processing, this is handled by
> + *    using pg_atomic_compare_exchange_u32, worker will change the sate to
> + *    LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED.
> + * 3) read first_block, start_offset & cur_lineno in any order.
> + */
> +typedef struct ParallelCopyLineBoundary
>
> Here, you have mentioned how workers and leader should operate to make
> sure access to the data is sane.  However, you have not explained what
> is the problem if they don't do so and it is not apparent to me.
> Also, it is not very clear what is the purpose of this data structure
> from comments.
>

Fixed

> 5.
> +/*
> + * Circular queue used to store the line information.
> + */
> +typedef struct ParallelCopyLineBoundaries
> +{
> + /* Position for the leader to populate a line. */
> + uint32 leader_pos;
>
> I don't think the variable needs to be named as leader_pos, it is okay
> to name it is as 'pos' as the comment above it explains its usage.
>

Fixed

> 7.
> +#define DATA_BLOCK_SIZE RAW_BUF_SIZE
> +#define RINGSIZE (10 * 1000)
> +#define MAX_BLOCKS_COUNT 1000
> +#define WORKER_CHUNK_COUNT 50 /* should be mod of RINGSIZE */
>
> It would be good if you can write a few comments to explain why you
> have chosen these default values.
>

Fixed

> 8.
> ParallelCopyCommonKeyData, shall we name this as
> SerializedParallelCopyState or something like that?  For example, see
> SerializedSnapshotData which has been used to pass snapshot
> information to passed to workers.
>

Renamed as suggested

> 9.
> +CopyCommonInfoForWorker(CopyState cstate, ParallelCopyCommonKeyData
> *shared_cstate)
>
> If you agree with point-8, then let's name this as
> SerializeParallelCopyState.  See, if there is more usage of similar
> types in the patch then lets change those as well.
>

Fixed

> 10.
> + * in the DSM. The specified number of workers will then be launched.
> + *
> + */
> +static ParallelContext*
> +BeginParallelCopy(int nworkers, CopyState cstate, List *attnamelist, Oid relid)
>
> No need of an extra line with only '*' in the above multi-line comment.
>

Fixed

> 11.
> BeginParallelCopy(..)
> {
> ..
> + EstimateLineKeysStr(pcxt, cstate->null_print);
> + EstimateLineKeysStr(pcxt, cstate->null_print_client);
> + EstimateLineKeysStr(pcxt, cstate->delim);
> + EstimateLineKeysStr(pcxt, cstate->quote);
> + EstimateLineKeysStr(pcxt, cstate->escape);
> ..
> }
>
> Why do we need to do this separately for each variable of cstate?
> Can't we serialize it along with other members of
> SerializeParallelCopyState (a new name for ParallelCopyCommonKeyData)?
>

These are variable length string variables, I felt we will not be able
to serialize along with other members and need to be serialized
separately.

> 12.
> BeginParallelCopy(..)
> {
> ..
> + LaunchParallelWorkers(pcxt);
> + if (pcxt->nworkers_launched == 0)
> + {
> + EndParallelCopy(pcxt);
> + elog(WARNING,
> + "No workers available, copy will be run in non-parallel mode");
> ..
> }
>
> I don't see the need to issue a WARNING if we are not able to launch
> workers.  We don't do that for other cases where we fail to launch
> workers.
>

Fixed

> 13.
> +}
> +/*
> + * ParallelCopyMain -
> ..
>
> +}
> +/*
> + * ParallelCopyLeader
>
> One line space is required before starting a new function.
>

Fixed

Please find the updated patch with the fixes included.


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
vignesh C
Date:
>
> Please find the updated patch with the fixes included.
>

Patch 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
had few indentation issues, I have fixed and attached the patch for
the same.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
Ashutosh Sharma
Date:
Some review comments (mostly) from the leader side code changes:  

1) Do we need a DSM key for the FORCE_QUOTE option? I think FORCE_QUOTE option is only used with COPY TO and not COPY FROM so not sure why you have added it.

PARALLEL_COPY_KEY_FORCE_QUOTE_LIST

2) Should we be allocating the parallel copy data structure only when it is confirmed that the parallel copy is allowed?

pcdata = (ParallelCopyData *) palloc0(sizeof(ParallelCopyData));
cstate->pcdata = pcdata;

Or, if you want it to be allocated before confirming if Parallel copy is allowed or not, then I think it would be good to allocate it in *cstate->copycontext* memory context so that when EndCopy is called towards the end of the COPY FROM operation, the entire context itself gets deleted thereby freeing the memory space allocated for pcdata. In fact it would be good to ensure that all the local memory allocated inside the ctstate structure gets allocated in the *cstate->copycontext* memory context.

3) Should we allow Parallel Copy when the insert method is CIM_MULTI_CONDITIONAL?

+   /* Check if the insertion mode is single. */
+   if (FindInsertMethod(cstate) == CIM_SINGLE)
+       return false;

I know we have added checks in CopyFrom() to ensure that if any trigger (before row or instead of) is found on any of partition being loaded with data, then COPY FROM operation would fail, but does it mean that we are okay to perform parallel copy on partitioned table. Have we done some performance testing with the partitioned table where the data in the input file needs to be routed to the different partitions?

4) There are lot of if-checks in IsParallelCopyAllowed function that are checked in CopyFrom function as well which means in case of Parallel Copy those checks will get executed multiple times (first by the leader and from second time onwards by each worker process). Is that required?

5) Should the worker process be calling this function when the leader has already called it once in ExecBeforeStmtTrigger()?

/* Verify the named relation is a valid target for INSERT */
CheckValidResultRel(resultRelInfo, CMD_INSERT);

6) I think it would be good to re-write the comments atop ParallelCopyLeader(). From the present comments it appears as if you were trying to put the information pointwise but somehow you ended up putting in a paragraph. The comments also have some typos like *line beaks* which possibly means line breaks. This is applicable for other comments as well where you

7) Is the following checking equivalent to IsWorker()? If so, it would be good to replace it with an IsWorker like macro to increase the readability.

(IsParallelCopy() && !IsLeader())

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Fri, Jul 17, 2020 at 2:09 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Please find the updated patch with the fixes included.
>

Patch 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
had few indentation issues, I have fixed and attached the patch for
the same.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
Amit Kapila
Date:
On Fri, Jul 17, 2020 at 2:09 PM vignesh C <vignesh21@gmail.com> wrote:
>
> >
> > Please find the updated patch with the fixes included.
> >
>
> Patch 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
> had few indentation issues, I have fixed and attached the patch for
> the same.
>

Ensure to use the version with each patch-series as that makes it
easier for the reviewer to verify the changes done in the latest
version of the patch.  One way is to use commands like "git
format-patch -6 -v <version_of_patch_series>" or you can add the
version number manually.

Review comments:
===================

0001-Copy-code-readjustment-to-support-parallel-copy
1.
@@ -807,8 +835,11 @@ CopyLoadRawBuf(CopyState cstate)
  else
  nbytes = 0; /* no data need be saved */

+ if (cstate->copy_dest == COPY_NEW_FE)
+ minread = RAW_BUF_SIZE - nbytes;
+
  inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes,
-   1, RAW_BUF_SIZE - nbytes);
+   minread, RAW_BUF_SIZE - nbytes);

No comment to explain why this change is done?

0002-Framework-for-leader-worker-in-parallel-copy
2.
+ * ParallelCopyLineBoundary is common data structure between leader & worker,
+ * Leader process will be populating data block, data block offset &
the size of
+ * the record in DSM for the workers to copy the data into the relation.
+ * This is protected by the following sequence in the leader & worker. If they
+ * don't follow this order the worker might process wrong line_size and leader
+ * might populate the information which worker has not yet processed or in the
+ * process of processing.
+ * Leader should operate in the following order:
+ * 1) check if line_size is -1, if not wait, it means worker is still
+ * processing.
+ * 2) set line_state to LINE_LEADER_POPULATING.
+ * 3) update first_block, start_offset & cur_lineno in any order.
+ * 4) update line_size.
+ * 5) update line_state to LINE_LEADER_POPULATED.
+ * Worker should operate in the following order:
+ * 1) check line_state is LINE_LEADER_POPULATED, if not it means
leader is still
+ * populating the data.
+ * 2) read line_size.
+ * 3) only one worker should choose one line for processing, this is handled by
+ *    using pg_atomic_compare_exchange_u32, worker will change the sate to
+ *    LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED.
+ * 4) read first_block, start_offset & cur_lineno in any order.
+ * 5) process line_size data.
+ * 6) update line_size to -1.
+ */
+typedef struct ParallelCopyLineBoundary

Are we doing all this state management to avoid using locks while
processing lines?  If so, I think we can use either spinlock or LWLock
to keep the main patch simple and then provide a later patch to make
it lock-less.  This will allow us to first focus on the main design of
the patch rather than trying to make this datastructure processing
lock-less in the best possible way.

3.
+ /*
+ * Actual lines inserted by worker (some records will be filtered based on
+ * where condition).
+ */
+ pg_atomic_uint64 processed;
+ pg_atomic_uint64 total_worker_processed; /* total processed records
by the workers */

The difference between processed and total_worker_processed is not
clear.  Can we expand the comments a bit?

4.
+ * SerializeList - Insert a list into shared memory.
+ */
+static void
+SerializeList(ParallelContext *pcxt, int key, List *inputlist,
+   Size est_list_size)
+{
+ if (inputlist != NIL)
+ {
+ ParallelCopyKeyListInfo *sharedlistinfo = (ParallelCopyKeyListInfo
*)shm_toc_allocate(pcxt->toc,
+ est_list_size);
+ CopyListSharedMemory(inputlist, est_list_size, sharedlistinfo);
+ shm_toc_insert(pcxt->toc, key, sharedlistinfo);
+ }
+}

Why do we need to write a special mechanism (CopyListSharedMemory) to
serialize a list.  Why can't we use nodeToString?  It should be able
to take care of List datatype, see outNode which is called from
nodeToString.  Once you do that, I think you won't need even
EstimateLineKeysList, strlen should work instead.

Check, if you have any similar special handling for other types that
can be dealt with nodeToString?

5.
+ MemSet(shared_info_ptr, 0, est_shared_info);
+ shared_info_ptr->is_read_in_progress = true;
+ shared_info_ptr->cur_block_pos = -1;
+ shared_info_ptr->full_transaction_id = full_transaction_id;
+ shared_info_ptr->mycid = GetCurrentCommandId(true);
+ for (count = 0; count < RINGSIZE; count++)
+ {
+ ParallelCopyLineBoundary *lineInfo =
&shared_info_ptr->line_boundaries.ring[count];
+ pg_atomic_init_u32(&(lineInfo->line_size), -1);
+ }
+

You can move this initialization in a separate function.

6.
In function BeginParallelCopy(), you need to keep a provision to
collect wal_usage and buf_usage stats.  See _bt_begin_parallel for
reference.  Those will be required for pg_stat_statements.

7.
DeserializeString() -- it is better to name this function as RestoreString.
ParallelWorkerInitialization() -- it is better to name this function
as InitializeParallelCopyInfo or something like that, the current name
is quite confusing.
ParallelCopyLeader() -- how about ParallelCopyFrom? ParallelCopyLeader
doesn't sound good to me.  You can suggest something else if you don't
like ParallelCopyFrom

8.
 /*
- * PopulateGlobalsForCopyFrom - Populates the common variables
required for copy
- * from operation. This is a helper function for BeginCopy function.
+ * PopulateCatalogInformation - Populates the common variables
required for copy
+ * from operation. This is a helper function for BeginCopy &
+ * ParallelWorkerInitialization function.
  */
 static void
 PopulateGlobalsForCopyFrom(CopyState cstate, TupleDesc tupDesc,
- List *attnamelist)
+    List *attnamelist)

The actual function name and the name in function header don't match.
I also don't like this function name, how about
PopulateCommonCstateInfo?  Similarly how about changing
PopulateCatalogInformation to PopulateCstateCatalogInfo?

9.
+static const struct
+{
+ char *fn_name;
+ copy_data_source_cb fn_addr;
+} InternalParallelCopyFuncPtrs[] =
+
+{
+ {
+ "copy_read_data", copy_read_data
+ },
+};

The function copy_read_data is present in
src/backend/replication/logical/tablesync.c and seems to be used
during logical replication.  Why do we want to expose this function as
part of this patch?

0003-Allow-copy-from-command-to-process-data-from-file-ST
10.
In the commit message, you have written "The leader does not
participate in the insertion of data, leaders only responsibility will
be to identify the lines as fast as possible for the workers to do the
actual copy operation. The leader waits till all the lines populated
are processed by the workers and exits."

I think you should also mention that we have chosen this design based
on the reason "that everything stalls if the leader doesn't accept
further input data, as well as when there are no available splitted
chunks so it doesn't seem like a good idea to have the leader do other
work.  This is backed by the performance data where we have seen that
with 1 worker there is just a 5-10% (or whatever percentage difference
you have seen) performance difference)".

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:

Thanks for your comments Amit, i have worked on the comments, my thoughts on the same are mentioned below.

On Tue, Jul 21, 2020 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jul 17, 2020 at 2:09 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > >
> > > Please find the updated patch with the fixes included.
> > >
> >
> > Patch 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
> > had few indentation issues, I have fixed and attached the patch for
> > the same.
> >
>
> Ensure to use the version with each patch-series as that makes it
> easier for the reviewer to verify the changes done in the latest
> version of the patch.  One way is to use commands like "git
> format-patch -6 -v <version_of_patch_series>" or you can add the
> version number manually.
>

Taken care.

> Review comments:
> ===================
>
> 0001-Copy-code-readjustment-to-support-parallel-copy
> 1.
> @@ -807,8 +835,11 @@ CopyLoadRawBuf(CopyState cstate)
>   else
>   nbytes = 0; /* no data need be saved */
>
> + if (cstate->copy_dest == COPY_NEW_FE)
> + minread = RAW_BUF_SIZE - nbytes;
> +
>   inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes,
> -   1, RAW_BUF_SIZE - nbytes);
> +   minread, RAW_BUF_SIZE - nbytes);
>
> No comment to explain why this change is done?
>
> 0002-Framework-for-leader-worker-in-parallel-copy

Currently CopyGetData copies a lesser amount of data to buffer even though space is available in buffer because minread was passed as 1 to CopyGetData. Because of this there are frequent call to CopyGetData for fetching the data. In this case it will load only some data due to the below check:
while (maxread > 0 && bytesread < minread && !cstate->reached_eof)
After reading some data bytesread will be greater than minread which is passed as 1 and return with lesser amount of data, even though there is some space.
This change is required for parallel copy feature as each time we get a new DSM data block which is of 64K size and copy the data. If we copy less data into DSM data blocks we might end up consuming all the DSM data blocks.  I felt this issue can be fixed as part of HEAD. Have posted a separate thread [1] for this. I'm planning to remove that change once it gets committed. Can that go as a separate
patch or should we include it here?
[1] - https://www.postgresql.org/message-id/CALDaNm0v4CjmvSnftYnx_9pOS_dKRG%3DO3NnBgJsQmi0KipvLog%40mail.gmail.com

> 2.
> + * ParallelCopyLineBoundary is common data structure between leader & worker,
> + * Leader process will be populating data block, data block offset &
> the size of
> + * the record in DSM for the workers to copy the data into the relation.
> + * This is protected by the following sequence in the leader & worker. If they
> + * don't follow this order the worker might process wrong line_size and leader
> + * might populate the information which worker has not yet processed or in the
> + * process of processing.
> + * Leader should operate in the following order:
> + * 1) check if line_size is -1, if not wait, it means worker is still
> + * processing.
> + * 2) set line_state to LINE_LEADER_POPULATING.
> + * 3) update first_block, start_offset & cur_lineno in any order.
> + * 4) update line_size.
> + * 5) update line_state to LINE_LEADER_POPULATED.
> + * Worker should operate in the following order:
> + * 1) check line_state is LINE_LEADER_POPULATED, if not it means
> leader is still
> + * populating the data.
> + * 2) read line_size.
> + * 3) only one worker should choose one line for processing, this is handled by
> + *    using pg_atomic_compare_exchange_u32, worker will change the sate to
> + *    LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED.
> + * 4) read first_block, start_offset & cur_lineno in any order.
> + * 5) process line_size data.
> + * 6) update line_size to -1.
> + */
> +typedef struct ParallelCopyLineBoundary
>
> Are we doing all this state management to avoid using locks while
> processing lines?  If so, I think we can use either spinlock or LWLock
> to keep the main patch simple and then provide a later patch to make
> it lock-less.  This will allow us to first focus on the main design of
> the patch rather than trying to make this datastructure processing
> lock-less in the best possible way.
>

The steps will be more or less same if we use spinlock too. step 1, step 3 & step 4 will be common we have to use lock & unlock instead of step 2 & step 5. I feel we can retain the current implementation.

> 3.
> + /*
> + * Actual lines inserted by worker (some records will be filtered based on
> + * where condition).
> + */
> + pg_atomic_uint64 processed;
> + pg_atomic_uint64 total_worker_processed; /* total processed records
> by the workers */
>
> The difference between processed and total_worker_processed is not
> clear.  Can we expand the comments a bit?
>

Fixed

> 4.
> + * SerializeList - Insert a list into shared memory.
> + */
> +static void
> +SerializeList(ParallelContext *pcxt, int key, List *inputlist,
> +   Size est_list_size)
> +{
> + if (inputlist != NIL)
> + {
> + ParallelCopyKeyListInfo *sharedlistinfo = (ParallelCopyKeyListInfo
> *)shm_toc_allocate(pcxt->toc,
> + est_list_size);
> + CopyListSharedMemory(inputlist, est_list_size, sharedlistinfo);
> + shm_toc_insert(pcxt->toc, key, sharedlistinfo);
> + }
> +}
>
> Why do we need to write a special mechanism (CopyListSharedMemory) to
> serialize a list.  Why can't we use nodeToString?  It should be able
> to take care of List datatype, see outNode which is called from
> nodeToString.  Once you do that, I think you won't need even
> EstimateLineKeysList, strlen should work instead.
>
> Check, if you have any similar special handling for other types that
> can be dealt with nodeToString?
>

Fixed

> 5.
> + MemSet(shared_info_ptr, 0, est_shared_info);
> + shared_info_ptr->is_read_in_progress = true;
> + shared_info_ptr->cur_block_pos = -1;
> + shared_info_ptr->full_transaction_id = full_transaction_id;
> + shared_info_ptr->mycid = GetCurrentCommandId(true);
> + for (count = 0; count < RINGSIZE; count++)
> + {
> + ParallelCopyLineBoundary *lineInfo =
> &shared_info_ptr->line_boundaries.ring[count];
> + pg_atomic_init_u32(&(lineInfo->line_size), -1);
> + }
> +
>
> You can move this initialization in a separate function.
>

Fixed

> 6.
> In function BeginParallelCopy(), you need to keep a provision to
> collect wal_usage and buf_usage stats.  See _bt_begin_parallel for
> reference.  Those will be required for pg_stat_statements.
>

Fixed

> 7.
> DeserializeString() -- it is better to name this function as RestoreString.
> ParallelWorkerInitialization() -- it is better to name this function
> as InitializeParallelCopyInfo or something like that, the current name
> is quite confusing.
> ParallelCopyLeader() -- how about ParallelCopyFrom? ParallelCopyLeader
> doesn't sound good to me.  You can suggest something else if you don't
> like ParallelCopyFrom
>

Fixed

> 8.
>  /*
> - * PopulateGlobalsForCopyFrom - Populates the common variables
> required for copy
> - * from operation. This is a helper function for BeginCopy function.
> + * PopulateCatalogInformation - Populates the common variables
> required for copy
> + * from operation. This is a helper function for BeginCopy &
> + * ParallelWorkerInitialization function.
>   */
>  static void
>  PopulateGlobalsForCopyFrom(CopyState cstate, TupleDesc tupDesc,
> - List *attnamelist)
> +    List *attnamelist)
>
> The actual function name and the name in function header don't match.
> I also don't like this function name, how about
> PopulateCommonCstateInfo?  Similarly how about changing
> PopulateCatalogInformation to PopulateCstateCatalogInfo?
>

Fixed

> 9.
> +static const struct
> +{
> + char *fn_name;
> + copy_data_source_cb fn_addr;
> +} InternalParallelCopyFuncPtrs[] =
> +
> +{
> + {
> + "copy_read_data", copy_read_data
> + },
> +};
>
> The function copy_read_data is present in
> src/backend/replication/logical/tablesync.c and seems to be used
> during logical replication.  Why do we want to expose this function as
> part of this patch?
>

I was thinking we could include the framework to support parallelism for logical replication too and can be enhanced when it is needed. Now I have removed this as part of the new patch provided, that can be added whenever required.

> 0003-Allow-copy-from-command-to-process-data-from-file-ST
> 10.
> In the commit message, you have written "The leader does not
> participate in the insertion of data, leaders only responsibility will
> be to identify the lines as fast as possible for the workers to do the
> actual copy operation. The leader waits till all the lines populated
> are processed by the workers and exits."
>
> I think you should also mention that we have chosen this design based
> on the reason "that everything stalls if the leader doesn't accept
> further input data, as well as when there are no available splitted
> chunks so it doesn't seem like a good idea to have the leader do other
> work.  This is backed by the performance data where we have seen that
> with 1 worker there is just a 5-10% (or whatever percentage difference
> you have seen) performance difference)".

Fixed.
Please find the new patch attached with the fixes.
Thoughts?


On Tue, Jul 21, 2020 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Jul 17, 2020 at 2:09 PM vignesh C <vignesh21@gmail.com> wrote:
>
> >
> > Please find the updated patch with the fixes included.
> >
>
> Patch 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
> had few indentation issues, I have fixed and attached the patch for
> the same.
>

Ensure to use the version with each patch-series as that makes it
easier for the reviewer to verify the changes done in the latest
version of the patch.  One way is to use commands like "git
format-patch -6 -v <version_of_patch_series>" or you can add the
version number manually.

Review comments:
===================

0001-Copy-code-readjustment-to-support-parallel-copy
1.
@@ -807,8 +835,11 @@ CopyLoadRawBuf(CopyState cstate)
  else
  nbytes = 0; /* no data need be saved */

+ if (cstate->copy_dest == COPY_NEW_FE)
+ minread = RAW_BUF_SIZE - nbytes;
+
  inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes,
-   1, RAW_BUF_SIZE - nbytes);
+   minread, RAW_BUF_SIZE - nbytes);

No comment to explain why this change is done?

0002-Framework-for-leader-worker-in-parallel-copy
2.
+ * ParallelCopyLineBoundary is common data structure between leader & worker,
+ * Leader process will be populating data block, data block offset &
the size of
+ * the record in DSM for the workers to copy the data into the relation.
+ * This is protected by the following sequence in the leader & worker. If they
+ * don't follow this order the worker might process wrong line_size and leader
+ * might populate the information which worker has not yet processed or in the
+ * process of processing.
+ * Leader should operate in the following order:
+ * 1) check if line_size is -1, if not wait, it means worker is still
+ * processing.
+ * 2) set line_state to LINE_LEADER_POPULATING.
+ * 3) update first_block, start_offset & cur_lineno in any order.
+ * 4) update line_size.
+ * 5) update line_state to LINE_LEADER_POPULATED.
+ * Worker should operate in the following order:
+ * 1) check line_state is LINE_LEADER_POPULATED, if not it means
leader is still
+ * populating the data.
+ * 2) read line_size.
+ * 3) only one worker should choose one line for processing, this is handled by
+ *    using pg_atomic_compare_exchange_u32, worker will change the sate to
+ *    LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED.
+ * 4) read first_block, start_offset & cur_lineno in any order.
+ * 5) process line_size data.
+ * 6) update line_size to -1.
+ */
+typedef struct ParallelCopyLineBoundary

Are we doing all this state management to avoid using locks while
processing lines?  If so, I think we can use either spinlock or LWLock
to keep the main patch simple and then provide a later patch to make
it lock-less.  This will allow us to first focus on the main design of
the patch rather than trying to make this datastructure processing
lock-less in the best possible way.

3.
+ /*
+ * Actual lines inserted by worker (some records will be filtered based on
+ * where condition).
+ */
+ pg_atomic_uint64 processed;
+ pg_atomic_uint64 total_worker_processed; /* total processed records
by the workers */

The difference between processed and total_worker_processed is not
clear.  Can we expand the comments a bit?

4.
+ * SerializeList - Insert a list into shared memory.
+ */
+static void
+SerializeList(ParallelContext *pcxt, int key, List *inputlist,
+   Size est_list_size)
+{
+ if (inputlist != NIL)
+ {
+ ParallelCopyKeyListInfo *sharedlistinfo = (ParallelCopyKeyListInfo
*)shm_toc_allocate(pcxt->toc,
+ est_list_size);
+ CopyListSharedMemory(inputlist, est_list_size, sharedlistinfo);
+ shm_toc_insert(pcxt->toc, key, sharedlistinfo);
+ }
+}

Why do we need to write a special mechanism (CopyListSharedMemory) to
serialize a list.  Why can't we use nodeToString?  It should be able
to take care of List datatype, see outNode which is called from
nodeToString.  Once you do that, I think you won't need even
EstimateLineKeysList, strlen should work instead.

Check, if you have any similar special handling for other types that
can be dealt with nodeToString?

5.
+ MemSet(shared_info_ptr, 0, est_shared_info);
+ shared_info_ptr->is_read_in_progress = true;
+ shared_info_ptr->cur_block_pos = -1;
+ shared_info_ptr->full_transaction_id = full_transaction_id;
+ shared_info_ptr->mycid = GetCurrentCommandId(true);
+ for (count = 0; count < RINGSIZE; count++)
+ {
+ ParallelCopyLineBoundary *lineInfo =
&shared_info_ptr->line_boundaries.ring[count];
+ pg_atomic_init_u32(&(lineInfo->line_size), -1);
+ }
+

You can move this initialization in a separate function.

6.
In function BeginParallelCopy(), you need to keep a provision to
collect wal_usage and buf_usage stats.  See _bt_begin_parallel for
reference.  Those will be required for pg_stat_statements.

7.
DeserializeString() -- it is better to name this function as RestoreString.
ParallelWorkerInitialization() -- it is better to name this function
as InitializeParallelCopyInfo or something like that, the current name
is quite confusing.
ParallelCopyLeader() -- how about ParallelCopyFrom? ParallelCopyLeader
doesn't sound good to me.  You can suggest something else if you don't
like ParallelCopyFrom

8.
 /*
- * PopulateGlobalsForCopyFrom - Populates the common variables
required for copy
- * from operation. This is a helper function for BeginCopy function.
+ * PopulateCatalogInformation - Populates the common variables
required for copy
+ * from operation. This is a helper function for BeginCopy &
+ * ParallelWorkerInitialization function.
  */
 static void
 PopulateGlobalsForCopyFrom(CopyState cstate, TupleDesc tupDesc,
- List *attnamelist)
+    List *attnamelist)

The actual function name and the name in function header don't match.
I also don't like this function name, how about
PopulateCommonCstateInfo?  Similarly how about changing
PopulateCatalogInformation to PopulateCstateCatalogInfo?

9.
+static const struct
+{
+ char *fn_name;
+ copy_data_source_cb fn_addr;
+} InternalParallelCopyFuncPtrs[] =
+
+{
+ {
+ "copy_read_data", copy_read_data
+ },
+};

The function copy_read_data is present in
src/backend/replication/logical/tablesync.c and seems to be used
during logical replication.  Why do we want to expose this function as
part of this patch?

0003-Allow-copy-from-command-to-process-data-from-file-ST
10.
In the commit message, you have written "The leader does not
participate in the insertion of data, leaders only responsibility will
be to identify the lines as fast as possible for the workers to do the
actual copy operation. The leader waits till all the lines populated
are processed by the workers and exits."

I think you should also mention that we have chosen this design based
on the reason "that everything stalls if the leader doesn't accept
further input data, as well as when there are no available splitted
chunks so it doesn't seem like a good idea to have the leader do other
work.  This is backed by the performance data where we have seen that
with 1 worker there is just a 5-10% (or whatever percentage difference
you have seen) performance difference)".

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Parallel copy

From
vignesh C
Date:
Thanks for reviewing and providing the comments Ashutosh.
Please find my thoughts below:

On Fri, Jul 17, 2020 at 7:18 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Some review comments (mostly) from the leader side code changes:  
>
> 1) Do we need a DSM key for the FORCE_QUOTE option? I think FORCE_QUOTE option is only used with COPY TO and not COPY FROM so not sure why you have added it.
>
> PARALLEL_COPY_KEY_FORCE_QUOTE_LIST
>

Fixed

> 2) Should we be allocating the parallel copy data structure only when it is confirmed that the parallel copy is allowed?
>
> pcdata = (ParallelCopyData *) palloc0(sizeof(ParallelCopyData));
> cstate->pcdata = pcdata;
>
> Or, if you want it to be allocated before confirming if Parallel copy is allowed or not, then I think it would be good to allocate it in *cstate->copycontext* memory context so that when EndCopy is called towards the end of the COPY FROM operation, the entire context itself gets deleted thereby freeing the memory space allocated for pcdata. In fact it would be good to ensure that all the local memory allocated inside the ctstate structure gets allocated in the *cstate->copycontext* memory context.
>

Fixed

> 3) Should we allow Parallel Copy when the insert method is CIM_MULTI_CONDITIONAL?
>
> +   /* Check if the insertion mode is single. */
> +   if (FindInsertMethod(cstate) == CIM_SINGLE)
> +       return false;
>
> I know we have added checks in CopyFrom() to ensure that if any trigger (before row or instead of) is found on any of partition being loaded with data, then COPY FROM operation would fail, but does it mean that we are okay to perform parallel copy on partitioned table. Have we done some performance testing with the partitioned table where the data in the input file needs to be routed to the different partitions?
>

Partition data is handled like what Amit had told in one of earlier mails [1].  My colleague Bharath has run performance test with partition table, he will be sharing the results.

> 4) There are lot of if-checks in IsParallelCopyAllowed function that are checked in CopyFrom function as well which means in case of Parallel Copy those checks will get executed multiple times (first by the leader and from second time onwards by each worker process). Is that required?
>

It is called from BeginParallelCopy, This will be called only once. This change is ok.

> 5) Should the worker process be calling this function when the leader has already called it once in ExecBeforeStmtTrigger()?
>
> /* Verify the named relation is a valid target for INSERT */
> CheckValidResultRel(resultRelInfo, CMD_INSERT);
>

Fixed.

> 6) I think it would be good to re-write the comments atop ParallelCopyLeader(). From the present comments it appears as if you were trying to put the information pointwise but somehow you ended up putting in a paragraph. The comments also have some typos like *line beaks* which possibly means line breaks. This is applicable for other comments as well where you
>

Fixed.

> 7) Is the following checking equivalent to IsWorker()? If so, it would be good to replace it with an IsWorker like macro to increase the readability.
>
> (IsParallelCopy() && !IsLeader())
>

Fixed.

These have been fixed and the new patch is attached as part of my previous mail.
[1] - https://www.postgresql.org/message-id/CAA4eK1LQPxULxw8JpucX0PwzQQRk%3Dq4jG32cU1us2%2B-mtzZUQg%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
Bharath Rupireddy
Date:
On Wed, Jul 22, 2020 at 7:56 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Thanks for reviewing and providing the comments Ashutosh.
> Please find my thoughts below:
>
> On Fri, Jul 17, 2020 at 7:18 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > Some review comments (mostly) from the leader side code changes:  
> >
> > 3) Should we allow Parallel Copy when the insert method is CIM_MULTI_CONDITIONAL?
> >
> > +   /* Check if the insertion mode is single. */
> > +   if (FindInsertMethod(cstate) == CIM_SINGLE)
> > +       return false;
> >
> > I know we have added checks in CopyFrom() to ensure that if any trigger (before row or instead of) is found on any of partition being loaded with data, then COPY FROM operation would fail, but does it mean that we are okay to perform parallel copy on partitioned table. Have we done some performance testing with the partitioned table where the data in the input file needs to be routed to the different partitions?
> >
>
> Partition data is handled like what Amit had told in one of earlier mails [1].  My colleague Bharath has run performance test with partition table, he will be sharing the results.
>

I ran tests for partitioned use cases - results are similar to that of non partitioned cases[1].

parallel workerstest case 1(exec time in sec): copy from csv file, 5.1GB, 10million tuples, 4 range partitions, 3 indexes on integer columns unique datatest case 2(exec time in sec): copy from csv file, 5.1GB, 10million tuples, 4 range partitions, unique data
0205.403(1X)135(1X)
2114.724(1.79X)59.388(2.27X)
499.017(2.07X)56.742(2.34X)
899.722(2.06X)66.323(2.03X)
1698.147(2.09X)66.054(2.04X)
2097.723(2.1X)66.389(2.03X)
3097.048(2.11X)70.568(1.91X)

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
Amit Kapila
Date:
On Thu, Jul 23, 2020 at 8:51 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Wed, Jul 22, 2020 at 7:56 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Thanks for reviewing and providing the comments Ashutosh.
> Please find my thoughts below:
>
> On Fri, Jul 17, 2020 at 7:18 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > Some review comments (mostly) from the leader side code changes:  
> >
> > 3) Should we allow Parallel Copy when the insert method is CIM_MULTI_CONDITIONAL?
> >
> > +   /* Check if the insertion mode is single. */
> > +   if (FindInsertMethod(cstate) == CIM_SINGLE)
> > +       return false;
> >
> > I know we have added checks in CopyFrom() to ensure that if any trigger (before row or instead of) is found on any of partition being loaded with data, then COPY FROM operation would fail, but does it mean that we are okay to perform parallel copy on partitioned table. Have we done some performance testing with the partitioned table where the data in the input file needs to be routed to the different partitions?
> >
>
> Partition data is handled like what Amit had told in one of earlier mails [1].  My colleague Bharath has run performance test with partition table, he will be sharing the results.
>

I ran tests for partitioned use cases - results are similar to that of non partitioned cases[1].

I could see the gain up to 10-11 times for non-partitioned cases [1], can we use similar test case here as well (with one of the indexes on text column or having gist index) to see its impact?


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
Ashutosh Sharma
Date:
I think, when doing the performance testing for partitioned table, it would be good to also mention about the distribution of data in the input file. One possible data distribution could be that we have let's say 100 tuples in the input file, and every consecutive tuple belongs to a different partition.

On Thu, Jul 23, 2020 at 8:51 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Wed, Jul 22, 2020 at 7:56 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Thanks for reviewing and providing the comments Ashutosh.
> Please find my thoughts below:
>
> On Fri, Jul 17, 2020 at 7:18 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > Some review comments (mostly) from the leader side code changes:  
> >
> > 3) Should we allow Parallel Copy when the insert method is CIM_MULTI_CONDITIONAL?
> >
> > +   /* Check if the insertion mode is single. */
> > +   if (FindInsertMethod(cstate) == CIM_SINGLE)
> > +       return false;
> >
> > I know we have added checks in CopyFrom() to ensure that if any trigger (before row or instead of) is found on any of partition being loaded with data, then COPY FROM operation would fail, but does it mean that we are okay to perform parallel copy on partitioned table. Have we done some performance testing with the partitioned table where the data in the input file needs to be routed to the different partitions?
> >
>
> Partition data is handled like what Amit had told in one of earlier mails [1].  My colleague Bharath has run performance test with partition table, he will be sharing the results.
>

I ran tests for partitioned use cases - results are similar to that of non partitioned cases[1].

parallel workerstest case 1(exec time in sec): copy from csv file, 5.1GB, 10million tuples, 4 range partitions, 3 indexes on integer columns unique datatest case 2(exec time in sec): copy from csv file, 5.1GB, 10million tuples, 4 range partitions, unique data
0205.403(1X)135(1X)
2114.724(1.79X)59.388(2.27X)
499.017(2.07X)56.742(2.34X)
899.722(2.06X)66.323(2.03X)
1698.147(2.09X)66.054(2.04X)
2097.723(2.1X)66.389(2.03X)
3097.048(2.11X)70.568(1.91X)

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
Bharath Rupireddy
Date:
On Thu, Jul 23, 2020 at 9:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>>
>> I ran tests for partitioned use cases - results are similar to that of non partitioned cases[1].
>
>
> I could see the gain up to 10-11 times for non-partitioned cases [1], can we use similar test case here as well (with one of the indexes on text column or having gist index) to see its impact?
>
> [1] - https://www.postgresql.org/message-id/CALj2ACVR4WE98Per1H7ajosW8vafN16548O2UV8bG3p4D3XnPg%40mail.gmail.com
>

Thanks Amit! Please find the results of detailed testing done for partitioned use cases:

Range Partitions: consecutive rows go into the same partitions.
parallel workerstest case 1(exec time in sec): copy from csv file, 2 indexes on integer columns and 1 index on text column, 4 range partitionstest case 2(exec time in sec): copy from csv file, 1 gist index on text column, 4 range partitionstest case 3(exec time in sec): copy from csv file, 3 indexes on integer columns, 4 range partitions
01051.924(1X)785.052(1X)205.403(1X)
2589.576(1.78X)421.974(1.86X)114.724(1.79X)
4321.960(3.27X)230.997(3.4X)99.017(2.07X)
8199.245(5.23X)156.132(5.02X)99.722(2.06X)
16127.343(8.26X)173.696(4.52X)98.147(2.09X)
20122.029(8.62X)186.418(4.21X)97.723(2.1X)
30142.876(7.36X)214.598(3.66X)97.048(2.11X)

On Thu, Jul 23, 2020 at 10:21 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> I think, when doing the performance testing for partitioned table, it would be good to also mention about the distribution of data in the input file. One possible data distribution could be that we have let's say 100 tuples in the input file, and every consecutive tuple belongs to a different partition.
>

To address Ashutosh's point, I used hash partitioning. Hope this helps to clear the doubt.

Hash Partitions: where there are high chances that consecutive rows may go into different partitions.
parallel workerstest case 1(exec time in sec): copy from csv file, 2 indexes on integer columns and 1 index on text column, 4 hash partitionstest case 2(exec time in sec): copy from csv file, 1 gist index on text column, 4 hash partitionstest case 3(exec time in sec): copy from csv file, 3 indexes on integer columns, 4 hash partitions
01060.884(1X)812.283(1X)207.745(1X)
2572.542(1.85X)418.454(1.94X)107.850(1.93X)
4298.132(3.56X)227.367(3.57X)83.895(2.48X)
8169.449(6.26X)137.993(5.89X)85.411(2.43X)
16112.297(9.45X)95.167(8.53X)96.136(2.16X)
20101.546(10.45X)90.552(8.97X)97.066(2.14X)
30113.877(9.32X)127.17(6.38X)96.819(2.14X)


With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
vignesh C
Date:
The patches were not applying because of the recent commits.
I have rebased the patch over head & attached.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

On Thu, Jul 23, 2020 at 6:07 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Thu, Jul 23, 2020 at 9:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>>
>> I ran tests for partitioned use cases - results are similar to that of non partitioned cases[1].
>
>
> I could see the gain up to 10-11 times for non-partitioned cases [1], can we use similar test case here as well (with one of the indexes on text column or having gist index) to see its impact?
>
> [1] - https://www.postgresql.org/message-id/CALj2ACVR4WE98Per1H7ajosW8vafN16548O2UV8bG3p4D3XnPg%40mail.gmail.com
>

Thanks Amit! Please find the results of detailed testing done for partitioned use cases:

Range Partitions: consecutive rows go into the same partitions.
parallel workerstest case 1(exec time in sec): copy from csv file, 2 indexes on integer columns and 1 index on text column, 4 range partitionstest case 2(exec time in sec): copy from csv file, 1 gist index on text column, 4 range partitionstest case 3(exec time in sec): copy from csv file, 3 indexes on integer columns, 4 range partitions
01051.924(1X)785.052(1X)205.403(1X)
2589.576(1.78X)421.974(1.86X)114.724(1.79X)
4321.960(3.27X)230.997(3.4X)99.017(2.07X)
8199.245(5.23X)156.132(5.02X)99.722(2.06X)
16127.343(8.26X)173.696(4.52X)98.147(2.09X)
20122.029(8.62X)186.418(4.21X)97.723(2.1X)
30142.876(7.36X)214.598(3.66X)97.048(2.11X)

On Thu, Jul 23, 2020 at 10:21 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> I think, when doing the performance testing for partitioned table, it would be good to also mention about the distribution of data in the input file. One possible data distribution could be that we have let's say 100 tuples in the input file, and every consecutive tuple belongs to a different partition.
>

To address Ashutosh's point, I used hash partitioning. Hope this helps to clear the doubt.

Hash Partitions: where there are high chances that consecutive rows may go into different partitions.
parallel workerstest case 1(exec time in sec): copy from csv file, 2 indexes on integer columns and 1 index on text column, 4 hash partitionstest case 2(exec time in sec): copy from csv file, 1 gist index on text column, 4 hash partitionstest case 3(exec time in sec): copy from csv file, 3 indexes on integer columns, 4 hash partitions
01060.884(1X)812.283(1X)207.745(1X)
2572.542(1.85X)418.454(1.94X)107.850(1.93X)
4298.132(3.56X)227.367(3.57X)83.895(2.48X)
8169.449(6.26X)137.993(5.89X)85.411(2.43X)
16112.297(9.45X)95.167(8.53X)96.136(2.16X)
20101.546(10.45X)90.552(8.97X)97.066(2.14X)
30113.877(9.32X)127.17(6.38X)96.819(2.14X)


With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Parallel copy

From
Bharath Rupireddy
Date:
On Sat, Aug 1, 2020 at 9:55 AM vignesh C <vignesh21@gmail.com> wrote:
>
> The patches were not applying because of the recent commits.
> I have rebased the patch over head & attached.
>
I rebased v2-0006-Parallel-Copy-For-Binary-Format-Files.patch.

Putting together all the patches rebased on to the latest commit
b8fdee7d0ca8bd2165d46fb1468f75571b706a01. Patches from 0001 to 0005
are rebased by Vignesh, that are from the previous mail and the patch
0006 is rebased by me.

Please consider this patch set for further review.


With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
Tomas Vondra
Date:
On Mon, Aug 03, 2020 at 12:33:48PM +0530, Bharath Rupireddy wrote:
>On Sat, Aug 1, 2020 at 9:55 AM vignesh C <vignesh21@gmail.com> wrote:
>>
>> The patches were not applying because of the recent commits.
>> I have rebased the patch over head & attached.
>>
>I rebased v2-0006-Parallel-Copy-For-Binary-Format-Files.patch.
>
>Putting together all the patches rebased on to the latest commit
>b8fdee7d0ca8bd2165d46fb1468f75571b706a01. Patches from 0001 to 0005
>are rebased by Vignesh, that are from the previous mail and the patch
>0006 is rebased by me.
>
>Please consider this patch set for further review.
>

I'd suggest incrementing the version every time an updated version is
submitted, even if it's just a rebased version. It makes it clearer
which version of the code is being discussed, etc.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Parallel copy

From
vignesh C
Date:
On Tue, Aug 4, 2020 at 9:51 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Aug 03, 2020 at 12:33:48PM +0530, Bharath Rupireddy wrote:
> >On Sat, Aug 1, 2020 at 9:55 AM vignesh C <vignesh21@gmail.com> wrote:
> >>
> >> The patches were not applying because of the recent commits.
> >> I have rebased the patch over head & attached.
> >>
> >I rebased v2-0006-Parallel-Copy-For-Binary-Format-Files.patch.
> >
> >Putting together all the patches rebased on to the latest commit
> >b8fdee7d0ca8bd2165d46fb1468f75571b706a01. Patches from 0001 to 0005
> >are rebased by Vignesh, that are from the previous mail and the patch
> >0006 is rebased by me.
> >
> >Please consider this patch set for further review.
> >
>
> I'd suggest incrementing the version every time an updated version is
> submitted, even if it's just a rebased version. It makes it clearer
> which version of the code is being discussed, etc.

Sure, we will take care of this when we are sending the next set of patches.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Greg Nancarrow
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            tested, failed

Hi,

I don't claim to yet understand all of the Postgres internals that this patch is updating and interacting with, so I'm
stilltesting and debugging portions of this patch, but would like to give feedback on what I've noticed so far.
 
I have done some ad-hoc testing of the patch using parallel copies from text/csv/binary files and have not yet struck
anyexecution problems other than some option validation and associated error messages on boundary cases.
 

One general question that I have: is there a user benefit (over the normal non-parallel COPY) to allowing "COPY ...
FROM... WITH (PARALLEL 1)"?
 


My following comments are broken down by patch:

(1) v2-0001-Copy-code-readjustment-to-support-parallel-copy.patch

(i) Whilst I can't entirely blame these patches for it (as they are following what is already there), I can't help
noticingthe use of numerous macros in src/backend/commands/copy.c which paste in multiple lines of code in various
places.
It's getting a little out-of-hand. Surely the majority of these would be best inline functions instead?
Perhaps hasn't been done because too many parameters need to be passed - thoughts?


(2) v2-0002-Framework-for-leader-worker-in-parallel-copy.patch

(i) minor point: there are some tabbing/spacing issues in this patch (and the other patches), affecting alignment.
e.g. mixed tabs/spaces and misalignment in PARALLEL_COPY_KEY_xxx definitions

(ii)

+/*
+ * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data
+ * block to process to avoid lock contention. This value should be mode of
+ * RINGSIZE, as wrap around cases is currently not handled while selecting the
+ * WORKER_CHUNK_COUNT by the worker.
+ */
+#define WORKER_CHUNK_COUNT 50


"This value should be mode of RINGSIZE ..."

-> typo: mode  (mod?  should evenly divide into RINGSIZE?)


(iii)
+ *    using pg_atomic_compare_exchange_u32, worker will change the sate to

->typo: sate  (should be "state")


(iv)

+                         errmsg("parallel option supported only for copy from"),

-> suggest change to:        errmsg("parallel option is supported only for COPY FROM"),

(v)

+            errno = 0; /* To distinguish success/failure after call */
+            val = strtol(str, &endptr, 10);
+
+            /* Check for various possible errors */
+            if ((errno == ERANGE && (val == LONG_MAX || val == LONG_MIN))
+                || (errno != 0 && val == 0) ||
+                *endptr)
+                ereport(ERROR,
+                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                         errmsg("improper use of argument to option \"%s\"",
+                                defel->defname),
+                         parser_errposition(pstate, defel->location)));
+
+            if (endptr == str)
+               ereport(ERROR,
+                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                         errmsg("no digits were found in argument to option \"%s\"",
+                                defel->defname),
+                         parser_errposition(pstate, defel->location)));
+
+            cstate->nworkers = (int) val;
+
+            if (cstate->nworkers <= 0)
+                ereport(ERROR,
+                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                         errmsg("argument to option \"%s\" must be a positive integer greater than zero",
+                                defel->defname),
+                         parser_errposition(pstate, defel->location)));


I think this validation code needs to be improved, including the error messages (e.g. when can a "positive integer" NOT
begreater than zero?)
 

There is some overlap in the "no digits were found" case between the two conditions above, depending, for example, if
theargument is quoted. 
 
Also, "improper use of argument to option" sounds a bit odd and vague to me. 
Finally, not range checking before casting long to int can lead to allowing out-of-range int values like in the
followingcase:
 

test=# copy mytable from '/myspace/test_pcopy/tmp.dat' (parallel '-2147483648');
ERROR:  argument to option "parallel" must be a positive integer greater than zero
LINE 1: copy mytable from '/myspace/test_pcopy/tmp.dat' (parallel '-2...
                                                        ^
BUT the following is allowed...

test=# copy mytable from '/myspace/test_pcopy/tmp.dat' (parallel '-2147483649');
COPY 1000000


I'd suggest to change the above validation code to do similar validation to that for the CREATE TABLE parallel_workers
storageparameter (case RELOPT_TYPE_INT in reloptions.c). Like that code, wouldn't it be best to range-check the integer
optionvalue to be within a reasonable range, say 1 to 1024, with a corresponding errdetail message if possible?
 


(3) v2-0003-Allow-copy-from-command-to-process-data-from-file.patch

(i)

Patch comment says:

"This feature allows the copy from to leverage multiple CPUs in order to copy
data from file/STDIN to a table. This adds a PARALLEL option to COPY FROM
command where the user can specify the number of workers that can be used
to perform the COPY FROM command. Specifying zero as number of workers will
disable parallelism."

BUT - the changes to ProcessCopyOptions() specified in "v2-0002-Framework-for-leader-worker-in-parallel-copy.patch" do
notallow zero workers to be specified - you get an error in that case. Patch comment should be updated accordingly.
 

(ii)

#define GETPROCESSED(processed) \
-return processed;
+if (!IsParallelCopy()) \
+    return processed; \
+else \
+    return pg_atomic_read_u64(&cstate->pcdata->pcshared_info->processed);
+

I think GETPROCESSED would be better named "RETURNPROCESSED".

(iii)

The below comment seems out- of-date with the current code - is it referring to the loop embedded at the bottom of the
currentloop that the comment is within?
 

+        /*
+         * There is a possibility that the above loop has come out because
+         * data_blk_ptr->curr_blk_completed is set, but dataSize read might
+         * be an old value, if data_blk_ptr->curr_blk_completed and the line is
+         * completed, line_size will be set. Read the line_size again to be
+         * sure if it is complete or partial block.
+         */

(iv)

I may be wrong here, but in the following block of code, isn't there a window of opportunity (however small) in which
theline_state might be updated (LINE_WORKER_PROCESSED) by another worker just AFTER pg_atomic_read_u32() returns the
currentline_state which is put into curr_line_state, such that a write_pos update might be missed? And then a
race-conditionexists for reading/setting line_size (since line_size gets atomically set after line_state is set)?
 
If I am wrong in thinking this synchronization might not be correct, maybe the comments could be improved here to
explainhow this code is safe in that respect.
 


+        /* Get the current line information. */
+        lineInfo = &pcshared_info->line_boundaries.ring[write_pos];
+        curr_line_state = pg_atomic_read_u32(&lineInfo->line_state);
+        if ((write_pos % WORKER_CHUNK_COUNT == 0) &&
+            (curr_line_state == LINE_WORKER_PROCESSED ||
+             curr_line_state == LINE_WORKER_PROCESSING))
+        {
+            pcdata->worker_processed_pos = write_pos;
+            write_pos = (write_pos + WORKER_CHUNK_COUNT) %  RINGSIZE;
+            continue;
+        }
+
+        /* Get the size of this line. */
+        dataSize = pg_atomic_read_u32(&lineInfo->line_size);
+
+        if (dataSize != 0) /* If not an empty line. */
+        {
+            /* Get the block information. */
+            data_blk_ptr = &pcshared_info->data_blocks[lineInfo->first_block];
+
+            if (!data_blk_ptr->curr_blk_completed && (dataSize == -1))
+            {
+                /* Wait till the current line or block is added. */
+                COPY_WAIT_TO_PROCESS()
+                continue;
+            }
+        }
+
+        /* Make sure that no worker has consumed this element. */
+        if (pg_atomic_compare_exchange_u32(&lineInfo->line_state,
+                                           &line_state, LINE_WORKER_PROCESSING))
+            break;


(4) v2-0004-Documentation-for-parallel-copy.patch

(i) I think that it is necessary to mention the "max_worker_processes" option in the description of the COPY statement
PARALLELoption.
 

For example, something like:

+      Perform <command>COPY FROM</command> in parallel using <replaceable
+      class="parameter"> integer</replaceable> background workers.  Please
+      note that it is not guaranteed that the number of parallel workers
+      specified in <replaceable class="parameter">integer</replaceable> will
+      be used during execution.  It is possible for a copy to run with fewer
+      workers than specified, or even with no workers at all (for example, 
+      due to the setting of max_worker_processes).  This option is allowed
+      only in <command>COPY FROM</command>.
 

(5) v2-0005-Tests-for-parallel-copy.patch

(i) None of the provided tests seem to test beyond "PARALLEL 2" 


(6) v2-0006-Parallel-Copy-For-Binary-Format-Files.patch

(i) In the ParallelCopyFrom() function, "cstate->raw_buf" is pfree()d:

+    /* raw_buf is not used in parallel copy, instead data blocks are used.*/
+    pfree(cstate->raw_buf);


This comment doesn't seem to be entirely true.
At least for text/csv file COPY FROM, cstate->raw_buf is subsequently referenced in the SetRawBufForLoad() function,
whichis called by CopyReadLineText():
 

    cur_data_blk_ptr = (cstate->raw_buf) ? &pcshared_info->data_blocks[cur_block_pos] : NULL;

So I think cstate->raw_buf should be set to NULL after being pfree()d, and the comment fixed/adjusted.


(ii) This patch adds some macros (involving parallel copy checks) AFTER the comment:

/* End parallel copy Macros */


Regards,
Greg Nancarrow
Fujitsu Australia

Re: Parallel copy

From
vignesh C
Date:
Thanks Greg for reviewing the patch. Please find my thoughts for your comments.

On Wed, Aug 12, 2020 at 9:10 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
> I have done some ad-hoc testing of the patch using parallel copies from text/csv/binary files and have not yet struck
anyexecution problems other than some option validation and associated error messages on boundary cases. 
>
> One general question that I have: is there a user benefit (over the normal non-parallel COPY) to allowing "COPY ...
FROM... WITH (PARALLEL 1)"? 
>

There will be marginal improvement as worker only need to process the
data, need not do the file reading, file reading would have been done
by the main process. The real improvement can be seen from 2 workers
onwards.

>
> My following comments are broken down by patch:
>
> (1) v2-0001-Copy-code-readjustment-to-support-parallel-copy.patch
>
> (i) Whilst I can't entirely blame these patches for it (as they are following what is already there), I can't help
noticingthe use of numerous macros in src/backend/commands/copy.c which paste in multiple lines of code in various
places.
> It's getting a little out-of-hand. Surely the majority of these would be best inline functions instead?
> Perhaps hasn't been done because too many parameters need to be passed - thoughts?
>

I felt they have used macros mainly because it has a tight loop and
having macros gives better performance. I have added the macros
CLEAR_EOL_LINE, INCREMENTPROCESSED & GETPROCESSED as there will be
slight difference in parallel copy & non parallel copy for these. In
the remaining patches the macor will be extended to include parallel
copy logic. Instead of having checks in the core logic, thought of
keeping as macros so that the readability is good.

>
> (2) v2-0002-Framework-for-leader-worker-in-parallel-copy.patch
>
> (i) minor point: there are some tabbing/spacing issues in this patch (and the other patches), affecting alignment.
> e.g. mixed tabs/spaces and misalignment in PARALLEL_COPY_KEY_xxx definitions
>

Fixed

> (ii)
>
> +/*
> + * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data
> + * block to process to avoid lock contention. This value should be mode of
> + * RINGSIZE, as wrap around cases is currently not handled while selecting the
> + * WORKER_CHUNK_COUNT by the worker.
> + */
> +#define WORKER_CHUNK_COUNT 50
>
>
> "This value should be mode of RINGSIZE ..."
>
> -> typo: mode  (mod?  should evenly divide into RINGSIZE?)

Fixed, changed it to divisible by.

> (iii)
> + *    using pg_atomic_compare_exchange_u32, worker will change the sate to
>
> ->typo: sate  (should be "state")

Fixed

> (iv)
>
> +                                                errmsg("parallel option supported only for copy from"),
>
> -> suggest change to:           errmsg("parallel option is supported only for COPY FROM"),
>

Fixed

> (v)
>
> +                       errno = 0; /* To distinguish success/failure after call */
> +                       val = strtol(str, &endptr, 10);
> +
> +                       /* Check for various possible errors */
> +                       if ((errno == ERANGE && (val == LONG_MAX || val == LONG_MIN))
> +                               || (errno != 0 && val == 0) ||
> +                               *endptr)
> +                               ereport(ERROR,
> +                                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                                                errmsg("improper use of argument to option \"%s\"",
> +                                                               defel->defname),
> +                                                parser_errposition(pstate, defel->location)));
> +
> +                       if (endptr == str)
> +                          ereport(ERROR,
> +                                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                                                errmsg("no digits were found in argument to option \"%s\"",
> +                                                               defel->defname),
> +                                                parser_errposition(pstate, defel->location)));
> +
> +                       cstate->nworkers = (int) val;
> +
> +                       if (cstate->nworkers <= 0)
> +                               ereport(ERROR,
> +                                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                                                errmsg("argument to option \"%s\" must be a positive integer greater
thanzero", 
> +                                                               defel->defname),
> +                                                parser_errposition(pstate, defel->location)));
>
>
> I think this validation code needs to be improved, including the error messages (e.g. when can a "positive integer"
NOTbe greater than zero?) 
>
> There is some overlap in the "no digits were found" case between the two conditions above, depending, for example, if
theargument is quoted. 
> Also, "improper use of argument to option" sounds a bit odd and vague to me.
> Finally, not range checking before casting long to int can lead to allowing out-of-range int values like in the
followingcase: 
>
> test=# copy mytable from '/myspace/test_pcopy/tmp.dat' (parallel '-2147483648');
> ERROR:  argument to option "parallel" must be a positive integer greater than zero
> LINE 1: copy mytable from '/myspace/test_pcopy/tmp.dat' (parallel '-2...
>                                                         ^
> BUT the following is allowed...
>
> test=# copy mytable from '/myspace/test_pcopy/tmp.dat' (parallel '-2147483649');
> COPY 1000000
>
>
> I'd suggest to change the above validation code to do similar validation to that for the CREATE TABLE
parallel_workersstorage parameter (case RELOPT_TYPE_INT in reloptions.c). Like that code, wouldn't it be best to
range-checkthe integer option value to be within a reasonable range, say 1 to 1024, with a corresponding errdetail
messageif possible? 
>

Fixed, changed as suggested.

> (3) v2-0003-Allow-copy-from-command-to-process-data-from-file.patch
>
> (i)
>
> Patch comment says:
>
> "This feature allows the copy from to leverage multiple CPUs in order to copy
> data from file/STDIN to a table. This adds a PARALLEL option to COPY FROM
> command where the user can specify the number of workers that can be used
> to perform the COPY FROM command. Specifying zero as number of workers will
> disable parallelism."
>
> BUT - the changes to ProcessCopyOptions() specified in "v2-0002-Framework-for-leader-worker-in-parallel-copy.patch"
donot allow zero workers to be specified - you get an error in that case. Patch comment should be updated accordingly. 
>

Removed "Specifying zero as number of workers will disable
parallelism". As the new  value is range from 1 to 1024.

> (ii)
>
> #define GETPROCESSED(processed) \
> -return processed;
> +if (!IsParallelCopy()) \
> +       return processed; \
> +else \
> +       return pg_atomic_read_u64(&cstate->pcdata->pcshared_info->processed);
> +
>
> I think GETPROCESSED would be better named "RETURNPROCESSED".
>

Fixed.

> (iii)
>
> The below comment seems out- of-date with the current code - is it referring to the loop embedded at the bottom of
thecurrent loop that the comment is within? 
>
> +               /*
> +                * There is a possibility that the above loop has come out because
> +                * data_blk_ptr->curr_blk_completed is set, but dataSize read might
> +                * be an old value, if data_blk_ptr->curr_blk_completed and the line is
> +                * completed, line_size will be set. Read the line_size again to be
> +                * sure if it is complete or partial block.
> +                */
>

Updated, it is referring to the embedded loop at the bottom of the current loop.

> (iv)
>
> I may be wrong here, but in the following block of code, isn't there a window of opportunity (however small) in which
theline_state might be updated (LINE_WORKER_PROCESSED) by another worker just AFTER pg_atomic_read_u32() returns the
currentline_state which is put into curr_line_state, such that a write_pos update might be missed? And then a
race-conditionexists for reading/setting line_size (since line_size gets atomically set after line_state is set)? 
> If I am wrong in thinking this synchronization might not be correct, maybe the comments could be improved here to
explainhow this code is safe in that respect. 
>
>
> +               /* Get the current line information. */
> +               lineInfo = &pcshared_info->line_boundaries.ring[write_pos];
> +               curr_line_state = pg_atomic_read_u32(&lineInfo->line_state);
> +               if ((write_pos % WORKER_CHUNK_COUNT == 0) &&
> +                       (curr_line_state == LINE_WORKER_PROCESSED ||
> +                        curr_line_state == LINE_WORKER_PROCESSING))
> +               {
> +                       pcdata->worker_processed_pos = write_pos;
> +                       write_pos = (write_pos + WORKER_CHUNK_COUNT) %  RINGSIZE;
> +                       continue;
> +               }
> +
> +               /* Get the size of this line. */
> +               dataSize = pg_atomic_read_u32(&lineInfo->line_size);
> +
> +               if (dataSize != 0) /* If not an empty line. */
> +               {
> +                       /* Get the block information. */
> +                       data_blk_ptr = &pcshared_info->data_blocks[lineInfo->first_block];
> +
> +                       if (!data_blk_ptr->curr_blk_completed && (dataSize == -1))
> +                       {
> +                               /* Wait till the current line or block is added. */
> +                               COPY_WAIT_TO_PROCESS()
> +                               continue;
> +                       }
> +               }
> +
> +               /* Make sure that no worker has consumed this element. */
> +               if (pg_atomic_compare_exchange_u32(&lineInfo->line_state,
> +                                                                                  &line_state,
LINE_WORKER_PROCESSING))
> +                       break;
>

This is not possible because of pg_atomic_compare_exchange_u32, this
will succeed only for one of the workers whose line_state is
LINE_LEADER_POPULATED, for other workers it will fail. This is
explained in detail above ParallelCopyLineBoundary.

>
> (4) v2-0004-Documentation-for-parallel-copy.patch
>
> (i) I think that it is necessary to mention the "max_worker_processes" option in the description of the COPY
statementPARALLEL option. 
>
> For example, something like:
>
> +      Perform <command>COPY FROM</command> in parallel using <replaceable
> +      class="parameter"> integer</replaceable> background workers.  Please
> +      note that it is not guaranteed that the number of parallel workers
> +      specified in <replaceable class="parameter">integer</replaceable> will
> +      be used during execution.  It is possible for a copy to run with fewer
> +      workers than specified, or even with no workers at all (for example,
> +      due to the setting of max_worker_processes).  This option is allowed
> +      only in <command>COPY FROM</command>.
>

Fixed.

> (5) v2-0005-Tests-for-parallel-copy.patch
>
> (i) None of the provided tests seem to test beyond "PARALLEL 2"
>

I intentionally ran with 1 parallel worker, because when you specify
more than 1 parallel worker the order of record insertion can vary &
there may be random failures.

>
> (6) v2-0006-Parallel-Copy-For-Binary-Format-Files.patch
>
> (i) In the ParallelCopyFrom() function, "cstate->raw_buf" is pfree()d:
>
> +       /* raw_buf is not used in parallel copy, instead data blocks are used.*/
> +       pfree(cstate->raw_buf);
>

raw_buf is not used in parallel copy, instead raw_buf will be pointing
to shared memory data blocks. This memory was allocated as part of
BeginCopyFrom, uptil this point we cannot be 100% sure as copy can be
performed sequentially like in case max_worker_processes is not
available, if it switches to sequential mode raw_buf will be used
while performing copy operation. At this place we can safely free this
memory that was allocated.

> This comment doesn't seem to be entirely true.
> At least for text/csv file COPY FROM, cstate->raw_buf is subsequently referenced in the SetRawBufForLoad() function,
whichis called by CopyReadLineText(): 
>
>     cur_data_blk_ptr = (cstate->raw_buf) ? &pcshared_info->data_blocks[cur_block_pos] : NULL;
>
> So I think cstate->raw_buf should be set to NULL after being pfree()d, and the comment fixed/adjusted.
>
>
> (ii) This patch adds some macros (involving parallel copy checks) AFTER the comment:
>
> /* End parallel copy Macros */

Fixed, moved the macros above the comment.

I have attached new set of patches with the fixes.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
Greg Nancarrow
Date:
Hi Vignesh,

Some further comments:

(1) v3-0002-Framework-for-leader-worker-in-parallel-copy.patch

+/*
+ * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data
+ * block to process to avoid lock contention. This value should be divisible by
+ * RINGSIZE, as wrap around cases is currently not handled while selecting the
+ * WORKER_CHUNK_COUNT by the worker.
+ */
+#define WORKER_CHUNK_COUNT 50


"This value should be divisible by RINGSIZE" is not a correct
statement (since obviously 50 is not divisible by 10000).
It should say something like "This value should evenly divide into
RINGSIZE", or "RINGSIZE should be a multiple of WORKER_CHUNK_COUNT".


(2) v3-0003-Allow-copy-from-command-to-process-data-from-file.patch

(i)

+                       /*
+                        * If the data is present in current block
lineInfo. line_size
+                        * will be updated. If the data is spread
across the blocks either

Somehow a space has been put between "lineinfo." and "line_size".
It should be: "If the data is present in current block
lineInfo.line_size will be updated"

(ii)

>This is not possible because of pg_atomic_compare_exchange_u32, this
>will succeed only for one of the workers whose line_state is
>LINE_LEADER_POPULATED, for other workers it will fail. This is
>explained in detail above ParallelCopyLineBoundary.

Yes, but prior to that call to pg_atomic_compare_exchange_u32(),
aren't you separately reading line_state and line_state, so that
between those reads, it may have transitioned from leader to another
worker, such that the read line state ("cur_line_state", being checked
in the if block) may not actually match what is now in the line_state
and/or the read line_size ("dataSize") doesn't actually correspond to
the read line state?

(sorry, still not 100% convinced that the synchronization and checks
are safe in all cases)

(3) v3-0006-Parallel-Copy-For-Binary-Format-Files.patch

>raw_buf is not used in parallel copy, instead raw_buf will be pointing
>to shared memory data blocks. This memory was allocated as part of
>BeginCopyFrom, uptil this point we cannot be 100% sure as copy can be
>performed sequentially like in case max_worker_processes is not
>available, if it switches to sequential mode raw_buf will be used
>while performing copy operation. At this place we can safely free this
>memory that was allocated

So the following code (which checks raw_buf, which still points to
memory that has been pfreed) is still valid?

  In the SetRawBufForLoad() function, which is called by CopyReadLineText():

    cur_data_blk_ptr = (cstate->raw_buf) ?
&pcshared_info->data_blocks[cur_block_pos] : NULL;

The above code looks a bit dicey to me. I stepped over that line in
the debugger when I debugged an instance of Parallel Copy, so it
definitely gets executed.
It makes me wonder what other code could possibly be checking raw_buf
and using it in some way, when in fact what it points to has been
pfreed.

Are you able to add the following line of code, or will it (somehow)
break logic that you are relying on?

pfree(cstate->raw_buf);
cstate->raw_buf = NULL;               <=== I suggest that this line is added

Regards,
Greg Nancarrow
Fujitsu Australia



Re: Parallel copy

From
vignesh C
Date:
Thanks Greg for reviewing the patch. Please find my thoughts for your comments.

On Mon, Aug 17, 2020 at 9:44 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
> Some further comments:
>
> (1) v3-0002-Framework-for-leader-worker-in-parallel-copy.patch
>
> +/*
> + * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data
> + * block to process to avoid lock contention. This value should be divisible by
> + * RINGSIZE, as wrap around cases is currently not handled while selecting the
> + * WORKER_CHUNK_COUNT by the worker.
> + */
> +#define WORKER_CHUNK_COUNT 50
>
>
> "This value should be divisible by RINGSIZE" is not a correct
> statement (since obviously 50 is not divisible by 10000).
> It should say something like "This value should evenly divide into
> RINGSIZE", or "RINGSIZE should be a multiple of WORKER_CHUNK_COUNT".
>

Fixed. Changed it to RINGSIZE should be a multiple of WORKER_CHUNK_COUNT.

> (2) v3-0003-Allow-copy-from-command-to-process-data-from-file.patch
>
> (i)
>
> +                       /*
> +                        * If the data is present in current block
> lineInfo. line_size
> +                        * will be updated. If the data is spread
> across the blocks either
>
> Somehow a space has been put between "lineinfo." and "line_size".
> It should be: "If the data is present in current block
> lineInfo.line_size will be updated"

Fixed, changed it to lineinfo->line_size.

>
> (ii)
>
> >This is not possible because of pg_atomic_compare_exchange_u32, this
> >will succeed only for one of the workers whose line_state is
> >LINE_LEADER_POPULATED, for other workers it will fail. This is
> >explained in detail above ParallelCopyLineBoundary.
>
> Yes, but prior to that call to pg_atomic_compare_exchange_u32(),
> aren't you separately reading line_state and line_state, so that
> between those reads, it may have transitioned from leader to another
> worker, such that the read line state ("cur_line_state", being checked
> in the if block) may not actually match what is now in the line_state
> and/or the read line_size ("dataSize") doesn't actually correspond to
> the read line state?
>
> (sorry, still not 100% convinced that the synchronization and checks
> are safe in all cases)
>

I think that you are describing about the problem could happen in the
following case:
when we read curr_line_state, the value was LINE_WORKER_PROCESSED or
LINE_WORKER_PROCESSING. Then in some cases if the leader is very fast
compared to the workers then the leader quickly populates one line and
sets the state to LINE_LEADER_POPULATED. State is changed to
LINE_LEADER_POPULATED when we are checking the currr_line_state.
I feel this will not be a problem because, Leader will populate & wait
till some RING element is available to populate. In the meantime
worker has seen that state is LINE_WORKER_PROCESSED or
LINE_WORKER_PROCESSING(previous state that it read), worker has
identified that this chunk was processed by some other worker, worker
will move and try to get the next available chunk & insert those
records. It will keep continuing till it gets the next chunk to
process. Eventually one of the workers will get this chunk and process
it.

> (3) v3-0006-Parallel-Copy-For-Binary-Format-Files.patch
>
> >raw_buf is not used in parallel copy, instead raw_buf will be pointing
> >to shared memory data blocks. This memory was allocated as part of
> >BeginCopyFrom, uptil this point we cannot be 100% sure as copy can be
> >performed sequentially like in case max_worker_processes is not
> >available, if it switches to sequential mode raw_buf will be used
> >while performing copy operation. At this place we can safely free this
> >memory that was allocated
>
> So the following code (which checks raw_buf, which still points to
> memory that has been pfreed) is still valid?
>
>   In the SetRawBufForLoad() function, which is called by CopyReadLineText():
>
>     cur_data_blk_ptr = (cstate->raw_buf) ?
> &pcshared_info->data_blocks[cur_block_pos] : NULL;
>
> The above code looks a bit dicey to me. I stepped over that line in
> the debugger when I debugged an instance of Parallel Copy, so it
> definitely gets executed.
> It makes me wonder what other code could possibly be checking raw_buf
> and using it in some way, when in fact what it points to has been
> pfreed.
>
> Are you able to add the following line of code, or will it (somehow)
> break logic that you are relying on?
>
> pfree(cstate->raw_buf);
> cstate->raw_buf = NULL;               <=== I suggest that this line is added
>

You are right, I have debugged & verified it sets it to an invalid
block which is not expected. There are chances this would have caused
some corruption in some machines. The suggested fix is required, I
have fixed it. I have moved this change to
0003-Allow-copy-from-command-to-process-data-from-file.patch as
0006-Parallel-Copy-For-Binary-Format-Files is only for Binary format
parallel copy & that change is common change for parallel copy.

I have attached new set of patches with the fixes.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
Greg Nancarrow
Date:
> I have attached new set of patches with the fixes.
> Thoughts?

Hi Vignesh,

I don't really have any further comments on the code, but would like
to share some results of some Parallel Copy performance tests I ran
(attached).

The tests loaded a 5GB CSV data file into a 100 column table (of
different data types). The following were varied as part of the test:
- Number of workers (1 – 10)
- No indexes / 4-indexes
- Default settings / increased resources (shared_buffers,work_mem, etc.)

(I did not do any partition-related tests as I believe those type of
tests were previously performed)

I built Postgres (latest OSS code) with the latest Parallel Copy patches (v4).
The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM.


I observed the following trends:
- For the data file size used, Parallel Copy achieved best performance
using about 9 – 10 workers. Larger data files may benefit from using
more workers. However, I couldn’t really see any better performance,
for example, from using 16 workers on a 10GB CSV data file compared to
using 8 workers. Results may also vary depending on machine
characteristics.
- Parallel Copy with 1 worker ran slower than normal Copy in a couple
of cases (I did question if allowing 1 worker was useful in my patch
review).
- Typical load time improvement (load factor) for Parallel Copy was
between 2x and 3x. Better load factors can be obtained by using larger
data files and/or more indexes.
- Increasing Postgres resources made little or no difference to
Parallel Copy performance when the target table had no indexes.
Increasing Postgres resources improved Parallel Copy performance when
the target table had indexes.

Regards,
Greg Nancarrow
Fujitsu Australia

Attachment

Re: Parallel copy

From
Amit Kapila
Date:
On Thu, Aug 27, 2020 at 8:04 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> > I have attached new set of patches with the fixes.
> > Thoughts?
>
> Hi Vignesh,
>
> I don't really have any further comments on the code, but would like
> to share some results of some Parallel Copy performance tests I ran
> (attached).
>
> The tests loaded a 5GB CSV data file into a 100 column table (of
> different data types). The following were varied as part of the test:
> - Number of workers (1 – 10)
> - No indexes / 4-indexes
> - Default settings / increased resources (shared_buffers,work_mem, etc.)
>
> (I did not do any partition-related tests as I believe those type of
> tests were previously performed)
>
> I built Postgres (latest OSS code) with the latest Parallel Copy patches (v4).
> The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM.
>
>
> I observed the following trends:
> - For the data file size used, Parallel Copy achieved best performance
> using about 9 – 10 workers. Larger data files may benefit from using
> more workers. However, I couldn’t really see any better performance,
> for example, from using 16 workers on a 10GB CSV data file compared to
> using 8 workers. Results may also vary depending on machine
> characteristics.
> - Parallel Copy with 1 worker ran slower than normal Copy in a couple
> of cases (I did question if allowing 1 worker was useful in my patch
> review).

I think the reason is that for 1 worker case there is not much
parallelization as a leader doesn't perform the actual load work.
Vignesh, can you please once see if the results are reproducible at
your end, if so, we can once compare the perf profiles to see why in
some cases we get improvement and in other cases not. Based on that we
can decide whether to allow the 1 worker case or not.

> - Typical load time improvement (load factor) for Parallel Copy was
> between 2x and 3x. Better load factors can be obtained by using larger
> data files and/or more indexes.
>

Nice improvement and I think you are right that with larger load data
we will get even better improvement.

--
With Regards,
Amit Kapila.



Re: Parallel copy

From
vignesh C
Date:
On Thu, Aug 27, 2020 at 8:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 27, 2020 at 8:04 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
> >
> > > I have attached new set of patches with the fixes.
> > > Thoughts?
> >
> > Hi Vignesh,
> >
> > I don't really have any further comments on the code, but would like
> > to share some results of some Parallel Copy performance tests I ran
> > (attached).
> >
> > The tests loaded a 5GB CSV data file into a 100 column table (of
> > different data types). The following were varied as part of the test:
> > - Number of workers (1 – 10)
> > - No indexes / 4-indexes
> > - Default settings / increased resources (shared_buffers,work_mem, etc.)
> >
> > (I did not do any partition-related tests as I believe those type of
> > tests were previously performed)
> >
> > I built Postgres (latest OSS code) with the latest Parallel Copy patches (v4).
> > The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM.
> >
> >
> > I observed the following trends:
> > - For the data file size used, Parallel Copy achieved best performance
> > using about 9 – 10 workers. Larger data files may benefit from using
> > more workers. However, I couldn’t really see any better performance,
> > for example, from using 16 workers on a 10GB CSV data file compared to
> > using 8 workers. Results may also vary depending on machine
> > characteristics.
> > - Parallel Copy with 1 worker ran slower than normal Copy in a couple
> > of cases (I did question if allowing 1 worker was useful in my patch
> > review).
>
> I think the reason is that for 1 worker case there is not much
> parallelization as a leader doesn't perform the actual load work.
> Vignesh, can you please once see if the results are reproducible at
> your end, if so, we can once compare the perf profiles to see why in
> some cases we get improvement and in other cases not. Based on that we
> can decide whether to allow the 1 worker case or not.
>

I will spend some time on this and update.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Amit Kapila
Date:
On Thu, Aug 27, 2020 at 4:56 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, Aug 27, 2020 at 8:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 27, 2020 at 8:04 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
> > >
> > > > I have attached new set of patches with the fixes.
> > > > Thoughts?
> > >
> > > Hi Vignesh,
> > >
> > > I don't really have any further comments on the code, but would like
> > > to share some results of some Parallel Copy performance tests I ran
> > > (attached).
> > >
> > > The tests loaded a 5GB CSV data file into a 100 column table (of
> > > different data types). The following were varied as part of the test:
> > > - Number of workers (1 – 10)
> > > - No indexes / 4-indexes
> > > - Default settings / increased resources (shared_buffers,work_mem, etc.)
> > >
> > > (I did not do any partition-related tests as I believe those type of
> > > tests were previously performed)
> > >
> > > I built Postgres (latest OSS code) with the latest Parallel Copy patches (v4).
> > > The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM.
> > >
> > >
> > > I observed the following trends:
> > > - For the data file size used, Parallel Copy achieved best performance
> > > using about 9 – 10 workers. Larger data files may benefit from using
> > > more workers. However, I couldn’t really see any better performance,
> > > for example, from using 16 workers on a 10GB CSV data file compared to
> > > using 8 workers. Results may also vary depending on machine
> > > characteristics.
> > > - Parallel Copy with 1 worker ran slower than normal Copy in a couple
> > > of cases (I did question if allowing 1 worker was useful in my patch
> > > review).
> >
> > I think the reason is that for 1 worker case there is not much
> > parallelization as a leader doesn't perform the actual load work.
> > Vignesh, can you please once see if the results are reproducible at
> > your end, if so, we can once compare the perf profiles to see why in
> > some cases we get improvement and in other cases not. Based on that we
> > can decide whether to allow the 1 worker case or not.
> >
>
> I will spend some time on this and update.
>

Thanks.

--
With Regards,
Amit Kapila.



Re: Parallel copy

From
vignesh C
Date:


On Thu, Aug 27, 2020 at 8:04 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
> - Parallel Copy with 1 worker ran slower than normal Copy in a couple
> of cases (I did question if allowing 1 worker was useful in my patch
> review).

Thanks Greg for your review & testing.
I had executed various tests with 1GB, 2GB & 5GB with 100 columns without parallel mode & with 1 parallel worker. Test result for the same is as given below:
TestWithout parallel modeWith 1 Parallel worker
1GB csv file 100 columns
(100 bytes data in each column)
62 seconds47 seconds (1.32X)
1GB csv file 100 columns
(1000 bytes data in each column)
89 seconds78 seconds (1.14X)
2GB csv file 100 columns
(1 byte data in each column)
277 seconds256 seconds (1.08X)
5GB csv file 100 columns
(100 byte data in each column)
515 seconds445 seconds (1.16X)

I have run the tests multiple times and have noticed the similar execution times in all the runs for the above tests.
In the above results there is slight improvement with 1 worker. In my tests I did not observe the degradation for copy with 1 worker compared to the non parallel copy. Can you share with me the script you used to generate the data & the ddl of the table, so that it will help me check that scenario you faced the problem.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
Greg Nancarrow
Date:
Hi Vignesh,

>Can you share with me the script you used to generate the data & the ddl of the table, so that it will help me check
that>scenario you faced the >problem.
 

Unfortunately I can't directly share it (considered company IP),
though having said that it's only doing something that is relatively
simple and unremarkable, so I'd expect it to be much like what you are
currently doing. I can describe it in general.

The table being used contains 100 columns (as I pointed out earlier),
with the first column of "bigserial" type, and the others of different
types like "character varying(255)", "numeric", "date" and "time
without timezone". There's about 60 of the "character varying(255)"
overall, with the other types interspersed.

When testing with indexes, 4 b-tree indexes were used that each
included the first column and then distinctly 9 other columns.

A CSV record (row) template file was created with test data
(corresponding to the table), and that was simply copied and appended
over and over with a record prefix in order to create the test data
file.
The following shell-script basically does it (but very slowly). I was
using a small C program to do similar, a lot faster.
In my case, N=2550000 produced about a 5GB CSV file.

    file_out=data.csv; for i in {1..N}; do echo -n "$i," >> $file_out;
cat sample_record.csv >> $file_out; done

One other thing I should mention is that between each test run, I
cleared the OS page cache, as described here:
https://linuxhint.com/clear_cache_linux/
That way, each COPY FROM is not taking advantage of any OS-cached data
from a previous COPY FROM.

If your data is somehow significantly different and you want to (and
can) share your script, then I can try it in my environment.


Regards,
Greg



Re: Parallel copy

From
vignesh C
Date:
On Tue, Sep 1, 2020 at 3:39 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> Hi Vignesh,
>
> >Can you share with me the script you used to generate the data & the ddl of the table, so that it will help me check
that>scenario you faced the >problem.
 
>
> Unfortunately I can't directly share it (considered company IP),
> though having said that it's only doing something that is relatively
> simple and unremarkable, so I'd expect it to be much like what you are
> currently doing. I can describe it in general.
>
> The table being used contains 100 columns (as I pointed out earlier),
> with the first column of "bigserial" type, and the others of different
> types like "character varying(255)", "numeric", "date" and "time
> without timezone". There's about 60 of the "character varying(255)"
> overall, with the other types interspersed.
>
> When testing with indexes, 4 b-tree indexes were used that each
> included the first column and then distinctly 9 other columns.
>
> A CSV record (row) template file was created with test data
> (corresponding to the table), and that was simply copied and appended
> over and over with a record prefix in order to create the test data
> file.
> The following shell-script basically does it (but very slowly). I was
> using a small C program to do similar, a lot faster.
> In my case, N=2550000 produced about a 5GB CSV file.
>
>     file_out=data.csv; for i in {1..N}; do echo -n "$i," >> $file_out;
> cat sample_record.csv >> $file_out; done
>
> One other thing I should mention is that between each test run, I
> cleared the OS page cache, as described here:
> https://linuxhint.com/clear_cache_linux/
> That way, each COPY FROM is not taking advantage of any OS-cached data
> from a previous COPY FROM.

I will try with a similar test and check if I can reproduce.

> If your data is somehow significantly different and you want to (and
> can) share your script, then I can try it in my environment.

I have attached the scripts that I used for the test results I
mentioned in my previous mail. create.sql file has the table that I
used, insert_data_gen.txt has the insert data generation scripts. I
varied the count in insert_data_gen to generate csv files of 1GB, 2GB
& 5GB & varied the data to generate 1 char, 10 char & 100 char for
each column for various testing. You can rename insert_data_gen.txt to
insert_data_gen.sh & generate the csv file.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
Greg Nancarrow
Date:
>On Wed, Sep 2, 2020 at 3:40 PM vignesh C <vignesh21@gmail.com> wrote:
> I have attached the scripts that I used for the test results I
> mentioned in my previous mail. create.sql file has the table that I
> used, insert_data_gen.txt has the insert data generation scripts. I
> varied the count in insert_data_gen to generate csv files of 1GB, 2GB
> & 5GB & varied the data to generate 1 char, 10 char & 100 char for
> each column for various testing. You can rename insert_data_gen.txt to
> insert_data_gen.sh & generate the csv file.


Hi Vignesh,

I used your script and table definition, multiplying the number of
records to produce a 5GB and 9.5GB CSV file.
I got the following results:


(1) Postgres default settings, 5GB CSV (530000 rows):

Copy Type            Duration (s)   Load factor
===============================================
Normal Copy          132.197         -

Parallel Copy
(#workers)
1                    98.428          1.34
2                    52.753          2.51
3                    37.630          3.51
4                    33.554          3.94
5                    33.636          3.93
6                    33.821          3.91
7                    34.270          3.86
8                    34.465          3.84
9                    34.315          3.85
10                   33.543          3.94


(2) Postgres increased resources, 5GB CSV (530000 rows):

shared_buffers = 20% of RAM (total RAM = 376GB) = 76GB
work_mem = 10% of RAM = 38GB
maintenance_work_mem = 10% of RAM = 38GB
max_worker_processes = 16
max_parallel_workers = 16
checkpoint_timeout = 30min
max_wal_size=2GB


Copy Type            Duration (s)   Load factor
===============================================
Normal Copy          131.835         -

Parallel Copy
(#workers)
1                    98.301          1.34
2                    53.261          2.48
3                    37.868          3.48
4                    34.224          3.85
5                    33.831          3.90
6                    34.229          3.85
7                    34.512          3.82
8                    34.303          3.84
9                    34.690          3.80
10                   34.479          3.82



(3) Postgres default settings, 9.5GB CSV (1000000 rows):

Copy Type            Duration (s)   Load factor
===============================================
Normal Copy          248.503         -

Parallel Copy
(#workers)
1                    185.724         1.34
2                    99.832          2.49
3                    70.560          3.52
4                    63.328          3.92
5                    63.182          3.93
6                    64.108          3.88
7                    64.131          3.87
8                    64.350          3.86
9                    64.293          3.87
10                   63.818          3.89


(4) Postgres increased resources, 9.5GB CSV (1000000 rows):

shared_buffers = 20% of RAM (total RAM = 376GB) = 76GB
work_mem = 10% of RAM = 38GB
maintenance_work_mem = 10% of RAM = 38GB
max_worker_processes = 16
max_parallel_workers = 16
checkpoint_timeout = 30min
max_wal_size=2GB


Copy Type            Duration (s)   Load factor
===============================================
Normal Copy          248.647        -

Parallel Copy
(#workers)
1                    182.236        1.36
2                    92.814         2.68
3                    67.347         3.69
4                    63.839         3.89
5                    62.672         3.97
6                    63.873         3.89
7                    64.930         3.83
8                    63.885         3.89
9                    62.397         3.98
10                   64.477         3.86



So as you found, with this particular table definition and data, 1
parallel worker always performs better than normal copy.
The different result obtained for this particular case seems to be
caused by the following factors:
- different table definition (I used a variety of column types)
- amount of data per row (I used less data per row, so more rows per
same size data file)

As I previously observed, if the target table has no indexes,
increasing resources beyond the default settings makes little
difference to the performance.

Regards,
Greg Nancarrow
Fujitsu Australia



Re: Parallel copy

From
vignesh C
Date:
On Tue, Sep 1, 2020 at 3:39 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> Hi Vignesh,
>
> >Can you share with me the script you used to generate the data & the ddl of the table, so that it will help me check
that>scenario you faced the >problem.
 
>
> Unfortunately I can't directly share it (considered company IP),
> though having said that it's only doing something that is relatively
> simple and unremarkable, so I'd expect it to be much like what you are
> currently doing. I can describe it in general.
>
> The table being used contains 100 columns (as I pointed out earlier),
> with the first column of "bigserial" type, and the others of different
> types like "character varying(255)", "numeric", "date" and "time
> without timezone". There's about 60 of the "character varying(255)"
> overall, with the other types interspersed.
>

Thanks Greg for executing & sharing the results.
I tried with a similar test case that you suggested, I was not able to
reproduce the degradation scenario.
If it is possible, can you run perf for the scenario with 1 worker &
non parallel mode & share the perf results, we will be able to find
out which of the functions is consuming more time by doing a
comparison of the perf reports.
Steps for running perf:
1) get the postgres pid
2) perf record -a -g -p <above pid>
3) Run copy command
4) Execute "perf report -g" once copy finishes.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Bharath Rupireddy
Date:
On Fri, Sep 11, 2020 at 3:49 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> I couldn't use the original machine from which I obtained the previous
> results, but ended up using a 4-core CentOS7 VM, which showed a
> similar pattern in the performance results for this test case.
> I obtained the following results from loading a 2GB CSV file (1000000
> rows, 4 indexes):
>
> Copy Type            Duration (s)          Load factor
> ===============================================
> Normal Copy        190.891                -
>
> Parallel Copy
> (#workers)
> 1                            210.947               0.90
>
Hi Greg,

I tried to recreate the test case(attached) and I didn't find much
difference with the custom postgresql.config file.
Test case: 250000 tuples, 4 indexes(composite indexes with 10
columns), 3.7GB, 100 columns(as suggested by you and all the
varchar(255) columns are having 255 characters), exec time in sec.

With custom postgresql.conf[1], removed and recreated the data
directory after every run(I couldn't perform the OS page cache flush
due to some reasons. So, chose this recreation of data dir way, for
testing purpose):
 HEAD: 129.547, 128.624, 128.890
 Patch: 0 workers - 130.213, 131.298, 130.555
 Patch: 1 worker - 127.757, 125.560, 128.275

With default postgresql.conf, removed and recreated the data directory
after every run:
 HEAD: 138.276, 150.472, 153.304
 Patch: 0 workers - 162.468,  149.423, 159.137
 Patch: 1 worker - 136.055, 144.250, 137.916

Few questions:
 1. Was the run performed with default postgresql.conf file? If not,
what are the changed configurations?
 2. Are the readings for normal copy(190.891sec, mentioned by you
above) taken on HEAD or with patch, 0 workers? How much is the runtime
with your test case on HEAD(Without patch) and 0 workers(With patch)?
 3. Was the run performed on release build?
 4. Were the readings taken on multiple runs(say 3 or 4 times)?

[1] - Postgres configuration used for above testing:
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
Greg Nancarrow
Date:
Hi Bharath,

On Tue, Sep 15, 2020 at 11:49 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Few questions:
>  1. Was the run performed with default postgresql.conf file? If not,
> what are the changed configurations?
Yes, just default settings.

>  2. Are the readings for normal copy(190.891sec, mentioned by you
> above) taken on HEAD or with patch, 0 workers?
With patch

>How much is the runtime
> with your test case on HEAD(Without patch) and 0 workers(With patch)?
TBH, I didn't test that. Looking at the changes, I wouldn't expect a
degradation of performance for normal copy (you have tested, right?).

>  3. Was the run performed on release build?
For generating the perf data I sent (normal copy vs parallel copy with
1 worker), I used a debug build (-g -O0), as that is needed for
generating all the relevant perf data for Postgres code. Previously I
ran with a release build (-O2).

>  4. Were the readings taken on multiple runs(say 3 or 4 times)?
The readings I sent were from just one run (not averaged), but I did
run the tests several times to verify the readings were representative
of the pattern I was seeing.


Fortunately I have been given permission to share the exact table
definition and data I used, so you can check the behaviour and timings
on your own test machine.
Please see the attachment.
You can create the table using the table.sql and index_4.sql
definitions in the "sql" directory.
The data.csv file (to be loaded by COPY) can be created with the
included "dupdata" tool in the "input" directory, which you need to
build, then run, specifying a suitable number of records and path of
the template record (see README). Obviously the larger the number of
records, the larger the file ...
The table can then be loaded using COPY with "format csv" (and
"parallel N" if testing parallel copy).

Regards,
Greg Nancarrow
Fujitsu Australia

Attachment

Re: Parallel copy

From
Ashutosh Sharma
Date:
Hi Vignesh,

I've spent some time today looking at your new set of patches and I've
some thoughts and queries which I would like to put here:

Why are these not part of the shared cstate structure?

    SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, cstate->null_print);
    SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim);
    SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote);
    SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape);

I think in the refactoring patch we could replace all the cstate
variables that would be shared between the leader and workers with a
common structure which would be used even for a serial copy. Thoughts?

--

Have you tested your patch when encoding conversion is needed? If so,
could you please point out the email that has the test results.

--

Apart from above, I've noticed some cosmetic errors which I am sharing here:

+#define    IsParallelCopy()        (cstate->is_parallel)
+#define IsLeader()             (cstate->pcdata->is_leader)

This doesn't look to be properly aligned.

--

+   shared_info_ptr = (ParallelCopyShmInfo *)
shm_toc_allocate(pcxt->toc, sizeof(ParallelCopyShmInfo));
+   PopulateParallelCopyShmInfo(shared_info_ptr, full_transaction_id);

..

+   /* Store shared build state, for which we reserved space. */
+   shared_cstate = (SerializedParallelCopyState
*)shm_toc_allocate(pcxt->toc, est_cstateshared);

In the first case, while typecasting you've added a space between the
typename and the function but that is missing in the second case. I
think it would be good if you could make it consistent.

Same comment applies here as well:

+   pg_atomic_uint32    line_state;     /* line state */
+   uint64              cur_lineno;     /* line number for error messages */
+}ParallelCopyLineBoundary;

...

+   CommandId                   mycid;  /* command id */
+   ParallelCopyLineBoundaries  line_boundaries; /* line array */
+} ParallelCopyShmInfo;

There is no space between the closing brace and the structure name in
the first case but it is in the second one. So, again this doesn't
look consistent.

I could also find this type of inconsistency in comments. See below:

+/* It can hold upto 10000 record information for worker to process. RINGSIZE
+ * should be a multiple of WORKER_CHUNK_COUNT, as wrap around cases
is currently
+ * not handled while selecting the WORKER_CHUNK_COUNT by the worker. */
+#define RINGSIZE (10 * 1000)

...

+/*
+ * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data
+ * block to process to avoid lock contention. Read RINGSIZE comments before
+ * changing this value.
+ */
+#define WORKER_CHUNK_COUNT 50

You may see these kinds of errors at other places as well if you scan
through your patch.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Wed, Aug 19, 2020 at 11:51 AM vignesh C <vignesh21@gmail.com> wrote:
>
> Thanks Greg for reviewing the patch. Please find my thoughts for your comments.
>
> On Mon, Aug 17, 2020 at 9:44 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
> > Some further comments:
> >
> > (1) v3-0002-Framework-for-leader-worker-in-parallel-copy.patch
> >
> > +/*
> > + * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data
> > + * block to process to avoid lock contention. This value should be divisible by
> > + * RINGSIZE, as wrap around cases is currently not handled while selecting the
> > + * WORKER_CHUNK_COUNT by the worker.
> > + */
> > +#define WORKER_CHUNK_COUNT 50
> >
> >
> > "This value should be divisible by RINGSIZE" is not a correct
> > statement (since obviously 50 is not divisible by 10000).
> > It should say something like "This value should evenly divide into
> > RINGSIZE", or "RINGSIZE should be a multiple of WORKER_CHUNK_COUNT".
> >
>
> Fixed. Changed it to RINGSIZE should be a multiple of WORKER_CHUNK_COUNT.
>
> > (2) v3-0003-Allow-copy-from-command-to-process-data-from-file.patch
> >
> > (i)
> >
> > +                       /*
> > +                        * If the data is present in current block
> > lineInfo. line_size
> > +                        * will be updated. If the data is spread
> > across the blocks either
> >
> > Somehow a space has been put between "lineinfo." and "line_size".
> > It should be: "If the data is present in current block
> > lineInfo.line_size will be updated"
>
> Fixed, changed it to lineinfo->line_size.
>
> >
> > (ii)
> >
> > >This is not possible because of pg_atomic_compare_exchange_u32, this
> > >will succeed only for one of the workers whose line_state is
> > >LINE_LEADER_POPULATED, for other workers it will fail. This is
> > >explained in detail above ParallelCopyLineBoundary.
> >
> > Yes, but prior to that call to pg_atomic_compare_exchange_u32(),
> > aren't you separately reading line_state and line_state, so that
> > between those reads, it may have transitioned from leader to another
> > worker, such that the read line state ("cur_line_state", being checked
> > in the if block) may not actually match what is now in the line_state
> > and/or the read line_size ("dataSize") doesn't actually correspond to
> > the read line state?
> >
> > (sorry, still not 100% convinced that the synchronization and checks
> > are safe in all cases)
> >
>
> I think that you are describing about the problem could happen in the
> following case:
> when we read curr_line_state, the value was LINE_WORKER_PROCESSED or
> LINE_WORKER_PROCESSING. Then in some cases if the leader is very fast
> compared to the workers then the leader quickly populates one line and
> sets the state to LINE_LEADER_POPULATED. State is changed to
> LINE_LEADER_POPULATED when we are checking the currr_line_state.
> I feel this will not be a problem because, Leader will populate & wait
> till some RING element is available to populate. In the meantime
> worker has seen that state is LINE_WORKER_PROCESSED or
> LINE_WORKER_PROCESSING(previous state that it read), worker has
> identified that this chunk was processed by some other worker, worker
> will move and try to get the next available chunk & insert those
> records. It will keep continuing till it gets the next chunk to
> process. Eventually one of the workers will get this chunk and process
> it.
>
> > (3) v3-0006-Parallel-Copy-For-Binary-Format-Files.patch
> >
> > >raw_buf is not used in parallel copy, instead raw_buf will be pointing
> > >to shared memory data blocks. This memory was allocated as part of
> > >BeginCopyFrom, uptil this point we cannot be 100% sure as copy can be
> > >performed sequentially like in case max_worker_processes is not
> > >available, if it switches to sequential mode raw_buf will be used
> > >while performing copy operation. At this place we can safely free this
> > >memory that was allocated
> >
> > So the following code (which checks raw_buf, which still points to
> > memory that has been pfreed) is still valid?
> >
> >   In the SetRawBufForLoad() function, which is called by CopyReadLineText():
> >
> >     cur_data_blk_ptr = (cstate->raw_buf) ?
> > &pcshared_info->data_blocks[cur_block_pos] : NULL;
> >
> > The above code looks a bit dicey to me. I stepped over that line in
> > the debugger when I debugged an instance of Parallel Copy, so it
> > definitely gets executed.
> > It makes me wonder what other code could possibly be checking raw_buf
> > and using it in some way, when in fact what it points to has been
> > pfreed.
> >
> > Are you able to add the following line of code, or will it (somehow)
> > break logic that you are relying on?
> >
> > pfree(cstate->raw_buf);
> > cstate->raw_buf = NULL;               <=== I suggest that this line is added
> >
>
> You are right, I have debugged & verified it sets it to an invalid
> block which is not expected. There are chances this would have caused
> some corruption in some machines. The suggested fix is required, I
> have fixed it. I have moved this change to
> 0003-Allow-copy-from-command-to-process-data-from-file.patch as
> 0006-Parallel-Copy-For-Binary-Format-Files is only for Binary format
> parallel copy & that change is common change for parallel copy.
>
> I have attached new set of patches with the fixes.
> Thoughts?
>
> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Bharath Rupireddy
Date:
On Wed, Sep 16, 2020 at 1:20 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> Fortunately I have been given permission to share the exact table
> definition and data I used, so you can check the behaviour and timings
> on your own test machine.
>

Thanks Greg for the script. I ran your test case and I didn't observe
any increase in exec time with 1 worker, indeed, we have benefitted a
few seconds from 0 to 1 worker as expected.

Execution time is in seconds. Each test case is executed 3 times on
release build. Each time the data directory is recreated.

Case 1: 1000000 rows, 2GB
With Patch, default configuration, 0 worker: 88.933, 92.261, 88.423
With Patch, default configuration, 1 worker: 73.825, 74.583, 72.678

With Patch, custom configuration, 0 worker: 76.191, 78.160, 78.822
With Patch, custom configuration, 1 worker: 61.289, 61.288, 60.573

Case 2: 2550000 rows, 5GB
With Patch, default configuration, 0 worker: 246.031, 188.323, 216.683
With Patch, default configuration, 1 worker: 156.299, 153.293, 170.307

With Patch, custom configuration, 0 worker: 197.234, 195.866, 196.049
With Patch, custom configuration, 1 worker: 157.173, 158.287, 157.090

[1] - Custom configuration is set up to ensure that no other processes
influence the results. The postgresql.conf used:
shared_buffers = 40GB
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:
Thanks Ashutosh for your comments.

On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi Vignesh,
>
> I've spent some time today looking at your new set of patches and I've
> some thoughts and queries which I would like to put here:
>
> Why are these not part of the shared cstate structure?
>
>     SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, cstate->null_print);
>     SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim);
>     SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote);
>     SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape);
>

I have used shared_cstate mainly to share the integer & bool data
types from the leader to worker process. The above data types are of
char* data type, I will not be able to use it like how I could do it
for integer type. So I preferred to send these as separate keys to the
worker. Thoughts?

> I think in the refactoring patch we could replace all the cstate
> variables that would be shared between the leader and workers with a
> common structure which would be used even for a serial copy. Thoughts?
>

Currently we are using shared_cstate only to share integer & bool data
types from leader to worker. Once worker retrieves the shared data for
integer & bool data types, worker will copy it to cstate. I preferred
this way because only for integer & bool we retrieve to shared_cstate
& copy it to cstate and for rest of the members any way we are
directly copying back to cstate. Thoughts?

> Have you tested your patch when encoding conversion is needed? If so,
> could you please point out the email that has the test results.
>

We have not yet done encoding testing, we will do and post the results
separately in the coming days.

> Apart from above, I've noticed some cosmetic errors which I am sharing here:
>
> +#define    IsParallelCopy()        (cstate->is_parallel)
> +#define IsLeader()             (cstate->pcdata->is_leader)
>
> This doesn't look to be properly aligned.
>

Fixed.

> +   shared_info_ptr = (ParallelCopyShmInfo *)
> shm_toc_allocate(pcxt->toc, sizeof(ParallelCopyShmInfo));
> +   PopulateParallelCopyShmInfo(shared_info_ptr, full_transaction_id);
>
> ..
>
> +   /* Store shared build state, for which we reserved space. */
> +   shared_cstate = (SerializedParallelCopyState
> *)shm_toc_allocate(pcxt->toc, est_cstateshared);
>
> In the first case, while typecasting you've added a space between the
> typename and the function but that is missing in the second case. I
> think it would be good if you could make it consistent.
>

Fixed

> Same comment applies here as well:
>
> +   pg_atomic_uint32    line_state;     /* line state */
> +   uint64              cur_lineno;     /* line number for error messages */
> +}ParallelCopyLineBoundary;
>
> ...
>
> +   CommandId                   mycid;  /* command id */
> +   ParallelCopyLineBoundaries  line_boundaries; /* line array */
> +} ParallelCopyShmInfo;
>
> There is no space between the closing brace and the structure name in
> the first case but it is in the second one. So, again this doesn't
> look consistent.
>

Fixed

> I could also find this type of inconsistency in comments. See below:
>
> +/* It can hold upto 10000 record information for worker to process. RINGSIZE
> + * should be a multiple of WORKER_CHUNK_COUNT, as wrap around cases
> is currently
> + * not handled while selecting the WORKER_CHUNK_COUNT by the worker. */
> +#define RINGSIZE (10 * 1000)
>
> ...
>
> +/*
> + * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data
> + * block to process to avoid lock contention. Read RINGSIZE comments before
> + * changing this value.
> + */
> +#define WORKER_CHUNK_COUNT 50
>
> You may see these kinds of errors at other places as well if you scan
> through your patch.

Fixed.

Please find the attached v5 patch which has the fixes for the same.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
Bharath Rupireddy
Date:
On Thu, Sep 17, 2020 at 11:06 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Wed, Sep 16, 2020 at 1:20 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> >
> > Fortunately I have been given permission to share the exact table
> > definition and data I used, so you can check the behaviour and timings
> > on your own test machine.
> >
>
> Thanks Greg for the script. I ran your test case and I didn't observe
> any increase in exec time with 1 worker, indeed, we have benefitted a
> few seconds from 0 to 1 worker as expected.
>
> Execution time is in seconds. Each test case is executed 3 times on
> release build. Each time the data directory is recreated.
>
> Case 1: 1000000 rows, 2GB
> With Patch, default configuration, 0 worker: 88.933, 92.261, 88.423
> With Patch, default configuration, 1 worker: 73.825, 74.583, 72.678
>
> With Patch, custom configuration, 0 worker: 76.191, 78.160, 78.822
> With Patch, custom configuration, 1 worker: 61.289, 61.288, 60.573
>
> Case 2: 2550000 rows, 5GB
> With Patch, default configuration, 0 worker: 246.031, 188.323, 216.683
> With Patch, default configuration, 1 worker: 156.299, 153.293, 170.307
>
> With Patch, custom configuration, 0 worker: 197.234, 195.866, 196.049
> With Patch, custom configuration, 1 worker: 157.173, 158.287, 157.090
>

Hi Greg,

If you still observe the issue in your testing environment, I'm attaching a testing patch(that applies on top of the latest parallel copy patch set i.e. v5 1 to 6) to capture various timings such as total copy time in leader and worker, index and table insertion time, leader and worker waiting time. These logs are shown in the server log file.

Few things to follow before testing:
1. Is the table being dropped/truncated after the test with 0 workers and before running with 1 worker? If not, then the index insertion time would increase.[1](for me it is increasing by 10 sec). This is obvious because the 1st time index will be created from bottom up manner(from leaves to root), but for the 2nd time it has to search and insert at the proper leaves and inner B+Tree nodes.
2. If possible, can you also run with custom postgresql.conf settings[2] along with default? Just to ensure that other bg processes such as checkpointer, autovacuum, bgwriter etc. don't affect our testcase. For instance, with default postgresql.conf file, it looks like checkpointing[3] is happening frequently, could you please let us know if that happens at your end?
3. Could you please run the test case 3 times at least? Just to ensure the consistency of the issue.
4. I ran the tests in a performance test system where no other user processes(except system processes) are running. Is it possible for you to do the same?

Please capture and share the timing logs with us.

Here's a snapshot of how the added timings show up in the logs: ( I captured this with your test case case 1: 1000000 rows, 2GB, custom postgresql.conf file settings[2]).
with 0 workers:
2020-09-22 10:49:27.508 BST [163910] LOG:  totaltableinsertiontime = 24072.034 ms
2020-09-22 10:49:27.508 BST [163910] LOG:  totalindexinsertiontime = 60.682 ms
2020-09-22 10:49:27.508 BST [163910] LOG:  totalcopytime = 59664.594 ms

with 1 worker:
2020-09-22 10:53:58.409 BST [163947] LOG:  totalcopyworkerwaitingtime = 59.815 ms
2020-09-22 10:53:58.409 BST [163947] LOG:  totaltableinsertiontime = 23585.881 ms
2020-09-22 10:53:58.409 BST [163947] LOG:  totalindexinsertiontime = 30.946 ms
2020-09-22 10:53:58.409 BST [163947] LOG:  totalcopytimeworker = 47047.956 ms
2020-09-22 10:53:58.429 BST [163946] LOG:  totalcopyleaderwaitingtime = 26746.744 ms
2020-09-22 10:53:58.429 BST [163946] LOG:  totalcopytime = 47150.002 ms

[1]
0 worker:
LOG:  totaltableinsertiontime = 25491.881 ms
LOG:  totalindexinsertiontime = 14136.104 ms
LOG:  totalcopytime = 75606.858 ms
table is not dropped and so are indexes
1 worker:
LOG:  totalcopyworkerwaitingtime = 64.582 ms
LOG:  totaltableinsertiontime = 21360.875 ms
LOG:  totalindexinsertiontime = 24843.570 ms
LOG:  totalcopytimeworker = 69837.162 ms
LOG:  totalcopyleaderwaitingtime = 49548.441 ms
LOG:  totalcopytime = 69997.778 ms

[2]
custom postgresql.conf configuration:
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

[3]
LOG:  checkpoints are occurring too frequently (14 seconds apart)
HINT:  Consider increasing the configuration parameter "max_wal_size".

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Parallel copy

From
Greg Nancarrow
Date:
Hi Bharath,

> Few things to follow before testing:
> 1. Is the table being dropped/truncated after the test with 0 workers and before running with 1 worker? If not, then
theindex insertion time would increase.[1](for me it is increasing by 10 sec). This is obvious because the 1st time
indexwill be created from bottom up manner(from leaves to root), but for the 2nd time it has to search and insert at
theproper leaves and inner B+Tree nodes. 

Yes, it' being truncated before running each and every COPY.

> 2. If possible, can you also run with custom postgresql.conf settings[2] along with default? Just to ensure that
otherbg processes such as checkpointer, autovacuum, bgwriter etc. don't affect our testcase. For instance, with default
postgresql.conffile, it looks like checkpointing[3] is happening frequently, could you please let us know if that
happensat your end? 

Yes, have run with default and your custom settings. With default
settings, I can confirm that checkpointing is happening frequently
with the tests I've run here.

> 3. Could you please run the test case 3 times at least? Just to ensure the consistency of the issue.

Yes, have run 4 times. Seems to be a performance hit (whether normal
copy or parallel-1 copy) on the first COPY run on a freshly created
database. After that, results are consistent.

> 4. I ran the tests in a performance test system where no other user processes(except system processes) are running.
Isit possible for you to do the same? 
>
> Please capture and share the timing logs with us.
>

Yes, I have ensured the system is as idle as possible prior to testing.

I have attached the test results obtained after building with your
Parallel Copy patch and testing patch applied (HEAD at
733fa9aa51c526582f100aa0d375e0eb9a6bce8b).

Test results show that Parallel COPY with 1 worker is performing
better than normal COPY in the test scenarios run. There is a
performance hit (regardless of COPY type) on the very first COPY run
on a freshly-created database.

I ran the test case 4 times. and also in reverse order, with truncate
run before each COPY (output and logs named xxxx_0_1 run normal COPY
then parallel COPY, and named xxxx_1_0 run parallel COPY and then
normal COPY).

Please refer to attached results.

Regards,
Greg

Attachment

Re: Parallel copy

From
Bharath Rupireddy
Date:
Thanks Greg for the testing.

On Thu, Sep 24, 2020 at 8:27 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> > 3. Could you please run the test case 3 times at least? Just to ensure the consistency of the issue.
>
> Yes, have run 4 times. Seems to be a performance hit (whether normal
> copy or parallel-1 copy) on the first COPY run on a freshly created
> database. After that, results are consistent.
>

From the logs, I see that it is happening only with default postgresql.conf, and there's inconsistency in table insertion times, especially from the 1st time to 2nd time. Also, the table insertion time variation is more. This is expected with the default postgresql.conf, because of the background processes interference. That's the reason we usually run with custom configuration to correctly measure the performance gain.

br_default_0_1.log:
2020-09-23 22:32:36.944 JST [112616] LOG:  totaltableinsertiontime = 155068.244 ms
2020-09-23 22:33:57.615 JST [11426] LOG:  totaltableinsertiontime = 42096.275 ms

2020-09-23 22:37:39.192 JST [43097] LOG:  totaltableinsertiontime = 29135.262 ms
2020-09-23 22:38:56.389 JST [54205] LOG:  totaltableinsertiontime = 38953.912 ms
2020-09-23 22:40:27.573 JST [66485] LOG:  totaltableinsertiontime = 27895.326 ms
2020-09-23 22:41:34.948 JST [77523] LOG:  totaltableinsertiontime = 28929.642 ms
2020-09-23 22:43:18.938 JST [89857] LOG:  totaltableinsertiontime = 30625.015 ms
2020-09-23 22:44:21.938 JST [101372] LOG:  totaltableinsertiontime = 24624.045 ms

br_default_1_0.log:
2020-09-24 11:12:14.989 JST [56146] LOG:  totaltableinsertiontime = 192068.350 ms
2020-09-24 11:13:38.228 JST [88455] LOG:  totaltableinsertiontime = 30999.942 ms

2020-09-24 11:15:50.381 JST [108935] LOG:  totaltableinsertiontime = 31673.204 ms
2020-09-24 11:17:14.260 JST [118541] LOG:  totaltableinsertiontime = 31367.027 ms
2020-09-24 11:20:18.975 JST [17270] LOG:  totaltableinsertiontime = 26858.924 ms
2020-09-24 11:22:17.822 JST [26852] LOG:  totaltableinsertiontime = 66531.442 ms
2020-09-24 11:24:09.221 JST [47971] LOG:  totaltableinsertiontime = 38943.384 ms
2020-09-24 11:25:30.955 JST [58849] LOG:  totaltableinsertiontime = 28286.634 ms

br_custom_0_1.log:
2020-09-24 10:29:44.956 JST [110477] LOG:  totaltableinsertiontime = 20207.928 ms
2020-09-24 10:30:49.570 JST [120568] LOG:  totaltableinsertiontime = 23360.006 ms
2020-09-24 10:32:31.659 JST [2753] LOG:  totaltableinsertiontime = 19837.588 ms
2020-09-24 10:35:49.245 JST [31118] LOG:  totaltableinsertiontime = 21759.253 ms
2020-09-24 10:36:54.834 JST [41763] LOG:  totaltableinsertiontime = 23547.323 ms
2020-09-24 10:38:53.507 JST [56779] LOG:  totaltableinsertiontime = 21543.984 ms
2020-09-24 10:39:58.713 JST [67489] LOG:  totaltableinsertiontime = 25254.563 ms

br_custom_1_0.log:
2020-09-24 10:49:03.242 JST [15308] LOG:  totaltableinsertiontime = 16541.201 ms
2020-09-24 10:50:11.848 JST [23324] LOG:  totaltableinsertiontime = 15076.577 ms
2020-09-24 10:51:24.497 JST [35394] LOG:  totaltableinsertiontime = 16400.777 ms
2020-09-24 10:52:32.354 JST [42953] LOG:  totaltableinsertiontime = 15591.051 ms
2020-09-24 10:54:30.327 JST [61136] LOG:  totaltableinsertiontime = 16700.954 ms
2020-09-24 10:55:38.377 JST [68719] LOG:  totaltableinsertiontime = 15435.150 ms
2020-09-24 10:57:08.927 JST [83335] LOG:  totaltableinsertiontime = 17133.251 ms
2020-09-24 10:58:17.420 JST [90905] LOG:  totaltableinsertiontime = 15352.753 ms

>
> Test results show that Parallel COPY with 1 worker is performing
> better than normal COPY in the test scenarios run.
>

Good to know :)

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
Bharath Rupireddy
Date:
>
> > Have you tested your patch when encoding conversion is needed? If so,
> > could you please point out the email that has the test results.
> >
>
> We have not yet done encoding testing, we will do and post the results
> separately in the coming days.
>

Hi Ashutosh,

I ran the tests ensuring pg_server_to_any() gets called from copy.c. I specified the encoding option of COPY command, with client and server encodings being UTF-8.

Tests are performed with custom postgresql.conf[1], 10million rows, 5.2GB data. The results are of the triplet form (exec time in sec, number of workers, gain)

Use case 1: 2 indexes on integer columns, 1 index on text column
(1174.395, 0, 1X), (1127.792, 1, 1.04X), (644.260, 2, 1.82X), (341.284, 4, 3.43X), (204.423, 8, 5.74X), (140.692, 16, 8.34X), (129.843, 20, 9.04X), (134.511, 30, 8.72X)

Use case 2: 1 gist index on text column
(811.412, 0, 1X), (772.203, 1, 1.05X), (437.364, 2, 1.85X), (263.575, 4, 3.08X), (175.135, 8, 4.63X), (155.355, 16, 5.22X), (178.704, 20, 4.54X), (199.402, 30, 4.06)

Use case 3: 3 indexes on integer columns
(220.680, 0, 1X), (185.096, 1, 1.19X), (134.811, 2, 1.64X), (114.585, 4, 1.92X), (107.707, 8, 2.05X), (101.253, 16, 2.18X), (100.749, 20, 2.19X), (100.656, 30, 2.19X)
 
The results are similar to our earlier runs[2].

[1]
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

[2]
https://www.postgresql.org/message-id/CALDaNm13zK%3DJXfZWqZJsm3%2B2yagYDJc%3DeJBgE4i77-4PPNj7vw%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
Ashutosh Sharma
Date:
On Thu, Sep 24, 2020 at 3:00 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> >
> > > Have you tested your patch when encoding conversion is needed? If so,
> > > could you please point out the email that has the test results.
> > >
> >
> > We have not yet done encoding testing, we will do and post the results
> > separately in the coming days.
> >
>
> Hi Ashutosh,
>
> I ran the tests ensuring pg_server_to_any() gets called from copy.c. I specified the encoding option of COPY command,
withclient and server encodings being UTF-8.
 
>

Thanks Bharath for the testing. The results look impressive.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: Parallel copy

From
Amit Kapila
Date:
On Wed, Jul 22, 2020 at 7:48 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, Jul 21, 2020 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > Review comments:
> > ===================
> >
> > 0001-Copy-code-readjustment-to-support-parallel-copy
> > 1.
> > @@ -807,8 +835,11 @@ CopyLoadRawBuf(CopyState cstate)
> >   else
> >   nbytes = 0; /* no data need be saved */
> >
> > + if (cstate->copy_dest == COPY_NEW_FE)
> > + minread = RAW_BUF_SIZE - nbytes;
> > +
> >   inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes,
> > -   1, RAW_BUF_SIZE - nbytes);
> > +   minread, RAW_BUF_SIZE - nbytes);
> >
> > No comment to explain why this change is done?
> >
> > 0002-Framework-for-leader-worker-in-parallel-copy
>
> Currently CopyGetData copies a lesser amount of data to buffer even though space is available in buffer because
minreadwas passed as 1 to CopyGetData. Because of this there are frequent call to CopyGetData for fetching the data. In
thiscase it will load only some data due to the below check: 
> while (maxread > 0 && bytesread < minread && !cstate->reached_eof)
> After reading some data bytesread will be greater than minread which is passed as 1 and return with lesser amount of
data,even though there is some space. 
> This change is required for parallel copy feature as each time we get a new DSM data block which is of 64K size and
copythe data. If we copy less data into DSM data blocks we might end up consuming all the DSM data blocks. 
>

Why can't we reuse the DSM block which has unfilled space?

>  I felt this issue can be fixed as part of HEAD. Have posted a separate thread [1] for this. I'm planning to remove
thatchange once it gets committed. Can that go as a separate 
> patch or should we include it here?
> [1] - https://www.postgresql.org/message-id/CALDaNm0v4CjmvSnftYnx_9pOS_dKRG%3DO3NnBgJsQmi0KipvLog%40mail.gmail.com
>

I am convinced by the reason given by Kyotaro-San in that another
thread [1] and performance data shown by Peter that this can't be an
independent improvement and rather in some cases it can do harm. Now,
if you need it for a parallel-copy path then we can change it
specifically to the parallel-copy code path but I don't understand
your reason completely.

> > 2.
..
> > + */
> > +typedef struct ParallelCopyLineBoundary
> >
> > Are we doing all this state management to avoid using locks while
> > processing lines?  If so, I think we can use either spinlock or LWLock
> > to keep the main patch simple and then provide a later patch to make
> > it lock-less.  This will allow us to first focus on the main design of
> > the patch rather than trying to make this datastructure processing
> > lock-less in the best possible way.
> >
>
> The steps will be more or less same if we use spinlock too. step 1, step 3 & step 4 will be common we have to use
lock& unlock instead of step 2 & step 5. I feel we can retain the current implementation. 
>

I'll study this in detail and let you know my opinion on the same but
in the meantime, I don't follow one part of this comment: "If they
don't follow this order the worker might process wrong line_size and
leader might populate the information which worker has not yet
processed or in the process of processing."

Do you want to say that leader might overwrite some information which
worker hasn't read yet? If so, it is not clear from the comment.
Another minor point about this comment:

+ * ParallelCopyLineBoundary is common data structure between leader & worker,
+ * Leader process will be populating data block, data block offset &
the size of

I think there should be a full-stop after worker instead of a comma.

>
> > 6.
> > In function BeginParallelCopy(), you need to keep a provision to
> > collect wal_usage and buf_usage stats.  See _bt_begin_parallel for
> > reference.  Those will be required for pg_stat_statements.
> >
>
> Fixed
>

How did you ensure that this is fixed? Have you tested it, if so
please share the test? I see a basic problem with your fix.

+ /* Report WAL/buffer usage during parallel execution */
+ bufferusage = shm_toc_lookup(toc, PARALLEL_COPY_BUFFER_USAGE, false);
+ walusage = shm_toc_lookup(toc, PARALLEL_COPY_WAL_USAGE, false);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+   &walusage[ParallelWorkerNumber]);

You need to call InstrStartParallelQuery() before the actual operation
starts, without that stats won't be accurate? Also, after calling
WaitForParallelWorkersToFinish(), you need to accumulate the stats
collected from workers which neither you have done nor is possible
with the current code in your patch because you haven't made any
provision to capture them in BeginParallelCopy.

I suggest you look into lazy_parallel_vacuum_indexes() and
begin_parallel_vacuum() to understand how the buffer/wal usage stats
are accumulated. Also, please test this functionality using
pg_stat_statements.

>
> > 0003-Allow-copy-from-command-to-process-data-from-file-ST
> > 10.
> > In the commit message, you have written "The leader does not
> > participate in the insertion of data, leaders only responsibility will
> > be to identify the lines as fast as possible for the workers to do the
> > actual copy operation. The leader waits till all the lines populated
> > are processed by the workers and exits."
> >
> > I think you should also mention that we have chosen this design based
> > on the reason "that everything stalls if the leader doesn't accept
> > further input data, as well as when there are no available splitted
> > chunks so it doesn't seem like a good idea to have the leader do other
> > work.  This is backed by the performance data where we have seen that
> > with 1 worker there is just a 5-10% (or whatever percentage difference
> > you have seen) performance difference)".
>
> Fixed.
>

Make it a one-paragraph starting from "The leader does not participate
in the insertion of data  .... just a 5-10% performance difference".
Right now both the parts look a bit disconnected.

Few additional comments:
======================
v5-0001-Copy-code-readjustment-to-support-parallel-copy
---------------------------------------------------------------------------------
1.
+/*
+ * CLEAR_EOL_LINE - Wrapper for clearing EOL.
+ */
+#define CLEAR_EOL_LINE() \
+if (!result && !IsHeaderLine()) \
+ ClearEOLFromCopiedData(cstate, cstate->line_buf.data, \
+    cstate->line_buf.len, \
+    &cstate->line_buf.len) \

I don't like this macro. I think it is sufficient to move the common
code to be called from the parallel and non-parallel path in
ClearEOLFromCopiedData but I think the other checks can be done
in-place. I think having macros for such a thing makes code less
readable.

2.
-
+static void PopulateCommonCstateInfo(CopyState cstate, TupleDesc tup_desc,
+ List *attnamelist);

Spurious line removal.

v5-0002-Framework-for-leader-worker-in-parallel-copy
---------------------------------------------------------------------------
3.
+ FullTransactionId full_transaction_id; /* xid for copy from statement */
+ CommandId mycid; /* command id */
+ ParallelCopyLineBoundaries line_boundaries; /* line array */
+} ParallelCopyShmInfo;

We already serialize FullTransactionId and CommandId via
InitializeParallelDSM->SerializeTransactionState. Can't we reuse it? I
think recently Parallel Insert patch has also done something for this
[2] so you can refer that if you want.

v5-0004-Documentation-for-parallel-copy
-----------------------------------------------------------
1.  Perform <command>COPY FROM</command> in parallel using <replaceable
+      class="parameter"> integer</replaceable> background workers.

No need for space before integer.


[1] - https://www.postgresql.org/message-id/20200911.155804.359271394064499501.horikyota.ntt%40gmail.com
[2] - https://www.postgresql.org/message-id/CAJcOf-fn1nhEtaU91NvRuA3EbvbJGACMd4_c%2BUu3XU5VMv37Aw%40mail.gmail.com

--
With Regards,
Amit Kapila.



Re: Parallel copy

From
Amit Kapila
Date:
On Tue, Sep 22, 2020 at 2:44 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Thanks Ashutosh for your comments.
>
> On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > Hi Vignesh,
> >
> > I've spent some time today looking at your new set of patches and I've
> > some thoughts and queries which I would like to put here:
> >
> > Why are these not part of the shared cstate structure?
> >
> >     SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, cstate->null_print);
> >     SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim);
> >     SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote);
> >     SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape);
> >
>
> I have used shared_cstate mainly to share the integer & bool data
> types from the leader to worker process. The above data types are of
> char* data type, I will not be able to use it like how I could do it
> for integer type. So I preferred to send these as separate keys to the
> worker. Thoughts?
>

I think the way you have written will work but if we go with
Ashutosh's proposal it will look elegant and in the future, if we need
to share more strings as part of cstate structure then that would be
easier. You can probably refer to EstimateParamListSpace,
SerializeParamList, and RestoreParamList to see how we can share
different types of data in one key.

-- 
With Regards,
Amit Kapila.



Re: Parallel copy

From
Ashutosh Sharma
Date:
On Mon, Sep 28, 2020 at 3:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 22, 2020 at 2:44 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > Thanks Ashutosh for your comments.
> >
> > On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > >
> > > Hi Vignesh,
> > >
> > > I've spent some time today looking at your new set of patches and I've
> > > some thoughts and queries which I would like to put here:
> > >
> > > Why are these not part of the shared cstate structure?
> > >
> > >     SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, cstate->null_print);
> > >     SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim);
> > >     SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote);
> > >     SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape);
> > >
> >
> > I have used shared_cstate mainly to share the integer & bool data
> > types from the leader to worker process. The above data types are of
> > char* data type, I will not be able to use it like how I could do it
> > for integer type. So I preferred to send these as separate keys to the
> > worker. Thoughts?
> >
>
> I think the way you have written will work but if we go with
> Ashutosh's proposal it will look elegant and in the future, if we need
> to share more strings as part of cstate structure then that would be
> easier. You can probably refer to EstimateParamListSpace,
> SerializeParamList, and RestoreParamList to see how we can share
> different types of data in one key.
>

Yeah. And in addition to that it will also reduce the number of DSM
keys that we need to maintain.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: Parallel copy

From
Greg Nancarrow
Date:
Hi Vignesh and Bharath,

Seems like the Parallel Copy patch is regarding RI_TRIGGER_PK as
parallel-unsafe.
Can you explain why this is?

Regards,
Greg Nancarrow
Fujitsu Australia



Re: Parallel copy

From
Amit Kapila
Date:
On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Few additional comments:
> ======================

Some more comments:

v5-0002-Framework-for-leader-worker-in-parallel-copy
===========================================
1.
These values
+ * help in handover of multiple records with significant size of data to be
+ * processed by each of the workers to make sure there is no context
switch & the
+ * work is fairly distributed among the workers.

How about writing it as: "These values help in the handover of
multiple records with the significant size of data to be processed by
each of the workers. This also ensures there is no context switch and
the work is fairly distributed among the workers."

2. Can we keep WORKER_CHUNK_COUNT, MAX_BLOCKS_COUNT, and RINGSIZE as
power-of-two? Say WORKER_CHUNK_COUNT as 64, MAX_BLOCK_COUNT as 1024,
and accordingly choose RINGSIZE. At many places, we do that way. I
think it can sometimes help in faster processing due to cache size
requirements and in this case, I don't see a reason why we can't
choose these values to be power-of-two. If you agree with this change
then also do some performance testing after this change?

3.
+ bool   curr_blk_completed;
+ char   data[DATA_BLOCK_SIZE]; /* data read from file */
+ uint8  skip_bytes;
+} ParallelCopyDataBlock;

Is there a reason to keep skip_bytes after data? Normally the variable
size data is at the end of the structure. Also, there is no comment
explaining the purpose of skip_bytes.

4.
+ * Copy data block information.
+ * ParallelCopyDataBlock's will be created in DSM. Data read from file will be
+ * copied in these DSM data blocks. The leader process identifies the records
+ * and the record information will be shared to the workers. The workers will
+ * insert the records into the table. There can be one or more number
of records
+ * in each of the data block based on the record size.
+ */
+typedef struct ParallelCopyDataBlock

Keep one empty line after the description line like below. I also
suggested to do a minor tweak in the above sentence which is as
follows:

* Copy data block information.
*
* These data blocks are created in DSM. Data read ...

Try to follow a similar format in other comments as well.

5. I think it is better to move parallelism related code to a new file
(we can name it as copyParallel.c or something like that).

6. copy.c(1648,25): warning C4133: 'function': incompatible types -
from 'ParallelCopyLineState *' to 'uint32 *'
Getting above compilation warning on Windows.

v5-0003-Allow-copy-from-command-to-process-data-from-file
==================================================
1.
@@ -4294,7 +5047,7 @@ BeginCopyFrom(ParseState *pstate,
  * only in text mode.
  */
  initStringInfo(&cstate->attribute_buf);
- cstate->raw_buf = (char *) palloc(RAW_BUF_SIZE + 1);
+ cstate->raw_buf = (IsParallelCopy()) ? NULL : (char *)
palloc(RAW_BUF_SIZE + 1);

Is there anyway IsParallelCopy can be true by this time? AFAICS, we do
anything about parallelism after this. If you want to save this
allocation then we need to move this after we determine that
parallelism can be used or not and accordingly the below code in the
patch needs to be changed.

 * ParallelCopyFrom - parallel copy leader's functionality.
  *
  * Leader executes the before statement for before statement trigger, if before
@@ -1110,8 +1547,302 @@ ParallelCopyFrom(CopyState cstate)
  ParallelCopyShmInfo *pcshared_info = cstate->pcdata->pcshared_info;
  ereport(DEBUG1, (errmsg("Running parallel copy leader")));

+ /* raw_buf is not used in parallel copy, instead data blocks are used.*/
+ pfree(cstate->raw_buf);
+ cstate->raw_buf = NULL;

Is there anything else also the allocation of which depends on parallelism?

2.
+static pg_attribute_always_inline bool
+IsParallelCopyAllowed(CopyState cstate)
+{
+ /* Parallel copy not allowed for frontend (2.0 protocol) & binary option. */
+ if ((cstate->copy_dest == COPY_OLD_FE) || cstate->binary)
+ return false;
+
+ /* Check if copy is into foreign table or temporary table. */
+ if (cstate->rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE ||
+ RelationUsesLocalBuffers(cstate->rel))
+ return false;
+
+ /* Check if trigger function is parallel safe. */
+ if (cstate->rel->trigdesc != NULL &&
+ !IsTriggerFunctionParallelSafe(cstate->rel->trigdesc))
+ return false;
+
+ /*
+ * Check if there is after statement or instead of trigger or transition
+ * table triggers.
+ */
+ if (cstate->rel->trigdesc != NULL &&
+ (cstate->rel->trigdesc->trig_insert_after_statement ||
+ cstate->rel->trigdesc->trig_insert_instead_row ||
+ cstate->rel->trigdesc->trig_insert_new_table))
+ return false;
+
+ /* Check if the volatile expressions are parallel safe, if present any. */
+ if (!CheckExprParallelSafety(cstate))
+ return false;
+
+ /* Check if the insertion mode is single. */
+ if (FindInsertMethod(cstate) == CIM_SINGLE)
+ return false;
+
+ return true;
+}

In the comments, we should write why parallelism is not allowed for a
particular case. The cases where parallel-unsafe clause is involved
are okay but it is not clear from comments why it is not allowed in
other cases.

3.
+ ParallelCopyShmInfo *pcshared_info = cstate->pcdata->pcshared_info;
+ ParallelCopyLineBoundary *lineInfo;
+ uint32 line_first_block = pcshared_info->cur_block_pos;
+ line_pos = UpdateBlockInLineInfo(cstate,
+    line_first_block,
+    cstate->raw_buf_index, -1,
+    LINE_LEADER_POPULATING);
+ lineInfo = &pcshared_info->line_boundaries.ring[line_pos];
+ elog(DEBUG1, "[Leader] Adding - block:%d, offset:%d, line position:%d",
+ line_first_block, lineInfo->start_offset, line_pos);

Can we take all the code here inside function UpdateBlockInLineInfo? I
see that it is called from one other place but I guess most of the
surrounding code there can also be moved inside the function. Can we
change the name of the function to UpdateSharedLineInfo or something
like that and remove inline marking from this? I am not sure we want
to inline such big functions. If it make difference in performance
then we can probably consider it.

4.
EndLineParallelCopy()
{
..
+ /* Update line size. */
+ pg_atomic_write_u32(&lineInfo->line_size, line_size);
+ pg_atomic_write_u32(&lineInfo->line_state, LINE_LEADER_POPULATED);
+ elog(DEBUG1, "[Leader] After adding - line position:%d, line_size:%d",
+ line_pos, line_size);
..
}

Can we instead call UpdateSharedLineInfo (new function name for
UpdateBlockInLineInfo) to do this and maybe see it only updates the
required info? The idea is to centralize the code for updating
SharedLineInfo.

5.
+static uint32
+GetLinePosition(CopyState cstate)
+{
+ ParallelCopyData *pcdata = cstate->pcdata;
+ ParallelCopyShmInfo *pcshared_info = pcdata->pcshared_info;
+ uint32  previous_pos = pcdata->worker_processed_pos;
+ uint32 write_pos = (previous_pos == -1) ? 0 : (previous_pos + 1) % RINGSIZE;

It seems to me that each worker has to hop through all the processed
chunks before getting the chunk which it can process. This will work
but I think it is better if we have some shared counter which can tell
us the next chunk to be processed and avoid all the unnecessary work
of hopping to find the exact position.

v5-0004-Documentation-for-parallel-copy
-----------------------------------------
1. Can you add one or two examples towards the end of the page where
we have examples for other Copy options?


Please run pgindent on all patches as that will make the code look better.

From the testing perspective,
1. Test by having something force_parallel_mode = regress which means
that all existing Copy tests in the regression will be executed via
new worker code. You can have this as a test-only patch for now and
make sure all existing tests passed with this.
2. Do we have tests for toast tables? I think if you implement the
previous point some existing tests might cover it but I feel we should
have at least one or two tests for the same.
3. Have we checked the code coverage of the newly added code with
existing tests?

-- 
With Regards,
Amit Kapila.



Re: Parallel copy

From
vignesh C
Date:
On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Few additional comments:
> > ======================
>
> Some more comments:
>

Thanks Amit for the comments, I will work on the comments and provide
a patch in the next few days.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Amit Kapila
Date:
On Tue, Sep 29, 2020 at 3:16 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> Hi Vignesh and Bharath,
>
> Seems like the Parallel Copy patch is regarding RI_TRIGGER_PK as
> parallel-unsafe.
> Can you explain why this is?
>

I don't think we need to restrict this case and even if there is some
reason to do so then probably the same should be mentioned in the
comments.

-- 
With Regards,
Amit Kapila.



Re: Parallel copy

From
Tomas Vondra
Date:
Hello Vignesh,

I've done some basic benchmarking on the v4 version of the patches (but
AFAIKC the v5 should perform about the same), and some initial review.

For the benchmarking, I used the lineitem table from TPC-H - for 75GB
data set, this largest table is about 64GB once loaded, with another
54GB in 5 indexes. This is on a server with 32 cores, 64GB of RAM and
NVME storage.

The COPY duration with varying number of workers (specified using the
parallel COPY option) looks like this:

      workers    duration
     ---------------------
            0        1366
            1        1255
            2         704
            3         526
            4         434
            5         385
            6         347
            7         322
            8         327

So this seems to work pretty well - initially we get almost linear
speedup, then it slows down (likely due to contention for locks, I/O
etc.). Not bad.

I've only done a quick review, but overall the patch looks in fairly
good shape.

1) I don't quite understand why we need INCREMENTPROCESSED and
RETURNPROCESSED, considering it just does ++ or return. It just
obfuscated the code, I think.

2) I find it somewhat strange that BeginParallelCopy can just decide not
to do parallel copy after all. Why not to do this decisions in the
caller? Or maybe it's fine this way, not sure.

3) AFAIK we don't modify typedefs.list in patches, so these changes
should be removed. 

4) IsTriggerFunctionParallelSafe actually checks all triggers, not just
one, so the comment needs minor rewording.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Parallel copy

From
Amit Kapila
Date:
On Sat, Oct 3, 2020 at 6:20 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> Hello Vignesh,
>
> I've done some basic benchmarking on the v4 version of the patches (but
> AFAIKC the v5 should perform about the same), and some initial review.
>
> For the benchmarking, I used the lineitem table from TPC-H - for 75GB
> data set, this largest table is about 64GB once loaded, with another
> 54GB in 5 indexes. This is on a server with 32 cores, 64GB of RAM and
> NVME storage.
>
> The COPY duration with varying number of workers (specified using the
> parallel COPY option) looks like this:
>
>       workers    duration
>      ---------------------
>             0        1366
>             1        1255
>             2         704
>             3         526
>             4         434
>             5         385
>             6         347
>             7         322
>             8         327
>
> So this seems to work pretty well - initially we get almost linear
> speedup, then it slows down (likely due to contention for locks, I/O
> etc.). Not bad.
>

+1. These numbers (> 4x speed up) look good to me.


-- 
With Regards,
Amit Kapila.



Re: Parallel copy

From
vignesh C
Date:
On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 22, 2020 at 7:48 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Tue, Jul 21, 2020 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > Review comments:
> > > ===================
> > >
> > > 0001-Copy-code-readjustment-to-support-parallel-copy
> > > 1.
> > > @@ -807,8 +835,11 @@ CopyLoadRawBuf(CopyState cstate)
> > >   else
> > >   nbytes = 0; /* no data need be saved */
> > >
> > > + if (cstate->copy_dest == COPY_NEW_FE)
> > > + minread = RAW_BUF_SIZE - nbytes;
> > > +
> > >   inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes,
> > > -   1, RAW_BUF_SIZE - nbytes);
> > > +   minread, RAW_BUF_SIZE - nbytes);
> > >
> > > No comment to explain why this change is done?
> > >
> > > 0002-Framework-for-leader-worker-in-parallel-copy
> >
> > Currently CopyGetData copies a lesser amount of data to buffer even though space is available in buffer because
minreadwas passed as 1 to CopyGetData. Because of this there are frequent call to CopyGetData for fetching the data. In
thiscase it will load only some data due to the below check: 
> > while (maxread > 0 && bytesread < minread && !cstate->reached_eof)
> > After reading some data bytesread will be greater than minread which is passed as 1 and return with lesser amount
ofdata, even though there is some space. 
> > This change is required for parallel copy feature as each time we get a new DSM data block which is of 64K size and
copythe data. If we copy less data into DSM data blocks we might end up consuming all the DSM data blocks. 
> >
>
> Why can't we reuse the DSM block which has unfilled space?
>
> >  I felt this issue can be fixed as part of HEAD. Have posted a separate thread [1] for this. I'm planning to remove
thatchange once it gets committed. Can that go as a separate 
> > patch or should we include it here?
> > [1] - https://www.postgresql.org/message-id/CALDaNm0v4CjmvSnftYnx_9pOS_dKRG%3DO3NnBgJsQmi0KipvLog%40mail.gmail.com
> >
>
> I am convinced by the reason given by Kyotaro-San in that another
> thread [1] and performance data shown by Peter that this can't be an
> independent improvement and rather in some cases it can do harm. Now,
> if you need it for a parallel-copy path then we can change it
> specifically to the parallel-copy code path but I don't understand
> your reason completely.
>

Whenever we need data to be populated, we will get a new data block &
pass it to CopyGetData to populate the data. In case of file copy, the
server will completely fill the data block. We expect the data to be
filled completely. If data is available it will completely load the
complete data block in case of file copy. There is no scenario where
even if data is present a partial data block will be returned except
for EOF or no data available. But in case of STDIN data copy, even
though there is 8K data available in data block & 8K data available in
STDIN, CopyGetData will return as soon as libpq buffer data is more
than the minread. We will pass new data block every time to load data.
Every time we pass an 8K data block but CopyGetData loads a few bytes
in the new data block & returns. I wanted to keep the same data
population logic for both file copy & STDIN copy i.e copy full 8K data
blocks & then the populated data can be required. There is an
alternative solution I can have some special handling in case of STDIN
wherein the existing data block can be passed with the index from
where the data should be copied. Thoughts?

> > > 2.
> ..
> > > + */
> > > +typedef struct ParallelCopyLineBoundary
> > >
> > > Are we doing all this state management to avoid using locks while
> > > processing lines?  If so, I think we can use either spinlock or LWLock
> > > to keep the main patch simple and then provide a later patch to make
> > > it lock-less.  This will allow us to first focus on the main design of
> > > the patch rather than trying to make this datastructure processing
> > > lock-less in the best possible way.
> > >
> >
> > The steps will be more or less same if we use spinlock too. step 1, step 3 & step 4 will be common we have to use
lock& unlock instead of step 2 & step 5. I feel we can retain the current implementation. 
> >
>
> I'll study this in detail and let you know my opinion on the same but
> in the meantime, I don't follow one part of this comment: "If they
> don't follow this order the worker might process wrong line_size and
> leader might populate the information which worker has not yet
> processed or in the process of processing."
>
> Do you want to say that leader might overwrite some information which
> worker hasn't read yet? If so, it is not clear from the comment.
> Another minor point about this comment:
>

Here leader and worker must follow these steps to avoid any corruption
or hang issue. Changed it to:
 * The leader & worker process access the shared line information by following
 * the below steps to avoid any data corruption or hang:

> + * ParallelCopyLineBoundary is common data structure between leader & worker,
> + * Leader process will be populating data block, data block offset &
> the size of
>
> I think there should be a full-stop after worker instead of a comma.
>

Changed it.

> >
> > > 6.
> > > In function BeginParallelCopy(), you need to keep a provision to
> > > collect wal_usage and buf_usage stats.  See _bt_begin_parallel for
> > > reference.  Those will be required for pg_stat_statements.
> > >
> >
> > Fixed
> >
>
> How did you ensure that this is fixed? Have you tested it, if so
> please share the test? I see a basic problem with your fix.
>
> + /* Report WAL/buffer usage during parallel execution */
> + bufferusage = shm_toc_lookup(toc, PARALLEL_COPY_BUFFER_USAGE, false);
> + walusage = shm_toc_lookup(toc, PARALLEL_COPY_WAL_USAGE, false);
> + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
> +   &walusage[ParallelWorkerNumber]);
>
> You need to call InstrStartParallelQuery() before the actual operation
> starts, without that stats won't be accurate? Also, after calling
> WaitForParallelWorkersToFinish(), you need to accumulate the stats
> collected from workers which neither you have done nor is possible
> with the current code in your patch because you haven't made any
> provision to capture them in BeginParallelCopy.
>
> I suggest you look into lazy_parallel_vacuum_indexes() and
> begin_parallel_vacuum() to understand how the buffer/wal usage stats
> are accumulated. Also, please test this functionality using
> pg_stat_statements.
>

Made changes accordingly.
I have verified it using:
postgres=# select * from pg_stat_statements where query like '%copy%';
 userid | dbid  |       queryid        |
                         query
               | plans | total_plan_time |
min_plan_time | max_plan_time | mean_plan_time | stddev_plan_time |
calls | total_exec_time | min_exec_time | max_exec_time |
mean_exec_time | stddev_exec_time |  rows  | shared_blks_hi
t | shared_blks_read | shared_blks_dirtied | shared_blks_written |
local_blks_hit | local_blks_read | local_blks_dirtied |
local_blks_written | temp_blks_read | temp_blks_written | blk_
read_time | blk_write_time | wal_records | wal_fpi | wal_bytes

--------+-------+----------------------+---------------------------------------------------------------------------------------------------------------------+-------+-----------------+-

--------------+---------------+----------------+------------------+-------+-----------------+---------------+---------------+----------------+------------------+--------+---------------

--+------------------+---------------------+---------------------+----------------+-----------------+--------------------+--------------------+----------------+-------------------+-----
----------+----------------+-------------+---------+-----------
     10 | 13743 | -6947756673093447609 | copy hw from
'/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format
csv, delimiter ',')               |     0 |               0 |
            0 |             0 |              0 |                0 |
 1 |      265.195105 |    265.195105 |    265.195105 |     265.195105
|                0 | 175000 |            191
6 |                0 |                 946 |                 946 |
         0 |               0 |                  0 |                  0
|              0 |                 0 |
        0 |              0 |        1116 |       0 |   3587203
     10 | 13743 |  8570215596364326047 | copy hw from
'/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format
csv, delimiter ',', parallel '2') |     0 |               0 |
            0 |             0 |              0 |                0 |
 1 |    35668.402482 |  35668.402482 |  35668.402482 |   35668.402482
|                0 | 175000 |            310
1 |               36 |                 952 |                 919 |
         0 |               0 |                  0 |                  0
|              0 |                 0 |
        0 |              0 |        1119 |       6 |   3624405
(2 rows)

> >
> > > 0003-Allow-copy-from-command-to-process-data-from-file-ST
> > > 10.
> > > In the commit message, you have written "The leader does not
> > > participate in the insertion of data, leaders only responsibility will
> > > be to identify the lines as fast as possible for the workers to do the
> > > actual copy operation. The leader waits till all the lines populated
> > > are processed by the workers and exits."
> > >
> > > I think you should also mention that we have chosen this design based
> > > on the reason "that everything stalls if the leader doesn't accept
> > > further input data, as well as when there are no available splitted
> > > chunks so it doesn't seem like a good idea to have the leader do other
> > > work.  This is backed by the performance data where we have seen that
> > > with 1 worker there is just a 5-10% (or whatever percentage difference
> > > you have seen) performance difference)".
> >
> > Fixed.
> >
>
> Make it a one-paragraph starting from "The leader does not participate
> in the insertion of data  .... just a 5-10% performance difference".
> Right now both the parts look a bit disconnected.
>

Made the contents starting from "The leader does not" in a paragraph.

> Few additional comments:
> ======================
> v5-0001-Copy-code-readjustment-to-support-parallel-copy
> ---------------------------------------------------------------------------------
> 1.
> +/*
> + * CLEAR_EOL_LINE - Wrapper for clearing EOL.
> + */
> +#define CLEAR_EOL_LINE() \
> +if (!result && !IsHeaderLine()) \
> + ClearEOLFromCopiedData(cstate, cstate->line_buf.data, \
> +    cstate->line_buf.len, \
> +    &cstate->line_buf.len) \
>
> I don't like this macro. I think it is sufficient to move the common
> code to be called from the parallel and non-parallel path in
> ClearEOLFromCopiedData but I think the other checks can be done
> in-place. I think having macros for such a thing makes code less
> readable.
>

I have removed the macro & called ClearEOLFromCopiedData directly
wherever required.

> 2.
> -
> +static void PopulateCommonCstateInfo(CopyState cstate, TupleDesc tup_desc,
> + List *attnamelist);
>
> Spurious line removal.
>

I have modified it to keep it as it is.

> v5-0002-Framework-for-leader-worker-in-parallel-copy
> ---------------------------------------------------------------------------
> 3.
> + FullTransactionId full_transaction_id; /* xid for copy from statement */
> + CommandId mycid; /* command id */
> + ParallelCopyLineBoundaries line_boundaries; /* line array */
> +} ParallelCopyShmInfo;
>
> We already serialize FullTransactionId and CommandId via
> InitializeParallelDSM->SerializeTransactionState. Can't we reuse it? I
> think recently Parallel Insert patch has also done something for this
> [2] so you can refer that if you want.
>

Changed it to remove setting of command id & full transaction id.
Added a function SetCurrentCommandIdUsedForWorker to set
currentCommandIdUsed to true & called GetCurrentCommandId by passing
!IsParallelCopy().

> v5-0004-Documentation-for-parallel-copy
> -----------------------------------------------------------
> 1.  Perform <command>COPY FROM</command> in parallel using <replaceable
> +      class="parameter"> integer</replaceable> background workers.
>
> No need for space before integer.
>

I have removed it.

Attached v6 patch with the fixes.



Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
vignesh C
Date:
On Tue, Sep 29, 2020 at 3:16 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> Hi Vignesh and Bharath,
>
> Seems like the Parallel Copy patch is regarding RI_TRIGGER_PK as
> parallel-unsafe.
> Can you explain why this is?

Yes we don't need to restrict parallelism for RI_TRIGGER_PK cases as
we don't do any command counter increments while performing PK checks
as opposed to RI_TRIGGER_FK/foreign key checks. We have modified this
in the v6 patch set.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:
On Mon, Sep 28, 2020 at 3:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 22, 2020 at 2:44 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > Thanks Ashutosh for your comments.
> >
> > On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > >
> > > Hi Vignesh,
> > >
> > > I've spent some time today looking at your new set of patches and I've
> > > some thoughts and queries which I would like to put here:
> > >
> > > Why are these not part of the shared cstate structure?
> > >
> > >     SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, cstate->null_print);
> > >     SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim);
> > >     SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote);
> > >     SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape);
> > >
> >
> > I have used shared_cstate mainly to share the integer & bool data
> > types from the leader to worker process. The above data types are of
> > char* data type, I will not be able to use it like how I could do it
> > for integer type. So I preferred to send these as separate keys to the
> > worker. Thoughts?
> >
>
> I think the way you have written will work but if we go with
> Ashutosh's proposal it will look elegant and in the future, if we need
> to share more strings as part of cstate structure then that would be
> easier. You can probably refer to EstimateParamListSpace,
> SerializeParamList, and RestoreParamList to see how we can share
> different types of data in one key.
>

Thanks for the solution Amit, I have fixed this and handled it in the
v6 patch shared in my previous mail.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Greg Nancarrow
Date:
On Thu, Oct 8, 2020 at 5:44 AM vignesh C <vignesh21@gmail.com> wrote:

> Attached v6 patch with the fixes.
>

Hi Vignesh,

I noticed a couple of issues when scanning the code in the following patch:

    v6-0003-Allow-copy-from-command-to-process-data-from-file.patch

In the following code, it will put a junk uint16 value into *destptr
(and thus may well cause a crash) on a Big Endian architecture
(Solaris Sparc, s390x, etc.):
You're storing a (uint16) string length in a uint32 and then pulling
out the lower two bytes of the uint32 and copying them into the
location pointed to by destptr.


static void
+CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr,
+ uint32 *copiedsize)
+{
+ uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0;
+
+ memcpy(destptr, (uint16 *) &len, sizeof(uint16));
+ *copiedsize += sizeof(uint16);
+ if (len)
+ {
+ memcpy(destptr + sizeof(uint16), srcPtr, len);
+ *copiedsize += len;
+ }
+}

I suggest you change the code to:

    uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
    memcpy(destptr, &len, sizeof(uint16));

[I assume string length here can't ever exceed (65535 - 1), right?]

Looking a bit deeper into this, I'm wondering if in fact your
EstimateStringSize() and EstimateNodeSize() functions should be using
BUFFERALIGN() for EACH stored string/node (rather than just calling
shm_toc_estimate_chunk() once at the end, after the length of packed
strings and nodes has been estimated), to ensure alignment of start of
each string/node. Other Postgres code appears to be aligning each
stored chunk using shm_toc_estimate_chunk(). See the definition of
that macro and its current usages.

Then you could safely use:

    uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
    *(uint16 *)destptr = len;
    *copiedsize += sizeof(uint16);
    if (len)
    {
        memcpy(destptr + sizeof(uint16), srcPtr, len);
        *copiedsize += len;
    }

and in the CopyStringFromSharedMemory() function, then could safely use:

    len = *(uint16 *)srcPtr;

The compiler may be smart enough to optimize-away the memcpy() in this
case anyway, but there are issues in doing this for architectures that
take a performance hit for unaligned access, or don't support
unaligned access.

Also, in CopyXXXXFromSharedMemory() functions, you should use palloc()
instead of palloc0(), as you're filling the entire palloc'd buffer
anyway, so no need to ask for additional MemSet() of all buffer bytes
to 0 prior to memcpy().


Regards,
Greg Nancarrow
Fujitsu Australia



Re: Parallel copy

From
vignesh C
Date:
On Mon, Sep 28, 2020 at 6:37 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> On Mon, Sep 28, 2020 at 3:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Sep 22, 2020 at 2:44 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > Thanks Ashutosh for your comments.
> > >
> > > On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > >
> > > > Hi Vignesh,
> > > >
> > > > I've spent some time today looking at your new set of patches and I've
> > > > some thoughts and queries which I would like to put here:
> > > >
> > > > Why are these not part of the shared cstate structure?
> > > >
> > > >     SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, cstate->null_print);
> > > >     SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim);
> > > >     SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote);
> > > >     SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape);
> > > >
> > >
> > > I have used shared_cstate mainly to share the integer & bool data
> > > types from the leader to worker process. The above data types are of
> > > char* data type, I will not be able to use it like how I could do it
> > > for integer type. So I preferred to send these as separate keys to the
> > > worker. Thoughts?
> > >
> >
> > I think the way you have written will work but if we go with
> > Ashutosh's proposal it will look elegant and in the future, if we need
> > to share more strings as part of cstate structure then that would be
> > easier. You can probably refer to EstimateParamListSpace,
> > SerializeParamList, and RestoreParamList to see how we can share
> > different types of data in one key.
> >
>
> Yeah. And in addition to that it will also reduce the number of DSM
> keys that we need to maintain.
>

Thanks Ashutosh, This is handled as part of the v6 patch set.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:
On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Few additional comments:
> > ======================
>
> Some more comments:
>
> v5-0002-Framework-for-leader-worker-in-parallel-copy
> ===========================================
> 1.
> These values
> + * help in handover of multiple records with significant size of data to be
> + * processed by each of the workers to make sure there is no context
> switch & the
> + * work is fairly distributed among the workers.
>
> How about writing it as: "These values help in the handover of
> multiple records with the significant size of data to be processed by
> each of the workers. This also ensures there is no context switch and
> the work is fairly distributed among the workers."

Changed as suggested.

>
> 2. Can we keep WORKER_CHUNK_COUNT, MAX_BLOCKS_COUNT, and RINGSIZE as
> power-of-two? Say WORKER_CHUNK_COUNT as 64, MAX_BLOCK_COUNT as 1024,
> and accordingly choose RINGSIZE. At many places, we do that way. I
> think it can sometimes help in faster processing due to cache size
> requirements and in this case, I don't see a reason why we can't
> choose these values to be power-of-two. If you agree with this change
> then also do some performance testing after this change?
>

Modified as suggested, Have checked few performance tests & verified
there is no degradation. We will post a performance run of this
separately in the coming days..

> 3.
> + bool   curr_blk_completed;
> + char   data[DATA_BLOCK_SIZE]; /* data read from file */
> + uint8  skip_bytes;
> +} ParallelCopyDataBlock;
>
> Is there a reason to keep skip_bytes after data? Normally the variable
> size data is at the end of the structure. Also, there is no comment
> explaining the purpose of skip_bytes.
>

Modified as suggested and added comments.

> 4.
> + * Copy data block information.
> + * ParallelCopyDataBlock's will be created in DSM. Data read from file will be
> + * copied in these DSM data blocks. The leader process identifies the records
> + * and the record information will be shared to the workers. The workers will
> + * insert the records into the table. There can be one or more number
> of records
> + * in each of the data block based on the record size.
> + */
> +typedef struct ParallelCopyDataBlock
>
> Keep one empty line after the description line like below. I also
> suggested to do a minor tweak in the above sentence which is as
> follows:
>
> * Copy data block information.
> *
> * These data blocks are created in DSM. Data read ...
>
> Try to follow a similar format in other comments as well.
>

Modified as suggested.

> 5. I think it is better to move parallelism related code to a new file
> (we can name it as copyParallel.c or something like that).
>

Modified, added copyparallel.c file to include copy parallelism
functionality & copyparallel.c file & some of the function prototype &
data structure were moved to copy.h header file so that it can be
shared between copy.c & copyparallel.c

> 6. copy.c(1648,25): warning C4133: 'function': incompatible types -
> from 'ParallelCopyLineState *' to 'uint32 *'
> Getting above compilation warning on Windows.
>

Modified the data type.

> v5-0003-Allow-copy-from-command-to-process-data-from-file
> ==================================================
> 1.
> @@ -4294,7 +5047,7 @@ BeginCopyFrom(ParseState *pstate,
>   * only in text mode.
>   */
>   initStringInfo(&cstate->attribute_buf);
> - cstate->raw_buf = (char *) palloc(RAW_BUF_SIZE + 1);
> + cstate->raw_buf = (IsParallelCopy()) ? NULL : (char *)
> palloc(RAW_BUF_SIZE + 1);
>
> Is there anyway IsParallelCopy can be true by this time? AFAICS, we do
> anything about parallelism after this. If you want to save this
> allocation then we need to move this after we determine that
> parallelism can be used or not and accordingly the below code in the
> patch needs to be changed.
>
>  * ParallelCopyFrom - parallel copy leader's functionality.
>   *
>   * Leader executes the before statement for before statement trigger, if before
> @@ -1110,8 +1547,302 @@ ParallelCopyFrom(CopyState cstate)
>   ParallelCopyShmInfo *pcshared_info = cstate->pcdata->pcshared_info;
>   ereport(DEBUG1, (errmsg("Running parallel copy leader")));
>
> + /* raw_buf is not used in parallel copy, instead data blocks are used.*/
> + pfree(cstate->raw_buf);
> + cstate->raw_buf = NULL;
>

Removed the palloc change, raw_buf will be allocated both for parallel
and non parallel copy. One other solution that I thought was to move
the memory allocation to CopyFrom, but this solution might affect fdw
where they use BeginCopyFrom, NextCopyFrom & EndCopyFrom. So I have
kept the allocation as in BeginCopyFrom & freeing for parallel copy in
ParallelCopyFrom.

> Is there anything else also the allocation of which depends on parallelism?
>

I felt this is the only allocated memory that sequential copy requires
and which is not required in parallel copy.

> 2.
> +static pg_attribute_always_inline bool
> +IsParallelCopyAllowed(CopyState cstate)
> +{
> + /* Parallel copy not allowed for frontend (2.0 protocol) & binary option. */
> + if ((cstate->copy_dest == COPY_OLD_FE) || cstate->binary)
> + return false;
> +
> + /* Check if copy is into foreign table or temporary table. */
> + if (cstate->rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE ||
> + RelationUsesLocalBuffers(cstate->rel))
> + return false;
> +
> + /* Check if trigger function is parallel safe. */
> + if (cstate->rel->trigdesc != NULL &&
> + !IsTriggerFunctionParallelSafe(cstate->rel->trigdesc))
> + return false;
> +
> + /*
> + * Check if there is after statement or instead of trigger or transition
> + * table triggers.
> + */
> + if (cstate->rel->trigdesc != NULL &&
> + (cstate->rel->trigdesc->trig_insert_after_statement ||
> + cstate->rel->trigdesc->trig_insert_instead_row ||
> + cstate->rel->trigdesc->trig_insert_new_table))
> + return false;
> +
> + /* Check if the volatile expressions are parallel safe, if present any. */
> + if (!CheckExprParallelSafety(cstate))
> + return false;
> +
> + /* Check if the insertion mode is single. */
> + if (FindInsertMethod(cstate) == CIM_SINGLE)
> + return false;
> +
> + return true;
> +}
>
> In the comments, we should write why parallelism is not allowed for a
> particular case. The cases where parallel-unsafe clause is involved
> are okay but it is not clear from comments why it is not allowed in
> other cases.
>

Added comments.

> 3.
> + ParallelCopyShmInfo *pcshared_info = cstate->pcdata->pcshared_info;
> + ParallelCopyLineBoundary *lineInfo;
> + uint32 line_first_block = pcshared_info->cur_block_pos;
> + line_pos = UpdateBlockInLineInfo(cstate,
> +    line_first_block,
> +    cstate->raw_buf_index, -1,
> +    LINE_LEADER_POPULATING);
> + lineInfo = &pcshared_info->line_boundaries.ring[line_pos];
> + elog(DEBUG1, "[Leader] Adding - block:%d, offset:%d, line position:%d",
> + line_first_block, lineInfo->start_offset, line_pos);
>
> Can we take all the code here inside function UpdateBlockInLineInfo? I
> see that it is called from one other place but I guess most of the
> surrounding code there can also be moved inside the function. Can we
> change the name of the function to UpdateSharedLineInfo or something
> like that and remove inline marking from this? I am not sure we want
> to inline such big functions. If it make difference in performance
> then we can probably consider it.
>

Changed as suggested.

> 4.
> EndLineParallelCopy()
> {
> ..
> + /* Update line size. */
> + pg_atomic_write_u32(&lineInfo->line_size, line_size);
> + pg_atomic_write_u32(&lineInfo->line_state, LINE_LEADER_POPULATED);
> + elog(DEBUG1, "[Leader] After adding - line position:%d, line_size:%d",
> + line_pos, line_size);
> ..
> }
>
> Can we instead call UpdateSharedLineInfo (new function name for
> UpdateBlockInLineInfo) to do this and maybe see it only updates the
> required info? The idea is to centralize the code for updating
> SharedLineInfo.
>

Updated as suggested.

> 5.
> +static uint32
> +GetLinePosition(CopyState cstate)
> +{
> + ParallelCopyData *pcdata = cstate->pcdata;
> + ParallelCopyShmInfo *pcshared_info = pcdata->pcshared_info;
> + uint32  previous_pos = pcdata->worker_processed_pos;
> + uint32 write_pos = (previous_pos == -1) ? 0 : (previous_pos + 1) % RINGSIZE;
>
> It seems to me that each worker has to hop through all the processed
> chunks before getting the chunk which it can process. This will work
> but I think it is better if we have some shared counter which can tell
> us the next chunk to be processed and avoid all the unnecessary work
> of hopping to find the exact position.

I had tried to have a spin lock & try to track this position instead
of hopping through the processed chunks. But I did not get the earlier
performance results, there was slight degradation:
Use case 2: 3 indexes on integer columns
Run on earlier patches without spinlock:
(220.680, 0, 1X), (185.096, 1, 1.19X), (134.811, 2, 1.64X), (114.585,
4, 1.92X), (107.707, 8, 2.05X), (101.253, 16, 2.18X), (100.749, 20,
2.19X), (100.656, 30, 2.19X)
Run on latest v6 patches with spinlock:
(216.059, 0, 1X), (177.639, 1, 1.22X), (145.213, 2, 1.49X), (126.370,
4, 1.71X), (121.013, 8, 1.78X), (102.933, 16, 2.1X), (103.000, 20,
2.1X), (100.308, 30, 2.15X)
I have not included these changes as there was some performance
degradation. I will try to come with a different solution for this and
discuss in the coming days. This point is not yet handled.


> v5-0004-Documentation-for-parallel-copy
> -----------------------------------------
> 1. Can you add one or two examples towards the end of the page where
> we have examples for other Copy options?
>
>
> Please run pgindent on all patches as that will make the code look better.

Have run pgindent on the latest patches.

> From the testing perspective,
> 1. Test by having something force_parallel_mode = regress which means
> that all existing Copy tests in the regression will be executed via
> new worker code. You can have this as a test-only patch for now and
> make sure all existing tests passed with this.
> 2. Do we have tests for toast tables? I think if you implement the
> previous point some existing tests might cover it but I feel we should
> have at least one or two tests for the same.
> 3. Have we checked the code coverage of the newly added code with
> existing tests?

These will be handled in the next few days.

These changes are present as part of the v6 patch set.

I'm summarizing the pending open points so that I don't miss anything:
1) Performance test on latest patch set.
2) Testing points suggested.
3) Support of parallel copy for COPY_OLD_FE.
4) Worker has to hop through all the processed chunks before getting
the chunk which it can process.
5) Handling of Tomas's comments.
6) Handling of Greg's comments.

We plan to work on this & complete in the next few days.


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Amit Kapila
Date:
On Thu, Oct 8, 2020 at 12:14 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > I am convinced by the reason given by Kyotaro-San in that another
> > thread [1] and performance data shown by Peter that this can't be an
> > independent improvement and rather in some cases it can do harm. Now,
> > if you need it for a parallel-copy path then we can change it
> > specifically to the parallel-copy code path but I don't understand
> > your reason completely.
> >
>
> Whenever we need data to be populated, we will get a new data block &
> pass it to CopyGetData to populate the data. In case of file copy, the
> server will completely fill the data block. We expect the data to be
> filled completely. If data is available it will completely load the
> complete data block in case of file copy. There is no scenario where
> even if data is present a partial data block will be returned except
> for EOF or no data available. But in case of STDIN data copy, even
> though there is 8K data available in data block & 8K data available in
> STDIN, CopyGetData will return as soon as libpq buffer data is more
> than the minread. We will pass new data block every time to load data.
> Every time we pass an 8K data block but CopyGetData loads a few bytes
> in the new data block & returns. I wanted to keep the same data
> population logic for both file copy & STDIN copy i.e copy full 8K data
> blocks & then the populated data can be required. There is an
> alternative solution I can have some special handling in case of STDIN
> wherein the existing data block can be passed with the index from
> where the data should be copied. Thoughts?
>

What you are proposing as an alternative solution, isn't that what we
are doing without the patch? IIUC, you require this because of your
corresponding changes to handle COPY_NEW_FE in CopyReadLine(), is that
right? If so, what is the difficulty in making it behave similar to
the non-parallel case?

-- 
With Regards,
Amit Kapila.



Re: Parallel copy

From
Amit Kapila
Date:
On Thu, Oct 8, 2020 at 12:14 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > + */
> > > > +typedef struct ParallelCopyLineBoundary
> > > >
> > > > Are we doing all this state management to avoid using locks while
> > > > processing lines?  If so, I think we can use either spinlock or LWLock
> > > > to keep the main patch simple and then provide a later patch to make
> > > > it lock-less.  This will allow us to first focus on the main design of
> > > > the patch rather than trying to make this datastructure processing
> > > > lock-less in the best possible way.
> > > >
> > >
> > > The steps will be more or less same if we use spinlock too. step 1, step 3 & step 4 will be common we have to use
lock& unlock instead of step 2 & step 5. I feel we can retain the current implementation.
 
> > >
> >
> > I'll study this in detail and let you know my opinion on the same but
> > in the meantime, I don't follow one part of this comment: "If they
> > don't follow this order the worker might process wrong line_size and
> > leader might populate the information which worker has not yet
> > processed or in the process of processing."
> >
> > Do you want to say that leader might overwrite some information which
> > worker hasn't read yet? If so, it is not clear from the comment.
> > Another minor point about this comment:
> >
>
> Here leader and worker must follow these steps to avoid any corruption
> or hang issue. Changed it to:
>  * The leader & worker process access the shared line information by following
>  * the below steps to avoid any data corruption or hang:
>

Actually, I wanted more on the lines why such corruption or hang can
happen? It might help reviewers to understand why you have followed
such a sequence.

> >
> > How did you ensure that this is fixed? Have you tested it, if so
> > please share the test? I see a basic problem with your fix.
> >
> > + /* Report WAL/buffer usage during parallel execution */
> > + bufferusage = shm_toc_lookup(toc, PARALLEL_COPY_BUFFER_USAGE, false);
> > + walusage = shm_toc_lookup(toc, PARALLEL_COPY_WAL_USAGE, false);
> > + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
> > +   &walusage[ParallelWorkerNumber]);
> >
> > You need to call InstrStartParallelQuery() before the actual operation
> > starts, without that stats won't be accurate? Also, after calling
> > WaitForParallelWorkersToFinish(), you need to accumulate the stats
> > collected from workers which neither you have done nor is possible
> > with the current code in your patch because you haven't made any
> > provision to capture them in BeginParallelCopy.
> >
> > I suggest you look into lazy_parallel_vacuum_indexes() and
> > begin_parallel_vacuum() to understand how the buffer/wal usage stats
> > are accumulated. Also, please test this functionality using
> > pg_stat_statements.
> >
>
> Made changes accordingly.
> I have verified it using:
> postgres=# select * from pg_stat_statements where query like '%copy%';
>  userid | dbid  |       queryid        |
>                          query
>                | plans | total_plan_time |
> min_plan_time | max_plan_time | mean_plan_time | stddev_plan_time |
> calls | total_exec_time | min_exec_time | max_exec_time |
> mean_exec_time | stddev_exec_time |  rows  | shared_blks_hi
> t | shared_blks_read | shared_blks_dirtied | shared_blks_written |
> local_blks_hit | local_blks_read | local_blks_dirtied |
> local_blks_written | temp_blks_read | temp_blks_written | blk_
> read_time | blk_write_time | wal_records | wal_fpi | wal_bytes
>
--------+-------+----------------------+---------------------------------------------------------------------------------------------------------------------+-------+-----------------+-
>
--------------+---------------+----------------+------------------+-------+-----------------+---------------+---------------+----------------+------------------+--------+---------------
>
--+------------------+---------------------+---------------------+----------------+-----------------+--------------------+--------------------+----------------+-------------------+-----
> ----------+----------------+-------------+---------+-----------
>      10 | 13743 | -6947756673093447609 | copy hw from
> '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format
> csv, delimiter ',')               |     0 |               0 |
>             0 |             0 |              0 |                0 |
>  1 |      265.195105 |    265.195105 |    265.195105 |     265.195105
> |                0 | 175000 |            191
> 6 |                0 |                 946 |                 946 |
>          0 |               0 |                  0 |                  0
> |              0 |                 0 |
>         0 |              0 |        1116 |       0 |   3587203
>      10 | 13743 |  8570215596364326047 | copy hw from
> '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format
> csv, delimiter ',', parallel '2') |     0 |               0 |
>             0 |             0 |              0 |                0 |
>  1 |    35668.402482 |  35668.402482 |  35668.402482 |   35668.402482
> |                0 | 175000 |            310
> 1 |               36 |                 952 |                 919 |
>          0 |               0 |                  0 |                  0
> |              0 |                 0 |
>         0 |              0 |        1119 |       6 |   3624405
> (2 rows)
>

I am not able to properly parse the data but If understand the wal
data for non-parallel (1116 |       0 |   3587203) and parallel (1119
|       6 |   3624405) case doesn't seem to be the same. Is that
right? If so, why? Please ensure that no checkpoint happens for both
cases.

-- 
With Regards,
Amit Kapila.



Re: Parallel copy

From
Amit Kapila
Date:
On Thu, Oct 8, 2020 at 8:43 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> On Thu, Oct 8, 2020 at 5:44 AM vignesh C <vignesh21@gmail.com> wrote:
>
> > Attached v6 patch with the fixes.
> >
>
> Hi Vignesh,
>
> I noticed a couple of issues when scanning the code in the following patch:
>
>     v6-0003-Allow-copy-from-command-to-process-data-from-file.patch
>
> In the following code, it will put a junk uint16 value into *destptr
> (and thus may well cause a crash) on a Big Endian architecture
> (Solaris Sparc, s390x, etc.):
> You're storing a (uint16) string length in a uint32 and then pulling
> out the lower two bytes of the uint32 and copying them into the
> location pointed to by destptr.
>
>
> static void
> +CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr,
> + uint32 *copiedsize)
> +{
> + uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0;
> +
> + memcpy(destptr, (uint16 *) &len, sizeof(uint16));
> + *copiedsize += sizeof(uint16);
> + if (len)
> + {
> + memcpy(destptr + sizeof(uint16), srcPtr, len);
> + *copiedsize += len;
> + }
> +}
>
> I suggest you change the code to:
>
>     uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
>     memcpy(destptr, &len, sizeof(uint16));
>
> [I assume string length here can't ever exceed (65535 - 1), right?]
>

Your suggestion makes sense to me if the assumption related to string
length is correct. If we can't ensure that then we need to probably
use four bytes uint32 to store the length.

> Looking a bit deeper into this, I'm wondering if in fact your
> EstimateStringSize() and EstimateNodeSize() functions should be using
> BUFFERALIGN() for EACH stored string/node (rather than just calling
> shm_toc_estimate_chunk() once at the end, after the length of packed
> strings and nodes has been estimated), to ensure alignment of start of
> each string/node. Other Postgres code appears to be aligning each
> stored chunk using shm_toc_estimate_chunk(). See the definition of
> that macro and its current usages.
>

I am not sure if this required for the purpose of correctness. AFAIU,
we do store/estimate multiple parameters in same way at other places,
see EstimateParamListSpace and SerializeParamList. Do you have
something else in mind?

While looking at the latest code, I observed below issue in patch
v6-0003-Allow-copy-from-command-to-process-data-from-file:

+ /* Estimate the size for shared information for PARALLEL_COPY_KEY_CSTATE */
+ est_cstateshared = MAXALIGN(sizeof(SerializedParallelCopyState));
+ shm_toc_estimate_chunk(&pcxt->estimator, est_cstateshared);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+ strsize = EstimateCstateSize(pcxt, cstate, attnamelist, &whereClauseStr,
+ &rangeTableStr, &attnameListStr,
+ ¬nullListStr, &nullListStr,
+ &convertListStr);

Here, do we need to separately estimate the size of
SerializedParallelCopyState when it is also done in
EstimateCstateSize?

-- 
With Regards,
Amit Kapila.



Re: Parallel copy

From
Greg Nancarrow
Date:
On Fri, Oct 9, 2020 at 5:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Looking a bit deeper into this, I'm wondering if in fact your
> > EstimateStringSize() and EstimateNodeSize() functions should be using
> > BUFFERALIGN() for EACH stored string/node (rather than just calling
> > shm_toc_estimate_chunk() once at the end, after the length of packed
> > strings and nodes has been estimated), to ensure alignment of start of
> > each string/node. Other Postgres code appears to be aligning each
> > stored chunk using shm_toc_estimate_chunk(). See the definition of
> > that macro and its current usages.
> >
>
> I am not sure if this required for the purpose of correctness. AFAIU,
> we do store/estimate multiple parameters in same way at other places,
> see EstimateParamListSpace and SerializeParamList. Do you have
> something else in mind?
>

The point I was trying to make is that potentially more efficient code
can be used if the individual strings/nodes are aligned, rather than
packed (as they are now), but as you point out, there are already
cases (e.g. SerializeParamList) where within the separately-aligned
chunks the data is not aligned, so maybe not a big deal. Oh well,
without alignment, that means use of memcpy() cannot really be avoided
here for serializing/de-serializing ints etc., let's hope the compiler
optimizes it as best it can.

Regards,
Greg Nancarrow
Fujitsu Australia



Re: Parallel copy

From
Bharath Rupireddy
Date:
On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> From the testing perspective,
> 1. Test by having something force_parallel_mode = regress which means
> that all existing Copy tests in the regression will be executed via
> new worker code. You can have this as a test-only patch for now and
> make sure all existing tests passed with this.
>

I don't think all the existing copy test cases(except the new test cases added in the parallel copy patch set) would run inside the parallel worker if force_parallel_mode is on. This is because, the parallelism will be picked up for parallel copy only if parallel option is specified unlike parallelism for select queries.

Anyways, I ran with force_parallel_mode on and regress. All copy related tests and make check/make check-world ran fine.

>
> 2. Do we have tests for toast tables? I think if you implement the
> previous point some existing tests might cover it but I feel we should
> have at least one or two tests for the same.
>

Toast table use case 1: 10000 tuples, 9.6GB data, 3 indexes 2 on integer columns, 1 on text column(not the toast column), csv file, each row is > 1320KB:
(222.767, 0, 1X), (134.171, 1, 1.66X), (93.749, 2, 2.38X), (93.672, 4, 2.38X), (94.827, 8, 2.35X), (93.766, 16, 2.37X), (98.153, 20, 2.27X), (122.721, 30, 1.81X)

Toast table use case 2: 100000 tuples, 96GB data, 3 indexes 2 on integer columns, 1 on text column(not the toast column), csv file, each row is > 1320KB:
(2255.032, 0, 1X), (1358.628, 1, 1.66X), (901.170, 2, 2.5X), (912.743, 4, 2.47X), (988.718, 8, 2.28X), (938.000, 16, 2.4X), (997.556, 20, 2.26X), (1000.586, 30, 2.25X)

Toast table use case3: 10000 tuples, 9.6GB, no indexes, binary file, each row is > 1320KB:
(136.983, 0, 1X), (136.418, 1, 1X), (81.896, 2, 1.66X), (62.929, 4, 2.16X), (52.311, 8, 2.6X), (40.032, 16, 3.49X), (44.097, 20, 3.09X), (62.310, 30, 2.18X)

In the case of a Toast table, we could achieve upto 2.5X for csv files, and 3.5X for binary files. We are analyzing this point and will post an update on our findings soon.

While testing for the Toast table case with a binary file, I discovered an issue with the earlier v6-0006-Parallel-Copy-For-Binary-Format-Files.patch from [1], I fixed it and added the updated v6-0006 patch here. Please note that I'm also attaching the 1 to 5 patches from version 6 just for completion, that have no change from what Vignesh sent earlier in [1].

>
> 3. Have we checked the code coverage of the newly added code with
> existing tests?
>

So far, we manually ensured that most of the code parts are covered(see below list of test cases). But we are also planning to do the code coverage using some tool in the coming days.

Apart from the above tests, I also captured performance measurement on the latest v6 patch set.

Use case 1: 10million rows, 5.2GB data,2 indexes on integer columns, 1 index on text column, csv file
(1168.484, 0, 1X), (1116.442, 1, 1.05X), (641.272, 2, 1.82X), (338.963, 4, 3.45X), (202.914, 8, 5.76X), (139.884, 16, 8.35X), (128.955, 20, 9.06X), (131.898, 30, 8.86X)

Use case 2: 10million rows, 5.2GB data,2 indexes on integer columns, 1 index on text column, binary file
(1097.83, 0, 1X), (1095.735, 1, 1.002X), (625.610, 2, 1.75X), (319.833, 4, 3.43X), (186.908, 8, 5.87X), (132.115, 16, 8.31X), (128.854, 20, 8.52X), (134.965, 30, 8.13X)

Use case 2: 10million rows, 5.2GB data, 3 indexes on integer columns, csv file
(218.227, 0, 1X), (182.815, 1, 1.19X), (135.500, 2, 1.61), (113.954, 4, 1.91X), (106.243, 8, 2.05X), (101.222, 16, 2.15X), (100.378, 20, 2.17X), (100.351, 30, 2.17X)

All the above tests are performed on the latest v6 patch set (attached here in this thread) with custom postgresql.conf[1]. The results are of the triplet form (exec time in sec, number of workers, gain)

Overall, we have below test cases to cover the code and for performance measurements. We plan to run these tests whenever a new set of patches is posted.

1. csv
2. binary
3. force parallel mode = regress
4. toast data csv and binary
5. foreign key check, before row, after row, before statement, after statement, instead of triggers
6. partition case
7. foreign partitions and partitions having trigger cases
8. where clause having parallel unsafe and safe expression, default parallel unsafe and safe expression
9. temp, global, local, unlogged, inherited tables cases, foreign tables

[1] https://www.postgresql.org/message-id/CALDaNm29DJKy0-vozs8eeBRf2u3rbvPdZHCocrd0VjoWHS7h5A%40mail.gmail.com
[2]
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Parallel copy

From
Amit Kapila
Date:
On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > From the testing perspective,
> > 1. Test by having something force_parallel_mode = regress which means
> > that all existing Copy tests in the regression will be executed via
> > new worker code. You can have this as a test-only patch for now and
> > make sure all existing tests passed with this.
> >
>
> I don't think all the existing copy test cases(except the new test cases added in the parallel copy patch set) would
runinside the parallel worker if force_parallel_mode is on. This is because, the parallelism will be picked up for
parallelcopy only if parallel option is specified unlike parallelism for select queries. 
>

Sure, you need to change the code such that when force_parallel_mode =
'regress' is specified then it always uses one worker. This is
primarily for testing purposes and will help during the development of
this patch as it will make all exiting Copy tests to use quite a good
portion of the parallel infrastructure.

>
> All the above tests are performed on the latest v6 patch set (attached here in this thread) with custom
postgresql.conf[1].The results are of the triplet form (exec time in sec, number of workers, gain) 
>

Okay, so I am assuming the performance is the same as we have seen
with the earlier versions of patches.

> Overall, we have below test cases to cover the code and for performance measurements. We plan to run these tests
whenevera new set of patches is posted. 
>
> 1. csv
> 2. binary

Don't we need the tests for plain text files as well?

> 3. force parallel mode = regress
> 4. toast data csv and binary
> 5. foreign key check, before row, after row, before statement, after statement, instead of triggers
> 6. partition case
> 7. foreign partitions and partitions having trigger cases
> 8. where clause having parallel unsafe and safe expression, default parallel unsafe and safe expression
> 9. temp, global, local, unlogged, inherited tables cases, foreign tables
>

Sounds like good coverage. So, are you doing all this testing
manually? How are you maintaining these tests?

--
With Regards,
Amit Kapila.



Re: Parallel copy

From
Bharath Rupireddy
Date:
On Fri, Oct 9, 2020 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > From the testing perspective,
> > > 1. Test by having something force_parallel_mode = regress which means
> > > that all existing Copy tests in the regression will be executed via
> > > new worker code. You can have this as a test-only patch for now and
> > > make sure all existing tests passed with this.
> > >
> >
> > I don't think all the existing copy test cases(except the new test cases added in the parallel copy patch set)
wouldrun inside the parallel worker if force_parallel_mode is on. This is because, the parallelism will be picked up
forparallel copy only if parallel option is specified unlike parallelism for select queries. 
> >
>
> Sure, you need to change the code such that when force_parallel_mode =
> 'regress' is specified then it always uses one worker. This is
> primarily for testing purposes and will help during the development of
> this patch as it will make all exiting Copy tests to use quite a good
> portion of the parallel infrastructure.
>

IIUC, firstly, I will set force_parallel_mode = FORCE_PARALLEL_REGRESS
as default value in guc.c, and then adjust the parallelism related
code in copy.c such that it always picks 1 worker and spawns it. This
way, all the existing copy test cases would be run in parallel worker.
Please let me know if this is okay. If yes, I will do this and update
here.

>
> > All the above tests are performed on the latest v6 patch set (attached here in this thread) with custom
postgresql.conf[1].The results are of the triplet form (exec time in sec, number of workers, gain) 
> >
>
> Okay, so I am assuming the performance is the same as we have seen
> with the earlier versions of patches.
>

Yes. Most recent run on v5 patch set [1]

>
> > Overall, we have below test cases to cover the code and for performance measurements. We plan to run these tests
whenevera new set of patches is posted. 
> >
> > 1. csv
> > 2. binary
>
> Don't we need the tests for plain text files as well?
>

Will add one.

>
> > 3. force parallel mode = regress
> > 4. toast data csv and binary
> > 5. foreign key check, before row, after row, before statement, after statement, instead of triggers
> > 6. partition case
> > 7. foreign partitions and partitions having trigger cases
> > 8. where clause having parallel unsafe and safe expression, default parallel unsafe and safe expression
> > 9. temp, global, local, unlogged, inherited tables cases, foreign tables
> >
>
> Sounds like good coverage. So, are you doing all this testing
> manually? How are you maintaining these tests?
>

Yes, running them manually. Few of the tests(1,2,4) require huge
datasets for performance measurements and other test cases are to
ensure we don't choose parallelism. We will try to add test cases that
are not meant for performance, to the patch test.

[1] -
https://www.postgresql.org/message-id/CALj2ACW%3Djm5ri%2B7rXiQaFT_c5h2rVS%3DcJOQVFR5R%2Bbowt3QDkw%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Amit Kapila
Date:
On Fri, Oct 9, 2020 at 3:50 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Fri, Oct 9, 2020 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > >
> > > On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > From the testing perspective,
> > > > 1. Test by having something force_parallel_mode = regress which means
> > > > that all existing Copy tests in the regression will be executed via
> > > > new worker code. You can have this as a test-only patch for now and
> > > > make sure all existing tests passed with this.
> > > >
> > >
> > > I don't think all the existing copy test cases(except the new test cases added in the parallel copy patch set)
wouldrun inside the parallel worker if force_parallel_mode is on. This is because, the parallelism will be picked up
forparallel copy only if parallel option is specified unlike parallelism for select queries. 
> > >
> >
> > Sure, you need to change the code such that when force_parallel_mode =
> > 'regress' is specified then it always uses one worker. This is
> > primarily for testing purposes and will help during the development of
> > this patch as it will make all exiting Copy tests to use quite a good
> > portion of the parallel infrastructure.
> >
>
> IIUC, firstly, I will set force_parallel_mode = FORCE_PARALLEL_REGRESS
> as default value in guc.c,
>

No need to set this as the default value. You can change it in
postgresql.conf before running tests.

> and then adjust the parallelism related
> code in copy.c such that it always picks 1 worker and spawns it. This
> way, all the existing copy test cases would be run in parallel worker.
> Please let me know if this is okay.
>

Yeah, this sounds fine.

> If yes, I will do this and update
> here.
>

Okay, thanks, but ensure the difference in test execution before and
after your change. After your change, all the 'copy' tests should
invoke the worker to perform a copy.

> >
> > > All the above tests are performed on the latest v6 patch set (attached here in this thread) with custom
postgresql.conf[1].The results are of the triplet form (exec time in sec, number of workers, gain) 
> > >
> >
> > Okay, so I am assuming the performance is the same as we have seen
> > with the earlier versions of patches.
> >
>
> Yes. Most recent run on v5 patch set [1]
>

Okay, good to know that.

--
With Regards,
Amit Kapila.



Re: Parallel copy

From
vignesh C
Date:
On Fri, Oct 9, 2020 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> While looking at the latest code, I observed below issue in patch
> v6-0003-Allow-copy-from-command-to-process-data-from-file:
>
> + /* Estimate the size for shared information for PARALLEL_COPY_KEY_CSTATE */
> + est_cstateshared = MAXALIGN(sizeof(SerializedParallelCopyState));
> + shm_toc_estimate_chunk(&pcxt->estimator, est_cstateshared);
> + shm_toc_estimate_keys(&pcxt->estimator, 1);
> +
> + strsize = EstimateCstateSize(pcxt, cstate, attnamelist, &whereClauseStr,
> + &rangeTableStr, &attnameListStr,
> + ¬nullListStr, &nullListStr,
> + &convertListStr);
>
> Here, do we need to separately estimate the size of
> SerializedParallelCopyState when it is also done in
> EstimateCstateSize?

This is not required, this has been removed in the attached patches.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
Bharath Rupireddy
Date:
I did performance testing on v7 patch set[1] with custom
postgresql.conf[2]. The results are of the triplet form (exec time in
sec, number of workers, gain)

Use case 1: 10million rows, 5.2GB data, 2 indexes on integer columns,
1 index on text column, binary file
(1104.898, 0, 1X), (1112.221, 1, 1X), (640.236, 2, 1.72X), (335.090,
4, 3.3X), (200.492, 8, 5.51X), (131.448, 16, 8.4X), (121.832, 20,
9.1X), (124.287, 30, 8.9X)

Use case 2: 10million rows, 5.2GB data,2 indexes on integer columns, 1
index on text column, copy from stdin, csv format
(1203.282, 0, 1X), (1135.517, 1, 1.06X), (655.140, 2, 1.84X),
(343.688, 4, 3.5X), (203.742, 8, 5.9X), (144.793, 16, 8.31X),
(133.339, 20, 9.02X), (136.672, 30, 8.8X)

Use case 3: 10million rows, 5.2GB data,2 indexes on integer columns, 1
index on text column, text file
(1165.991, 0, 1X), (1128.599, 1, 1.03X), (644.793, 2, 1.81X),
(342.813, 4, 3.4X), (204.279, 8, 5.71X), (139.986, 16, 8.33X),
(128.259, 20, 9.1X), (132.764, 30, 8.78X)

Above results are similar to the results with earlier versions of the patch set.

On Fri, Oct 9, 2020 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Sure, you need to change the code such that when force_parallel_mode =
> 'regress' is specified then it always uses one worker. This is
> primarily for testing purposes and will help during the development of
> this patch as it will make all exiting Copy tests to use quite a good
> portion of the parallel infrastructure.
>

I performed force_parallel_mode = regress testing and found 2 issues,
the fixes for the same are available in v7 patch set[1].

>
> > Overall, we have below test cases to cover the code and for performance measurements. We plan to run these tests
whenevera new set of patches is posted.
 
> >
> > 1. csv
> > 2. binary
>
> Don't we need the tests for plain text files as well?
>

I added a text use case and above mentioned are perf results on v7 patch set[1].

>
> > 3. force parallel mode = regress
> > 4. toast data csv and binary
> > 5. foreign key check, before row, after row, before statement, after statement, instead of triggers
> > 6. partition case
> > 7. foreign partitions and partitions having trigger cases
> > 8. where clause having parallel unsafe and safe expression, default parallel unsafe and safe expression
> > 9. temp, global, local, unlogged, inherited tables cases, foreign tables
> >
>
> Sounds like good coverage. So, are you doing all this testing
> manually? How are you maintaining these tests?
>

All test cases listed above, except for the cases that are meant to
measure perf gain with huge data, are present in v7-0005 patch in v7
patch set[1].

[1] https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com

[2]
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:


On Fri, Oct 9, 2020 at 10:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Oct 8, 2020 at 12:14 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > I am convinced by the reason given by Kyotaro-San in that another
> > > thread [1] and performance data shown by Peter that this can't be an
> > > independent improvement and rather in some cases it can do harm. Now,
> > > if you need it for a parallel-copy path then we can change it
> > > specifically to the parallel-copy code path but I don't understand
> > > your reason completely.
> > >
> >
> > Whenever we need data to be populated, we will get a new data block &
> > pass it to CopyGetData to populate the data. In case of file copy, the
> > server will completely fill the data block. We expect the data to be
> > filled completely. If data is available it will completely load the
> > complete data block in case of file copy. There is no scenario where
> > even if data is present a partial data block will be returned except
> > for EOF or no data available. But in case of STDIN data copy, even
> > though there is 8K data available in data block & 8K data available in
> > STDIN, CopyGetData will return as soon as libpq buffer data is more
> > than the minread. We will pass new data block every time to load data.
> > Every time we pass an 8K data block but CopyGetData loads a few bytes
> > in the new data block & returns. I wanted to keep the same data
> > population logic for both file copy & STDIN copy i.e copy full 8K data
> > blocks & then the populated data can be required. There is an
> > alternative solution I can have some special handling in case of STDIN
> > wherein the existing data block can be passed with the index from
> > where the data should be copied. Thoughts?
> >
>
> What you are proposing as an alternative solution, isn't that what we
> are doing without the patch? IIUC, you require this because of your
> corresponding changes to handle COPY_NEW_FE in CopyReadLine(), is that
> right? If so, what is the difficulty in making it behave similar to
> the non-parallel case?
>

The alternate solution is similar to how existing copy handles STDIN copies, I have made changes in the v7 patch attached in [1] to have parallel copy handle STDIN data similar to non parallel copy, so the original comment on why this change is required has been removed from 001 patch:
> > + if (cstate->copy_dest == COPY_NEW_FE)
> > + minread = RAW_BUF_SIZE - nbytes;
> > +
> >   inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes,
> > -   1, RAW_BUF_SIZE - nbytes);
> > +   minread, RAW_BUF_SIZE - nbytes);
> >
> > No comment to explain why this change is done?

[1] https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
vignesh C
Date:
On Fri, Oct 9, 2020 at 11:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Oct 8, 2020 at 12:14 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > + */
> > > > > +typedef struct ParallelCopyLineBoundary
> > > > >
> > > > > Are we doing all this state management to avoid using locks while
> > > > > processing lines?  If so, I think we can use either spinlock or LWLock
> > > > > to keep the main patch simple and then provide a later patch to make
> > > > > it lock-less.  This will allow us to first focus on the main design of
> > > > > the patch rather than trying to make this datastructure processing
> > > > > lock-less in the best possible way.
> > > > >
> > > >
> > > > The steps will be more or less same if we use spinlock too. step 1, step 3 & step 4 will be common we have to
uselock & unlock instead of step 2 & step 5. I feel we can retain the current implementation.
 
> > > >
> > >
> > > I'll study this in detail and let you know my opinion on the same but
> > > in the meantime, I don't follow one part of this comment: "If they
> > > don't follow this order the worker might process wrong line_size and
> > > leader might populate the information which worker has not yet
> > > processed or in the process of processing."
> > >
> > > Do you want to say that leader might overwrite some information which
> > > worker hasn't read yet? If so, it is not clear from the comment.
> > > Another minor point about this comment:
> > >
> >
> > Here leader and worker must follow these steps to avoid any corruption
> > or hang issue. Changed it to:
> >  * The leader & worker process access the shared line information by following
> >  * the below steps to avoid any data corruption or hang:
> >
>
> Actually, I wanted more on the lines why such corruption or hang can
> happen? It might help reviewers to understand why you have followed
> such a sequence.

There are 3 variables which the leader & worker are working on:
line_size, line_state & data. Leader will update line_state & populate
data, update line_size & line_state. Workers will wait for line_state
to be updated, once the updated leader will read the data based on the
line_size. If the worker is not synchronized wrong line_size will be
set & read wrong amount of data, anything can happen.There are 3
variables which leader & worker are working on: line_size, line_state
& data. Leader will update line_state & populate data, update
line_size & line_state. Workers will wait for line_state to be
updated, once the updated leader will read the data based on the
line_size. If the worker is not synchronized wrong line_size will be
set & read wrong amount of data, anything can happen. This is the
usual concurrency case with reader/writers. I felt that much details
need not be mentioned.

> > >
> > > How did you ensure that this is fixed? Have you tested it, if so
> > > please share the test? I see a basic problem with your fix.
> > >
> > > + /* Report WAL/buffer usage during parallel execution */
> > > + bufferusage = shm_toc_lookup(toc, PARALLEL_COPY_BUFFER_USAGE, false);
> > > + walusage = shm_toc_lookup(toc, PARALLEL_COPY_WAL_USAGE, false);
> > > + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
> > > +   &walusage[ParallelWorkerNumber]);
> > >
> > > You need to call InstrStartParallelQuery() before the actual operation
> > > starts, without that stats won't be accurate? Also, after calling
> > > WaitForParallelWorkersToFinish(), you need to accumulate the stats
> > > collected from workers which neither you have done nor is possible
> > > with the current code in your patch because you haven't made any
> > > provision to capture them in BeginParallelCopy.
> > >
> > > I suggest you look into lazy_parallel_vacuum_indexes() and
> > > begin_parallel_vacuum() to understand how the buffer/wal usage stats
> > > are accumulated. Also, please test this functionality using
> > > pg_stat_statements.
> > >
> >
> > Made changes accordingly.
> > I have verified it using:
> > postgres=# select * from pg_stat_statements where query like '%copy%';
> >  userid | dbid  |       queryid        |
> >                          query
> >                | plans | total_plan_time |
> > min_plan_time | max_plan_time | mean_plan_time | stddev_plan_time |
> > calls | total_exec_time | min_exec_time | max_exec_time |
> > mean_exec_time | stddev_exec_time |  rows  | shared_blks_hi
> > t | shared_blks_read | shared_blks_dirtied | shared_blks_written |
> > local_blks_hit | local_blks_read | local_blks_dirtied |
> > local_blks_written | temp_blks_read | temp_blks_written | blk_
> > read_time | blk_write_time | wal_records | wal_fpi | wal_bytes
> >
--------+-------+----------------------+---------------------------------------------------------------------------------------------------------------------+-------+-----------------+-
> >
--------------+---------------+----------------+------------------+-------+-----------------+---------------+---------------+----------------+------------------+--------+---------------
> >
--+------------------+---------------------+---------------------+----------------+-----------------+--------------------+--------------------+----------------+-------------------+-----
> > ----------+----------------+-------------+---------+-----------
> >      10 | 13743 | -6947756673093447609 | copy hw from
> > '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format
> > csv, delimiter ',')               |     0 |               0 |
> >             0 |             0 |              0 |                0 |
> >  1 |      265.195105 |    265.195105 |    265.195105 |     265.195105
> > |                0 | 175000 |            191
> > 6 |                0 |                 946 |                 946 |
> >          0 |               0 |                  0 |                  0
> > |              0 |                 0 |
> >         0 |              0 |        1116 |       0 |   3587203
> >      10 | 13743 |  8570215596364326047 | copy hw from
> > '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format
> > csv, delimiter ',', parallel '2') |     0 |               0 |
> >             0 |             0 |              0 |                0 |
> >  1 |    35668.402482 |  35668.402482 |  35668.402482 |   35668.402482
> > |                0 | 175000 |            310
> > 1 |               36 |                 952 |                 919 |
> >          0 |               0 |                  0 |                  0
> > |              0 |                 0 |
> >         0 |              0 |        1119 |       6 |   3624405
> > (2 rows)
> >
>
> I am not able to properly parse the data but If understand the wal
> data for non-parallel (1116 |       0 |   3587203) and parallel (1119
> |       6 |   3624405) case doesn't seem to be the same. Is that
> right? If so, why? Please ensure that no checkpoint happens for both
> cases.
>

I have disabled checkpoint, the results with the checkpoint disabled
are given below:
                                           | wal_records | wal_fpi | wal_bytes
Sequential Copy                   | 1116            |       0   |   3587669
Parallel Copy(1 worker)         | 1116            |       0   |   3587669
Parallel Copy(4 worker)         | 1121            |       0   |   3587668
I noticed that for 1 worker wal_records & wal_bytes are same as
sequential copy, but with different worker count I had noticed that
there is difference in wal_records & wal_bytes, I think the difference
should be ok because with more than 1 worker the order of records
processed will be different based on which worker picks which records
to process from input file. In the case of sequential copy/1 worker
the order in which the records will be processed is always in the same
order hence wal_bytes are the same.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:
On Sat, Oct 3, 2020 at 6:20 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Hello Vignesh,
>
> I've done some basic benchmarking on the v4 version of the patches (but
> AFAIKC the v5 should perform about the same), and some initial review.
>
> For the benchmarking, I used the lineitem table from TPC-H - for 75GB
> data set, this largest table is about 64GB once loaded, with another
> 54GB in 5 indexes. This is on a server with 32 cores, 64GB of RAM and
> NVME storage.
>
> The COPY duration with varying number of workers (specified using the
> parallel COPY option) looks like this:
>
>       workers    duration
>      ---------------------
>             0        1366
>             1        1255
>             2         704
>             3         526
>             4         434
>             5         385
>             6         347
>             7         322
>             8         327
>
> So this seems to work pretty well - initially we get almost linear
> speedup, then it slows down (likely due to contention for locks, I/O
> etc.). Not bad.

Thanks for testing with different workers & posting the results.

> I've only done a quick review, but overall the patch looks in fairly
> good shape.
>
> 1) I don't quite understand why we need INCREMENTPROCESSED and
> RETURNPROCESSED, considering it just does ++ or return. It just
> obfuscated the code, I think.
>

I have removed the macros.

> 2) I find it somewhat strange that BeginParallelCopy can just decide not
> to do parallel copy after all. Why not to do this decisions in the
> caller? Or maybe it's fine this way, not sure.
>

I have moved the check IsParallelCopyAllowed to the caller.

> 3) AFAIK we don't modify typedefs.list in patches, so these changes
> should be removed.
>

I had seen that in many of the commits typedefs.list is getting changed, also it helps in running pgindent. So I'm retaining this change.

> 4) IsTriggerFunctionParallelSafe actually checks all triggers, not just
> one, so the comment needs minor rewording.
>

Modified the comments.

Thanks for the comments & sharing the test results Tomas, These changes are fixed in one of my earlier mail [1] that I sent.

[1] https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
vignesh C
Date:
On Thu, Oct 8, 2020 at 8:43 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> On Thu, Oct 8, 2020 at 5:44 AM vignesh C <vignesh21@gmail.com> wrote:
>
> > Attached v6 patch with the fixes.
> >
>
> Hi Vignesh,
>
> I noticed a couple of issues when scanning the code in the following patch:
>
>     v6-0003-Allow-copy-from-command-to-process-data-from-file.patch
>
> In the following code, it will put a junk uint16 value into *destptr
> (and thus may well cause a crash) on a Big Endian architecture
> (Solaris Sparc, s390x, etc.):
> You're storing a (uint16) string length in a uint32 and then pulling
> out the lower two bytes of the uint32 and copying them into the
> location pointed to by destptr.
>
>
> static void
> +CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr,
> + uint32 *copiedsize)
> +{
> + uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0;
> +
> + memcpy(destptr, (uint16 *) &len, sizeof(uint16));
> + *copiedsize += sizeof(uint16);
> + if (len)
> + {
> + memcpy(destptr + sizeof(uint16), srcPtr, len);
> + *copiedsize += len;
> + }
> +}
>
> I suggest you change the code to:
>
>     uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
>     memcpy(destptr, &len, sizeof(uint16));
>
> [I assume string length here can't ever exceed (65535 - 1), right?]
>
> Looking a bit deeper into this, I'm wondering if in fact your
> EstimateStringSize() and EstimateNodeSize() functions should be using
> BUFFERALIGN() for EACH stored string/node (rather than just calling
> shm_toc_estimate_chunk() once at the end, after the length of packed
> strings and nodes has been estimated), to ensure alignment of start of
> each string/node. Other Postgres code appears to be aligning each
> stored chunk using shm_toc_estimate_chunk(). See the definition of
> that macro and its current usages.
>

I'm not handling this, this is similar to how it is handled in other places.

> Then you could safely use:
>
>     uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
>     *(uint16 *)destptr = len;
>     *copiedsize += sizeof(uint16);
>     if (len)
>     {
>         memcpy(destptr + sizeof(uint16), srcPtr, len);
>         *copiedsize += len;
>     }
>
> and in the CopyStringFromSharedMemory() function, then could safely use:
>
>     len = *(uint16 *)srcPtr;
>
> The compiler may be smart enough to optimize-away the memcpy() in this
> case anyway, but there are issues in doing this for architectures that
> take a performance hit for unaligned access, or don't support
> unaligned access.

Changed it to uin32, so that there are no issues in case if length exceeds 65535 & also to avoid problems in Big Endian architecture.

> Also, in CopyXXXXFromSharedMemory() functions, you should use palloc()
> instead of palloc0(), as you're filling the entire palloc'd buffer
> anyway, so no need to ask for additional MemSet() of all buffer bytes
> to 0 prior to memcpy().
>

I have changed palloc0 to palloc.

Thanks Greg for reviewing & providing your comments. These changes are fixed in one of my earlier mail [1] that I sent.
[1] https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
Amit Kapila
Date:
On Wed, Oct 14, 2020 at 6:51 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Fri, Oct 9, 2020 at 11:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I am not able to properly parse the data but If understand the wal
> > data for non-parallel (1116 |       0 |   3587203) and parallel (1119
> > |       6 |   3624405) case doesn't seem to be the same. Is that
> > right? If so, why? Please ensure that no checkpoint happens for both
> > cases.
> >
>
> I have disabled checkpoint, the results with the checkpoint disabled
> are given below:
>                                            | wal_records | wal_fpi | wal_bytes
> Sequential Copy                   | 1116            |       0   |   3587669
> Parallel Copy(1 worker)         | 1116            |       0   |   3587669
> Parallel Copy(4 worker)         | 1121            |       0   |   3587668
> I noticed that for 1 worker wal_records & wal_bytes are same as
> sequential copy, but with different worker count I had noticed that
> there is difference in wal_records & wal_bytes, I think the difference
> should be ok because with more than 1 worker the order of records
> processed will be different based on which worker picks which records
> to process from input file. In the case of sequential copy/1 worker
> the order in which the records will be processed is always in the same
> order hence wal_bytes are the same.
>

Are all records of the same size in your test? If so, then why the
order should matter? Also, even the number of wal_records has
increased but wal_bytes are not increased, rather it is one-byte less.
Can we identify what is going on here? I don't intend to say that it
is a problem but we should know the reason clearly.

-- 
With Regards,
Amit Kapila.



RE: Parallel copy

From
"Hou, Zhijie"
Date:
Hi Vignesh,

After having a look over the patch,
I have some suggestions for 
0003-Allow-copy-from-command-to-process-data-from-file.patch.

1.

+static uint32
+EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List *attnamelist,
+                   char **whereClauseStr, char **rangeTableStr,
+                   char **attnameListStr, char **notnullListStr,
+                   char **nullListStr, char **convertListStr)
+{
+    uint32        strsize = MAXALIGN(sizeof(SerializedParallelCopyState));
+
+    strsize += EstimateStringSize(cstate->null_print);
+    strsize += EstimateStringSize(cstate->delim);
+    strsize += EstimateStringSize(cstate->quote);
+    strsize += EstimateStringSize(cstate->escape);


It use function EstimateStringSize to get the strlen of null_print, delim, quote and escape.
But the length of null_print seems has been stored in null_print_len.
And delim/quote/escape must be 1 byte, so I think call strlen again seems unnecessary.

How about  " strsize += sizeof(uint32) + cstate->null_print_len + 1"

2.
+    strsize += EstimateNodeSize(cstate->whereClause, whereClauseStr);

+    copiedsize += CopyStringToSharedMemory(cstate, whereClauseStr,
+                                           shmptr + copiedsize);

Some string length is counted for two times.
The ' whereClauseStr ' has call strlen in EstimateNodeSize once and call strlen in CopyStringToSharedMemory again.
I don't know wheather it's worth to refacor the code to avoid duplicate strlen . what do you think ?

Best regards,
houzj









Re: Parallel copy

From
Amit Kapila
Date:
On Sun, Oct 18, 2020 at 7:47 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
>
> Hi Vignesh,
>
> After having a look over the patch,
> I have some suggestions for
> 0003-Allow-copy-from-command-to-process-data-from-file.patch.
>
> 1.
>
> +static uint32
> +EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List *attnamelist,
> +                                  char **whereClauseStr, char **rangeTableStr,
> +                                  char **attnameListStr, char **notnullListStr,
> +                                  char **nullListStr, char **convertListStr)
> +{
> +       uint32          strsize = MAXALIGN(sizeof(SerializedParallelCopyState));
> +
> +       strsize += EstimateStringSize(cstate->null_print);
> +       strsize += EstimateStringSize(cstate->delim);
> +       strsize += EstimateStringSize(cstate->quote);
> +       strsize += EstimateStringSize(cstate->escape);
>
>
> It use function EstimateStringSize to get the strlen of null_print, delim, quote and escape.
> But the length of null_print seems has been stored in null_print_len.
> And delim/quote/escape must be 1 byte, so I think call strlen again seems unnecessary.
>
> How about  " strsize += sizeof(uint32) + cstate->null_print_len + 1"
>

+1. This seems like a good suggestion but add comments for
delim/quote/escape to indicate that we are considering one-byte for
each. I think this will obviate the need of function
EstimateStringSize. Another thing in this regard is that we normally
use add_size function to compute the size but I don't see that being
used in this and nearby computation. That helps us to detect overflow
of addition if any.

EstimateCstateSize()
{
..
+
+ strsize++;
..
}

Why do we need this additional one-byte increment? Does it make sense
to add a small comment for the same?

> 2.
> +       strsize += EstimateNodeSize(cstate->whereClause, whereClauseStr);
>
> +       copiedsize += CopyStringToSharedMemory(cstate, whereClauseStr,
> +                                                                                  shmptr + copiedsize);
>
> Some string length is counted for two times.
> The ' whereClauseStr ' has call strlen in EstimateNodeSize once and call strlen in CopyStringToSharedMemory again.
> I don't know wheather it's worth to refacor the code to avoid duplicate strlen . what do you think ?
>

It doesn't seem worth to me. We probably need to use additional
variables to save those lengths. I think it will add more
code/complexity than we will save. See EstimateParamListSpace and
SerializeParamList where we get the typeLen each time, that way code
looks neat to me and we are don't going to save much by not following
a similar thing here.

-- 
With Regards,
Amit Kapila.



Re: Parallel copy

From
vignesh C
Date:
On Thu, Oct 15, 2020 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Oct 14, 2020 at 6:51 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Fri, Oct 9, 2020 at 11:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > I am not able to properly parse the data but If understand the wal
> > > data for non-parallel (1116 |       0 |   3587203) and parallel (1119
> > > |       6 |   3624405) case doesn't seem to be the same. Is that
> > > right? If so, why? Please ensure that no checkpoint happens for both
> > > cases.
> > >
> >
> > I have disabled checkpoint, the results with the checkpoint disabled
> > are given below:
> >                                            | wal_records | wal_fpi | wal_bytes
> > Sequential Copy                   | 1116            |       0   |   3587669
> > Parallel Copy(1 worker)         | 1116            |       0   |   3587669
> > Parallel Copy(4 worker)         | 1121            |       0   |   3587668
> > I noticed that for 1 worker wal_records & wal_bytes are same as
> > sequential copy, but with different worker count I had noticed that
> > there is difference in wal_records & wal_bytes, I think the difference
> > should be ok because with more than 1 worker the order of records
> > processed will be different based on which worker picks which records
> > to process from input file. In the case of sequential copy/1 worker
> > the order in which the records will be processed is always in the same
> > order hence wal_bytes are the same.
> >
>
> Are all records of the same size in your test? If so, then why the
> order should matter? Also, even the number of wal_records has
> increased but wal_bytes are not increased, rather it is one-byte less.
> Can we identify what is going on here? I don't intend to say that it
> is a problem but we should know the reason clearly.

The earlier run that I executed was with varying record size. The
below results are by modifying the records to keep it of same size:
                                              | wal_records | wal_fpi
| wal_bytes
Sequential Copy                    | 1307            |       0    |   4198526
Parallel Copy(1 worker)         | 1307            |       0    |   4198526
Parallel Copy(2 worker)         | 1308            |       0    |   4198836
Parallel Copy(4 worker)         | 1307            |       0    |   4199147
Parallel Copy(8 worker)         | 1312            |       0    |   4199735
Parallel Copy(16 worker)       | 1313            |       0    |   4200311

Still I noticed that there is some difference in wal_records &
wal_bytes. I feel the difference in wal_records & wal_bytes is because
of the following reasons:
Each worker prepares 1000 tuples and then tries to do
heap_multi_insert for 1000 tuples, In our case approximately 185
tuples is stored in 1 page, 925 tuples are stored in 5 WAL records and
the remaining 75 tuples are stored in next WAL record. The wal dump is
like below:
rmgr: Heap2       len (rec/tot):   3750/  3750, tx:        510, lsn:
0/0160EC80, prev 0/0160DDB0, desc: MULTI_INSERT+INIT 185 tuples flags
0x00, blkref #0: rel 1663/13751/16384 blk 0
rmgr: Heap2       len (rec/tot):   3750/  3750, tx:        510, lsn:
0/0160FB28, prev 0/0160EC80, desc: MULTI_INSERT+INIT 185 tuples flags
0x00, blkref #0: rel 1663/13751/16384 blk 1
rmgr: Heap2       len (rec/tot):   3750/  3750, tx:        510, lsn:
0/016109E8, prev 0/0160FB28, desc: MULTI_INSERT+INIT 185 tuples flags
0x00, blkref #0: rel 1663/13751/16384 blk 2
rmgr: Heap2       len (rec/tot):   3750/  3750, tx:        510, lsn:
0/01611890, prev 0/016109E8, desc: MULTI_INSERT+INIT 185 tuples flags
0x00, blkref #0: rel 1663/13751/16384 blk 3
rmgr: Heap2       len (rec/tot):   3750/  3750, tx:        510, lsn:
0/01612750, prev 0/01611890, desc: MULTI_INSERT+INIT 185 tuples flags
0x00, blkref #0: rel 1663/13751/16384 blk 4
rmgr: Heap2       len (rec/tot):   1550/  1550, tx:        510, lsn:
0/016135F8, prev 0/01612750, desc: MULTI_INSERT+INIT 75 tuples flags
0x02, blkref #0: rel 1663/13751/16384 blk 5

After the 1st 1000 tuples are inserted and when the worker tries to
insert another 1000 tuples, it will use the last page which had free
space to insert where we can insert 110 more tuples:
rmgr: Heap2       len (rec/tot):   2470/  2470, tx:        510, lsn:
0/01613C08, prev 0/016135F8, desc: MULTI_INSERT 110 tuples flags 0x00,
blkref #0: rel 1663/13751/16384 blk 5
rmgr: Heap2       len (rec/tot):   3750/  3750, tx:        510, lsn:
0/016145C8, prev 0/01613C08, desc: MULTI_INSERT+INIT 185 tuples flags
0x00, blkref #0: rel 1663/13751/16384 blk 6
rmgr: Heap2       len (rec/tot):   3750/  3750, tx:        510, lsn:
0/01615470, prev 0/016145C8, desc: MULTI_INSERT+INIT 185 tuples flags
0x00, blkref #0: rel 1663/13751/16384 blk 7
rmgr: Heap2       len (rec/tot):   3750/  3750, tx:        510, lsn:
0/01616330, prev 0/01615470, desc: MULTI_INSERT+INIT 185 tuples flags
0x00, blkref #0: rel 1663/13751/16384 blk 8
rmgr: Heap2       len (rec/tot):   3750/  3750, tx:        510, lsn:
0/016171D8, prev 0/01616330, desc: MULTI_INSERT+INIT 185 tuples flags
0x00, blkref #0: rel 1663/13751/16384 blk 9
rmgr: Heap2       len (rec/tot):   3050/  3050, tx:        510, lsn:
0/01618098, prev 0/016171D8, desc: MULTI_INSERT+INIT 150 tuples flags
0x02, blkref #0: rel 1663/13751/16384 blk 10

This behavior will be the same for sequential copy and copy with 1
worker as the sequence of insert & the pages used to insert is in same
order. There 2 reasons together result in the varying wal_size &
wal_records with multiple worker: 1) When more than 1 worker is
involved the sequence in which the pages that will be selected is not
guaranteed, the MULTI_INSERT tuple count varies &
MULTI_INSERT/MULTI_INSERT+INIT description varies. 2) wal_records will
increase with more number of workers because when the tuples are split
across the workers, one of the worker will have few more WAL record
because the last heap_multi_insert gets split across the workers and
generates new wal records like:
rmgr: Heap2       len (rec/tot):    600/   600, tx:        510, lsn:
0/019F8B08, prev 0/019F7C48, desc: MULTI_INSERT 25 tuples flags 0x00,
blkref #0: rel 1663/13751/16384 blk 1065

Attached the tar of wal file dump which was used for analysis.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
Bharath Rupireddy
Date:
On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 2. Do we have tests for toast tables? I think if you implement the
> > previous point some existing tests might cover it but I feel we should
> > have at least one or two tests for the same.
> >
> Toast table use case 1: 10000 tuples, 9.6GB data, 3 indexes 2 on integer columns, 1 on text column(not the toast column), csv file, each row is > 1320KB:
> (222.767, 0, 1X), (134.171, 1, 1.66X), (93.749, 2, 2.38X), (93.672, 4, 2.38X), (94.827, 8, 2.35X), (93.766, 16, 2.37X), (98.153, 20, 2.27X), (122.721, 30, 1.81X)
>
> Toast table use case 2: 100000 tuples, 96GB data, 3 indexes 2 on integer columns, 1 on text column(not the toast column), csv file, each row is > 1320KB:
> (2255.032, 0, 1X), (1358.628, 1, 1.66X), (901.170, 2, 2.5X), (912.743, 4, 2.47X), (988.718, 8, 2.28X), (938.000, 16, 2.4X), (997.556, 20, 2.26X), (1000.586, 30, 2.25X)
>
> Toast table use case3: 10000 tuples, 9.6GB, no indexes, binary file, each row is > 1320KB:
> (136.983, 0, 1X), (136.418, 1, 1X), (81.896, 2, 1.66X), (62.929, 4, 2.16X), (52.311, 8, 2.6X), (40.032, 16, 3.49X), (44.097, 20, 3.09X), (62.310, 30, 2.18X)
>
> In the case of a Toast table, we could achieve upto 2.5X for csv files, and 3.5X for binary files. We are analyzing this point and will post an update on our findings soon.
>

I analyzed the above point of getting only upto 2.5X performance improvement for csv files with a toast table with 3 indexers - 2 on integer columns and 1 on text column(not the toast column). Reason is that workers are fast enough to do the work and they are waiting for the leader to fill in the data blocks and in this case the leader is able to serve the workers at its maximum possible speed. Hence most of the time the workers are waiting not doing any beneficial work.

Having observed the above point, I tried to make workers perform more work to avoid waiting time. For this, I added a gist index on the toasted text column. The use and results are as follows.

Toast table use case4: 10000 tuples, 9.6GB, 4 indexes - 2 on integer columns, 1 on non-toasted text column and 1 gist index on toasted text column, csv file, each row is  ~ 12.2KB:

(1322.839, 0, 1X), (1261.176, 1, 1.05X), (632.296, 2, 2.09X), (321.941, 4, 4.11X), (181.796, 8, 7.27X), (105.750, 16, 12.51X), (107.099, 20, 12.35X), (123.262, 30, 10.73X)

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
vignesh C
Date:
On Mon, Oct 19, 2020 at 2:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Oct 18, 2020 at 7:47 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
> >
> > Hi Vignesh,
> >
> > After having a look over the patch,
> > I have some suggestions for
> > 0003-Allow-copy-from-command-to-process-data-from-file.patch.
> >
> > 1.
> >
> > +static uint32
> > +EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List *attnamelist,
> > +                                  char **whereClauseStr, char **rangeTableStr,
> > +                                  char **attnameListStr, char **notnullListStr,
> > +                                  char **nullListStr, char **convertListStr)
> > +{
> > +       uint32          strsize = MAXALIGN(sizeof(SerializedParallelCopyState));
> > +
> > +       strsize += EstimateStringSize(cstate->null_print);
> > +       strsize += EstimateStringSize(cstate->delim);
> > +       strsize += EstimateStringSize(cstate->quote);
> > +       strsize += EstimateStringSize(cstate->escape);
> >
> >
> > It use function EstimateStringSize to get the strlen of null_print, delim, quote and escape.
> > But the length of null_print seems has been stored in null_print_len.
> > And delim/quote/escape must be 1 byte, so I think call strlen again seems unnecessary.
> >
> > How about  " strsize += sizeof(uint32) + cstate->null_print_len + 1"
> >
>
> +1. This seems like a good suggestion but add comments for
> delim/quote/escape to indicate that we are considering one-byte for
> each. I think this will obviate the need of function
> EstimateStringSize. Another thing in this regard is that we normally
> use add_size function to compute the size but I don't see that being
> used in this and nearby computation. That helps us to detect overflow
> of addition if any.
>
> EstimateCstateSize()
> {
> ..
> +
> + strsize++;
> ..
> }
>
> Why do we need this additional one-byte increment? Does it make sense
> to add a small comment for the same?
>

Changed it to handle null_print, delim, quote & escape accordingly in
the attached patch, the one byte increment is not required, I have
removed it.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
vignesh C
Date:


On Thu, Oct 8, 2020 at 11:15 AM vignesh C <vignesh21@gmail.com> wrote:
>
> I'm summarizing the pending open points so that I don't miss anything:
> 1) Performance test on latest patch set.

It is tested and results are shared by bharath at [1]

> 2) Testing points suggested.

Tests are added as suggested and details shared by bharath at [1]

> 3) Support of parallel copy for COPY_OLD_FE.

It is handled as part of v8 patch shared at [2]

> 4) Worker has to hop through all the processed chunks before getting
> the chunk which it can process.

Open

> 5) Handling of Tomas's comments.

I have fixed and updated the fix details as part of [3] 

> 6) Handling of Greg's comments.

I have fixed and updated the fix details as part of [4] 

Except for "4) Worker has to hop through all the processed chunks before getting the chunk which it can process", all open tasks are handled. I will work on this and provide an update shortly. 

Re: Parallel copy

From
Bharath Rupireddy
Date:
Hi Vignesh,

I took a look at the v8 patch set. Here are some comments:

1. PopulateCommonCstateInfo() -- can we use PopulateCommonCStateInfo()
or PopulateCopyStateInfo()? And also EstimateCstateSize() --
EstimateCStateSize(), PopulateCstateCatalogInfo() --
PopulateCStateCatalogInfo()?

2. Instead of mentioning numbers like 1024, 64K, 10240 in the
comments, can we represent them in terms of macros?
/* It can hold 1024 blocks of 64K data in DSM to be processed by the worker. */
#define MAX_BLOCKS_COUNT 1024
/*
 * It can hold upto 10240 record information for worker to process. RINGSIZE

3. How about
"
Each worker at once will pick the WORKER_CHUNK_COUNT records from the
DSM data blocks and store them in it's local memory.
This is to make workers not contend much while getting record
information from the DSM. Read RINGSIZE comments before
 changing this value.
"
instead of
/*
 * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data
 * block to process to avoid lock contention. Read RINGSIZE comments before
 * changing this value.
 */

4.  How about one line gap before and after for comments: "Leader
should operate in the following order:" and "Worker should operate in
the following order:"

5. Can we move RAW_BUF_BYTES macro definition to the beginning of the
copy.h where all the macro are defined?

6. I don't think we need the change in toast_internals.c with the
temporary hack Assert(!(IsParallelWorker() && !currentCommandIdUsed));
in GetCurrentCommandId()

7. I think
    /* Can't perform copy in parallel */
    if (parallel_workers <= 0)
        return NULL;
can be
    /* Can't perform copy in parallel */
    if (parallel_workers == 0)
        return NULL;
as parallel_workers can never be < 0 since we enter BeginParallelCopy
only if cstate->nworkers > 0 and also we are not allowed to have
negative values for max_worker_processes.

8. Do we want to pfree(cstate->pcdata) in case we failed to start any
parallel workers, we would have allocated a good
    else
        {
            /*
             * Reset nworkers to -1 here. This is useful in cases where user
             * specifies parallel workers, but, no worker is picked up, so go
             * back to non parallel mode value of nworkers.
             */
            cstate->nworkers = -1;
            *processed = CopyFrom(cstate);    /* copy from file to database */
        }

9. Instead of calling CopyStringToSharedMemory() for each string
variable, can't we just create a linked list of all the strings that
need to be copied into shm and call CopyStringToSharedMemory() only
once? We could avoid 5 function calls?

10. Similar to above comment: can we fill all the required
cstate->variables inside the function CopyNodeFromSharedMemory() and
call it only once? In each worker we could save overhead of 5 function
calls.

11. Looks like CopyStringFromSharedMemory() and
CopyNodeFromSharedMemory() do almost the same things except
stringToNode() and pfree(destptr);. Can we have a generic function
CopyFromSharedMemory() or something else and handle with flag "bool
isnode" to differentiate the two use cases?

12. Can we move below check to the end in IsParallelCopyAllowed()?
    /* Check parallel safety of the trigger functions. */
    if (cstate->rel->trigdesc != NULL &&
        !CheckRelTrigFunParallelSafety(cstate->rel->trigdesc))
        return false;

13. CacheLineInfo(): Instead of goto empty_data_line_update; how about
having this directly inside the if block as it's being used only once?

14. GetWorkerLine(): How about avoiding goto statements and replacing
the common code with a always static inline function or a macro?

15. UpdateSharedLineInfo(): Below line is misaligned.
                lineInfo->first_block = blk_pos;
        lineInfo->start_offset = offset;

16. ParallelCopyFrom(): Do we need CHECK_FOR_INTERRUPTS(); at the
start of  for (;;)?

17. Remove extra lines after #define IsHeaderLine()
(cstate->header_line && cstate->cur_lineno == 1) in copy.h

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Amit Kapila
Date:
On Wed, Oct 21, 2020 at 3:19 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
>
> 9. Instead of calling CopyStringToSharedMemory() for each string
> variable, can't we just create a linked list of all the strings that
> need to be copied into shm and call CopyStringToSharedMemory() only
> once? We could avoid 5 function calls?
>

If we want to avoid different function calls then can't we just store
all these strings in a local structure and use it? That might improve
the other parts of code as well where we are using these as individual
parameters.

> 10. Similar to above comment: can we fill all the required
> cstate->variables inside the function CopyNodeFromSharedMemory() and
> call it only once? In each worker we could save overhead of 5 function
> calls.
>

Yeah, that makes sense.

-- 
With Regards,
Amit Kapila.



Re: Parallel copy

From
Bharath Rupireddy
Date:
On Wed, Oct 21, 2020 at 3:18 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> 17. Remove extra lines after #define IsHeaderLine()
> (cstate->header_line && cstate->cur_lineno == 1) in copy.h
>

 I missed one comment:

 18. I think we need to treat the number of parallel workers as an
integer similar to the parallel option in vacuum.

postgres=# copy t1 from stdin with(parallel '1');      <<<<< - we
should not allow this.
Enter data to be copied followed by a newline.

postgres=# vacuum (parallel '1') t1;
ERROR:  parallel requires an integer value


With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Heikki Linnakangas
Date:
I had a brief look at at this patch. Important work! A couple of first 
impressions:

1. The split between patches 
0002-Framework-for-leader-worker-in-parallel-copy.patch and 
0003-Allow-copy-from-command-to-process-data-from-file.patch is quite 
artificial. All the stuff introduced in the first is unused until the 
second patch is applied. The first patch introduces a forward 
declaration for ParallelCopyData(), but the function only comes in the 
second patch. The comments in the first patch talk about 
LINE_LEADER_POPULATING and LINE_LEADER_POPULATED, but the enum only 
comes in the second patch. I think these have to merged into one. If you 
want to split it somehow, I'd suggest having a separate patch just to 
move CopyStateData from copy.c to copy.h. The subsequent patch would 
then be easier to read as you could see more easily what's being added 
to CopyStateData. Actually I think it would be better to have a new 
header file, copy_internal.h, to hold CopyStateData and the other 
structs, and keep copy.h as it is.

2. This desperately needs some kind of a high-level overview of how it 
works. What is a leader, what is a worker? Which process does each step 
of COPY processing, like reading from the file/socket, splitting the 
input into lines, handling escapes, calling input functions, and 
updating the heap and indexes? What data structures are used for the 
communication? How does is the work synchronized between the processes? 
There are comments on those individual aspects scattered in the patch, 
but if you're not already familiar with it, you don't know where to 
start. There's some of that in the commit message, but it needs to be 
somewhere in the source code, maybe in a long comment at the top of 
copyparallel.c.

3. I'm surprised there's a separate ParallelCopyLineBoundary struct for 
every input line. Doesn't that incur a lot of synchronization overhead? 
I haven't done any testing, this is just my gut feeling, but I assumed 
you'd work in batches of, say, 100 or 1000 lines each.

- Heikki



Re: Parallel copy

From
Ashutosh Sharma
Date:
Hi Vignesh,

Thanks for the updated patches. Here are some more comments that I can
find after reviewing your latest patches:

+/*
+ * This structure helps in storing the common data from CopyStateData that are
+ * required by the workers. This information will then be allocated and stored
+ * into the DSM for the worker to retrieve and copy it to CopyStateData.
+ */
+typedef struct SerializedParallelCopyState
+{
+   /* low-level state data */
+   CopyDest    copy_dest;      /* type of copy source/destination */
+   int         file_encoding;  /* file or remote side's character encoding */
+   bool        need_transcoding;   /* file encoding diff from server? */
+   bool        encoding_embeds_ascii;  /* ASCII can be non-first byte? */
+
...
...
+
+   /* Working state for COPY FROM */
+   AttrNumber  num_defaults;
+   Oid         relid;
+} SerializedParallelCopyState;

Can the above structure not be part of the CopyStateData structure? I
am just asking this question because all the fields present in the
above structure are also present in the CopyStateData structure. So,
including it in the CopyStateData structure will reduce the code
duplication and will also make CopyStateData a bit shorter.

--

+           pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist,
+                                    relid);

Do we need to pass cstate->nworkers and relid to BeginParallelCopy()
function when we are already passing cstate structure, using which
both of these information can be retrieved ?

--

+/* DSM keys for parallel copy.  */
+#define PARALLEL_COPY_KEY_SHARED_INFO              1
+#define PARALLEL_COPY_KEY_CSTATE                   2
+#define PARALLEL_COPY_WAL_USAGE                    3
+#define PARALLEL_COPY_BUFFER_USAGE                 4

DSM key names do not appear to be consistent. For shared info and
cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY",
but for WalUsage and BufferUsage structures, it is prefixed with
"PARALLEL_COPY". I think it would be better to make them consistent.

--

    if (resultRelInfo->ri_TrigDesc != NULL &&
        (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
         resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
    {
        /*
         * Can't support multi-inserts when there are any BEFORE/INSTEAD OF
         * triggers on the table. Such triggers might query the table we're
         * inserting into and act differently if the tuples that have already
         * been processed and prepared for insertion are not there.
         */
        insertMethod = CIM_SINGLE;
    }
    else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
             resultRelInfo->ri_TrigDesc->trig_insert_new_table)
    {
        /*
         * For partitioned tables we can't support multi-inserts when there
         * are any statement level insert triggers. It might be possible to
         * allow partitioned tables with such triggers in the future, but for
         * now, CopyMultiInsertInfoFlush expects that any before row insert
         * and statement level insert triggers are on the same relation.
         */
        insertMethod = CIM_SINGLE;
    }
    else if (resultRelInfo->ri_FdwRoutine != NULL ||
             cstate->volatile_defexprs)
    {
...
...

I think, if possible, all these if-else checks in CopyFrom() can be
moved to a single function which can probably be named as
IdentifyCopyInsertMethod() and this function can be called in
IsParallelCopyAllowed(). This will ensure that in case of Parallel
Copy when the leader has performed all these checks, the worker won't
do it again. I also feel that it will make the code look a bit
cleaner.

--

+void
+ParallelCopyMain(dsm_segment *seg, shm_toc *toc)
+{
...
...
+   InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+                         &walusage[ParallelWorkerNumber]);
+
+   MemoryContextSwitchTo(oldcontext);
+   pfree(cstate);
+   return;
+}

It seems like you also need to delete the memory context
(cstate->copycontext) here.

--

+void
+ExecBeforeStmtTrigger(CopyState cstate)
+{
+   EState     *estate = CreateExecutorState();
+   ResultRelInfo *resultRelInfo;

This function has a lot of comments which have been copied as it is
from the CopyFrom function, I think it would be good to remove those
comments from here and mention that this code changes done in this
function has been taken from the CopyFrom function. If any queries
people may refer to the CopyFrom function. This will again avoid the
unnecessary code in the patch.

--

As Heikki rightly pointed out in his previous email, we need some high
level description of how Parallel Copy works somewhere in
copyparallel.c file. For reference, please see how a brief description
about parallel vacuum has been added in the vacuumlazy.c file.

 * Lazy vacuum supports parallel execution with parallel worker processes.  In
 * a parallel vacuum, we perform both index vacuum and index cleanup with
 * parallel worker processes.  Individual indexes are processed by one vacuum
...
...

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


On Wed, Oct 21, 2020 at 12:08 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, Oct 19, 2020 at 2:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sun, Oct 18, 2020 at 7:47 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
> > >
> > > Hi Vignesh,
> > >
> > > After having a look over the patch,
> > > I have some suggestions for
> > > 0003-Allow-copy-from-command-to-process-data-from-file.patch.
> > >
> > > 1.
> > >
> > > +static uint32
> > > +EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List *attnamelist,
> > > +                                  char **whereClauseStr, char **rangeTableStr,
> > > +                                  char **attnameListStr, char **notnullListStr,
> > > +                                  char **nullListStr, char **convertListStr)
> > > +{
> > > +       uint32          strsize = MAXALIGN(sizeof(SerializedParallelCopyState));
> > > +
> > > +       strsize += EstimateStringSize(cstate->null_print);
> > > +       strsize += EstimateStringSize(cstate->delim);
> > > +       strsize += EstimateStringSize(cstate->quote);
> > > +       strsize += EstimateStringSize(cstate->escape);
> > >
> > >
> > > It use function EstimateStringSize to get the strlen of null_print, delim, quote and escape.
> > > But the length of null_print seems has been stored in null_print_len.
> > > And delim/quote/escape must be 1 byte, so I think call strlen again seems unnecessary.
> > >
> > > How about  " strsize += sizeof(uint32) + cstate->null_print_len + 1"
> > >
> >
> > +1. This seems like a good suggestion but add comments for
> > delim/quote/escape to indicate that we are considering one-byte for
> > each. I think this will obviate the need of function
> > EstimateStringSize. Another thing in this regard is that we normally
> > use add_size function to compute the size but I don't see that being
> > used in this and nearby computation. That helps us to detect overflow
> > of addition if any.
> >
> > EstimateCstateSize()
> > {
> > ..
> > +
> > + strsize++;
> > ..
> > }
> >
> > Why do we need this additional one-byte increment? Does it make sense
> > to add a small comment for the same?
> >
>
> Changed it to handle null_print, delim, quote & escape accordingly in
> the attached patch, the one byte increment is not required, I have
> removed it.
>
> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Ashutosh Sharma
Date:
On Fri, Oct 23, 2020 at 5:42 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi Vignesh,
>
> Thanks for the updated patches. Here are some more comments that I can
> find after reviewing your latest patches:
>
> +/*
> + * This structure helps in storing the common data from CopyStateData that are
> + * required by the workers. This information will then be allocated and stored
> + * into the DSM for the worker to retrieve and copy it to CopyStateData.
> + */
> +typedef struct SerializedParallelCopyState
> +{
> +   /* low-level state data */
> +   CopyDest    copy_dest;      /* type of copy source/destination */
> +   int         file_encoding;  /* file or remote side's character encoding */
> +   bool        need_transcoding;   /* file encoding diff from server? */
> +   bool        encoding_embeds_ascii;  /* ASCII can be non-first byte? */
> +
> ...
> ...
> +
> +   /* Working state for COPY FROM */
> +   AttrNumber  num_defaults;
> +   Oid         relid;
> +} SerializedParallelCopyState;
>
> Can the above structure not be part of the CopyStateData structure? I
> am just asking this question because all the fields present in the
> above structure are also present in the CopyStateData structure. So,
> including it in the CopyStateData structure will reduce the code
> duplication and will also make CopyStateData a bit shorter.
>
> --
>
> +           pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist,
> +                                    relid);
>
> Do we need to pass cstate->nworkers and relid to BeginParallelCopy()
> function when we are already passing cstate structure, using which
> both of these information can be retrieved ?
>
> --
>
> +/* DSM keys for parallel copy.  */
> +#define PARALLEL_COPY_KEY_SHARED_INFO              1
> +#define PARALLEL_COPY_KEY_CSTATE                   2
> +#define PARALLEL_COPY_WAL_USAGE                    3
> +#define PARALLEL_COPY_BUFFER_USAGE                 4
>
> DSM key names do not appear to be consistent. For shared info and
> cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY",
> but for WalUsage and BufferUsage structures, it is prefixed with
> "PARALLEL_COPY". I think it would be better to make them consistent.
>
> --
>
>     if (resultRelInfo->ri_TrigDesc != NULL &&
>         (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
>          resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
>     {
>         /*
>          * Can't support multi-inserts when there are any BEFORE/INSTEAD OF
>          * triggers on the table. Such triggers might query the table we're
>          * inserting into and act differently if the tuples that have already
>          * been processed and prepared for insertion are not there.
>          */
>         insertMethod = CIM_SINGLE;
>     }
>     else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
>              resultRelInfo->ri_TrigDesc->trig_insert_new_table)
>     {
>         /*
>          * For partitioned tables we can't support multi-inserts when there
>          * are any statement level insert triggers. It might be possible to
>          * allow partitioned tables with such triggers in the future, but for
>          * now, CopyMultiInsertInfoFlush expects that any before row insert
>          * and statement level insert triggers are on the same relation.
>          */
>         insertMethod = CIM_SINGLE;
>     }
>     else if (resultRelInfo->ri_FdwRoutine != NULL ||
>              cstate->volatile_defexprs)
>     {
> ...
> ...
>
> I think, if possible, all these if-else checks in CopyFrom() can be
> moved to a single function which can probably be named as
> IdentifyCopyInsertMethod() and this function can be called in
> IsParallelCopyAllowed(). This will ensure that in case of Parallel
> Copy when the leader has performed all these checks, the worker won't
> do it again. I also feel that it will make the code look a bit
> cleaner.
>

Just rewriting above comment to make it a bit more clear:

I think, if possible, all these if-else checks in CopyFrom() should be
moved to a separate function which can probably be named as
IdentifyCopyInsertMethod() and this function called from
IsParallelCopyAllowed() and CopyFrom() functions. It will only be
called from CopyFrom() when IsParallelCopy() returns false. This will
ensure that in case of Parallel Copy if the leader has performed all
these checks, the worker won't do it again. I also feel that having a
separate function containing all these checks will make the code look
a bit cleaner.

> --
>
> +void
> +ParallelCopyMain(dsm_segment *seg, shm_toc *toc)
> +{
> ...
> ...
> +   InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
> +                         &walusage[ParallelWorkerNumber]);
> +
> +   MemoryContextSwitchTo(oldcontext);
> +   pfree(cstate);
> +   return;
> +}
>
> It seems like you also need to delete the memory context
> (cstate->copycontext) here.
>
> --
>
> +void
> +ExecBeforeStmtTrigger(CopyState cstate)
> +{
> +   EState     *estate = CreateExecutorState();
> +   ResultRelInfo *resultRelInfo;
>
> This function has a lot of comments which have been copied as it is
> from the CopyFrom function, I think it would be good to remove those
> comments from here and mention that this code changes done in this
> function has been taken from the CopyFrom function. If any queries
> people may refer to the CopyFrom function. This will again avoid the
> unnecessary code in the patch.
>
> --
>
> As Heikki rightly pointed out in his previous email, we need some high
> level description of how Parallel Copy works somewhere in
> copyparallel.c file. For reference, please see how a brief description
> about parallel vacuum has been added in the vacuumlazy.c file.
>
>  * Lazy vacuum supports parallel execution with parallel worker processes.  In
>  * a parallel vacuum, we perform both index vacuum and index cleanup with
>  * parallel worker processes.  Individual indexes are processed by one vacuum
> ...
> ...
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com
>
>
> On Wed, Oct 21, 2020 at 12:08 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Mon, Oct 19, 2020 at 2:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Sun, Oct 18, 2020 at 7:47 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
> > > >
> > > > Hi Vignesh,
> > > >
> > > > After having a look over the patch,
> > > > I have some suggestions for
> > > > 0003-Allow-copy-from-command-to-process-data-from-file.patch.
> > > >
> > > > 1.
> > > >
> > > > +static uint32
> > > > +EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List *attnamelist,
> > > > +                                  char **whereClauseStr, char **rangeTableStr,
> > > > +                                  char **attnameListStr, char **notnullListStr,
> > > > +                                  char **nullListStr, char **convertListStr)
> > > > +{
> > > > +       uint32          strsize = MAXALIGN(sizeof(SerializedParallelCopyState));
> > > > +
> > > > +       strsize += EstimateStringSize(cstate->null_print);
> > > > +       strsize += EstimateStringSize(cstate->delim);
> > > > +       strsize += EstimateStringSize(cstate->quote);
> > > > +       strsize += EstimateStringSize(cstate->escape);
> > > >
> > > >
> > > > It use function EstimateStringSize to get the strlen of null_print, delim, quote and escape.
> > > > But the length of null_print seems has been stored in null_print_len.
> > > > And delim/quote/escape must be 1 byte, so I think call strlen again seems unnecessary.
> > > >
> > > > How about  " strsize += sizeof(uint32) + cstate->null_print_len + 1"
> > > >
> > >
> > > +1. This seems like a good suggestion but add comments for
> > > delim/quote/escape to indicate that we are considering one-byte for
> > > each. I think this will obviate the need of function
> > > EstimateStringSize. Another thing in this regard is that we normally
> > > use add_size function to compute the size but I don't see that being
> > > used in this and nearby computation. That helps us to detect overflow
> > > of addition if any.
> > >
> > > EstimateCstateSize()
> > > {
> > > ..
> > > +
> > > + strsize++;
> > > ..
> > > }
> > >
> > > Why do we need this additional one-byte increment? Does it make sense
> > > to add a small comment for the same?
> > >
> >
> > Changed it to handle null_print, delim, quote & escape accordingly in
> > the attached patch, the one byte increment is not required, I have
> > removed it.
> >
> > Regards,
> > Vignesh
> > EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:
Thanks for the comments, please find my thoughts below.
On Wed, Oct 21, 2020 at 3:19 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Hi Vignesh,
>
> I took a look at the v8 patch set. Here are some comments:
>
> 1. PopulateCommonCstateInfo() -- can we use PopulateCommonCStateInfo()
> or PopulateCopyStateInfo()? And also EstimateCstateSize() --
> EstimateCStateSize(), PopulateCstateCatalogInfo() --
> PopulateCStateCatalogInfo()?
>

Changed as suggested.

> 2. Instead of mentioning numbers like 1024, 64K, 10240 in the
> comments, can we represent them in terms of macros?
> /* It can hold 1024 blocks of 64K data in DSM to be processed by the worker. */
> #define MAX_BLOCKS_COUNT 1024
> /*
>  * It can hold upto 10240 record information for worker to process. RINGSIZE
>

Changed as suggested.

> 3. How about
> "
> Each worker at once will pick the WORKER_CHUNK_COUNT records from the
> DSM data blocks and store them in it's local memory.
> This is to make workers not contend much while getting record
> information from the DSM. Read RINGSIZE comments before
>  changing this value.
> "
> instead of
> /*
>  * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data
>  * block to process to avoid lock contention. Read RINGSIZE comments before
>  * changing this value.
>  */
>

Rephrased it.

> 4.  How about one line gap before and after for comments: "Leader
> should operate in the following order:" and "Worker should operate in
> the following order:"
>

Changed it.

> 5. Can we move RAW_BUF_BYTES macro definition to the beginning of the
> copy.h where all the macro are defined?
>

Change was done as part of another commit & we are using as it is. I
preferred it to be as it is.

> 6. I don't think we need the change in toast_internals.c with the
> temporary hack Assert(!(IsParallelWorker() && !currentCommandIdUsed));
> in GetCurrentCommandId()
>

Modified it.

> 7. I think
>     /* Can't perform copy in parallel */
>     if (parallel_workers <= 0)
>         return NULL;
> can be
>     /* Can't perform copy in parallel */
>     if (parallel_workers == 0)
>         return NULL;
> as parallel_workers can never be < 0 since we enter BeginParallelCopy
> only if cstate->nworkers > 0 and also we are not allowed to have
> negative values for max_worker_processes.
>

Modified it.

> 8. Do we want to pfree(cstate->pcdata) in case we failed to start any
> parallel workers, we would have allocated a good
>     else
>         {
>             /*
>              * Reset nworkers to -1 here. This is useful in cases where user
>              * specifies parallel workers, but, no worker is picked up, so go
>              * back to non parallel mode value of nworkers.
>              */
>             cstate->nworkers = -1;
>             *processed = CopyFrom(cstate);    /* copy from file to database */
>         }
>

Added pfree.

> 9. Instead of calling CopyStringToSharedMemory() for each string
> variable, can't we just create a linked list of all the strings that
> need to be copied into shm and call CopyStringToSharedMemory() only
> once? We could avoid 5 function calls?
>

I feel keeping it this way makes the code more readable, and also this
is not in a performance intensive tight loop. I'm  retaining the
change as is unless we feel this will make an impact.

> 10. Similar to above comment: can we fill all the required
> cstate->variables inside the function CopyNodeFromSharedMemory() and
> call it only once? In each worker we could save overhead of 5 function
> calls.
>

same as above.

> 11. Looks like CopyStringFromSharedMemory() and
> CopyNodeFromSharedMemory() do almost the same things except
> stringToNode() and pfree(destptr);. Can we have a generic function
> CopyFromSharedMemory() or something else and handle with flag "bool
> isnode" to differentiate the two use cases?
>

Removed CopyStringFromSharedMemory & used CopyNodeFromSharedMemory
appropriately. CopyNodeFromSharedMemory is renamed to
RestoreNodeFromSharedMemory keep the name consistent.

> 12. Can we move below check to the end in IsParallelCopyAllowed()?
>     /* Check parallel safety of the trigger functions. */
>     if (cstate->rel->trigdesc != NULL &&
>         !CheckRelTrigFunParallelSafety(cstate->rel->trigdesc))
>         return false;
>

Modified.

> 13. CacheLineInfo(): Instead of goto empty_data_line_update; how about
> having this directly inside the if block as it's being used only once?
>

Have removed the goto by using a macro.

> 14. GetWorkerLine(): How about avoiding goto statements and replacing
> the common code with a always static inline function or a macro?
>

Have removed the goto by using a macro.

> 15. UpdateSharedLineInfo(): Below line is misaligned.
>                 lineInfo->first_block = blk_pos;
>         lineInfo->start_offset = offset;
>

Changed it.

> 16. ParallelCopyFrom(): Do we need CHECK_FOR_INTERRUPTS(); at the
> start of  for (;;)?
>

Added it.

> 17. Remove extra lines after #define IsHeaderLine()
> (cstate->header_line && cstate->cur_lineno == 1) in copy.h
>

Modified it.

Attached v9 patches have the fixes for the above comments.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
vignesh C
Date:
On Wed, Oct 21, 2020 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Oct 21, 2020 at 3:19 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> >
> > 9. Instead of calling CopyStringToSharedMemory() for each string
> > variable, can't we just create a linked list of all the strings that
> > need to be copied into shm and call CopyStringToSharedMemory() only
> > once? We could avoid 5 function calls?
> >
>
> If we want to avoid different function calls then can't we just store
> all these strings in a local structure and use it? That might improve
> the other parts of code as well where we are using these as individual
> parameters.
>

I have made one structure SerializedListToStrCState to store all the
variables. The rest of the common variables is directly copied from &
into cstate.

> > 10. Similar to above comment: can we fill all the required
> > cstate->variables inside the function CopyNodeFromSharedMemory() and
> > call it only once? In each worker we could save overhead of 5 function
> > calls.
> >
>
> Yeah, that makes sense.
>

I feel keeping it this way makes the code more readable, and also this
is not in a performance intensive tight loop. I'm  retaining the
change as is unless we feel this will make an impact.

This is addressed in v9 patch shared at [1].
[1] - https://www.postgresql.org/message-id/CALDaNm1cAONkFDN6K72DSiRpgqNGvwxQL7TjEiHZ58opnp9VoA@mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:
On Wed, Oct 21, 2020 at 4:20 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Wed, Oct 21, 2020 at 3:18 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > 17. Remove extra lines after #define IsHeaderLine()
> > (cstate->header_line && cstate->cur_lineno == 1) in copy.h
> >
>
>  I missed one comment:
>
>  18. I think we need to treat the number of parallel workers as an
> integer similar to the parallel option in vacuum.
>
> postgres=# copy t1 from stdin with(parallel '1');      <<<<< - we
> should not allow this.
> Enter data to be copied followed by a newline.
>
> postgres=# vacuum (parallel '1') t1;
> ERROR:  parallel requires an integer value
>

I have made the behavior the same as vacuum.
This is addressed in v9 patch shared at [1].
[1] - https://www.postgresql.org/message-id/CALDaNm1cAONkFDN6K72DSiRpgqNGvwxQL7TjEiHZ58opnp9VoA@mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:
Thanks Heikki for reviewing and providing your comments. Please find
my thoughts below.

On Fri, Oct 23, 2020 at 2:01 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> I had a brief look at at this patch. Important work! A couple of first
> impressions:
>
> 1. The split between patches
> 0002-Framework-for-leader-worker-in-parallel-copy.patch and
> 0003-Allow-copy-from-command-to-process-data-from-file.patch is quite
> artificial. All the stuff introduced in the first is unused until the
> second patch is applied. The first patch introduces a forward
> declaration for ParallelCopyData(), but the function only comes in the
> second patch. The comments in the first patch talk about
> LINE_LEADER_POPULATING and LINE_LEADER_POPULATED, but the enum only
> comes in the second patch. I think these have to merged into one. If you
> want to split it somehow, I'd suggest having a separate patch just to
> move CopyStateData from copy.c to copy.h. The subsequent patch would
> then be easier to read as you could see more easily what's being added
> to CopyStateData. Actually I think it would be better to have a new
> header file, copy_internal.h, to hold CopyStateData and the other
> structs, and keep copy.h as it is.
>

I have merged 0002 & 0003 patch, I have moved few things like creation
of copy_internal.h, moving of CopyStateData from copy.c into
copy_internal.h into 0001 patch.

> 2. This desperately needs some kind of a high-level overview of how it
> works. What is a leader, what is a worker? Which process does each step
> of COPY processing, like reading from the file/socket, splitting the
> input into lines, handling escapes, calling input functions, and
> updating the heap and indexes? What data structures are used for the
> communication? How does is the work synchronized between the processes?
> There are comments on those individual aspects scattered in the patch,
> but if you're not already familiar with it, you don't know where to
> start. There's some of that in the commit message, but it needs to be
> somewhere in the source code, maybe in a long comment at the top of
> copyparallel.c.
>

Added it in copyparallel.c

> 3. I'm surprised there's a separate ParallelCopyLineBoundary struct for
> every input line. Doesn't that incur a lot of synchronization overhead?
> I haven't done any testing, this is just my gut feeling, but I assumed
> you'd work in batches of, say, 100 or 1000 lines each.
>

Data read from the file will be stored in DSM which is of size 64k *
1024. Leader will parse and identify the line boundary like which line
starts from which data block, what is the starting offset in the data
block, what is the line size, this information will be present in
ParallelCopyLineBoundary. Like you said, each worker processes
WORKER_CHUNK_COUNT 64 lines at a time. Performance test results run
for parallel copy are available at [1]. This is addressed in v9 patch
shared at [2].

[1] https://www.postgresql.org/message-id/CALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CALDaNm1cAONkFDN6K72DSiRpgqNGvwxQL7TjEiHZ58opnp9VoA@mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:
Thanks Ashutosh for reviewing and providing your comments.

On Fri, Oct 23, 2020 at 5:43 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi Vignesh,
>
> Thanks for the updated patches. Here are some more comments that I can
> find after reviewing your latest patches:
>
> +/*
> + * This structure helps in storing the common data from CopyStateData that are
> + * required by the workers. This information will then be allocated and stored
> + * into the DSM for the worker to retrieve and copy it to CopyStateData.
> + */
> +typedef struct SerializedParallelCopyState
> +{
> +   /* low-level state data */
> +   CopyDest    copy_dest;      /* type of copy source/destination */
> +   int         file_encoding;  /* file or remote side's character encoding */
> +   bool        need_transcoding;   /* file encoding diff from server? */
> +   bool        encoding_embeds_ascii;  /* ASCII can be non-first byte? */
> +
> ...
> ...
> +
> +   /* Working state for COPY FROM */
> +   AttrNumber  num_defaults;
> +   Oid         relid;
> +} SerializedParallelCopyState;
>
> Can the above structure not be part of the CopyStateData structure? I
> am just asking this question because all the fields present in the
> above structure are also present in the CopyStateData structure. So,
> including it in the CopyStateData structure will reduce the code
> duplication and will also make CopyStateData a bit shorter.
>

I have removed the common members from the structure, now there are no
common members between CopyStateData & the new structure. I'm using
CopyStateData to copy to/from directly in the new patch.

> --
>
> +           pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist,
> +                                    relid);
>
> Do we need to pass cstate->nworkers and relid to BeginParallelCopy()
> function when we are already passing cstate structure, using which
> both of these information can be retrieved ?
>

nworkers need not be passed as you have suggested but relid need to be
passed as we will be setting it to pcdata, modified nworkers as
suggested.

> --
>
> +/* DSM keys for parallel copy.  */
> +#define PARALLEL_COPY_KEY_SHARED_INFO              1
> +#define PARALLEL_COPY_KEY_CSTATE                   2
> +#define PARALLEL_COPY_WAL_USAGE                    3
> +#define PARALLEL_COPY_BUFFER_USAGE                 4
>
> DSM key names do not appear to be consistent. For shared info and
> cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY",
> but for WalUsage and BufferUsage structures, it is prefixed with
> "PARALLEL_COPY". I think it would be better to make them consistent.
>

Modified as suggested

> --
>
>     if (resultRelInfo->ri_TrigDesc != NULL &&
>         (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
>          resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
>     {
>         /*
>          * Can't support multi-inserts when there are any BEFORE/INSTEAD OF
>          * triggers on the table. Such triggers might query the table we're
>          * inserting into and act differently if the tuples that have already
>          * been processed and prepared for insertion are not there.
>          */
>         insertMethod = CIM_SINGLE;
>     }
>     else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
>              resultRelInfo->ri_TrigDesc->trig_insert_new_table)
>     {
>         /*
>          * For partitioned tables we can't support multi-inserts when there
>          * are any statement level insert triggers. It might be possible to
>          * allow partitioned tables with such triggers in the future, but for
>          * now, CopyMultiInsertInfoFlush expects that any before row insert
>          * and statement level insert triggers are on the same relation.
>          */
>         insertMethod = CIM_SINGLE;
>     }
>     else if (resultRelInfo->ri_FdwRoutine != NULL ||
>              cstate->volatile_defexprs)
>     {
> ...
> ...
>
> I think, if possible, all these if-else checks in CopyFrom() can be
> moved to a single function which can probably be named as
> IdentifyCopyInsertMethod() and this function can be called in
> IsParallelCopyAllowed(). This will ensure that in case of Parallel
> Copy when the leader has performed all these checks, the worker won't
> do it again. I also feel that it will make the code look a bit
> cleaner.
>

In the recent patch posted we have changed it to simplify the check
for parallel copy, it is not an exact match. I feel this comment is
not applicable on the latest patch

> --
>
> +void
> +ParallelCopyMain(dsm_segment *seg, shm_toc *toc)
> +{
> ...
> ...
> +   InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
> +                         &walusage[ParallelWorkerNumber]);
> +
> +   MemoryContextSwitchTo(oldcontext);
> +   pfree(cstate);
> +   return;
> +}
>
> It seems like you also need to delete the memory context
> (cstate->copycontext) here.
>

Added it.

> --
>
> +void
> +ExecBeforeStmtTrigger(CopyState cstate)
> +{
> +   EState     *estate = CreateExecutorState();
> +   ResultRelInfo *resultRelInfo;
>
> This function has a lot of comments which have been copied as it is
> from the CopyFrom function, I think it would be good to remove those
> comments from here and mention that this code changes done in this
> function has been taken from the CopyFrom function. If any queries
> people may refer to the CopyFrom function. This will again avoid the
> unnecessary code in the patch.
>

Changed as suggested.

> --
>
> As Heikki rightly pointed out in his previous email, we need some high
> level description of how Parallel Copy works somewhere in
> copyparallel.c file. For reference, please see how a brief description
> about parallel vacuum has been added in the vacuumlazy.c file.
>
>  * Lazy vacuum supports parallel execution with parallel worker processes.  In
>  * a parallel vacuum, we perform both index vacuum and index cleanup with
>  * parallel worker processes.  Individual indexes are processed by one vacuum
> ...

Added it in copyparallel.c

This is addressed in v9 patch shared at [1].
[1] - https://www.postgresql.org/message-id/CALDaNm1cAONkFDN6K72DSiRpgqNGvwxQL7TjEiHZ58opnp9VoA@mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:
On Fri, Oct 23, 2020 at 6:58 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> >
> > I think, if possible, all these if-else checks in CopyFrom() can be
> > moved to a single function which can probably be named as
> > IdentifyCopyInsertMethod() and this function can be called in
> > IsParallelCopyAllowed(). This will ensure that in case of Parallel
> > Copy when the leader has performed all these checks, the worker won't
> > do it again. I also feel that it will make the code look a bit
> > cleaner.
> >
>
> Just rewriting above comment to make it a bit more clear:
>
> I think, if possible, all these if-else checks in CopyFrom() should be
> moved to a separate function which can probably be named as
> IdentifyCopyInsertMethod() and this function called from
> IsParallelCopyAllowed() and CopyFrom() functions. It will only be
> called from CopyFrom() when IsParallelCopy() returns false. This will
> ensure that in case of Parallel Copy if the leader has performed all
> these checks, the worker won't do it again. I also feel that having a
> separate function containing all these checks will make the code look
> a bit cleaner.
>

In the recent patch posted we have changed it to simplify the check
for parallel copy, it is not an exact match. I feel this comment is
not applicable on the latest patch

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



RE: Parallel copy

From
"Hou, Zhijie"
Date:
Hi 

I found some issue in v9-0002

1.
+
+    elog(DEBUG1, "[Worker] Processing - line position:%d, block:%d, unprocessed lines:%d, offset:%d, line size:%d",
+         write_pos, lineInfo->first_block,
+         pg_atomic_read_u32(&data_blk_ptr->unprocessed_line_parts),
+         offset, pg_atomic_read_u32(&lineInfo->line_size));
+

write_pos or other variable to be printed here are type of uint32, I think it'better to use '%u' in elog msg.

2.
+         * line_size will be set. Read the line_size again to be sure if it is
+         * completed or partial block.
+         */
+        dataSize = pg_atomic_read_u32(&lineInfo->line_size);
+        if (dataSize)

It use dataSize( type int ) to get uint32 which seems a little dangerous.
Is it better to define dataSize uint32 here? 

3.
Since function with  'Cstate' in name has been changed to 'CState'
I think we can change function PopulateCommonCstateInfo as well.

4.
+    if (pcdata->worker_line_buf_count)

I think some check like the above can be 'if (xxx > 0)', which seems easier to understand.


Best regards,
houzj



Re: Parallel copy

From
Amit Kapila
Date:
On Tue, Oct 27, 2020 at 7:06 PM vignesh C <vignesh21@gmail.com> wrote:
>
[latest version]

I think the parallel-safety checks in this patch
(v9-0002-Allow-copy-from-command-to-process-data-from-file) are
incomplete and wrong. See below comments.
1.
+static pg_attribute_always_inline bool
+CheckExprParallelSafety(CopyState cstate)
+{
+ if (contain_volatile_functions(cstate->whereClause))
+ {
+ if (max_parallel_hazard((Query *) cstate->whereClause) != PROPARALLEL_SAFE)
+ return false;
+ }

I don't understand the above check. Why do we only need to check where
clause for parallel-safety when it contains volatile functions? It
should be checked otherwise as well, no? The similar comment applies
to other checks in this function. Also, I don't think there is a need
to make this function inline.

2.
+/*
+ * IsParallelCopyAllowed
+ *
+ * Check if parallel copy can be allowed.
+ */
+bool
+IsParallelCopyAllowed(CopyState cstate)
{
..
+ * When there are BEFORE/AFTER/INSTEAD OF row triggers on the table. We do
+ * not allow parallelism in such cases because such triggers might query
+ * the table we are inserting into and act differently if the tuples that
+ * have already been processed and prepared for insertion are not there.
+ * Now, if we allow parallelism with such triggers the behaviour would
+ * depend on if the parallel worker has already inserted or not that
+ * particular tuples.
+ */
+ if (cstate->rel->trigdesc != NULL &&
+ (cstate->rel->trigdesc->trig_insert_after_statement ||
+ cstate->rel->trigdesc->trig_insert_new_table ||
+ cstate->rel->trigdesc->trig_insert_before_row ||
+ cstate->rel->trigdesc->trig_insert_after_row ||
+ cstate->rel->trigdesc->trig_insert_instead_row))
+ return false;
..

Why do we need to disable parallelism for before/after row triggers
unless they have parallel-unsafe functions? I see a few lines down in
this function you are checking parallel-safety of trigger functions,
what is the use of the same if you are already disabling parallelism
with the above check.

3. What about if the index on table has expressions that are
parallel-unsafe? What is your strategy to check parallel-safety for
partitioned tables?

I suggest checking Greg's patch for parallel-safety of Inserts [1]. I
think you will find that most of those checks are required here as
well and see how we can use that patch (at least what is common). I
feel the first patch should be just to have parallel-safety checks and
we can test that by trying to enable Copy with force_parallel_mode. We
can build the rest of the patch atop of it or in other words, let's
move all parallel-safety work into a separate patch.

Few assorted comments:
========================
1.
+/*
+ * ESTIMATE_NODE_SIZE - Estimate the size required for  node type in shared
+ * memory.
+ */
+#define ESTIMATE_NODE_SIZE(list, listStr, strsize) \
+{ \
+ uint32 estsize = sizeof(uint32); \
+ if ((List *)list != NIL) \
+ { \
+ listStr = nodeToString(list); \
+ estsize += strlen(listStr) + 1; \
+ } \
+ \
+ strsize = add_size(strsize, estsize); \
+}

This can be probably a function instead of a macro.

2.
+/*
+ * ESTIMATE_1BYTE_STR_SIZE - Estimate the size required for  1Byte strings in
+ * shared memory.
+ */
+#define ESTIMATE_1BYTE_STR_SIZE(src, strsize) \
+{ \
+ strsize = add_size(strsize, sizeof(uint8)); \
+ strsize = add_size(strsize, (src) ? 1 : 0); \
+}

This could be an inline function.

3.
+/*
+ * SERIALIZE_1BYTE_STR - Copy 1Byte strings to shared memory.
+ */
+#define SERIALIZE_1BYTE_STR(dest, src, copiedsize) \
+{ \
+ uint8 len = (src) ? 1 : 0; \
+ memcpy(dest + copiedsize, (uint8 *) &len, sizeof(uint8)); \
+ copiedsize += sizeof(uint8); \
+ if (src) \
+ dest[copiedsize++] = src[0]; \
+}

Similarly, this could be a function. I think keeping such things as
macros in-between code makes it difficult to read. Please see if you
can make these and similar macros as functions unless they are doing
few memory instructions. Having functions makes it easier to debug the
code as well.

[1] - https://www.postgresql.org/message-id/CAJcOf-cgfjj0NfYPrNFGmQJxsnNW102LTXbzqxQJuziar1EKfQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: Parallel copy

From
Heikki Linnakangas
Date:
On 27/10/2020 15:36, vignesh C wrote:
> Attached v9 patches have the fixes for the above comments.

I did some testing:

/tmp/longdata.pl:
--------
#!/usr/bin/perl
#
# Generate three rows:
# foo
# longdatalongdatalongdata...
# bar
#
# The length of the middle row is given as command line arg.
#

my $bytes = $ARGV[0];

print "foo\n";
for(my $i = 0; $i < $bytes; $i+=8){
     print "longdata";
}
print "\n";
print "bar\n";
--------

postgres=# copy longdata from program 'perl /tmp/longdata.pl 100000000' 
with (parallel 2);

This gets stuck forever (or at least I didn't have the patience to wait 
it finish). Both worker processes are consuming 100% of CPU.

- Heikki



Re: Parallel copy

From
"Daniel Westermann (DWE)"
Date:
On 27/10/2020 15:36, vignesh C wrote:
>> Attached v9 patches have the fixes for the above comments.

>I did some testing:

I did some testing as well and have a cosmetic remark:

postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 1000000000);
ERROR:  value 1000000000 out of bounds for option "parallel"
DETAIL:  Valid values are between "1" and "1024".
postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 100000000000);
ERROR:  parallel requires an integer value
postgres=#

Wouldn't it make more sense to only have one error message? The first one seems to be the better message.

Regards
Daniel


Re: Parallel copy

From
Amit Kapila
Date:
On Thu, Oct 29, 2020 at 11:45 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Oct 27, 2020 at 7:06 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> [latest version]
>
> I think the parallel-safety checks in this patch
> (v9-0002-Allow-copy-from-command-to-process-data-from-file) are
> incomplete and wrong.
>

One more point, I have noticed that some time back [1], I have given
one suggestion related to the way workers process the set of lines
(aka chunk). I think you can try by increasing the chunk size to say
100, 500, 1000 and use some shared counter to remember the number of
chunks processed.

[1] - https://www.postgresql.org/message-id/CAA4eK1L-Xgw1zZEbGePmhBBWmEmLFL6rCaiOMDPnq2GNMVz-sg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: Parallel copy

From
Heikki Linnakangas
Date:
On 27/10/2020 15:36, vignesh C wrote:
> Attached v9 patches have the fixes for the above comments.

I find this design to be very complicated. Why does the line-boundary 
information need to be in shared memory? I think this would be much 
simpler if each worker grabbed a fixed-size block of raw data, and 
processed that.

In your patch, the leader process scans the input to find out where one 
line ends and another begins, and because of that decision, the leader 
needs to make the line boundaries available in shared memory, for the 
worker processes. If we moved that responsibility to the worker 
processes, you wouldn't need to keep the line boundaries in shared 
memory. A worker would only need to pass enough state to the next worker 
to tell it where to start scanning the next block.

Whether the leader process finds the EOLs or the worker processes, it's 
pretty clear that it needs to be done ASAP, for a chunk at a time, 
because that cannot be done in parallel. I think some refactoring in 
CopyReadLine() and friends would be in order. It probably would be 
faster, or at least not slower, to find all the EOLs in a block in one 
tight loop, even when parallel copy is not used.

- Heikki



Re: Parallel copy

From
Heikki Linnakangas
Date:
On 30/10/2020 18:36, Heikki Linnakangas wrote:
> I find this design to be very complicated. Why does the line-boundary
> information need to be in shared memory? I think this would be much
> simpler if each worker grabbed a fixed-size block of raw data, and
> processed that.
> 
> In your patch, the leader process scans the input to find out where one
> line ends and another begins, and because of that decision, the leader
> needs to make the line boundaries available in shared memory, for the
> worker processes. If we moved that responsibility to the worker
> processes, you wouldn't need to keep the line boundaries in shared
> memory. A worker would only need to pass enough state to the next worker
> to tell it where to start scanning the next block.

Here's a high-level sketch of how I'm imagining this to work:

The shared memory structure consists of a queue of blocks, arranged as a 
ring buffer. Each block is of fixed size, and contains 64 kB of data, 
and a few fields for coordination:

typedef struct
{
     /* Current state of the block */
     pg_atomic_uint32 state;

     /* starting offset of first line within the block */
     int     startpos;

     char    data[64 kB];
} ParallelCopyDataBlock;

Where state is one of:

enum {
   FREE,       /* buffer is empty */
   FILLED,     /* leader has filled the buffer with raw data */
   READY,      /* start pos has been filled in, but no worker process 
has claimed the block yet */
   PROCESSING, /* worker has claimed the block, and is processing it */
}

State changes FREE -> FILLED -> READY -> PROCESSING -> FREE. As the COPY 
progresses, the ring of blocks will always look something like this:

blk 0 startpos  0: PROCESSING [worker 1]
blk 1 startpos 12: PROCESSING [worker 2]
blk 2 startpos 10: READY
blk 3 starptos  -: FILLED
blk 4 startpos  -: FILLED
blk 5 starptos  -: FILLED
blk 6 startpos  -: FREE
blk 7 startpos  -: FREE

Typically, each worker process is busy processing a block. After the 
blocks being processed, there is one block in READY state, and after 
that, blocks in FILLED state.

Leader process:

The leader process is simple. It picks the next FREE buffer, fills it 
with raw data from the file, and marks it as FILLED. If no buffers are 
FREE, wait.

Worker process:

1. Claim next READY block from queue, by changing its state to
    PROCESSING. If the next block is not READY yet, wait until it is.

2. Start scanning the block from 'startpos', finding end-of-line
    markers. (in CSV mode, need to track when we're in-quotes).

3. When you reach the end of the block, if the last line continues to
    next block, wait for the next block to become FILLED. Peek into the
    next block, and copy the remaining part of the split line to a local
    buffer, and set the 'startpos' on the next block to point to the end
    of the split line. Mark the next block as READY.

4. Process all the lines in the block, call input functions, insert
    rows.

5. Mark the block as DONE.

In this design, you don't need to keep line boundaries in shared memory, 
because each worker process is responsible for finding the line 
boundaries of its own block.

There's a point of serialization here, in that the next block cannot be 
processed, until the worker working on the previous block has finished 
scanning the EOLs, and set the starting position on the next block, 
putting it in READY state. That's not very different from your patch, 
where you had a similar point of serialization because the leader 
scanned the EOLs, but I think the coordination between processes is 
simpler here.

- Heikki



Re: Parallel copy

From
Heikki Linnakangas
Date:
On 30/10/2020 18:36, Heikki Linnakangas wrote:
> Whether the leader process finds the EOLs or the worker processes, it's
> pretty clear that it needs to be done ASAP, for a chunk at a time,
> because that cannot be done in parallel. I think some refactoring in
> CopyReadLine() and friends would be in order. It probably would be
> faster, or at least not slower, to find all the EOLs in a block in one
> tight loop, even when parallel copy is not used.

Something like the attached. It passes the regression tests, but it's 
quite incomplete. It's missing handing of "\." as end-of-file marker, 
and I haven't tested encoding conversions at all, for starters. Quick 
testing suggests that this a little bit faster than the current code, 
but the difference is small; I had to use a "WHERE false" option to 
really see the difference.

The crucial thing here is that there's a new function, ParseLinesText(), 
to find all end-of-line characters in a buffer in one go. In this patch, 
it's used against 'raw_buf', but with parallel copy, you could point it 
at a block in shared memory instead.

- Heikki

Attachment

Re: Parallel copy

From
Tomas Vondra
Date:
Hi,

I've done a bit more testing today, and I think the parsing is busted in
some way. Consider this:

     test=# create extension random;
     CREATE EXTENSION
     
     test=# create table t (a text);
     CREATE TABLE
     
     test=# insert into t select random_string(random_int(10, 256*1024)) from generate_series(1,10000);
     INSERT 0 10000
     
     test=# copy t to '/mnt/data/t.csv';
     COPY 10000
     
     test=# truncate t;
     TRUNCATE TABLE
     
     test=# copy t from '/mnt/data/t.csv';
     COPY 10000
     
     test=# truncate t;
     TRUNCATE TABLE
     
     test=# copy t from '/mnt/data/t.csv' with (parallel 2);
     ERROR:  invalid byte sequence for encoding "UTF8": 0x00
     CONTEXT:  COPY t, line 485:
"m&\nh%_a"%r]>qtCl:Q5ltvF~;2oS6@HB>F>og,bD$Lw'nZY\tYl#BH\t{(j~ryoZ08"SGU~.}8CcTRk1\ts$@U3szCC+U1U3i@P..."
     parallel worker


The functions come from an extension I use to generate random data, I've
pushed it to github [1]. The random_string() generates a random string
with ASCII characters, symbols and a couple special characters (\r\n\t).
The intent was to try loading data where a fields may span multiple 64kB
blocks and may contain newlines etc.

The non-parallel copy works fine, the parallel one fails. I haven't
investigated the details, but I guess it gets confused about where a
string starts/end, or something like that.


[1] https://github.com/tvondra/random


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Parallel copy

From
Tomas Vondra
Date:
On Fri, Oct 30, 2020 at 06:41:41PM +0200, Heikki Linnakangas wrote:
>On 30/10/2020 18:36, Heikki Linnakangas wrote:
>>I find this design to be very complicated. Why does the line-boundary
>>information need to be in shared memory? I think this would be much
>>simpler if each worker grabbed a fixed-size block of raw data, and
>>processed that.
>>
>>In your patch, the leader process scans the input to find out where one
>>line ends and another begins, and because of that decision, the leader
>>needs to make the line boundaries available in shared memory, for the
>>worker processes. If we moved that responsibility to the worker
>>processes, you wouldn't need to keep the line boundaries in shared
>>memory. A worker would only need to pass enough state to the next worker
>>to tell it where to start scanning the next block.
>
>Here's a high-level sketch of how I'm imagining this to work:
>
>The shared memory structure consists of a queue of blocks, arranged as 
>a ring buffer. Each block is of fixed size, and contains 64 kB of 
>data, and a few fields for coordination:
>
>typedef struct
>{
>    /* Current state of the block */
>    pg_atomic_uint32 state;
>
>    /* starting offset of first line within the block */
>    int     startpos;
>
>    char    data[64 kB];
>} ParallelCopyDataBlock;
>
>Where state is one of:
>
>enum {
>  FREE,       /* buffer is empty */
>  FILLED,     /* leader has filled the buffer with raw data */
>  READY,      /* start pos has been filled in, but no worker process 
>has claimed the block yet */
>  PROCESSING, /* worker has claimed the block, and is processing it */
>}
>
>State changes FREE -> FILLED -> READY -> PROCESSING -> FREE. As the 
>COPY progresses, the ring of blocks will always look something like 
>this:
>
>blk 0 startpos  0: PROCESSING [worker 1]
>blk 1 startpos 12: PROCESSING [worker 2]
>blk 2 startpos 10: READY
>blk 3 starptos  -: FILLED
>blk 4 startpos  -: FILLED
>blk 5 starptos  -: FILLED
>blk 6 startpos  -: FREE
>blk 7 startpos  -: FREE
>
>Typically, each worker process is busy processing a block. After the 
>blocks being processed, there is one block in READY state, and after 
>that, blocks in FILLED state.
>
>Leader process:
>
>The leader process is simple. It picks the next FREE buffer, fills it 
>with raw data from the file, and marks it as FILLED. If no buffers are 
>FREE, wait.
>
>Worker process:
>
>1. Claim next READY block from queue, by changing its state to
>   PROCESSING. If the next block is not READY yet, wait until it is.
>
>2. Start scanning the block from 'startpos', finding end-of-line
>   markers. (in CSV mode, need to track when we're in-quotes).
>
>3. When you reach the end of the block, if the last line continues to
>   next block, wait for the next block to become FILLED. Peek into the
>   next block, and copy the remaining part of the split line to a local
>   buffer, and set the 'startpos' on the next block to point to the end
>   of the split line. Mark the next block as READY.
>
>4. Process all the lines in the block, call input functions, insert
>   rows.
>
>5. Mark the block as DONE.
>
>In this design, you don't need to keep line boundaries in shared 
>memory, because each worker process is responsible for finding the 
>line boundaries of its own block.
>
>There's a point of serialization here, in that the next block cannot 
>be processed, until the worker working on the previous block has 
>finished scanning the EOLs, and set the starting position on the next 
>block, putting it in READY state. That's not very different from your 
>patch, where you had a similar point of serialization because the 
>leader scanned the EOLs, but I think the coordination between 
>processes is simpler here.
>

I agree this design looks simpler. I'm a bit worried about serializing
the parsing like this, though. It's true the current approach (where the
first phase of parsing happens in the leader) has a similar issue, but I
think it would be easier to improve that in that design.

My plan was to parallelize the parsing roughly like this:

1) split the input buffer into smaller chunks

2) let workers scan the buffers and record positions of interesting
characters (delimiters, quotes, ...) and pass it back to the leader

3) use the information to actually parse the input data (we only need to
look at the interesting characters, skipping large parts of data)

4) pass the parsed chunks to workers, just like in the current patch


But maybe something like that would be possible even with the approach
you propose - we could have a special parse phase for processing each
buffer, where any worker could look for the special characters, record
the positions in a bitmap next to the buffer. So the whole sequence of
states would look something like this:

     EMPTY
     FILLED
     PARSED
     READY
     PROCESSING

Of course, the question is whether parsing really is sufficiently
expensive for this to be worth it.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Parallel copy

From
Heikki Linnakangas
Date:
On 30/10/2020 22:56, Tomas Vondra wrote:
> I agree this design looks simpler. I'm a bit worried about serializing
> the parsing like this, though. It's true the current approach (where the
> first phase of parsing happens in the leader) has a similar issue, but I
> think it would be easier to improve that in that design.
> 
> My plan was to parallelize the parsing roughly like this:
> 
> 1) split the input buffer into smaller chunks
> 
> 2) let workers scan the buffers and record positions of interesting
> characters (delimiters, quotes, ...) and pass it back to the leader
> 
> 3) use the information to actually parse the input data (we only need to
> look at the interesting characters, skipping large parts of data)
> 
> 4) pass the parsed chunks to workers, just like in the current patch
> 
> 
> But maybe something like that would be possible even with the approach
> you propose - we could have a special parse phase for processing each
> buffer, where any worker could look for the special characters, record
> the positions in a bitmap next to the buffer. So the whole sequence of
> states would look something like this:
> 
>       EMPTY
>       FILLED
>       PARSED
>       READY
>       PROCESSING

I think it's even simpler than that. You don't need to communicate the 
"interesting positions" between processes, if the same worker takes care 
of the chunk through all states from FILLED to DONE.

You can build the bitmap of interesting positions immediately in FILLED 
state, independently of all previous blocks. Once you've built the 
bitmap, you need to wait for the information on where the first line 
starts, but presumably finding the interesting positions is the 
expensive part.

> Of course, the question is whether parsing really is sufficiently
> expensive for this to be worth it.

Yeah, I don't think it's worth it. Splitting the lines is pretty fast, I 
think we have many years to come before that becomes a bottleneck. But 
if it turns out I'm wrong and we need to implement that, the path is 
pretty straightforward.

- Heikki



Re: Parallel copy

From
Tomas Vondra
Date:
On Sat, Oct 31, 2020 at 12:09:32AM +0200, Heikki Linnakangas wrote:
>On 30/10/2020 22:56, Tomas Vondra wrote:
>>I agree this design looks simpler. I'm a bit worried about serializing
>>the parsing like this, though. It's true the current approach (where the
>>first phase of parsing happens in the leader) has a similar issue, but I
>>think it would be easier to improve that in that design.
>>
>>My plan was to parallelize the parsing roughly like this:
>>
>>1) split the input buffer into smaller chunks
>>
>>2) let workers scan the buffers and record positions of interesting
>>characters (delimiters, quotes, ...) and pass it back to the leader
>>
>>3) use the information to actually parse the input data (we only need to
>>look at the interesting characters, skipping large parts of data)
>>
>>4) pass the parsed chunks to workers, just like in the current patch
>>
>>
>>But maybe something like that would be possible even with the approach
>>you propose - we could have a special parse phase for processing each
>>buffer, where any worker could look for the special characters, record
>>the positions in a bitmap next to the buffer. So the whole sequence of
>>states would look something like this:
>>
>>      EMPTY
>>      FILLED
>>      PARSED
>>      READY
>>      PROCESSING
>
>I think it's even simpler than that. You don't need to communicate the 
>"interesting positions" between processes, if the same worker takes 
>care of the chunk through all states from FILLED to DONE.
>
>You can build the bitmap of interesting positions immediately in 
>FILLED state, independently of all previous blocks. Once you've built 
>the bitmap, you need to wait for the information on where the first 
>line starts, but presumably finding the interesting positions is the 
>expensive part.
>

I don't think it's that simple. For example, the previous block may
contain a very long value (say, 1MB), so a bunch of blocks have to be
processed by the same worker. That probably makes the state transitions
a bit, and it also means the bitmap would need to be passed to the
worker that actually processes the block. Or we might just ignore this,
on the grounds that it's not a very common situation.


>>Of course, the question is whether parsing really is sufficiently
>>expensive for this to be worth it.
>
>Yeah, I don't think it's worth it. Splitting the lines is pretty fast, 
>I think we have many years to come before that becomes a bottleneck. 
>But if it turns out I'm wrong and we need to implement that, the path 
>is pretty straightforward.
>

OK. I agree the parsing is relatively cheap, and I don't recall seeing
CSV parsing as a bottleneck in production.  I suspect that's might be
simply because we're hitting other bottlenecks first, though.

regard

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Parallel copy

From
Amit Kapila
Date:
On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Leader process:
>
> The leader process is simple. It picks the next FREE buffer, fills it
> with raw data from the file, and marks it as FILLED. If no buffers are
> FREE, wait.
>
> Worker process:
>
> 1. Claim next READY block from queue, by changing its state to
>     PROCESSING. If the next block is not READY yet, wait until it is.
>
> 2. Start scanning the block from 'startpos', finding end-of-line
>     markers. (in CSV mode, need to track when we're in-quotes).
>
> 3. When you reach the end of the block, if the last line continues to
>     next block, wait for the next block to become FILLED. Peek into the
>     next block, and copy the remaining part of the split line to a local
>     buffer, and set the 'startpos' on the next block to point to the end
>     of the split line. Mark the next block as READY.
>
> 4. Process all the lines in the block, call input functions, insert
>     rows.
>
> 5. Mark the block as DONE.
>
> In this design, you don't need to keep line boundaries in shared memory,
> because each worker process is responsible for finding the line
> boundaries of its own block.
>
> There's a point of serialization here, in that the next block cannot be
> processed, until the worker working on the previous block has finished
> scanning the EOLs, and set the starting position on the next block,
> putting it in READY state. That's not very different from your patch,
> where you had a similar point of serialization because the leader
> scanned the EOLs,
>

But in the design (single producer multiple consumer) used by the
patch the worker doesn't need to wait till the complete block is
processed, it can start processing the lines already found. This will
also allow workers to start much earlier to process the data as it
doesn't need to wait for all the offsets corresponding to 64K block
ready. However, in the design where each worker is processing the 64K
block, it can lead to much longer waits. I think this will impact the
Copy STDIN case more where in most cases (200-300 bytes tuples) we
receive line-by-line from client and find the line-endings by leader.
If the leader doesn't find the line-endings the workers need to wait
till the leader fill the entire 64K chunk, OTOH, with current approach
the worker can start as soon as leader is able to populate some
minimum number of line-endings

The other point is that the leader backend won't be used completely as
it is only doing a very small part (primarily reading the file) of the
overall work.

We have discussed both these approaches (a) single producer multiple
consumer, and (b) all workers doing the processing as you are saying
in the beginning and concluded that (a) is better, see some of the
relevant emails [1][2][3].

[1] - https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
[2] - https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
[3] - https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de

-- 
With Regards,
Amit Kapila.



Re: Parallel copy

From
Heikki Linnakangas
Date:
On 02/11/2020 08:14, Amit Kapila wrote:
> On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> Leader process:
>>
>> The leader process is simple. It picks the next FREE buffer, fills it
>> with raw data from the file, and marks it as FILLED. If no buffers are
>> FREE, wait.
>>
>> Worker process:
>>
>> 1. Claim next READY block from queue, by changing its state to
>>      PROCESSING. If the next block is not READY yet, wait until it is.
>>
>> 2. Start scanning the block from 'startpos', finding end-of-line
>>      markers. (in CSV mode, need to track when we're in-quotes).
>>
>> 3. When you reach the end of the block, if the last line continues to
>>      next block, wait for the next block to become FILLED. Peek into the
>>      next block, and copy the remaining part of the split line to a local
>>      buffer, and set the 'startpos' on the next block to point to the end
>>      of the split line. Mark the next block as READY.
>>
>> 4. Process all the lines in the block, call input functions, insert
>>      rows.
>>
>> 5. Mark the block as DONE.
>>
>> In this design, you don't need to keep line boundaries in shared memory,
>> because each worker process is responsible for finding the line
>> boundaries of its own block.
>>
>> There's a point of serialization here, in that the next block cannot be
>> processed, until the worker working on the previous block has finished
>> scanning the EOLs, and set the starting position on the next block,
>> putting it in READY state. That's not very different from your patch,
>> where you had a similar point of serialization because the leader
>> scanned the EOLs,
> 
> But in the design (single producer multiple consumer) used by the
> patch the worker doesn't need to wait till the complete block is
> processed, it can start processing the lines already found. This will
> also allow workers to start much earlier to process the data as it
> doesn't need to wait for all the offsets corresponding to 64K block
> ready. However, in the design where each worker is processing the 64K
> block, it can lead to much longer waits. I think this will impact the
> Copy STDIN case more where in most cases (200-300 bytes tuples) we
> receive line-by-line from client and find the line-endings by leader.
> If the leader doesn't find the line-endings the workers need to wait
> till the leader fill the entire 64K chunk, OTOH, with current approach
> the worker can start as soon as leader is able to populate some
> minimum number of line-endings

You can use a smaller block size. However, the point of parallel copy is 
to maximize bandwidth. If the workers ever have to sit idle, it means 
that the bottleneck is in receiving data from the client, i.e. the 
backend is fast enough, and you can't make the overall COPY finish any 
faster no matter how you do it.

> The other point is that the leader backend won't be used completely as
> it is only doing a very small part (primarily reading the file) of the
> overall work.

An idle process doesn't cost anything. If you have free CPU resources, 
use more workers.

> We have discussed both these approaches (a) single producer multiple
> consumer, and (b) all workers doing the processing as you are saying
> in the beginning and concluded that (a) is better, see some of the
> relevant emails [1][2][3].
> 
> [1] - https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
> [2] - https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
> [3] - https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de

Sorry I'm late to the party. I don't think the design I proposed was 
discussed in that threads. The alternative that's discussed in that 
thread seems to be something much more fine-grained, where processes 
claim individual lines. I'm not sure though, I didn't fully understand 
the alternative designs.

I want to throw out one more idea. It's an interim step, not the      final 
solution we want, but a useful step in getting there:

Have the leader process scan the input for line-endings. Split the input 
data into blocks of slightly under 64 kB in size, so that a line never 
crosses a block. Put the blocks in shared memory.

A worker process claims a block from shared memory, processes it from 
beginning to end. It *also* has to parse the input to split it into lines.

In this design, the line-splitting is done twice. That's clearly not 
optimal, and we want to avoid that in the final patch, but I think it 
would be a useful milestone. After that patch is done, write another 
patch to either a) implement the design I sketched, where blocks are 
fixed-size and a worker notifies the next worker on where the first line 
in next block begins, or b) have the leader process report the 
line-ending positions in shared memory, so that workers don't need to 
scan them again.

Even if we apply the patches together, I think splitting them like that 
would make for easier review.

- Heikki



Re: Parallel copy

From
Heikki Linnakangas
Date:
On 02/11/2020 09:10, Heikki Linnakangas wrote:
> On 02/11/2020 08:14, Amit Kapila wrote:
>> We have discussed both these approaches (a) single producer multiple
>> consumer, and (b) all workers doing the processing as you are saying
>> in the beginning and concluded that (a) is better, see some of the
>> relevant emails [1][2][3].
>>
>> [1] - https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
>> [2] - https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
>> [3] - https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de
> 
> Sorry I'm late to the party. I don't think the design I proposed was
> discussed in that threads. The alternative that's discussed in that
> thread seems to be something much more fine-grained, where processes
> claim individual lines. I'm not sure though, I didn't fully understand
> the alternative designs.

I read the thread more carefully, and I think Robert had basically the 
right idea here 
(https://www.postgresql.org/message-id/CA%2BTgmoZMU4az9MmdJtg04pjRa0wmWQtmoMxttdxNrupYJNcR3w%40mail.gmail.com):

> I really think we don't want a single worker in charge of finding
> tuple boundaries for everybody. That adds a lot of unnecessary
> inter-process communication and synchronization. Each process should
> just get the next tuple starting after where the last one ended, and
> then advance the end pointer so that the next process can do the same
> thing. [...]

And here 
(https://www.postgresql.org/message-id/CA%2BTgmoZw%2BF3y%2BoaxEsHEZBxdL1x1KAJ7pRMNgCqX0WjmjGNLrA%40mail.gmail.com):

> On Thu, Apr 9, 2020 at 2:55 PM Andres Freund
<andres(at)anarazel(dot)de> wrote:
>> I'm fairly certain that we do *not* want to distribute input data
>> between processes on a single tuple basis. Probably not even below
>> a few
hundred kb. If there's any sort of natural clustering in the loaded data
- extremely common, think timestamps - splitting on a granular basis
will make indexing much more expensive. And have a lot more contention.
> 
> That's a fair point. I think the solution ought to be that once any
> process starts finding line endings, it continues until it's grabbed
> at least a certain amount of data for itself. Then it stops and lets
> some other process grab a chunk of data.
Yes! That's pretty close to the design I sketched. I imagined that the 
leader would divide the input into 64 kB blocks, and each block would 
have  few metadata fields, notably the starting position of the first 
line in the block. I think Robert envisioned having a single "next 
starting position" field in shared memory. That works too, and is even 
simpler, so +1 for that.

For some reason, the discussion took a different turn from there, to 
discuss how the line-endings (called "chunks" in the discussion) should 
be represented in shared memory. But none of that is necessary with 
Robert's design.

- Heikki



Re: Parallel copy

From
Amit Kapila
Date:
On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 02/11/2020 08:14, Amit Kapila wrote:
> > On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >>
> >> In this design, you don't need to keep line boundaries in shared memory,
> >> because each worker process is responsible for finding the line
> >> boundaries of its own block.
> >>
> >> There's a point of serialization here, in that the next block cannot be
> >> processed, until the worker working on the previous block has finished
> >> scanning the EOLs, and set the starting position on the next block,
> >> putting it in READY state. That's not very different from your patch,
> >> where you had a similar point of serialization because the leader
> >> scanned the EOLs,
> >
> > But in the design (single producer multiple consumer) used by the
> > patch the worker doesn't need to wait till the complete block is
> > processed, it can start processing the lines already found. This will
> > also allow workers to start much earlier to process the data as it
> > doesn't need to wait for all the offsets corresponding to 64K block
> > ready. However, in the design where each worker is processing the 64K
> > block, it can lead to much longer waits. I think this will impact the
> > Copy STDIN case more where in most cases (200-300 bytes tuples) we
> > receive line-by-line from client and find the line-endings by leader.
> > If the leader doesn't find the line-endings the workers need to wait
> > till the leader fill the entire 64K chunk, OTOH, with current approach
> > the worker can start as soon as leader is able to populate some
> > minimum number of line-endings
>
> You can use a smaller block size.
>

Sure, but the same problem can happen if the last line in that block
is too long and we need to peek into the next block. And then there
could be cases where a single line could be greater than 64K.

> However, the point of parallel copy is
> to maximize bandwidth.
>

Okay, but this first-phase (finding the line boundaries) can anyway be
not done in parallel and we have seen in some of the initial
benchmarking that this initial phase is a small part of work
especially when the table has indexes, constraints, etc. So, I think
it won't matter much if this splitting is done in a single process or
multiple processes.

> If the workers ever have to sit idle, it means
> that the bottleneck is in receiving data from the client, i.e. the
> backend is fast enough, and you can't make the overall COPY finish any
> faster no matter how you do it.
>
> > The other point is that the leader backend won't be used completely as
> > it is only doing a very small part (primarily reading the file) of the
> > overall work.
>
> An idle process doesn't cost anything. If you have free CPU resources,
> use more workers.
>
> > We have discussed both these approaches (a) single producer multiple
> > consumer, and (b) all workers doing the processing as you are saying
> > in the beginning and concluded that (a) is better, see some of the
> > relevant emails [1][2][3].
> >
> > [1] - https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
> > [2] - https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
> > [3] - https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de
>
> Sorry I'm late to the party. I don't think the design I proposed was
> discussed in that threads.
>

I think something close to that is discussed as you have noticed in
your next email but IIRC, because many people (Andres, Ants, myself
and author) favoured the current approach (single reader and multiple
consumers) we decided to go with that. I feel this patch is very much
in the POC stage due to which the code doesn't look good and as we
move forward we need to see what is the better way to improve it,
maybe one of the ways is to split it as you are suggesting so that it
can be easier to review. I think the other important thing which this
patch has not addressed properly is the parallel-safety checks as
pointed by me earlier. There are two things to solve there (a) the
lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c
APIs, etc.) have checks which doesn't allow any writes, we need to see
which of those we can open now (or do some additional work to prevent
from those checks) after some of the work done for parallel-writes in
PG-13[1][2], and (b) in which all cases we can parallel-writes
(parallel copy) is allowed, for example need to identify whether table
or one of its partitions has any constraint/expression which is
parallel-unsafe.

[1] 85f6b49 Allow relation extension lock to conflict among parallel
group members
[2] 3ba59cc Allow page lock to conflict among parallel group members

>
> I want to throw out one more idea. It's an interim step, not the         final
> solution we want, but a useful step in getting there:
>
> Have the leader process scan the input for line-endings. Split the input
> data into blocks of slightly under 64 kB in size, so that a line never
> crosses a block. Put the blocks in shared memory.
>
> A worker process claims a block from shared memory, processes it from
> beginning to end. It *also* has to parse the input to split it into lines.
>
> In this design, the line-splitting is done twice. That's clearly not
> optimal, and we want to avoid that in the final patch, but I think it
> would be a useful milestone. After that patch is done, write another
> patch to either a) implement the design I sketched, where blocks are
> fixed-size and a worker notifies the next worker on where the first line
> in next block begins, or b) have the leader process report the
> line-ending positions in shared memory, so that workers don't need to
> scan them again.
>
> Even if we apply the patches together, I think splitting them like that
> would make for easier review.
>

I think this is worth exploring especially if it makes the patch
easier to review.

-- 
With Regards,
Amit Kapila.



Re: Parallel copy

From
Heikki Linnakangas
Date:
On 03/11/2020 10:59, Amit Kapila wrote:
> On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> However, the point of parallel copy is to maximize bandwidth.
> 
> Okay, but this first-phase (finding the line boundaries) can anyway
> be not done in parallel and we have seen in some of the initial 
> benchmarking that this initial phase is a small part of work 
> especially when the table has indexes, constraints, etc. So, I think 
> it won't matter much if this splitting is done in a single process
> or multiple processes.
Right, it won't matter performance-wise. That's not my point. The 
difference is in the complexity. If you don't store the line boundaries 
in shared memory, you get away with much simpler shared memory structures.

> I think something close to that is discussed as you have noticed in
> your next email but IIRC, because many people (Andres, Ants, myself
> and author) favoured the current approach (single reader and multiple
> consumers) we decided to go with that. I feel this patch is very much
> in the POC stage due to which the code doesn't look good and as we
> move forward we need to see what is the better way to improve it,
> maybe one of the ways is to split it as you are suggesting so that it
> can be easier to review.

Sure. I think the roadmap here is:

1. Split copy.c [1]. Not strictly necessary, but I think it'd make this 
nice to review and work with.

2. Refactor CopyReadLine(), so that finding the line-endings and the 
rest of the line-parsing are separated into separate functions.

3. Implement parallel copy.

> I think the other important thing which this
> patch has not addressed properly is the parallel-safety checks as
> pointed by me earlier. There are two things to solve there (a) the
> lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c
> APIs, etc.) have checks which doesn't allow any writes, we need to see
> which of those we can open now (or do some additional work to prevent
> from those checks) after some of the work done for parallel-writes in
> PG-13[1][2], and (b) in which all cases we can parallel-writes
> (parallel copy) is allowed, for example need to identify whether table
> or one of its partitions has any constraint/expression which is
> parallel-unsafe.

Agreed, that needs to be solved. I haven't given it any thought myself.

- Heikki

[1] 
https://www.postgresql.org/message-id/8e15b560-f387-7acc-ac90-763986617bfb%40iki.fi



RE: Parallel copy

From
"Hou, Zhijie"
Date:
Hi

> 
> my $bytes = $ARGV[0];
> for(my $i = 0; $i < $bytes; $i+=8){
>      print "longdata";
> }
> print "\n";
> --------
> 
> postgres=# copy longdata from program 'perl /tmp/longdata.pl 100000000'
> with (parallel 2);
> 
> This gets stuck forever (or at least I didn't have the patience to wait
> it finish). Both worker processes are consuming 100% of CPU.

I had a look over this problem.

the ParallelCopyDataBlock has size limit:
    uint8        skip_bytes;
    char        data[DATA_BLOCK_SIZE];    /* data read from file */

It seems the input line is so long that the leader process run out of the Shared memory among parallel copy workers.
And the leader process keep waiting free block.

For the worker process, it wait util line_state becomes LINE_LEADER_POPULATED,
But leader process won't set the line_state unless it read the whole line.

So it stuck forever.
May be we should reconsider about this situation.

The stack is as follows:

Leader stack:
#3  0x000000000075f7a1 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=1,
wait_event_info=wait_event_info@entry=150994945)at latch.c:411
 
#4  0x00000000005a9245 in WaitGetFreeCopyBlock (pcshared_info=pcshared_info@entry=0x7f26d2ed3580) at
copyparallel.c:1546
#5  0x00000000005a98ce in SetRawBufForLoad (cstate=cstate@entry=0x2978a88, line_size=67108864,
copy_buf_len=copy_buf_len@entry=65536,raw_buf_ptr=raw_buf_ptr@entry=65536, 
 
    copy_raw_buf=copy_raw_buf@entry=0x7fff4cdc0e18) at copyparallel.c:1572
#6  0x00000000005a1963 in CopyReadLineText (cstate=cstate@entry=0x2978a88) at copy.c:4058
#7  0x00000000005a4e76 in CopyReadLine (cstate=cstate@entry=0x2978a88) at copy.c:3863

Worker stack:
#0  GetLinePosition (cstate=cstate@entry=0x29e1f28) at copyparallel.c:1474
#1  0x00000000005a8aa4 in CacheLineInfo (cstate=cstate@entry=0x29e1f28, buff_count=buff_count@entry=0) at
copyparallel.c:711
#2  0x00000000005a8e46 in GetWorkerLine (cstate=cstate@entry=0x29e1f28) at copyparallel.c:885
#3  0x00000000005a4f2e in NextCopyFromRawFields (cstate=cstate@entry=0x29e1f28, fields=fields@entry=0x7fff4cdc0b48,
nfields=nfields@entry=0x7fff4cdc0b44)at copy.c:3615
 
#4  0x00000000005a50af in NextCopyFrom (cstate=cstate@entry=0x29e1f28, econtext=econtext@entry=0x2a358d8,
values=0x2a42068,nulls=0x2a42070) at copy.c:3696
 
#5  0x00000000005a5b90 in CopyFrom (cstate=cstate@entry=0x29e1f28) at copy.c:2985


Best regards,
houzj




Re: Parallel copy

From
vignesh C
Date:
On Thu, Nov 5, 2020 at 6:33 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
>
> Hi
>
> >
> > my $bytes = $ARGV[0];
> > for(my $i = 0; $i < $bytes; $i+=8){
> >      print "longdata";
> > }
> > print "\n";
> > --------
> >
> > postgres=# copy longdata from program 'perl /tmp/longdata.pl 100000000'
> > with (parallel 2);
> >
> > This gets stuck forever (or at least I didn't have the patience to wait
> > it finish). Both worker processes are consuming 100% of CPU.
>
> I had a look over this problem.
>
> the ParallelCopyDataBlock has size limit:
>         uint8           skip_bytes;
>         char            data[DATA_BLOCK_SIZE];  /* data read from file */
>
> It seems the input line is so long that the leader process run out of the Shared memory among parallel copy workers.
> And the leader process keep waiting free block.
>
> For the worker process, it wait util line_state becomes LINE_LEADER_POPULATED,
> But leader process won't set the line_state unless it read the whole line.
>
> So it stuck forever.
> May be we should reconsider about this situation.
>
> The stack is as follows:
>
> Leader stack:
> #3  0x000000000075f7a1 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=1,
wait_event_info=wait_event_info@entry=150994945)at latch.c:411
 
> #4  0x00000000005a9245 in WaitGetFreeCopyBlock (pcshared_info=pcshared_info@entry=0x7f26d2ed3580) at
copyparallel.c:1546
> #5  0x00000000005a98ce in SetRawBufForLoad (cstate=cstate@entry=0x2978a88, line_size=67108864,
copy_buf_len=copy_buf_len@entry=65536,raw_buf_ptr=raw_buf_ptr@entry=65536,
 
>     copy_raw_buf=copy_raw_buf@entry=0x7fff4cdc0e18) at copyparallel.c:1572
> #6  0x00000000005a1963 in CopyReadLineText (cstate=cstate@entry=0x2978a88) at copy.c:4058
> #7  0x00000000005a4e76 in CopyReadLine (cstate=cstate@entry=0x2978a88) at copy.c:3863
>
> Worker stack:
> #0  GetLinePosition (cstate=cstate@entry=0x29e1f28) at copyparallel.c:1474
> #1  0x00000000005a8aa4 in CacheLineInfo (cstate=cstate@entry=0x29e1f28, buff_count=buff_count@entry=0) at
copyparallel.c:711
> #2  0x00000000005a8e46 in GetWorkerLine (cstate=cstate@entry=0x29e1f28) at copyparallel.c:885
> #3  0x00000000005a4f2e in NextCopyFromRawFields (cstate=cstate@entry=0x29e1f28, fields=fields@entry=0x7fff4cdc0b48,
nfields=nfields@entry=0x7fff4cdc0b44)at copy.c:3615
 
> #4  0x00000000005a50af in NextCopyFrom (cstate=cstate@entry=0x29e1f28, econtext=econtext@entry=0x2a358d8,
values=0x2a42068,nulls=0x2a42070) at copy.c:3696
 
> #5  0x00000000005a5b90 in CopyFrom (cstate=cstate@entry=0x29e1f28) at copy.c:2985
>

Thanks for providing your thoughts. I have analyzed this issue and I'm
working on the fix for this, I will be posting a patch for this
shortly.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:
On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > On 02/11/2020 08:14, Amit Kapila wrote:
> > > On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > >>
> > >> In this design, you don't need to keep line boundaries in shared memory,
> > >> because each worker process is responsible for finding the line
> > >> boundaries of its own block.
> > >>
> > >> There's a point of serialization here, in that the next block cannot be
> > >> processed, until the worker working on the previous block has finished
> > >> scanning the EOLs, and set the starting position on the next block,
> > >> putting it in READY state. That's not very different from your patch,
> > >> where you had a similar point of serialization because the leader
> > >> scanned the EOLs,
> > >
> > > But in the design (single producer multiple consumer) used by the
> > > patch the worker doesn't need to wait till the complete block is
> > > processed, it can start processing the lines already found. This will
> > > also allow workers to start much earlier to process the data as it
> > > doesn't need to wait for all the offsets corresponding to 64K block
> > > ready. However, in the design where each worker is processing the 64K
> > > block, it can lead to much longer waits. I think this will impact the
> > > Copy STDIN case more where in most cases (200-300 bytes tuples) we
> > > receive line-by-line from client and find the line-endings by leader.
> > > If the leader doesn't find the line-endings the workers need to wait
> > > till the leader fill the entire 64K chunk, OTOH, with current approach
> > > the worker can start as soon as leader is able to populate some
> > > minimum number of line-endings
> >
> > You can use a smaller block size.
> >
>
> Sure, but the same problem can happen if the last line in that block
> is too long and we need to peek into the next block. And then there
> could be cases where a single line could be greater than 64K.
>
> > However, the point of parallel copy is
> > to maximize bandwidth.
> >
>
> Okay, but this first-phase (finding the line boundaries) can anyway be
> not done in parallel and we have seen in some of the initial
> benchmarking that this initial phase is a small part of work
> especially when the table has indexes, constraints, etc. So, I think
> it won't matter much if this splitting is done in a single process or
> multiple processes.
>
> > If the workers ever have to sit idle, it means
> > that the bottleneck is in receiving data from the client, i.e. the
> > backend is fast enough, and you can't make the overall COPY finish any
> > faster no matter how you do it.
> >
> > > The other point is that the leader backend won't be used completely as
> > > it is only doing a very small part (primarily reading the file) of the
> > > overall work.
> >
> > An idle process doesn't cost anything. If you have free CPU resources,
> > use more workers.
> >
> > > We have discussed both these approaches (a) single producer multiple
> > > consumer, and (b) all workers doing the processing as you are saying
> > > in the beginning and concluded that (a) is better, see some of the
> > > relevant emails [1][2][3].
> > >
> > > [1] - https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
> > > [2] - https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
> > > [3] - https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de
> >
> > Sorry I'm late to the party. I don't think the design I proposed was
> > discussed in that threads.
> >
>
> I think something close to that is discussed as you have noticed in
> your next email but IIRC, because many people (Andres, Ants, myself
> and author) favoured the current approach (single reader and multiple
> consumers) we decided to go with that. I feel this patch is very much
> in the POC stage due to which the code doesn't look good and as we
> move forward we need to see what is the better way to improve it,
> maybe one of the ways is to split it as you are suggesting so that it
> can be easier to review. I think the other important thing which this
> patch has not addressed properly is the parallel-safety checks as
> pointed by me earlier. There are two things to solve there (a) the
> lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c
> APIs, etc.) have checks which doesn't allow any writes, we need to see
> which of those we can open now (or do some additional work to prevent
> from those checks) after some of the work done for parallel-writes in
> PG-13[1][2], and (b) in which all cases we can parallel-writes
> (parallel copy) is allowed, for example need to identify whether table
> or one of its partitions has any constraint/expression which is
> parallel-unsafe.
>

I have worked to provide a patch for the parallel safety checks. It
checks if parallely copy can be performed, Parallel copy cannot be
performed for the following a) If relation is temporary table b) If
relation is foreign table c) If relation has non parallel safe index
expressions d) If relation has triggers present whose type is of non
before statement trigger type e) If relation has check constraint
which are not parallel safe f) If relation has partition and any
partition has the above type. This patch has the checks for it. This
patch will be used by parallel copy implementation.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
Amit Kapila
Date:
On Tue, Nov 10, 2020 at 7:12 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> I have worked to provide a patch for the parallel safety checks. It
> checks if parallely copy can be performed, Parallel copy cannot be
> performed for the following a) If relation is temporary table b) If
> relation is foreign table c) If relation has non parallel safe index
> expressions d) If relation has triggers present whose type is of non
> before statement trigger type e) If relation has check constraint
> which are not parallel safe f) If relation has partition and any
> partition has the above type. This patch has the checks for it. This
> patch will be used by parallel copy implementation.
>

How did you ensure that this is sufficient? For parallel-insert's
patch we have enabled parallel-mode for Inserts and ran the tests with
force_parallel_mode to see if we are not missing anything. Also, it
seems there are many common things here w.r.t parallel-insert patch,
is it possible to prepare this atop that patch or do you have any
reason to keep this separate?

-- 
With Regards,
Amit Kapila.



Re: Parallel copy

From
vignesh C
Date:
On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Nov 10, 2020 at 7:12 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > I have worked to provide a patch for the parallel safety checks. It
> > checks if parallely copy can be performed, Parallel copy cannot be
> > performed for the following a) If relation is temporary table b) If
> > relation is foreign table c) If relation has non parallel safe index
> > expressions d) If relation has triggers present whose type is of non
> > before statement trigger type e) If relation has check constraint
> > which are not parallel safe f) If relation has partition and any
> > partition has the above type. This patch has the checks for it. This
> > patch will be used by parallel copy implementation.
> >
>
> How did you ensure that this is sufficient? For parallel-insert's
> patch we have enabled parallel-mode for Inserts and ran the tests with
> force_parallel_mode to see if we are not missing anything. Also, it
> seems there are many common things here w.r.t parallel-insert patch,
> is it possible to prepare this atop that patch or do you have any
> reason to keep this separate?
>

I have done similar testing for copy too, I had set force_parallel
mode to regress, hardcoded in the code to pick parallel workers for
copy operation and ran make installcheck-world to verify. Many checks
in this patch are common between both patches, but I was not sure how
to handle it as both the projects are in-progress and are being
updated based on the reviewer's opinion. How to handle this?
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
Amit Kapila
Date:
On Wed, Nov 11, 2020 at 10:42 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Nov 10, 2020 at 7:12 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > >
> > > I have worked to provide a patch for the parallel safety checks. It
> > > checks if parallely copy can be performed, Parallel copy cannot be
> > > performed for the following a) If relation is temporary table b) If
> > > relation is foreign table c) If relation has non parallel safe index
> > > expressions d) If relation has triggers present whose type is of non
> > > before statement trigger type e) If relation has check constraint
> > > which are not parallel safe f) If relation has partition and any
> > > partition has the above type. This patch has the checks for it. This
> > > patch will be used by parallel copy implementation.
> > >
> >
> > How did you ensure that this is sufficient? For parallel-insert's
> > patch we have enabled parallel-mode for Inserts and ran the tests with
> > force_parallel_mode to see if we are not missing anything. Also, it
> > seems there are many common things here w.r.t parallel-insert patch,
> > is it possible to prepare this atop that patch or do you have any
> > reason to keep this separate?
> >
>
> I have done similar testing for copy too, I had set force_parallel
> mode to regress, hardcoded in the code to pick parallel workers for
> copy operation and ran make installcheck-world to verify. Many checks
> in this patch are common between both patches, but I was not sure how
> to handle it as both the projects are in-progress and are being
> updated based on the reviewer's opinion. How to handle this?
> Thoughts?
>

I have not studied the differences in detail but if it is possible to
prepare it on top of that patch then there shouldn't be a problem. To
avoid confusion if you want you can always either post the latest
version of that patch with your patch or point to it.

-- 
With Regards,
Amit Kapila.



Re: Parallel copy

From
Bharath Rupireddy
Date:
On Thu, Oct 29, 2020 at 2:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> 4) Worker has to hop through all the processed chunks before getting
> the chunk which it can process.
>
> One more point, I have noticed that some time back [1], I have given
> one suggestion related to the way workers process the set of lines
> (aka chunk). I think you can try by increasing the chunk size to say
> 100, 500, 1000 and use some shared counter to remember the number of
> chunks processed.
>

Hi, I did some analysis on using spinlock protected worker write position i.e. each worker acquires spinlock on a shared write position to choose the next available chunk vs each worker hops to get the next available chunk position:

Use Case: 10mn rows, 5.6GB data, 2 indexes on integer columns, 1 index on text column, results are of the form (no of workers, total exec time in sec, index insertion time in sec, worker write pos get time in sec, buffer contention event count):

With spinlock:
(1,1126.443,1060.067,0.478,0), (2,669.343,630.769,0.306,26), (4,346.297,326.950,0.161,89), (8,209.600,196.417,0.088,291), (16,166.113,157.086,0.065,1468), (20,173.884,166.013,0.067,2700), (30,173.087,1166.565,0.0065,5346)
Without spinlock:
(1,1119.695,1054.586,0.496,0), (2,645.733,608.313,1.5,8), (4,340.620,320.344,1.6,58), (8,203.985,189.644,1.3,222), (16,142.997,133.045,1,813), (20,132.621,122.527,1.1,1215), (30,135.737,126.716,1.5,2901)

With spinlock each worker is getting the required write position quickly and proceeding further till the index insertion(which is becoming a single point of contention) where we observed more buffer lock contention. Reason is that all the workers are reaching the index insertion point at the similar time.

Without spinlock, each worker is spending some time in hopping to get the write position, by the time the other workers are inserting into the indexes. So basically, all the workers are not reaching the index insertion point at the same time and hence less buffer lock contention.

The same behaviour(explained above) is observed with different worker chunk count(default 64, 128, 512 and 1024) i.e. the number of tuples each worker caches into its local memory before inserting into table.

In summary: with spinlock, it looks like we are able to avoid workers waiting to get the next chunk, which also means that we are not creating any contention point inside the parallel copy code. However this is causing another choking point i.e. index insertion if indexes are available on the table, which is out of scope of parallel copy code. We think that it would be good to use spinlock-protected worker write position or an atomic variable for worker write position(as it performs equal to spinlock or little better in some platforms). Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
vignesh C
Date:
On Thu, Oct 29, 2020 at 11:45 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Oct 27, 2020 at 7:06 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> [latest version]
>
> I think the parallel-safety checks in this patch
> (v9-0002-Allow-copy-from-command-to-process-data-from-file) are
> incomplete and wrong. See below comments.
> 1.
> +static pg_attribute_always_inline bool
> +CheckExprParallelSafety(CopyState cstate)
> +{
> + if (contain_volatile_functions(cstate->whereClause))
> + {
> + if (max_parallel_hazard((Query *) cstate->whereClause) != PROPARALLEL_SAFE)
> + return false;
> + }
>
> I don't understand the above check. Why do we only need to check where
> clause for parallel-safety when it contains volatile functions? It
> should be checked otherwise as well, no? The similar comment applies
> to other checks in this function. Also, I don't think there is a need
> to make this function inline.
>

I felt we should check if where clause is parallel safe and also check
if it does not contain volatile function, this is to avoid cases where
expressions may query the table we're inserting into. Modified it
accordingly.

> 2.
> +/*
> + * IsParallelCopyAllowed
> + *
> + * Check if parallel copy can be allowed.
> + */
> +bool
> +IsParallelCopyAllowed(CopyState cstate)
> {
> ..
> + * When there are BEFORE/AFTER/INSTEAD OF row triggers on the table. We do
> + * not allow parallelism in such cases because such triggers might query
> + * the table we are inserting into and act differently if the tuples that
> + * have already been processed and prepared for insertion are not there.
> + * Now, if we allow parallelism with such triggers the behaviour would
> + * depend on if the parallel worker has already inserted or not that
> + * particular tuples.
> + */
> + if (cstate->rel->trigdesc != NULL &&
> + (cstate->rel->trigdesc->trig_insert_after_statement ||
> + cstate->rel->trigdesc->trig_insert_new_table ||
> + cstate->rel->trigdesc->trig_insert_before_row ||
> + cstate->rel->trigdesc->trig_insert_after_row ||
> + cstate->rel->trigdesc->trig_insert_instead_row))
> + return false;
> ..
>
> Why do we need to disable parallelism for before/after row triggers
> unless they have parallel-unsafe functions? I see a few lines down in
> this function you are checking parallel-safety of trigger functions,
> what is the use of the same if you are already disabling parallelism
> with the above check.
>

Currently only before statement trigger is supported, rest of the
triggers are not supported, comments for the same is mentioned atop of
the checks. Removed the parallel safe check which was not required.

> 3. What about if the index on table has expressions that are
> parallel-unsafe? What is your strategy to check parallel-safety for
> partitioned tables?
>
> I suggest checking Greg's patch for parallel-safety of Inserts [1]. I
> think you will find that most of those checks are required here as
> well and see how we can use that patch (at least what is common). I
> feel the first patch should be just to have parallel-safety checks and
> we can test that by trying to enable Copy with force_parallel_mode. We
> can build the rest of the patch atop of it or in other words, let's
> move all parallel-safety work into a separate patch.
>

I have made this as a separate patch as of now. I will work on to see
if I can use Greg's changes as it is or if require I will provide few
review comments on top of Greg's patch so that it is usable for
parallel copy too and later post a separate patch with the changes on
top of it. I will retain it as a separate patch till that time.

> Few assorted comments:
> ========================
> 1.
> +/*
> + * ESTIMATE_NODE_SIZE - Estimate the size required for  node type in shared
> + * memory.
> + */
> +#define ESTIMATE_NODE_SIZE(list, listStr, strsize) \
> +{ \
> + uint32 estsize = sizeof(uint32); \
> + if ((List *)list != NIL) \
> + { \
> + listStr = nodeToString(list); \
> + estsize += strlen(listStr) + 1; \
> + } \
> + \
> + strsize = add_size(strsize, estsize); \
> +}
>
> This can be probably a function instead of a macro.
>

Changed it to a function.

> 2.
> +/*
> + * ESTIMATE_1BYTE_STR_SIZE - Estimate the size required for  1Byte strings in
> + * shared memory.
> + */
> +#define ESTIMATE_1BYTE_STR_SIZE(src, strsize) \
> +{ \
> + strsize = add_size(strsize, sizeof(uint8)); \
> + strsize = add_size(strsize, (src) ? 1 : 0); \
> +}
>
> This could be an inline function.
>

Changed it to an inline function.

> 3.
> +/*
> + * SERIALIZE_1BYTE_STR - Copy 1Byte strings to shared memory.
> + */
> +#define SERIALIZE_1BYTE_STR(dest, src, copiedsize) \
> +{ \
> + uint8 len = (src) ? 1 : 0; \
> + memcpy(dest + copiedsize, (uint8 *) &len, sizeof(uint8)); \
> + copiedsize += sizeof(uint8); \
> + if (src) \
> + dest[copiedsize++] = src[0]; \
> +}
>
> Similarly, this could be a function. I think keeping such things as
> macros in-between code makes it difficult to read. Please see if you
> can make these and similar macros as functions unless they are doing
> few memory instructions. Having functions makes it easier to debug the
> code as well.
>

Changed it to a function.

Attached v10 patch has the fixes for the same.


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
vignesh C
Date:
On Thu, Oct 29, 2020 at 2:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 27/10/2020 15:36, vignesh C wrote:
> > Attached v9 patches have the fixes for the above comments.
>
> I did some testing:
>
> /tmp/longdata.pl:
> --------
> #!/usr/bin/perl
> #
> # Generate three rows:
> # foo
> # longdatalongdatalongdata...
> # bar
> #
> # The length of the middle row is given as command line arg.
> #
>
> my $bytes = $ARGV[0];
>
> print "foo\n";
> for(my $i = 0; $i < $bytes; $i+=8){
>      print "longdata";
> }
> print "\n";
> print "bar\n";
> --------
>
> postgres=# copy longdata from program 'perl /tmp/longdata.pl 100000000'
> with (parallel 2);
>
> This gets stuck forever (or at least I didn't have the patience to wait
> it finish). Both worker processes are consuming 100% of CPU.
>

Thanks for identifying this issue, this issue is fixed in v10 patch posted at [1]
[1] https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
vignesh C
Date:
On Wed, Oct 28, 2020 at 5:36 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
>
> Hi
>
> I found some issue in v9-0002
>
> 1.
> +
> +       elog(DEBUG1, "[Worker] Processing - line position:%d, block:%d, unprocessed lines:%d, offset:%d, line size:%d",
> +                write_pos, lineInfo->first_block,
> +                pg_atomic_read_u32(&data_blk_ptr->unprocessed_line_parts),
> +                offset, pg_atomic_read_u32(&lineInfo->line_size));
> +
>
> write_pos or other variable to be printed here are type of uint32, I think it'better to use '%u' in elog msg.
>

Modified it.

> 2.
> +                * line_size will be set. Read the line_size again to be sure if it is
> +                * completed or partial block.
> +                */
> +               dataSize = pg_atomic_read_u32(&lineInfo->line_size);
> +               if (dataSize)
>
> It use dataSize( type int ) to get uint32 which seems a little dangerous.
> Is it better to define dataSize uint32 here?
>

Modified it.

> 3.
> Since function with  'Cstate' in name has been changed to 'CState'
> I think we can change function PopulateCommonCstateInfo as well.
>

Modified it.

> 4.
> +       if (pcdata->worker_line_buf_count)
>
> I think some check like the above can be 'if (xxx > 0)', which seems easier to understand.

Modified it.

Thanks for the comments, these issues are fixed in v10 patch posted at [1]
[1] https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
vignesh C
Date:
On Thu, Oct 29, 2020 at 2:26 PM Daniel Westermann (DWE)
<daniel.westermann@dbi-services.com> wrote:
>
> On 27/10/2020 15:36, vignesh C wrote:
> >> Attached v9 patches have the fixes for the above comments.
>
> >I did some testing:
>
> I did some testing as well and have a cosmetic remark:
>
> postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 1000000000);
> ERROR:  value 1000000000 out of bounds for option "parallel"
> DETAIL:  Valid values are between "1" and "1024".
> postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 100000000000);
> ERROR:  parallel requires an integer value
> postgres=#
>
> Wouldn't it make more sense to only have one error message? The first one seems to be the better message.
>

I had seen similar behavior in other places too:
postgres=# vacuum (parallel 1000000000) t1;
ERROR:  parallel vacuum degree must be between 0 and 1024
LINE 1: vacuum (parallel 1000000000) t1;
                ^
postgres=# vacuum (parallel 100000000000) t1;
ERROR:  parallel requires an integer value

I'm not sure if we should fix this.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:
On Fri, Nov 13, 2020 at 2:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Nov 11, 2020 at 10:42 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Nov 10, 2020 at 7:12 PM vignesh C <vignesh21@gmail.com> wrote:
> > > >
> > > > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > >
> > > > I have worked to provide a patch for the parallel safety checks. It
> > > > checks if parallely copy can be performed, Parallel copy cannot be
> > > > performed for the following a) If relation is temporary table b) If
> > > > relation is foreign table c) If relation has non parallel safe index
> > > > expressions d) If relation has triggers present whose type is of non
> > > > before statement trigger type e) If relation has check constraint
> > > > which are not parallel safe f) If relation has partition and any
> > > > partition has the above type. This patch has the checks for it. This
> > > > patch will be used by parallel copy implementation.
> > > >
> > >
> > > How did you ensure that this is sufficient? For parallel-insert's
> > > patch we have enabled parallel-mode for Inserts and ran the tests with
> > > force_parallel_mode to see if we are not missing anything. Also, it
> > > seems there are many common things here w.r.t parallel-insert patch,
> > > is it possible to prepare this atop that patch or do you have any
> > > reason to keep this separate?
> > >
> >
> > I have done similar testing for copy too, I had set force_parallel
> > mode to regress, hardcoded in the code to pick parallel workers for
> > copy operation and ran make installcheck-world to verify. Many checks
> > in this patch are common between both patches, but I was not sure how
> > to handle it as both the projects are in-progress and are being
> > updated based on the reviewer's opinion. How to handle this?
> > Thoughts?
> >
>
> I have not studied the differences in detail but if it is possible to
> prepare it on top of that patch then there shouldn't be a problem. To
> avoid confusion if you want you can always either post the latest
> version of that patch with your patch or point to it.
>

I have made this as a separate patch as of now. I will work on to see
if I can use Greg's changes as it is or if required I will provide a
few review comments on top of Greg's patch so that it is usable for
parallel copy too and later post a separate patch with the changes on
top of it. I will retain it as a separate patch till that time.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:


On Sat, Oct 31, 2020 at 2:07 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> I've done a bit more testing today, and I think the parsing is busted in
> some way. Consider this:
>
>      test=# create extension random;
>      CREATE EXTENSION
>
>      test=# create table t (a text);
>      CREATE TABLE
>
>      test=# insert into t select random_string(random_int(10, 256*1024)) from generate_series(1,10000);
>      INSERT 0 10000
>
>      test=# copy t to '/mnt/data/t.csv';
>      COPY 10000
>
>      test=# truncate t;
>      TRUNCATE TABLE
>
>      test=# copy t from '/mnt/data/t.csv';
>      COPY 10000
>
>      test=# truncate t;
>      TRUNCATE TABLE
>
>      test=# copy t from '/mnt/data/t.csv' with (parallel 2);
>      ERROR:  invalid byte sequence for encoding "UTF8": 0x00
>      CONTEXT:  COPY t, line 485: "m&\nh%_a"%r]>qtCl:Q5ltvF~;2oS6@HB>F>og,bD$Lw'nZY\tYl#BH\t{(j~ryoZ08"SGU~.}8CcTRk1\ts$@U3szCC+U1U3i@P..."
>      parallel worker
>
>
> The functions come from an extension I use to generate random data, I've
> pushed it to github [1]. The random_string() generates a random string
> with ASCII characters, symbols and a couple special characters (\r\n\t).
> The intent was to try loading data where a fields may span multiple 64kB
> blocks and may contain newlines etc.
>
> The non-parallel copy works fine, the parallel one fails. I haven't
> investigated the details, but I guess it gets confused about where a
> string starts/end, or something like that.
>

Thanks for identifying this issue, this issue is fixed in v10 patch posted at [1]
[1] https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

From
vignesh C
Date:
On Sat, Nov 7, 2020 at 7:01 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, Nov 5, 2020 at 6:33 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
> >
> > Hi
> >
> > >
> > > my $bytes = $ARGV[0];
> > > for(my $i = 0; $i < $bytes; $i+=8){
> > >      print "longdata";
> > > }
> > > print "\n";
> > > --------
> > >
> > > postgres=# copy longdata from program 'perl /tmp/longdata.pl 100000000'
> > > with (parallel 2);
> > >
> > > This gets stuck forever (or at least I didn't have the patience to wait
> > > it finish). Both worker processes are consuming 100% of CPU.
> >
> > I had a look over this problem.
> >
> > the ParallelCopyDataBlock has size limit:
> >         uint8           skip_bytes;
> >         char            data[DATA_BLOCK_SIZE];  /* data read from file */
> >
> > It seems the input line is so long that the leader process run out of the Shared memory among parallel copy workers.
> > And the leader process keep waiting free block.
> >
> > For the worker process, it wait util line_state becomes LINE_LEADER_POPULATED,
> > But leader process won't set the line_state unless it read the whole line.
> >
> > So it stuck forever.
> > May be we should reconsider about this situation.
> >
> > The stack is as follows:
> >
> > Leader stack:
> > #3  0x000000000075f7a1 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=1, wait_event_info=wait_event_info@entry=150994945) at latch.c:411
> > #4  0x00000000005a9245 in WaitGetFreeCopyBlock (pcshared_info=pcshared_info@entry=0x7f26d2ed3580) at copyparallel.c:1546
> > #5  0x00000000005a98ce in SetRawBufForLoad (cstate=cstate@entry=0x2978a88, line_size=67108864, copy_buf_len=copy_buf_len@entry=65536, raw_buf_ptr=raw_buf_ptr@entry=65536,
> >     copy_raw_buf=copy_raw_buf@entry=0x7fff4cdc0e18) at copyparallel.c:1572
> > #6  0x00000000005a1963 in CopyReadLineText (cstate=cstate@entry=0x2978a88) at copy.c:4058
> > #7  0x00000000005a4e76 in CopyReadLine (cstate=cstate@entry=0x2978a88) at copy.c:3863
> >
> > Worker stack:
> > #0  GetLinePosition (cstate=cstate@entry=0x29e1f28) at copyparallel.c:1474
> > #1  0x00000000005a8aa4 in CacheLineInfo (cstate=cstate@entry=0x29e1f28, buff_count=buff_count@entry=0) at copyparallel.c:711
> > #2  0x00000000005a8e46 in GetWorkerLine (cstate=cstate@entry=0x29e1f28) at copyparallel.c:885
> > #3  0x00000000005a4f2e in NextCopyFromRawFields (cstate=cstate@entry=0x29e1f28, fields=fields@entry=0x7fff4cdc0b48, nfields=nfields@entry=0x7fff4cdc0b44) at copy.c:3615
> > #4  0x00000000005a50af in NextCopyFrom (cstate=cstate@entry=0x29e1f28, econtext=econtext@entry=0x2a358d8, values=0x2a42068, nulls=0x2a42070) at copy.c:3696
> > #5  0x00000000005a5b90 in CopyFrom (cstate=cstate@entry=0x29e1f28) at copy.c:2985
> >
>
> Thanks for providing your thoughts. I have analyzed this issue and I'm
> working on the fix for this, I will be posting a patch for this
> shortly.
>

I have fixed and provided a patch for this at [1]
[1] https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

RE: Parallel copy

From
"Hou, Zhijie"
Date:
Hi Vignesh,

I took a look at the v10 patch set. Here are some comments:

1. 
+/*
+ * CheckExprParallelSafety
+ *
+ * Determine if where cluase and default expressions are parallel safe & do not
+ * have volatile expressions, return true if condition satisfies else return
+ * false.
+ */

'cluase' seems a typo.


2.
+            /*
+             * Make sure that no worker has consumed this element, if this
+             * line is spread across multiple data blocks, worker would have
+             * started processing, no need to change the state to
+             * LINE_LEADER_POPULATING in this case.
+             */
+            (void) pg_atomic_compare_exchange_u32(&lineInfo->line_state,
+                                                  ¤t_line_state,
+                                                  LINE_LEADER_POPULATED);
About the commect

+             * started processing, no need to change the state to
+             * LINE_LEADER_POPULATING in this case.

Does it means no need to change the state to LINE_LEADER_POPULATED ' here?


3.
+ * 3) only one worker should choose one line for processing, this is handled by
+ *    using pg_atomic_compare_exchange_u32, worker will change the state to
+ *    LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED.

In the latest patch, it will set the state to LINE_WORKER_PROCESSING if line_state is LINE_LEADER_POPULATED or
LINE_LEADER_POPULATING.
So The comment here seems wrong.


4.
A suggestion for CacheLineInfo.

It use appendBinaryStringXXX to store the line in memory.
appendBinaryStringXXX will double the str memory when there is no enough spaces.

How about call enlargeStringInfo in advance, if we already know the whole line size?
It can avoid some memory waste and may impove a little performance.


Best regards,
houzj





Re: Parallel copy

From
vignesh C
Date:
Thanks for the comments.
> I took a look at the v10 patch set. Here are some comments:
>
> 1.
> +/*
> + * CheckExprParallelSafety
> + *
> + * Determine if where cluase and default expressions are parallel safe & do not
> + * have volatile expressions, return true if condition satisfies else return
> + * false.
> + */
>
> 'cluase' seems a typo.
>

changed.

> 2.
> +                       /*
> +                        * Make sure that no worker has consumed this element, if this
> +                        * line is spread across multiple data blocks, worker would have
> +                        * started processing, no need to change the state to
> +                        * LINE_LEADER_POPULATING in this case.
> +                        */
> +                       (void) pg_atomic_compare_exchange_u32(&lineInfo->line_state,
> +
¤t_line_state,
> +
LINE_LEADER_POPULATED);
> About the commect
>
> +                        * started processing, no need to change the state to
> +                        * LINE_LEADER_POPULATING in this case.
>
> Does it means no need to change the state to LINE_LEADER_POPULATED ' here?
>
>

Yes it is LINE_LEADER_POPULATED, changed accordingly.

> 3.
> + * 3) only one worker should choose one line for processing, this is handled by
> + *    using pg_atomic_compare_exchange_u32, worker will change the state to
> + *    LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED.
>
> In the latest patch, it will set the state to LINE_WORKER_PROCESSING if line_state is LINE_LEADER_POPULATED or
LINE_LEADER_POPULATING.
> So The comment here seems wrong.
>

Updated the comments.

> 4.
> A suggestion for CacheLineInfo.
>
> It use appendBinaryStringXXX to store the line in memory.
> appendBinaryStringXXX will double the str memory when there is no enough spaces.
>
> How about call enlargeStringInfo in advance, if we already know the whole line size?
> It can avoid some memory waste and may impove a little performance.
>

Here we will not know the size beforehand, in some cases we will start
processing the data when current block is populated and keep
processing block by block, we will come to know of the size at the
end. We cannot use enlargeStringInfo because of this.

Attached v11 patch has the fix for this, it also includes the changes
to rebase on top of head.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

RE: Parallel copy

From
"Hou, Zhijie"
Date:
> > 4.
> > A suggestion for CacheLineInfo.
> >
> > It use appendBinaryStringXXX to store the line in memory.
> > appendBinaryStringXXX will double the str memory when there is no enough
> spaces.
> >
> > How about call enlargeStringInfo in advance, if we already know the whole
> line size?
> > It can avoid some memory waste and may impove a little performance.
> >
> 
> Here we will not know the size beforehand, in some cases we will start
> processing the data when current block is populated and keep processing
> block by block, we will come to know of the size at the end. We cannot use
> enlargeStringInfo because of this.
> 
> Attached v11 patch has the fix for this, it also includes the changes to
> rebase on top of head.

Thanks for the explanation.

I think there is still chances we can know the size.

+         * line_size will be set. Read the line_size again to be sure if it is
+         * completed or partial block.
+         */
+        dataSize = pg_atomic_read_u32(&lineInfo->line_size);
+        if (dataSize != -1)
+        {

If I am not wrong, this seems the branch that procsssing the populated block.
I think we can check the copiedSize here, if copiedSize == 0, that means
Datasizes is the size of the whole line and in this case we can do the enlarge.


Best regards,
houzj





Re: Parallel copy

From
vignesh C
Date:
On Mon, Dec 7, 2020 at 3:00 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
>
> > Attached v11 patch has the fix for this, it also includes the changes to
> > rebase on top of head.
>
> Thanks for the explanation.
>
> I think there is still chances we can know the size.
>
> +                * line_size will be set. Read the line_size again to be sure if it is
> +                * completed or partial block.
> +                */
> +               dataSize = pg_atomic_read_u32(&lineInfo->line_size);
> +               if (dataSize != -1)
> +               {
>
> If I am not wrong, this seems the branch that procsssing the populated block.
> I think we can check the copiedSize here, if copiedSize == 0, that means
> Datasizes is the size of the whole line and in this case we can do the enlarge.
>
>

Yes this optimization can be done, I will handle this in the next patch set.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



RE: Parallel copy

From
"Hou, Zhijie"
Date:
Hi

> Yes this optimization can be done, I will handle this in the next patch
> set.
> 

I have a suggestion for the parallel safety-check.

As designed, The leader does not participate in the insertion of data.
If User use (PARALLEL 1), there is only one worker process which will do the insertion.

IMO, we can skip some of the safety-check in this case, becase the safety-check is to limit parallel insert.
(except temporary table or ...)

So, how about checking (PARALLEL 1) separately ?
Although it looks a bit complicated, But (PARALLEL 1) do have a good performance improvement.

Best regards,
houzj



Re: Parallel copy

From
vignesh C
Date:
On Wed, Dec 23, 2020 at 3:05 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
>
> Hi
>
> > Yes this optimization can be done, I will handle this in the next patch
> > set.
> >
>
> I have a suggestion for the parallel safety-check.
>
> As designed, The leader does not participate in the insertion of data.
> If User use (PARALLEL 1), there is only one worker process which will do the insertion.
>
> IMO, we can skip some of the safety-check in this case, becase the safety-check is to limit parallel insert.
> (except temporary table or ...)
>
> So, how about checking (PARALLEL 1) separately ?
> Although it looks a bit complicated, But (PARALLEL 1) do have a good performance improvement.
>

Thanks for the comments Hou Zhijie, I will run a few tests with 1
worker and try to include this in the next patch set.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel copy

From
vignesh C
Date:
On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > On 02/11/2020 08:14, Amit Kapila wrote:
> > > On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > >>
> > >> In this design, you don't need to keep line boundaries in shared memory,
> > >> because each worker process is responsible for finding the line
> > >> boundaries of its own block.
> > >>
> > >> There's a point of serialization here, in that the next block cannot be
> > >> processed, until the worker working on the previous block has finished
> > >> scanning the EOLs, and set the starting position on the next block,
> > >> putting it in READY state. That's not very different from your patch,
> > >> where you had a similar point of serialization because the leader
> > >> scanned the EOLs,
> > >
> > > But in the design (single producer multiple consumer) used by the
> > > patch the worker doesn't need to wait till the complete block is
> > > processed, it can start processing the lines already found. This will
> > > also allow workers to start much earlier to process the data as it
> > > doesn't need to wait for all the offsets corresponding to 64K block
> > > ready. However, in the design where each worker is processing the 64K
> > > block, it can lead to much longer waits. I think this will impact the
> > > Copy STDIN case more where in most cases (200-300 bytes tuples) we
> > > receive line-by-line from client and find the line-endings by leader.
> > > If the leader doesn't find the line-endings the workers need to wait
> > > till the leader fill the entire 64K chunk, OTOH, with current approach
> > > the worker can start as soon as leader is able to populate some
> > > minimum number of line-endings
> >
> > You can use a smaller block size.
> >
>
> Sure, but the same problem can happen if the last line in that block
> is too long and we need to peek into the next block. And then there
> could be cases where a single line could be greater than 64K.
>
> > However, the point of parallel copy is
> > to maximize bandwidth.
> >
>
> Okay, but this first-phase (finding the line boundaries) can anyway be
> not done in parallel and we have seen in some of the initial
> benchmarking that this initial phase is a small part of work
> especially when the table has indexes, constraints, etc. So, I think
> it won't matter much if this splitting is done in a single process or
> multiple processes.
>

I wrote a patch to compare the performance of the current
implementation leader identifying the line bound design vs the workers
identifying the line boundary. The results of the same is given below:
The below data can be read as parallel copy time taken in seconds
based on the leader identifying the line boundary design, parallel
copy time taken in seconds based on the workers identifying the line
boundary design, workers.

Use case 1 - 10million rows, 5.2GB data,3 indexes on integer columns:
(211.206, 632.583, 1), (165.402, 360.152, 2), (137.608, 219.623, 4),
(128.003, 206.851, 8), (114.518, 177.790, 16), (109.257, 170.058, 20),
(102.050, 158.376, 30)

Use case 2 - 10million rows, 5.2GB data,2 indexes on integer columns,
1 index on text column, csv file:
(1212.356, 1602.118, 1), (707.191, 849.105, 2), (369.620, 441.068, 4),
(221.359, 252.775, 8), (167.152, 180.207, 16), (168.804, 181.986, 20),
(172.320, 194.875, 30)

Use case 3 - 10million rows, 5.2GB data without index:
(96.317, 437.453, 1), (70.730, 240.517, 2), (64.436, 197.604, 4),
(67.186, 175.630, 8), (76.561, 156.015, 16), (81.025, 150.687, 20),
(86.578, 148.481, 30)

Use case 4 - 10000 records, 9.6GB, toast data:
(147.076, 276.323, 1), (101.610, 141.893, 2), (100.703, 134.096, 4),
(112.583, 134.765, 8), (101.898, 135.789, 16), (109.258, 135.625, 20),
(109.219, 136.144, 30)

Attached is a patch that was used for the same. The patch is written
on top of the parallel copy patch.
The design Amit, Andres & myself voted for that is the leader
identifying the line bound design and sharing it in shared memory is
performing better.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel copy

From
Bharath Rupireddy
Date:
On Mon, Dec 28, 2020 at 3:14 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Attached is a patch that was used for the same. The patch is written
> on top of the parallel copy patch.
> The design Amit, Andres & myself voted for that is the leader
> identifying the line bound design and sharing it in shared memory is
> performing better.

Hi Hackers, I see following are some of the problem with parallel copy feature:

1) Leader identifying the line/tuple boundaries from the file, letting
the workers pick, insert parallelly vs leader reading the file and
letting workers identify line/tuple boundaries, insert
2) Determining parallel safety of partitioned tables
3) Bulk extension of relation while inserting i.e. adding more than
one extra blocks to the relation in RelationAddExtraBlocks

Please let me know if I'm missing anything.

For (1) - from Vignesh's experiments above, it shows that the " leader
identifying the line/tuple boundaries from the file, letting the
workers pick, insert parallelly" fares better.
For (2) - while it's being discussed in another thread (I'm not sure
what's the status of that thread), how about we take this feature
without the support for partitioned tables i.e. parallel copy is
disabled for partitioned tables? Once the other discussion gets to a
logical end, we can come back and enable parallel copy for partitioned
tables.
For (3) - we need a way to extend or add new blocks fastly - fallocate
might help here, not sure who's working on it, others can comment
better here.

Can we take the "parallel copy" feature forward of course with some
restrictions in place?

Thoughts?

Regards,
Bharath Rupireddy.