Thread: Parallel copy
This work is to parallelize the copy command and in particular "Copy <table_name> from 'filename' Where <condition>;" command. Before going into how and what portion of 'copy command' processing we can parallelize, let us see in brief what are the top-level operations we perform while copying from the file into a table. We read the file in 64KB chunks, then find the line endings and process that data line by line, where each line corresponds to one tuple. We first form the tuple (in form of value/null array) from that line, check if it qualifies the where condition and if it qualifies, then perform constraint check and few other checks and then finally store it in local tuple array. Once we reach 1000 tuples or consumed 64KB (whichever occurred first), we insert them together via table_multi_insert API and then for each tuple insert into the index(es) and execute after row triggers. So if we see here we do a lot of work after reading each 64K chunk. We can read the next chunk only after all the tuples are processed in the previous chunk we read. This brings us an opportunity to parallelize each 64K chunk processing. I think we can do this in more than one way. The first idea is that we allocate each chunk to a worker and once the worker has finished processing the current chunk, it can start with the next unprocessed chunk. Here, we need to see how to handle the partial tuples at the end or beginning of each chunk. We can read the chunks in dsa/dsm instead of in local buffer for processing. Alternatively, if we think that accessing shared memory can be costly we can read the entire chunk in local memory, but copy the partial tuple at the beginning of a chunk (if any) to a dsa. We mainly need partial tuple in the shared memory area. The worker which has found the initial part of the partial tuple will be responsible to process the entire tuple. Now, to detect whether there is a partial tuple at the beginning of a chunk, we always start reading one byte, prior to the start of the current chunk and if that byte is not a terminating line byte, we know that it is a partial tuple. Now, while processing the chunk, we will ignore this first line and start after the first terminating line. To connect the partial tuple in two consecutive chunks, we need to have another data structure (for the ease of reference in this email, I call it CTM (chunk-tuple-map)) in shared memory where we store some per-chunk information like the chunk-number, dsa location of that chunk and a variable which indicates whether we can free/reuse the current entry. Whenever we encounter the partial tuple at the beginning of a chunk we note down its chunk number, and dsa location in CTM. Next, whenever we encounter any partial tuple at the end of the chunk, we search CTM for next chunk-number and read from corresponding dsa location till we encounter terminating line byte. Once we have read and processed this partial tuple, we can mark the entry as available for reuse. There are some loose ends here like how many entries shall we allocate in this data structure. It depends on whether we want to allow the worker to start reading the next chunk before the partial tuple of the previous chunk is processed. To keep it simple, we can allow the worker to process the next chunk only when the partial tuple in the previous chunk is processed. This will allow us to keep the entries equal to a number of workers in CTM. I think we can easily improve this if we want but I don't think it will matter too much as in most cases by the time we processed the tuples in that chunk, the partial tuple would have been consumed by the other worker. Another approach that came up during an offlist discussion with Robert is that we have one dedicated worker for reading the chunks from file and it copies the complete tuples of one chunk in the shared memory and once that is done, a handover that chunks to another worker which can process tuples in that area. We can imagine that the reader worker is responsible to form some sort of work queue that can be processed by the other workers. In this idea, we won't be able to get the benefit of initial tokenization (forming tuple boundaries) via parallel workers and might need some additional memory processing as after reader worker has handed the initial shared memory segment, we need to somehow identify tuple boundaries and then process them. Another thing we need to figure out is the how many workers to use for the copy command. I think we can use it based on the file size which needs some experiments or may be based on user input. I think we have two related problems to solve for this (a) relation extension lock (required for extending the relation) which won't conflict among workers due to group locking, we are working on a solution for this in another thread [1], (b) Use of Page locks in Gin indexes, we can probably disallow parallelism if the table has Gin index which is not a great thing but not bad either. To be clear, this work is for PG14. Thoughts? [1] - https://www.postgresql.org/message-id/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B%2BSs%3DGn1LA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > This work is to parallelize the copy command and in particular "Copy > <table_name> from 'filename' Where <condition>;" command. Nice project, and a great stepping stone towards parallel DML. > The first idea is that we allocate each chunk to a worker and once the > worker has finished processing the current chunk, it can start with > the next unprocessed chunk. Here, we need to see how to handle the > partial tuples at the end or beginning of each chunk. We can read the > chunks in dsa/dsm instead of in local buffer for processing. > Alternatively, if we think that accessing shared memory can be costly > we can read the entire chunk in local memory, but copy the partial > tuple at the beginning of a chunk (if any) to a dsa. We mainly need > partial tuple in the shared memory area. The worker which has found > the initial part of the partial tuple will be responsible to process > the entire tuple. Now, to detect whether there is a partial tuple at > the beginning of a chunk, we always start reading one byte, prior to > the start of the current chunk and if that byte is not a terminating > line byte, we know that it is a partial tuple. Now, while processing > the chunk, we will ignore this first line and start after the first > terminating line. That's quiet similar to the approach I took with a parallel file_fdw patch[1], which mostly consisted of parallelising the reading part of copy.c, except that... > To connect the partial tuple in two consecutive chunks, we need to > have another data structure (for the ease of reference in this email, > I call it CTM (chunk-tuple-map)) in shared memory where we store some > per-chunk information like the chunk-number, dsa location of that > chunk and a variable which indicates whether we can free/reuse the > current entry. Whenever we encounter the partial tuple at the > beginning of a chunk we note down its chunk number, and dsa location > in CTM. Next, whenever we encounter any partial tuple at the end of > the chunk, we search CTM for next chunk-number and read from > corresponding dsa location till we encounter terminating line byte. > Once we have read and processed this partial tuple, we can mark the > entry as available for reuse. There are some loose ends here like how > many entries shall we allocate in this data structure. It depends on > whether we want to allow the worker to start reading the next chunk > before the partial tuple of the previous chunk is processed. To keep > it simple, we can allow the worker to process the next chunk only when > the partial tuple in the previous chunk is processed. This will allow > us to keep the entries equal to a number of workers in CTM. I think > we can easily improve this if we want but I don't think it will matter > too much as in most cases by the time we processed the tuples in that > chunk, the partial tuple would have been consumed by the other worker. ... I didn't use a shm 'partial tuple' exchanging mechanism, I just had each worker follow the final tuple in its chunk into the next chunk, and have each worker ignore the first tuple in chunk after chunk 0 because it knows someone else is looking after that. That means that there was some double reading going on near the boundaries, and considering how much I've been complaining about bogus extra system calls on this mailing list lately, yeah, your idea of doing a bit more coordination is a better idea. If you go this way, you might at least find the copy.c part of the patch I wrote useful as stand-in scaffolding code in the meantime while you prototype the parallel writing side, if you don't already have something better for this? > Another approach that came up during an offlist discussion with Robert > is that we have one dedicated worker for reading the chunks from file > and it copies the complete tuples of one chunk in the shared memory > and once that is done, a handover that chunks to another worker which > can process tuples in that area. We can imagine that the reader > worker is responsible to form some sort of work queue that can be > processed by the other workers. In this idea, we won't be able to get > the benefit of initial tokenization (forming tuple boundaries) via > parallel workers and might need some additional memory processing as > after reader worker has handed the initial shared memory segment, we > need to somehow identify tuple boundaries and then process them. Yeah, I have also wondered about something like this in a slightly different context. For parallel query in general, I wondered if there should be a Parallel Scatter node, that can be put on top of any parallel-safe plan, and it runs it in a worker process that just pushes tuples into a single-producer multi-consumer shm queue, and then other workers read from that whenever they need a tuple. Hmm, but for COPY, I suppose you'd want to push the raw lines with minimal examination, not tuples, into a shm queue, so I guess that's a bit different. > Another thing we need to figure out is the how many workers to use for > the copy command. I think we can use it based on the file size which > needs some experiments or may be based on user input. It seems like we don't even really have a general model for that sort of thing in the rest of the system yet, and I guess some kind of fairly dumb explicit system would make sense in the early days... > Thoughts? This is cool. [1] https://www.postgresql.org/message-id/CA%2BhUKGKZu8fpZo0W%3DPOmQEN46kXhLedzqqAnt5iJZy7tD0x6sw%40mail.gmail.com
On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > This work is to parallelize the copy command and in particular "Copy > > <table_name> from 'filename' Where <condition>;" command. > > Nice project, and a great stepping stone towards parallel DML. > Thanks. > > The first idea is that we allocate each chunk to a worker and once the > > worker has finished processing the current chunk, it can start with > > the next unprocessed chunk. Here, we need to see how to handle the > > partial tuples at the end or beginning of each chunk. We can read the > > chunks in dsa/dsm instead of in local buffer for processing. > > Alternatively, if we think that accessing shared memory can be costly > > we can read the entire chunk in local memory, but copy the partial > > tuple at the beginning of a chunk (if any) to a dsa. We mainly need > > partial tuple in the shared memory area. The worker which has found > > the initial part of the partial tuple will be responsible to process > > the entire tuple. Now, to detect whether there is a partial tuple at > > the beginning of a chunk, we always start reading one byte, prior to > > the start of the current chunk and if that byte is not a terminating > > line byte, we know that it is a partial tuple. Now, while processing > > the chunk, we will ignore this first line and start after the first > > terminating line. > > That's quiet similar to the approach I took with a parallel file_fdw > patch[1], which mostly consisted of parallelising the reading part of > copy.c, except that... > > > To connect the partial tuple in two consecutive chunks, we need to > > have another data structure (for the ease of reference in this email, > > I call it CTM (chunk-tuple-map)) in shared memory where we store some > > per-chunk information like the chunk-number, dsa location of that > > chunk and a variable which indicates whether we can free/reuse the > > current entry. Whenever we encounter the partial tuple at the > > beginning of a chunk we note down its chunk number, and dsa location > > in CTM. Next, whenever we encounter any partial tuple at the end of > > the chunk, we search CTM for next chunk-number and read from > > corresponding dsa location till we encounter terminating line byte. > > Once we have read and processed this partial tuple, we can mark the > > entry as available for reuse. There are some loose ends here like how > > many entries shall we allocate in this data structure. It depends on > > whether we want to allow the worker to start reading the next chunk > > before the partial tuple of the previous chunk is processed. To keep > > it simple, we can allow the worker to process the next chunk only when > > the partial tuple in the previous chunk is processed. This will allow > > us to keep the entries equal to a number of workers in CTM. I think > > we can easily improve this if we want but I don't think it will matter > > too much as in most cases by the time we processed the tuples in that > > chunk, the partial tuple would have been consumed by the other worker. > > ... I didn't use a shm 'partial tuple' exchanging mechanism, I just > had each worker follow the final tuple in its chunk into the next > chunk, and have each worker ignore the first tuple in chunk after > chunk 0 because it knows someone else is looking after that. That > means that there was some double reading going on near the boundaries, > Right and especially if the part in the second chunk is bigger, then we might need to read most of the second chunk. > and considering how much I've been complaining about bogus extra > system calls on this mailing list lately, yeah, your idea of doing a > bit more coordination is a better idea. If you go this way, you might > at least find the copy.c part of the patch I wrote useful as stand-in > scaffolding code in the meantime while you prototype the parallel > writing side, if you don't already have something better for this? > No, I haven't started writing anything yet, but I have some ideas on how to achieve this. I quickly skimmed through your patch and I think that can be used as a starting point though if we decide to go with accumulating the partial tuple or all the data in shm, then the things might differ. > > Another approach that came up during an offlist discussion with Robert > > is that we have one dedicated worker for reading the chunks from file > > and it copies the complete tuples of one chunk in the shared memory > > and once that is done, a handover that chunks to another worker which > > can process tuples in that area. We can imagine that the reader > > worker is responsible to form some sort of work queue that can be > > processed by the other workers. In this idea, we won't be able to get > > the benefit of initial tokenization (forming tuple boundaries) via > > parallel workers and might need some additional memory processing as > > after reader worker has handed the initial shared memory segment, we > > need to somehow identify tuple boundaries and then process them. > > Yeah, I have also wondered about something like this in a slightly > different context. For parallel query in general, I wondered if there > should be a Parallel Scatter node, that can be put on top of any > parallel-safe plan, and it runs it in a worker process that just > pushes tuples into a single-producer multi-consumer shm queue, and > then other workers read from that whenever they need a tuple. > The idea sounds great but the past experience shows that shoving all the tuples through queue might add a significant overhead. However, I don't know how exactly you are planning to use it? > Hmm, > but for COPY, I suppose you'd want to push the raw lines with minimal > examination, not tuples, into a shm queue, so I guess that's a bit > different. > Yeah. > > Another thing we need to figure out is the how many workers to use for > > the copy command. I think we can use it based on the file size which > > needs some experiments or may be based on user input. > > It seems like we don't even really have a general model for that sort > of thing in the rest of the system yet, and I guess some kind of > fairly dumb explicit system would make sense in the early days... > makes sense. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, 14 Feb 2020 at 11:57, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
...
> > Another approach that came up during an offlist discussion with Robert
> > is that we have one dedicated worker for reading the chunks from file
> > and it copies the complete tuples of one chunk in the shared memory
> > and once that is done, a handover that chunks to another worker which
> > can process tuples in that area. We can imagine that the reader
> > worker is responsible to form some sort of work queue that can be
> > processed by the other workers. In this idea, we won't be able to get
> > the benefit of initial tokenization (forming tuple boundaries) via
> > parallel workers and might need some additional memory processing as
> > after reader worker has handed the initial shared memory segment, we
> > need to somehow identify tuple boundaries and then process them.
Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line returns in quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote state required. A single worker could also process a stream without having to reread/rewind so it would be able to process input from STDIN or PROGRAM sources, making the improvements applicable to load operations done by third party tools and scripted \copy in psql.
>
...
> > Another thing we need to figure out is the how many workers to use for
> > the copy command. I think we can use it based on the file size which
> > needs some experiments or may be based on user input.
>
> It seems like we don't even really have a general model for that sort
> of thing in the rest of the system yet, and I guess some kind of
> fairly dumb explicit system would make sense in the early days...
>
makes sense.
The ratio between chunking or line parsing processes and the parallel worker pool would vary with the width of the table, complexity of the data or file (dates, encoding conversions), complexity of constraints and acceptable impact of the load. Being able to control it through user input would be great.
--
Alastair
On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote: > > On Fri, 14 Feb 2020 at 11:57, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> > >> > On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > ... >> >> > > Another approach that came up during an offlist discussion with Robert >> > > is that we have one dedicated worker for reading the chunks from file >> > > and it copies the complete tuples of one chunk in the shared memory >> > > and once that is done, a handover that chunks to another worker which >> > > can process tuples in that area. We can imagine that the reader >> > > worker is responsible to form some sort of work queue that can be >> > > processed by the other workers. In this idea, we won't be able to get >> > > the benefit of initial tokenization (forming tuple boundaries) via >> > > parallel workers and might need some additional memory processing as >> > > after reader worker has handed the initial shared memory segment, we >> > > need to somehow identify tuple boundaries and then process them. > > > Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line returns inquoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote state required. > AFAIU, the whole of this in-quote/out-of-quote state is manged inside CopyReadLineText which will be done by each of the parallel workers, something on the lines of what Thomas did in his patch [1]. Basically, we need to invent a mechanism to allocate chunks to individual workers and then the whole processing will be done as we are doing now except for special handling for partial tuples which I have explained in my previous email. Am, I missing something here? >> >> > > > ... >> >> >> > > Another thing we need to figure out is the how many workers to use for >> > > the copy command. I think we can use it based on the file size which >> > > needs some experiments or may be based on user input. >> > >> > It seems like we don't even really have a general model for that sort >> > of thing in the rest of the system yet, and I guess some kind of >> > fairly dumb explicit system would make sense in the early days... >> > >> >> makes sense. > > The ratio between chunking or line parsing processes and the parallel worker pool would vary with the width of the table,complexity of the data or file (dates, encoding conversions), complexity of constraints and acceptable impact of theload. Being able to control it through user input would be great. > Okay, I think one simple way could be that we compute the number of workers based on filesize (some experiments are required to determine this) unless the user has given the input. If the user has provided the input then we can use that with an upper limit to max_parallel_workers. [1] - https://www.postgresql.org/message-id/CA%2BhUKGKZu8fpZo0W%3DPOmQEN46kXhLedzqqAnt5iJZy7tD0x6sw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, 15 Feb 2020 at 04:55, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote: > > ... > > > > Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line returnsin quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote staterequired. > > > > AFAIU, the whole of this in-quote/out-of-quote state is manged inside > CopyReadLineText which will be done by each of the parallel workers, > something on the lines of what Thomas did in his patch [1]. > Basically, we need to invent a mechanism to allocate chunks to > individual workers and then the whole processing will be done as we > are doing now except for special handling for partial tuples which I > have explained in my previous email. Am, I missing something here? > The problem case that I see is the chunk boundary falling in the middle of a quoted field where - The quote opens in chunk 1 - The quote closes in chunk 2 - There is an EoL character between the start of chunk 2 and the closing quote When the worker processing chunk 2 starts, it believes itself to be in out-of-quote state, so only data between the start of the chunk and the EoL is regarded as belonging to the partial line. From that point on the parsing of the rest of the chunk goes off track. Some of the resulting errors can be avoided by, for instance, requiring a quote to be preceded by a delimiter or EoL. That answer fails when fields end with EoL characters, which happens often enough in the wild. Recovering from an incorrect in-quote/out-of-quote state assumption at the start of parsing a chunk just seems like a hole with no bottom. So it looks to me like it's best done in a single process which can keep track of that state reliably. -- Aastair
On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion@decodable.me> wrote: > > On Sat, 15 Feb 2020 at 04:55, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote: > > > > ... > > > > > > Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line returnsin quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote staterequired. > > > > > > > AFAIU, the whole of this in-quote/out-of-quote state is manged inside > > CopyReadLineText which will be done by each of the parallel workers, > > something on the lines of what Thomas did in his patch [1]. > > Basically, we need to invent a mechanism to allocate chunks to > > individual workers and then the whole processing will be done as we > > are doing now except for special handling for partial tuples which I > > have explained in my previous email. Am, I missing something here? > > > The problem case that I see is the chunk boundary falling in the > middle of a quoted field where > - The quote opens in chunk 1 > - The quote closes in chunk 2 > - There is an EoL character between the start of chunk 2 and the closing quote > > When the worker processing chunk 2 starts, it believes itself to be in > out-of-quote state, so only data between the start of the chunk and > the EoL is regarded as belonging to the partial line. From that point > on the parsing of the rest of the chunk goes off track. > > Some of the resulting errors can be avoided by, for instance, > requiring a quote to be preceded by a delimiter or EoL. That answer > fails when fields end with EoL characters, which happens often enough > in the wild. > > Recovering from an incorrect in-quote/out-of-quote state assumption at > the start of parsing a chunk just seems like a hole with no bottom. So > it looks to me like it's best done in a single process which can keep > track of that state reliably. > Good point and I agree with you that having a single process would avoid any such stuff. However, I will think some more on it and if you/anyone else gets some idea on how to deal with this in a multi-worker system (where we can allow each worker to read and process the chunk) then feel free to share your thoughts. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Feb 15, 2020 at 06:02:06PM +0530, Amit Kapila wrote: > On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion@decodable.me> wrote: > > > > On Sat, 15 Feb 2020 at 04:55, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote: > > > > > > ... > > > > > > > > Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line returnsin quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote staterequired. > > > > > > > > > > AFAIU, the whole of this in-quote/out-of-quote state is manged inside > > > CopyReadLineText which will be done by each of the parallel workers, > > > something on the lines of what Thomas did in his patch [1]. > > > Basically, we need to invent a mechanism to allocate chunks to > > > individual workers and then the whole processing will be done as we > > > are doing now except for special handling for partial tuples which I > > > have explained in my previous email. Am, I missing something here? > > > > > The problem case that I see is the chunk boundary falling in the > > middle of a quoted field where > > - The quote opens in chunk 1 > > - The quote closes in chunk 2 > > - There is an EoL character between the start of chunk 2 and the closing quote > > > > When the worker processing chunk 2 starts, it believes itself to be in > > out-of-quote state, so only data between the start of the chunk and > > the EoL is regarded as belonging to the partial line. From that point > > on the parsing of the rest of the chunk goes off track. > > > > Some of the resulting errors can be avoided by, for instance, > > requiring a quote to be preceded by a delimiter or EoL. That answer > > fails when fields end with EoL characters, which happens often enough > > in the wild. > > > > Recovering from an incorrect in-quote/out-of-quote state assumption at > > the start of parsing a chunk just seems like a hole with no bottom. So > > it looks to me like it's best done in a single process which can keep > > track of that state reliably. > > > > Good point and I agree with you that having a single process would > avoid any such stuff. However, I will think some more on it and if > you/anyone else gets some idea on how to deal with this in a > multi-worker system (where we can allow each worker to read and > process the chunk) then feel free to share your thoughts. I see two pieces of this puzzle: an input format we control, and the ones we don't. In the former case, we could encode all fields with base85 (or something similar that reduces the input alphabet efficiently), then reserve bytes that denote delimiters of various types. ASCII has separators for file, group, record, and unit that we could use as inspiration. I don't have anything to offer for free-form input other than to agree that it looks like a hole with no bottom, and maybe we should just keep that process serial, at least until someone finds a bottom. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 2/15/20 7:32 AM, Amit Kapila wrote: > On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion@decodable.me> wrote: >> On Sat, 15 Feb 2020 at 04:55, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote: >> ... >>>> Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line returnsin quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote staterequired. >>>> >>> AFAIU, the whole of this in-quote/out-of-quote state is manged inside >>> CopyReadLineText which will be done by each of the parallel workers, >>> something on the lines of what Thomas did in his patch [1]. >>> Basically, we need to invent a mechanism to allocate chunks to >>> individual workers and then the whole processing will be done as we >>> are doing now except for special handling for partial tuples which I >>> have explained in my previous email. Am, I missing something here? >>> >> The problem case that I see is the chunk boundary falling in the >> middle of a quoted field where >> - The quote opens in chunk 1 >> - The quote closes in chunk 2 >> - There is an EoL character between the start of chunk 2 and the closing quote >> >> When the worker processing chunk 2 starts, it believes itself to be in >> out-of-quote state, so only data between the start of the chunk and >> the EoL is regarded as belonging to the partial line. From that point >> on the parsing of the rest of the chunk goes off track. >> >> Some of the resulting errors can be avoided by, for instance, >> requiring a quote to be preceded by a delimiter or EoL. That answer >> fails when fields end with EoL characters, which happens often enough >> in the wild. >> >> Recovering from an incorrect in-quote/out-of-quote state assumption at >> the start of parsing a chunk just seems like a hole with no bottom. So >> it looks to me like it's best done in a single process which can keep >> track of that state reliably. >> > Good point and I agree with you that having a single process would > avoid any such stuff. However, I will think some more on it and if > you/anyone else gets some idea on how to deal with this in a > multi-worker system (where we can allow each worker to read and > process the chunk) then feel free to share your thoughts. > IIRC, in_quote only matters here in CSV mode (because CSV fields can have embedded newlines). So why not just forbid parallel copy in CSV mode, at least for now? I guess it depends on the actual use case. If we expect to be parallel loading humungous CSVs then that won't fly. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Feb 16, 2020 at 12:21 PM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > On 2/15/20 7:32 AM, Amit Kapila wrote: > > On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion@decodable.me> wrote: > >>> > >> The problem case that I see is the chunk boundary falling in the > >> middle of a quoted field where > >> - The quote opens in chunk 1 > >> - The quote closes in chunk 2 > >> - There is an EoL character between the start of chunk 2 and the closing quote > >> > >> When the worker processing chunk 2 starts, it believes itself to be in > >> out-of-quote state, so only data between the start of the chunk and > >> the EoL is regarded as belonging to the partial line. From that point > >> on the parsing of the rest of the chunk goes off track. > >> > >> Some of the resulting errors can be avoided by, for instance, > >> requiring a quote to be preceded by a delimiter or EoL. That answer > >> fails when fields end with EoL characters, which happens often enough > >> in the wild. > >> > >> Recovering from an incorrect in-quote/out-of-quote state assumption at > >> the start of parsing a chunk just seems like a hole with no bottom. So > >> it looks to me like it's best done in a single process which can keep > >> track of that state reliably. > >> > > Good point and I agree with you that having a single process would > > avoid any such stuff. However, I will think some more on it and if > > you/anyone else gets some idea on how to deal with this in a > > multi-worker system (where we can allow each worker to read and > > process the chunk) then feel free to share your thoughts. > > > > > IIRC, in_quote only matters here in CSV mode (because CSV fields can > have embedded newlines). > AFAIU, that is correct. > So why not just forbid parallel copy in CSV > mode, at least for now? I guess it depends on the actual use case. If we > expect to be parallel loading humungous CSVs then that won't fly. > I am not sure about this part. However, I guess we should at the very least have some extendable solution that can deal with csv, otherwise, we might end up re-designing everything if someday we want to deal with CSV. One naive idea is that in csv mode, we can set up the things slightly differently like the worker, won't start processing the chunk unless the previous chunk is completely parsed. So each worker would first parse and tokenize the entire chunk and then start writing it. So, this will make the reading/parsing part serialized, but writes can still be parallel. Now, I don't know if it is a good idea to process in a different way for csv mode. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, 15 Feb 2020 at 14:32, Amit Kapila <amit.kapila16@gmail.com> wrote: > Good point and I agree with you that having a single process would > avoid any such stuff. However, I will think some more on it and if > you/anyone else gets some idea on how to deal with this in a > multi-worker system (where we can allow each worker to read and > process the chunk) then feel free to share your thoughts. I think having a single process handle splitting the input into tuples makes most sense. It's possible to parse csv at multiple GB/s rates [1], finding tuple boundaries is a subset of that task. My first thought for a design would be to have two shared memory ring buffers, one for data and one for tuple start positions. Reader process reads the CSV data into the main buffer, finds tuple start locations in there and writes those to the secondary buffer. Worker processes claim a chunk of tuple positions from the secondary buffer and update their "keep this data around" position with the first position. Then proceed to parse and insert the tuples, updating their position until they find the end of the last tuple in the chunk. Buffer size, maximum and minimum chunk size could be tunable. Ideally the buffers would be at least big enough to absorb one of the workers getting scheduled out for a timeslice, which could be up to tens of megabytes. Regards, Ants Aasma [1] https://github.com/geofflangdale/simdcsv/
At Mon, 17 Feb 2020 16:49:22 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Sun, Feb 16, 2020 at 12:21 PM Andrew Dunstan > <andrew.dunstan@2ndquadrant.com> wrote: > > On 2/15/20 7:32 AM, Amit Kapila wrote: > > > On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion@decodable.me> wrot> > So why not just forbid parallel copyin CSV > > mode, at least for now? I guess it depends on the actual use case. If we > > expect to be parallel loading humungous CSVs then that won't fly. > > > > I am not sure about this part. However, I guess we should at the very > least have some extendable solution that can deal with csv, otherwise, > we might end up re-designing everything if someday we want to deal > with CSV. One naive idea is that in csv mode, we can set up the > things slightly differently like the worker, won't start processing > the chunk unless the previous chunk is completely parsed. So each > worker would first parse and tokenize the entire chunk and then start > writing it. So, this will make the reading/parsing part serialized, > but writes can still be parallel. Now, I don't know if it is a good > idea to process in a different way for csv mode. In an extreme case, if we didn't see a QUOTE in a chunk, we cannot know the chunk is in a quoted section or not, until all the past chunks are parsed. After all we are forced to parse fully sequentially as far as we allow QUOTE. On the other hand, if we allowed "COPY t FROM f WITH (FORMAT CSV, QUOTE '')" in order to signal that there's no quoted section in the file then all chunks would be fully concurrently parsable. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Feb 18, 2020 at 4:04 AM Ants Aasma <ants@cybertec.at> wrote: > On Sat, 15 Feb 2020 at 14:32, Amit Kapila <amit.kapila16@gmail.com> wrote: > > Good point and I agree with you that having a single process would > > avoid any such stuff. However, I will think some more on it and if > > you/anyone else gets some idea on how to deal with this in a > > multi-worker system (where we can allow each worker to read and > > process the chunk) then feel free to share your thoughts. > > I think having a single process handle splitting the input into tuples makes > most sense. It's possible to parse csv at multiple GB/s rates [1], finding > tuple boundaries is a subset of that task. Yeah, this is compelling. Even though it has to read the file serially, the real gains from parallel COPY should come from doing the real work in parallel: data-type parsing, tuple forming, WHERE clause filtering, partition routing, buffer management, insertion and associated triggers, FKs and index maintenance. The reason I used the other approach for the file_fdw patch is that I was trying to make it look as much as possible like parallel sequential scan and not create an extra worker, because I didn't feel like an FDW should be allowed to do that (what if executor nodes all over the query tree created worker processes willy-nilly?). Obviously it doesn't work correctly for embedded newlines, and even if you decree that multi-line values aren't allowed in parallel COPY, the stuff about tuples crossing chunk boundaries is still a bit unpleasant (whether solved by double reading as I showed, or a bunch of tap dancing in shared memory) and creates overheads. > My first thought for a design would be to have two shared memory ring buffers, > one for data and one for tuple start positions. Reader process reads the CSV > data into the main buffer, finds tuple start locations in there and writes > those to the secondary buffer. > > Worker processes claim a chunk of tuple positions from the secondary buffer and > update their "keep this data around" position with the first position. Then > proceed to parse and insert the tuples, updating their position until they find > the end of the last tuple in the chunk. +1. That sort of two-queue scheme is exactly how I sketched out a multi-consumer queue for a hypothetical Parallel Scatter node. It probably gets a bit trickier when the payload has to be broken up into fragments to wrap around the "data" buffer N times.
On Tue, 18 Feb 2020 at 04:40, Thomas Munro <thomas.munro@gmail.com> wrote: > +1. That sort of two-queue scheme is exactly how I sketched out a > multi-consumer queue for a hypothetical Parallel Scatter node. It > probably gets a bit trickier when the payload has to be broken up into > fragments to wrap around the "data" buffer N times. At least for copy it should be easy enough - it already has to handle reading data block by block. If worker updates its position while doing so the reader can wrap around the data buffer. There will be no parallelism while one worker is buffering up a line larger than the data buffer, but that doesn't seem like a major issue. Once the line is buffered and begins inserting next worker can start buffering the next tuple. Regards, Ants Aasma
On Mon, Feb 17, 2020 at 8:34 PM Ants Aasma <ants@cybertec.at> wrote: > > On Sat, 15 Feb 2020 at 14:32, Amit Kapila <amit.kapila16@gmail.com> wrote: > > Good point and I agree with you that having a single process would > > avoid any such stuff. However, I will think some more on it and if > > you/anyone else gets some idea on how to deal with this in a > > multi-worker system (where we can allow each worker to read and > > process the chunk) then feel free to share your thoughts. > > I think having a single process handle splitting the input into tuples makes > most sense. It's possible to parse csv at multiple GB/s rates [1], finding > tuple boundaries is a subset of that task. > > My first thought for a design would be to have two shared memory ring buffers, > one for data and one for tuple start positions. Reader process reads the CSV > data into the main buffer, finds tuple start locations in there and writes > those to the secondary buffer. > > Worker processes claim a chunk of tuple positions from the secondary buffer and > update their "keep this data around" position with the first position. Then > proceed to parse and insert the tuples, updating their position until they find > the end of the last tuple in the chunk. > This is something similar to what I had also in mind for this idea. I had thought of handing over complete chunk (64K or whatever we decide). The one thing that slightly bothers me is that we will add some additional overhead of copying to and from shared memory which was earlier from local process memory. And, the tokenization (finding line boundaries) would be serial. I think that tokenization should be a small part of the overall work we do during the copy operation, but will do some measurements to ascertain the same. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Feb 18, 2020 at 7:28 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Mon, 17 Feb 2020 16:49:22 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > On Sun, Feb 16, 2020 at 12:21 PM Andrew Dunstan > > <andrew.dunstan@2ndquadrant.com> wrote: > > > On 2/15/20 7:32 AM, Amit Kapila wrote: > > > > On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion@decodable.me> wrot> > So why not just forbid parallel copyin CSV > > > mode, at least for now? I guess it depends on the actual use case. If we > > > expect to be parallel loading humungous CSVs then that won't fly. > > > > > > > I am not sure about this part. However, I guess we should at the very > > least have some extendable solution that can deal with csv, otherwise, > > we might end up re-designing everything if someday we want to deal > > with CSV. One naive idea is that in csv mode, we can set up the > > things slightly differently like the worker, won't start processing > > the chunk unless the previous chunk is completely parsed. So each > > worker would first parse and tokenize the entire chunk and then start > > writing it. So, this will make the reading/parsing part serialized, > > but writes can still be parallel. Now, I don't know if it is a good > > idea to process in a different way for csv mode. > > In an extreme case, if we didn't see a QUOTE in a chunk, we cannot > know the chunk is in a quoted section or not, until all the past > chunks are parsed. After all we are forced to parse fully > sequentially as far as we allow QUOTE. > Right, I think the benefits of this as compared to single reader idea would be (a) we can save accessing shared memory for the most part of the chunk (b) for non-csv mode, even the tokenization (finding line boundaries) would also be parallel. OTOH, doing processing differently for csv and non-csv mode might not be good. > On the other hand, if we allowed "COPY t FROM f WITH (FORMAT CSV, > QUOTE '')" in order to signal that there's no quoted section in the > file then all chunks would be fully concurrently parsable. > Yeah, if we can provide such an option, we can probably make parallel csv processing equivalent to non-csv. However, users might not like this as I think in some cases it won't be easier for them to tell whether the file has quoted fields or not. I am not very sure of this point. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
At Tue, 18 Feb 2020 15:59:36 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Tue, Feb 18, 2020 at 7:28 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > In an extreme case, if we didn't see a QUOTE in a chunk, we cannot > > know the chunk is in a quoted section or not, until all the past > > chunks are parsed. After all we are forced to parse fully > > sequentially as far as we allow QUOTE. > > > > Right, I think the benefits of this as compared to single reader idea > would be (a) we can save accessing shared memory for the most part of > the chunk (b) for non-csv mode, even the tokenization (finding line > boundaries) would also be parallel. OTOH, doing processing > differently for csv and non-csv mode might not be good. Agreed. So I think it's a good point of compromize. > > On the other hand, if we allowed "COPY t FROM f WITH (FORMAT CSV, > > QUOTE '')" in order to signal that there's no quoted section in the > > file then all chunks would be fully concurrently parsable. > > > > Yeah, if we can provide such an option, we can probably make parallel > csv processing equivalent to non-csv. However, users might not like > this as I think in some cases it won't be easier for them to tell > whether the file has quoted fields or not. I am not very sure of this > point. I'm not sure how large portion of the usage contains quoted sections, so I'm not sure how it is useful.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote: > This is something similar to what I had also in mind for this idea. I > had thought of handing over complete chunk (64K or whatever we > decide). The one thing that slightly bothers me is that we will add > some additional overhead of copying to and from shared memory which > was earlier from local process memory. And, the tokenization (finding > line boundaries) would be serial. I think that tokenization should be > a small part of the overall work we do during the copy operation, but > will do some measurements to ascertain the same. I don't think any extra copying is needed. The reader can directly fread()/pq_copymsgbytes() into shared memory, and the workers can run CopyReadLineText() inner loop directly off of the buffer in shared memory. For serial performance of tokenization into lines, I really think a SIMD based approach will be fast enough for quite some time. I hacked up the code in the simdcsv project to only tokenize on line endings and it was able to tokenize a CSV file with short lines at 8+ GB/s. There are going to be many other bottlenecks before this one starts limiting. Patch attached if you'd like to try that out. Regards, Ants Aasma
Attachment
On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <ants@cybertec.at> wrote: > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote: > > This is something similar to what I had also in mind for this idea. I > > had thought of handing over complete chunk (64K or whatever we > > decide). The one thing that slightly bothers me is that we will add > > some additional overhead of copying to and from shared memory which > > was earlier from local process memory. And, the tokenization (finding > > line boundaries) would be serial. I think that tokenization should be > > a small part of the overall work we do during the copy operation, but > > will do some measurements to ascertain the same. > > I don't think any extra copying is needed. > I am talking about access to shared memory instead of the process local memory. I understand that an extra copy won't be required. > The reader can directly > fread()/pq_copymsgbytes() into shared memory, and the workers can run > CopyReadLineText() inner loop directly off of the buffer in shared memory. > I am slightly confused here. AFAIU, the for(;;) loop in CopyReadLineText is about finding the line endings which we thought that the reader process will do. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sun, Feb 16, 2020 at 12:51 AM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
IIRC, in_quote only matters here in CSV mode (because CSV fields can
have embedded newlines). So why not just forbid parallel copy in CSV
mode, at least for now? I guess it depends on the actual use case. If we
expect to be parallel loading humungous CSVs then that won't fly.
Loading large CSV files is pretty common here. I hope this can be supported.
On Tue, 18 Feb 2020 at 15:21, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <ants@cybertec.at> wrote: > > > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > This is something similar to what I had also in mind for this idea. I > > > had thought of handing over complete chunk (64K or whatever we > > > decide). The one thing that slightly bothers me is that we will add > > > some additional overhead of copying to and from shared memory which > > > was earlier from local process memory. And, the tokenization (finding > > > line boundaries) would be serial. I think that tokenization should be > > > a small part of the overall work we do during the copy operation, but > > > will do some measurements to ascertain the same. > > > > I don't think any extra copying is needed. > > > > I am talking about access to shared memory instead of the process > local memory. I understand that an extra copy won't be required. > > > The reader can directly > > fread()/pq_copymsgbytes() into shared memory, and the workers can run > > CopyReadLineText() inner loop directly off of the buffer in shared memory. > > > > I am slightly confused here. AFAIU, the for(;;) loop in > CopyReadLineText is about finding the line endings which we thought > that the reader process will do. Indeed, I somehow misread the code while scanning over it. So CopyReadLineText currently copies data from cstate->raw_buf to the StringInfo in cstate->line_buf. In parallel mode it would copy it from the shared data buffer to local line_buf until it hits the line end found by the data reader. The amount of copying done is still exactly the same as it is now. Regards, Ants Aasma
On Tue, Feb 18, 2020 at 06:51:29PM +0530, Amit Kapila wrote: > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <ants@cybertec.at> wrote: > > > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > This is something similar to what I had also in mind for this idea. I > > > had thought of handing over complete chunk (64K or whatever we > > > decide). The one thing that slightly bothers me is that we will add > > > some additional overhead of copying to and from shared memory which > > > was earlier from local process memory. And, the tokenization (finding > > > line boundaries) would be serial. I think that tokenization should be > > > a small part of the overall work we do during the copy operation, but > > > will do some measurements to ascertain the same. > > > > I don't think any extra copying is needed. > > I am talking about access to shared memory instead of the process > local memory. I understand that an extra copy won't be required. Isn't accessing shared memory from different pieces of execution what threads were designed to do? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Tue, Feb 18, 2020 at 8:41 PM David Fetter <david@fetter.org> wrote: > > On Tue, Feb 18, 2020 at 06:51:29PM +0530, Amit Kapila wrote: > > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <ants@cybertec.at> wrote: > > > > > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > This is something similar to what I had also in mind for this idea. I > > > > had thought of handing over complete chunk (64K or whatever we > > > > decide). The one thing that slightly bothers me is that we will add > > > > some additional overhead of copying to and from shared memory which > > > > was earlier from local process memory. And, the tokenization (finding > > > > line boundaries) would be serial. I think that tokenization should be > > > > a small part of the overall work we do during the copy operation, but > > > > will do some measurements to ascertain the same. > > > > > > I don't think any extra copying is needed. > > > > I am talking about access to shared memory instead of the process > > local memory. I understand that an extra copy won't be required. > > Isn't accessing shared memory from different pieces of execution what > threads were designed to do? > Sorry, but I don't understand what you mean by the above? We are going to use background workers (which are processes) for parallel workers. In general, it might not make a big difference in accessing shared memory as compared to local memory especially because the cost of other stuff in the copy is relatively higher. But still, it is a point to consider. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Feb 18, 2020 at 8:08 PM Ants Aasma <ants@cybertec.at> wrote: > > On Tue, 18 Feb 2020 at 15:21, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <ants@cybertec.at> wrote: > > > > > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > This is something similar to what I had also in mind for this idea. I > > > > had thought of handing over complete chunk (64K or whatever we > > > > decide). The one thing that slightly bothers me is that we will add > > > > some additional overhead of copying to and from shared memory which > > > > was earlier from local process memory. And, the tokenization (finding > > > > line boundaries) would be serial. I think that tokenization should be > > > > a small part of the overall work we do during the copy operation, but > > > > will do some measurements to ascertain the same. > > > > > > I don't think any extra copying is needed. > > > > > > > I am talking about access to shared memory instead of the process > > local memory. I understand that an extra copy won't be required. > > > > > The reader can directly > > > fread()/pq_copymsgbytes() into shared memory, and the workers can run > > > CopyReadLineText() inner loop directly off of the buffer in shared memory. > > > > > > > I am slightly confused here. AFAIU, the for(;;) loop in > > CopyReadLineText is about finding the line endings which we thought > > that the reader process will do. > > Indeed, I somehow misread the code while scanning over it. So CopyReadLineText > currently copies data from cstate->raw_buf to the StringInfo in > cstate->line_buf. In parallel mode it would copy it from the shared data buffer > to local line_buf until it hits the line end found by the data reader. The > amount of copying done is still exactly the same as it is now. > Yeah, on a broader level it will be something like that, but actual details might vary during implementation. BTW, have you given any thoughts on one other approach I have shared above [1]? We might not go with that idea, but it is better to discuss different ideas and evaluate their pros and cons. [1] - https://www.postgresql.org/message-id/CAA4eK1LyAyPCtBk4rkwomeT6%3DyTse5qWws-7i9EFwnUFZhvu5w%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Feb 18, 2020 at 7:51 PM Mike Blackwell <mike.blackwell@rrd.com> wrote:
On Sun, Feb 16, 2020 at 12:51 AM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
IIRC, in_quote only matters here in CSV mode (because CSV fields can
have embedded newlines). So why not just forbid parallel copy in CSV
mode, at least for now? I guess it depends on the actual use case. If we
expect to be parallel loading humungous CSVs then that won't fly.Loading large CSV files is pretty common here. I hope this can be supported.
Thank you for your inputs. It is important and valuable.
On Wed, 19 Feb 2020 at 06:22, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Feb 18, 2020 at 8:08 PM Ants Aasma <ants@cybertec.at> wrote: > > > > On Tue, 18 Feb 2020 at 15:21, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <ants@cybertec.at> wrote: > > > > > > > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > This is something similar to what I had also in mind for this idea. I > > > > > had thought of handing over complete chunk (64K or whatever we > > > > > decide). The one thing that slightly bothers me is that we will add > > > > > some additional overhead of copying to and from shared memory which > > > > > was earlier from local process memory. And, the tokenization (finding > > > > > line boundaries) would be serial. I think that tokenization should be > > > > > a small part of the overall work we do during the copy operation, but > > > > > will do some measurements to ascertain the same. > > > > > > > > I don't think any extra copying is needed. > > > > > > > > > > I am talking about access to shared memory instead of the process > > > local memory. I understand that an extra copy won't be required. > > > > > > > The reader can directly > > > > fread()/pq_copymsgbytes() into shared memory, and the workers can run > > > > CopyReadLineText() inner loop directly off of the buffer in shared memory. > > > > > > > > > > I am slightly confused here. AFAIU, the for(;;) loop in > > > CopyReadLineText is about finding the line endings which we thought > > > that the reader process will do. > > > > Indeed, I somehow misread the code while scanning over it. So CopyReadLineText > > currently copies data from cstate->raw_buf to the StringInfo in > > cstate->line_buf. In parallel mode it would copy it from the shared data buffer > > to local line_buf until it hits the line end found by the data reader. The > > amount of copying done is still exactly the same as it is now. > > > > Yeah, on a broader level it will be something like that, but actual > details might vary during implementation. BTW, have you given any > thoughts on one other approach I have shared above [1]? We might not > go with that idea, but it is better to discuss different ideas and > evaluate their pros and cons. > > [1] - https://www.postgresql.org/message-id/CAA4eK1LyAyPCtBk4rkwomeT6%3DyTse5qWws-7i9EFwnUFZhvu5w%40mail.gmail.com It seems to be that at least for the general CSV case the tokenization to tuples is an inherently serial task. Adding thread synchronization to that path for coordinating between multiple workers is only going to make it slower. It may be possible to enforce limitations on the input (e.g. no quotes allowed) or do some speculative tokenization (e.g. if we encounter quote before newline assume the chunk started in a quoted section) to make it possible to do the tokenization in parallel. But given that the simpler and more featured approach of handling it in a single reader process looks to be fast enough, I don't see the point. I rather think that the next big step would be to overlap reading input and tokenization, hopefully by utilizing Andres's work on asyncio. Regards, Ants Aasma
On Wed, Feb 19, 2020 at 11:02:15AM +0200, Ants Aasma wrote: >On Wed, 19 Feb 2020 at 06:22, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Tue, Feb 18, 2020 at 8:08 PM Ants Aasma <ants@cybertec.at> wrote: >> > >> > On Tue, 18 Feb 2020 at 15:21, Amit Kapila <amit.kapila16@gmail.com> wrote: >> > > >> > > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <ants@cybertec.at> wrote: >> > > > >> > > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote: >> > > > > This is something similar to what I had also in mind for this idea. I >> > > > > had thought of handing over complete chunk (64K or whatever we >> > > > > decide). The one thing that slightly bothers me is that we will add >> > > > > some additional overhead of copying to and from shared memory which >> > > > > was earlier from local process memory. And, the tokenization (finding >> > > > > line boundaries) would be serial. I think that tokenization should be >> > > > > a small part of the overall work we do during the copy operation, but >> > > > > will do some measurements to ascertain the same. >> > > > >> > > > I don't think any extra copying is needed. >> > > > >> > > >> > > I am talking about access to shared memory instead of the process >> > > local memory. I understand that an extra copy won't be required. >> > > >> > > > The reader can directly >> > > > fread()/pq_copymsgbytes() into shared memory, and the workers can run >> > > > CopyReadLineText() inner loop directly off of the buffer in shared memory. >> > > > >> > > >> > > I am slightly confused here. AFAIU, the for(;;) loop in >> > > CopyReadLineText is about finding the line endings which we thought >> > > that the reader process will do. >> > >> > Indeed, I somehow misread the code while scanning over it. So CopyReadLineText >> > currently copies data from cstate->raw_buf to the StringInfo in >> > cstate->line_buf. In parallel mode it would copy it from the shared data buffer >> > to local line_buf until it hits the line end found by the data reader. The >> > amount of copying done is still exactly the same as it is now. >> > >> >> Yeah, on a broader level it will be something like that, but actual >> details might vary during implementation. BTW, have you given any >> thoughts on one other approach I have shared above [1]? We might not >> go with that idea, but it is better to discuss different ideas and >> evaluate their pros and cons. >> >> [1] - https://www.postgresql.org/message-id/CAA4eK1LyAyPCtBk4rkwomeT6%3DyTse5qWws-7i9EFwnUFZhvu5w%40mail.gmail.com > >It seems to be that at least for the general CSV case the tokenization to >tuples is an inherently serial task. Adding thread synchronization to that path >for coordinating between multiple workers is only going to make it slower. It >may be possible to enforce limitations on the input (e.g. no quotes allowed) or >do some speculative tokenization (e.g. if we encounter quote before newline >assume the chunk started in a quoted section) to make it possible to do the >tokenization in parallel. But given that the simpler and more featured approach >of handling it in a single reader process looks to be fast enough, I don't see >the point. I rather think that the next big step would be to overlap reading >input and tokenization, hopefully by utilizing Andres's work on asyncio. > I generally agree with the impression that parsing CSV is tricky and unlikely to benefit from parallelism in general. There may be cases with restrictions making it easier (e.g. restrictions on the format) but that might be a bit too complex to start with. For example, I had an idea to parallelise the planning by splitting it into two phases: 1) indexing Splits the CSV file into equally-sized chunks, make each worker to just scan through it's chunk and store positions of delimiters, quotes, newlines etc. This is probably the most expensive part of the parsing (essentially go char by char), and we'd speed it up linearly. 2) merge Combine the information from (1) in a single process, and actually parse the CSV data - we would not have to inspect each character, because we'd know positions of interesting chars, so this should be fast. We might have to recheck some stuff (e.g. escaping) but it should still be much faster. But yes, this may be a bit complex and I'm not sure it's worth it. The one piece of information I'm missing here is at least a very rough quantification of the individual steps of CSV processing - for example if parsing takes only 10% of the time, it's pretty pointless to start by parallelising this part and we should focus on the rest. If it's 50% it might be a different story. Has anyone done any measurements? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Feb 19, 2020 at 4:08 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > The one piece of information I'm missing here is at least a very rough > quantification of the individual steps of CSV processing - for example > if parsing takes only 10% of the time, it's pretty pointless to start by > parallelising this part and we should focus on the rest. If it's 50% it > might be a different story. > Right, this is important information to know. > Has anyone done any measurements? > Not yet, but planning to work on it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Feb 14, 2020 at 01:41:54PM +0530, Amit Kapila wrote: > This work is to parallelize the copy command and in particular "Copy > <table_name> from 'filename' Where <condition>;" command. Apropos of the initial parsing issue generally, there's an interesting approach taken here: https://github.com/robertdavidgraham/wc2 Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Thu, Feb 20, 2020 at 5:12 AM David Fetter <david@fetter.org> wrote: > > On Fri, Feb 14, 2020 at 01:41:54PM +0530, Amit Kapila wrote: > > This work is to parallelize the copy command and in particular "Copy > > <table_name> from 'filename' Where <condition>;" command. > > Apropos of the initial parsing issue generally, there's an interesting > approach taken here: https://github.com/robertdavidgraham/wc2 > Thanks for sharing. I might be missing something, but I can't figure out how this can help here. Does this in some way help to allow multiple workers to read and tokenize the chunks? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Feb 20, 2020 at 04:11:39PM +0530, Amit Kapila wrote: >On Thu, Feb 20, 2020 at 5:12 AM David Fetter <david@fetter.org> wrote: >> >> On Fri, Feb 14, 2020 at 01:41:54PM +0530, Amit Kapila wrote: >> > This work is to parallelize the copy command and in particular "Copy >> > <table_name> from 'filename' Where <condition>;" command. >> >> Apropos of the initial parsing issue generally, there's an interesting >> approach taken here: https://github.com/robertdavidgraham/wc2 >> > >Thanks for sharing. I might be missing something, but I can't figure >out how this can help here. Does this in some way help to allow >multiple workers to read and tokenize the chunks? > I think the wc2 is showing that maybe instead of parallelizing the parsing, we might instead try using a different tokenizer/parser and make the implementation more efficient instead of just throwing more CPUs on it. I don't know if our code is similar to what wc does, maytbe parsing csv is more complicated than what wc does. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Feb 20, 2020 at 02:36:02PM +0100, Tomas Vondra wrote: > On Thu, Feb 20, 2020 at 04:11:39PM +0530, Amit Kapila wrote: > > On Thu, Feb 20, 2020 at 5:12 AM David Fetter <david@fetter.org> wrote: > > > > > > On Fri, Feb 14, 2020 at 01:41:54PM +0530, Amit Kapila wrote: > > > > This work is to parallelize the copy command and in particular "Copy > > > > <table_name> from 'filename' Where <condition>;" command. > > > > > > Apropos of the initial parsing issue generally, there's an interesting > > > approach taken here: https://github.com/robertdavidgraham/wc2 > > > > > > > Thanks for sharing. I might be missing something, but I can't figure > > out how this can help here. Does this in some way help to allow > > multiple workers to read and tokenize the chunks? > > I think the wc2 is showing that maybe instead of parallelizing the > parsing, we might instead try using a different tokenizer/parser and > make the implementation more efficient instead of just throwing more > CPUs on it. That was what I had in mind. > I don't know if our code is similar to what wc does, maytbe parsing > csv is more complicated than what wc does. CSV parsing differs from wc in that there are more states in the state machine, but I don't see anything fundamentally different. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Thu, 20 Feb 2020 at 18:43, David Fetter <david@fetter.org> wrote:> > On Thu, Feb 20, 2020 at 02:36:02PM +0100, Tomas Vondra wrote: > > I think the wc2 is showing that maybe instead of parallelizing the > > parsing, we might instead try using a different tokenizer/parser and > > make the implementation more efficient instead of just throwing more > > CPUs on it. > > That was what I had in mind. > > > I don't know if our code is similar to what wc does, maytbe parsing > > csv is more complicated than what wc does. > > CSV parsing differs from wc in that there are more states in the state > machine, but I don't see anything fundamentally different. The trouble with a state machine based approach is that the state transitions form a dependency chain, which means that at best the processing rate will be 4-5 cycles per byte (L1 latency to fetch the next state). I whipped together a quick prototype that uses SIMD and bitmap manipulations to do the equivalent of CopyReadLineText() in csv mode including quotes and escape handling, this runs at 0.25-0.5 cycles per byte. Regards, Ants Aasma
Attachment
On Fri, Feb 21, 2020 at 02:54:31PM +0200, Ants Aasma wrote: >On Thu, 20 Feb 2020 at 18:43, David Fetter <david@fetter.org> wrote:> >> On Thu, Feb 20, 2020 at 02:36:02PM +0100, Tomas Vondra wrote: >> > I think the wc2 is showing that maybe instead of parallelizing the >> > parsing, we might instead try using a different tokenizer/parser and >> > make the implementation more efficient instead of just throwing more >> > CPUs on it. >> >> That was what I had in mind. >> >> > I don't know if our code is similar to what wc does, maytbe parsing >> > csv is more complicated than what wc does. >> >> CSV parsing differs from wc in that there are more states in the state >> machine, but I don't see anything fundamentally different. > >The trouble with a state machine based approach is that the state >transitions form a dependency chain, which means that at best the >processing rate will be 4-5 cycles per byte (L1 latency to fetch the >next state). > >I whipped together a quick prototype that uses SIMD and bitmap >manipulations to do the equivalent of CopyReadLineText() in csv mode >including quotes and escape handling, this runs at 0.25-0.5 cycles per >byte. > Interesting. How does that compare to what we currently have? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Feb 18, 2020 at 6:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > I am talking about access to shared memory instead of the process > local memory. I understand that an extra copy won't be required. You make it sound like there is some performance penalty for accessing shared memory, but I don't think that's true. It's true that *contended* access to shared memory can be slower, because if multiple processes are trying to access the same memory, and especially if multiple processes are trying to write the same memory, then the cache lines have to be shared and that has a cost. However, I don't think that would create any noticeable effect in this case. First, there's presumably only one writer process. Second, you wouldn't normally have multiple readers working on the same part of the data at the same time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-02-19 11:38:45 +0100, Tomas Vondra wrote: > I generally agree with the impression that parsing CSV is tricky and > unlikely to benefit from parallelism in general. There may be cases with > restrictions making it easier (e.g. restrictions on the format) but that > might be a bit too complex to start with. > > For example, I had an idea to parallelise the planning by splitting it > into two phases: FWIW, I think we ought to rewrite our COPY parsers before we go for complex schemes. They're way slower than a decent green-field CSV/... parser. > The one piece of information I'm missing here is at least a very rough > quantification of the individual steps of CSV processing - for example > if parsing takes only 10% of the time, it's pretty pointless to start by > parallelising this part and we should focus on the rest. If it's 50% it > might be a different story. Has anyone done any measurements? Not recently, but I'm pretty sure that I've observed CSV parsing to be way more than 10%. Greetings, Andres Freund
On Sun, Feb 23, 2020 at 05:09:51PM -0800, Andres Freund wrote: >Hi, > >On 2020-02-19 11:38:45 +0100, Tomas Vondra wrote: >> I generally agree with the impression that parsing CSV is tricky and >> unlikely to benefit from parallelism in general. There may be cases with >> restrictions making it easier (e.g. restrictions on the format) but that >> might be a bit too complex to start with. >> >> For example, I had an idea to parallelise the planning by splitting it >> into two phases: > >FWIW, I think we ought to rewrite our COPY parsers before we go for >complex schemes. They're way slower than a decent green-field >CSV/... parser. > Yep, that's quite possible. > >> The one piece of information I'm missing here is at least a very rough >> quantification of the individual steps of CSV processing - for example >> if parsing takes only 10% of the time, it's pretty pointless to start by >> parallelising this part and we should focus on the rest. If it's 50% it >> might be a different story. Has anyone done any measurements? > >Not recently, but I'm pretty sure that I've observed CSV parsing to be >way more than 10%. > Perhaps. I guess it'll depend on the CSV file (number of fields, ...), so I still think we need to do some measurements first. I'm willing to do that, but (a) I doubt I'll have time for that until after 2020-03, and (b) it'd be good to agree on some set of typical CSV files. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Feb 25, 2020 at 9:30 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Sun, Feb 23, 2020 at 05:09:51PM -0800, Andres Freund wrote: > >Hi, > > > >> The one piece of information I'm missing here is at least a very rough > >> quantification of the individual steps of CSV processing - for example > >> if parsing takes only 10% of the time, it's pretty pointless to start by > >> parallelising this part and we should focus on the rest. If it's 50% it > >> might be a different story. Has anyone done any measurements? > > > >Not recently, but I'm pretty sure that I've observed CSV parsing to be > >way more than 10%. > > > > Perhaps. I guess it'll depend on the CSV file (number of fields, ...), > so I still think we need to do some measurements first. > Agreed. > I'm willing to > do that, but (a) I doubt I'll have time for that until after 2020-03, > and (b) it'd be good to agree on some set of typical CSV files. > Right, I don't know what is the best way to define that. I can think of the below tests. 1. A table with 10 columns (with datatypes as integers, date, text). It has one index (unique/primary). Load with 1 million rows (basically the data should be probably 5-10 GB). 2. A table with 10 columns (with datatypes as integers, date, text). It has three indexes, one index can be (unique/primary). Load with 1 million rows (basically the data should be probably 5-10 GB). 3. A table with 10 columns (with datatypes as integers, date, text). It has three indexes, one index can be (unique/primary). It has before and after trigeers. Load with 1 million rows (basically the data should be probably 5-10 GB). 4. A table with 10 columns (with datatypes as integers, date, text). It has five or six indexes, one index can be (unique/primary). Load with 1 million rows (basically the data should be probably 5-10 GB). Among all these tests, we can check how much time did we spend in reading, parsing the csv files vs. rest of execution? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, 26 Feb 2020 at 10:54, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Feb 25, 2020 at 9:30 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > ... > > > > Perhaps. I guess it'll depend on the CSV file (number of fields, ...), > > so I still think we need to do some measurements first. > > > > Agreed. > > > I'm willing to > > do that, but (a) I doubt I'll have time for that until after 2020-03, > > and (b) it'd be good to agree on some set of typical CSV files. > > > > Right, I don't know what is the best way to define that. I can think > of the below tests. > > 1. A table with 10 columns (with datatypes as integers, date, text). > It has one index (unique/primary). Load with 1 million rows (basically > the data should be probably 5-10 GB). > 2. A table with 10 columns (with datatypes as integers, date, text). > It has three indexes, one index can be (unique/primary). Load with 1 > million rows (basically the data should be probably 5-10 GB). > 3. A table with 10 columns (with datatypes as integers, date, text). > It has three indexes, one index can be (unique/primary). It has before > and after trigeers. Load with 1 million rows (basically the data > should be probably 5-10 GB). > 4. A table with 10 columns (with datatypes as integers, date, text). > It has five or six indexes, one index can be (unique/primary). Load > with 1 million rows (basically the data should be probably 5-10 GB). > > Among all these tests, we can check how much time did we spend in > reading, parsing the csv files vs. rest of execution? That's a good set of tests of what happens after the parse. Two simpler test runs may provide useful baselines - no constraints/indexes with all columns varchar and no constraints/indexes with columns correctly typed. For testing the impact of various parts of the parse process, my idea would be: - A base dataset with 10 columns including int, date and text. One text field quoted and containing both delimiters and line terminators - A derivative to measure just line parsing - strip the quotes around the text field and quote the whole row as one text field - A derivative to measure the impact of quoted fields - clean up the text field so it doesn't require quoting - A derivative to measure the impact of row length - run ten rows together to make 100 column rows, but only a tenth as many rows If that sounds reasonable, I'll try to knock up a generator. -- Alastair
On Tue, 25 Feb 2020 at 18:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Perhaps. I guess it'll depend on the CSV file (number of fields, ...), > so I still think we need to do some measurements first. I'm willing to > do that, but (a) I doubt I'll have time for that until after 2020-03, > and (b) it'd be good to agree on some set of typical CSV files. I agree that getting a nice varied dataset would be nice. Including things like narrow integer only tables, strings with newlines and escapes in them, extremely wide rows. I tried to capture a quick profile just to see what it looks like. Grabbed a random open data set from the web, about 800MB of narrow rows CSV [1]. Script: CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text); COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true); Profile: # Samples: 59K of event 'cycles:u' # Event count (approx.): 57644269486 # # Overhead Command Shared Object Symbol # ........ ........ .................. ....................................... # 18.24% postgres postgres [.] CopyReadLine 9.23% postgres postgres [.] NextCopyFrom 8.87% postgres postgres [.] NextCopyFromRawFields 5.82% postgres postgres [.] pg_verify_mbstr_len 5.45% postgres postgres [.] pg_strtoint32 4.16% postgres postgres [.] heap_fill_tuple 4.03% postgres postgres [.] heap_compute_data_size 3.83% postgres postgres [.] CopyFrom 3.78% postgres postgres [.] AllocSetAlloc 3.53% postgres postgres [.] heap_form_tuple 2.96% postgres postgres [.] InputFunctionCall 2.89% postgres libc-2.30.so [.] __memmove_avx_unaligned_erms 1.82% postgres libc-2.30.so [.] __strlen_avx2 1.72% postgres postgres [.] AllocSetReset 1.72% postgres postgres [.] RelationPutHeapTuple 1.47% postgres postgres [.] heap_prepare_insert 1.31% postgres postgres [.] heap_multi_insert 1.25% postgres postgres [.] textin 1.24% postgres postgres [.] int4in 1.05% postgres postgres [.] tts_buffer_heap_clear 0.85% postgres postgres [.] pg_any_to_server 0.80% postgres postgres [.] pg_comp_crc32c_sse42 0.77% postgres postgres [.] cstring_to_text_with_len 0.69% postgres postgres [.] AllocSetFree 0.60% postgres postgres [.] appendBinaryStringInfo 0.55% postgres postgres [.] tts_buffer_heap_materialize.part.0 0.54% postgres postgres [.] palloc 0.54% postgres libc-2.30.so [.] __memmove_avx_unaligned 0.51% postgres postgres [.] palloc0 0.51% postgres postgres [.] pg_encoding_max_length 0.48% postgres postgres [.] enlargeStringInfo 0.47% postgres postgres [.] ExecStoreVirtualTuple 0.45% postgres postgres [.] PageAddItemExtended So that confirms that the parsing is a huge chunk of overhead with current splitting into lines being the largest portion. Amdahl's law says that splitting into tuples needs to be made fast before parallelizing makes any sense. Regards, Ants Aasma [1] https://www3.stats.govt.nz/2018census/Age-sex-by-ethnic-group-grouped-total-responses-census-usually-resident-population-counts-2006-2013-2018-Censuses-RC-TA-SA2-DHB.zip
On Wed, Feb 26, 2020 at 8:47 PM Ants Aasma <ants@cybertec.at> wrote: > > On Tue, 25 Feb 2020 at 18:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Perhaps. I guess it'll depend on the CSV file (number of fields, ...), > > so I still think we need to do some measurements first. I'm willing to > > do that, but (a) I doubt I'll have time for that until after 2020-03, > > and (b) it'd be good to agree on some set of typical CSV files. > > I agree that getting a nice varied dataset would be nice. Including > things like narrow integer only tables, strings with newlines and > escapes in them, extremely wide rows. > > I tried to capture a quick profile just to see what it looks like. > Grabbed a random open data set from the web, about 800MB of narrow > rows CSV [1]. > > Script: > CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text); > COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true); > > Profile: > # Samples: 59K of event 'cycles:u' > # Event count (approx.): 57644269486 > # > # Overhead Command Shared Object Symbol > # ........ ........ .................. > ....................................... > # > 18.24% postgres postgres [.] CopyReadLine > 9.23% postgres postgres [.] NextCopyFrom > 8.87% postgres postgres [.] NextCopyFromRawFields > 5.82% postgres postgres [.] pg_verify_mbstr_len > 5.45% postgres postgres [.] pg_strtoint32 > 4.16% postgres postgres [.] heap_fill_tuple > 4.03% postgres postgres [.] heap_compute_data_size > 3.83% postgres postgres [.] CopyFrom > 3.78% postgres postgres [.] AllocSetAlloc > 3.53% postgres postgres [.] heap_form_tuple > 2.96% postgres postgres [.] InputFunctionCall > 2.89% postgres libc-2.30.so [.] __memmove_avx_unaligned_erms > 1.82% postgres libc-2.30.so [.] __strlen_avx2 > 1.72% postgres postgres [.] AllocSetReset > 1.72% postgres postgres [.] RelationPutHeapTuple > 1.47% postgres postgres [.] heap_prepare_insert > 1.31% postgres postgres [.] heap_multi_insert > 1.25% postgres postgres [.] textin > 1.24% postgres postgres [.] int4in > 1.05% postgres postgres [.] tts_buffer_heap_clear > 0.85% postgres postgres [.] pg_any_to_server > 0.80% postgres postgres [.] pg_comp_crc32c_sse42 > 0.77% postgres postgres [.] cstring_to_text_with_len > 0.69% postgres postgres [.] AllocSetFree > 0.60% postgres postgres [.] appendBinaryStringInfo > 0.55% postgres postgres [.] tts_buffer_heap_materialize.part.0 > 0.54% postgres postgres [.] palloc > 0.54% postgres libc-2.30.so [.] __memmove_avx_unaligned > 0.51% postgres postgres [.] palloc0 > 0.51% postgres postgres [.] pg_encoding_max_length > 0.48% postgres postgres [.] enlargeStringInfo > 0.47% postgres postgres [.] ExecStoreVirtualTuple > 0.45% postgres postgres [.] PageAddItemExtended > > So that confirms that the parsing is a huge chunk of overhead with > current splitting into lines being the largest portion. Amdahl's law > says that splitting into tuples needs to be made fast before > parallelizing makes any sense. > I have ran very simple case on table with 2 indexes and I can see a lot of time is spent in index insertion. I agree that there is a good amount of time spent in tokanizing but it is not very huge compared to index insertion. I have expanded the time spent in the CopyFrom function from my perf report and pasted here. We can see that a lot of time is spent in ExecInsertIndexTuples(77%). I agree that we need to further evaluate that out of which how much is I/O vs CPU operations. But, the point I want to make is that it's not true for all the cases that parsing is taking maximum amout of time. - 99.50% CopyFrom - 82.90% CopyMultiInsertInfoFlush - 82.85% CopyMultiInsertBufferFlush + 77.68% ExecInsertIndexTuples + 3.74% table_multi_insert + 0.89% ExecClearTuple - 12.54% NextCopyFrom - 7.70% NextCopyFromRawFields - 5.72% CopyReadLine 3.96% CopyReadLineText + 1.49% pg_any_to_server 1.86% CopyReadAttributesCSV + 3.68% InputFunctionCall + 2.11% ExecMaterializeSlot + 0.94% MemoryContextReset My test: -- Prepare: CREATE TABLE t (a int, b int, c varchar); insert into t select i,i, 'aaaaaaaaaaaaaaaaaaaaaaaa' from generate_series(1,10000000) as i; copy t to '/home/dilipkumar/a.csv' WITH (FORMAT 'csv', HEADER true); truncate table t; create index idx on t(a); create index idx1 on t(c); -- Test CopyFrom and measure with perf: copy t from '/home/dilipkumar/a.csv' WITH (FORMAT 'csv', HEADER true); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Feb 26, 2020 at 4:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 25, 2020 at 9:30 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > On Sun, Feb 23, 2020 at 05:09:51PM -0800, Andres Freund wrote:
> > >Hi,
> > >
> > >> The one piece of information I'm missing here is at least a very rough
> > >> quantification of the individual steps of CSV processing - for example
> > >> if parsing takes only 10% of the time, it's pretty pointless to start by
> > >> parallelising this part and we should focus on the rest. If it's 50% it
> > >> might be a different story. Has anyone done any measurements?
> > >
> > >Not recently, but I'm pretty sure that I've observed CSV parsing to be
> > >way more than 10%.
> > >
> >
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first.
> >
>
> Agreed.
>
> > I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
> >
>
> Right, I don't know what is the best way to define that. I can think
> of the below tests.
>
> 1. A table with 10 columns (with datatypes as integers, date, text).
> It has one index (unique/primary). Load with 1 million rows (basically
> the data should be probably 5-10 GB).
> 2. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). Load with 1
> million rows (basically the data should be probably 5-10 GB).
> 3. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). It has before
> and after trigeers. Load with 1 million rows (basically the data
> should be probably 5-10 GB).
> 4. A table with 10 columns (with datatypes as integers, date, text).
> It has five or six indexes, one index can be (unique/primary). Load
> with 1 million rows (basically the data should be probably 5-10 GB).
>
I have tried to capture the execution time taken for 3 scenarios which I felt could give a fair idea:
Test1 (Table with 3 indexes and 1 trigger)
Test2 (Table with 2 indexes)
Test3 (Table without indexes/triggers)
I have captured the following details:
File Read time - time taken to read the file from CopyGetData function.
Read line Time - time taken to read line from NextCopyFrom function(read time & tokenise time excluded)
Tokenize Time - time taken to tokenize the contents from NextCopyFromRawFields function.
Data Execution Time - remaining execution time from the total time
The execution breakdown for the tests are given below:
Steps for the scenarios:
Test1(Table with 3 indexes and 1 trigger):
CREATE TABLE census2 (year int,age int,ethnic int,sex int,area text,count text);
CREATE TABLE census3(year int,age int,ethnic int,sex int,area text,count text);
CREATE INDEX idx1_census2 on census2(year);
CREATE INDEX idx2_census2 on census2(age);
CREATE INDEX idx2_census2 on census2(ethnic);
CREATE or REPLACE FUNCTION census2_afterinsert()
RETURNS TRIGGER
AS $$
BEGIN
INSERT INTO census3 SELECT * FROM census2 limit 1;
RETURN NEW;
END;
$$
LANGUAGE plpgsql;
CREATE TRIGGER census2_trigger AFTER INSERT ON census2 FOR EACH ROW EXECUTE PROCEDURE census2_afterinsert();
COPY census2 FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);
Test2 (Table with 2 indexes):
CREATE TABLE census1 (year int,age int,ethnic int,sex int,area text,count text);
CREATE INDEX idx1_census1 on census1(year);
CREATE INDEX idx2_census1 on census1(age);
COPY census1 FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);
Test3 (Table without indexes/triggers):
CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
COPY census FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);
Note: The Data8277.csv used was the same data that Ants aasma had used.
From the above result we could infer that Read line will have to be done sequentially. Read line time takes about 2.01%, 5.40% and 34.97%of the total time. I felt we will be able to parallelise the remaining phases of the copy. The performance improvement will vary based on the scenario(indexes/triggers), it will be proportionate to the number of indexes and triggers. Read line can also be parallelised in txt format(non csv). I feel parallelising copy could give significant improvement in quite some scenarios.
Further I'm planning to see how the execution will be for toast table. I'm also planning to do test on RAM disk where I will configure the data on RAM disk, so that we can further eliminate the I/O cost.
Thoughts?
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
>
> On Tue, Feb 25, 2020 at 9:30 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > On Sun, Feb 23, 2020 at 05:09:51PM -0800, Andres Freund wrote:
> > >Hi,
> > >
> > >> The one piece of information I'm missing here is at least a very rough
> > >> quantification of the individual steps of CSV processing - for example
> > >> if parsing takes only 10% of the time, it's pretty pointless to start by
> > >> parallelising this part and we should focus on the rest. If it's 50% it
> > >> might be a different story. Has anyone done any measurements?
> > >
> > >Not recently, but I'm pretty sure that I've observed CSV parsing to be
> > >way more than 10%.
> > >
> >
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first.
> >
>
> Agreed.
>
> > I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
> >
>
> Right, I don't know what is the best way to define that. I can think
> of the below tests.
>
> 1. A table with 10 columns (with datatypes as integers, date, text).
> It has one index (unique/primary). Load with 1 million rows (basically
> the data should be probably 5-10 GB).
> 2. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). Load with 1
> million rows (basically the data should be probably 5-10 GB).
> 3. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). It has before
> and after trigeers. Load with 1 million rows (basically the data
> should be probably 5-10 GB).
> 4. A table with 10 columns (with datatypes as integers, date, text).
> It has five or six indexes, one index can be (unique/primary). Load
> with 1 million rows (basically the data should be probably 5-10 GB).
>
I have tried to capture the execution time taken for 3 scenarios which I felt could give a fair idea:
Test1 (Table with 3 indexes and 1 trigger)
Test2 (Table with 2 indexes)
Test3 (Table without indexes/triggers)
I have captured the following details:
File Read time - time taken to read the file from CopyGetData function.
Read line Time - time taken to read line from NextCopyFrom function(read time & tokenise time excluded)
Tokenize Time - time taken to tokenize the contents from NextCopyFromRawFields function.
Data Execution Time - remaining execution time from the total time
The execution breakdown for the tests are given below:
Test/ Time(In Seconds) | Total Time | File Read Time | Read line /Buffer Read Time | Tokenize Time | Data Execution Time |
Test1 | 1693.369 | 0.256 | 34.173 | 5.578 | 1653.362 |
Test2 | 736.096 | 0.288 | 39.762 | 6.525 | 689.521 |
Test3 | 112.06 | 0.266 | 39.189 | 6.433 | 66.172 |
Steps for the scenarios:
Test1(Table with 3 indexes and 1 trigger):
CREATE TABLE census2 (year int,age int,ethnic int,sex int,area text,count text);
CREATE TABLE census3(year int,age int,ethnic int,sex int,area text,count text);
CREATE INDEX idx1_census2 on census2(year);
CREATE INDEX idx2_census2 on census2(age);
CREATE INDEX idx2_census2 on census2(ethnic);
CREATE or REPLACE FUNCTION census2_afterinsert()
RETURNS TRIGGER
AS $$
BEGIN
INSERT INTO census3 SELECT * FROM census2 limit 1;
RETURN NEW;
END;
$$
LANGUAGE plpgsql;
CREATE TRIGGER census2_trigger AFTER INSERT ON census2 FOR EACH ROW EXECUTE PROCEDURE census2_afterinsert();
COPY census2 FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);
Test2 (Table with 2 indexes):
CREATE TABLE census1 (year int,age int,ethnic int,sex int,area text,count text);
CREATE INDEX idx1_census1 on census1(year);
CREATE INDEX idx2_census1 on census1(age);
COPY census1 FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);
Test3 (Table without indexes/triggers):
CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
COPY census FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);
Note: The Data8277.csv used was the same data that Ants aasma had used.
From the above result we could infer that Read line will have to be done sequentially. Read line time takes about 2.01%, 5.40% and 34.97%of the total time. I felt we will be able to parallelise the remaining phases of the copy. The performance improvement will vary based on the scenario(indexes/triggers), it will be proportionate to the number of indexes and triggers. Read line can also be parallelised in txt format(non csv). I feel parallelising copy could give significant improvement in quite some scenarios.
Further I'm planning to see how the execution will be for toast table. I'm also planning to do test on RAM disk where I will configure the data on RAM disk, so that we can further eliminate the I/O cost.
Thoughts?
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
On Wed, Feb 26, 2020 at 8:47 PM Ants Aasma <ants@cybertec.at> wrote:
>
> On Tue, 25 Feb 2020 at 18:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first. I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
>
> I agree that getting a nice varied dataset would be nice. Including
> things like narrow integer only tables, strings with newlines and
> escapes in them, extremely wide rows.
>
> I tried to capture a quick profile just to see what it looks like.
> Grabbed a random open data set from the web, about 800MB of narrow
> rows CSV [1].
>
> Script:
> CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
> COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true);
>
> Profile:
> # Samples: 59K of event 'cycles:u'
> # Event count (approx.): 57644269486
> #
> # Overhead Command Shared Object Symbol
> # ........ ........ ..................
> .......................................
> #
> 18.24% postgres postgres [.] CopyReadLine
> 9.23% postgres postgres [.] NextCopyFrom
> 8.87% postgres postgres [.] NextCopyFromRawFields
> 5.82% postgres postgres [.] pg_verify_mbstr_len
> 5.45% postgres postgres [.] pg_strtoint32
> 4.16% postgres postgres [.] heap_fill_tuple
> 4.03% postgres postgres [.] heap_compute_data_size
> 3.83% postgres postgres [.] CopyFrom
> 3.78% postgres postgres [.] AllocSetAlloc
> 3.53% postgres postgres [.] heap_form_tuple
> 2.96% postgres postgres [.] InputFunctionCall
> 2.89% postgres libc-2.30.so [.] __memmove_avx_unaligned_erms
> 1.82% postgres libc-2.30.so [.] __strlen_avx2
> 1.72% postgres postgres [.] AllocSetReset
> 1.72% postgres postgres [.] RelationPutHeapTuple
> 1.47% postgres postgres [.] heap_prepare_insert
> 1.31% postgres postgres [.] heap_multi_insert
> 1.25% postgres postgres [.] textin
> 1.24% postgres postgres [.] int4in
> 1.05% postgres postgres [.] tts_buffer_heap_clear
> 0.85% postgres postgres [.] pg_any_to_server
> 0.80% postgres postgres [.] pg_comp_crc32c_sse42
> 0.77% postgres postgres [.] cstring_to_text_with_len
> 0.69% postgres postgres [.] AllocSetFree
> 0.60% postgres postgres [.] appendBinaryStringInfo
> 0.55% postgres postgres [.] tts_buffer_heap_materialize.part.0
> 0.54% postgres postgres [.] palloc
> 0.54% postgres libc-2.30.so [.] __memmove_avx_unaligned
> 0.51% postgres postgres [.] palloc0
> 0.51% postgres postgres [.] pg_encoding_max_length
> 0.48% postgres postgres [.] enlargeStringInfo
> 0.47% postgres postgres [.] ExecStoreVirtualTuple
> 0.45% postgres postgres [.] PageAddItemExtended
>
> So that confirms that the parsing is a huge chunk of overhead with
> current splitting into lines being the largest portion. Amdahl's law
> says that splitting into tuples needs to be made fast before
> parallelizing makes any sense.
>
I had taken perf report with the same test data that you had used, I was getting the following results:
.....
+ 99.61% 0.00% postgres postgres [.] PortalRun
+ 99.61% 0.00% postgres postgres [.] PortalRunMulti
+ 99.61% 0.00% postgres postgres [.] PortalRunUtility
+ 99.61% 0.00% postgres postgres [.] ProcessUtility
+ 99.61% 0.00% postgres postgres [.] standard_ProcessUtility
+ 99.61% 0.00% postgres postgres [.] DoCopy
+ 99.30% 0.94% postgres postgres [.] CopyFrom
+ 51.61% 7.76% postgres postgres [.] NextCopyFrom
+ 23.66% 0.01% postgres postgres [.] CopyMultiInsertInfoFlush
+ 23.61% 0.28% postgres postgres [.] CopyMultiInsertBufferFlush
+ 21.99% 1.02% postgres postgres [.] NextCopyFromRawFields
+ 19.79% 0.01% postgres postgres [.] table_multi_insert
+ 19.32% 3.00% postgres postgres [.] heap_multi_insert
+ 18.27% 2.44% postgres postgres [.] InputFunctionCall
+ 15.19% 0.89% postgres postgres [.] CopyReadLine
+ 13.05% 0.18% postgres postgres [.] ExecMaterializeSlot
+ 13.00% 0.55% postgres postgres [.] tts_buffer_heap_materialize
+ 12.31% 1.77% postgres postgres [.] heap_form_tuple
+ 10.43% 0.45% postgres postgres [.] int4in
+ 10.18% 8.92% postgres postgres [.] CopyReadLineText
......
>
> On Tue, 25 Feb 2020 at 18:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first. I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
>
> I agree that getting a nice varied dataset would be nice. Including
> things like narrow integer only tables, strings with newlines and
> escapes in them, extremely wide rows.
>
> I tried to capture a quick profile just to see what it looks like.
> Grabbed a random open data set from the web, about 800MB of narrow
> rows CSV [1].
>
> Script:
> CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
> COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true);
>
> Profile:
> # Samples: 59K of event 'cycles:u'
> # Event count (approx.): 57644269486
> #
> # Overhead Command Shared Object Symbol
> # ........ ........ ..................
> .......................................
> #
> 18.24% postgres postgres [.] CopyReadLine
> 9.23% postgres postgres [.] NextCopyFrom
> 8.87% postgres postgres [.] NextCopyFromRawFields
> 5.82% postgres postgres [.] pg_verify_mbstr_len
> 5.45% postgres postgres [.] pg_strtoint32
> 4.16% postgres postgres [.] heap_fill_tuple
> 4.03% postgres postgres [.] heap_compute_data_size
> 3.83% postgres postgres [.] CopyFrom
> 3.78% postgres postgres [.] AllocSetAlloc
> 3.53% postgres postgres [.] heap_form_tuple
> 2.96% postgres postgres [.] InputFunctionCall
> 2.89% postgres libc-2.30.so [.] __memmove_avx_unaligned_erms
> 1.82% postgres libc-2.30.so [.] __strlen_avx2
> 1.72% postgres postgres [.] AllocSetReset
> 1.72% postgres postgres [.] RelationPutHeapTuple
> 1.47% postgres postgres [.] heap_prepare_insert
> 1.31% postgres postgres [.] heap_multi_insert
> 1.25% postgres postgres [.] textin
> 1.24% postgres postgres [.] int4in
> 1.05% postgres postgres [.] tts_buffer_heap_clear
> 0.85% postgres postgres [.] pg_any_to_server
> 0.80% postgres postgres [.] pg_comp_crc32c_sse42
> 0.77% postgres postgres [.] cstring_to_text_with_len
> 0.69% postgres postgres [.] AllocSetFree
> 0.60% postgres postgres [.] appendBinaryStringInfo
> 0.55% postgres postgres [.] tts_buffer_heap_materialize.part.0
> 0.54% postgres postgres [.] palloc
> 0.54% postgres libc-2.30.so [.] __memmove_avx_unaligned
> 0.51% postgres postgres [.] palloc0
> 0.51% postgres postgres [.] pg_encoding_max_length
> 0.48% postgres postgres [.] enlargeStringInfo
> 0.47% postgres postgres [.] ExecStoreVirtualTuple
> 0.45% postgres postgres [.] PageAddItemExtended
>
> So that confirms that the parsing is a huge chunk of overhead with
> current splitting into lines being the largest portion. Amdahl's law
> says that splitting into tuples needs to be made fast before
> parallelizing makes any sense.
>
I had taken perf report with the same test data that you had used, I was getting the following results:
.....
+ 99.61% 0.00% postgres postgres [.] PortalRun
+ 99.61% 0.00% postgres postgres [.] PortalRunMulti
+ 99.61% 0.00% postgres postgres [.] PortalRunUtility
+ 99.61% 0.00% postgres postgres [.] ProcessUtility
+ 99.61% 0.00% postgres postgres [.] standard_ProcessUtility
+ 99.61% 0.00% postgres postgres [.] DoCopy
+ 99.30% 0.94% postgres postgres [.] CopyFrom
+ 51.61% 7.76% postgres postgres [.] NextCopyFrom
+ 23.66% 0.01% postgres postgres [.] CopyMultiInsertInfoFlush
+ 23.61% 0.28% postgres postgres [.] CopyMultiInsertBufferFlush
+ 21.99% 1.02% postgres postgres [.] NextCopyFromRawFields
+ 19.79% 0.01% postgres postgres [.] table_multi_insert
+ 19.32% 3.00% postgres postgres [.] heap_multi_insert
+ 18.27% 2.44% postgres postgres [.] InputFunctionCall
+ 15.19% 0.89% postgres postgres [.] CopyReadLine
+ 13.05% 0.18% postgres postgres [.] ExecMaterializeSlot
+ 13.00% 0.55% postgres postgres [.] tts_buffer_heap_materialize
+ 12.31% 1.77% postgres postgres [.] heap_form_tuple
+ 10.43% 0.45% postgres postgres [.] int4in
+ 10.18% 8.92% postgres postgres [.] CopyReadLineText
......
In my results I observed execution table_multi_insert was nearly 20%. Also I felt like once we have made few tuples from CopyReadLine, the parallel workers should be able to start consuming the data and process the data. We need not wait for the complete tokenisation to be finished. Once few tuples are tokenised parallel workers should start consuming the data parallelly and tokenisation should happen simultaneously. In this way once the copy is done parallelly total execution time should be CopyReadLine Time + delta processing time.
Thoughts?
I have got the execution breakdown for few scenarios with normal disk and RAM disk.
Execution breakup in Normal disk:
Test/ Time(In Seconds) | Total TIme | File Read Time | copyreadline Time | Remaining Execution Time | Read line percentage |
Test1(3 index + 1 trigger) | 2099.017 | 0.311 | 10.256 | 2088.45 | 0.4886096682 |
Test2(2 index) | 657.994 | 0.303 | 10.171 | 647.52 | 1.545758776 |
Test3(no index, no trigger) | 112.41 | 0.296 | 10.996 | 101.118 | 9.782047861 |
Test4(toast) | 360.028 | 1.43 | 46.556 | 312.042 | 12.93121646 |
Execution breakup in RAM disk:
Test/ Time(In Seconds) | Total TIme | File Read Time | copyreadline Time | Remaining Execution Time | Read line percentage |
Test1(3 index + 1 trigger) | 1571.558 | 0.259 | 6.986 | 1564.313 | 0.4445270235 |
Test2(2 index) | 369.942 | 0.263 | 6.848 | 362.831 | 1.851100983 |
Test3(no index, no trigger) | 54.077 | 0.239 | 6.805 | 47.033 | 12.58390813 |
Test4(toast) | 96.323 | 0.918 | 26.603 | 68.802 | 27.61853348 |
Steps for the scenarios:
Test1(Table with 3 indexes and 1 trigger):
CREATE TABLE census2 (year int,age int,ethnic int,sex int,area text,count text);
CREATE TABLE census3(year int,age int,ethnic int,sex int,area text,count text);
CREATE INDEX idx1_census2 on census2(year);
CREATE INDEX idx2_census2 on census2(age);
CREATE INDEX idx3_census2 on census2(ethnic);
CREATE or REPLACE FUNCTION census2_afterinsert()
RETURNS TRIGGER
AS $$
BEGIN
INSERT INTO census3 SELECT * FROM census2 limit 1;
RETURN NEW;
END;
$$
LANGUAGE plpgsql;
CREATE TRIGGER census2_trigger AFTER INSERT ON census2 FOR EACH ROW EXECUTE PROCEDURE census2_afterinsert();
COPY census2 FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);
CREATE TABLE census3(year int,age int,ethnic int,sex int,area text,count text);
CREATE INDEX idx1_census2 on census2(year);
CREATE INDEX idx2_census2 on census2(age);
CREATE INDEX idx3_census2 on census2(ethnic);
CREATE or REPLACE FUNCTION census2_afterinsert()
RETURNS TRIGGER
AS $$
BEGIN
INSERT INTO census3 SELECT * FROM census2 limit 1;
RETURN NEW;
END;
$$
LANGUAGE plpgsql;
CREATE TRIGGER census2_trigger AFTER INSERT ON census2 FOR EACH ROW EXECUTE PROCEDURE census2_afterinsert();
COPY census2 FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);
Test2 (Table with 2 indexes):
CREATE TABLE census1 (year int,age int,ethnic int,sex int,area text,count text);
CREATE INDEX idx1_census1 on census1(year);
CREATE INDEX idx2_census1 on census1(age);
COPY census1 FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);
CREATE INDEX idx1_census1 on census1(year);
CREATE INDEX idx2_census1 on census1(age);
COPY census1 FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);
Test3 (Table without indexes/triggers):
CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
COPY census FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);
COPY census FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);
Random open data set from the web, about 800MB of narrow rows CSV [1] was used in the above tests, the same which Ants Aasma had used.
Test4 (Toast table):
CREATE TABLE indtoasttest(descr text, cnt int DEFAULT 0, f1 text, f2 text);
alter table indtoasttest alter column f1 set storage external;
alter table indtoasttest alter column f2 set storage external;
inserted 262144 records
copy indtoasttest to '/mnt/magnetic/vignesh.c/postgres/toast_data3.csv' WITH (FORMAT 'csv', HEADER true);
CREATE TABLE indtoasttest1(descr text, cnt int DEFAULT 0, f1 text, f2 text);
alter table indtoasttest1 alter column f1 set storage external;
alter table indtoasttest1 alter column f2 set storage external;
copy indtoasttest1 from '/mnt/magnetic/vignesh.c/postgres/toast_data3.csv' WITH (FORMAT 'csv', HEADER true);
alter table indtoasttest alter column f1 set storage external;
alter table indtoasttest alter column f2 set storage external;
inserted 262144 records
copy indtoasttest to '/mnt/magnetic/vignesh.c/postgres/toast_data3.csv' WITH (FORMAT 'csv', HEADER true);
CREATE TABLE indtoasttest1(descr text, cnt int DEFAULT 0, f1 text, f2 text);
alter table indtoasttest1 alter column f1 set storage external;
alter table indtoasttest1 alter column f2 set storage external;
copy indtoasttest1 from '/mnt/magnetic/vignesh.c/postgres/toast_data3.csv' WITH (FORMAT 'csv', HEADER true);
Attached patch for reference which was used to capture the execution time breakup.
Thoughts?
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
On Tue, Mar 3, 2020 at 11:44 AM vignesh C <vignesh21@gmail.com> wrote:
On Wed, Feb 26, 2020 at 8:47 PM Ants Aasma <ants@cybertec.at> wrote:
>
> On Tue, 25 Feb 2020 at 18:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first. I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
>
> I agree that getting a nice varied dataset would be nice. Including
> things like narrow integer only tables, strings with newlines and
> escapes in them, extremely wide rows.
>
> I tried to capture a quick profile just to see what it looks like.
> Grabbed a random open data set from the web, about 800MB of narrow
> rows CSV [1].
>
> Script:
> CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
> COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true);
>
> Profile:
> # Samples: 59K of event 'cycles:u'
> # Event count (approx.): 57644269486
> #
> # Overhead Command Shared Object Symbol
> # ........ ........ ..................
> .......................................
> #
> 18.24% postgres postgres [.] CopyReadLine
> 9.23% postgres postgres [.] NextCopyFrom
> 8.87% postgres postgres [.] NextCopyFromRawFields
> 5.82% postgres postgres [.] pg_verify_mbstr_len
> 5.45% postgres postgres [.] pg_strtoint32
> 4.16% postgres postgres [.] heap_fill_tuple
> 4.03% postgres postgres [.] heap_compute_data_size
> 3.83% postgres postgres [.] CopyFrom
> 3.78% postgres postgres [.] AllocSetAlloc
> 3.53% postgres postgres [.] heap_form_tuple
> 2.96% postgres postgres [.] InputFunctionCall
> 2.89% postgres libc-2.30.so [.] __memmove_avx_unaligned_erms
> 1.82% postgres libc-2.30.so [.] __strlen_avx2
> 1.72% postgres postgres [.] AllocSetReset
> 1.72% postgres postgres [.] RelationPutHeapTuple
> 1.47% postgres postgres [.] heap_prepare_insert
> 1.31% postgres postgres [.] heap_multi_insert
> 1.25% postgres postgres [.] textin
> 1.24% postgres postgres [.] int4in
> 1.05% postgres postgres [.] tts_buffer_heap_clear
> 0.85% postgres postgres [.] pg_any_to_server
> 0.80% postgres postgres [.] pg_comp_crc32c_sse42
> 0.77% postgres postgres [.] cstring_to_text_with_len
> 0.69% postgres postgres [.] AllocSetFree
> 0.60% postgres postgres [.] appendBinaryStringInfo
> 0.55% postgres postgres [.] tts_buffer_heap_materialize.part.0
> 0.54% postgres postgres [.] palloc
> 0.54% postgres libc-2.30.so [.] __memmove_avx_unaligned
> 0.51% postgres postgres [.] palloc0
> 0.51% postgres postgres [.] pg_encoding_max_length
> 0.48% postgres postgres [.] enlargeStringInfo
> 0.47% postgres postgres [.] ExecStoreVirtualTuple
> 0.45% postgres postgres [.] PageAddItemExtended
>
> So that confirms that the parsing is a huge chunk of overhead with
> current splitting into lines being the largest portion. Amdahl's law
> says that splitting into tuples needs to be made fast before
> parallelizing makes any sense.
>
I had taken perf report with the same test data that you had used, I was getting the following results:
.....
+ 99.61% 0.00% postgres postgres [.] PortalRun
+ 99.61% 0.00% postgres postgres [.] PortalRunMulti
+ 99.61% 0.00% postgres postgres [.] PortalRunUtility
+ 99.61% 0.00% postgres postgres [.] ProcessUtility
+ 99.61% 0.00% postgres postgres [.] standard_ProcessUtility
+ 99.61% 0.00% postgres postgres [.] DoCopy
+ 99.30% 0.94% postgres postgres [.] CopyFrom
+ 51.61% 7.76% postgres postgres [.] NextCopyFrom
+ 23.66% 0.01% postgres postgres [.] CopyMultiInsertInfoFlush
+ 23.61% 0.28% postgres postgres [.] CopyMultiInsertBufferFlush
+ 21.99% 1.02% postgres postgres [.] NextCopyFromRawFields
+ 19.79% 0.01% postgres postgres [.] table_multi_insert
+ 19.32% 3.00% postgres postgres [.] heap_multi_insert
+ 18.27% 2.44% postgres postgres [.] InputFunctionCall
+ 15.19% 0.89% postgres postgres [.] CopyReadLine
+ 13.05% 0.18% postgres postgres [.] ExecMaterializeSlot
+ 13.00% 0.55% postgres postgres [.] tts_buffer_heap_materialize
+ 12.31% 1.77% postgres postgres [.] heap_form_tuple
+ 10.43% 0.45% postgres postgres [.] int4in
+ 10.18% 8.92% postgres postgres [.] CopyReadLineText
......In my results I observed execution table_multi_insert was nearly 20%. Also I felt like once we have made few tuples from CopyReadLine, the parallel workers should be able to start consuming the data and process the data. We need not wait for the complete tokenisation to be finished. Once few tuples are tokenised parallel workers should start consuming the data parallelly and tokenisation should happen simultaneously. In this way once the copy is done parallelly total execution time should be CopyReadLine Time + delta processing time.Thoughts?
Attachment
On Thu, Mar 12, 2020 at 6:39 PM vignesh C <vignesh21@gmail.com> wrote: > Existing parallel copy code flow. Copy supports copy operation from csv, txt & bin format file. For processing csv & text format, it will read 64kb chunk or lesser size if in case the file has lesser size contents in the input file. Server will then read one tuple of data and do the processing of the tuple. If the above tuple that is generated was less than 64kb data, then the server will try to generate another tuple for processing from the remaining unprocessed data. If it is not able to generate one tuple from the unprocessed data it will do a further 64kb data read or lesser remaining size that is present in the file and send the tuple for processing. This process is repeated till the complete file is processed. For processing bin format file the flow is slightly different. Server will read the number of columns that are present. Then read the column size data and then read the actual column contents, repeat this for all the columns. Server will then process the tuple that is generated. This process is repeated for all the remaining tuples in the bin file. The tuple processing flow is the same in all the formats. Currently all the operations happen sequentially. This project will help in parallelizing the copy operation. I'm planning to do the POC of parallel copy with the below design: Proposed Syntax: COPY table_name FROM ‘copy_file' WITH (FORMAT ‘format’, PARALLEL ‘workers’); Users can specify the number of workers that must be used for copying the data in parallel. Here ‘workers’ is the number of workers that must be used for parallel copy operation apart from the leader. Leader is responsible for reading the data from the input file and generating the work for the workers. Leader will start a transaction and share this transaction with the workers. All workers will be using the same transaction to insert the records. Leader will create a circular queue and share it across the workers. The circular queue will be present in DSM. Leader will be using a fixed size queue to share the contents between the leader and the workers. Currently we will have 100 elements present in the queue. This will be created before the workers are started and shared with the workers. The data structures that are required by the parallel workers will be initialized by the leader, the size required in dsm will be calculated and the necessary keys will be loaded in the DSM. The specified number of workers will then be launched. Leader will read the table data from the file and copy the contents to the queue element by element. Each element in the queue will have 64K size DSA. This DSA will be used to store tuple contents from the file. The leader will try to copy as much content as possible within one 64K DSA queue element. We intend to store at least one tuple in each queue element. There are some cases where the 64K space may not be enough to store a single tuple. Mostly in cases where the table has toast data present and the single tuple can be more than 64K size. In these scenarios we will extend the DSA space accordingly. We cannot change the size of the dsm once the workers are launched. Whereas in case of DSA we can free the dsa pointer and reallocate the dsa pointer based on the memory size required. This is the very reason for choosing DSA over DSM for storing the data that must be inserted into the relation. Leader will keep on loading the data into the queue till the queue becomes full. Leader will transform his role into a worker either when the Queue is full or the Complete file is processed. Once the queue is full, the leader will switch its role to become a worker, then the leader will continue to act as worker till 25% of the elements in the queue is consumed by all the workers. Once there is at least 25% space available in the queue leader who was working as a worker will switch its role back to become the leader again. The above process of filling the queue will be continued by the leader until the whole file is processed. Leader will wait until the respective workers finish processing the queue elements. The copy from functionality is also being used during initdb operations where the copy is intended to be performed in single mode or the user can still continue running in non-parallel mode. In case of non parallel mode, memory allocation will happen using palloc instead of DSM/DSA and most of the flow will be the same in both parallel and non parallel cases. We had a couple of options for the way in which queue elements can be stored. Option 1: Each element (DSA chunk) will contain tuples such that each tuple will be preceded by the length of the tuple. So the tuples will be arranged like (Length of tuple-1, tuple-1), (Length of tuple-2, tuple-2), .... Or Option 2: Each element (DSA chunk) will contain only tuples (tuple-1), (tuple-2), ..... And we will have a second ring-buffer which contains a start-offset or length of each tuple. The old design used to generate one tuple of data and process tuple by tuple. In the new design, the server will generate multiple tuples of data per queue element. The worker will then process data tuple by tuple. As we are processing the data tuple by tuple, I felt both of the options are almost the same. However Design1 was chosen over Design 2 as we can save up on some space that was required by another variable in each element of the queue. The parallel workers will read the tuples from the queue and do the following operations, all of these operations: a) where clause handling, b) convert tuple to columns, c) add default null values for the missing columns that are not present in that record, d) find the partition if it is partitioned table, e) before row insert Triggers, constraints f) insertion of the data. Rest of the flow is the same as the existing code. Enhancements after POC is done: Initially we plan to use the number of workers based on the worker count user has specified, Later we will do some experiments and think of an approach to choose workers automatically after processing sample contents from the file. Initially we plan to use 100 elements in the queue, Later we will experiment to find the right size for the queue once the basic patch is ready. Initially we plan to generate the transaction from the leader and share it across to the workers. Later we will change this in such a way that the first process that will do an insert operation will generate the transaction and share it with the rest of them. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Tue, 7 Apr 2020 at 08:24, vignesh C <vignesh21@gmail.com> wrote: > Leader will create a circular queue > and share it across the workers. The circular queue will be present in > DSM. Leader will be using a fixed size queue to share the contents > between the leader and the workers. Currently we will have 100 > elements present in the queue. This will be created before the workers > are started and shared with the workers. The data structures that are > required by the parallel workers will be initialized by the leader, > the size required in dsm will be calculated and the necessary keys > will be loaded in the DSM. The specified number of workers will then > be launched. Leader will read the table data from the file and copy > the contents to the queue element by element. Each element in the > queue will have 64K size DSA. This DSA will be used to store tuple > contents from the file. The leader will try to copy as much content as > possible within one 64K DSA queue element. We intend to store at least > one tuple in each queue element. There are some cases where the 64K > space may not be enough to store a single tuple. Mostly in cases where > the table has toast data present and the single tuple can be more than > 64K size. In these scenarios we will extend the DSA space accordingly. > We cannot change the size of the dsm once the workers are launched. > Whereas in case of DSA we can free the dsa pointer and reallocate the > dsa pointer based on the memory size required. This is the very reason > for choosing DSA over DSM for storing the data that must be inserted > into the relation. I think the element based approach and requirement that all tuples fit into the queue makes things unnecessarily complex. The approach I detailed earlier allows for tuples to be bigger than the buffer. In that case a worker will claim the long tuple from the ring queue of tuple start positions, and starts copying it into its local line_buf. This can wrap around the buffer multiple times until the next start position shows up. At that point this worker can proceed with inserting the tuple and the next worker will claim the next tuple. This way nothing needs to be resized, there is no risk of a file with huge tuples running the system out of memory because each element will be reallocated to be huge and the number of elements is not something that has to be tuned. > We had a couple of options for the way in which queue elements can be stored. > Option 1: Each element (DSA chunk) will contain tuples such that each > tuple will be preceded by the length of the tuple. So the tuples will > be arranged like (Length of tuple-1, tuple-1), (Length of tuple-2, > tuple-2), .... Or Option 2: Each element (DSA chunk) will contain only > tuples (tuple-1), (tuple-2), ..... And we will have a second > ring-buffer which contains a start-offset or length of each tuple. The > old design used to generate one tuple of data and process tuple by > tuple. In the new design, the server will generate multiple tuples of > data per queue element. The worker will then process data tuple by > tuple. As we are processing the data tuple by tuple, I felt both of > the options are almost the same. However Design1 was chosen over > Design 2 as we can save up on some space that was required by another > variable in each element of the queue. With option 1 it's not possible to read input data into shared memory and there needs to be an extra memcpy in the time critical sequential flow of the leader. With option 2 data could be read directly into the shared memory buffer. With future async io support, reading and looking for tuple boundaries could be performed concurrently. Regards, Ants Aasma Cybertec
On Tue, Apr 7, 2020 at 7:08 PM Ants Aasma <ants@cybertec.at> wrote: > > On Tue, 7 Apr 2020 at 08:24, vignesh C <vignesh21@gmail.com> wrote: > > Leader will create a circular queue > > and share it across the workers. The circular queue will be present in > > DSM. Leader will be using a fixed size queue to share the contents > > between the leader and the workers. Currently we will have 100 > > elements present in the queue. This will be created before the workers > > are started and shared with the workers. The data structures that are > > required by the parallel workers will be initialized by the leader, > > the size required in dsm will be calculated and the necessary keys > > will be loaded in the DSM. The specified number of workers will then > > be launched. Leader will read the table data from the file and copy > > the contents to the queue element by element. Each element in the > > queue will have 64K size DSA. This DSA will be used to store tuple > > contents from the file. The leader will try to copy as much content as > > possible within one 64K DSA queue element. We intend to store at least > > one tuple in each queue element. There are some cases where the 64K > > space may not be enough to store a single tuple. Mostly in cases where > > the table has toast data present and the single tuple can be more than > > 64K size. In these scenarios we will extend the DSA space accordingly. > > We cannot change the size of the dsm once the workers are launched. > > Whereas in case of DSA we can free the dsa pointer and reallocate the > > dsa pointer based on the memory size required. This is the very reason > > for choosing DSA over DSM for storing the data that must be inserted > > into the relation. > > I think the element based approach and requirement that all tuples fit > into the queue makes things unnecessarily complex. The approach I > detailed earlier allows for tuples to be bigger than the buffer. In > that case a worker will claim the long tuple from the ring queue of > tuple start positions, and starts copying it into its local line_buf. > This can wrap around the buffer multiple times until the next start > position shows up. At that point this worker can proceed with > inserting the tuple and the next worker will claim the next tuple. > IIUC, with the fixed size buffer, the parallelism might hit a bit because till the worker copies the data from shared buffer to local buffer the reader process won't be able to continue. I think there will be somewhat more leader-worker coordination is required with the fixed buffer size. However, as you pointed out, we can't allow it to increase it to max_size possible for all tuples as that might require a lot of memory. One idea could be that we allow it for first any such tuple and then if any other element/chunk in the queue required more memory than the default 64KB, then we will always fallback to use the memory we have allocated for first chunk. This will allow us to not use more memory except for one tuple and won't hit parallelism much as in many cases not all tuples will be so large. I think in the proposed approach queue element is nothing but a way to divide the work among workers based on size rather than based on number of tuples. Say if we try to divide the work among workers based on start offsets, it can be more tricky. Because it could lead to either a lot of contentention if we choose say one offset per-worker (basically copy the data for one tuple, process it and then pick next tuple) or probably unequal division of work because some can be smaller and others can be bigger. I guess division based on size would be a better idea. OTOH, I see the advantage of your approach as well and I will think more on it. > > > We had a couple of options for the way in which queue elements can be stored. > > Option 1: Each element (DSA chunk) will contain tuples such that each > > tuple will be preceded by the length of the tuple. So the tuples will > > be arranged like (Length of tuple-1, tuple-1), (Length of tuple-2, > > tuple-2), .... Or Option 2: Each element (DSA chunk) will contain only > > tuples (tuple-1), (tuple-2), ..... And we will have a second > > ring-buffer which contains a start-offset or length of each tuple. The > > old design used to generate one tuple of data and process tuple by > > tuple. In the new design, the server will generate multiple tuples of > > data per queue element. The worker will then process data tuple by > > tuple. As we are processing the data tuple by tuple, I felt both of > > the options are almost the same. However Design1 was chosen over > > Design 2 as we can save up on some space that was required by another > > variable in each element of the queue. > > With option 1 it's not possible to read input data into shared memory > and there needs to be an extra memcpy in the time critical sequential > flow of the leader. With option 2 data could be read directly into the > shared memory buffer. With future async io support, reading and > looking for tuple boundaries could be performed concurrently. > Yeah, option-2 sounds better. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Apr 7, 2020 at 9:38 AM Ants Aasma <ants@cybertec.at> wrote: > I think the element based approach and requirement that all tuples fit > into the queue makes things unnecessarily complex. The approach I > detailed earlier allows for tuples to be bigger than the buffer. In > that case a worker will claim the long tuple from the ring queue of > tuple start positions, and starts copying it into its local line_buf. > This can wrap around the buffer multiple times until the next start > position shows up. At that point this worker can proceed with > inserting the tuple and the next worker will claim the next tuple. > > This way nothing needs to be resized, there is no risk of a file with > huge tuples running the system out of memory because each element will > be reallocated to be huge and the number of elements is not something > that has to be tuned. +1. This seems like the right way to do it. > > We had a couple of options for the way in which queue elements can be stored. > > Option 1: Each element (DSA chunk) will contain tuples such that each > > tuple will be preceded by the length of the tuple. So the tuples will > > be arranged like (Length of tuple-1, tuple-1), (Length of tuple-2, > > tuple-2), .... Or Option 2: Each element (DSA chunk) will contain only > > tuples (tuple-1), (tuple-2), ..... And we will have a second > > ring-buffer which contains a start-offset or length of each tuple. The > > old design used to generate one tuple of data and process tuple by > > tuple. In the new design, the server will generate multiple tuples of > > data per queue element. The worker will then process data tuple by > > tuple. As we are processing the data tuple by tuple, I felt both of > > the options are almost the same. However Design1 was chosen over > > Design 2 as we can save up on some space that was required by another > > variable in each element of the queue. > > With option 1 it's not possible to read input data into shared memory > and there needs to be an extra memcpy in the time critical sequential > flow of the leader. With option 2 data could be read directly into the > shared memory buffer. With future async io support, reading and > looking for tuple boundaries could be performed concurrently. But option 2 still seems significantly worse than your proposal above, right? I really think we don't want a single worker in charge of finding tuple boundaries for everybody. That adds a lot of unnecessary inter-process communication and synchronization. Each process should just get the next tuple starting after where the last one ended, and then advance the end pointer so that the next process can do the same thing. Vignesh's proposal involves having a leader process that has to switch roles - he picks an arbitrary 25% threshold - and if it doesn't switch roles at the right time, performance will be impacted. If the leader doesn't get scheduled in time to refill the queue before it runs completely empty, workers will have to wait. Ants's scheme avoids that risk: whoever needs the next tuple reads the next line. There's no need to ever wait for the leader because there is no leader. I think it's worth enumerating some of the other ways that a project in this area can fail to achieve good speedups, so that we can try to avoid those that are avoidable and be aware of the others: - If we're unable to supply data to the COPY process as fast as the workers could load it, then speed will be limited at that point. We know reading the file from disk is pretty fast compared to what a single process can do. I'm not sure we've tested what happens with a network socket. It will depend on the network speed some, but it might be useful to know how many MB/s we can pump through over a UNIX socket. - The portion of the time that is used to split the lines is not easily parallelizable. That seems to be a fairly small percentage for a reasonably wide table, but it looks significant (13-18%) for a narrow table. Such cases will gain less performance and be limited to a smaller number of workers. I think we also need to be careful about files whose lines are longer than the size of the buffer. If we're not careful, we could get a significant performance drop-off in such cases. We should make sure to pick an algorithm that seems like it will handle such cases without serious regressions and check that a file composed entirely of such long lines is handled reasonably efficiently. - There could be index contention. Let's suppose that we can read data super fast and break it up into lines super fast. Maybe the file we're reading is fully RAM-cached and the lines are long. Now all of the backends are inserting into the indexes at the same time, and they might be trying to insert into the same pages. If so, lock contention could become a factor that hinders performance. - There could also be similar contention on the heap. Say the tuples are narrow, and many backends are trying to insert tuples into the same heap page at the same time. This would lead to many lock/unlock cycles. This could be avoided if the backends avoid targeting the same heap pages, but I'm not sure there's any reason to expect that they would do so unless we make some special provision for it. - These problems could also arise with respect to TOAST table insertions, either on the TOAST table itself or on its index. This would only happen if the table contains a lot of toastable values, but that could be the case: imagine a table with a bunch of columns each of which contains a long string that isn't very compressible. - What else? I bet the above list is not comprehensive. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 8 Apr 2020 at 22:30, Robert Haas <robertmhaas@gmail.com> wrote: > - If we're unable to supply data to the COPY process as fast as the > workers could load it, then speed will be limited at that point. We > know reading the file from disk is pretty fast compared to what a > single process can do. I'm not sure we've tested what happens with a > network socket. It will depend on the network speed some, but it might > be useful to know how many MB/s we can pump through over a UNIX > socket. This raises a good point. If at some point we want to minimize the amount of memory copies then we might want to allow for RDMA to directly write incoming network traffic into a distributing ring buffer, which would include the protocol level headers. But at this point we are so far off from network reception becoming a bottleneck I don't think it's worth holding anything up for not allowing for zero copy transfers. > - The portion of the time that is used to split the lines is not > easily parallelizable. That seems to be a fairly small percentage for > a reasonably wide table, but it looks significant (13-18%) for a > narrow table. Such cases will gain less performance and be limited to > a smaller number of workers. I think we also need to be careful about > files whose lines are longer than the size of the buffer. If we're not > careful, we could get a significant performance drop-off in such > cases. We should make sure to pick an algorithm that seems like it > will handle such cases without serious regressions and check that a > file composed entirely of such long lines is handled reasonably > efficiently. I don't have a proof, but my gut feel tells me that it's fundamentally impossible to ingest csv without a serial line-ending/comment tokenization pass. The current line splitting algorithm is terrible. I'm currently working with some scientific data where on ingestion CopyReadLineText() is about 25% on profiles. I prototyped a replacement that can do ~8GB/s on narrow rows, more on wider ones. For rows that are consistently wider than the input buffer I think parallelism will still give a win - the serial phase is just memcpy through a ringbuffer, after which a worker goes away to perform the actual insert, letting the next worker read the data. The memcpy is already happening today, CopyReadLineText() copies the input buffer into a StringInfo, so the only extra work is synchronization between leader and worker. > - There could be index contention. Let's suppose that we can read data > super fast and break it up into lines super fast. Maybe the file we're > reading is fully RAM-cached and the lines are long. Now all of the > backends are inserting into the indexes at the same time, and they > might be trying to insert into the same pages. If so, lock contention > could become a factor that hinders performance. Different data distribution strategies can have an effect on that. Dealing out input data in larger or smaller chunks will have a considerable effect on contention, btree page splits and all kinds of things. I think the common theme would be a push to increase chunk size to reduce contention.. > - There could also be similar contention on the heap. Say the tuples > are narrow, and many backends are trying to insert tuples into the > same heap page at the same time. This would lead to many lock/unlock > cycles. This could be avoided if the backends avoid targeting the same > heap pages, but I'm not sure there's any reason to expect that they > would do so unless we make some special provision for it. I thought there already was a provision for that. Am I mis-remembering? > - What else? I bet the above list is not comprehensive. I think parallel copy patch needs to concentrate on splitting input data to workers. After that any performance issues would be basically the same as a normal parallel insert workload. There may well be bottlenecks there, but those could be tackled independently. Regards, Ants Aasma Cybertec
On Thu, Apr 9, 2020 at 1:00 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Apr 7, 2020 at 9:38 AM Ants Aasma <ants@cybertec.at> wrote: > > > > With option 1 it's not possible to read input data into shared memory > > and there needs to be an extra memcpy in the time critical sequential > > flow of the leader. With option 2 data could be read directly into the > > shared memory buffer. With future async io support, reading and > > looking for tuple boundaries could be performed concurrently. > > But option 2 still seems significantly worse than your proposal above, right? > > I really think we don't want a single worker in charge of finding > tuple boundaries for everybody. That adds a lot of unnecessary > inter-process communication and synchronization. Each process should > just get the next tuple starting after where the last one ended, and > then advance the end pointer so that the next process can do the same > thing. Vignesh's proposal involves having a leader process that has to > switch roles - he picks an arbitrary 25% threshold - and if it doesn't > switch roles at the right time, performance will be impacted. If the > leader doesn't get scheduled in time to refill the queue before it > runs completely empty, workers will have to wait. Ants's scheme avoids > that risk: whoever needs the next tuple reads the next line. There's > no need to ever wait for the leader because there is no leader. > Hmm, I think in his scheme also there is a single reader process. See the email above [1] where he described how it should work. I think the difference is in the division of work. AFAIU, in Ants scheme, the worker needs to pick the work from tuple_offset queue whereas in Vignesh's scheme it will be based on the size (each worker will get probably 64KB of work). I think in his scheme the main thing to find out is how many tuple offsets to be assigned to each worker in one-go so that we don't unnecessarily add contention for finding the work unit. I think we need to find the right balance between size and number of tuples. I am trying to consider size here because larger sized tuples will probably require more time as we need to allocate more space for them and also probably requires more processing time. One way to achieve that could be each worker will try to claim 500 tuples (or some other threshold number) but if their size is greater than 64K (or some other threshold size) then the worker will try with lesser number of tuples (such that the size of the chunk of tuples is less than a threshold size.). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Apr 9, 2020 at 4:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Apr 9, 2020 at 1:00 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Tue, Apr 7, 2020 at 9:38 AM Ants Aasma <ants@cybertec.at> wrote: > > > > > > With option 1 it's not possible to read input data into shared memory > > > and there needs to be an extra memcpy in the time critical sequential > > > flow of the leader. With option 2 data could be read directly into the > > > shared memory buffer. With future async io support, reading and > > > looking for tuple boundaries could be performed concurrently. > > > > But option 2 still seems significantly worse than your proposal above, right? > > > > I really think we don't want a single worker in charge of finding > > tuple boundaries for everybody. That adds a lot of unnecessary > > inter-process communication and synchronization. Each process should > > just get the next tuple starting after where the last one ended, and > > then advance the end pointer so that the next process can do the same > > thing. Vignesh's proposal involves having a leader process that has to > > switch roles - he picks an arbitrary 25% threshold - and if it doesn't > > switch roles at the right time, performance will be impacted. If the > > leader doesn't get scheduled in time to refill the queue before it > > runs completely empty, workers will have to wait. Ants's scheme avoids > > that risk: whoever needs the next tuple reads the next line. There's > > no need to ever wait for the leader because there is no leader. > > > > Hmm, I think in his scheme also there is a single reader process. See > the email above [1] where he described how it should work. > oops, I forgot to specify the link to the email. See https://www.postgresql.org/message-id/CANwKhkO87A8gApobOz_o6c9P5auuEG1W2iCz0D5CfOeGgAnk3g%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Apr 9, 2020 at 3:55 AM Ants Aasma <ants@cybertec.at> wrote: > > On Wed, 8 Apr 2020 at 22:30, Robert Haas <robertmhaas@gmail.com> wrote: > > > - The portion of the time that is used to split the lines is not > > easily parallelizable. That seems to be a fairly small percentage for > > a reasonably wide table, but it looks significant (13-18%) for a > > narrow table. Such cases will gain less performance and be limited to > > a smaller number of workers. I think we also need to be careful about > > files whose lines are longer than the size of the buffer. If we're not > > careful, we could get a significant performance drop-off in such > > cases. We should make sure to pick an algorithm that seems like it > > will handle such cases without serious regressions and check that a > > file composed entirely of such long lines is handled reasonably > > efficiently. > > I don't have a proof, but my gut feel tells me that it's fundamentally > impossible to ingest csv without a serial line-ending/comment > tokenization pass. > I think even if we try to do it via multiple workers it might not be better. In such a scheme, every worker needs to update the end boundaries and the next worker to keep a check if the previous has updated the end pointer. I think this can add a significant synchronization effort for cases where tuples are of 100 or so bytes which will be a common case. > The current line splitting algorithm is terrible. > I'm currently working with some scientific data where on ingestion > CopyReadLineText() is about 25% on profiles. I prototyped a > replacement that can do ~8GB/s on narrow rows, more on wider ones. > Good to hear. I think that will be a good project on its own and that might give a boost to parallel copy as with that we can further reduce the non-parallelizable work unit. > For rows that are consistently wider than the input buffer I think > parallelism will still give a win - the serial phase is just memcpy > through a ringbuffer, after which a worker goes away to perform the > actual insert, letting the next worker read the data. The memcpy is > already happening today, CopyReadLineText() copies the input buffer > into a StringInfo, so the only extra work is synchronization between > leader and worker. > > > > - There could also be similar contention on the heap. Say the tuples > > are narrow, and many backends are trying to insert tuples into the > > same heap page at the same time. This would lead to many lock/unlock > > cycles. This could be avoided if the backends avoid targeting the same > > heap pages, but I'm not sure there's any reason to expect that they > > would do so unless we make some special provision for it. > > I thought there already was a provision for that. Am I mis-remembering? > The copy uses heap_multi_insert to insert batch of tuples and I think each batch should ideally use a different page mostly it will be a new page. So, not sure if this will be a problem or a problem of a level for which we need to do some special handling. But if this turns out to be a problem, we definetly need some better way to deal with it. > > - What else? I bet the above list is not comprehensive. > > I think parallel copy patch needs to concentrate on splitting input > data to workers. After that any performance issues would be basically > the same as a normal parallel insert workload. There may well be > bottlenecks there, but those could be tackled independently. > I agree. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Apr 9, 2020 at 1:00 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Apr 7, 2020 at 9:38 AM Ants Aasma <ants@cybertec.at> wrote: > > I think the element based approach and requirement that all tuples fit > > into the queue makes things unnecessarily complex. The approach I > > detailed earlier allows for tuples to be bigger than the buffer. In > > that case a worker will claim the long tuple from the ring queue of > > tuple start positions, and starts copying it into its local line_buf. > > This can wrap around the buffer multiple times until the next start > > position shows up. At that point this worker can proceed with > > inserting the tuple and the next worker will claim the next tuple. > > > > This way nothing needs to be resized, there is no risk of a file with > > huge tuples running the system out of memory because each element will > > be reallocated to be huge and the number of elements is not something > > that has to be tuned. > > +1. This seems like the right way to do it. > > > > We had a couple of options for the way in which queue elements can be stored. > > > Option 1: Each element (DSA chunk) will contain tuples such that each > > > tuple will be preceded by the length of the tuple. So the tuples will > > > be arranged like (Length of tuple-1, tuple-1), (Length of tuple-2, > > > tuple-2), .... Or Option 2: Each element (DSA chunk) will contain only > > > tuples (tuple-1), (tuple-2), ..... And we will have a second > > > ring-buffer which contains a start-offset or length of each tuple. The > > > old design used to generate one tuple of data and process tuple by > > > tuple. In the new design, the server will generate multiple tuples of > > > data per queue element. The worker will then process data tuple by > > > tuple. As we are processing the data tuple by tuple, I felt both of > > > the options are almost the same. However Design1 was chosen over > > > Design 2 as we can save up on some space that was required by another > > > variable in each element of the queue. > > > > With option 1 it's not possible to read input data into shared memory > > and there needs to be an extra memcpy in the time critical sequential > > flow of the leader. With option 2 data could be read directly into the > > shared memory buffer. With future async io support, reading and > > looking for tuple boundaries could be performed concurrently. > > But option 2 still seems significantly worse than your proposal above, right? > > I really think we don't want a single worker in charge of finding > tuple boundaries for everybody. That adds a lot of unnecessary > inter-process communication and synchronization. Each process should > just get the next tuple starting after where the last one ended, and > then advance the end pointer so that the next process can do the same > thing. Vignesh's proposal involves having a leader process that has to > switch roles - he picks an arbitrary 25% threshold - and if it doesn't > switch roles at the right time, performance will be impacted. If the > leader doesn't get scheduled in time to refill the queue before it > runs completely empty, workers will have to wait. Ants's scheme avoids > that risk: whoever needs the next tuple reads the next line. There's > no need to ever wait for the leader because there is no leader. I agree that if the leader switches the role, then it is possible that sometimes the leader might not produce the work before the queue is empty. OTOH, the problem with the approach you are suggesting is that the work will be generated on-demand, i.e. there is no specific process who is generating the data while workers are busy inserting the data. So IMHO, if we have a specific leader process then there will always be work available for all the workers. I agree that we need to find the correct point when the leader will work as a worker. One idea could be that when the queue is full and there is no space to push more work to queue then the leader himself processes that work. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Apr 9, 2020 at 7:49 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I agree that if the leader switches the role, then it is possible that > sometimes the leader might not produce the work before the queue is > empty. OTOH, the problem with the approach you are suggesting is that > the work will be generated on-demand, i.e. there is no specific > process who is generating the data while workers are busy inserting > the data. I think you have a point. The way I think things could go wrong if we don't have a leader is if it tends to happen that everyone wants new work at the same time. In that case, everyone will wait at once, whereas if there is a designated process that aggressively queues up work, we could perhaps avoid that. Note that you really have to have the case where everyone wants new work at the exact same moment, because otherwise they just all take turns finding work for themselves, and everything is fine, because nobody's waiting for anybody else to do any work, so everyone is always making forward progress. Now on the other hand, if we do have a leader, and for some reason it's slow in responding, everyone will have to wait. That could happen either because the leader also has other responsibilities, like reading data or helping with the main work when the queue is full, or just because the system is really busy and the leader doesn't get scheduled on-CPU for a while. I am inclined to think that's likely to be a more serious problem. The thing is, the problem of everyone needing new work at the same time can't really keep on repeating. Say that everyone finishes processing their first chunk at the same time. Now everyone needs a second chunk, and in a leaderless system, they must take turns getting it. So they will go in some order. The ones who go later will presumably also finish later, so the end times for the second and following chunks will be scattered. You shouldn't get repeated pile-ups with everyone finishing at the same time, because each time it happens, it will force a little bit of waiting that will spread things out. If they clump up again, that will happen again, but it shouldn't happen every time. But in the case where there is a leader, I don't think there's any similar protection. Suppose we go with the design Vignesh proposes where the leader switches to processing chunks when the queue is more than 75% full. If the leader has a "hiccup" where it gets swapped out or is busy with processing a chunk for a longer-than-normal time, all of the other processes have to wait for it. Now we can probably tune this to some degree by adjusting the queue size and fullness thresholds, but the optimal values for those parameters might be quite different on different systems, depending on load, I/O performance, CPU architecture, etc. If there's a system or configuration where the leader tends not to respond fast enough, it will probably just keep happening, because nothing in the algorithm will tend to shake it out of that bad pattern. I'm not 100% certain that my analysis here is right, so it will be interesting to hear from other people. However, as a general rule, I think we want to minimize the amount of work that can only be done by one process (the leader) and maximize the amount that can be done by any process with whichever one is available taking on the job. In the case of COPY FROM STDIN, the reads from the network socket can only be done by the one process connected to it. In the case of COPY from a file, even that could be rotated around, if all processes open the file individually and seek to the appropriate offset. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On April 9, 2020 4:01:43 AM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote: >On Thu, Apr 9, 2020 at 3:55 AM Ants Aasma <ants@cybertec.at> wrote: >> >> On Wed, 8 Apr 2020 at 22:30, Robert Haas <robertmhaas@gmail.com> >wrote: >> >> > - The portion of the time that is used to split the lines is not >> > easily parallelizable. That seems to be a fairly small percentage >for >> > a reasonably wide table, but it looks significant (13-18%) for a >> > narrow table. Such cases will gain less performance and be limited >to >> > a smaller number of workers. I think we also need to be careful >about >> > files whose lines are longer than the size of the buffer. If we're >not >> > careful, we could get a significant performance drop-off in such >> > cases. We should make sure to pick an algorithm that seems like it >> > will handle such cases without serious regressions and check that a >> > file composed entirely of such long lines is handled reasonably >> > efficiently. >> >> I don't have a proof, but my gut feel tells me that it's >fundamentally >> impossible to ingest csv without a serial line-ending/comment >> tokenization pass. I can't quite see a way either. But even if it were, I have a hard time seeing parallelizing that path as the right thing. >I think even if we try to do it via multiple workers it might not be >better. In such a scheme, every worker needs to update the end >boundaries and the next worker to keep a check if the previous has >updated the end pointer. I think this can add a significant >synchronization effort for cases where tuples are of 100 or so bytes >which will be a common case. It seems like it'd also have terrible caching and instruction level parallelism behavior. By constantly switching the processthat analyzes boundaries, the current data will have to be brought into l1/register, rather than staying there. I'm fairly certain that we do *not* want to distribute input data between processes on a single tuple basis. Probably noteven below a few hundred kb. If there's any sort of natural clustering in the loaded data - extremely common, think timestamps- splitting on a granular basis will make indexing much more expensive. And have a lot more contention. >> The current line splitting algorithm is terrible. >> I'm currently working with some scientific data where on ingestion >> CopyReadLineText() is about 25% on profiles. I prototyped a >> replacement that can do ~8GB/s on narrow rows, more on wider ones. We should really replace the entire copy parsing code. It's terrible. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Thu, Apr 9, 2020 at 2:55 PM Andres Freund <andres@anarazel.de> wrote: > I'm fairly certain that we do *not* want to distribute input data between processes on a single tuple basis. Probably noteven below a few hundred kb. If there's any sort of natural clustering in the loaded data - extremely common, think timestamps- splitting on a granular basis will make indexing much more expensive. And have a lot more contention. That's a fair point. I think the solution ought to be that once any process starts finding line endings, it continues until it's grabbed at least a certain amount of data for itself. Then it stops and lets some other process grab a chunk of data. Or are you are arguing that there should be only one process that's allowed to find line endings for the entire duration of the load? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On April 9, 2020 12:29:09 PM PDT, Robert Haas <robertmhaas@gmail.com> wrote: >On Thu, Apr 9, 2020 at 2:55 PM Andres Freund <andres@anarazel.de> >wrote: >> I'm fairly certain that we do *not* want to distribute input data >between processes on a single tuple basis. Probably not even below a >few hundred kb. If there's any sort of natural clustering in the loaded >data - extremely common, think timestamps - splitting on a granular >basis will make indexing much more expensive. And have a lot more >contention. > >That's a fair point. I think the solution ought to be that once any >process starts finding line endings, it continues until it's grabbed >at least a certain amount of data for itself. Then it stops and lets >some other process grab a chunk of data. > >Or are you are arguing that there should be only one process that's >allowed to find line endings for the entire duration of the load? I've not yet read the whole thread. So I'm probably restating ideas. Imo, yes, there should be only one process doing the chunking. For ilp, cache efficiency, but also because the leader isthe only process with access to the network socket. It should load input data into one large buffer that's shared acrossprocesses. There should be a separate ringbuffer with tuple/partial tuple (for huge tuples) offsets. Worker processesshould grab large chunks of offsets from the offset ringbuffer. If the ringbuffer is not full, the worker chunksshould be reduced in size. Given that everything stalls if the leader doesn't accept further input data, as well as when there are no available splittedchunks, it doesn't seem like a good idea to have the leader do other work. I don't think optimizing/targeting copy from local files, where multiple processes could read, is useful. COPY STDIN is theonly thing that practically matters. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Thu, Apr 9, 2020 at 4:00 PM Andres Freund <andres@anarazel.de> wrote: > I've not yet read the whole thread. So I'm probably restating ideas. Yeah, but that's OK. > Imo, yes, there should be only one process doing the chunking. For ilp, cache efficiency, but also because the leader isthe only process with access to the network socket. It should load input data into one large buffer that's shared acrossprocesses. There should be a separate ringbuffer with tuple/partial tuple (for huge tuples) offsets. Worker processesshould grab large chunks of offsets from the offset ringbuffer. If the ringbuffer is not full, the worker chunksshould be reduced in size. My concern here is that it's going to be hard to avoid processes going idle. If the leader does nothing at all once the ring buffer is full, it's wasting time that it could spend processing a chunk. But if it picks up a chunk, then it might not get around to refilling the buffer before other processes are idle with no work to do. Still, it might be the case that having the process that is reading the data also find the line endings is so fast that it makes no sense to split those two tasks. After all, whoever just read the data must have it in cache, and that helps a lot. > Given that everything stalls if the leader doesn't accept further input data, as well as when there are no available splittedchunks, it doesn't seem like a good idea to have the leader do other work. > > I don't think optimizing/targeting copy from local files, where multiple processes could read, is useful. COPY STDIN isthe only thing that practically matters. Yeah, I think Amit has been thinking primarily in terms of COPY from files, and I've been encouraging him to at least consider the STDIN case. But I think you're right, and COPY FROM STDIN should be the design center for this feature. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-04-10 07:40:06 -0400, Robert Haas wrote: > On Thu, Apr 9, 2020 at 4:00 PM Andres Freund <andres@anarazel.de> wrote: > > Imo, yes, there should be only one process doing the chunking. For ilp, cache efficiency, but also because the leaderis the only process with access to the network socket. It should load input data into one large buffer that's sharedacross processes. There should be a separate ringbuffer with tuple/partial tuple (for huge tuples) offsets. Workerprocesses should grab large chunks of offsets from the offset ringbuffer. If the ringbuffer is not full, the workerchunks should be reduced in size. > > My concern here is that it's going to be hard to avoid processes going > idle. If the leader does nothing at all once the ring buffer is full, > it's wasting time that it could spend processing a chunk. But if it > picks up a chunk, then it might not get around to refilling the buffer > before other processes are idle with no work to do. An idle process doesn't cost much. Processes that use CPU inefficiently however... > Still, it might be the case that having the process that is reading > the data also find the line endings is so fast that it makes no sense > to split those two tasks. After all, whoever just read the data must > have it in cache, and that helps a lot. Yea. And if it's not fast enough to split lines, then we have a problem regardless of which process does the splitting. Greetings, Andres Freund
On Fri, Apr 10, 2020 at 2:26 PM Andres Freund <andres@anarazel.de> wrote: > > Still, it might be the case that having the process that is reading > > the data also find the line endings is so fast that it makes no sense > > to split those two tasks. After all, whoever just read the data must > > have it in cache, and that helps a lot. > > Yea. And if it's not fast enough to split lines, then we have a problem > regardless of which process does the splitting. Still, if the reader does the splitting, then you don't need as much IPC, right? The shared memory data structure is just a ring of bytes, and whoever reads from it is responsible for the rest. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-04-13 14:13:46 -0400, Robert Haas wrote: > On Fri, Apr 10, 2020 at 2:26 PM Andres Freund <andres@anarazel.de> wrote: > > > Still, it might be the case that having the process that is reading > > > the data also find the line endings is so fast that it makes no sense > > > to split those two tasks. After all, whoever just read the data must > > > have it in cache, and that helps a lot. > > > > Yea. And if it's not fast enough to split lines, then we have a problem > > regardless of which process does the splitting. > > Still, if the reader does the splitting, then you don't need as much > IPC, right? The shared memory data structure is just a ring of bytes, > and whoever reads from it is responsible for the rest. I don't think so. If only one process does the splitting, the exclusively locked section is just popping off a bunch of offsets of the ring. And that could fairly easily be done with atomic ops (since what we need is basically a single producer multiple consumer queue, which can be done lock free fairly easily ). Whereas in the case of each process doing the splitting, the exclusively locked part is splitting along lines - which takes considerably longer than just popping off a few offsets. Greetings, Andres Freund
On Mon, Apr 13, 2020 at 4:16 PM Andres Freund <andres@anarazel.de> wrote: > I don't think so. If only one process does the splitting, the > exclusively locked section is just popping off a bunch of offsets of the > ring. And that could fairly easily be done with atomic ops (since what > we need is basically a single producer multiple consumer queue, which > can be done lock free fairly easily ). Whereas in the case of each > process doing the splitting, the exclusively locked part is splitting > along lines - which takes considerably longer than just popping off a > few offsets. Hmm, that does seem believable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, I was going through some literatures on parsing CSV files in a fully parallelized way and found (from [1]) an interesting approach implemented in the open-source project ParaText[2]. The algorithm follows a two-phase approach: the first pass identifies the adjusted chunks in parallel by exploiting the simplicity of CSV formats and the second phase processes complete records within each adjusted chunk by one of the available workers. Here is the sketch: 1. Each worker scans a distinct fixed sized chunk of the CSV file and collects the following three stats from the chunk: a) number of quotes b) position of the first new line after even number of quotes c) position of the first new line after odd number of quotes 2. Once stats from all the chunks are collected, the leader identifies the adjusted chunk boundaries by iterating over the stats linearly: - For the k-th chunk, the leader adds the number of quotes in k-1 chunks. - If the number is even, then the k-th chunk does not start in the middle of a quoted field, and the first newline after an even number of quotes (the second collected information) is the first record delimiter in this chunk. - Otherwise, if the number is odd, the first newline after an odd number of quotes (the third collected information) is the first record delimiter. - The end position of the adjusted chunk is obtained based on the starting position of the next adjusted chunk. 3. Once the boundaries of the chunks are determined (forming adjusted chunks), individual worker may take up one adjusted chunk and process the tuples independently. Although this approach parses the CSV in parallel, it requires two scan on the CSV file. So, given a system with spinning hard-disk and small RAM, as per my understanding, the algorithm will perform very poorly. But, if we use this algorithm to parse a CSV file on a multi-core system with a large RAM, the performance might be improved significantly [1]. Hence, I was trying to think whether we can leverage this idea for implementing parallel COPY in PG. We can design an algorithm similar to parallel hash-join where the workers pass through different phases. 1. Phase 1 - Read fixed size chunks in parallel, store the chunks and the small stats about each chunk in the shared memory. If the shared memory is full, go to phase 2. 2. Phase 2 - Allow a single worker to process the stats and decide the actual chunk boundaries so that no tuple spans across two different chunks. Go to phase 3. 3. Phase 3 - Each worker picks one adjusted chunk, parse and process tuples from the same. Once done with one chunk, it picks the next one and so on. 4. If there are still some unread contents, go back to phase 1. We can probably use separate workers for phase 1 and phase 3 so that they can work concurrently. Advantages: 1. Each worker spends some significant time in each phase. Gets benefit of the instruction cache - at least in phase 1. 2. It also has the same advantage of parallel hash join - fast workers get to work more. 3. We can extend this solution for reading data from STDIN. Of course, the phase 1 and phase 2 must be performed by the leader process who can read from the socket. Disadvantages: 1. Surely doesn't work if we don't have enough shared memory. 2. Probably, this approach is just impractical for PG due to certain limitations. Thoughts? [1] https://www.microsoft.com/en-us/research/uploads/prod/2019/04/chunker-sigmod19.pdf [2] ParaText. https://github.com/wiseio/paratext. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Tue, 14 Apr 2020 at 22:40, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > 1. Each worker scans a distinct fixed sized chunk of the CSV file and > collects the following three stats from the chunk: > a) number of quotes > b) position of the first new line after even number of quotes > c) position of the first new line after odd number of quotes > 2. Once stats from all the chunks are collected, the leader identifies > the adjusted chunk boundaries by iterating over the stats linearly: > - For the k-th chunk, the leader adds the number of quotes in k-1 chunks. > - If the number is even, then the k-th chunk does not start in the > middle of a quoted field, and the first newline after an even number > of quotes (the second collected information) is the first record > delimiter in this chunk. > - Otherwise, if the number is odd, the first newline after an odd > number of quotes (the third collected information) is the first record > delimiter. > - The end position of the adjusted chunk is obtained based on the > starting position of the next adjusted chunk. The trouble is that, at least with current coding, the number of quotes in a chunk can depend on whether the chunk started in a quote or not. That's because escape characters only count inside quotes. See for example the following csv: foo,\"bar baz",\"xyz" This currently parses as one line and the number of parsed quotes doesn't change if you add a quote in front. But the general approach of doing the tokenization in parallel and then a serial pass over the tokenization would still work. The quote counting and new line finding just has to be done for both starting in quote and not starting in quote case. Using phases doesn't look like the correct approach - the tokenization can be prepared just in time for the serial pass and processing the chunk can proceed immediately after. This could all be done by having the data in a single ringbuffer with a processing pipeline where one process does the reading, then workers grab tokenization chunks as they become available, then one process handles determining the chunk boundaries, after which the chunks are processed. But I still don't think this is something to worry about for the first version. Just a better line splitting algorithm should go a looong way in feeding a large number of workers, even when inserting to an unindexed unlogged table. If we get the SIMD line splitting in, it will be enough to overwhelm most I/O subsystems available today. Regards, Ants Aasma
On Mon, 13 Apr 2020 at 23:16, Andres Freund <andres@anarazel.de> wrote: > > Still, if the reader does the splitting, then you don't need as much > > IPC, right? The shared memory data structure is just a ring of bytes, > > and whoever reads from it is responsible for the rest. > > I don't think so. If only one process does the splitting, the > exclusively locked section is just popping off a bunch of offsets of the > ring. And that could fairly easily be done with atomic ops (since what > we need is basically a single producer multiple consumer queue, which > can be done lock free fairly easily ). Whereas in the case of each > process doing the splitting, the exclusively locked part is splitting > along lines - which takes considerably longer than just popping off a > few offsets. I see the benefit of having one process responsible for splitting as being able to run ahead of the workers to queue up work when many of them need new data at the same time. I don't think the locking benefits of a ring are important in this case. At current rather conservative chunk sizes we are looking at ~100k chunks per second at best, normal locking should be perfectly adequate. And chunk size can easily be increased. I see the main value in it being simple. But there is a point that having a layer of indirection instead of a linear buffer allows for some workers to fall behind. Either because the kernel scheduled them out for a time slice, or they need to do I/O or because inserting some tuple hit an unique conflict and needs to wait for a tx to complete or abort to resolve. With a ring buffer reading has to wait on the slowest worker reading its chunk. Having workers copy the data to a local buffer as the first step would reduce the probability of hitting any issues. But still, at GB/s rates, hiding a 10ms timeslice of delay would need 10's of megabytes of buffer. FWIW. I think just increasing the buffer is good enough - the CPUs processing this workload are likely to have tens to hundreds of megabytes of cache on board.
On Wed, Apr 15, 2020 at 1:10 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > Hence, I was trying to think whether we can leverage this idea for > implementing parallel COPY in PG. We can design an algorithm similar > to parallel hash-join where the workers pass through different phases. > 1. Phase 1 - Read fixed size chunks in parallel, store the chunks and > the small stats about each chunk in the shared memory. If the shared > memory is full, go to phase 2. > 2. Phase 2 - Allow a single worker to process the stats and decide the > actual chunk boundaries so that no tuple spans across two different > chunks. Go to phase 3. > > 3. Phase 3 - Each worker picks one adjusted chunk, parse and process > tuples from the same. Once done with one chunk, it picks the next one > and so on. > > 4. If there are still some unread contents, go back to phase 1. > > We can probably use separate workers for phase 1 and phase 3 so that > they can work concurrently. > > Advantages: > 1. Each worker spends some significant time in each phase. Gets > benefit of the instruction cache - at least in phase 1. > 2. It also has the same advantage of parallel hash join - fast workers > get to work more. > 3. We can extend this solution for reading data from STDIN. Of course, > the phase 1 and phase 2 must be performed by the leader process who > can read from the socket. > > Disadvantages: > 1. Surely doesn't work if we don't have enough shared memory. > 2. Probably, this approach is just impractical for PG due to certain > limitations. > As I understand this, it needs to parse the lines twice (second time in phase-3) and till the first two phases are over, we can't start the tuple processing work which is done in phase-3. So even if the tokenization is done a bit faster but we will lose some on processing the tuples which might not be an overall win and in fact, it can be worse as compared to the single reader approach being discussed. Now, if the work done in tokenization is a major (or significant) portion of the copy then thinking of such a technique might be useful but that is not the case as seen in the data shared above (the tokenize time is very less as compared to data processing time) in this email. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Apr 15, 2020 at 7:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > As I understand this, it needs to parse the lines twice (second time > in phase-3) and till the first two phases are over, we can't start the > tuple processing work which is done in phase-3. So even if the > tokenization is done a bit faster but we will lose some on processing > the tuples which might not be an overall win and in fact, it can be > worse as compared to the single reader approach being discussed. > Now, if the work done in tokenization is a major (or significant) > portion of the copy then thinking of such a technique might be useful > but that is not the case as seen in the data shared above (the > tokenize time is very less as compared to data processing time) in > this email. It seems to me that a good first step here might be to forget about parallelism for a minute and just write a patch to make the line splitting as fast as possible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 15, 2020 at 2:15 PM Ants Aasma <ants@cybertec.at> wrote: > > On Tue, 14 Apr 2020 at 22:40, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > 1. Each worker scans a distinct fixed sized chunk of the CSV file and > > collects the following three stats from the chunk: > > a) number of quotes > > b) position of the first new line after even number of quotes > > c) position of the first new line after odd number of quotes > > 2. Once stats from all the chunks are collected, the leader identifies > > the adjusted chunk boundaries by iterating over the stats linearly: > > - For the k-th chunk, the leader adds the number of quotes in k-1 chunks. > > - If the number is even, then the k-th chunk does not start in the > > middle of a quoted field, and the first newline after an even number > > of quotes (the second collected information) is the first record > > delimiter in this chunk. > > - Otherwise, if the number is odd, the first newline after an odd > > number of quotes (the third collected information) is the first record > > delimiter. > > - The end position of the adjusted chunk is obtained based on the > > starting position of the next adjusted chunk. > > The trouble is that, at least with current coding, the number of > quotes in a chunk can depend on whether the chunk started in a quote > or not. That's because escape characters only count inside quotes. See > for example the following csv: > > foo,\"bar > baz",\"xyz" > > This currently parses as one line and the number of parsed quotes > doesn't change if you add a quote in front. > > But the general approach of doing the tokenization in parallel and > then a serial pass over the tokenization would still work. The quote > counting and new line finding just has to be done for both starting in > quote and not starting in quote case. > Yeah, right. > Using phases doesn't look like the correct approach - the tokenization > can be prepared just in time for the serial pass and processing the > chunk can proceed immediately after. This could all be done by having > the data in a single ringbuffer with a processing pipeline where one > process does the reading, then workers grab tokenization chunks as > they become available, then one process handles determining the chunk > boundaries, after which the chunks are processed. > I was thinking from this point of view - the sooner we introduce parallelism in the process, the greater the benefits. Probably there isn't any way to avoid a single-pass over the data (phase - 2 in the above case) to tokenise the chunks. So yeah, if the reading and tokenisation phase doesn't take much time, parallelising the same will just be an overkill. As pointed by Andres and you, using a lock-free circular buffer implementation sounds the way to go forward. AFAIK, FIFO circular queue with CAS-based implementation suffers from two problems - 1. (as pointed by you) slow workers may block producers. 2. Since it doesn't partition the queue among the workers, does not achieve good locality and cache-friendliness, limits their scalability on NUMA systems. > But I still don't think this is something to worry about for the first > version. Just a better line splitting algorithm should go a looong way > in feeding a large number of workers, even when inserting to an > unindexed unlogged table. If we get the SIMD line splitting in, it > will be enough to overwhelm most I/O subsystems available today. > Yeah. Parsing text is a great use case for data parallelism which can be achieved by SIMD instructions. Consider processing 8-bit ASCII characters in 512-bit SIMD word. A lot of code and complexity from CopyReadLineText will surely go away. And further (I'm not sure in this point), if we can use the schema of the table, perhaps JIT can generate machine code to efficient read of fields based on their types. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On 2020-04-15 10:12:14 -0400, Robert Haas wrote: > On Wed, Apr 15, 2020 at 7:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > As I understand this, it needs to parse the lines twice (second time > > in phase-3) and till the first two phases are over, we can't start the > > tuple processing work which is done in phase-3. So even if the > > tokenization is done a bit faster but we will lose some on processing > > the tuples which might not be an overall win and in fact, it can be > > worse as compared to the single reader approach being discussed. > > Now, if the work done in tokenization is a major (or significant) > > portion of the copy then thinking of such a technique might be useful > > but that is not the case as seen in the data shared above (the > > tokenize time is very less as compared to data processing time) in > > this email. > > It seems to me that a good first step here might be to forget about > parallelism for a minute and just write a patch to make the line > splitting as fast as possible. +1 Compared to all the rest of the efforts during COPY a fast "split rows" implementation should not be a bottleneck anymore.
Hi, On 2020-04-15 20:36:39 +0530, Kuntal Ghosh wrote: > I was thinking from this point of view - the sooner we introduce > parallelism in the process, the greater the benefits. I don't really agree. Sure, that's true from a theoretical perspective, but the incremental gains may be very small, and the cost in complexity very high. If we can get single threaded splitting of rows to be >4GB/s, which should very well be attainable, the rest of the COPY work is going to dominate the time. We shouldn't add complexity to parallelize more of the line splitting, caring too much about scalable datastructures, etc when the bottleneck after some straightforward optimization is usually still in the parallelized part. I'd expect that for now we'd likely hit scalability issues in other parts of the system first (e.g. extension locks, buffer mapping). Greetings, Andres Freund
Hi, On 2020-04-15 12:05:47 +0300, Ants Aasma wrote: > I see the benefit of having one process responsible for splitting as > being able to run ahead of the workers to queue up work when many of > them need new data at the same time. Yea, I agree. > I don't think the locking benefits of a ring are important in this > case. At current rather conservative chunk sizes we are looking at > ~100k chunks per second at best, normal locking should be perfectly > adequate. And chunk size can easily be increased. I see the main value > in it being simple. I think the locking benefits of not needing to hold a lock *while* splitting (as we'd need in some proposal floated earlier) is likely to already be beneficial. I don't think we need to worry about lock scalability protecting the queue of already split data, for now. I don't think we really want to have a much larger chunk size, btw. Makes it more likely for data to workers to take an uneven amount of time. > But there is a point that having a layer of indirection instead of a > linear buffer allows for some workers to fall behind. Yea. It'd probably make sense to read the input data into an array of evenly sized blocks, and have the datastructure (still think a ringbuffer makes sense) of split boundaries point into those entries. If we don't require the input blocks to be in-order in that array, we can reuse blocks therein that are fully processed, even if "earlier" data in the input has not yet been fully processed. > With a ring buffer reading has to wait on the slowest worker reading > its chunk. To be clear, I was only thinking of using a ringbuffer to indicate split boundaries. And that workers would just pop entries from it before they actually process the data (stored outside of the ringbuffer). Since the split boundaries will always be read in order by workers, and the entries will be tiny, there's no need to avoid copying out entries. So basically what I was thinking we *eventually* may want (I'd forgo some of this initially) is something like: struct InputBlock { uint32 unprocessed_chunk_parts; uint32 following_block; char data[INPUT_BLOCK_SIZE] }; // array of input data, with > 2*nworkers entries InputBlock *input_blocks; struct ChunkedInputBoundary { uint32 firstblock; uint32 startoff; }; struct ChunkedInputBoundaries { uint32 read_pos; uint32 write_end; ChunkedInputBoundary ring[RINGSIZE]; }; Where the leader would read data into InputBlocks with unprocessed_chunk_parts == 0. Then it'd split the read input data into chunks (presumably with chunk size << input block size), putting identified chunks into ChunkedInputBoundaries. For each ChunkedInputBoundary it'd increment the unprocessed_chunk_parts of each InputBlock containing parts of the chunk. For chunks across >1 InputBlocks each InputBlock's following_block would be set accordingly. Workers would just pop an entry from the ringbuffer (making that entry reusable), and process the chunk. The underlying data would not be copied out of the InputBlocks, but obviously readers would need to take care to handle InputBlock boundaries. Whenever a chunk is fully read, or when crossing a InputBlock boundary, the InputBlock's unprocessed_chunk_parts would be decremented. Recycling of InputBlocks could probably just be an occasional linear search for buffers with unprocessed_chunk_parts == 0. Something roughly like this should not be too complicated to implement. Unless extremely unluckly (very wide input data spanning many InputBlocks) a straggling reader would not prevent global progress, it'd just prevent reuse of the InputBlocks with data for its chunk (normally that'd be two InputBlocks, not more). > Having workers copy the data to a local buffer as the first > step would reduce the probability of hitting any issues. But still, at > GB/s rates, hiding a 10ms timeslice of delay would need 10's of > megabytes of buffer. Yea. Given the likelihood of blocking on resources (reading in index data, writing out dirty buffers for reclaim, row locks for uniqueness checks, extension locks, ...), as well as non uniform per-row costs (partial indexes, index splits, ...) I think we ought to try to cope well with that. IMO/IME it'll be common to see stalls that are much longer than 10ms for processes that do COPY, even when the system is not overloaded. > FWIW. I think just increasing the buffer is good enough - the CPUs > processing this workload are likely to have tens to hundreds of > megabytes of cache on board. It'll not necessarily be a cache shared between leader / workers though, and some of the cache-cache transfers will be more expensive even within a socket (between core complexes for AMD, multi chip processors for Intel). Greetings, Andres Freund
On Wed, Apr 15, 2020 at 10:45 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-04-15 20:36:39 +0530, Kuntal Ghosh wrote: > > I was thinking from this point of view - the sooner we introduce > > parallelism in the process, the greater the benefits. > > I don't really agree. Sure, that's true from a theoretical perspective, > but the incremental gains may be very small, and the cost in complexity > very high. If we can get single threaded splitting of rows to be >4GB/s, > which should very well be attainable, the rest of the COPY work is going > to dominate the time. We shouldn't add complexity to parallelize more > of the line splitting, caring too much about scalable datastructures, > etc when the bottleneck after some straightforward optimization is > usually still in the parallelized part. > > I'd expect that for now we'd likely hit scalability issues in other > parts of the system first (e.g. extension locks, buffer mapping). > Got your point. In this particular case, a single producer is fast enough (or probably we can make it fast enough) to generate enough chunks for multiple consumers so that they don't stay idle and wait for work. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Wed, Apr 15, 2020 at 11:49 PM Andres Freund <andres@anarazel.de> wrote: > > To be clear, I was only thinking of using a ringbuffer to indicate split > boundaries. And that workers would just pop entries from it before they > actually process the data (stored outside of the ringbuffer). Since the > split boundaries will always be read in order by workers, and the > entries will be tiny, there's no need to avoid copying out entries. > I think the binary mode processing will be slightly different because unlike text and csv format, the data is stored in Length, Value format for each column and there are no line markers. I don't think there will be a big difference but still, we need to somewhere keep the information what is the format of data in ring buffers. Basically, we can copy the data in Length, Value format and once the writers know about the format, they will parse the data in the appropriate format. We currently also have a different way of parsing the binary format, see NextCopyFrom. I think we need to be careful about avoiding duplicate work as much as possible. Apart from this, we have analyzed the other cases as mentioned below where we need to decide whether we can allow parallelism for the copy command. Case-1: Do we want to enable parallelism for a copy when transition tables are involved? Basically, during the copy, we do capture tuples in transition tables for certain cases like when after statement trigger accesses the same relation on which we have a trigger. See the example below [1]. We decide this in function MakeTransitionCaptureState. For such cases, we collect minimal tuples in the tuple store after processing them so that later after statement triggers can access them. Now, if we want to enable parallelism for such cases, we instead need to store and access tuples from shared tuple store (sharedtuplestore.c/sharedtuplestore.h). However, it doesn't have the facility to store tuples in-memory, so we always need to store and access from a file which could be costly unless we also have an additional way to store minimal tuples in shared memory till work_memory and then in shared tuple store. It is possible to do all this or part of this work to enable parallel copy for such cases but I am not sure if it is worth it. We can decide to not enable parallelism for such cases and later allow if we see demand for the same and it will also help us to not introduce additional work/complexity in the first version of the patch. Case-2: The Single Insertion mode (CIM_SINGLE) is performed in various scenarios and whether we can allow parallelism for those depends on case to case basis which is discussed below: a. When there are BEFORE/INSTEAD OF triggers on the table. We don't allow multi-inserts in such cases because such triggers might query the table we're inserting into and act differently if the tuples that have already been processed and prepared for insertion are not there. Now, if we allow parallelism with such triggers the behavior would depend on if the parallel worker has already inserted or not that particular row. I guess such functions should ideally be marked as parallel-unsafe. So, in short in this case whether to allow parallelism or not depends upon the parallel-safety marking of this function. b. For partitioned tables, we can't support multi-inserts when there are any statement-level insert triggers. This is because as of now, we expect that any before row insert and statement-level insert triggers are on the same relation. Now, there is no harm in allowing parallelism for such cases but it depends upon if we have the infrastructure (basically allow tuples to be collected in shared tuple store) to support statement-level insert triggers. c. For inserts into foreign tables. We can't allow the parallelism in this case because each worker needs to establish the FDW connection and operate in a separate transaction. Now unless we have a capability to provide a two-phase commit protocol for "Transactions involving multiple postgres foreign servers" (which is being discussed in a separate thread [2]), we can't allow this. d. If there are volatile default expressions or the where clause contains a volatile expression. Here, we can check if the expression is parallel-safe, then we can allow parallelism. Case-3: In copy command, for performing foreign key checks, we take KEY SHARE lock on primary key table rows which inturn will increment the command counter and updates the snapshot. Now, as we share the snapshots at the beginning of the command, we can't allow it to be changed later. So, unless we do something special for it, I think we can't allow parallelism in such cases. I couldn't think of many problems if we allow parallelism in such cases. One inconsistency, if we allow FK checks via workers, would be that at the end of COPY the value of command_counter will not be what we expect as we wouldn't have accounted for that from workers. Now, if COPY is being done in a transaction it will not assign the correct values to the next commands. Also, for executing deferred triggers, we use transaction snapshot, so if anything is changed in snapshot via parallel workers, ideally it should have synced the changed snapshot in the worker. Now, the other concern could be that different workers can try to acquire KEY SHARE lock on the same tuples which they will be able to acquire due to group locking or otherwise but I don't see any problem with it. I am not sure if it above leads to any user-visible problem but I might be missing something here. I think if we can think of any real problems we can try to design a better solution to address those. Case-4: For Deferred Triggers, it seems we record CTIDs of tuples (via ExecARInsertTriggers->AfterTriggerSaveEvent) and then execute deferred triggers at transaction end using AfterTriggerFireDeferred or at end of the statement. The challenge to allow parallelism for such cases is we need to capture the CTID events in shared memory. For that, we either need to invent a new infrastructure for event capturing in shared memory which will be a huge task on its own. The other idea is to get CTIDs via shared memory and then add those to event queues via leader but I think in that case we need to ensure the order of CTIDs (basically it should be in the same order in which we have processed them). [1] - create or replace function dump_insert() returns trigger language plpgsql as $$ begin raise notice 'trigger = %, new table = %', TG_NAME, (select string_agg(new_table::text, ', ' order by a) from new_table); return null; end; $$; create table test (a int); create trigger trg1_test after insert on test referencing new table as new_table for each statement execute procedure dump_insert(); copy test (a) from stdin; 1 2 3 \. [2] - https://www.postgresql.org/message-id/20191206.173215.1818665441859410805.horikyota.ntt%40gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
I wonder why you're still looking at this instead of looking at just speeding up the current code, especially the line splitting, per previous discussion. And then coming back to study this issue more after that's done. On Mon, May 11, 2020 at 8:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > Apart from this, we have analyzed the other cases as mentioned below > where we need to decide whether we can allow parallelism for the copy > command. > Case-1: > Do we want to enable parallelism for a copy when transition tables are > involved? I think it would be OK not to support this. > Case-2: > a. When there are BEFORE/INSTEAD OF triggers on the table. > b. For partitioned tables, we can't support multi-inserts when there > are any statement-level insert triggers. > c. For inserts into foreign tables. > d. If there are volatile default expressions or the where clause > contains a volatile expression. Here, we can check if the expression > is parallel-safe, then we can allow parallelism. This all sounds fine. > Case-3: > In copy command, for performing foreign key checks, we take KEY SHARE > lock on primary key table rows which inturn will increment the command > counter and updates the snapshot. Now, as we share the snapshots at > the beginning of the command, we can't allow it to be changed later. > So, unless we do something special for it, I think we can't allow > parallelism in such cases. This sounds like much more of a problem to me; it'd be a significant restriction that would kick in routine cases where the user isn't doing anything particularly exciting. The command counter presumably only needs to be updated once per command, so maybe we could do that before we start parallelism. However, I think we would need to have some kind of dynamic memory structure to which new combo CIDs can be added by any member of the group, and then discovered by other members of the group later. At the end of the parallel operation, the leader must discover any combo CIDs added by others to that table before destroying it, even if it has no immediate use for the information. We can't allow a situation where the group members have inconsistent notions of which combo CIDs exist or what their mappings are, and if KEY SHARE locks are being taken, new combo CIDs could be created. > Case-4: > For Deferred Triggers, it seems we record CTIDs of tuples (via > ExecARInsertTriggers->AfterTriggerSaveEvent) and then execute deferred > triggers at transaction end using AfterTriggerFireDeferred or at end > of the statement. I think this could be left for the future. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, May 11, 2020 at 11:52 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I wonder why you're still looking at this instead of looking at just > speeding up the current code, especially the line splitting, > Because the line splitting is just 1-2% of overall work in common cases. See the data shared by Vignesh for various workloads [1]. The time it takes is in range of 0.5-12% approximately and for cases like a table with few indexes, it is not more than 1-2%. [1] - https://www.postgresql.org/message-id/CALDaNm3r8cPsk0Vo_-6AXipTrVwd0o9U2S0nCmRdku1Dn-Tpqg%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, May 11, 2020 at 11:52 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > Case-3: > > In copy command, for performing foreign key checks, we take KEY SHARE > > lock on primary key table rows which inturn will increment the command > > counter and updates the snapshot. Now, as we share the snapshots at > > the beginning of the command, we can't allow it to be changed later. > > So, unless we do something special for it, I think we can't allow > > parallelism in such cases. > > This sounds like much more of a problem to me; it'd be a significant > restriction that would kick in routine cases where the user isn't > doing anything particularly exciting. The command counter presumably > only needs to be updated once per command, so maybe we could do that > before we start parallelism. However, I think we would need to have > some kind of dynamic memory structure to which new combo CIDs can be > added by any member of the group, and then discovered by other members > of the group later. At the end of the parallel operation, the leader > must discover any combo CIDs added by others to that table before > destroying it, even if it has no immediate use for the information. We > can't allow a situation where the group members have inconsistent > notions of which combo CIDs exist or what their mappings are, and if > KEY SHARE locks are being taken, new combo CIDs could be created. > AFAIU, we don't generate combo CIDs for this case. See below code in heap_lock_tuple(): /* * Store transaction information of xact locking the tuple. * * Note: Cmax is meaningless in this context, so don't set it; this avoids * possibly generating a useless combo CID. Moreover, if we're locking a * previously updated tuple, it's important to preserve the Cmax. * * Also reset the HOT UPDATE bit, but only if there's no update; otherwise * we would break the HOT chain. */ tuple->t_data->t_infomask &= ~HEAP_XMAX_BITS; tuple->t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED; tuple->t_data->t_infomask |= new_infomask; tuple->t_data->t_infomask2 |= new_infomask2; I don't understand why we need to do something special for combo CIDs if they are not generated during this operation? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, May 12, 2020 at 1:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > I don't understand why we need to do something special for combo CIDs > if they are not generated during this operation? Hmm. Well I guess if they're not being generated then we don't need to do anything about them, but I still think we should try to work around having to disable parallelism for a table which is referenced by foreign keys. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 14, 2020 at 12:39 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, May 12, 2020 at 1:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I don't understand why we need to do something special for combo CIDs > > if they are not generated during this operation? > > Hmm. Well I guess if they're not being generated then we don't need to > do anything about them, but I still think we should try to work around > having to disable parallelism for a table which is referenced by > foreign keys. > Okay, just to be clear, we want to allow parallelism for a table that has foreign keys. Basically, a parallel copy should work while loading data into tables having FK references. To support that, we need to consider a few things. a. Currently, we increment the command counter each time we take a key share lock on a tuple during trigger execution. I am really not sure if this is required during Copy command execution or we can just increment it once for the copy. If we need to increment the command counter just once for copy command then for the parallel copy we can ensure that we do it just once at the end of the parallel copy but if not then we might need some special handling. b. Another point is that after inserting rows we record CTIDs of the tuples in the event queue and then once all tuples are processed we call FK trigger for each CTID. Now, with parallelism, the FK checks will be processed once the worker processed one chunk. I don't see any problem with it but still, this will be a bit different from what we do in serial case. Do you see any problem with this? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, May 14, 2020 at 11:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, May 14, 2020 at 12:39 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Tue, May 12, 2020 at 1:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I don't understand why we need to do something special for combo CIDs > > > if they are not generated during this operation? > > > > Hmm. Well I guess if they're not being generated then we don't need to > > do anything about them, but I still think we should try to work around > > having to disable parallelism for a table which is referenced by > > foreign keys. > > > > Okay, just to be clear, we want to allow parallelism for a table that > has foreign keys. Basically, a parallel copy should work while > loading data into tables having FK references. > > To support that, we need to consider a few things. > a. Currently, we increment the command counter each time we take a key > share lock on a tuple during trigger execution. I am really not sure > if this is required during Copy command execution or we can just > increment it once for the copy. If we need to increment the command > counter just once for copy command then for the parallel copy we can > ensure that we do it just once at the end of the parallel copy but if > not then we might need some special handling. > > b. Another point is that after inserting rows we record CTIDs of the > tuples in the event queue and then once all tuples are processed we > call FK trigger for each CTID. Now, with parallelism, the FK checks > will be processed once the worker processed one chunk. I don't see > any problem with it but still, this will be a bit different from what > we do in serial case. Do you see any problem with this? IMHO, it should not be a problem because without parallelism also we trigger the foreign key check when we detect EOF and end of data from STDIN. And, with parallel workers also the worker will assume that it has complete all the work and it can go for the foreign key check is only after the leader receives EOF and end of data from STDIN. The only difference is that each worker is not waiting for all the data (from all workers) to get inserted before checking the constraint. Moreover, we are not supporting external triggers with the parallel copy, otherwise, we might have to worry that those triggers could do something on the primary table before we check the constraint. I am not sure if there are any other factors that I am missing. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, May 14, 2020 at 2:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > To support that, we need to consider a few things. > a. Currently, we increment the command counter each time we take a key > share lock on a tuple during trigger execution. I am really not sure > if this is required during Copy command execution or we can just > increment it once for the copy. If we need to increment the command > counter just once for copy command then for the parallel copy we can > ensure that we do it just once at the end of the parallel copy but if > not then we might need some special handling. My sense is that it would be a lot more sensible to do it at the *beginning* of the parallel operation. Once we do it once, we shouldn't ever do it again; that's how it works now. Deferring it until later seems much more likely to break things. > b. Another point is that after inserting rows we record CTIDs of the > tuples in the event queue and then once all tuples are processed we > call FK trigger for each CTID. Now, with parallelism, the FK checks > will be processed once the worker processed one chunk. I don't see > any problem with it but still, this will be a bit different from what > we do in serial case. Do you see any problem with this? I think there could be some problems here. For instance, suppose that there are two entries for different workers for the same CTID. If the leader were trying to do all the work, they'd be handled consecutively. If they were from completely unrelated processes, locking would serialize them. But group locking won't, so there you have an issue, I think. Also, it's not ideal from a work-distribution perspective: one worker could finish early and be unable to help the others. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, May 15, 2020 at 1:51 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, May 14, 2020 at 2:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > To support that, we need to consider a few things. > > a. Currently, we increment the command counter each time we take a key > > share lock on a tuple during trigger execution. I am really not sure > > if this is required during Copy command execution or we can just > > increment it once for the copy. If we need to increment the command > > counter just once for copy command then for the parallel copy we can > > ensure that we do it just once at the end of the parallel copy but if > > not then we might need some special handling. > > My sense is that it would be a lot more sensible to do it at the > *beginning* of the parallel operation. Once we do it once, we > shouldn't ever do it again; that's how it works now. Deferring it > until later seems much more likely to break things. > AFAIU, we always increment the command counter after executing the command. Why do we want to do it differently here? > > b. Another point is that after inserting rows we record CTIDs of the > > tuples in the event queue and then once all tuples are processed we > > call FK trigger for each CTID. Now, with parallelism, the FK checks > > will be processed once the worker processed one chunk. I don't see > > any problem with it but still, this will be a bit different from what > > we do in serial case. Do you see any problem with this? > > I think there could be some problems here. For instance, suppose that > there are two entries for different workers for the same CTID. > First, let me clarify the CTID I have used in my email are for the table in which insertion is happening which means FK table. So, in such a case, we can't have the same CTIDs queued for different workers. Basically, we use CTID to fetch the row from FK table later and form a query to lock (in KEY SHARE mode) the corresponding tuple in PK table. Now, it is possible that two different workers try to lock the same row of PK table. I am not clear what problem group locking can have in this case because these are non-conflicting locks. Can you please elaborate a bit more? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, May 15, 2020 at 12:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > My sense is that it would be a lot more sensible to do it at the > > *beginning* of the parallel operation. Once we do it once, we > > shouldn't ever do it again; that's how it works now. Deferring it > > until later seems much more likely to break things. > > AFAIU, we always increment the command counter after executing the > command. Why do we want to do it differently here? Hmm, now I'm starting to think that I'm confused about what is under discussion here. Which CommandCounterIncrement() are we talking about here? > First, let me clarify the CTID I have used in my email are for the > table in which insertion is happening which means FK table. So, in > such a case, we can't have the same CTIDs queued for different > workers. Basically, we use CTID to fetch the row from FK table later > and form a query to lock (in KEY SHARE mode) the corresponding tuple > in PK table. Now, it is possible that two different workers try to > lock the same row of PK table. I am not clear what problem group > locking can have in this case because these are non-conflicting locks. > Can you please elaborate a bit more? I'm concerned about two workers trying to take the same lock at the same time. If that's prevented by the buffer locking then I think it's OK, but if it's prevented by a heavyweight lock then it's not going to work in this case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, May 15, 2020 at 6:49 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, May 15, 2020 at 12:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > My sense is that it would be a lot more sensible to do it at the > > > *beginning* of the parallel operation. Once we do it once, we > > > shouldn't ever do it again; that's how it works now. Deferring it > > > until later seems much more likely to break things. > > > > AFAIU, we always increment the command counter after executing the > > command. Why do we want to do it differently here? > > Hmm, now I'm starting to think that I'm confused about what is under > discussion here. Which CommandCounterIncrement() are we talking about > here? > The one we do after executing a non-readonly command. Let me try to explain by example: CREATE TABLE tab_fk_referenced_chk(refindex INTEGER PRIMARY KEY, height real, weight real); insert into tab_fk_referenced_chk values( 1, 1.1, 100); CREATE TABLE tab_fk_referencing_chk(index INTEGER REFERENCES tab_fk_referenced_chk(refindex), height real, weight real); COPY tab_fk_referencing_chk(index,height,weight) FROM stdin WITH( DELIMITER ','); 1,1.1,100 1,2.1,200 1,3.1,300 \. In the above case, even though we are executing a single command from the user perspective, but the currentCommandId will be four after the command. One increment will be for the copy command and the other three increments are for locking tuple in PK table (tab_fk_referenced_chk) for three tuples in FK table (tab_fk_referencing_chk). Now, for parallel workers, it is (theoretically) possible that the three tuples are processed by three different workers which don't get synced as of now. The question was do we see any kind of problem with this and if so can we just sync it up at the end of parallelism. > > First, let me clarify the CTID I have used in my email are for the > > table in which insertion is happening which means FK table. So, in > > such a case, we can't have the same CTIDs queued for different > > workers. Basically, we use CTID to fetch the row from FK table later > > and form a query to lock (in KEY SHARE mode) the corresponding tuple > > in PK table. Now, it is possible that two different workers try to > > lock the same row of PK table. I am not clear what problem group > > locking can have in this case because these are non-conflicting locks. > > Can you please elaborate a bit more? > > I'm concerned about two workers trying to take the same lock at the > same time. If that's prevented by the buffer locking then I think it's > OK, but if it's prevented by a heavyweight lock then it's not going to > work in this case. > We do take buffer lock in exclusive mode before trying to acquire KEY SHARE lock on the tuple, so the two workers shouldn't try to acquire at the same time. I think you are trying to see if in any case, two workers try to acquire heavyweight lock like tuple lock or something like that to perform the operation then it will create a problem because due to group locking it will allow such an operation where it should not have been. But I don't think anything of that sort is feasible in COPY operation and if it is then we probably need to carefully block it or find some solution for it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi.
We have made a patch on the lines that were discussed in the previous mails. We could achieve up to 9.87X performance improvement. The improvement varies from case to case.
Workers/ Exec time (seconds) | copy from file, 2 indexes on integer columns 1 index on text column | copy from stdin, 2 indexes on integer columns 1 index on text column | copy from file, 1 gist index on text column | copy from file, 3 indexes on integer columns | copy from stdin, 3 indexes on integer columns |
0 | 1162.772(1X) | 1176.035(1X) | 827.669(1X) | 216.171(1X) | 217.376(1X) |
1 | 1110.288(1.05X) | 1120.556(1.05X) | 747.384(1.11X) | 174.242(1.24X) | 163.492(1.33X) |
2 | 635.249(1.83X) | 668.18(1.76X) | 435.673(1.9X) | 133.829(1.61X) | 126.516(1.72X) |
4 | 336.835(3.45X) | 346.768(3.39X) | 236.406(3.5X) | 105.767(2.04X) | 107.382(2.02X) |
8 | 188.577(6.17X) | 194.491(6.04X) | 148.962(5.56X) | 100.708(2.15X) | 107.72(2.01X) |
16 | 126.819(9.17X) | 146.402(8.03X) | 119.923(6.9X) | 97.996(2.2X) | 106.531(2.04X) |
20 | 117.845(9.87X) | 149.203(7.88X) | 138.741(5.96X) | 97.94(2.21X) | 107.5(2.02) |
30 | 127.554(9.11X) | 161.218(7.29X) | 172.443(4.8X) | 98.232(2.2X) | 108.778(1.99X) |
Posting the initial patch to get the feedback.
Design of the Parallel Copy: The backend, to which the "COPY FROM" query is submitted acts as leader with the responsibility of reading data from the file/stdin, launching at most n number of workers as specified with PARALLEL 'n' option in the "COPY FROM" query. The leader populates the common data required for the workers execution in the DSM and shares it with the workers. The leader then executes before statement triggers if there exists any. Leader populates DSM chunks which includes the start offset and chunk size, while populating the chunks it reads as many blocks as required into the DSM data blocks from the file. Each block is of 64K size. The leader parses the data to identify a chunk, the existing logic from CopyReadLineText which identifies the chunks with some changes was used for this. Leader checks if a free chunk is available to copy the information, if there is no free chunk it waits till the required chunk is freed up by the worker and then copies the identified chunks information (offset & chunk size) into the DSM chunks. This process is repeated till the complete file is processed. Simultaneously, the workers cache the chunks(50) locally into the local memory and release the chunks to the leader for further populating. Each worker processes the chunk that it cached and inserts it into the table. The leader waits till all the chunks populated are processed by the workers and exits.
We would like to include support of parallel copy for referential integrity constraints and parallelizing copy from binary format files in the future.
The above mentioned tests were run with CSV format, file size of 5.1GB & 10 million records in the table. The postgres configuration and system configuration used is attached in config.txt.
Myself and one of my colleagues Bharath have developed this patch. We would like to thank Amit, Dilip, Robert, Andres, Ants, Kuntal, Alastair, Tomas, David, Thomas, Andrew & Kyotaro for their thoughts/discussions/suggestions.
Thoughts?
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
On Mon, May 18, 2020 at 10:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, May 15, 2020 at 6:49 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, May 15, 2020 at 12:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > My sense is that it would be a lot more sensible to do it at the
> > > *beginning* of the parallel operation. Once we do it once, we
> > > shouldn't ever do it again; that's how it works now. Deferring it
> > > until later seems much more likely to break things.
> >
> > AFAIU, we always increment the command counter after executing the
> > command. Why do we want to do it differently here?
>
> Hmm, now I'm starting to think that I'm confused about what is under
> discussion here. Which CommandCounterIncrement() are we talking about
> here?
>
The one we do after executing a non-readonly command. Let me try to
explain by example:
CREATE TABLE tab_fk_referenced_chk(refindex INTEGER PRIMARY KEY,
height real, weight real);
insert into tab_fk_referenced_chk values( 1, 1.1, 100);
CREATE TABLE tab_fk_referencing_chk(index INTEGER REFERENCES
tab_fk_referenced_chk(refindex), height real, weight real);
COPY tab_fk_referencing_chk(index,height,weight) FROM stdin WITH(
DELIMITER ',');
1,1.1,100
1,2.1,200
1,3.1,300
\.
In the above case, even though we are executing a single command from
the user perspective, but the currentCommandId will be four after the
command. One increment will be for the copy command and the other
three increments are for locking tuple in PK table
(tab_fk_referenced_chk) for three tuples in FK table
(tab_fk_referencing_chk). Now, for parallel workers, it is
(theoretically) possible that the three tuples are processed by three
different workers which don't get synced as of now. The question was
do we see any kind of problem with this and if so can we just sync it
up at the end of parallelism.
> > First, let me clarify the CTID I have used in my email are for the
> > table in which insertion is happening which means FK table. So, in
> > such a case, we can't have the same CTIDs queued for different
> > workers. Basically, we use CTID to fetch the row from FK table later
> > and form a query to lock (in KEY SHARE mode) the corresponding tuple
> > in PK table. Now, it is possible that two different workers try to
> > lock the same row of PK table. I am not clear what problem group
> > locking can have in this case because these are non-conflicting locks.
> > Can you please elaborate a bit more?
>
> I'm concerned about two workers trying to take the same lock at the
> same time. If that's prevented by the buffer locking then I think it's
> OK, but if it's prevented by a heavyweight lock then it's not going to
> work in this case.
>
We do take buffer lock in exclusive mode before trying to acquire KEY
SHARE lock on the tuple, so the two workers shouldn't try to acquire
at the same time. I think you are trying to see if in any case, two
workers try to acquire heavyweight lock like tuple lock or something
like that to perform the operation then it will create a problem
because due to group locking it will allow such an operation where it
should not have been. But I don't think anything of that sort is
feasible in COPY operation and if it is then we probably need to
carefully block it or find some solution for it.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, May 18, 2020 at 12:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > In the above case, even though we are executing a single command from > the user perspective, but the currentCommandId will be four after the > command. One increment will be for the copy command and the other > three increments are for locking tuple in PK table > (tab_fk_referenced_chk) for three tuples in FK table > (tab_fk_referencing_chk). Now, for parallel workers, it is > (theoretically) possible that the three tuples are processed by three > different workers which don't get synced as of now. The question was > do we see any kind of problem with this and if so can we just sync it > up at the end of parallelism. I strongly disagree with the idea of "just sync(ing) it up at the end of parallelism". That seems like a completely unprincipled approach to the problem. Either the command counter increment is important or it's not. If it's not important, maybe we can arrange to skip it in the first place. If it is important, then it's probably not OK for each backend to be doing it separately. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-06-03 12:13:14 -0400, Robert Haas wrote: > On Mon, May 18, 2020 at 12:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > In the above case, even though we are executing a single command from > > the user perspective, but the currentCommandId will be four after the > > command. One increment will be for the copy command and the other > > three increments are for locking tuple in PK table > > (tab_fk_referenced_chk) for three tuples in FK table > > (tab_fk_referencing_chk). Now, for parallel workers, it is > > (theoretically) possible that the three tuples are processed by three > > different workers which don't get synced as of now. The question was > > do we see any kind of problem with this and if so can we just sync it > > up at the end of parallelism. > I strongly disagree with the idea of "just sync(ing) it up at the end > of parallelism". That seems like a completely unprincipled approach to > the problem. Either the command counter increment is important or it's > not. If it's not important, maybe we can arrange to skip it in the > first place. If it is important, then it's probably not OK for each > backend to be doing it separately. That scares me too. These command counter increments definitely aren't unnecessary in the general case. Even in the example you share above, aren't we potentially going to actually lock rows multiple times from within the same transaction, instead of once? If the command counter increments from within ri_trigger.c aren't visible to other parallel workers/leader, we'll not correctly understand that a locked row is invisible to heap_lock_tuple, because we're not using a new enough snapshot (by dint of not having a new enough cid). I've not dug through everything that'd potentially cause, but it seems pretty clearly a no-go from here. Greetings, Andres Freund
Hi, On 2020-06-03 15:53:24 +0530, vignesh C wrote: > Workers/ > Exec time (seconds) copy from file, > 2 indexes on integer columns > 1 index on text column copy from stdin, > 2 indexes on integer columns > 1 index on text column copy from file, 1 gist index on text column copy > from file, > 3 indexes on integer columns copy from stdin, 3 indexes on integer columns > 0 1162.772(1X) 1176.035(1X) 827.669(1X) 216.171(1X) 217.376(1X) > 1 1110.288(1.05X) 1120.556(1.05X) 747.384(1.11X) 174.242(1.24X) 163.492(1.33X) > 2 635.249(1.83X) 668.18(1.76X) 435.673(1.9X) 133.829(1.61X) 126.516(1.72X) > 4 336.835(3.45X) 346.768(3.39X) 236.406(3.5X) 105.767(2.04X) 107.382(2.02X) > 8 188.577(6.17X) 194.491(6.04X) 148.962(5.56X) 100.708(2.15X) 107.72(2.01X) > 16 126.819(9.17X) 146.402(8.03X) 119.923(6.9X) 97.996(2.2X) 106.531(2.04X) > 20 *117.845(9.87X)* 149.203(7.88X) 138.741(5.96X) 97.94(2.21X) 107.5(2.02) > 30 127.554(9.11X) 161.218(7.29X) 172.443(4.8X) 98.232(2.2X) 108.778(1.99X) Hm. you don't explicitly mention that in your design, but given how small the benefits going from 0-1 workers is, I assume the leader doesn't do any "chunk processing" on its own? > Design of the Parallel Copy: The backend, to which the "COPY FROM" query is > submitted acts as leader with the responsibility of reading data from the > file/stdin, launching at most n number of workers as specified with > PARALLEL 'n' option in the "COPY FROM" query. The leader populates the > common data required for the workers execution in the DSM and shares it > with the workers. The leader then executes before statement triggers if > there exists any. Leader populates DSM chunks which includes the start > offset and chunk size, while populating the chunks it reads as many blocks > as required into the DSM data blocks from the file. Each block is of 64K > size. The leader parses the data to identify a chunk, the existing logic > from CopyReadLineText which identifies the chunks with some changes was > used for this. Leader checks if a free chunk is available to copy the > information, if there is no free chunk it waits till the required chunk is > freed up by the worker and then copies the identified chunks information > (offset & chunk size) into the DSM chunks. This process is repeated till > the complete file is processed. Simultaneously, the workers cache the > chunks(50) locally into the local memory and release the chunks to the > leader for further populating. Each worker processes the chunk that it > cached and inserts it into the table. The leader waits till all the chunks > populated are processed by the workers and exits. Why do we need the local copy of 50 chunks? Copying memory around is far from free. I don't see why it'd be better to add per-process caching, rather than making the DSM bigger? I can see some benefit in marking multiple chunks as being processed with one lock acquisition, but I don't think adding a memory copy is a good idea. This patch *desperately* needs to be split up. It imo is close to unreviewable, due to a large amount of changes that just move code around without other functional changes being mixed in with the actual new stuff. > /* > + * State of the chunk. > + */ > +typedef enum ChunkState > +{ > + CHUNK_INIT, /* initial state of chunk */ > + CHUNK_LEADER_POPULATING, /* leader processing chunk */ > + CHUNK_LEADER_POPULATED, /* leader completed populating chunk */ > + CHUNK_WORKER_PROCESSING, /* worker processing chunk */ > + CHUNK_WORKER_PROCESSED /* worker completed processing chunk */ > +}ChunkState; > + > +#define RAW_BUF_SIZE 65536 /* we palloc RAW_BUF_SIZE+1 bytes */ > + > +#define DATA_BLOCK_SIZE RAW_BUF_SIZE > +#define RINGSIZE (10 * 1000) > +#define MAX_BLOCKS_COUNT 1000 > +#define WORKER_CHUNK_COUNT 50 /* should be mod of RINGSIZE */ > + > +#define IsParallelCopy() (cstate->is_parallel) > +#define IsLeader() (cstate->pcdata->is_leader) > +#define IsHeaderLine() (cstate->header_line && cstate->cur_lineno == 1) > + > +/* > + * Copy data block information. > + */ > +typedef struct CopyDataBlock > +{ > + /* The number of unprocessed chunks in the current block. */ > + pg_atomic_uint32 unprocessed_chunk_parts; > + > + /* > + * If the current chunk data is continued into another block, > + * following_block will have the position where the remaining data need to > + * be read. > + */ > + uint32 following_block; > + > + /* > + * This flag will be set, when the leader finds out this block can be read > + * safely by the worker. This helps the worker to start processing the chunk > + * early where the chunk will be spread across many blocks and the worker > + * need not wait for the complete chunk to be processed. > + */ > + bool curr_blk_completed; > + char data[DATA_BLOCK_SIZE + 1]; /* data read from file */ > +}CopyDataBlock; What's the + 1 here about? > +/* > + * Parallel copy line buffer information. > + */ > +typedef struct ParallelCopyLineBuf > +{ > + StringInfoData line_buf; > + uint64 cur_lineno; /* line number for error messages */ > +}ParallelCopyLineBuf; Why do we need separate infrastructure for this? We shouldn't duplicate infrastructure unnecessarily. > +/* > + * Common information that need to be copied to shared memory. > + */ > +typedef struct CopyWorkerCommonData > +{ Why is parallel specific stuff here suddenly not named ParallelCopy* anymore? If you introduce a naming like that it imo should be used consistently. > + /* low-level state data */ > + CopyDest copy_dest; /* type of copy source/destination */ > + int file_encoding; /* file or remote side's character encoding */ > + bool need_transcoding; /* file encoding diff from server? */ > + bool encoding_embeds_ascii; /* ASCII can be non-first byte? */ > + > + /* parameters from the COPY command */ > + bool csv_mode; /* Comma Separated Value format? */ > + bool header_line; /* CSV header line? */ > + int null_print_len; /* length of same */ > + bool force_quote_all; /* FORCE_QUOTE *? */ > + bool convert_selectively; /* do selective binary conversion? */ > + > + /* Working state for COPY FROM */ > + AttrNumber num_defaults; > + Oid relid; > +}CopyWorkerCommonData; But I actually think we shouldn't have this information in two different structs. This should exist once, independent of using parallel / non-parallel copy. > +/* List information */ > +typedef struct ListInfo > +{ > + int count; /* count of attributes */ > + > + /* string info in the form info followed by info1, info2... infon */ > + char info[1]; > +} ListInfo; Based on these comments I have no idea what this could be for. > /* > - * This keeps the character read at the top of the loop in the buffer > - * even if there is more than one read-ahead. > + * This keeps the character read at the top of the loop in the buffer > + * even if there is more than one read-ahead. > + */ > +#define IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(extralen) \ > +if (1) \ > +{ \ > + if (copy_buff_state.raw_buf_ptr + (extralen) >= copy_buff_state.copy_buf_len && !hit_eof) \ > + { \ > + if (IsParallelCopy()) \ > + { \ > + copy_buff_state.chunk_size = prev_chunk_size; /* update previous chunk size */ \ > + if (copy_buff_state.block_switched) \ > + { \ > + pg_atomic_sub_fetch_u32(©_buff_state.data_blk_ptr->unprocessed_chunk_parts, 1); \ > + copy_buff_state.copy_buf_len = prev_copy_buf_len; \ > + } \ > + } \ > + copy_buff_state.raw_buf_ptr = prev_raw_ptr; /* undo fetch */ \ > + need_data = true; \ > + continue; \ > + } \ > +} else ((void) 0) I think it's an absolutely clear no-go to add new branches to these. They're *really* hot already, and this is going to sprinkle a significant amount of new instructions over a lot of places. > +/* > + * SET_RAWBUF_FOR_LOAD - Set raw_buf to the shared memory where the file data must > + * be read. > + */ > +#define SET_RAWBUF_FOR_LOAD() \ > +{ \ > + ShmCopyInfo *pcshared_info = cstate->pcdata->pcshared_info; \ > + uint32 cur_block_pos; \ > + /* \ > + * Mark the previous block as completed, worker can start copying this data. \ > + */ \ > + if (copy_buff_state.data_blk_ptr != copy_buff_state.curr_data_blk_ptr && \ > + copy_buff_state.data_blk_ptr->curr_blk_completed == false) \ > + copy_buff_state.data_blk_ptr->curr_blk_completed = true; \ > + \ > + copy_buff_state.data_blk_ptr = copy_buff_state.curr_data_blk_ptr; \ > + cur_block_pos = WaitGetFreeCopyBlock(pcshared_info); \ > + copy_buff_state.curr_data_blk_ptr = &pcshared_info->data_blocks[cur_block_pos]; \ > + \ > + if (!copy_buff_state.data_blk_ptr) \ > + { \ > + copy_buff_state.data_blk_ptr = copy_buff_state.curr_data_blk_ptr; \ > + chunk_first_block = cur_block_pos; \ > + } \ > + else if (need_data == false) \ > + copy_buff_state.data_blk_ptr->following_block = cur_block_pos; \ > + \ > + cstate->raw_buf = copy_buff_state.curr_data_blk_ptr->data; \ > + copy_buff_state.copy_raw_buf = cstate->raw_buf; \ > +} > + > +/* > + * END_CHUNK_PARALLEL_COPY - Update the chunk information in shared memory. > + */ > +#define END_CHUNK_PARALLEL_COPY() \ > +{ \ > + if (!IsHeaderLine()) \ > + { \ > + ShmCopyInfo *pcshared_info = cstate->pcdata->pcshared_info; \ > + ChunkBoundaries *chunkBoundaryPtr = &pcshared_info->chunk_boundaries; \ > + if (copy_buff_state.chunk_size) \ > + { \ > + ChunkBoundary *chunkInfo = &chunkBoundaryPtr->ring[chunk_pos]; \ > + /* \ > + * If raw_buf_ptr is zero, unprocessed_chunk_parts would have been \ > + * incremented in SEEK_COPY_BUFF_POS. This will happen if the whole \ > + * chunk finishes at the end of the current block. If the \ > + * new_line_size > raw_buf_ptr, then the new block has only new line \ > + * char content. The unprocessed count should not be increased in \ > + * this case. \ > + */ \ > + if (copy_buff_state.raw_buf_ptr != 0 && \ > + copy_buff_state.raw_buf_ptr > new_line_size) \ > + pg_atomic_add_fetch_u32(©_buff_state.curr_data_blk_ptr->unprocessed_chunk_parts, 1); \ > + \ > + /* Update chunk size. */ \ > + pg_atomic_write_u32(&chunkInfo->chunk_size, copy_buff_state.chunk_size); \ > + pg_atomic_write_u32(&chunkInfo->chunk_state, CHUNK_LEADER_POPULATED); \ > + elog(DEBUG1, "[Leader] After adding - chunk position:%d, chunk_size:%d", \ > + chunk_pos, copy_buff_state.chunk_size); \ > + pcshared_info->populated++; \ > + } \ > + else if (new_line_size) \ > + { \ > + /* \ > + * This means only new line char, empty record should be \ > + * inserted. \ > + */ \ > + ChunkBoundary *chunkInfo; \ > + chunk_pos = UpdateBlockInChunkInfo(cstate, -1, -1, 0, \ > + CHUNK_LEADER_POPULATED); \ > + chunkInfo = &chunkBoundaryPtr->ring[chunk_pos]; \ > + elog(DEBUG1, "[Leader] Added empty chunk with offset:%d, chunk position:%d, chunk size:%d", \ > + chunkInfo->start_offset, chunk_pos, \ > + pg_atomic_read_u32(&chunkInfo->chunk_size)); \ > + pcshared_info->populated++; \ > + } \ > + }\ > + \ > + /*\ > + * All of the read data is processed, reset index & len. In the\ > + * subsequent read, we will get a new block and copy data in to the\ > + * new block.\ > + */\ > + if (copy_buff_state.raw_buf_ptr == copy_buff_state.copy_buf_len)\ > + {\ > + cstate->raw_buf_index = 0;\ > + cstate->raw_buf_len = 0;\ > + }\ > + else\ > + cstate->raw_buf_len = copy_buff_state.copy_buf_len;\ > +} Why are these macros? They are way way way above a length where that makes any sort of sense. Greetings, Andres Freund
On Thu, Jun 4, 2020 at 12:09 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-06-03 12:13:14 -0400, Robert Haas wrote: > > On Mon, May 18, 2020 at 12:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > In the above case, even though we are executing a single command from > > > the user perspective, but the currentCommandId will be four after the > > > command. One increment will be for the copy command and the other > > > three increments are for locking tuple in PK table > > > (tab_fk_referenced_chk) for three tuples in FK table > > > (tab_fk_referencing_chk). Now, for parallel workers, it is > > > (theoretically) possible that the three tuples are processed by three > > > different workers which don't get synced as of now. The question was > > > do we see any kind of problem with this and if so can we just sync it > > > up at the end of parallelism. > > > I strongly disagree with the idea of "just sync(ing) it up at the end > > of parallelism". That seems like a completely unprincipled approach to > > the problem. Either the command counter increment is important or it's > > not. If it's not important, maybe we can arrange to skip it in the > > first place. If it is important, then it's probably not OK for each > > backend to be doing it separately. > > That scares me too. These command counter increments definitely aren't > unnecessary in the general case. > Yeah, this is what we want to understand? Can you explain how they are useful here? AFAIU, heap_lock_tuple doesn't use commandid while storing the transaction information of xact while locking the tuple. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi, On 2020-06-04 08:10:07 +0530, Amit Kapila wrote: > On Thu, Jun 4, 2020 at 12:09 AM Andres Freund <andres@anarazel.de> wrote: > > > I strongly disagree with the idea of "just sync(ing) it up at the end > > > of parallelism". That seems like a completely unprincipled approach to > > > the problem. Either the command counter increment is important or it's > > > not. If it's not important, maybe we can arrange to skip it in the > > > first place. If it is important, then it's probably not OK for each > > > backend to be doing it separately. > > > > That scares me too. These command counter increments definitely aren't > > unnecessary in the general case. > > > > Yeah, this is what we want to understand? Can you explain how they > are useful here? AFAIU, heap_lock_tuple doesn't use commandid while > storing the transaction information of xact while locking the tuple. But the HeapTupleSatisfiesUpdate() call does use it? And even if that weren't an issue, I don't see how it's defensible to just randomly break the the commandid coherency for parallel copy. Greetings, Andres Freund
On Thu, Jun 4, 2020 at 9:10 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-06-04 08:10:07 +0530, Amit Kapila wrote: > > On Thu, Jun 4, 2020 at 12:09 AM Andres Freund <andres@anarazel.de> wrote: > > > > I strongly disagree with the idea of "just sync(ing) it up at the end > > > > of parallelism". That seems like a completely unprincipled approach to > > > > the problem. Either the command counter increment is important or it's > > > > not. If it's not important, maybe we can arrange to skip it in the > > > > first place. If it is important, then it's probably not OK for each > > > > backend to be doing it separately. > > > > > > That scares me too. These command counter increments definitely aren't > > > unnecessary in the general case. > > > > > > > Yeah, this is what we want to understand? Can you explain how they > > are useful here? AFAIU, heap_lock_tuple doesn't use commandid while > > storing the transaction information of xact while locking the tuple. > > But the HeapTupleSatisfiesUpdate() call does use it? > It won't use 'cid' for lockers or multi-lockers case (AFAICS, there are special case handling for lockers/multi-lockers). I think it is used for updates/deletes. > And even if that weren't an issue, I don't see how it's defensible to > just randomly break the the commandid coherency for parallel copy. > At this stage, we are evelauating whether there is any need to increment command counter for foreign key checks or is it just happening because we are using using some common code to execute "Select ... For Key Share" statetement during these checks. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 4, 2020 at 12:44 AM Andres Freund <andres@anarazel.de> wrote > > > Hm. you don't explicitly mention that in your design, but given how > small the benefits going from 0-1 workers is, I assume the leader > doesn't do any "chunk processing" on its own? > Yes you are right, the leader does not do any processing, Leader's work is mainly to populate the shared memory with the offset information for each record. > > > > Design of the Parallel Copy: The backend, to which the "COPY FROM" query is > > submitted acts as leader with the responsibility of reading data from the > > file/stdin, launching at most n number of workers as specified with > > PARALLEL 'n' option in the "COPY FROM" query. The leader populates the > > common data required for the workers execution in the DSM and shares it > > with the workers. The leader then executes before statement triggers if > > there exists any. Leader populates DSM chunks which includes the start > > offset and chunk size, while populating the chunks it reads as many blocks > > as required into the DSM data blocks from the file. Each block is of 64K > > size. The leader parses the data to identify a chunk, the existing logic > > from CopyReadLineText which identifies the chunks with some changes was > > used for this. Leader checks if a free chunk is available to copy the > > information, if there is no free chunk it waits till the required chunk is > > freed up by the worker and then copies the identified chunks information > > (offset & chunk size) into the DSM chunks. This process is repeated till > > the complete file is processed. Simultaneously, the workers cache the > > chunks(50) locally into the local memory and release the chunks to the > > leader for further populating. Each worker processes the chunk that it > > cached and inserts it into the table. The leader waits till all the chunks > > populated are processed by the workers and exits. > > Why do we need the local copy of 50 chunks? Copying memory around is far > from free. I don't see why it'd be better to add per-process caching, > rather than making the DSM bigger? I can see some benefit in marking > multiple chunks as being processed with one lock acquisition, but I > don't think adding a memory copy is a good idea. We had run performance with csv data file, 5.1GB, 10million tuples, 2 indexes on integer columns, results for the same are given below. We noticed in some cases the performance is better if we copy the 50 records locally and release the shared memory. We will get better benefits as the workers increase. Thoughts? ------------------------------------------------------------------------------------------------ Workers | Exec time (With local copying | Exec time (Without copying, | 50 records & release the | processing record by record) | shared memory) | ------------------------------------------------------------------------------------------------ 0 | 1162.772(1X) | 1152.684(1X) 2 | 635.249(1.83X) | 647.894(1.78X) 4 | 336.835(3.45X) | 335.534(3.43X) 8 | 188.577(6.17 X) | 189.461(6.08X) 16 | 126.819(9.17X) | 142.730(8.07X) 20 | 117.845(9.87X) | 146.533(7.87X) 30 | 127.554(9.11X) | 160.307(7.19X) > This patch *desperately* needs to be split up. It imo is close to > unreviewable, due to a large amount of changes that just move code > around without other functional changes being mixed in with the actual > new stuff. I have split the patch, the new split patches are attached. > > > > > /* > > + * State of the chunk. > > + */ > > +typedef enum ChunkState > > +{ > > + CHUNK_INIT, /* initial state of chunk */ > > + CHUNK_LEADER_POPULATING, /* leader processing chunk */ > > + CHUNK_LEADER_POPULATED, /* leader completed populating chunk */ > > + CHUNK_WORKER_PROCESSING, /* worker processing chunk */ > > + CHUNK_WORKER_PROCESSED /* worker completed processing chunk */ > > +}ChunkState; > > + > > +#define RAW_BUF_SIZE 65536 /* we palloc RAW_BUF_SIZE+1 bytes */ > > + > > +#define DATA_BLOCK_SIZE RAW_BUF_SIZE > > +#define RINGSIZE (10 * 1000) > > +#define MAX_BLOCKS_COUNT 1000 > > +#define WORKER_CHUNK_COUNT 50 /* should be mod of RINGSIZE */ > > + > > +#define IsParallelCopy() (cstate->is_parallel) > > +#define IsLeader() (cstate->pcdata->is_leader) > > +#define IsHeaderLine() (cstate->header_line && cstate->cur_lineno == 1) > > + > > +/* > > + * Copy data block information. > > + */ > > +typedef struct CopyDataBlock > > +{ > > + /* The number of unprocessed chunks in the current block. */ > > + pg_atomic_uint32 unprocessed_chunk_parts; > > + > > + /* > > + * If the current chunk data is continued into another block, > > + * following_block will have the position where the remaining data need to > > + * be read. > > + */ > > + uint32 following_block; > > + > > + /* > > + * This flag will be set, when the leader finds out this block can be read > > + * safely by the worker. This helps the worker to start processing the chunk > > + * early where the chunk will be spread across many blocks and the worker > > + * need not wait for the complete chunk to be processed. > > + */ > > + bool curr_blk_completed; > > + char data[DATA_BLOCK_SIZE + 1]; /* data read from file */ > > +}CopyDataBlock; > > What's the + 1 here about? Fixed this, removed +1. That is not needed. > > > > +/* > > + * Parallel copy line buffer information. > > + */ > > +typedef struct ParallelCopyLineBuf > > +{ > > + StringInfoData line_buf; > > + uint64 cur_lineno; /* line number for error messages */ > > +}ParallelCopyLineBuf; > > Why do we need separate infrastructure for this? We shouldn't duplicate > infrastructure unnecessarily. > This was required for copying the multiple records locally and releasing the shared memory. I have not changed this, will decide on this based on the decision taken for one of the previous comments. > > > > > +/* > > + * Common information that need to be copied to shared memory. > > + */ > > +typedef struct CopyWorkerCommonData > > +{ > > Why is parallel specific stuff here suddenly not named ParallelCopy* > anymore? If you introduce a naming like that it imo should be used > consistently. Fixed, changed to maintain ParallelCopy in all structs. > > > + /* low-level state data */ > > + CopyDest copy_dest; /* type of copy source/destination */ > > + int file_encoding; /* file or remote side's character encoding */ > > + bool need_transcoding; /* file encoding diff from server? */ > > + bool encoding_embeds_ascii; /* ASCII can be non-first byte? */ > > + > > + /* parameters from the COPY command */ > > + bool csv_mode; /* Comma Separated Value format? */ > > + bool header_line; /* CSV header line? */ > > + int null_print_len; /* length of same */ > > + bool force_quote_all; /* FORCE_QUOTE *? */ > > + bool convert_selectively; /* do selective binary conversion? */ > > + > > + /* Working state for COPY FROM */ > > + AttrNumber num_defaults; > > + Oid relid; > > +}CopyWorkerCommonData; > > But I actually think we shouldn't have this information in two different > structs. This should exist once, independent of using parallel / > non-parallel copy. > This structure helps in storing the common data from CopyStateData that are required by the workers. This information will then be allocated and stored into the DSM for the worker to retrieve and copy it to CopyStateData. > > > +/* List information */ > > +typedef struct ListInfo > > +{ > > + int count; /* count of attributes */ > > + > > + /* string info in the form info followed by info1, info2... infon */ > > + char info[1]; > > +} ListInfo; > > Based on these comments I have no idea what this could be for. > Have added better comments for this. The following is added: This structure will help in converting a List data type into the below structure format with the count having the number of elements in the list and the info having the List elements appended contiguously. This converted structure will be allocated in shared memory and stored in DSM for the worker to retrieve and later convert it back to List data type. > > > /* > > - * This keeps the character read at the top of the loop in the buffer > > - * even if there is more than one read-ahead. > > + * This keeps the character read at the top of the loop in the buffer > > + * even if there is more than one read-ahead. > > + */ > > +#define IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(extralen) \ > > +if (1) \ > > +{ \ > > + if (copy_buff_state.raw_buf_ptr + (extralen) >= copy_buff_state.copy_buf_len && !hit_eof) \ > > + { \ > > + if (IsParallelCopy()) \ > > + { \ > > + copy_buff_state.chunk_size = prev_chunk_size; /* update previous chunk size */ \ > > + if (copy_buff_state.block_switched) \ > > + { \ > > + pg_atomic_sub_fetch_u32(©_buff_state.data_blk_ptr->unprocessed_chunk_parts, 1); \ > > + copy_buff_state.copy_buf_len = prev_copy_buf_len; \ > > + } \ > > + } \ > > + copy_buff_state.raw_buf_ptr = prev_raw_ptr; /* undo fetch */ \ > > + need_data = true; \ > > + continue; \ > > + } \ > > +} else ((void) 0) > > I think it's an absolutely clear no-go to add new branches to > these. They're *really* hot already, and this is going to sprinkle a > significant amount of new instructions over a lot of places. > Fixed, removed this. > > > > +/* > > + * SET_RAWBUF_FOR_LOAD - Set raw_buf to the shared memory where the file data must > > + * be read. > > + */ > > +#define SET_RAWBUF_FOR_LOAD() \ > > +{ \ > > + ShmCopyInfo *pcshared_info = cstate->pcdata->pcshared_info; \ > > + uint32 cur_block_pos; \ > > + /* \ > > + * Mark the previous block as completed, worker can start copying this data. \ > > + */ \ > > + if (copy_buff_state.data_blk_ptr != copy_buff_state.curr_data_blk_ptr && \ > > + copy_buff_state.data_blk_ptr->curr_blk_completed == false) \ > > + copy_buff_state.data_blk_ptr->curr_blk_completed = true; \ > > + \ > > + copy_buff_state.data_blk_ptr = copy_buff_state.curr_data_blk_ptr; \ > > + cur_block_pos = WaitGetFreeCopyBlock(pcshared_info); \ > > + copy_buff_state.curr_data_blk_ptr = &pcshared_info->data_blocks[cur_block_pos]; \ > > + \ > > + if (!copy_buff_state.data_blk_ptr) \ > > + { \ > > + copy_buff_state.data_blk_ptr = copy_buff_state.curr_data_blk_ptr; \ > > + chunk_first_block = cur_block_pos; \ > > + } \ > > + else if (need_data == false) \ > > + copy_buff_state.data_blk_ptr->following_block = cur_block_pos; \ > > + \ > > + cstate->raw_buf = copy_buff_state.curr_data_blk_ptr->data; \ > > + copy_buff_state.copy_raw_buf = cstate->raw_buf; \ > > +} > > + > > +/* > > + * END_CHUNK_PARALLEL_COPY - Update the chunk information in shared memory. > > + */ > > +#define END_CHUNK_PARALLEL_COPY() \ > > +{ \ > > + if (!IsHeaderLine()) \ > > + { \ > > + ShmCopyInfo *pcshared_info = cstate->pcdata->pcshared_info; \ > > + ChunkBoundaries *chunkBoundaryPtr = &pcshared_info->chunk_boundaries; \ > > + if (copy_buff_state.chunk_size) \ > > + { \ > > + ChunkBoundary *chunkInfo = &chunkBoundaryPtr->ring[chunk_pos]; \ > > + /* \ > > + * If raw_buf_ptr is zero, unprocessed_chunk_parts would have been \ > > + * incremented in SEEK_COPY_BUFF_POS. This will happen if the whole \ > > + * chunk finishes at the end of the current block. If the \ > > + * new_line_size > raw_buf_ptr, then the new block has only new line \ > > + * char content. The unprocessed count should not be increased in \ > > + * this case. \ > > + */ \ > > + if (copy_buff_state.raw_buf_ptr != 0 && \ > > + copy_buff_state.raw_buf_ptr > new_line_size) \ > > + pg_atomic_add_fetch_u32(©_buff_state.curr_data_blk_ptr->unprocessed_chunk_parts, 1);\ > > + \ > > + /* Update chunk size. */ \ > > + pg_atomic_write_u32(&chunkInfo->chunk_size, copy_buff_state.chunk_size); \ > > + pg_atomic_write_u32(&chunkInfo->chunk_state, CHUNK_LEADER_POPULATED); \ > > + elog(DEBUG1, "[Leader] After adding - chunk position:%d, chunk_size:%d", \ > > + chunk_pos, copy_buff_state.chunk_size); \ > > + pcshared_info->populated++; \ > > + } \ > > + else if (new_line_size) \ > > + { \ > > + /* \ > > + * This means only new line char, empty record should be \ > > + * inserted. \ > > + */ \ > > + ChunkBoundary *chunkInfo; \ > > + chunk_pos = UpdateBlockInChunkInfo(cstate, -1, -1, 0, \ > > + CHUNK_LEADER_POPULATED); \ > > + chunkInfo = &chunkBoundaryPtr->ring[chunk_pos]; \ > > + elog(DEBUG1, "[Leader] Added empty chunk with offset:%d, chunk position:%d, chunk size:%d", \ > > + chunkInfo->start_offset, chunk_pos, \ > > + pg_atomic_read_u32(&chunkInfo->chunk_size)); \ > > + pcshared_info->populated++; \ > > + } \ > > + }\ > > + \ > > + /*\ > > + * All of the read data is processed, reset index & len. In the\ > > + * subsequent read, we will get a new block and copy data in to the\ > > + * new block.\ > > + */\ > > + if (copy_buff_state.raw_buf_ptr == copy_buff_state.copy_buf_len)\ > > + {\ > > + cstate->raw_buf_index = 0;\ > > + cstate->raw_buf_len = 0;\ > > + }\ > > + else\ > > + cstate->raw_buf_len = copy_buff_state.copy_buf_len;\ > > +} > > Why are these macros? They are way way way above a length where that > makes any sort of sense. > Converted these macros to functions. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
Hi All,
I've spent little bit of time going through the project discussion that has happened in this email thread and to start with I have few questions which I would like to put here:
Q1) Are we also planning to read the input data in parallel or is it only about performing the multi-insert operation in parallel? AFAIU, the data reading part will be done by the leader process alone so no parallelism is involved there.
Q2) How are we going to deal with the partitioned tables? I mean will there be some worker process dedicated for each partition or how is it? Further, the challenge that I see incase of partitioned tables is that we would have a single input file containing data to be inserted into multiple tables (aka partitions) unlike the normal case where all the tuples in the input file would belong to the same table.
Q3) Incase of toast tables, there is a possibility of having a single tuple in the input file which could be of a very big size (probably in GB) eventually resulting in a bigger file size. So, in this case, how are we going to decide the number of worker processes to be launched. I mean, although the file size is big, but the number of tuples to be processed is just one or few of them, so, can we decide the number of the worker processes to be launched based on the file size?
Q4) Who is going to process constraints (preferably the deferred constraint) that is supposed to be executed at the COMMIT time? I mean is it the leader process or the worker process or in such cases we won't be choosing the parallelism at all?
Q5) Do we have any risk of table bloating when the data is loaded in parallel. I am just asking this because incase of parallelism there would be multiple processes performing bulk insert into a table. There is a chance that the table file might get extended even if there is some space into the file being written into, but that space is locked by some other worker process and hence that might result in a creation of a new block for that table. Sorry, if I am missing something here.
Please note that I haven't gone through all the emails in this thread so there is a possibility that I might have repeated the question that has already been raised and answered here. If that is the case, I am sorry for that, but it would be very helpful if someone could point out that thread so that I can go through it. Thank you.
Q1) Are we also planning to read the input data in parallel or is it only about performing the multi-insert operation in parallel? AFAIU, the data reading part will be done by the leader process alone so no parallelism is involved there.
Q2) How are we going to deal with the partitioned tables? I mean will there be some worker process dedicated for each partition or how is it? Further, the challenge that I see incase of partitioned tables is that we would have a single input file containing data to be inserted into multiple tables (aka partitions) unlike the normal case where all the tuples in the input file would belong to the same table.
Q3) Incase of toast tables, there is a possibility of having a single tuple in the input file which could be of a very big size (probably in GB) eventually resulting in a bigger file size. So, in this case, how are we going to decide the number of worker processes to be launched. I mean, although the file size is big, but the number of tuples to be processed is just one or few of them, so, can we decide the number of the worker processes to be launched based on the file size?
Q4) Who is going to process constraints (preferably the deferred constraint) that is supposed to be executed at the COMMIT time? I mean is it the leader process or the worker process or in such cases we won't be choosing the parallelism at all?
Q5) Do we have any risk of table bloating when the data is loaded in parallel. I am just asking this because incase of parallelism there would be multiple processes performing bulk insert into a table. There is a chance that the table file might get extended even if there is some space into the file being written into, but that space is locked by some other worker process and hence that might result in a creation of a new block for that table. Sorry, if I am missing something here.
Please note that I haven't gone through all the emails in this thread so there is a possibility that I might have repeated the question that has already been raised and answered here. If that is the case, I am sorry for that, but it would be very helpful if someone could point out that thread so that I can go through it. Thank you.
On Fri, Jun 12, 2020 at 11:01 AM vignesh C <vignesh21@gmail.com> wrote:
On Thu, Jun 4, 2020 at 12:44 AM Andres Freund <andres@anarazel.de> wrote
>
>
> Hm. you don't explicitly mention that in your design, but given how
> small the benefits going from 0-1 workers is, I assume the leader
> doesn't do any "chunk processing" on its own?
>
Yes you are right, the leader does not do any processing, Leader's
work is mainly to populate the shared memory with the offset
information for each record.
>
>
> > Design of the Parallel Copy: The backend, to which the "COPY FROM" query is
> > submitted acts as leader with the responsibility of reading data from the
> > file/stdin, launching at most n number of workers as specified with
> > PARALLEL 'n' option in the "COPY FROM" query. The leader populates the
> > common data required for the workers execution in the DSM and shares it
> > with the workers. The leader then executes before statement triggers if
> > there exists any. Leader populates DSM chunks which includes the start
> > offset and chunk size, while populating the chunks it reads as many blocks
> > as required into the DSM data blocks from the file. Each block is of 64K
> > size. The leader parses the data to identify a chunk, the existing logic
> > from CopyReadLineText which identifies the chunks with some changes was
> > used for this. Leader checks if a free chunk is available to copy the
> > information, if there is no free chunk it waits till the required chunk is
> > freed up by the worker and then copies the identified chunks information
> > (offset & chunk size) into the DSM chunks. This process is repeated till
> > the complete file is processed. Simultaneously, the workers cache the
> > chunks(50) locally into the local memory and release the chunks to the
> > leader for further populating. Each worker processes the chunk that it
> > cached and inserts it into the table. The leader waits till all the chunks
> > populated are processed by the workers and exits.
>
> Why do we need the local copy of 50 chunks? Copying memory around is far
> from free. I don't see why it'd be better to add per-process caching,
> rather than making the DSM bigger? I can see some benefit in marking
> multiple chunks as being processed with one lock acquisition, but I
> don't think adding a memory copy is a good idea.
We had run performance with csv data file, 5.1GB, 10million tuples, 2
indexes on integer columns, results for the same are given below. We
noticed in some cases the performance is better if we copy the 50
records locally and release the shared memory. We will get better
benefits as the workers increase. Thoughts?
------------------------------------------------------------------------------------------------
Workers | Exec time (With local copying | Exec time (Without copying,
| 50 records & release the | processing record by record)
| shared memory) |
------------------------------------------------------------------------------------------------
0 | 1162.772(1X) | 1152.684(1X)
2 | 635.249(1.83X) | 647.894(1.78X)
4 | 336.835(3.45X) | 335.534(3.43X)
8 | 188.577(6.17 X) | 189.461(6.08X)
16 | 126.819(9.17X) | 142.730(8.07X)
20 | 117.845(9.87X) | 146.533(7.87X)
30 | 127.554(9.11X) | 160.307(7.19X)
> This patch *desperately* needs to be split up. It imo is close to
> unreviewable, due to a large amount of changes that just move code
> around without other functional changes being mixed in with the actual
> new stuff.
I have split the patch, the new split patches are attached.
>
>
>
> > /*
> > + * State of the chunk.
> > + */
> > +typedef enum ChunkState
> > +{
> > + CHUNK_INIT, /* initial state of chunk */
> > + CHUNK_LEADER_POPULATING, /* leader processing chunk */
> > + CHUNK_LEADER_POPULATED, /* leader completed populating chunk */
> > + CHUNK_WORKER_PROCESSING, /* worker processing chunk */
> > + CHUNK_WORKER_PROCESSED /* worker completed processing chunk */
> > +}ChunkState;
> > +
> > +#define RAW_BUF_SIZE 65536 /* we palloc RAW_BUF_SIZE+1 bytes */
> > +
> > +#define DATA_BLOCK_SIZE RAW_BUF_SIZE
> > +#define RINGSIZE (10 * 1000)
> > +#define MAX_BLOCKS_COUNT 1000
> > +#define WORKER_CHUNK_COUNT 50 /* should be mod of RINGSIZE */
> > +
> > +#define IsParallelCopy() (cstate->is_parallel)
> > +#define IsLeader() (cstate->pcdata->is_leader)
> > +#define IsHeaderLine() (cstate->header_line && cstate->cur_lineno == 1)
> > +
> > +/*
> > + * Copy data block information.
> > + */
> > +typedef struct CopyDataBlock
> > +{
> > + /* The number of unprocessed chunks in the current block. */
> > + pg_atomic_uint32 unprocessed_chunk_parts;
> > +
> > + /*
> > + * If the current chunk data is continued into another block,
> > + * following_block will have the position where the remaining data need to
> > + * be read.
> > + */
> > + uint32 following_block;
> > +
> > + /*
> > + * This flag will be set, when the leader finds out this block can be read
> > + * safely by the worker. This helps the worker to start processing the chunk
> > + * early where the chunk will be spread across many blocks and the worker
> > + * need not wait for the complete chunk to be processed.
> > + */
> > + bool curr_blk_completed;
> > + char data[DATA_BLOCK_SIZE + 1]; /* data read from file */
> > +}CopyDataBlock;
>
> What's the + 1 here about?
Fixed this, removed +1. That is not needed.
>
>
> > +/*
> > + * Parallel copy line buffer information.
> > + */
> > +typedef struct ParallelCopyLineBuf
> > +{
> > + StringInfoData line_buf;
> > + uint64 cur_lineno; /* line number for error messages */
> > +}ParallelCopyLineBuf;
>
> Why do we need separate infrastructure for this? We shouldn't duplicate
> infrastructure unnecessarily.
>
This was required for copying the multiple records locally and
releasing the shared memory. I have not changed this, will decide on
this based on the decision taken for one of the previous comments.
>
>
>
> > +/*
> > + * Common information that need to be copied to shared memory.
> > + */
> > +typedef struct CopyWorkerCommonData
> > +{
>
> Why is parallel specific stuff here suddenly not named ParallelCopy*
> anymore? If you introduce a naming like that it imo should be used
> consistently.
Fixed, changed to maintain ParallelCopy in all structs.
>
> > + /* low-level state data */
> > + CopyDest copy_dest; /* type of copy source/destination */
> > + int file_encoding; /* file or remote side's character encoding */
> > + bool need_transcoding; /* file encoding diff from server? */
> > + bool encoding_embeds_ascii; /* ASCII can be non-first byte? */
> > +
> > + /* parameters from the COPY command */
> > + bool csv_mode; /* Comma Separated Value format? */
> > + bool header_line; /* CSV header line? */
> > + int null_print_len; /* length of same */
> > + bool force_quote_all; /* FORCE_QUOTE *? */
> > + bool convert_selectively; /* do selective binary conversion? */
> > +
> > + /* Working state for COPY FROM */
> > + AttrNumber num_defaults;
> > + Oid relid;
> > +}CopyWorkerCommonData;
>
> But I actually think we shouldn't have this information in two different
> structs. This should exist once, independent of using parallel /
> non-parallel copy.
>
This structure helps in storing the common data from CopyStateData
that are required by the workers. This information will then be
allocated and stored into the DSM for the worker to retrieve and copy
it to CopyStateData.
>
> > +/* List information */
> > +typedef struct ListInfo
> > +{
> > + int count; /* count of attributes */
> > +
> > + /* string info in the form info followed by info1, info2... infon */
> > + char info[1];
> > +} ListInfo;
>
> Based on these comments I have no idea what this could be for.
>
Have added better comments for this. The following is added: This
structure will help in converting a List data type into the below
structure format with the count having the number of elements in the
list and the info having the List elements appended contiguously. This
converted structure will be allocated in shared memory and stored in
DSM for the worker to retrieve and later convert it back to List data
type.
>
> > /*
> > - * This keeps the character read at the top of the loop in the buffer
> > - * even if there is more than one read-ahead.
> > + * This keeps the character read at the top of the loop in the buffer
> > + * even if there is more than one read-ahead.
> > + */
> > +#define IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(extralen) \
> > +if (1) \
> > +{ \
> > + if (copy_buff_state.raw_buf_ptr + (extralen) >= copy_buff_state.copy_buf_len && !hit_eof) \
> > + { \
> > + if (IsParallelCopy()) \
> > + { \
> > + copy_buff_state.chunk_size = prev_chunk_size; /* update previous chunk size */ \
> > + if (copy_buff_state.block_switched) \
> > + { \
> > + pg_atomic_sub_fetch_u32(©_buff_state.data_blk_ptr->unprocessed_chunk_parts, 1); \
> > + copy_buff_state.copy_buf_len = prev_copy_buf_len; \
> > + } \
> > + } \
> > + copy_buff_state.raw_buf_ptr = prev_raw_ptr; /* undo fetch */ \
> > + need_data = true; \
> > + continue; \
> > + } \
> > +} else ((void) 0)
>
> I think it's an absolutely clear no-go to add new branches to
> these. They're *really* hot already, and this is going to sprinkle a
> significant amount of new instructions over a lot of places.
>
Fixed, removed this.
>
>
> > +/*
> > + * SET_RAWBUF_FOR_LOAD - Set raw_buf to the shared memory where the file data must
> > + * be read.
> > + */
> > +#define SET_RAWBUF_FOR_LOAD() \
> > +{ \
> > + ShmCopyInfo *pcshared_info = cstate->pcdata->pcshared_info; \
> > + uint32 cur_block_pos; \
> > + /* \
> > + * Mark the previous block as completed, worker can start copying this data. \
> > + */ \
> > + if (copy_buff_state.data_blk_ptr != copy_buff_state.curr_data_blk_ptr && \
> > + copy_buff_state.data_blk_ptr->curr_blk_completed == false) \
> > + copy_buff_state.data_blk_ptr->curr_blk_completed = true; \
> > + \
> > + copy_buff_state.data_blk_ptr = copy_buff_state.curr_data_blk_ptr; \
> > + cur_block_pos = WaitGetFreeCopyBlock(pcshared_info); \
> > + copy_buff_state.curr_data_blk_ptr = &pcshared_info->data_blocks[cur_block_pos]; \
> > + \
> > + if (!copy_buff_state.data_blk_ptr) \
> > + { \
> > + copy_buff_state.data_blk_ptr = copy_buff_state.curr_data_blk_ptr; \
> > + chunk_first_block = cur_block_pos; \
> > + } \
> > + else if (need_data == false) \
> > + copy_buff_state.data_blk_ptr->following_block = cur_block_pos; \
> > + \
> > + cstate->raw_buf = copy_buff_state.curr_data_blk_ptr->data; \
> > + copy_buff_state.copy_raw_buf = cstate->raw_buf; \
> > +}
> > +
> > +/*
> > + * END_CHUNK_PARALLEL_COPY - Update the chunk information in shared memory.
> > + */
> > +#define END_CHUNK_PARALLEL_COPY() \
> > +{ \
> > + if (!IsHeaderLine()) \
> > + { \
> > + ShmCopyInfo *pcshared_info = cstate->pcdata->pcshared_info; \
> > + ChunkBoundaries *chunkBoundaryPtr = &pcshared_info->chunk_boundaries; \
> > + if (copy_buff_state.chunk_size) \
> > + { \
> > + ChunkBoundary *chunkInfo = &chunkBoundaryPtr->ring[chunk_pos]; \
> > + /* \
> > + * If raw_buf_ptr is zero, unprocessed_chunk_parts would have been \
> > + * incremented in SEEK_COPY_BUFF_POS. This will happen if the whole \
> > + * chunk finishes at the end of the current block. If the \
> > + * new_line_size > raw_buf_ptr, then the new block has only new line \
> > + * char content. The unprocessed count should not be increased in \
> > + * this case. \
> > + */ \
> > + if (copy_buff_state.raw_buf_ptr != 0 && \
> > + copy_buff_state.raw_buf_ptr > new_line_size) \
> > + pg_atomic_add_fetch_u32(©_buff_state.curr_data_blk_ptr->unprocessed_chunk_parts, 1); \
> > + \
> > + /* Update chunk size. */ \
> > + pg_atomic_write_u32(&chunkInfo->chunk_size, copy_buff_state.chunk_size); \
> > + pg_atomic_write_u32(&chunkInfo->chunk_state, CHUNK_LEADER_POPULATED); \
> > + elog(DEBUG1, "[Leader] After adding - chunk position:%d, chunk_size:%d", \
> > + chunk_pos, copy_buff_state.chunk_size); \
> > + pcshared_info->populated++; \
> > + } \
> > + else if (new_line_size) \
> > + { \
> > + /* \
> > + * This means only new line char, empty record should be \
> > + * inserted. \
> > + */ \
> > + ChunkBoundary *chunkInfo; \
> > + chunk_pos = UpdateBlockInChunkInfo(cstate, -1, -1, 0, \
> > + CHUNK_LEADER_POPULATED); \
> > + chunkInfo = &chunkBoundaryPtr->ring[chunk_pos]; \
> > + elog(DEBUG1, "[Leader] Added empty chunk with offset:%d, chunk position:%d, chunk size:%d", \
> > + chunkInfo->start_offset, chunk_pos, \
> > + pg_atomic_read_u32(&chunkInfo->chunk_size)); \
> > + pcshared_info->populated++; \
> > + } \
> > + }\
> > + \
> > + /*\
> > + * All of the read data is processed, reset index & len. In the\
> > + * subsequent read, we will get a new block and copy data in to the\
> > + * new block.\
> > + */\
> > + if (copy_buff_state.raw_buf_ptr == copy_buff_state.copy_buf_len)\
> > + {\
> > + cstate->raw_buf_index = 0;\
> > + cstate->raw_buf_len = 0;\
> > + }\
> > + else\
> > + cstate->raw_buf_len = copy_buff_state.copy_buf_len;\
> > +}
>
> Why are these macros? They are way way way above a length where that
> makes any sort of sense.
>
Converted these macros to functions.
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
On Fri, Jun 12, 2020 at 4:57 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Hi All, > > I've spent little bit of time going through the project discussion that has happened in this email thread and to startwith I have few questions which I would like to put here: > > Q1) Are we also planning to read the input data in parallel or is it only about performing the multi-insert operation inparallel? AFAIU, the data reading part will be done by the leader process alone so no parallelism is involved there. > Yes, your understanding is correct. > Q2) How are we going to deal with the partitioned tables? > I haven't studied the patch but my understanding is that we will support parallel copy for partitioned tables with a few restrictions as explained in my earlier email [1]. See, Case-2 (b) in the email. > I mean will there be some worker process dedicated for each partition or how is it? No, it the split is just based on the input, otherwise each worker should insert as we would have done without any workers. > Q3) Incase of toast tables, there is a possibility of having a single tuple in the input file which could be of a verybig size (probably in GB) eventually resulting in a bigger file size. So, in this case, how are we going to decide thenumber of worker processes to be launched. I mean, although the file size is big, but the number of tuples to be processedis just one or few of them, so, can we decide the number of the worker processes to be launched based on the filesize? > Yeah, such situations would be tricky, so we should have an option for user to specify the number of workers. > Q4) Who is going to process constraints (preferably the deferred constraint) that is supposed to be executed at the COMMITtime? I mean is it the leader process or the worker process or in such cases we won't be choosing the parallelism atall? > In the first version, we won't do parallelism for this. Again, see one of my earlier email [1] where I have explained this and other cases where we won't be supporting parallel copy. > Q5) Do we have any risk of table bloating when the data is loaded in parallel. I am just asking this because incase ofparallelism there would be multiple processes performing bulk insert into a table. There is a chance that the table filemight get extended even if there is some space into the file being written into, but that space is locked by some otherworker process and hence that might result in a creation of a new block for that table. Sorry, if I am missing somethinghere. > Hmm, each worker will operate at page level, after first insertion, the same worker will try to insert in the same page in which it has inserted last, so there shouldn't be such a problem. > Please note that I haven't gone through all the emails in this thread so there is a possibility that I might have repeatedthe question that has already been raised and answered here. If that is the case, I am sorry for that, but it wouldbe very helpful if someone could point out that thread so that I can go through it. Thank you. > No problem, I understand sometimes it is difficult to go through each and every email especially when the discussion is long. Anyway, thanks for showing the interest in the patch. [1] - https://www.postgresql.org/message-id/CAA4eK1%2BANNEaMJCCXm4naweP5PLY6LhJMvGo_V7-Pnfbh6GsOA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi,
Attached is the patch supporting parallel copy for binary format files.
The performance improvement achieved with different workers is as shown below. Dataset used has 10million tuples and is of 5.3GB size.
parallel workers | test case 1(exec time in sec): copy from binary file, 2 indexes on integer columns and 1 index on text column | test case 2(exec time in sec): copy from binary file, 1 gist index on text column | test case 3(exec time in sec): copy from binary file, 3 indexes on integer columns |
0 | 1106.899(1X) | 772.758(1X) | 171.338(1X) |
1 | 1094.165(1.01X) | 757.365(1.02X) | 163.018(1.05X) |
2 | 618.397(1.79X) | 428.304(1.8X) | 117.508(1.46X) |
4 | 320.511(3.45X) | 231.938(3.33X) | 80.297(2.13X) |
8 | 172.462(6.42X) | 150.212(5.14X) | 71.518(2.39X) |
16 | 110.460(10.02X) | 124.929(6.18X) | 91.308(1.88X) |
20 | 98.470(11.24X) | 137.313(5.63X) | 95.289(1.79X) |
30 | 109.229(10.13X) | 173.54(4.45X) | 95.799(1.78X) |
Design followed for developing this patch:
Leader reads data from the file into the DSM data blocks each of 64K size. It also identifies each tuple data block id, start offset, end offset, tuple size and updates this information in the ring data structure. Workers parallely read the tuple information from the ring data structure, the actual tuple data from the data blocks and parallely insert the tuples into the table.
Please note that this patch can be applied on the series of patches that were posted previously[1] for parallel copy for csv/text files.
The correct order to apply all the patches is -
0003-Allow-copy-from-command-to-process-data-from-file-ST.patchand
0005-Parallel-Copy-For-Binary-Format-Files.patch
The above tests were run with the configuration attached config.txt, which is the same used for performance tests of csv/text files posted earlier in this mail chain.
Request the community to take this patch up for review along with the parallel copy for csv/text file patches and provide feedback.
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Attachment
Thanks Amit for the clarifications. Regarding partitioned table, one of the question was - if we are loading data into a partitioned table using COPY command, then the input file would contain tuples for different tables (partitions) unlike the normal table case where all the tuples in the input file would belong to the same table. So, in such a case, how are we going to accumulate tuples into the DSM? I mean will the leader process check which tuple needs to be routed to which partition and accordingly accumulate them into the DSM. For e.g. let's say in the input data file we have 10 tuples where the 1st tuple belongs to partition1, 2nd belongs to partition2 and likewise. So, in such cases, will the leader process accumulate all the tuples belonging to partition1 into one DSM and tuples belonging to partition2 into some other DSM and assign them to the worker process or we have taken some other approach to handle this scenario?
Further, I haven't got much time to look into the links that you have shared in your previous response. Will have a look into those and will also slowly start looking into the patches as and when I get some time. Thank you.
Further, I haven't got much time to look into the links that you have shared in your previous response. Will have a look into those and will also slowly start looking into the patches as and when I get some time. Thank you.
On Sat, Jun 13, 2020 at 9:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Jun 12, 2020 at 4:57 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi All,
>
> I've spent little bit of time going through the project discussion that has happened in this email thread and to start with I have few questions which I would like to put here:
>
> Q1) Are we also planning to read the input data in parallel or is it only about performing the multi-insert operation in parallel? AFAIU, the data reading part will be done by the leader process alone so no parallelism is involved there.
>
Yes, your understanding is correct.
> Q2) How are we going to deal with the partitioned tables?
>
I haven't studied the patch but my understanding is that we will
support parallel copy for partitioned tables with a few restrictions
as explained in my earlier email [1]. See, Case-2 (b) in the email.
> I mean will there be some worker process dedicated for each partition or how is it?
No, it the split is just based on the input, otherwise each worker
should insert as we would have done without any workers.
> Q3) Incase of toast tables, there is a possibility of having a single tuple in the input file which could be of a very big size (probably in GB) eventually resulting in a bigger file size. So, in this case, how are we going to decide the number of worker processes to be launched. I mean, although the file size is big, but the number of tuples to be processed is just one or few of them, so, can we decide the number of the worker processes to be launched based on the file size?
>
Yeah, such situations would be tricky, so we should have an option for
user to specify the number of workers.
> Q4) Who is going to process constraints (preferably the deferred constraint) that is supposed to be executed at the COMMIT time? I mean is it the leader process or the worker process or in such cases we won't be choosing the parallelism at all?
>
In the first version, we won't do parallelism for this. Again, see
one of my earlier email [1] where I have explained this and other
cases where we won't be supporting parallel copy.
> Q5) Do we have any risk of table bloating when the data is loaded in parallel. I am just asking this because incase of parallelism there would be multiple processes performing bulk insert into a table. There is a chance that the table file might get extended even if there is some space into the file being written into, but that space is locked by some other worker process and hence that might result in a creation of a new block for that table. Sorry, if I am missing something here.
>
Hmm, each worker will operate at page level, after first insertion,
the same worker will try to insert in the same page in which it has
inserted last, so there shouldn't be such a problem.
> Please note that I haven't gone through all the emails in this thread so there is a possibility that I might have repeated the question that has already been raised and answered here. If that is the case, I am sorry for that, but it would be very helpful if someone could point out that thread so that I can go through it. Thank you.
>
No problem, I understand sometimes it is difficult to go through each
and every email especially when the discussion is long. Anyway,
thanks for showing the interest in the patch.
[1] - https://www.postgresql.org/message-id/CAA4eK1%2BANNEaMJCCXm4naweP5PLY6LhJMvGo_V7-Pnfbh6GsOA%40mail.gmail.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Mon, Jun 15, 2020 at 7:41 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Thanks Amit for the clarifications. Regarding partitioned table, one of the question was - if we are loading data intoa partitioned table using COPY command, then the input file would contain tuples for different tables (partitions) unlikethe normal table case where all the tuples in the input file would belong to the same table. So, in such a case, howare we going to accumulate tuples into the DSM? I mean will the leader process check which tuple needs to be routed towhich partition and accordingly accumulate them into the DSM. For e.g. let's say in the input data file we have 10 tupleswhere the 1st tuple belongs to partition1, 2nd belongs to partition2 and likewise. So, in such cases, will the leaderprocess accumulate all the tuples belonging to partition1 into one DSM and tuples belonging to partition2 into someother DSM and assign them to the worker process or we have taken some other approach to handle this scenario? > No, all the tuples (for all partitions) will be accumulated in a single DSM and the workers/leader will route the tuple to an appropriate partition. > Further, I haven't got much time to look into the links that you have shared in your previous response. Will have a lookinto those and will also slowly start looking into the patches as and when I get some time. Thank you. > Yeah, it will be good if you go through all the emails once because most of the decisions (and design) in the patch is supposed to be based on the discussion in this thread. Note - Please don't top post, try to give inline replies. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi, I have included tests for parallel copy feature & few bugs that were identified during testing have been fixed. Attached patches for the same. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com On Tue, Jun 16, 2020 at 3:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jun 15, 2020 at 7:41 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > Thanks Amit for the clarifications. Regarding partitioned table, one of the question was - if we are loading data intoa partitioned table using COPY command, then the input file would contain tuples for different tables (partitions) unlikethe normal table case where all the tuples in the input file would belong to the same table. So, in such a case, howare we going to accumulate tuples into the DSM? I mean will the leader process check which tuple needs to be routed towhich partition and accordingly accumulate them into the DSM. For e.g. let's say in the input data file we have 10 tupleswhere the 1st tuple belongs to partition1, 2nd belongs to partition2 and likewise. So, in such cases, will the leaderprocess accumulate all the tuples belonging to partition1 into one DSM and tuples belonging to partition2 into someother DSM and assign them to the worker process or we have taken some other approach to handle this scenario? > > > > No, all the tuples (for all partitions) will be accumulated in a > single DSM and the workers/leader will route the tuple to an > appropriate partition. > > > Further, I haven't got much time to look into the links that you have shared in your previous response. Will have a lookinto those and will also slowly start looking into the patches as and when I get some time. Thank you. > > > > Yeah, it will be good if you go through all the emails once because > most of the decisions (and design) in the patch is supposed to be > based on the discussion in this thread. > > Note - Please don't top post, try to give inline replies. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com
Attachment
- 0005-Tests-for-parallel-copy.patch
- 0004-Documentation-for-parallel-copy.patch
- 0001-Copy-code-readjustment-to-support-parallel-copy.patch
- 0002-Framework-for-leader-worker-in-parallel-copy.patch
- 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
- 0006-Parallel-Copy-For-Binary-Format-Files.patch
On Mon, Jun 15, 2020 at 4:39 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > The above tests were run with the configuration attached config.txt, which is the same used for performance tests of csv/textfiles posted earlier in this mail chain. > > Request the community to take this patch up for review along with the parallel copy for csv/text file patches and providefeedback. > I had reviewed the patch, few comments: + + /* + * Parallel copy for binary formatted files + */ + ParallelCopyDataBlock *curr_data_block; + ParallelCopyDataBlock *prev_data_block; + uint32 curr_data_offset; + uint32 curr_block_pos; + ParallelCopyTupleInfo curr_tuple_start_info; + ParallelCopyTupleInfo curr_tuple_end_info; } CopyStateData; The new members added should be present in ParallelCopyData + if (cstate->curr_tuple_start_info.block_id == cstate->curr_tuple_end_info.block_id) + { + elog(DEBUG1,"LEADER - tuple lies in a single data block"); + + line_size = cstate->curr_tuple_end_info.offset - cstate->curr_tuple_start_info.offset + 1; + pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[cstate->curr_tuple_start_info.block_id].unprocessed_line_parts, 1); + } + else + { + uint32 following_block_id = pcshared_info->data_blocks[cstate->curr_tuple_start_info.block_id].following_block; + + elog(DEBUG1,"LEADER - tuple is spread across data blocks"); + + line_size = DATA_BLOCK_SIZE - cstate->curr_tuple_start_info.offset - + pcshared_info->data_blocks[cstate->curr_tuple_start_info.block_id].skip_bytes; + + pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[cstate->curr_tuple_start_info.block_id].unprocessed_line_parts, 1); + + while (following_block_id != cstate->curr_tuple_end_info.block_id) + { + line_size = line_size + DATA_BLOCK_SIZE - pcshared_info->data_blocks[following_block_id].skip_bytes; + + pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[following_block_id].unprocessed_line_parts, 1); + + following_block_id = pcshared_info->data_blocks[following_block_id].following_block; + + if (following_block_id == -1) + break; + } + + if (following_block_id != -1) + pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[following_block_id].unprocessed_line_parts, 1); + + line_size = line_size + cstate->curr_tuple_end_info.offset + 1; + } line_size can be set as and when we process the tuple from CopyReadBinaryTupleLeader and this can be set at the end. That way the above code can be removed. + + /* + * Parallel copy for binary formatted files + */ + ParallelCopyDataBlock *curr_data_block; + ParallelCopyDataBlock *prev_data_block; + uint32 curr_data_offset; + uint32 curr_block_pos; + ParallelCopyTupleInfo curr_tuple_start_info; + ParallelCopyTupleInfo curr_tuple_end_info; } CopyStateData; curr_block_pos variable is present in ParallelCopyShmInfo, we could use it and remove from here. curr_data_offset, similar variable raw_buf_index is present in CopyStateData, we could use it and remove from here. + if (cstate->curr_data_offset + sizeof(fld_count) >= (DATA_BLOCK_SIZE - 1)) + { + ParallelCopyDataBlock *data_block = NULL; + uint8 movebytes = 0; + + block_pos = WaitGetFreeCopyBlock(pcshared_info); + + movebytes = DATA_BLOCK_SIZE - cstate->curr_data_offset; + + cstate->curr_data_block->skip_bytes = movebytes; + + data_block = &pcshared_info->data_blocks[block_pos]; + + if (movebytes > 0) + memmove(&data_block->data[0], &cstate->curr_data_block->data[cstate->curr_data_offset], + movebytes); + + elog(DEBUG1, "LEADER - field count is spread across data blocks - moved %d bytes from current block %u to %u block", + movebytes, cstate->curr_block_pos, block_pos); + + readbytes = CopyGetData(cstate, &data_block->data[movebytes], 1, (DATA_BLOCK_SIZE - movebytes)); + + elog(DEBUG1, "LEADER - bytes read from file after field count is moved to next data block %d", readbytes); + + if (cstate->reached_eof) + ereport(ERROR, + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), + errmsg("unexpected EOF in COPY data"))); + + cstate->curr_data_block = data_block; + cstate->curr_data_offset = 0; + cstate->curr_block_pos = block_pos; + } This code is duplicate in CopyReadBinaryTupleLeader & CopyReadBinaryAttributeLeader. We could make a function and re-use. +/* + * CopyReadBinaryAttributeWorker - leader identifies boundaries/offsets + * for each attribute/column, it moves on to next data block if the + * attribute/column is spread across data blocks. + */ +static pg_attribute_always_inline Datum +CopyReadBinaryAttributeWorker(CopyState cstate, int column_no, + FmgrInfo *flinfo, Oid typioparam, int32 typmod, bool *isnull) +{ + int32 fld_size; + Datum result; column_no is not used, it can be removed + if (fld_count == -1) + { + /* + * Received EOF marker. In a V3-protocol copy, wait for the + * protocol-level EOF, and complain if it doesn't come + * immediately. This ensures that we correctly handle CopyFail, + * if client chooses to send that now. + * + * Note that we MUST NOT try to read more data in an old-protocol + * copy, since there is no protocol-level EOF marker then. We + * could go either way for copy from file, but choose to throw + * error if there's data after the EOF marker, for consistency + * with the new-protocol case. + */ + char dummy; + + if (cstate->copy_dest != COPY_OLD_FE && + CopyGetData(cstate, &dummy, 1, 1) > 0) + ereport(ERROR, + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), + errmsg("received copy data after EOF marker"))); + return true; + } + + if (fld_count != attr_count) + ereport(ERROR, + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), + errmsg("row field count is %d, expected %d", + (int) fld_count, attr_count))); + + cstate->curr_tuple_start_info.block_id = cstate->curr_block_pos; + cstate->curr_tuple_start_info.offset = cstate->curr_data_offset; + cstate->curr_data_offset = cstate->curr_data_offset + sizeof(fld_count); + new_block_pos = cstate->curr_block_pos; + + foreach(cur, cstate->attnumlist) + { + int attnum = lfirst_int(cur); + int m = attnum - 1; + Form_pg_attribute att = TupleDescAttr(tupDesc, m); The above code is present in NextCopyFrom & CopyReadBinaryTupleLeader, check if we can make a common function or we could use NextCopyFrom as it is. + memcpy(&fld_count, &cstate->curr_data_block->data[cstate->curr_data_offset], sizeof(fld_count)); + fld_count = (int16) pg_ntoh16(fld_count); + + if (fld_count == -1) + { + return true; + } Should this be an assert in CopyReadBinaryTupleWorker function as this check is already done in the leader. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Hi,
1) Can you please add some comments atop the new function PopulateAttributes() describing its functionality in detail. Further, this new function contains the code from BeginCopy() to set attribute level options used with COPY FROM such as FORCE_QUOTE, FORCE_NOT_NULL, FORCE_NULL etc. in cstate and along with that it also copies the code from BeginCopy() to set other infos such as client encoding type, encoding conversion etc. Hence, I think it would be good to give it some better name, basically something that matches with what actually it is doing.
2) Again, the name for the new function CheckCopyFromValidity() doesn't look good to me. From the function name it appears as if it does the sanity check of the entire COPY FROM command, but actually it is just doing the sanity check for the target relation specified with COPY FROM. So, probably something like CheckTargetRelValidity would look more sensible, I think? TBH, I am not good at naming the functions so you can always ignore my suggestions about function and variable names :)
3) Any reason for not making CheckCopyFromValidity as a macro instead of a new function. It is just doing the sanity check for the target relation.
4) Earlier in CopyReadLine() function while trying to clear the EOL marker from cstate->line_buf.data (copied data), we were not checking if the line read by CopyReadLineText() function is a header line or not, but I can see that your patch checks that before clearing the EOL marker. Any reason for this extra check?
5) I noticed the below spurious line removal in the patch.
@@ -3839,7 +3953,6 @@ static bool
CopyReadLine(CopyState cstate)
{
bool result;
-
Please note that I haven't got a chance to look into other patches as of now. I will do that whenever possible. Thank you.
On Fri, Jun 12, 2020 at 11:01 AM vignesh C <vignesh21@gmail.com> wrote:
On Thu, Jun 4, 2020 at 12:44 AM Andres Freund <andres@anarazel.de> wrote
>
>
> Hm. you don't explicitly mention that in your design, but given how
> small the benefits going from 0-1 workers is, I assume the leader
> doesn't do any "chunk processing" on its own?
>
Yes you are right, the leader does not do any processing, Leader's
work is mainly to populate the shared memory with the offset
information for each record.
>
>
> > Design of the Parallel Copy: The backend, to which the "COPY FROM" query is
> > submitted acts as leader with the responsibility of reading data from the
> > file/stdin, launching at most n number of workers as specified with
> > PARALLEL 'n' option in the "COPY FROM" query. The leader populates the
> > common data required for the workers execution in the DSM and shares it
> > with the workers. The leader then executes before statement triggers if
> > there exists any. Leader populates DSM chunks which includes the start
> > offset and chunk size, while populating the chunks it reads as many blocks
> > as required into the DSM data blocks from the file. Each block is of 64K
> > size. The leader parses the data to identify a chunk, the existing logic
> > from CopyReadLineText which identifies the chunks with some changes was
> > used for this. Leader checks if a free chunk is available to copy the
> > information, if there is no free chunk it waits till the required chunk is
> > freed up by the worker and then copies the identified chunks information
> > (offset & chunk size) into the DSM chunks. This process is repeated till
> > the complete file is processed. Simultaneously, the workers cache the
> > chunks(50) locally into the local memory and release the chunks to the
> > leader for further populating. Each worker processes the chunk that it
> > cached and inserts it into the table. The leader waits till all the chunks
> > populated are processed by the workers and exits.
>
> Why do we need the local copy of 50 chunks? Copying memory around is far
> from free. I don't see why it'd be better to add per-process caching,
> rather than making the DSM bigger? I can see some benefit in marking
> multiple chunks as being processed with one lock acquisition, but I
> don't think adding a memory copy is a good idea.
We had run performance with csv data file, 5.1GB, 10million tuples, 2
indexes on integer columns, results for the same are given below. We
noticed in some cases the performance is better if we copy the 50
records locally and release the shared memory. We will get better
benefits as the workers increase. Thoughts?
------------------------------------------------------------------------------------------------
Workers | Exec time (With local copying | Exec time (Without copying,
| 50 records & release the | processing record by record)
| shared memory) |
------------------------------------------------------------------------------------------------
0 | 1162.772(1X) | 1152.684(1X)
2 | 635.249(1.83X) | 647.894(1.78X)
4 | 336.835(3.45X) | 335.534(3.43X)
8 | 188.577(6.17 X) | 189.461(6.08X)
16 | 126.819(9.17X) | 142.730(8.07X)
20 | 117.845(9.87X) | 146.533(7.87X)
30 | 127.554(9.11X) | 160.307(7.19X)
> This patch *desperately* needs to be split up. It imo is close to
> unreviewable, due to a large amount of changes that just move code
> around without other functional changes being mixed in with the actual
> new stuff.
I have split the patch, the new split patches are attached.
>
>
>
> > /*
> > + * State of the chunk.
> > + */
> > +typedef enum ChunkState
> > +{
> > + CHUNK_INIT, /* initial state of chunk */
> > + CHUNK_LEADER_POPULATING, /* leader processing chunk */
> > + CHUNK_LEADER_POPULATED, /* leader completed populating chunk */
> > + CHUNK_WORKER_PROCESSING, /* worker processing chunk */
> > + CHUNK_WORKER_PROCESSED /* worker completed processing chunk */
> > +}ChunkState;
> > +
> > +#define RAW_BUF_SIZE 65536 /* we palloc RAW_BUF_SIZE+1 bytes */
> > +
> > +#define DATA_BLOCK_SIZE RAW_BUF_SIZE
> > +#define RINGSIZE (10 * 1000)
> > +#define MAX_BLOCKS_COUNT 1000
> > +#define WORKER_CHUNK_COUNT 50 /* should be mod of RINGSIZE */
> > +
> > +#define IsParallelCopy() (cstate->is_parallel)
> > +#define IsLeader() (cstate->pcdata->is_leader)
> > +#define IsHeaderLine() (cstate->header_line && cstate->cur_lineno == 1)
> > +
> > +/*
> > + * Copy data block information.
> > + */
> > +typedef struct CopyDataBlock
> > +{
> > + /* The number of unprocessed chunks in the current block. */
> > + pg_atomic_uint32 unprocessed_chunk_parts;
> > +
> > + /*
> > + * If the current chunk data is continued into another block,
> > + * following_block will have the position where the remaining data need to
> > + * be read.
> > + */
> > + uint32 following_block;
> > +
> > + /*
> > + * This flag will be set, when the leader finds out this block can be read
> > + * safely by the worker. This helps the worker to start processing the chunk
> > + * early where the chunk will be spread across many blocks and the worker
> > + * need not wait for the complete chunk to be processed.
> > + */
> > + bool curr_blk_completed;
> > + char data[DATA_BLOCK_SIZE + 1]; /* data read from file */
> > +}CopyDataBlock;
>
> What's the + 1 here about?
Fixed this, removed +1. That is not needed.
>
>
> > +/*
> > + * Parallel copy line buffer information.
> > + */
> > +typedef struct ParallelCopyLineBuf
> > +{
> > + StringInfoData line_buf;
> > + uint64 cur_lineno; /* line number for error messages */
> > +}ParallelCopyLineBuf;
>
> Why do we need separate infrastructure for this? We shouldn't duplicate
> infrastructure unnecessarily.
>
This was required for copying the multiple records locally and
releasing the shared memory. I have not changed this, will decide on
this based on the decision taken for one of the previous comments.
>
>
>
> > +/*
> > + * Common information that need to be copied to shared memory.
> > + */
> > +typedef struct CopyWorkerCommonData
> > +{
>
> Why is parallel specific stuff here suddenly not named ParallelCopy*
> anymore? If you introduce a naming like that it imo should be used
> consistently.
Fixed, changed to maintain ParallelCopy in all structs.
>
> > + /* low-level state data */
> > + CopyDest copy_dest; /* type of copy source/destination */
> > + int file_encoding; /* file or remote side's character encoding */
> > + bool need_transcoding; /* file encoding diff from server? */
> > + bool encoding_embeds_ascii; /* ASCII can be non-first byte? */
> > +
> > + /* parameters from the COPY command */
> > + bool csv_mode; /* Comma Separated Value format? */
> > + bool header_line; /* CSV header line? */
> > + int null_print_len; /* length of same */
> > + bool force_quote_all; /* FORCE_QUOTE *? */
> > + bool convert_selectively; /* do selective binary conversion? */
> > +
> > + /* Working state for COPY FROM */
> > + AttrNumber num_defaults;
> > + Oid relid;
> > +}CopyWorkerCommonData;
>
> But I actually think we shouldn't have this information in two different
> structs. This should exist once, independent of using parallel /
> non-parallel copy.
>
This structure helps in storing the common data from CopyStateData
that are required by the workers. This information will then be
allocated and stored into the DSM for the worker to retrieve and copy
it to CopyStateData.
>
> > +/* List information */
> > +typedef struct ListInfo
> > +{
> > + int count; /* count of attributes */
> > +
> > + /* string info in the form info followed by info1, info2... infon */
> > + char info[1];
> > +} ListInfo;
>
> Based on these comments I have no idea what this could be for.
>
Have added better comments for this. The following is added: This
structure will help in converting a List data type into the below
structure format with the count having the number of elements in the
list and the info having the List elements appended contiguously. This
converted structure will be allocated in shared memory and stored in
DSM for the worker to retrieve and later convert it back to List data
type.
>
> > /*
> > - * This keeps the character read at the top of the loop in the buffer
> > - * even if there is more than one read-ahead.
> > + * This keeps the character read at the top of the loop in the buffer
> > + * even if there is more than one read-ahead.
> > + */
> > +#define IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(extralen) \
> > +if (1) \
> > +{ \
> > + if (copy_buff_state.raw_buf_ptr + (extralen) >= copy_buff_state.copy_buf_len && !hit_eof) \
> > + { \
> > + if (IsParallelCopy()) \
> > + { \
> > + copy_buff_state.chunk_size = prev_chunk_size; /* update previous chunk size */ \
> > + if (copy_buff_state.block_switched) \
> > + { \
> > + pg_atomic_sub_fetch_u32(©_buff_state.data_blk_ptr->unprocessed_chunk_parts, 1); \
> > + copy_buff_state.copy_buf_len = prev_copy_buf_len; \
> > + } \
> > + } \
> > + copy_buff_state.raw_buf_ptr = prev_raw_ptr; /* undo fetch */ \
> > + need_data = true; \
> > + continue; \
> > + } \
> > +} else ((void) 0)
>
> I think it's an absolutely clear no-go to add new branches to
> these. They're *really* hot already, and this is going to sprinkle a
> significant amount of new instructions over a lot of places.
>
Fixed, removed this.
>
>
> > +/*
> > + * SET_RAWBUF_FOR_LOAD - Set raw_buf to the shared memory where the file data must
> > + * be read.
> > + */
> > +#define SET_RAWBUF_FOR_LOAD() \
> > +{ \
> > + ShmCopyInfo *pcshared_info = cstate->pcdata->pcshared_info; \
> > + uint32 cur_block_pos; \
> > + /* \
> > + * Mark the previous block as completed, worker can start copying this data. \
> > + */ \
> > + if (copy_buff_state.data_blk_ptr != copy_buff_state.curr_data_blk_ptr && \
> > + copy_buff_state.data_blk_ptr->curr_blk_completed == false) \
> > + copy_buff_state.data_blk_ptr->curr_blk_completed = true; \
> > + \
> > + copy_buff_state.data_blk_ptr = copy_buff_state.curr_data_blk_ptr; \
> > + cur_block_pos = WaitGetFreeCopyBlock(pcshared_info); \
> > + copy_buff_state.curr_data_blk_ptr = &pcshared_info->data_blocks[cur_block_pos]; \
> > + \
> > + if (!copy_buff_state.data_blk_ptr) \
> > + { \
> > + copy_buff_state.data_blk_ptr = copy_buff_state.curr_data_blk_ptr; \
> > + chunk_first_block = cur_block_pos; \
> > + } \
> > + else if (need_data == false) \
> > + copy_buff_state.data_blk_ptr->following_block = cur_block_pos; \
> > + \
> > + cstate->raw_buf = copy_buff_state.curr_data_blk_ptr->data; \
> > + copy_buff_state.copy_raw_buf = cstate->raw_buf; \
> > +}
> > +
> > +/*
> > + * END_CHUNK_PARALLEL_COPY - Update the chunk information in shared memory.
> > + */
> > +#define END_CHUNK_PARALLEL_COPY() \
> > +{ \
> > + if (!IsHeaderLine()) \
> > + { \
> > + ShmCopyInfo *pcshared_info = cstate->pcdata->pcshared_info; \
> > + ChunkBoundaries *chunkBoundaryPtr = &pcshared_info->chunk_boundaries; \
> > + if (copy_buff_state.chunk_size) \
> > + { \
> > + ChunkBoundary *chunkInfo = &chunkBoundaryPtr->ring[chunk_pos]; \
> > + /* \
> > + * If raw_buf_ptr is zero, unprocessed_chunk_parts would have been \
> > + * incremented in SEEK_COPY_BUFF_POS. This will happen if the whole \
> > + * chunk finishes at the end of the current block. If the \
> > + * new_line_size > raw_buf_ptr, then the new block has only new line \
> > + * char content. The unprocessed count should not be increased in \
> > + * this case. \
> > + */ \
> > + if (copy_buff_state.raw_buf_ptr != 0 && \
> > + copy_buff_state.raw_buf_ptr > new_line_size) \
> > + pg_atomic_add_fetch_u32(©_buff_state.curr_data_blk_ptr->unprocessed_chunk_parts, 1); \
> > + \
> > + /* Update chunk size. */ \
> > + pg_atomic_write_u32(&chunkInfo->chunk_size, copy_buff_state.chunk_size); \
> > + pg_atomic_write_u32(&chunkInfo->chunk_state, CHUNK_LEADER_POPULATED); \
> > + elog(DEBUG1, "[Leader] After adding - chunk position:%d, chunk_size:%d", \
> > + chunk_pos, copy_buff_state.chunk_size); \
> > + pcshared_info->populated++; \
> > + } \
> > + else if (new_line_size) \
> > + { \
> > + /* \
> > + * This means only new line char, empty record should be \
> > + * inserted. \
> > + */ \
> > + ChunkBoundary *chunkInfo; \
> > + chunk_pos = UpdateBlockInChunkInfo(cstate, -1, -1, 0, \
> > + CHUNK_LEADER_POPULATED); \
> > + chunkInfo = &chunkBoundaryPtr->ring[chunk_pos]; \
> > + elog(DEBUG1, "[Leader] Added empty chunk with offset:%d, chunk position:%d, chunk size:%d", \
> > + chunkInfo->start_offset, chunk_pos, \
> > + pg_atomic_read_u32(&chunkInfo->chunk_size)); \
> > + pcshared_info->populated++; \
> > + } \
> > + }\
> > + \
> > + /*\
> > + * All of the read data is processed, reset index & len. In the\
> > + * subsequent read, we will get a new block and copy data in to the\
> > + * new block.\
> > + */\
> > + if (copy_buff_state.raw_buf_ptr == copy_buff_state.copy_buf_len)\
> > + {\
> > + cstate->raw_buf_index = 0;\
> > + cstate->raw_buf_len = 0;\
> > + }\
> > + else\
> > + cstate->raw_buf_len = copy_buff_state.copy_buf_len;\
> > +}
>
> Why are these macros? They are way way way above a length where that
> makes any sort of sense.
>
Converted these macros to functions.
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
Thanks Ashutosh For your review, my comments are inline. On Fri, Jun 19, 2020 at 5:41 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Hi, > > I just got some time to review the first patch in the list i.e. 0001-Copy-code-readjustment-to-support-parallel-copy.patch.As the patch name suggests, it is just trying to reshuffle theexisting code for COPY command here and there. There is no extra changes added in the patch as such, but still I do havesome review comments, please have a look: > > 1) Can you please add some comments atop the new function PopulateAttributes() describing its functionality in detail.Further, this new function contains the code from BeginCopy() to set attribute level options used with COPY FROM suchas FORCE_QUOTE, FORCE_NOT_NULL, FORCE_NULL etc. in cstate and along with that it also copies the code from BeginCopy()to set other infos such as client encoding type, encoding conversion etc. Hence, I think it would be good to giveit some better name, basically something that matches with what actually it is doing. > There is no new code added in this function, some part of code from BeginCopy was made in to a new function as this part of code will also be required for the parallel copy workers before the workers start the actual copy operation. This code was made into a function to avoid duplication. Changed the function name to PopulateGlobalsForCopyFrom & added few comments. > 2) Again, the name for the new function CheckCopyFromValidity() doesn't look good to me. From the function name it appearsas if it does the sanity check of the entire COPY FROM command, but actually it is just doing the sanity check forthe target relation specified with COPY FROM. So, probably something like CheckTargetRelValidity would look more sensible,I think? TBH, I am not good at naming the functions so you can always ignore my suggestions about function and variablenames :) > Changed as suggested. > 3) Any reason for not making CheckCopyFromValidity as a macro instead of a new function. It is just doing the sanity checkfor the target relation. > I felt there is reasonable number of lines in the function & it is not in performance intensive path, so I preferred function over macro. Your thoughts? > 4) Earlier in CopyReadLine() function while trying to clear the EOL marker from cstate->line_buf.data (copied data), wewere not checking if the line read by CopyReadLineText() function is a header line or not, but I can see that your patchchecks that before clearing the EOL marker. Any reason for this extra check? > If you see the caller of CopyReadLine, i.e. NextCopyFromRawFields does nothing for the header line, server basically calls CopyReadLine again, it is a kind of small optimization. Anyway server is not going to do anything with header line, I felt no need to clear EOL marker for header lines. /* on input just throw the header line away */ if (cstate->cur_lineno == 0 && cstate->header_line) { cstate->cur_lineno++; if (CopyReadLine(cstate)) return false; /* done */ } cstate->cur_lineno++; /* Actually read the line into memory here */ done = CopyReadLine(cstate); I think no need to make a fix for this. Your thoughts? > 5) I noticed the below spurious line removal in the patch. > > @@ -3839,7 +3953,6 @@ static bool > CopyReadLine(CopyState cstate) > { > bool result; > - > Fixed. I have attached the patch for the same with the fixes. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Jun 23, 2020 at 8:07 AM vignesh C <vignesh21@gmail.com> wrote:
> I have attached the patch for the same with the fixes.
The patches were not applying on the head, attached the patches that can be applied on head.
I have added a commitfest entry[1] for this feature.
> I have attached the patch for the same with the fixes.
The patches were not applying on the head, attached the patches that can be applied on head.
I have added a commitfest entry[1] for this feature.
On Tue, Jun 23, 2020 at 8:07 AM vignesh C <vignesh21@gmail.com> wrote:
Thanks Ashutosh For your review, my comments are inline.
On Fri, Jun 19, 2020 at 5:41 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi,
>
> I just got some time to review the first patch in the list i.e. 0001-Copy-code-readjustment-to-support-parallel-copy.patch. As the patch name suggests, it is just trying to reshuffle the existing code for COPY command here and there. There is no extra changes added in the patch as such, but still I do have some review comments, please have a look:
>
> 1) Can you please add some comments atop the new function PopulateAttributes() describing its functionality in detail. Further, this new function contains the code from BeginCopy() to set attribute level options used with COPY FROM such as FORCE_QUOTE, FORCE_NOT_NULL, FORCE_NULL etc. in cstate and along with that it also copies the code from BeginCopy() to set other infos such as client encoding type, encoding conversion etc. Hence, I think it would be good to give it some better name, basically something that matches with what actually it is doing.
>
There is no new code added in this function, some part of code from
BeginCopy was made in to a new function as this part of code will also
be required for the parallel copy workers before the workers start the
actual copy operation. This code was made into a function to avoid
duplication. Changed the function name to PopulateGlobalsForCopyFrom &
added few comments.
> 2) Again, the name for the new function CheckCopyFromValidity() doesn't look good to me. From the function name it appears as if it does the sanity check of the entire COPY FROM command, but actually it is just doing the sanity check for the target relation specified with COPY FROM. So, probably something like CheckTargetRelValidity would look more sensible, I think? TBH, I am not good at naming the functions so you can always ignore my suggestions about function and variable names :)
>
Changed as suggested.
> 3) Any reason for not making CheckCopyFromValidity as a macro instead of a new function. It is just doing the sanity check for the target relation.
>
I felt there is reasonable number of lines in the function & it is not
in performance intensive path, so I preferred function over macro.
Your thoughts?
> 4) Earlier in CopyReadLine() function while trying to clear the EOL marker from cstate->line_buf.data (copied data), we were not checking if the line read by CopyReadLineText() function is a header line or not, but I can see that your patch checks that before clearing the EOL marker. Any reason for this extra check?
>
If you see the caller of CopyReadLine, i.e. NextCopyFromRawFields does
nothing for the header line, server basically calls CopyReadLine
again, it is a kind of small optimization. Anyway server is not going
to do anything with header line, I felt no need to clear EOL marker
for header lines.
/* on input just throw the header line away */
if (cstate->cur_lineno == 0 && cstate->header_line)
{
cstate->cur_lineno++;
if (CopyReadLine(cstate))
return false; /* done */
}
cstate->cur_lineno++;
/* Actually read the line into memory here */
done = CopyReadLine(cstate);
I think no need to make a fix for this. Your thoughts?
> 5) I noticed the below spurious line removal in the patch.
>
> @@ -3839,7 +3953,6 @@ static bool
> CopyReadLine(CopyState cstate)
> {
> bool result;
> -
>
Fixed.
I have attached the patch for the same with the fixes.
Thoughts?
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
Attachment
- 0001-Copy-code-readjustment-to-support-parallel-copy.patch
- 0004-Documentation-for-parallel-copy.patch
- 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
- 0002-Framework-for-leader-worker-in-parallel-copy.patch
- 0005-Tests-for-parallel-copy.patch
- 0006-Parallel-Copy-For-Binary-Format-Files.patch
Hi, Thanks Vignesh for reviewing parallel copy for binary format files patch. I tried to address the comments in the attached patch (0006-Parallel-Copy-For-Binary-Format-Files.patch). On Thu, Jun 18, 2020 at 6:42 PM vignesh C <vignesh21@gmail.com> wrote: > > On Mon, Jun 15, 2020 at 4:39 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > The above tests were run with the configuration attached config.txt, which is the same used for performance tests ofcsv/text files posted earlier in this mail chain. > > > > Request the community to take this patch up for review along with the parallel copy for csv/text file patches and providefeedback. > > > > I had reviewed the patch, few comments: > > The new members added should be present in ParallelCopyData Added to ParallelCopyData. > > line_size can be set as and when we process the tuple from > CopyReadBinaryTupleLeader and this can be set at the end. That way the > above code can be removed. > curr_tuple_start_info and curr_tuple_end_info variables are now local variables to CopyReadBinaryTupleLeader and the line size calculation code is moved to CopyReadBinaryAttributeLeader. > > curr_block_pos variable is present in ParallelCopyShmInfo, we could > use it and remove from here. > curr_data_offset, similar variable raw_buf_index is present in > CopyStateData, we could use it and remove from here. > Yes, making use of them now. > > This code is duplicate in CopyReadBinaryTupleLeader & > CopyReadBinaryAttributeLeader. We could make a function and re-use. > Added a new function AdjustFieldInfo. > > column_no is not used, it can be removed > Removed. > > The above code is present in NextCopyFrom & CopyReadBinaryTupleLeader, > check if we can make a common function or we could use NextCopyFrom as > it is. > Added a macro CHECK_FIELD_COUNT. > + if (fld_count == -1) > + { > + return true; > + } > > Should this be an assert in CopyReadBinaryTupleWorker function as this > check is already done in the leader. > This check in leader signifies the end of the file. For the workers, the eof is when GetLinePosition() returns -1. line_pos = GetLinePosition(cstate); if (line_pos == -1) return true; In case the if (fld_count == -1) is encountered in the worker, workers should just return true from CopyReadBinaryTupleWorker marking eof. Having this as an assert doesn't serve the purpose I feel. Along with the review comments addressed patch(0006-Parallel-Copy-For-Binary-Format-Files.patch) also attaching all other latest series of patches(0001 to 0005) from [1], the order of applying patches is from 0001 to 0006. [1] https://www.postgresql.org/message-id/CALDaNm0H3N9gK7CMheoaXkO99g%3DuAPA93nSZXu0xDarPyPY6sg%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
- 0006-Parallel-Copy-For-Binary-Format-Files.patch
- 0001-Copy-code-readjustment-to-support-parallel-copy.patch
- 0002-Framework-for-leader-worker-in-parallel-copy.patch
- 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
- 0004-Documentation-for-parallel-copy.patch
- 0005-Tests-for-parallel-copy.patch
Hi, It looks like the parsing of newly introduced "PARALLEL" option for COPY FROM command has an issue(in the 0002-Framework-for-leader-worker-in-parallel-copy.patch), Mentioning ....PARALLEL '4ar2eteid'); would pass with 4 workers since atoi() is being used for converting string to integer which just returns 4, ignoring other strings. I used strtol(), added error checks and introduced the error " improper use of argument to option "parallel"" for the above cases. parallel '4ar2eteid'); ERROR: improper use of argument to option "parallel" LINE 5: parallel '1\'); Along with the updated patch 0002-Framework-for-leader-worker-in-parallel-copy.patch, also attaching all the latest patches from [1]. [1] - https://www.postgresql.org/message-id/CALj2ACW94icER3WrWapon7JkcX8j0TGRue5ycWMTEvgA3X7fOg%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com On Tue, Jun 23, 2020 at 12:22 PM vignesh C <vignesh21@gmail.com> wrote: > > On Tue, Jun 23, 2020 at 8:07 AM vignesh C <vignesh21@gmail.com> wrote: > > I have attached the patch for the same with the fixes. > > The patches were not applying on the head, attached the patches that can be applied on head. > I have added a commitfest entry[1] for this feature. > > [1] - https://commitfest.postgresql.org/28/2610/ > > > On Tue, Jun 23, 2020 at 8:07 AM vignesh C <vignesh21@gmail.com> wrote: >> >> Thanks Ashutosh For your review, my comments are inline. >> On Fri, Jun 19, 2020 at 5:41 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> > >> > Hi, >> > >> > I just got some time to review the first patch in the list i.e. 0001-Copy-code-readjustment-to-support-parallel-copy.patch.As the patch name suggests, it is just trying to reshuffle theexisting code for COPY command here and there. There is no extra changes added in the patch as such, but still I do havesome review comments, please have a look: >> > >> > 1) Can you please add some comments atop the new function PopulateAttributes() describing its functionality in detail.Further, this new function contains the code from BeginCopy() to set attribute level options used with COPY FROM suchas FORCE_QUOTE, FORCE_NOT_NULL, FORCE_NULL etc. in cstate and along with that it also copies the code from BeginCopy()to set other infos such as client encoding type, encoding conversion etc. Hence, I think it would be good to giveit some better name, basically something that matches with what actually it is doing. >> > >> >> There is no new code added in this function, some part of code from >> BeginCopy was made in to a new function as this part of code will also >> be required for the parallel copy workers before the workers start the >> actual copy operation. This code was made into a function to avoid >> duplication. Changed the function name to PopulateGlobalsForCopyFrom & >> added few comments. >> >> > 2) Again, the name for the new function CheckCopyFromValidity() doesn't look good to me. From the function name it appearsas if it does the sanity check of the entire COPY FROM command, but actually it is just doing the sanity check forthe target relation specified with COPY FROM. So, probably something like CheckTargetRelValidity would look more sensible,I think? TBH, I am not good at naming the functions so you can always ignore my suggestions about function and variablenames :) >> > >> >> Changed as suggested. >> > 3) Any reason for not making CheckCopyFromValidity as a macro instead of a new function. It is just doing the sanitycheck for the target relation. >> > >> >> I felt there is reasonable number of lines in the function & it is not >> in performance intensive path, so I preferred function over macro. >> Your thoughts? >> >> > 4) Earlier in CopyReadLine() function while trying to clear the EOL marker from cstate->line_buf.data (copied data),we were not checking if the line read by CopyReadLineText() function is a header line or not, but I can see that yourpatch checks that before clearing the EOL marker. Any reason for this extra check? >> > >> >> If you see the caller of CopyReadLine, i.e. NextCopyFromRawFields does >> nothing for the header line, server basically calls CopyReadLine >> again, it is a kind of small optimization. Anyway server is not going >> to do anything with header line, I felt no need to clear EOL marker >> for header lines. >> /* on input just throw the header line away */ >> if (cstate->cur_lineno == 0 && cstate->header_line) >> { >> cstate->cur_lineno++; >> if (CopyReadLine(cstate)) >> return false; /* done */ >> } >> >> cstate->cur_lineno++; >> >> /* Actually read the line into memory here */ >> done = CopyReadLine(cstate); >> I think no need to make a fix for this. Your thoughts? >> >> > 5) I noticed the below spurious line removal in the patch. >> > >> > @@ -3839,7 +3953,6 @@ static bool >> > CopyReadLine(CopyState cstate) >> > { >> > bool result; >> > - >> > >> >> Fixed. >> I have attached the patch for the same with the fixes. >> Thoughts? >> >> Regards, >> Vignesh >> EnterpriseDB: http://www.enterprisedb.com
On Wed, Jun 24, 2020 at 2:16 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Hi, > > It looks like the parsing of newly introduced "PARALLEL" option for > COPY FROM command has an issue(in the > 0002-Framework-for-leader-worker-in-parallel-copy.patch), > Mentioning ....PARALLEL '4ar2eteid'); would pass with 4 workers since > atoi() is being used for converting string to integer which just > returns 4, ignoring other strings. > > I used strtol(), added error checks and introduced the error " > improper use of argument to option "parallel"" for the above cases. > > parallel '4ar2eteid'); > ERROR: improper use of argument to option "parallel" > LINE 5: parallel '1\'); > > Along with the updated patch > 0002-Framework-for-leader-worker-in-parallel-copy.patch, also > attaching all the latest patches from [1]. > > [1] - https://www.postgresql.org/message-id/CALj2ACW94icER3WrWapon7JkcX8j0TGRue5ycWMTEvgA3X7fOg%40mail.gmail.com > I'm sorry, I forgot to attach the patches. Here are the latest series of patches. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
- 0001-Copy-code-readjustment-to-support-parallel-copy.patch
- 0002-Framework-for-leader-worker-in-parallel-copy.patch
- 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
- 0004-Documentation-for-parallel-copy.patch
- 0005-Tests-for-parallel-copy.patch
- 0006-Parallel-Copy-For-Binary-Format-Files.patch
Hi, 0006 patch has some code clean up and issue fixes found during internal testing. Attaching the latest patches herewith. The order of applying the patches remains the same i.e. from 0001 to 0006. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
- 0001-Copy-code-readjustment-to-support-parallel-copy.patch
- 0002-Framework-for-leader-worker-in-parallel-copy.patch
- 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
- 0004-Documentation-for-parallel-copy.patch
- 0005-Tests-for-parallel-copy.patch
- 0006-Parallel-Copy-For-Binary-Format-Files.patch
Hi, I have made few changes in 0003 & 0005 patch, there were a couple of bugs in 0003 patch & some random test failures in 0005 patch. Attached new patches which include the fixes for the same. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com On Fri, Jun 26, 2020 at 2:34 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Hi, > > 0006 patch has some code clean up and issue fixes found during internal testing. > > Attaching the latest patches herewith. > > The order of applying the patches remains the same i.e. from 0001 to 0006. > > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com
Attachment
- 0001-Copy-code-readjustment-to-support-parallel-copy.patch
- 0002-Framework-for-leader-worker-in-parallel-copy.patch
- 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
- 0004-Documentation-for-parallel-copy.patch
- 0005-Tests-for-parallel-copy.patch
- 0006-Parallel-Copy-For-Binary-Format-Files.patch
On Wed, Jul 1, 2020 at 2:46 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Hi,
>
> I have made few changes in 0003 & 0005 patch, there were a couple of
> bugs in 0003 patch & some random test failures in 0005 patch.
> Attached new patches which include the fixes for the same.
I have made changes in 0003 patch, to remove changes made in pqmq.c for parallel worker error handling hang issue. This is being discussed in email [1] separately as it is a bug in the head. The rest of the patches have no changes.
>
> Hi,
>
> I have made few changes in 0003 & 0005 patch, there were a couple of
> bugs in 0003 patch & some random test failures in 0005 patch.
> Attached new patches which include the fixes for the same.
I have made changes in 0003 patch, to remove changes made in pqmq.c for parallel worker error handling hang issue. This is being discussed in email [1] separately as it is a bug in the head. The rest of the patches have no changes.
Attachment
- 0001-Copy-code-readjustment-to-support-parallel-copy.patch
- 0002-Framework-for-leader-worker-in-parallel-copy.patch
- 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
- 0004-Documentation-for-parallel-copy.patch
- 0005-Tests-for-parallel-copy.patch
- 0006-Parallel-Copy-For-Binary-Format-Files.patch
On Wed, Jun 24, 2020 at 1:41 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Along with the review comments addressed > patch(0006-Parallel-Copy-For-Binary-Format-Files.patch) also attaching > all other latest series of patches(0001 to 0005) from [1], the order > of applying patches is from 0001 to 0006. > > [1] https://www.postgresql.org/message-id/CALDaNm0H3N9gK7CMheoaXkO99g%3DuAPA93nSZXu0xDarPyPY6sg%40mail.gmail.com > Some comments: + movebytes = DATA_BLOCK_SIZE - cstate->raw_buf_index; + + cstate->pcdata->curr_data_block->skip_bytes = movebytes; + + data_block = &pcshared_info->data_blocks[block_pos]; + + if (movebytes > 0) + memmove(&data_block->data[0], &cstate->pcdata->curr_data_block->data[cstate->raw_buf_index], + movebytes); we can create a local variable and use in place of cstate->pcdata->curr_data_block. + if (cstate->raw_buf_index + sizeof(fld_count) >= (DATA_BLOCK_SIZE - 1)) + AdjustFieldInfo(cstate, 1); + + memcpy(&fld_count, &cstate->pcdata->curr_data_block->data[cstate->raw_buf_index], sizeof(fld_count)); Should this be like below, as the remaining size can fit in current block: if (cstate->raw_buf_index + sizeof(fld_count) >= DATA_BLOCK_SIZE) + if ((cstate->raw_buf_index + sizeof(fld_size)) >= (DATA_BLOCK_SIZE - 1)) + { + AdjustFieldInfo(cstate, 2); + *new_block_pos = pcshared_info->cur_block_pos; + } Same like above. + movebytes = DATA_BLOCK_SIZE - cstate->raw_buf_index; + + cstate->pcdata->curr_data_block->skip_bytes = movebytes; + + data_block = &pcshared_info->data_blocks[block_pos]; + + if (movebytes > 0) Instead of the above check, we can have an assert check for movebytes. + if (mode == 1) + { + cstate->pcdata->curr_data_block = data_block; + cstate->raw_buf_index = 0; + } + else if(mode == 2) + { + ParallelCopyDataBlock *prev_data_block = NULL; + prev_data_block = cstate->pcdata->curr_data_block; + prev_data_block->following_block = block_pos; + cstate->pcdata->curr_data_block = data_block; + + if (prev_data_block->curr_blk_completed == false) + prev_data_block->curr_blk_completed = true; + + cstate->raw_buf_index = 0; + } This code is common for both, keep in common flow and remove if (mode == 1) cstate->pcdata->curr_data_block = data_block; cstate->raw_buf_index = 0; +#define CHECK_FIELD_COUNT \ +{\ + if (fld_count == -1) \ + { \ + if (IsParallelCopy() && \ + !IsLeader()) \ + return true; \ + else if (IsParallelCopy() && \ + IsLeader()) \ + { \ + if (cstate->pcdata->curr_data_block->data[cstate->raw_buf_index + sizeof(fld_count)] != 0) \ + ereport(ERROR, \ + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), \ + errmsg("received copy data after EOF marker"))); \ + return true; \ + } \ We only copy sizeof(fld_count), Shouldn't we check fld_count != cstate->max_fields? Am I missing something here? + if ((cstate->raw_buf_index + sizeof(fld_size)) >= (DATA_BLOCK_SIZE - 1)) + { + AdjustFieldInfo(cstate, 2); + *new_block_pos = pcshared_info->cur_block_pos; + } + + memcpy(&fld_size, &cstate->pcdata->curr_data_block->data[cstate->raw_buf_index], sizeof(fld_size)); + + cstate->raw_buf_index = cstate->raw_buf_index + sizeof(fld_size); + + fld_size = (int32) pg_ntoh32(fld_size); + + if (fld_size == 0) + ereport(ERROR, + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), + errmsg("unexpected EOF in COPY data"))); + + if (fld_size < -1) + ereport(ERROR, + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), + errmsg("invalid field size"))); + + if ((DATA_BLOCK_SIZE - cstate->raw_buf_index) >= fld_size) + { + cstate->raw_buf_index = cstate->raw_buf_index + fld_size; + } We can keep the check like cstate->raw_buf_index + fld_size < ..., for better readability and consistency. +static pg_attribute_always_inline void +CopyReadBinaryAttributeLeader(CopyState cstate, FmgrInfo *flinfo, + Oid typioparam, int32 typmod, uint32 *new_block_pos, + int m, ParallelCopyTupleInfo *tuple_start_info_ptr, + ParallelCopyTupleInfo *tuple_end_info_ptr, uint32 *line_size) flinfo, typioparam & typmod is not used, we can remove the parameter. +static pg_attribute_always_inline void +CopyReadBinaryAttributeLeader(CopyState cstate, FmgrInfo *flinfo, + Oid typioparam, int32 typmod, uint32 *new_block_pos, + int m, ParallelCopyTupleInfo *tuple_start_info_ptr, + ParallelCopyTupleInfo *tuple_end_info_ptr, uint32 *line_size) I felt this function need not be an inline function. + /* binary format */ + /* for paralle copy leader, fill in the error There are some typos, run spell check + /* raw_buf_index should never cross data block size, + * as the required number of data blocks would have + * been obtained in the above while loop. + */ There are few places, commenting style should be changed to postgres style + if (cstate->pcdata->curr_data_block == NULL) + { + block_pos = WaitGetFreeCopyBlock(pcshared_info); + + cstate->pcdata->curr_data_block = &pcshared_info->data_blocks[block_pos]; + + cstate->raw_buf_index = 0; + + readbytes = CopyGetData(cstate, &cstate->pcdata->curr_data_block->data, 1, DATA_BLOCK_SIZE); + + elog(DEBUG1, "LEADER - bytes read from file %d", readbytes); + + if (cstate->reached_eof) + return true; + } There are many empty lines, these are not required. + if (cstate->raw_buf_index + sizeof(fld_count) >= (DATA_BLOCK_SIZE - 1)) + AdjustFieldInfo(cstate, 1); + + memcpy(&fld_count, &cstate->pcdata->curr_data_block->data[cstate->raw_buf_index], sizeof(fld_count)); + + fld_count = (int16) pg_ntoh16(fld_count); + + CHECK_FIELD_COUNT; + + cstate->raw_buf_index = cstate->raw_buf_index + sizeof(fld_count); + new_block_pos = pcshared_info->cur_block_pos; You can run pg_indent once for the changes. + if (mode == 1) + { + cstate->pcdata->curr_data_block = data_block; + cstate->raw_buf_index = 0; + } + else if(mode == 2) + { Could use macros for 1 & 2 for better readability. + if (tuple_start_info_ptr->block_id == tuple_end_info_ptr->block_id) + { + elog(DEBUG1,"LEADER - tuple lies in a single data block"); + + *line_size = tuple_end_info_ptr->offset - tuple_start_info_ptr->offset + 1; + pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[tuple_start_info_ptr->block_id].unprocessed_line_parts, 1); + } + else + { + uint32 following_block_id = pcshared_info->data_blocks[tuple_start_info_ptr->block_id].following_block; + + elog(DEBUG1,"LEADER - tuple is spread across data blocks"); + + *line_size = DATA_BLOCK_SIZE - tuple_start_info_ptr->offset - + pcshared_info->data_blocks[tuple_start_info_ptr->block_id].skip_bytes; + + pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[tuple_start_info_ptr->block_id].unprocessed_line_parts, 1); + + while (following_block_id != tuple_end_info_ptr->block_id) + { + *line_size = *line_size + DATA_BLOCK_SIZE - pcshared_info->data_blocks[following_block_id].skip_bytes; + + pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[following_block_id].unprocessed_line_parts, 1); + + following_block_id = pcshared_info->data_blocks[following_block_id].following_block; + + if (following_block_id == -1) + break; + } + + if (following_block_id != -1) + pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[following_block_id].unprocessed_line_parts, 1); + + *line_size = *line_size + tuple_end_info_ptr->offset + 1; + } We could calculate the size as we parse and identify one record, if we do that way this can be removed. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Thanks Vignesh for the review. Addressed the comments in 0006 patch. > > we can create a local variable and use in place of > cstate->pcdata->curr_data_block. Done. > + if (cstate->raw_buf_index + sizeof(fld_count) >= (DATA_BLOCK_SIZE - 1)) > + AdjustFieldInfo(cstate, 1); > + > + memcpy(&fld_count, > &cstate->pcdata->curr_data_block->data[cstate->raw_buf_index], > sizeof(fld_count)); > Should this be like below, as the remaining size can fit in current block: > if (cstate->raw_buf_index + sizeof(fld_count) >= DATA_BLOCK_SIZE) > > + if ((cstate->raw_buf_index + sizeof(fld_size)) >= (DATA_BLOCK_SIZE - 1)) > + { > + AdjustFieldInfo(cstate, 2); > + *new_block_pos = pcshared_info->cur_block_pos; > + } > Same like above. Yes you are right. Changed. > > + movebytes = DATA_BLOCK_SIZE - cstate->raw_buf_index; > + > + cstate->pcdata->curr_data_block->skip_bytes = movebytes; > + > + data_block = &pcshared_info->data_blocks[block_pos]; > + > + if (movebytes > 0) > Instead of the above check, we can have an assert check for movebytes. No, we can't use assert here. For the edge case where the current data block is full to the size DATA_BLOCK_SIZE, then movebytes will be 0, but we need to get a new data block. We avoid memmove by having movebytes>0 check. > + if (mode == 1) > + { > + cstate->pcdata->curr_data_block = data_block; > + cstate->raw_buf_index = 0; > + } > + else if(mode == 2) > + { > + ParallelCopyDataBlock *prev_data_block = NULL; > + prev_data_block = cstate->pcdata->curr_data_block; > + prev_data_block->following_block = block_pos; > + cstate->pcdata->curr_data_block = data_block; > + > + if (prev_data_block->curr_blk_completed == false) > + prev_data_block->curr_blk_completed = true; > + > + cstate->raw_buf_index = 0; > + } > > This code is common for both, keep in common flow and remove if (mode == 1) > cstate->pcdata->curr_data_block = data_block; > cstate->raw_buf_index = 0; > Done. > +#define CHECK_FIELD_COUNT \ > +{\ > + if (fld_count == -1) \ > + { \ > + if (IsParallelCopy() && \ > + !IsLeader()) \ > + return true; \ > + else if (IsParallelCopy() && \ > + IsLeader()) \ > + { \ > + if > (cstate->pcdata->curr_data_block->data[cstate->raw_buf_index + > sizeof(fld_count)] != 0) \ > + ereport(ERROR, \ > + > (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), \ > + errmsg("received copy > data after EOF marker"))); \ > + return true; \ > + } \ > We only copy sizeof(fld_count), Shouldn't we check fld_count != > cstate->max_fields? Am I missing something here? fld_count != cstate->max_fields check is done after the above checks. > + if ((DATA_BLOCK_SIZE - cstate->raw_buf_index) >= fld_size) > + { > + cstate->raw_buf_index = cstate->raw_buf_index + fld_size; > + } > We can keep the check like cstate->raw_buf_index + fld_size < ..., for > better readability and consistency. > I think this is okay. It gives a good meaning that available bytes in the current data block is greater or equal to fld_size then, the tuple lies in the current data block. > +static pg_attribute_always_inline void > +CopyReadBinaryAttributeLeader(CopyState cstate, FmgrInfo *flinfo, > + Oid typioparam, int32 typmod, uint32 *new_block_pos, > + int m, ParallelCopyTupleInfo *tuple_start_info_ptr, > + ParallelCopyTupleInfo *tuple_end_info_ptr, uint32 *line_size) > flinfo, typioparam & typmod is not used, we can remove the parameter. > Done. > +static pg_attribute_always_inline void > +CopyReadBinaryAttributeLeader(CopyState cstate, FmgrInfo *flinfo, > + Oid typioparam, int32 typmod, uint32 *new_block_pos, > + int m, ParallelCopyTupleInfo *tuple_start_info_ptr, > + ParallelCopyTupleInfo *tuple_end_info_ptr, uint32 *line_size) > I felt this function need not be an inline function. Yes. Changed. > > + /* binary format */ > + /* for paralle copy leader, fill in the error > There are some typos, run spell check Done. > > + /* raw_buf_index should never cross data block size, > + * as the required number of data blocks would have > + * been obtained in the above while loop. > + */ > There are few places, commenting style should be changed to postgres style Changed. > > + if (cstate->pcdata->curr_data_block == NULL) > + { > + block_pos = WaitGetFreeCopyBlock(pcshared_info); > + > + cstate->pcdata->curr_data_block = > &pcshared_info->data_blocks[block_pos]; > + > + cstate->raw_buf_index = 0; > + > + readbytes = CopyGetData(cstate, > &cstate->pcdata->curr_data_block->data, 1, DATA_BLOCK_SIZE); > + > + elog(DEBUG1, "LEADER - bytes read from file %d", readbytes); > + > + if (cstate->reached_eof) > + return true; > + } > There are many empty lines, these are not required. > Removed. > > + > + fld_count = (int16) pg_ntoh16(fld_count); > + > + CHECK_FIELD_COUNT; > + > + cstate->raw_buf_index = cstate->raw_buf_index + sizeof(fld_count); > + new_block_pos = pcshared_info->cur_block_pos; > You can run pg_indent once for the changes. > I ran pg_indent and observed that there are many places getting modified by pg_indent. If we need to run pg_indet on copy.c for parallel copy alone, then first, we need to run on plane copy.c and take those changes and then run for all parallel copy files. I think we better run pg_indent, for all the parallel copy patches once and for all, maybe just before we kind of finish up all the code reviews. > + if (mode == 1) > + { > + cstate->pcdata->curr_data_block = data_block; > + cstate->raw_buf_index = 0; > + } > + else if(mode == 2) > + { > Could use macros for 1 & 2 for better readability. Done. > > + > + if (following_block_id == -1) > + break; > + } > + > + if (following_block_id != -1) > + > pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[following_block_id].unprocessed_line_parts, > 1); > + > + *line_size = *line_size + > tuple_end_info_ptr->offset + 1; > + } > We could calculate the size as we parse and identify one record, if we > do that way this can be removed. > Done. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
- 0001-Copy-code-readjustment-to-support-parallel-copy.patch
- 0002-Framework-for-leader-worker-in-parallel-copy.patch
- 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
- 0004-Documentation-for-parallel-copy.patch
- 0005-Tests-for-parallel-copy.patch
- 0006-Parallel-Copy-For-Binary-Format-Files.patch
On Sat, 11 Jul 2020 at 08:55, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Thanks Vignesh for the review. Addressed the comments in 0006 patch. > > > > > we can create a local variable and use in place of > > cstate->pcdata->curr_data_block. > > Done. > > > + if (cstate->raw_buf_index + sizeof(fld_count) >= (DATA_BLOCK_SIZE - 1)) > > + AdjustFieldInfo(cstate, 1); > > + > > + memcpy(&fld_count, > > &cstate->pcdata->curr_data_block->data[cstate->raw_buf_index], > > sizeof(fld_count)); > > Should this be like below, as the remaining size can fit in current block: > > if (cstate->raw_buf_index + sizeof(fld_count) >= DATA_BLOCK_SIZE) > > > > + if ((cstate->raw_buf_index + sizeof(fld_size)) >= (DATA_BLOCK_SIZE - 1)) > > + { > > + AdjustFieldInfo(cstate, 2); > > + *new_block_pos = pcshared_info->cur_block_pos; > > + } > > Same like above. > > Yes you are right. Changed. > > > > > + movebytes = DATA_BLOCK_SIZE - cstate->raw_buf_index; > > + > > + cstate->pcdata->curr_data_block->skip_bytes = movebytes; > > + > > + data_block = &pcshared_info->data_blocks[block_pos]; > > + > > + if (movebytes > 0) > > Instead of the above check, we can have an assert check for movebytes. > > No, we can't use assert here. For the edge case where the current data > block is full to the size DATA_BLOCK_SIZE, then movebytes will be 0, > but we need to get a new data block. We avoid memmove by having > movebytes>0 check. > > > + if (mode == 1) > > + { > > + cstate->pcdata->curr_data_block = data_block; > > + cstate->raw_buf_index = 0; > > + } > > + else if(mode == 2) > > + { > > + ParallelCopyDataBlock *prev_data_block = NULL; > > + prev_data_block = cstate->pcdata->curr_data_block; > > + prev_data_block->following_block = block_pos; > > + cstate->pcdata->curr_data_block = data_block; > > + > > + if (prev_data_block->curr_blk_completed == false) > > + prev_data_block->curr_blk_completed = true; > > + > > + cstate->raw_buf_index = 0; > > + } > > > > This code is common for both, keep in common flow and remove if (mode == 1) > > cstate->pcdata->curr_data_block = data_block; > > cstate->raw_buf_index = 0; > > > > Done. > > > +#define CHECK_FIELD_COUNT \ > > +{\ > > + if (fld_count == -1) \ > > + { \ > > + if (IsParallelCopy() && \ > > + !IsLeader()) \ > > + return true; \ > > + else if (IsParallelCopy() && \ > > + IsLeader()) \ > > + { \ > > + if > > (cstate->pcdata->curr_data_block->data[cstate->raw_buf_index + > > sizeof(fld_count)] != 0) \ > > + ereport(ERROR, \ > > + > > (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), \ > > + errmsg("received copy > > data after EOF marker"))); \ > > + return true; \ > > + } \ > > We only copy sizeof(fld_count), Shouldn't we check fld_count != > > cstate->max_fields? Am I missing something here? > > fld_count != cstate->max_fields check is done after the above checks. > > > + if ((DATA_BLOCK_SIZE - cstate->raw_buf_index) >= fld_size) > > + { > > + cstate->raw_buf_index = cstate->raw_buf_index + fld_size; > > + } > > We can keep the check like cstate->raw_buf_index + fld_size < ..., for > > better readability and consistency. > > > > I think this is okay. It gives a good meaning that available bytes in > the current data block is greater or equal to fld_size then, the tuple > lies in the current data block. > > > +static pg_attribute_always_inline void > > +CopyReadBinaryAttributeLeader(CopyState cstate, FmgrInfo *flinfo, > > + Oid typioparam, int32 typmod, uint32 *new_block_pos, > > + int m, ParallelCopyTupleInfo *tuple_start_info_ptr, > > + ParallelCopyTupleInfo *tuple_end_info_ptr, uint32 *line_size) > > flinfo, typioparam & typmod is not used, we can remove the parameter. > > > > Done. > > > +static pg_attribute_always_inline void > > +CopyReadBinaryAttributeLeader(CopyState cstate, FmgrInfo *flinfo, > > + Oid typioparam, int32 typmod, uint32 *new_block_pos, > > + int m, ParallelCopyTupleInfo *tuple_start_info_ptr, > > + ParallelCopyTupleInfo *tuple_end_info_ptr, uint32 *line_size) > > I felt this function need not be an inline function. > > Yes. Changed. > > > > > + /* binary format */ > > + /* for paralle copy leader, fill in the error > > There are some typos, run spell check > > Done. > > > > > + /* raw_buf_index should never cross data block size, > > + * as the required number of data blocks would have > > + * been obtained in the above while loop. > > + */ > > There are few places, commenting style should be changed to postgres style > > Changed. > > > > > + if (cstate->pcdata->curr_data_block == NULL) > > + { > > + block_pos = WaitGetFreeCopyBlock(pcshared_info); > > + > > + cstate->pcdata->curr_data_block = > > &pcshared_info->data_blocks[block_pos]; > > + > > + cstate->raw_buf_index = 0; > > + > > + readbytes = CopyGetData(cstate, > > &cstate->pcdata->curr_data_block->data, 1, DATA_BLOCK_SIZE); > > + > > + elog(DEBUG1, "LEADER - bytes read from file %d", readbytes); > > + > > + if (cstate->reached_eof) > > + return true; > > + } > > There are many empty lines, these are not required. > > > > Removed. > > > > > + > > + fld_count = (int16) pg_ntoh16(fld_count); > > + > > + CHECK_FIELD_COUNT; > > + > > + cstate->raw_buf_index = cstate->raw_buf_index + sizeof(fld_count); > > + new_block_pos = pcshared_info->cur_block_pos; > > You can run pg_indent once for the changes. > > > > I ran pg_indent and observed that there are many places getting > modified by pg_indent. If we need to run pg_indet on copy.c for > parallel copy alone, then first, we need to run on plane copy.c and > take those changes and then run for all parallel copy files. I think > we better run pg_indent, for all the parallel copy patches once and > for all, maybe just before we kind of finish up all the code reviews. > > > + if (mode == 1) > > + { > > + cstate->pcdata->curr_data_block = data_block; > > + cstate->raw_buf_index = 0; > > + } > > + else if(mode == 2) > > + { > > Could use macros for 1 & 2 for better readability. > > Done. > > > > > + > > + if (following_block_id == -1) > > + break; > > + } > > + > > + if (following_block_id != -1) > > + > > pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[following_block_id].unprocessed_line_parts, > > 1); > > + > > + *line_size = *line_size + > > tuple_end_info_ptr->offset + 1; > > + } > > We could calculate the size as we parse and identify one record, if we > > do that way this can be removed. > > > > Done. Hi Bharath, I was looking forward to review this patch-set but unfortunately it is showing a reject in copy.c, and might need a rebase. I was applying on master over the commit- cd22d3cdb9bd9963c694c01a8c0232bbae3ddcfb. -- Regards, Rafia Sabih
> > Hi Bharath, > > I was looking forward to review this patch-set but unfortunately it is > showing a reject in copy.c, and might need a rebase. > I was applying on master over the commit- > cd22d3cdb9bd9963c694c01a8c0232bbae3ddcfb. > Thanks for showing interest. Please find the patch set rebased to latest commit b1e48bbe64a411666bb1928b9741e112e267836d. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
- 0001-Copy-code-readjustment-to-support-parallel-copy.patch
- 0002-Framework-for-leader-worker-in-parallel-copy.patch
- 0003-Allow-copy-from-command-to-process-data-from-file.patch
- 0004-Documentation-for-parallel-copy.patch
- 0005-Tests-for-parallel-copy.patch
- 0006-Parallel-Copy-For-Binary-Format-Files.patch
On Sun, Jul 12, 2020 at 5:48 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > Hi Bharath, > > > > I was looking forward to review this patch-set but unfortunately it is > > showing a reject in copy.c, and might need a rebase. > > I was applying on master over the commit- > > cd22d3cdb9bd9963c694c01a8c0232bbae3ddcfb. > > > > Thanks for showing interest. Please find the patch set rebased to > latest commit b1e48bbe64a411666bb1928b9741e112e267836d. > Few comments: ==================== 0001-Copy-code-readjustment-to-support-parallel-copy I am not sure converting the code to macros is a good idea, it makes this code harder to read. Also, there are a few changes which I am not sure are necessary. 1. +/* + * CLEAR_EOL_FROM_COPIED_DATA - Clear EOL from the copied data. + */ +#define CLEAR_EOL_FROM_COPIED_DATA(copy_line_data, copy_line_pos, copy_line_size) \ +{ \ + /* \ + * If we didn't hit EOF, then we must have transferred the EOL marker \ + * to line_buf along with the data. Get rid of it. \ + */ \ + switch (cstate->eol_type) \ + { \ + case EOL_NL: \ + Assert(copy_line_size >= 1); \ + Assert(copy_line_data[copy_line_pos - 1] == '\n'); \ + copy_line_data[copy_line_pos - 1] = '\0'; \ + copy_line_size--; \ + break; \ + case EOL_CR: \ + Assert(copy_line_size >= 1); \ + Assert(copy_line_data[copy_line_pos - 1] == '\r'); \ + copy_line_data[copy_line_pos - 1] = '\0'; \ + copy_line_size--; \ + break; \ + case EOL_CRNL: \ + Assert(copy_line_size >= 2); \ + Assert(copy_line_data[copy_line_pos - 2] == '\r'); \ + Assert(copy_line_data[copy_line_pos - 1] == '\n'); \ + copy_line_data[copy_line_pos - 2] = '\0'; \ + copy_line_size -= 2; \ + break; \ + case EOL_UNKNOWN: \ + /* shouldn't get here */ \ + Assert(false); \ + break; \ + } \ +} In the original code, we are using only len and buffer, here we are using position, length/size and buffer. Is it really required or can we do with just len and buffer? 2. +/* + * INCREMENTPROCESSED - Increment the lines processed. + */ +#define INCREMENTPROCESSED(processed) \ +processed++; + +/* + * GETPROCESSED - Get the lines processed. + */ +#define GETPROCESSED(processed) \ +return processed; + I don't like converting above to macros. I don't think converting such things to macros will buy us much. 0002-Framework-for-leader-worker-in-parallel-copy 3. /* + * Copy data block information. + */ +typedef struct ParallelCopyDataBlock It is better to add a few comments atop this data structure to explain how it is used? 4. + * ParallelCopyLineBoundary is common data structure between leader & worker, + * this is protected by the following sequence in the leader & worker. + * Leader should operate in the following order: + * 1) update first_block, start_offset & cur_lineno in any order. + * 2) update line_size. + * 3) update line_state. + * Worker should operate in the following order: + * 1) read line_size. + * 2) only one worker should choose one line for processing, this is handled by + * using pg_atomic_compare_exchange_u32, worker will change the sate to + * LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED. + * 3) read first_block, start_offset & cur_lineno in any order. + */ +typedef struct ParallelCopyLineBoundary Here, you have mentioned how workers and leader should operate to make sure access to the data is sane. However, you have not explained what is the problem if they don't do so and it is not apparent to me. Also, it is not very clear what is the purpose of this data structure from comments. 5. +/* + * Circular queue used to store the line information. + */ +typedef struct ParallelCopyLineBoundaries +{ + /* Position for the leader to populate a line. */ + uint32 leader_pos; I don't think the variable needs to be named as leader_pos, it is okay to name it is as 'pos' as the comment above it explains its usage. 7. +#define DATA_BLOCK_SIZE RAW_BUF_SIZE +#define RINGSIZE (10 * 1000) +#define MAX_BLOCKS_COUNT 1000 +#define WORKER_CHUNK_COUNT 50 /* should be mod of RINGSIZE */ It would be good if you can write a few comments to explain why you have chosen these default values. 8. ParallelCopyCommonKeyData, shall we name this as SerializedParallelCopyState or something like that? For example, see SerializedSnapshotData which has been used to pass snapshot information to passed to workers. 9. +CopyCommonInfoForWorker(CopyState cstate, ParallelCopyCommonKeyData *shared_cstate) If you agree with point-8, then let's name this as SerializeParallelCopyState. See, if there is more usage of similar types in the patch then lets change those as well. 10. + * in the DSM. The specified number of workers will then be launched. + * + */ +static ParallelContext* +BeginParallelCopy(int nworkers, CopyState cstate, List *attnamelist, Oid relid) No need of an extra line with only '*' in the above multi-line comment. 11. BeginParallelCopy(..) { .. + EstimateLineKeysStr(pcxt, cstate->null_print); + EstimateLineKeysStr(pcxt, cstate->null_print_client); + EstimateLineKeysStr(pcxt, cstate->delim); + EstimateLineKeysStr(pcxt, cstate->quote); + EstimateLineKeysStr(pcxt, cstate->escape); .. } Why do we need to do this separately for each variable of cstate? Can't we serialize it along with other members of SerializeParallelCopyState (a new name for ParallelCopyCommonKeyData)? 12. BeginParallelCopy(..) { .. + LaunchParallelWorkers(pcxt); + if (pcxt->nworkers_launched == 0) + { + EndParallelCopy(pcxt); + elog(WARNING, + "No workers available, copy will be run in non-parallel mode"); .. } I don't see the need to issue a WARNING if we are not able to launch workers. We don't do that for other cases where we fail to launch workers. 13. +} +/* + * ParallelCopyMain - .. +} +/* + * ParallelCopyLeader One line space is required before starting a new function. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Thanks for the comments Amit. On Wed, Jul 15, 2020 at 10:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Few comments: > ==================== > 0001-Copy-code-readjustment-to-support-parallel-copy > > I am not sure converting the code to macros is a good idea, it makes > this code harder to read. Also, there are a few changes which I am > not sure are necessary. > 1. > +/* > + * CLEAR_EOL_FROM_COPIED_DATA - Clear EOL from the copied data. > + */ > +#define CLEAR_EOL_FROM_COPIED_DATA(copy_line_data, copy_line_pos, > copy_line_size) \ > +{ \ > + /* \ > + * If we didn't hit EOF, then we must have transferred the EOL marker \ > + * to line_buf along with the data. Get rid of it. \ > + */ \ > + switch (cstate->eol_type) \ > + { \ > + case EOL_NL: \ > + Assert(copy_line_size >= 1); \ > + Assert(copy_line_data[copy_line_pos - 1] == '\n'); \ > + copy_line_data[copy_line_pos - 1] = '\0'; \ > + copy_line_size--; \ > + break; \ > + case EOL_CR: \ > + Assert(copy_line_size >= 1); \ > + Assert(copy_line_data[copy_line_pos - 1] == '\r'); \ > + copy_line_data[copy_line_pos - 1] = '\0'; \ > + copy_line_size--; \ > + break; \ > + case EOL_CRNL: \ > + Assert(copy_line_size >= 2); \ > + Assert(copy_line_data[copy_line_pos - 2] == '\r'); \ > + Assert(copy_line_data[copy_line_pos - 1] == '\n'); \ > + copy_line_data[copy_line_pos - 2] = '\0'; \ > + copy_line_size -= 2; \ > + break; \ > + case EOL_UNKNOWN: \ > + /* shouldn't get here */ \ > + Assert(false); \ > + break; \ > + } \ > +} > > In the original code, we are using only len and buffer, here we are > using position, length/size and buffer. Is it really required or can > we do with just len and buffer? > Position is required so that we can have common code for parallel & non-parallel copy, in case of parallel copy position & length will differ as they can spread across multiple data blocks. Retained the variables as is. Changed the macro to function. > 2. > +/* > + * INCREMENTPROCESSED - Increment the lines processed. > + */ > +#define INCREMENTPROCESSED(processed) \ > +processed++; > + > +/* > + * GETPROCESSED - Get the lines processed. > + */ > +#define GETPROCESSED(processed) \ > +return processed; > + > > I don't like converting above to macros. I don't think converting > such things to macros will buy us much. > This macro will be extended to in 0003-Allow-copy-from-command-to-process-data-from-file.patch: +#define INCREMENTPROCESSED(processed) \ +{ \ + if (!IsParallelCopy()) \ + processed++; \ + else \ + pg_atomic_add_fetch_u64(&cstate->pcdata->pcshared_info->processed, 1); \ +} This need to be made to macro so that it can handle both parallel copy and non parallel copy. Retaining this as macro, if you insist I can move the change to 0003-Allow-copy-from-command-to-process-data-from-file.patch patch. > 0002-Framework-for-leader-worker-in-parallel-copy > 3. > /* > + * Copy data block information. > + */ > +typedef struct ParallelCopyDataBlock > > It is better to add a few comments atop this data structure to explain > how it is used? > Fixed. > 4. > + * ParallelCopyLineBoundary is common data structure between leader & worker, > + * this is protected by the following sequence in the leader & worker. > + * Leader should operate in the following order: > + * 1) update first_block, start_offset & cur_lineno in any order. > + * 2) update line_size. > + * 3) update line_state. > + * Worker should operate in the following order: > + * 1) read line_size. > + * 2) only one worker should choose one line for processing, this is handled by > + * using pg_atomic_compare_exchange_u32, worker will change the sate to > + * LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED. > + * 3) read first_block, start_offset & cur_lineno in any order. > + */ > +typedef struct ParallelCopyLineBoundary > > Here, you have mentioned how workers and leader should operate to make > sure access to the data is sane. However, you have not explained what > is the problem if they don't do so and it is not apparent to me. > Also, it is not very clear what is the purpose of this data structure > from comments. > Fixed > 5. > +/* > + * Circular queue used to store the line information. > + */ > +typedef struct ParallelCopyLineBoundaries > +{ > + /* Position for the leader to populate a line. */ > + uint32 leader_pos; > > I don't think the variable needs to be named as leader_pos, it is okay > to name it is as 'pos' as the comment above it explains its usage. > Fixed > 7. > +#define DATA_BLOCK_SIZE RAW_BUF_SIZE > +#define RINGSIZE (10 * 1000) > +#define MAX_BLOCKS_COUNT 1000 > +#define WORKER_CHUNK_COUNT 50 /* should be mod of RINGSIZE */ > > It would be good if you can write a few comments to explain why you > have chosen these default values. > Fixed > 8. > ParallelCopyCommonKeyData, shall we name this as > SerializedParallelCopyState or something like that? For example, see > SerializedSnapshotData which has been used to pass snapshot > information to passed to workers. > Renamed as suggested > 9. > +CopyCommonInfoForWorker(CopyState cstate, ParallelCopyCommonKeyData > *shared_cstate) > > If you agree with point-8, then let's name this as > SerializeParallelCopyState. See, if there is more usage of similar > types in the patch then lets change those as well. > Fixed > 10. > + * in the DSM. The specified number of workers will then be launched. > + * > + */ > +static ParallelContext* > +BeginParallelCopy(int nworkers, CopyState cstate, List *attnamelist, Oid relid) > > No need of an extra line with only '*' in the above multi-line comment. > Fixed > 11. > BeginParallelCopy(..) > { > .. > + EstimateLineKeysStr(pcxt, cstate->null_print); > + EstimateLineKeysStr(pcxt, cstate->null_print_client); > + EstimateLineKeysStr(pcxt, cstate->delim); > + EstimateLineKeysStr(pcxt, cstate->quote); > + EstimateLineKeysStr(pcxt, cstate->escape); > .. > } > > Why do we need to do this separately for each variable of cstate? > Can't we serialize it along with other members of > SerializeParallelCopyState (a new name for ParallelCopyCommonKeyData)? > These are variable length string variables, I felt we will not be able to serialize along with other members and need to be serialized separately. > 12. > BeginParallelCopy(..) > { > .. > + LaunchParallelWorkers(pcxt); > + if (pcxt->nworkers_launched == 0) > + { > + EndParallelCopy(pcxt); > + elog(WARNING, > + "No workers available, copy will be run in non-parallel mode"); > .. > } > > I don't see the need to issue a WARNING if we are not able to launch > workers. We don't do that for other cases where we fail to launch > workers. > Fixed > 13. > +} > +/* > + * ParallelCopyMain - > .. > > +} > +/* > + * ParallelCopyLeader > > One line space is required before starting a new function. > Fixed Please find the updated patch with the fixes included. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
- 0001-Copy-code-readjustment-to-support-parallel-copy.patch
- 0002-Framework-for-leader-worker-in-parallel-copy.patch
- 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
- 0004-Documentation-for-parallel-copy.patch
- 0005-Tests-for-parallel-copy.patch
- 0006-Parallel-Copy-For-Binary-Format-Files.patch
> > Please find the updated patch with the fixes included. > Patch 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch had few indentation issues, I have fixed and attached the patch for the same. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
- 0001-Copy-code-readjustment-to-support-parallel-copy.patch
- 0002-Framework-for-leader-worker-in-parallel-copy.patch
- 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
- 0004-Documentation-for-parallel-copy.patch
- 0005-Tests-for-parallel-copy.patch
- 0006-Parallel-Copy-For-Binary-Format-Files.patch
Some review comments (mostly) from the leader side code changes:
1) Do we need a DSM key for the FORCE_QUOTE option? I think FORCE_QUOTE option is only used with COPY TO and not COPY FROM so not sure why you have added it.
PARALLEL_COPY_KEY_FORCE_QUOTE_LIST
2) Should we be allocating the parallel copy data structure only when it is confirmed that the parallel copy is allowed?
pcdata = (ParallelCopyData *) palloc0(sizeof(ParallelCopyData));
cstate->pcdata = pcdata;
Or, if you want it to be allocated before confirming if Parallel copy is allowed or not, then I think it would be good to allocate it in *cstate->copycontext* memory context so that when EndCopy is called towards the end of the COPY FROM operation, the entire context itself gets deleted thereby freeing the memory space allocated for pcdata. In fact it would be good to ensure that all the local memory allocated inside the ctstate structure gets allocated in the *cstate->copycontext* memory context.
3) Should we allow Parallel Copy when the insert method is CIM_MULTI_CONDITIONAL?
+ /* Check if the insertion mode is single. */
+ if (FindInsertMethod(cstate) == CIM_SINGLE)
+ return false;
I know we have added checks in CopyFrom() to ensure that if any trigger (before row or instead of) is found on any of partition being loaded with data, then COPY FROM operation would fail, but does it mean that we are okay to perform parallel copy on partitioned table. Have we done some performance testing with the partitioned table where the data in the input file needs to be routed to the different partitions?
4) There are lot of if-checks in IsParallelCopyAllowed function that are checked in CopyFrom function as well which means in case of Parallel Copy those checks will get executed multiple times (first by the leader and from second time onwards by each worker process). Is that required?
5) Should the worker process be calling this function when the leader has already called it once in ExecBeforeStmtTrigger()?
/* Verify the named relation is a valid target for INSERT */
CheckValidResultRel(resultRelInfo, CMD_INSERT);
6) I think it would be good to re-write the comments atop ParallelCopyLeader(). From the present comments it appears as if you were trying to put the information pointwise but somehow you ended up putting in a paragraph. The comments also have some typos like *line beaks* which possibly means line breaks. This is applicable for other comments as well where you
7) Is the following checking equivalent to IsWorker()? If so, it would be good to replace it with an IsWorker like macro to increase the readability.
1) Do we need a DSM key for the FORCE_QUOTE option? I think FORCE_QUOTE option is only used with COPY TO and not COPY FROM so not sure why you have added it.
PARALLEL_COPY_KEY_FORCE_QUOTE_LIST
2) Should we be allocating the parallel copy data structure only when it is confirmed that the parallel copy is allowed?
pcdata = (ParallelCopyData *) palloc0(sizeof(ParallelCopyData));
cstate->pcdata = pcdata;
Or, if you want it to be allocated before confirming if Parallel copy is allowed or not, then I think it would be good to allocate it in *cstate->copycontext* memory context so that when EndCopy is called towards the end of the COPY FROM operation, the entire context itself gets deleted thereby freeing the memory space allocated for pcdata. In fact it would be good to ensure that all the local memory allocated inside the ctstate structure gets allocated in the *cstate->copycontext* memory context.
3) Should we allow Parallel Copy when the insert method is CIM_MULTI_CONDITIONAL?
+ /* Check if the insertion mode is single. */
+ if (FindInsertMethod(cstate) == CIM_SINGLE)
+ return false;
I know we have added checks in CopyFrom() to ensure that if any trigger (before row or instead of) is found on any of partition being loaded with data, then COPY FROM operation would fail, but does it mean that we are okay to perform parallel copy on partitioned table. Have we done some performance testing with the partitioned table where the data in the input file needs to be routed to the different partitions?
4) There are lot of if-checks in IsParallelCopyAllowed function that are checked in CopyFrom function as well which means in case of Parallel Copy those checks will get executed multiple times (first by the leader and from second time onwards by each worker process). Is that required?
5) Should the worker process be calling this function when the leader has already called it once in ExecBeforeStmtTrigger()?
/* Verify the named relation is a valid target for INSERT */
CheckValidResultRel(resultRelInfo, CMD_INSERT);
6) I think it would be good to re-write the comments atop ParallelCopyLeader(). From the present comments it appears as if you were trying to put the information pointwise but somehow you ended up putting in a paragraph. The comments also have some typos like *line beaks* which possibly means line breaks. This is applicable for other comments as well where you
7) Is the following checking equivalent to IsWorker()? If so, it would be good to replace it with an IsWorker like macro to increase the readability.
(IsParallelCopy() && !IsLeader())
On Fri, Jul 17, 2020 at 2:09 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Please find the updated patch with the fixes included.
>
Patch 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
had few indentation issues, I have fixed and attached the patch for
the same.
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
On Fri, Jul 17, 2020 at 2:09 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > Please find the updated patch with the fixes included. > > > > Patch 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch > had few indentation issues, I have fixed and attached the patch for > the same. > Ensure to use the version with each patch-series as that makes it easier for the reviewer to verify the changes done in the latest version of the patch. One way is to use commands like "git format-patch -6 -v <version_of_patch_series>" or you can add the version number manually. Review comments: =================== 0001-Copy-code-readjustment-to-support-parallel-copy 1. @@ -807,8 +835,11 @@ CopyLoadRawBuf(CopyState cstate) else nbytes = 0; /* no data need be saved */ + if (cstate->copy_dest == COPY_NEW_FE) + minread = RAW_BUF_SIZE - nbytes; + inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes, - 1, RAW_BUF_SIZE - nbytes); + minread, RAW_BUF_SIZE - nbytes); No comment to explain why this change is done? 0002-Framework-for-leader-worker-in-parallel-copy 2. + * ParallelCopyLineBoundary is common data structure between leader & worker, + * Leader process will be populating data block, data block offset & the size of + * the record in DSM for the workers to copy the data into the relation. + * This is protected by the following sequence in the leader & worker. If they + * don't follow this order the worker might process wrong line_size and leader + * might populate the information which worker has not yet processed or in the + * process of processing. + * Leader should operate in the following order: + * 1) check if line_size is -1, if not wait, it means worker is still + * processing. + * 2) set line_state to LINE_LEADER_POPULATING. + * 3) update first_block, start_offset & cur_lineno in any order. + * 4) update line_size. + * 5) update line_state to LINE_LEADER_POPULATED. + * Worker should operate in the following order: + * 1) check line_state is LINE_LEADER_POPULATED, if not it means leader is still + * populating the data. + * 2) read line_size. + * 3) only one worker should choose one line for processing, this is handled by + * using pg_atomic_compare_exchange_u32, worker will change the sate to + * LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED. + * 4) read first_block, start_offset & cur_lineno in any order. + * 5) process line_size data. + * 6) update line_size to -1. + */ +typedef struct ParallelCopyLineBoundary Are we doing all this state management to avoid using locks while processing lines? If so, I think we can use either spinlock or LWLock to keep the main patch simple and then provide a later patch to make it lock-less. This will allow us to first focus on the main design of the patch rather than trying to make this datastructure processing lock-less in the best possible way. 3. + /* + * Actual lines inserted by worker (some records will be filtered based on + * where condition). + */ + pg_atomic_uint64 processed; + pg_atomic_uint64 total_worker_processed; /* total processed records by the workers */ The difference between processed and total_worker_processed is not clear. Can we expand the comments a bit? 4. + * SerializeList - Insert a list into shared memory. + */ +static void +SerializeList(ParallelContext *pcxt, int key, List *inputlist, + Size est_list_size) +{ + if (inputlist != NIL) + { + ParallelCopyKeyListInfo *sharedlistinfo = (ParallelCopyKeyListInfo *)shm_toc_allocate(pcxt->toc, + est_list_size); + CopyListSharedMemory(inputlist, est_list_size, sharedlistinfo); + shm_toc_insert(pcxt->toc, key, sharedlistinfo); + } +} Why do we need to write a special mechanism (CopyListSharedMemory) to serialize a list. Why can't we use nodeToString? It should be able to take care of List datatype, see outNode which is called from nodeToString. Once you do that, I think you won't need even EstimateLineKeysList, strlen should work instead. Check, if you have any similar special handling for other types that can be dealt with nodeToString? 5. + MemSet(shared_info_ptr, 0, est_shared_info); + shared_info_ptr->is_read_in_progress = true; + shared_info_ptr->cur_block_pos = -1; + shared_info_ptr->full_transaction_id = full_transaction_id; + shared_info_ptr->mycid = GetCurrentCommandId(true); + for (count = 0; count < RINGSIZE; count++) + { + ParallelCopyLineBoundary *lineInfo = &shared_info_ptr->line_boundaries.ring[count]; + pg_atomic_init_u32(&(lineInfo->line_size), -1); + } + You can move this initialization in a separate function. 6. In function BeginParallelCopy(), you need to keep a provision to collect wal_usage and buf_usage stats. See _bt_begin_parallel for reference. Those will be required for pg_stat_statements. 7. DeserializeString() -- it is better to name this function as RestoreString. ParallelWorkerInitialization() -- it is better to name this function as InitializeParallelCopyInfo or something like that, the current name is quite confusing. ParallelCopyLeader() -- how about ParallelCopyFrom? ParallelCopyLeader doesn't sound good to me. You can suggest something else if you don't like ParallelCopyFrom 8. /* - * PopulateGlobalsForCopyFrom - Populates the common variables required for copy - * from operation. This is a helper function for BeginCopy function. + * PopulateCatalogInformation - Populates the common variables required for copy + * from operation. This is a helper function for BeginCopy & + * ParallelWorkerInitialization function. */ static void PopulateGlobalsForCopyFrom(CopyState cstate, TupleDesc tupDesc, - List *attnamelist) + List *attnamelist) The actual function name and the name in function header don't match. I also don't like this function name, how about PopulateCommonCstateInfo? Similarly how about changing PopulateCatalogInformation to PopulateCstateCatalogInfo? 9. +static const struct +{ + char *fn_name; + copy_data_source_cb fn_addr; +} InternalParallelCopyFuncPtrs[] = + +{ + { + "copy_read_data", copy_read_data + }, +}; The function copy_read_data is present in src/backend/replication/logical/tablesync.c and seems to be used during logical replication. Why do we want to expose this function as part of this patch? 0003-Allow-copy-from-command-to-process-data-from-file-ST 10. In the commit message, you have written "The leader does not participate in the insertion of data, leaders only responsibility will be to identify the lines as fast as possible for the workers to do the actual copy operation. The leader waits till all the lines populated are processed by the workers and exits." I think you should also mention that we have chosen this design based on the reason "that everything stalls if the leader doesn't accept further input data, as well as when there are no available splitted chunks so it doesn't seem like a good idea to have the leader do other work. This is backed by the performance data where we have seen that with 1 worker there is just a 5-10% (or whatever percentage difference you have seen) performance difference)". -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Thanks for your comments Amit, i have worked on the comments, my thoughts on the same are mentioned below.
On Tue, Jul 21, 2020 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jul 17, 2020 at 2:09 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > >
> > > Please find the updated patch with the fixes included.
> > >
> >
> > Patch 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
> > had few indentation issues, I have fixed and attached the patch for
> > the same.
> >
>
> Ensure to use the version with each patch-series as that makes it
> easier for the reviewer to verify the changes done in the latest
> version of the patch. One way is to use commands like "git
> format-patch -6 -v <version_of_patch_series>" or you can add the
> version number manually.
>
Taken care.
> Review comments:
> ===================
>
> 0001-Copy-code-readjustment-to-support-parallel-copy
> 1.
> @@ -807,8 +835,11 @@ CopyLoadRawBuf(CopyState cstate)
> else
> nbytes = 0; /* no data need be saved */
>
> + if (cstate->copy_dest == COPY_NEW_FE)
> + minread = RAW_BUF_SIZE - nbytes;
> +
> inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes,
> - 1, RAW_BUF_SIZE - nbytes);
> + minread, RAW_BUF_SIZE - nbytes);
>
> No comment to explain why this change is done?
>
> 0002-Framework-for-leader-worker-in-parallel-copy
Currently CopyGetData copies a lesser amount of data to buffer even though space is available in buffer because minread was passed as 1 to CopyGetData. Because of this there are frequent call to CopyGetData for fetching the data. In this case it will load only some data due to the below check:
while (maxread > 0 && bytesread < minread && !cstate->reached_eof)
After reading some data bytesread will be greater than minread which is passed as 1 and return with lesser amount of data, even though there is some space.
This change is required for parallel copy feature as each time we get a new DSM data block which is of 64K size and copy the data. If we copy less data into DSM data blocks we might end up consuming all the DSM data blocks. I felt this issue can be fixed as part of HEAD. Have posted a separate thread [1] for this. I'm planning to remove that change once it gets committed. Can that go as a separate
patch or should we include it here?
[1] - https://www.postgresql.org/message-id/CALDaNm0v4CjmvSnftYnx_9pOS_dKRG%3DO3NnBgJsQmi0KipvLog%40mail.gmail.com
> 2.
> + * ParallelCopyLineBoundary is common data structure between leader & worker,
> + * Leader process will be populating data block, data block offset &
> the size of
> + * the record in DSM for the workers to copy the data into the relation.
> + * This is protected by the following sequence in the leader & worker. If they
> + * don't follow this order the worker might process wrong line_size and leader
> + * might populate the information which worker has not yet processed or in the
> + * process of processing.
> + * Leader should operate in the following order:
> + * 1) check if line_size is -1, if not wait, it means worker is still
> + * processing.
> + * 2) set line_state to LINE_LEADER_POPULATING.
> + * 3) update first_block, start_offset & cur_lineno in any order.
> + * 4) update line_size.
> + * 5) update line_state to LINE_LEADER_POPULATED.
> + * Worker should operate in the following order:
> + * 1) check line_state is LINE_LEADER_POPULATED, if not it means
> leader is still
> + * populating the data.
> + * 2) read line_size.
> + * 3) only one worker should choose one line for processing, this is handled by
> + * using pg_atomic_compare_exchange_u32, worker will change the sate to
> + * LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED.
> + * 4) read first_block, start_offset & cur_lineno in any order.
> + * 5) process line_size data.
> + * 6) update line_size to -1.
> + */
> +typedef struct ParallelCopyLineBoundary
>
> Are we doing all this state management to avoid using locks while
> processing lines? If so, I think we can use either spinlock or LWLock
> to keep the main patch simple and then provide a later patch to make
> it lock-less. This will allow us to first focus on the main design of
> the patch rather than trying to make this datastructure processing
> lock-less in the best possible way.
>
The steps will be more or less same if we use spinlock too. step 1, step 3 & step 4 will be common we have to use lock & unlock instead of step 2 & step 5. I feel we can retain the current implementation.
> 3.
> + /*
> + * Actual lines inserted by worker (some records will be filtered based on
> + * where condition).
> + */
> + pg_atomic_uint64 processed;
> + pg_atomic_uint64 total_worker_processed; /* total processed records
> by the workers */
>
> The difference between processed and total_worker_processed is not
> clear. Can we expand the comments a bit?
>
Fixed
> 4.
> + * SerializeList - Insert a list into shared memory.
> + */
> +static void
> +SerializeList(ParallelContext *pcxt, int key, List *inputlist,
> + Size est_list_size)
> +{
> + if (inputlist != NIL)
> + {
> + ParallelCopyKeyListInfo *sharedlistinfo = (ParallelCopyKeyListInfo
> *)shm_toc_allocate(pcxt->toc,
> + est_list_size);
> + CopyListSharedMemory(inputlist, est_list_size, sharedlistinfo);
> + shm_toc_insert(pcxt->toc, key, sharedlistinfo);
> + }
> +}
>
> Why do we need to write a special mechanism (CopyListSharedMemory) to
> serialize a list. Why can't we use nodeToString? It should be able
> to take care of List datatype, see outNode which is called from
> nodeToString. Once you do that, I think you won't need even
> EstimateLineKeysList, strlen should work instead.
>
> Check, if you have any similar special handling for other types that
> can be dealt with nodeToString?
>
Fixed
> 5.
> + MemSet(shared_info_ptr, 0, est_shared_info);
> + shared_info_ptr->is_read_in_progress = true;
> + shared_info_ptr->cur_block_pos = -1;
> + shared_info_ptr->full_transaction_id = full_transaction_id;
> + shared_info_ptr->mycid = GetCurrentCommandId(true);
> + for (count = 0; count < RINGSIZE; count++)
> + {
> + ParallelCopyLineBoundary *lineInfo =
> &shared_info_ptr->line_boundaries.ring[count];
> + pg_atomic_init_u32(&(lineInfo->line_size), -1);
> + }
> +
>
> You can move this initialization in a separate function.
>
Fixed
> 6.
> In function BeginParallelCopy(), you need to keep a provision to
> collect wal_usage and buf_usage stats. See _bt_begin_parallel for
> reference. Those will be required for pg_stat_statements.
>
Fixed
> 7.
> DeserializeString() -- it is better to name this function as RestoreString.
> ParallelWorkerInitialization() -- it is better to name this function
> as InitializeParallelCopyInfo or something like that, the current name
> is quite confusing.
> ParallelCopyLeader() -- how about ParallelCopyFrom? ParallelCopyLeader
> doesn't sound good to me. You can suggest something else if you don't
> like ParallelCopyFrom
>
Fixed
> 8.
> /*
> - * PopulateGlobalsForCopyFrom - Populates the common variables
> required for copy
> - * from operation. This is a helper function for BeginCopy function.
> + * PopulateCatalogInformation - Populates the common variables
> required for copy
> + * from operation. This is a helper function for BeginCopy &
> + * ParallelWorkerInitialization function.
> */
> static void
> PopulateGlobalsForCopyFrom(CopyState cstate, TupleDesc tupDesc,
> - List *attnamelist)
> + List *attnamelist)
>
> The actual function name and the name in function header don't match.
> I also don't like this function name, how about
> PopulateCommonCstateInfo? Similarly how about changing
> PopulateCatalogInformation to PopulateCstateCatalogInfo?
>
Fixed
> 9.
> +static const struct
> +{
> + char *fn_name;
> + copy_data_source_cb fn_addr;
> +} InternalParallelCopyFuncPtrs[] =
> +
> +{
> + {
> + "copy_read_data", copy_read_data
> + },
> +};
>
> The function copy_read_data is present in
> src/backend/replication/logical/tablesync.c and seems to be used
> during logical replication. Why do we want to expose this function as
> part of this patch?
>
I was thinking we could include the framework to support parallelism for logical replication too and can be enhanced when it is needed. Now I have removed this as part of the new patch provided, that can be added whenever required.
> 0003-Allow-copy-from-command-to-process-data-from-file-ST
> 10.
> In the commit message, you have written "The leader does not
> participate in the insertion of data, leaders only responsibility will
> be to identify the lines as fast as possible for the workers to do the
> actual copy operation. The leader waits till all the lines populated
> are processed by the workers and exits."
>
> I think you should also mention that we have chosen this design based
> on the reason "that everything stalls if the leader doesn't accept
> further input data, as well as when there are no available splitted
> chunks so it doesn't seem like a good idea to have the leader do other
> work. This is backed by the performance data where we have seen that
> with 1 worker there is just a 5-10% (or whatever percentage difference
> you have seen) performance difference)".
Fixed.
Please find the new patch attached with the fixes.
Thoughts?
On Tue, Jul 21, 2020 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Jul 17, 2020 at 2:09 PM vignesh C <vignesh21@gmail.com> wrote:
>
> >
> > Please find the updated patch with the fixes included.
> >
>
> Patch 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
> had few indentation issues, I have fixed and attached the patch for
> the same.
>
Ensure to use the version with each patch-series as that makes it
easier for the reviewer to verify the changes done in the latest
version of the patch. One way is to use commands like "git
format-patch -6 -v <version_of_patch_series>" or you can add the
version number manually.
Review comments:
===================
0001-Copy-code-readjustment-to-support-parallel-copy
1.
@@ -807,8 +835,11 @@ CopyLoadRawBuf(CopyState cstate)
else
nbytes = 0; /* no data need be saved */
+ if (cstate->copy_dest == COPY_NEW_FE)
+ minread = RAW_BUF_SIZE - nbytes;
+
inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes,
- 1, RAW_BUF_SIZE - nbytes);
+ minread, RAW_BUF_SIZE - nbytes);
No comment to explain why this change is done?
0002-Framework-for-leader-worker-in-parallel-copy
2.
+ * ParallelCopyLineBoundary is common data structure between leader & worker,
+ * Leader process will be populating data block, data block offset &
the size of
+ * the record in DSM for the workers to copy the data into the relation.
+ * This is protected by the following sequence in the leader & worker. If they
+ * don't follow this order the worker might process wrong line_size and leader
+ * might populate the information which worker has not yet processed or in the
+ * process of processing.
+ * Leader should operate in the following order:
+ * 1) check if line_size is -1, if not wait, it means worker is still
+ * processing.
+ * 2) set line_state to LINE_LEADER_POPULATING.
+ * 3) update first_block, start_offset & cur_lineno in any order.
+ * 4) update line_size.
+ * 5) update line_state to LINE_LEADER_POPULATED.
+ * Worker should operate in the following order:
+ * 1) check line_state is LINE_LEADER_POPULATED, if not it means
leader is still
+ * populating the data.
+ * 2) read line_size.
+ * 3) only one worker should choose one line for processing, this is handled by
+ * using pg_atomic_compare_exchange_u32, worker will change the sate to
+ * LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED.
+ * 4) read first_block, start_offset & cur_lineno in any order.
+ * 5) process line_size data.
+ * 6) update line_size to -1.
+ */
+typedef struct ParallelCopyLineBoundary
Are we doing all this state management to avoid using locks while
processing lines? If so, I think we can use either spinlock or LWLock
to keep the main patch simple and then provide a later patch to make
it lock-less. This will allow us to first focus on the main design of
the patch rather than trying to make this datastructure processing
lock-less in the best possible way.
3.
+ /*
+ * Actual lines inserted by worker (some records will be filtered based on
+ * where condition).
+ */
+ pg_atomic_uint64 processed;
+ pg_atomic_uint64 total_worker_processed; /* total processed records
by the workers */
The difference between processed and total_worker_processed is not
clear. Can we expand the comments a bit?
4.
+ * SerializeList - Insert a list into shared memory.
+ */
+static void
+SerializeList(ParallelContext *pcxt, int key, List *inputlist,
+ Size est_list_size)
+{
+ if (inputlist != NIL)
+ {
+ ParallelCopyKeyListInfo *sharedlistinfo = (ParallelCopyKeyListInfo
*)shm_toc_allocate(pcxt->toc,
+ est_list_size);
+ CopyListSharedMemory(inputlist, est_list_size, sharedlistinfo);
+ shm_toc_insert(pcxt->toc, key, sharedlistinfo);
+ }
+}
Why do we need to write a special mechanism (CopyListSharedMemory) to
serialize a list. Why can't we use nodeToString? It should be able
to take care of List datatype, see outNode which is called from
nodeToString. Once you do that, I think you won't need even
EstimateLineKeysList, strlen should work instead.
Check, if you have any similar special handling for other types that
can be dealt with nodeToString?
5.
+ MemSet(shared_info_ptr, 0, est_shared_info);
+ shared_info_ptr->is_read_in_progress = true;
+ shared_info_ptr->cur_block_pos = -1;
+ shared_info_ptr->full_transaction_id = full_transaction_id;
+ shared_info_ptr->mycid = GetCurrentCommandId(true);
+ for (count = 0; count < RINGSIZE; count++)
+ {
+ ParallelCopyLineBoundary *lineInfo =
&shared_info_ptr->line_boundaries.ring[count];
+ pg_atomic_init_u32(&(lineInfo->line_size), -1);
+ }
+
You can move this initialization in a separate function.
6.
In function BeginParallelCopy(), you need to keep a provision to
collect wal_usage and buf_usage stats. See _bt_begin_parallel for
reference. Those will be required for pg_stat_statements.
7.
DeserializeString() -- it is better to name this function as RestoreString.
ParallelWorkerInitialization() -- it is better to name this function
as InitializeParallelCopyInfo or something like that, the current name
is quite confusing.
ParallelCopyLeader() -- how about ParallelCopyFrom? ParallelCopyLeader
doesn't sound good to me. You can suggest something else if you don't
like ParallelCopyFrom
8.
/*
- * PopulateGlobalsForCopyFrom - Populates the common variables
required for copy
- * from operation. This is a helper function for BeginCopy function.
+ * PopulateCatalogInformation - Populates the common variables
required for copy
+ * from operation. This is a helper function for BeginCopy &
+ * ParallelWorkerInitialization function.
*/
static void
PopulateGlobalsForCopyFrom(CopyState cstate, TupleDesc tupDesc,
- List *attnamelist)
+ List *attnamelist)
The actual function name and the name in function header don't match.
I also don't like this function name, how about
PopulateCommonCstateInfo? Similarly how about changing
PopulateCatalogInformation to PopulateCstateCatalogInfo?
9.
+static const struct
+{
+ char *fn_name;
+ copy_data_source_cb fn_addr;
+} InternalParallelCopyFuncPtrs[] =
+
+{
+ {
+ "copy_read_data", copy_read_data
+ },
+};
The function copy_read_data is present in
src/backend/replication/logical/tablesync.c and seems to be used
during logical replication. Why do we want to expose this function as
part of this patch?
0003-Allow-copy-from-command-to-process-data-from-file-ST
10.
In the commit message, you have written "The leader does not
participate in the insertion of data, leaders only responsibility will
be to identify the lines as fast as possible for the workers to do the
actual copy operation. The leader waits till all the lines populated
are processed by the workers and exits."
I think you should also mention that we have chosen this design based
on the reason "that everything stalls if the leader doesn't accept
further input data, as well as when there are no available splitted
chunks so it doesn't seem like a good idea to have the leader do other
work. This is backed by the performance data where we have seen that
with 1 worker there is just a 5-10% (or whatever percentage difference
you have seen) performance difference)".
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment
- v2-0001-Copy-code-readjustment-to-support-parallel-copy.patch
- v2-0002-Framework-for-leader-worker-in-parallel-copy.patch
- v2-0003-Allow-copy-from-command-to-process-data-from-file.patch
- v2-0004-Documentation-for-parallel-copy.patch
- v2-0005-Tests-for-parallel-copy.patch
- v2-0006-Parallel-Copy-For-Binary-Format-Files.patch
Thanks for reviewing and providing the comments Ashutosh.
Please find my thoughts below:
On Fri, Jul 17, 2020 at 7:18 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Some review comments (mostly) from the leader side code changes:
>
> 1) Do we need a DSM key for the FORCE_QUOTE option? I think FORCE_QUOTE option is only used with COPY TO and not COPY FROM so not sure why you have added it.
>
> PARALLEL_COPY_KEY_FORCE_QUOTE_LIST
>
Fixed
> 2) Should we be allocating the parallel copy data structure only when it is confirmed that the parallel copy is allowed?
>
> pcdata = (ParallelCopyData *) palloc0(sizeof(ParallelCopyData));
> cstate->pcdata = pcdata;
>
> Or, if you want it to be allocated before confirming if Parallel copy is allowed or not, then I think it would be good to allocate it in *cstate->copycontext* memory context so that when EndCopy is called towards the end of the COPY FROM operation, the entire context itself gets deleted thereby freeing the memory space allocated for pcdata. In fact it would be good to ensure that all the local memory allocated inside the ctstate structure gets allocated in the *cstate->copycontext* memory context.
>
Fixed
> 3) Should we allow Parallel Copy when the insert method is CIM_MULTI_CONDITIONAL?
>
> + /* Check if the insertion mode is single. */
> + if (FindInsertMethod(cstate) == CIM_SINGLE)
> + return false;
>
> I know we have added checks in CopyFrom() to ensure that if any trigger (before row or instead of) is found on any of partition being loaded with data, then COPY FROM operation would fail, but does it mean that we are okay to perform parallel copy on partitioned table. Have we done some performance testing with the partitioned table where the data in the input file needs to be routed to the different partitions?
>
Partition data is handled like what Amit had told in one of earlier mails [1]. My colleague Bharath has run performance test with partition table, he will be sharing the results.
> 4) There are lot of if-checks in IsParallelCopyAllowed function that are checked in CopyFrom function as well which means in case of Parallel Copy those checks will get executed multiple times (first by the leader and from second time onwards by each worker process). Is that required?
>
It is called from BeginParallelCopy, This will be called only once. This change is ok.
> 5) Should the worker process be calling this function when the leader has already called it once in ExecBeforeStmtTrigger()?
>
> /* Verify the named relation is a valid target for INSERT */
> CheckValidResultRel(resultRelInfo, CMD_INSERT);
>
Fixed.
> 6) I think it would be good to re-write the comments atop ParallelCopyLeader(). From the present comments it appears as if you were trying to put the information pointwise but somehow you ended up putting in a paragraph. The comments also have some typos like *line beaks* which possibly means line breaks. This is applicable for other comments as well where you
>
Fixed.
> 7) Is the following checking equivalent to IsWorker()? If so, it would be good to replace it with an IsWorker like macro to increase the readability.
>
> (IsParallelCopy() && !IsLeader())
>
Fixed.
These have been fixed and the new patch is attached as part of my previous mail.
[1] - https://www.postgresql.org/message-id/CAA4eK1LQPxULxw8JpucX0PwzQQRk%3Dq4jG32cU1us2%2B-mtzZUQg%40mail.gmail.com
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
Please find my thoughts below:
On Fri, Jul 17, 2020 at 7:18 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Some review comments (mostly) from the leader side code changes:
>
> 1) Do we need a DSM key for the FORCE_QUOTE option? I think FORCE_QUOTE option is only used with COPY TO and not COPY FROM so not sure why you have added it.
>
> PARALLEL_COPY_KEY_FORCE_QUOTE_LIST
>
Fixed
> 2) Should we be allocating the parallel copy data structure only when it is confirmed that the parallel copy is allowed?
>
> pcdata = (ParallelCopyData *) palloc0(sizeof(ParallelCopyData));
> cstate->pcdata = pcdata;
>
> Or, if you want it to be allocated before confirming if Parallel copy is allowed or not, then I think it would be good to allocate it in *cstate->copycontext* memory context so that when EndCopy is called towards the end of the COPY FROM operation, the entire context itself gets deleted thereby freeing the memory space allocated for pcdata. In fact it would be good to ensure that all the local memory allocated inside the ctstate structure gets allocated in the *cstate->copycontext* memory context.
>
Fixed
> 3) Should we allow Parallel Copy when the insert method is CIM_MULTI_CONDITIONAL?
>
> + /* Check if the insertion mode is single. */
> + if (FindInsertMethod(cstate) == CIM_SINGLE)
> + return false;
>
> I know we have added checks in CopyFrom() to ensure that if any trigger (before row or instead of) is found on any of partition being loaded with data, then COPY FROM operation would fail, but does it mean that we are okay to perform parallel copy on partitioned table. Have we done some performance testing with the partitioned table where the data in the input file needs to be routed to the different partitions?
>
Partition data is handled like what Amit had told in one of earlier mails [1]. My colleague Bharath has run performance test with partition table, he will be sharing the results.
> 4) There are lot of if-checks in IsParallelCopyAllowed function that are checked in CopyFrom function as well which means in case of Parallel Copy those checks will get executed multiple times (first by the leader and from second time onwards by each worker process). Is that required?
>
It is called from BeginParallelCopy, This will be called only once. This change is ok.
> 5) Should the worker process be calling this function when the leader has already called it once in ExecBeforeStmtTrigger()?
>
> /* Verify the named relation is a valid target for INSERT */
> CheckValidResultRel(resultRelInfo, CMD_INSERT);
>
Fixed.
> 6) I think it would be good to re-write the comments atop ParallelCopyLeader(). From the present comments it appears as if you were trying to put the information pointwise but somehow you ended up putting in a paragraph. The comments also have some typos like *line beaks* which possibly means line breaks. This is applicable for other comments as well where you
>
Fixed.
> 7) Is the following checking equivalent to IsWorker()? If so, it would be good to replace it with an IsWorker like macro to increase the readability.
>
> (IsParallelCopy() && !IsLeader())
>
Fixed.
These have been fixed and the new patch is attached as part of my previous mail.
[1] - https://www.postgresql.org/message-id/CAA4eK1LQPxULxw8JpucX0PwzQQRk%3Dq4jG32cU1us2%2B-mtzZUQg%40mail.gmail.com
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 22, 2020 at 7:56 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Thanks for reviewing and providing the comments Ashutosh.
> Please find my thoughts below:
>
> On Fri, Jul 17, 2020 at 7:18 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > Some review comments (mostly) from the leader side code changes:
> >
> > 3) Should we allow Parallel Copy when the insert method is CIM_MULTI_CONDITIONAL?
> >
> > + /* Check if the insertion mode is single. */
> > + if (FindInsertMethod(cstate) == CIM_SINGLE)
> > + return false;
> >
> > I know we have added checks in CopyFrom() to ensure that if any trigger (before row or instead of) is found on any of partition being loaded with data, then COPY FROM operation would fail, but does it mean that we are okay to perform parallel copy on partitioned table. Have we done some performance testing with the partitioned table where the data in the input file needs to be routed to the different partitions?
> >
>
> Partition data is handled like what Amit had told in one of earlier mails [1]. My colleague Bharath has run performance test with partition table, he will be sharing the results.
>
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
>
> Thanks for reviewing and providing the comments Ashutosh.
> Please find my thoughts below:
>
> On Fri, Jul 17, 2020 at 7:18 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > Some review comments (mostly) from the leader side code changes:
> >
> > 3) Should we allow Parallel Copy when the insert method is CIM_MULTI_CONDITIONAL?
> >
> > + /* Check if the insertion mode is single. */
> > + if (FindInsertMethod(cstate) == CIM_SINGLE)
> > + return false;
> >
> > I know we have added checks in CopyFrom() to ensure that if any trigger (before row or instead of) is found on any of partition being loaded with data, then COPY FROM operation would fail, but does it mean that we are okay to perform parallel copy on partitioned table. Have we done some performance testing with the partitioned table where the data in the input file needs to be routed to the different partitions?
> >
>
> Partition data is handled like what Amit had told in one of earlier mails [1]. My colleague Bharath has run performance test with partition table, he will be sharing the results.
>
I ran tests for partitioned use cases - results are similar to that of non partitioned cases[1].
parallel workers | test case 1(exec time in sec): copy from csv file, 5.1GB, 10million tuples, 4 range partitions, 3 indexes on integer columns unique data | test case 2(exec time in sec): copy from csv file, 5.1GB, 10million tuples, 4 range partitions, unique data |
0 | 205.403(1X) | 135(1X) |
2 | 114.724(1.79X) | 59.388(2.27X) |
4 | 99.017(2.07X) | 56.742(2.34X) |
8 | 99.722(2.06X) | 66.323(2.03X) |
16 | 98.147(2.09X) | 66.054(2.04X) |
20 | 97.723(2.1X) | 66.389(2.03X) |
30 | 97.048(2.11X) | 70.568(1.91X) |
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 23, 2020 at 8:51 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Wed, Jul 22, 2020 at 7:56 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Thanks for reviewing and providing the comments Ashutosh.
> Please find my thoughts below:
>
> On Fri, Jul 17, 2020 at 7:18 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > Some review comments (mostly) from the leader side code changes:
> >
> > 3) Should we allow Parallel Copy when the insert method is CIM_MULTI_CONDITIONAL?
> >
> > + /* Check if the insertion mode is single. */
> > + if (FindInsertMethod(cstate) == CIM_SINGLE)
> > + return false;
> >
> > I know we have added checks in CopyFrom() to ensure that if any trigger (before row or instead of) is found on any of partition being loaded with data, then COPY FROM operation would fail, but does it mean that we are okay to perform parallel copy on partitioned table. Have we done some performance testing with the partitioned table where the data in the input file needs to be routed to the different partitions?
> >
>
> Partition data is handled like what Amit had told in one of earlier mails [1]. My colleague Bharath has run performance test with partition table, he will be sharing the results.
>I ran tests for partitioned use cases - results are similar to that of non partitioned cases[1].
I could see the gain up to 10-11 times for non-partitioned cases [1], can we use similar test case here as well (with one of the indexes on text column or having gist index) to see its impact?
I think, when doing the performance testing for partitioned table, it would be good to also mention about the distribution of data in the input file. One possible data distribution could be that we have let's say 100 tuples in the input file, and every consecutive tuple belongs to a different partition.
On Thu, Jul 23, 2020 at 8:51 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Wed, Jul 22, 2020 at 7:56 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Thanks for reviewing and providing the comments Ashutosh.
> Please find my thoughts below:
>
> On Fri, Jul 17, 2020 at 7:18 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > Some review comments (mostly) from the leader side code changes:
> >
> > 3) Should we allow Parallel Copy when the insert method is CIM_MULTI_CONDITIONAL?
> >
> > + /* Check if the insertion mode is single. */
> > + if (FindInsertMethod(cstate) == CIM_SINGLE)
> > + return false;
> >
> > I know we have added checks in CopyFrom() to ensure that if any trigger (before row or instead of) is found on any of partition being loaded with data, then COPY FROM operation would fail, but does it mean that we are okay to perform parallel copy on partitioned table. Have we done some performance testing with the partitioned table where the data in the input file needs to be routed to the different partitions?
> >
>
> Partition data is handled like what Amit had told in one of earlier mails [1]. My colleague Bharath has run performance test with partition table, he will be sharing the results.
>I ran tests for partitioned use cases - results are similar to that of non partitioned cases[1].
parallel workers test case 1(exec time in sec): copy from csv file, 5.1GB, 10million tuples, 4 range partitions, 3 indexes on integer columns unique data test case 2(exec time in sec): copy from csv file, 5.1GB, 10million tuples, 4 range partitions, unique data 0 205.403(1X) 135(1X) 2 114.724(1.79X) 59.388(2.27X) 4 99.017(2.07X) 56.742(2.34X) 8 99.722(2.06X) 66.323(2.03X) 16 98.147(2.09X) 66.054(2.04X) 20 97.723(2.1X) 66.389(2.03X) 30 97.048(2.11X) 70.568(1.91X)
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 23, 2020 at 9:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>>
>> I ran tests for partitioned use cases - results are similar to that of non partitioned cases[1].
>
>
> I could see the gain up to 10-11 times for non-partitioned cases [1], can we use similar test case here as well (with one of the indexes on text column or having gist index) to see its impact?
>
> [1] - https://www.postgresql.org/message-id/CALj2ACVR4WE98Per1H7ajosW8vafN16548O2UV8bG3p4D3XnPg%40mail.gmail.com
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
>
>>
>> I ran tests for partitioned use cases - results are similar to that of non partitioned cases[1].
>
>
> I could see the gain up to 10-11 times for non-partitioned cases [1], can we use similar test case here as well (with one of the indexes on text column or having gist index) to see its impact?
>
> [1] - https://www.postgresql.org/message-id/CALj2ACVR4WE98Per1H7ajosW8vafN16548O2UV8bG3p4D3XnPg%40mail.gmail.com
>
Thanks Amit! Please find the results of detailed testing done for partitioned use cases:
Range Partitions: consecutive rows go into the same partitions.
parallel workers | test case 1(exec time in sec): copy from csv file, 2 indexes on integer columns and 1 index on text column, 4 range partitions | test case 2(exec time in sec): copy from csv file, 1 gist index on text column, 4 range partitions | test case 3(exec time in sec): copy from csv file, 3 indexes on integer columns, 4 range partitions |
0 | 1051.924(1X) | 785.052(1X) | 205.403(1X) |
2 | 589.576(1.78X) | 421.974(1.86X) | 114.724(1.79X) |
4 | 321.960(3.27X) | 230.997(3.4X) | 99.017(2.07X) |
8 | 199.245(5.23X) | 156.132(5.02X) | 99.722(2.06X) |
16 | 127.343(8.26X) | 173.696(4.52X) | 98.147(2.09X) |
20 | 122.029(8.62X) | 186.418(4.21X) | 97.723(2.1X) |
30 | 142.876(7.36X) | 214.598(3.66X) | 97.048(2.11X) |
On Thu, Jul 23, 2020 at 10:21 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> I think, when doing the performance testing for partitioned table, it would be good to also mention about the distribution of data in the input file. One possible data distribution could be that we have let's say 100 tuples in the input file, and every consecutive tuple belongs to a different partition.
>
> I think, when doing the performance testing for partitioned table, it would be good to also mention about the distribution of data in the input file. One possible data distribution could be that we have let's say 100 tuples in the input file, and every consecutive tuple belongs to a different partition.
>
To address Ashutosh's point, I used hash partitioning. Hope this helps to clear the doubt.
Hash Partitions: where there are high chances that consecutive rows may go into different partitions.
parallel workers | test case 1(exec time in sec): copy from csv file, 2 indexes on integer columns and 1 index on text column, 4 hash partitions | test case 2(exec time in sec): copy from csv file, 1 gist index on text column, 4 hash partitions | test case 3(exec time in sec): copy from csv file, 3 indexes on integer columns, 4 hash partitions |
0 | 1060.884(1X) | 812.283(1X) | 207.745(1X) |
2 | 572.542(1.85X) | 418.454(1.94X) | 107.850(1.93X) |
4 | 298.132(3.56X) | 227.367(3.57X) | 83.895(2.48X) |
8 | 169.449(6.26X) | 137.993(5.89X) | 85.411(2.43X) |
16 | 112.297(9.45X) | 95.167(8.53X) | 96.136(2.16X) |
20 | 101.546(10.45X) | 90.552(8.97X) | 97.066(2.14X) |
30 | 113.877(9.32X) | 127.17(6.38X) | 96.819(2.14X) |
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
The patches were not applying because of the recent commits.
I have rebased the patch over head & attached.
On Thu, Jul 23, 2020 at 6:07 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Thu, Jul 23, 2020 at 9:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>>
>> I ran tests for partitioned use cases - results are similar to that of non partitioned cases[1].
>
>
> I could see the gain up to 10-11 times for non-partitioned cases [1], can we use similar test case here as well (with one of the indexes on text column or having gist index) to see its impact?
>
> [1] - https://www.postgresql.org/message-id/CALj2ACVR4WE98Per1H7ajosW8vafN16548O2UV8bG3p4D3XnPg%40mail.gmail.com>Thanks Amit! Please find the results of detailed testing done for partitioned use cases:Range Partitions: consecutive rows go into the same partitions.
parallel workers test case 1(exec time in sec): copy from csv file, 2 indexes on integer columns and 1 index on text column, 4 range partitions test case 2(exec time in sec): copy from csv file, 1 gist index on text column, 4 range partitions test case 3(exec time in sec): copy from csv file, 3 indexes on integer columns, 4 range partitions 0 1051.924(1X) 785.052(1X) 205.403(1X) 2 589.576(1.78X) 421.974(1.86X) 114.724(1.79X) 4 321.960(3.27X) 230.997(3.4X) 99.017(2.07X) 8 199.245(5.23X) 156.132(5.02X) 99.722(2.06X) 16 127.343(8.26X) 173.696(4.52X) 98.147(2.09X) 20 122.029(8.62X) 186.418(4.21X) 97.723(2.1X) 30 142.876(7.36X) 214.598(3.66X) 97.048(2.11X) On Thu, Jul 23, 2020 at 10:21 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:>
> I think, when doing the performance testing for partitioned table, it would be good to also mention about the distribution of data in the input file. One possible data distribution could be that we have let's say 100 tuples in the input file, and every consecutive tuple belongs to a different partition.
>To address Ashutosh's point, I used hash partitioning. Hope this helps to clear the doubt.Hash Partitions: where there are high chances that consecutive rows may go into different partitions.
parallel workers test case 1(exec time in sec): copy from csv file, 2 indexes on integer columns and 1 index on text column, 4 hash partitions test case 2(exec time in sec): copy from csv file, 1 gist index on text column, 4 hash partitions test case 3(exec time in sec): copy from csv file, 3 indexes on integer columns, 4 hash partitions 0 1060.884(1X) 812.283(1X) 207.745(1X) 2 572.542(1.85X) 418.454(1.94X) 107.850(1.93X) 4 298.132(3.56X) 227.367(3.57X) 83.895(2.48X) 8 169.449(6.26X) 137.993(5.89X) 85.411(2.43X) 16 112.297(9.45X) 95.167(8.53X) 96.136(2.16X) 20 101.546(10.45X) 90.552(8.97X) 97.066(2.14X) 30 113.877(9.32X) 127.17(6.38X) 96.819(2.14X) With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sat, Aug 1, 2020 at 9:55 AM vignesh C <vignesh21@gmail.com> wrote: > > The patches were not applying because of the recent commits. > I have rebased the patch over head & attached. > I rebased v2-0006-Parallel-Copy-For-Binary-Format-Files.patch. Putting together all the patches rebased on to the latest commit b8fdee7d0ca8bd2165d46fb1468f75571b706a01. Patches from 0001 to 0005 are rebased by Vignesh, that are from the previous mail and the patch 0006 is rebased by me. Please consider this patch set for further review. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
- v2-0001-Copy-code-readjustment-to-support-parallel-copy.patch
- v2-0002-Framework-for-leader-worker-in-parallel-copy.patch
- v2-0003-Allow-copy-from-command-to-process-data-from-file.patch
- v2-0004-Documentation-for-parallel-copy.patch
- v2-0005-Tests-for-parallel-copy.patch
- v2-0006-Parallel-Copy-For-Binary-Format-Files.patch
On Mon, Aug 03, 2020 at 12:33:48PM +0530, Bharath Rupireddy wrote: >On Sat, Aug 1, 2020 at 9:55 AM vignesh C <vignesh21@gmail.com> wrote: >> >> The patches were not applying because of the recent commits. >> I have rebased the patch over head & attached. >> >I rebased v2-0006-Parallel-Copy-For-Binary-Format-Files.patch. > >Putting together all the patches rebased on to the latest commit >b8fdee7d0ca8bd2165d46fb1468f75571b706a01. Patches from 0001 to 0005 >are rebased by Vignesh, that are from the previous mail and the patch >0006 is rebased by me. > >Please consider this patch set for further review. > I'd suggest incrementing the version every time an updated version is submitted, even if it's just a rebased version. It makes it clearer which version of the code is being discussed, etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Aug 4, 2020 at 9:51 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Mon, Aug 03, 2020 at 12:33:48PM +0530, Bharath Rupireddy wrote: > >On Sat, Aug 1, 2020 at 9:55 AM vignesh C <vignesh21@gmail.com> wrote: > >> > >> The patches were not applying because of the recent commits. > >> I have rebased the patch over head & attached. > >> > >I rebased v2-0006-Parallel-Copy-For-Binary-Format-Files.patch. > > > >Putting together all the patches rebased on to the latest commit > >b8fdee7d0ca8bd2165d46fb1468f75571b706a01. Patches from 0001 to 0005 > >are rebased by Vignesh, that are from the previous mail and the patch > >0006 is rebased by me. > > > >Please consider this patch set for further review. > > > > I'd suggest incrementing the version every time an updated version is > submitted, even if it's just a rebased version. It makes it clearer > which version of the code is being discussed, etc. Sure, we will take care of this when we are sending the next set of patches. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, failed Hi, I don't claim to yet understand all of the Postgres internals that this patch is updating and interacting with, so I'm stilltesting and debugging portions of this patch, but would like to give feedback on what I've noticed so far. I have done some ad-hoc testing of the patch using parallel copies from text/csv/binary files and have not yet struck anyexecution problems other than some option validation and associated error messages on boundary cases. One general question that I have: is there a user benefit (over the normal non-parallel COPY) to allowing "COPY ... FROM... WITH (PARALLEL 1)"? My following comments are broken down by patch: (1) v2-0001-Copy-code-readjustment-to-support-parallel-copy.patch (i) Whilst I can't entirely blame these patches for it (as they are following what is already there), I can't help noticingthe use of numerous macros in src/backend/commands/copy.c which paste in multiple lines of code in various places. It's getting a little out-of-hand. Surely the majority of these would be best inline functions instead? Perhaps hasn't been done because too many parameters need to be passed - thoughts? (2) v2-0002-Framework-for-leader-worker-in-parallel-copy.patch (i) minor point: there are some tabbing/spacing issues in this patch (and the other patches), affecting alignment. e.g. mixed tabs/spaces and misalignment in PARALLEL_COPY_KEY_xxx definitions (ii) +/* + * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data + * block to process to avoid lock contention. This value should be mode of + * RINGSIZE, as wrap around cases is currently not handled while selecting the + * WORKER_CHUNK_COUNT by the worker. + */ +#define WORKER_CHUNK_COUNT 50 "This value should be mode of RINGSIZE ..." -> typo: mode (mod? should evenly divide into RINGSIZE?) (iii) + * using pg_atomic_compare_exchange_u32, worker will change the sate to ->typo: sate (should be "state") (iv) + errmsg("parallel option supported only for copy from"), -> suggest change to: errmsg("parallel option is supported only for COPY FROM"), (v) + errno = 0; /* To distinguish success/failure after call */ + val = strtol(str, &endptr, 10); + + /* Check for various possible errors */ + if ((errno == ERANGE && (val == LONG_MAX || val == LONG_MIN)) + || (errno != 0 && val == 0) || + *endptr) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("improper use of argument to option \"%s\"", + defel->defname), + parser_errposition(pstate, defel->location))); + + if (endptr == str) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("no digits were found in argument to option \"%s\"", + defel->defname), + parser_errposition(pstate, defel->location))); + + cstate->nworkers = (int) val; + + if (cstate->nworkers <= 0) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("argument to option \"%s\" must be a positive integer greater than zero", + defel->defname), + parser_errposition(pstate, defel->location))); I think this validation code needs to be improved, including the error messages (e.g. when can a "positive integer" NOT begreater than zero?) There is some overlap in the "no digits were found" case between the two conditions above, depending, for example, if theargument is quoted. Also, "improper use of argument to option" sounds a bit odd and vague to me. Finally, not range checking before casting long to int can lead to allowing out-of-range int values like in the followingcase: test=# copy mytable from '/myspace/test_pcopy/tmp.dat' (parallel '-2147483648'); ERROR: argument to option "parallel" must be a positive integer greater than zero LINE 1: copy mytable from '/myspace/test_pcopy/tmp.dat' (parallel '-2... ^ BUT the following is allowed... test=# copy mytable from '/myspace/test_pcopy/tmp.dat' (parallel '-2147483649'); COPY 1000000 I'd suggest to change the above validation code to do similar validation to that for the CREATE TABLE parallel_workers storageparameter (case RELOPT_TYPE_INT in reloptions.c). Like that code, wouldn't it be best to range-check the integer optionvalue to be within a reasonable range, say 1 to 1024, with a corresponding errdetail message if possible? (3) v2-0003-Allow-copy-from-command-to-process-data-from-file.patch (i) Patch comment says: "This feature allows the copy from to leverage multiple CPUs in order to copy data from file/STDIN to a table. This adds a PARALLEL option to COPY FROM command where the user can specify the number of workers that can be used to perform the COPY FROM command. Specifying zero as number of workers will disable parallelism." BUT - the changes to ProcessCopyOptions() specified in "v2-0002-Framework-for-leader-worker-in-parallel-copy.patch" do notallow zero workers to be specified - you get an error in that case. Patch comment should be updated accordingly. (ii) #define GETPROCESSED(processed) \ -return processed; +if (!IsParallelCopy()) \ + return processed; \ +else \ + return pg_atomic_read_u64(&cstate->pcdata->pcshared_info->processed); + I think GETPROCESSED would be better named "RETURNPROCESSED". (iii) The below comment seems out- of-date with the current code - is it referring to the loop embedded at the bottom of the currentloop that the comment is within? + /* + * There is a possibility that the above loop has come out because + * data_blk_ptr->curr_blk_completed is set, but dataSize read might + * be an old value, if data_blk_ptr->curr_blk_completed and the line is + * completed, line_size will be set. Read the line_size again to be + * sure if it is complete or partial block. + */ (iv) I may be wrong here, but in the following block of code, isn't there a window of opportunity (however small) in which theline_state might be updated (LINE_WORKER_PROCESSED) by another worker just AFTER pg_atomic_read_u32() returns the currentline_state which is put into curr_line_state, such that a write_pos update might be missed? And then a race-conditionexists for reading/setting line_size (since line_size gets atomically set after line_state is set)? If I am wrong in thinking this synchronization might not be correct, maybe the comments could be improved here to explainhow this code is safe in that respect. + /* Get the current line information. */ + lineInfo = &pcshared_info->line_boundaries.ring[write_pos]; + curr_line_state = pg_atomic_read_u32(&lineInfo->line_state); + if ((write_pos % WORKER_CHUNK_COUNT == 0) && + (curr_line_state == LINE_WORKER_PROCESSED || + curr_line_state == LINE_WORKER_PROCESSING)) + { + pcdata->worker_processed_pos = write_pos; + write_pos = (write_pos + WORKER_CHUNK_COUNT) % RINGSIZE; + continue; + } + + /* Get the size of this line. */ + dataSize = pg_atomic_read_u32(&lineInfo->line_size); + + if (dataSize != 0) /* If not an empty line. */ + { + /* Get the block information. */ + data_blk_ptr = &pcshared_info->data_blocks[lineInfo->first_block]; + + if (!data_blk_ptr->curr_blk_completed && (dataSize == -1)) + { + /* Wait till the current line or block is added. */ + COPY_WAIT_TO_PROCESS() + continue; + } + } + + /* Make sure that no worker has consumed this element. */ + if (pg_atomic_compare_exchange_u32(&lineInfo->line_state, + &line_state, LINE_WORKER_PROCESSING)) + break; (4) v2-0004-Documentation-for-parallel-copy.patch (i) I think that it is necessary to mention the "max_worker_processes" option in the description of the COPY statement PARALLELoption. For example, something like: + Perform <command>COPY FROM</command> in parallel using <replaceable + class="parameter"> integer</replaceable> background workers. Please + note that it is not guaranteed that the number of parallel workers + specified in <replaceable class="parameter">integer</replaceable> will + be used during execution. It is possible for a copy to run with fewer + workers than specified, or even with no workers at all (for example, + due to the setting of max_worker_processes). This option is allowed + only in <command>COPY FROM</command>. (5) v2-0005-Tests-for-parallel-copy.patch (i) None of the provided tests seem to test beyond "PARALLEL 2" (6) v2-0006-Parallel-Copy-For-Binary-Format-Files.patch (i) In the ParallelCopyFrom() function, "cstate->raw_buf" is pfree()d: + /* raw_buf is not used in parallel copy, instead data blocks are used.*/ + pfree(cstate->raw_buf); This comment doesn't seem to be entirely true. At least for text/csv file COPY FROM, cstate->raw_buf is subsequently referenced in the SetRawBufForLoad() function, whichis called by CopyReadLineText(): cur_data_blk_ptr = (cstate->raw_buf) ? &pcshared_info->data_blocks[cur_block_pos] : NULL; So I think cstate->raw_buf should be set to NULL after being pfree()d, and the comment fixed/adjusted. (ii) This patch adds some macros (involving parallel copy checks) AFTER the comment: /* End parallel copy Macros */ Regards, Greg Nancarrow Fujitsu Australia
Thanks Greg for reviewing the patch. Please find my thoughts for your comments. On Wed, Aug 12, 2020 at 9:10 AM Greg Nancarrow <gregn4422@gmail.com> wrote: > I have done some ad-hoc testing of the patch using parallel copies from text/csv/binary files and have not yet struck anyexecution problems other than some option validation and associated error messages on boundary cases. > > One general question that I have: is there a user benefit (over the normal non-parallel COPY) to allowing "COPY ... FROM... WITH (PARALLEL 1)"? > There will be marginal improvement as worker only need to process the data, need not do the file reading, file reading would have been done by the main process. The real improvement can be seen from 2 workers onwards. > > My following comments are broken down by patch: > > (1) v2-0001-Copy-code-readjustment-to-support-parallel-copy.patch > > (i) Whilst I can't entirely blame these patches for it (as they are following what is already there), I can't help noticingthe use of numerous macros in src/backend/commands/copy.c which paste in multiple lines of code in various places. > It's getting a little out-of-hand. Surely the majority of these would be best inline functions instead? > Perhaps hasn't been done because too many parameters need to be passed - thoughts? > I felt they have used macros mainly because it has a tight loop and having macros gives better performance. I have added the macros CLEAR_EOL_LINE, INCREMENTPROCESSED & GETPROCESSED as there will be slight difference in parallel copy & non parallel copy for these. In the remaining patches the macor will be extended to include parallel copy logic. Instead of having checks in the core logic, thought of keeping as macros so that the readability is good. > > (2) v2-0002-Framework-for-leader-worker-in-parallel-copy.patch > > (i) minor point: there are some tabbing/spacing issues in this patch (and the other patches), affecting alignment. > e.g. mixed tabs/spaces and misalignment in PARALLEL_COPY_KEY_xxx definitions > Fixed > (ii) > > +/* > + * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data > + * block to process to avoid lock contention. This value should be mode of > + * RINGSIZE, as wrap around cases is currently not handled while selecting the > + * WORKER_CHUNK_COUNT by the worker. > + */ > +#define WORKER_CHUNK_COUNT 50 > > > "This value should be mode of RINGSIZE ..." > > -> typo: mode (mod? should evenly divide into RINGSIZE?) Fixed, changed it to divisible by. > (iii) > + * using pg_atomic_compare_exchange_u32, worker will change the sate to > > ->typo: sate (should be "state") Fixed > (iv) > > + errmsg("parallel option supported only for copy from"), > > -> suggest change to: errmsg("parallel option is supported only for COPY FROM"), > Fixed > (v) > > + errno = 0; /* To distinguish success/failure after call */ > + val = strtol(str, &endptr, 10); > + > + /* Check for various possible errors */ > + if ((errno == ERANGE && (val == LONG_MAX || val == LONG_MIN)) > + || (errno != 0 && val == 0) || > + *endptr) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("improper use of argument to option \"%s\"", > + defel->defname), > + parser_errposition(pstate, defel->location))); > + > + if (endptr == str) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("no digits were found in argument to option \"%s\"", > + defel->defname), > + parser_errposition(pstate, defel->location))); > + > + cstate->nworkers = (int) val; > + > + if (cstate->nworkers <= 0) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("argument to option \"%s\" must be a positive integer greater thanzero", > + defel->defname), > + parser_errposition(pstate, defel->location))); > > > I think this validation code needs to be improved, including the error messages (e.g. when can a "positive integer" NOTbe greater than zero?) > > There is some overlap in the "no digits were found" case between the two conditions above, depending, for example, if theargument is quoted. > Also, "improper use of argument to option" sounds a bit odd and vague to me. > Finally, not range checking before casting long to int can lead to allowing out-of-range int values like in the followingcase: > > test=# copy mytable from '/myspace/test_pcopy/tmp.dat' (parallel '-2147483648'); > ERROR: argument to option "parallel" must be a positive integer greater than zero > LINE 1: copy mytable from '/myspace/test_pcopy/tmp.dat' (parallel '-2... > ^ > BUT the following is allowed... > > test=# copy mytable from '/myspace/test_pcopy/tmp.dat' (parallel '-2147483649'); > COPY 1000000 > > > I'd suggest to change the above validation code to do similar validation to that for the CREATE TABLE parallel_workersstorage parameter (case RELOPT_TYPE_INT in reloptions.c). Like that code, wouldn't it be best to range-checkthe integer option value to be within a reasonable range, say 1 to 1024, with a corresponding errdetail messageif possible? > Fixed, changed as suggested. > (3) v2-0003-Allow-copy-from-command-to-process-data-from-file.patch > > (i) > > Patch comment says: > > "This feature allows the copy from to leverage multiple CPUs in order to copy > data from file/STDIN to a table. This adds a PARALLEL option to COPY FROM > command where the user can specify the number of workers that can be used > to perform the COPY FROM command. Specifying zero as number of workers will > disable parallelism." > > BUT - the changes to ProcessCopyOptions() specified in "v2-0002-Framework-for-leader-worker-in-parallel-copy.patch" donot allow zero workers to be specified - you get an error in that case. Patch comment should be updated accordingly. > Removed "Specifying zero as number of workers will disable parallelism". As the new value is range from 1 to 1024. > (ii) > > #define GETPROCESSED(processed) \ > -return processed; > +if (!IsParallelCopy()) \ > + return processed; \ > +else \ > + return pg_atomic_read_u64(&cstate->pcdata->pcshared_info->processed); > + > > I think GETPROCESSED would be better named "RETURNPROCESSED". > Fixed. > (iii) > > The below comment seems out- of-date with the current code - is it referring to the loop embedded at the bottom of thecurrent loop that the comment is within? > > + /* > + * There is a possibility that the above loop has come out because > + * data_blk_ptr->curr_blk_completed is set, but dataSize read might > + * be an old value, if data_blk_ptr->curr_blk_completed and the line is > + * completed, line_size will be set. Read the line_size again to be > + * sure if it is complete or partial block. > + */ > Updated, it is referring to the embedded loop at the bottom of the current loop. > (iv) > > I may be wrong here, but in the following block of code, isn't there a window of opportunity (however small) in which theline_state might be updated (LINE_WORKER_PROCESSED) by another worker just AFTER pg_atomic_read_u32() returns the currentline_state which is put into curr_line_state, such that a write_pos update might be missed? And then a race-conditionexists for reading/setting line_size (since line_size gets atomically set after line_state is set)? > If I am wrong in thinking this synchronization might not be correct, maybe the comments could be improved here to explainhow this code is safe in that respect. > > > + /* Get the current line information. */ > + lineInfo = &pcshared_info->line_boundaries.ring[write_pos]; > + curr_line_state = pg_atomic_read_u32(&lineInfo->line_state); > + if ((write_pos % WORKER_CHUNK_COUNT == 0) && > + (curr_line_state == LINE_WORKER_PROCESSED || > + curr_line_state == LINE_WORKER_PROCESSING)) > + { > + pcdata->worker_processed_pos = write_pos; > + write_pos = (write_pos + WORKER_CHUNK_COUNT) % RINGSIZE; > + continue; > + } > + > + /* Get the size of this line. */ > + dataSize = pg_atomic_read_u32(&lineInfo->line_size); > + > + if (dataSize != 0) /* If not an empty line. */ > + { > + /* Get the block information. */ > + data_blk_ptr = &pcshared_info->data_blocks[lineInfo->first_block]; > + > + if (!data_blk_ptr->curr_blk_completed && (dataSize == -1)) > + { > + /* Wait till the current line or block is added. */ > + COPY_WAIT_TO_PROCESS() > + continue; > + } > + } > + > + /* Make sure that no worker has consumed this element. */ > + if (pg_atomic_compare_exchange_u32(&lineInfo->line_state, > + &line_state, LINE_WORKER_PROCESSING)) > + break; > This is not possible because of pg_atomic_compare_exchange_u32, this will succeed only for one of the workers whose line_state is LINE_LEADER_POPULATED, for other workers it will fail. This is explained in detail above ParallelCopyLineBoundary. > > (4) v2-0004-Documentation-for-parallel-copy.patch > > (i) I think that it is necessary to mention the "max_worker_processes" option in the description of the COPY statementPARALLEL option. > > For example, something like: > > + Perform <command>COPY FROM</command> in parallel using <replaceable > + class="parameter"> integer</replaceable> background workers. Please > + note that it is not guaranteed that the number of parallel workers > + specified in <replaceable class="parameter">integer</replaceable> will > + be used during execution. It is possible for a copy to run with fewer > + workers than specified, or even with no workers at all (for example, > + due to the setting of max_worker_processes). This option is allowed > + only in <command>COPY FROM</command>. > Fixed. > (5) v2-0005-Tests-for-parallel-copy.patch > > (i) None of the provided tests seem to test beyond "PARALLEL 2" > I intentionally ran with 1 parallel worker, because when you specify more than 1 parallel worker the order of record insertion can vary & there may be random failures. > > (6) v2-0006-Parallel-Copy-For-Binary-Format-Files.patch > > (i) In the ParallelCopyFrom() function, "cstate->raw_buf" is pfree()d: > > + /* raw_buf is not used in parallel copy, instead data blocks are used.*/ > + pfree(cstate->raw_buf); > raw_buf is not used in parallel copy, instead raw_buf will be pointing to shared memory data blocks. This memory was allocated as part of BeginCopyFrom, uptil this point we cannot be 100% sure as copy can be performed sequentially like in case max_worker_processes is not available, if it switches to sequential mode raw_buf will be used while performing copy operation. At this place we can safely free this memory that was allocated. > This comment doesn't seem to be entirely true. > At least for text/csv file COPY FROM, cstate->raw_buf is subsequently referenced in the SetRawBufForLoad() function, whichis called by CopyReadLineText(): > > cur_data_blk_ptr = (cstate->raw_buf) ? &pcshared_info->data_blocks[cur_block_pos] : NULL; > > So I think cstate->raw_buf should be set to NULL after being pfree()d, and the comment fixed/adjusted. > > > (ii) This patch adds some macros (involving parallel copy checks) AFTER the comment: > > /* End parallel copy Macros */ Fixed, moved the macros above the comment. I have attached new set of patches with the fixes. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
- v3-0001-Copy-code-readjustment-to-support-parallel-copy.patch
- v3-0002-Framework-for-leader-worker-in-parallel-copy.patch
- v3-0003-Allow-copy-from-command-to-process-data-from-file.patch
- v3-0004-Documentation-for-parallel-copy.patch
- v3-0005-Tests-for-parallel-copy.patch
- v3-0006-Parallel-Copy-For-Binary-Format-Files.patch
Hi Vignesh, Some further comments: (1) v3-0002-Framework-for-leader-worker-in-parallel-copy.patch +/* + * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data + * block to process to avoid lock contention. This value should be divisible by + * RINGSIZE, as wrap around cases is currently not handled while selecting the + * WORKER_CHUNK_COUNT by the worker. + */ +#define WORKER_CHUNK_COUNT 50 "This value should be divisible by RINGSIZE" is not a correct statement (since obviously 50 is not divisible by 10000). It should say something like "This value should evenly divide into RINGSIZE", or "RINGSIZE should be a multiple of WORKER_CHUNK_COUNT". (2) v3-0003-Allow-copy-from-command-to-process-data-from-file.patch (i) + /* + * If the data is present in current block lineInfo. line_size + * will be updated. If the data is spread across the blocks either Somehow a space has been put between "lineinfo." and "line_size". It should be: "If the data is present in current block lineInfo.line_size will be updated" (ii) >This is not possible because of pg_atomic_compare_exchange_u32, this >will succeed only for one of the workers whose line_state is >LINE_LEADER_POPULATED, for other workers it will fail. This is >explained in detail above ParallelCopyLineBoundary. Yes, but prior to that call to pg_atomic_compare_exchange_u32(), aren't you separately reading line_state and line_state, so that between those reads, it may have transitioned from leader to another worker, such that the read line state ("cur_line_state", being checked in the if block) may not actually match what is now in the line_state and/or the read line_size ("dataSize") doesn't actually correspond to the read line state? (sorry, still not 100% convinced that the synchronization and checks are safe in all cases) (3) v3-0006-Parallel-Copy-For-Binary-Format-Files.patch >raw_buf is not used in parallel copy, instead raw_buf will be pointing >to shared memory data blocks. This memory was allocated as part of >BeginCopyFrom, uptil this point we cannot be 100% sure as copy can be >performed sequentially like in case max_worker_processes is not >available, if it switches to sequential mode raw_buf will be used >while performing copy operation. At this place we can safely free this >memory that was allocated So the following code (which checks raw_buf, which still points to memory that has been pfreed) is still valid? In the SetRawBufForLoad() function, which is called by CopyReadLineText(): cur_data_blk_ptr = (cstate->raw_buf) ? &pcshared_info->data_blocks[cur_block_pos] : NULL; The above code looks a bit dicey to me. I stepped over that line in the debugger when I debugged an instance of Parallel Copy, so it definitely gets executed. It makes me wonder what other code could possibly be checking raw_buf and using it in some way, when in fact what it points to has been pfreed. Are you able to add the following line of code, or will it (somehow) break logic that you are relying on? pfree(cstate->raw_buf); cstate->raw_buf = NULL; <=== I suggest that this line is added Regards, Greg Nancarrow Fujitsu Australia
Thanks Greg for reviewing the patch. Please find my thoughts for your comments. On Mon, Aug 17, 2020 at 9:44 AM Greg Nancarrow <gregn4422@gmail.com> wrote: > Some further comments: > > (1) v3-0002-Framework-for-leader-worker-in-parallel-copy.patch > > +/* > + * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data > + * block to process to avoid lock contention. This value should be divisible by > + * RINGSIZE, as wrap around cases is currently not handled while selecting the > + * WORKER_CHUNK_COUNT by the worker. > + */ > +#define WORKER_CHUNK_COUNT 50 > > > "This value should be divisible by RINGSIZE" is not a correct > statement (since obviously 50 is not divisible by 10000). > It should say something like "This value should evenly divide into > RINGSIZE", or "RINGSIZE should be a multiple of WORKER_CHUNK_COUNT". > Fixed. Changed it to RINGSIZE should be a multiple of WORKER_CHUNK_COUNT. > (2) v3-0003-Allow-copy-from-command-to-process-data-from-file.patch > > (i) > > + /* > + * If the data is present in current block > lineInfo. line_size > + * will be updated. If the data is spread > across the blocks either > > Somehow a space has been put between "lineinfo." and "line_size". > It should be: "If the data is present in current block > lineInfo.line_size will be updated" Fixed, changed it to lineinfo->line_size. > > (ii) > > >This is not possible because of pg_atomic_compare_exchange_u32, this > >will succeed only for one of the workers whose line_state is > >LINE_LEADER_POPULATED, for other workers it will fail. This is > >explained in detail above ParallelCopyLineBoundary. > > Yes, but prior to that call to pg_atomic_compare_exchange_u32(), > aren't you separately reading line_state and line_state, so that > between those reads, it may have transitioned from leader to another > worker, such that the read line state ("cur_line_state", being checked > in the if block) may not actually match what is now in the line_state > and/or the read line_size ("dataSize") doesn't actually correspond to > the read line state? > > (sorry, still not 100% convinced that the synchronization and checks > are safe in all cases) > I think that you are describing about the problem could happen in the following case: when we read curr_line_state, the value was LINE_WORKER_PROCESSED or LINE_WORKER_PROCESSING. Then in some cases if the leader is very fast compared to the workers then the leader quickly populates one line and sets the state to LINE_LEADER_POPULATED. State is changed to LINE_LEADER_POPULATED when we are checking the currr_line_state. I feel this will not be a problem because, Leader will populate & wait till some RING element is available to populate. In the meantime worker has seen that state is LINE_WORKER_PROCESSED or LINE_WORKER_PROCESSING(previous state that it read), worker has identified that this chunk was processed by some other worker, worker will move and try to get the next available chunk & insert those records. It will keep continuing till it gets the next chunk to process. Eventually one of the workers will get this chunk and process it. > (3) v3-0006-Parallel-Copy-For-Binary-Format-Files.patch > > >raw_buf is not used in parallel copy, instead raw_buf will be pointing > >to shared memory data blocks. This memory was allocated as part of > >BeginCopyFrom, uptil this point we cannot be 100% sure as copy can be > >performed sequentially like in case max_worker_processes is not > >available, if it switches to sequential mode raw_buf will be used > >while performing copy operation. At this place we can safely free this > >memory that was allocated > > So the following code (which checks raw_buf, which still points to > memory that has been pfreed) is still valid? > > In the SetRawBufForLoad() function, which is called by CopyReadLineText(): > > cur_data_blk_ptr = (cstate->raw_buf) ? > &pcshared_info->data_blocks[cur_block_pos] : NULL; > > The above code looks a bit dicey to me. I stepped over that line in > the debugger when I debugged an instance of Parallel Copy, so it > definitely gets executed. > It makes me wonder what other code could possibly be checking raw_buf > and using it in some way, when in fact what it points to has been > pfreed. > > Are you able to add the following line of code, or will it (somehow) > break logic that you are relying on? > > pfree(cstate->raw_buf); > cstate->raw_buf = NULL; <=== I suggest that this line is added > You are right, I have debugged & verified it sets it to an invalid block which is not expected. There are chances this would have caused some corruption in some machines. The suggested fix is required, I have fixed it. I have moved this change to 0003-Allow-copy-from-command-to-process-data-from-file.patch as 0006-Parallel-Copy-For-Binary-Format-Files is only for Binary format parallel copy & that change is common change for parallel copy. I have attached new set of patches with the fixes. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
- v4-0001-Copy-code-readjustment-to-support-parallel-copy.patch
- v4-0002-Framework-for-leader-worker-in-parallel-copy.patch
- v4-0003-Allow-copy-from-command-to-process-data-from-file.patch
- v4-0004-Documentation-for-parallel-copy.patch
- v4-0005-Tests-for-parallel-copy.patch
- v4-0006-Parallel-Copy-For-Binary-Format-Files.patch
> I have attached new set of patches with the fixes. > Thoughts? Hi Vignesh, I don't really have any further comments on the code, but would like to share some results of some Parallel Copy performance tests I ran (attached). The tests loaded a 5GB CSV data file into a 100 column table (of different data types). The following were varied as part of the test: - Number of workers (1 – 10) - No indexes / 4-indexes - Default settings / increased resources (shared_buffers,work_mem, etc.) (I did not do any partition-related tests as I believe those type of tests were previously performed) I built Postgres (latest OSS code) with the latest Parallel Copy patches (v4). The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM. I observed the following trends: - For the data file size used, Parallel Copy achieved best performance using about 9 – 10 workers. Larger data files may benefit from using more workers. However, I couldn’t really see any better performance, for example, from using 16 workers on a 10GB CSV data file compared to using 8 workers. Results may also vary depending on machine characteristics. - Parallel Copy with 1 worker ran slower than normal Copy in a couple of cases (I did question if allowing 1 worker was useful in my patch review). - Typical load time improvement (load factor) for Parallel Copy was between 2x and 3x. Better load factors can be obtained by using larger data files and/or more indexes. - Increasing Postgres resources made little or no difference to Parallel Copy performance when the target table had no indexes. Increasing Postgres resources improved Parallel Copy performance when the target table had indexes. Regards, Greg Nancarrow Fujitsu Australia
Attachment
On Thu, Aug 27, 2020 at 8:04 AM Greg Nancarrow <gregn4422@gmail.com> wrote: > > > I have attached new set of patches with the fixes. > > Thoughts? > > Hi Vignesh, > > I don't really have any further comments on the code, but would like > to share some results of some Parallel Copy performance tests I ran > (attached). > > The tests loaded a 5GB CSV data file into a 100 column table (of > different data types). The following were varied as part of the test: > - Number of workers (1 – 10) > - No indexes / 4-indexes > - Default settings / increased resources (shared_buffers,work_mem, etc.) > > (I did not do any partition-related tests as I believe those type of > tests were previously performed) > > I built Postgres (latest OSS code) with the latest Parallel Copy patches (v4). > The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM. > > > I observed the following trends: > - For the data file size used, Parallel Copy achieved best performance > using about 9 – 10 workers. Larger data files may benefit from using > more workers. However, I couldn’t really see any better performance, > for example, from using 16 workers on a 10GB CSV data file compared to > using 8 workers. Results may also vary depending on machine > characteristics. > - Parallel Copy with 1 worker ran slower than normal Copy in a couple > of cases (I did question if allowing 1 worker was useful in my patch > review). I think the reason is that for 1 worker case there is not much parallelization as a leader doesn't perform the actual load work. Vignesh, can you please once see if the results are reproducible at your end, if so, we can once compare the perf profiles to see why in some cases we get improvement and in other cases not. Based on that we can decide whether to allow the 1 worker case or not. > - Typical load time improvement (load factor) for Parallel Copy was > between 2x and 3x. Better load factors can be obtained by using larger > data files and/or more indexes. > Nice improvement and I think you are right that with larger load data we will get even better improvement. -- With Regards, Amit Kapila.
On Thu, Aug 27, 2020 at 8:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 27, 2020 at 8:04 AM Greg Nancarrow <gregn4422@gmail.com> wrote: > > > > > I have attached new set of patches with the fixes. > > > Thoughts? > > > > Hi Vignesh, > > > > I don't really have any further comments on the code, but would like > > to share some results of some Parallel Copy performance tests I ran > > (attached). > > > > The tests loaded a 5GB CSV data file into a 100 column table (of > > different data types). The following were varied as part of the test: > > - Number of workers (1 – 10) > > - No indexes / 4-indexes > > - Default settings / increased resources (shared_buffers,work_mem, etc.) > > > > (I did not do any partition-related tests as I believe those type of > > tests were previously performed) > > > > I built Postgres (latest OSS code) with the latest Parallel Copy patches (v4). > > The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM. > > > > > > I observed the following trends: > > - For the data file size used, Parallel Copy achieved best performance > > using about 9 – 10 workers. Larger data files may benefit from using > > more workers. However, I couldn’t really see any better performance, > > for example, from using 16 workers on a 10GB CSV data file compared to > > using 8 workers. Results may also vary depending on machine > > characteristics. > > - Parallel Copy with 1 worker ran slower than normal Copy in a couple > > of cases (I did question if allowing 1 worker was useful in my patch > > review). > > I think the reason is that for 1 worker case there is not much > parallelization as a leader doesn't perform the actual load work. > Vignesh, can you please once see if the results are reproducible at > your end, if so, we can once compare the perf profiles to see why in > some cases we get improvement and in other cases not. Based on that we > can decide whether to allow the 1 worker case or not. > I will spend some time on this and update. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Thu, Aug 27, 2020 at 4:56 PM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, Aug 27, 2020 at 8:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Aug 27, 2020 at 8:04 AM Greg Nancarrow <gregn4422@gmail.com> wrote: > > > > > > > I have attached new set of patches with the fixes. > > > > Thoughts? > > > > > > Hi Vignesh, > > > > > > I don't really have any further comments on the code, but would like > > > to share some results of some Parallel Copy performance tests I ran > > > (attached). > > > > > > The tests loaded a 5GB CSV data file into a 100 column table (of > > > different data types). The following were varied as part of the test: > > > - Number of workers (1 – 10) > > > - No indexes / 4-indexes > > > - Default settings / increased resources (shared_buffers,work_mem, etc.) > > > > > > (I did not do any partition-related tests as I believe those type of > > > tests were previously performed) > > > > > > I built Postgres (latest OSS code) with the latest Parallel Copy patches (v4). > > > The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM. > > > > > > > > > I observed the following trends: > > > - For the data file size used, Parallel Copy achieved best performance > > > using about 9 – 10 workers. Larger data files may benefit from using > > > more workers. However, I couldn’t really see any better performance, > > > for example, from using 16 workers on a 10GB CSV data file compared to > > > using 8 workers. Results may also vary depending on machine > > > characteristics. > > > - Parallel Copy with 1 worker ran slower than normal Copy in a couple > > > of cases (I did question if allowing 1 worker was useful in my patch > > > review). > > > > I think the reason is that for 1 worker case there is not much > > parallelization as a leader doesn't perform the actual load work. > > Vignesh, can you please once see if the results are reproducible at > > your end, if so, we can once compare the perf profiles to see why in > > some cases we get improvement and in other cases not. Based on that we > > can decide whether to allow the 1 worker case or not. > > > > I will spend some time on this and update. > Thanks. -- With Regards, Amit Kapila.
On Thu, Aug 27, 2020 at 8:04 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
> - Parallel Copy with 1 worker ran slower than normal Copy in a couple
> of cases (I did question if allowing 1 worker was useful in my patch
> review).
Thanks Greg for your review & testing.
I had executed various tests with 1GB, 2GB & 5GB with 100 columns without parallel mode & with 1 parallel worker. Test result for the same is as given below:
Test | Without parallel mode | With 1 Parallel worker |
1GB csv file 100 columns (100 bytes data in each column) | 62 seconds | 47 seconds (1.32X) |
1GB csv file 100 columns (1000 bytes data in each column) | 89 seconds | 78 seconds (1.14X) |
2GB csv file 100 columns (1 byte data in each column) | 277 seconds | 256 seconds (1.08X) |
5GB csv file 100 columns (100 byte data in each column) | 515 seconds | 445 seconds (1.16X) |
I have run the tests multiple times and have noticed the similar execution times in all the runs for the above tests.
In the above results there is slight improvement with 1 worker. In my tests I did not observe the degradation for copy with 1 worker compared to the non parallel copy. Can you share with me the script you used to generate the data & the ddl of the table, so that it will help me check that scenario you faced the problem.
Hi Vignesh, >Can you share with me the script you used to generate the data & the ddl of the table, so that it will help me check that>scenario you faced the >problem. Unfortunately I can't directly share it (considered company IP), though having said that it's only doing something that is relatively simple and unremarkable, so I'd expect it to be much like what you are currently doing. I can describe it in general. The table being used contains 100 columns (as I pointed out earlier), with the first column of "bigserial" type, and the others of different types like "character varying(255)", "numeric", "date" and "time without timezone". There's about 60 of the "character varying(255)" overall, with the other types interspersed. When testing with indexes, 4 b-tree indexes were used that each included the first column and then distinctly 9 other columns. A CSV record (row) template file was created with test data (corresponding to the table), and that was simply copied and appended over and over with a record prefix in order to create the test data file. The following shell-script basically does it (but very slowly). I was using a small C program to do similar, a lot faster. In my case, N=2550000 produced about a 5GB CSV file. file_out=data.csv; for i in {1..N}; do echo -n "$i," >> $file_out; cat sample_record.csv >> $file_out; done One other thing I should mention is that between each test run, I cleared the OS page cache, as described here: https://linuxhint.com/clear_cache_linux/ That way, each COPY FROM is not taking advantage of any OS-cached data from a previous COPY FROM. If your data is somehow significantly different and you want to (and can) share your script, then I can try it in my environment. Regards, Greg
On Tue, Sep 1, 2020 at 3:39 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > Hi Vignesh, > > >Can you share with me the script you used to generate the data & the ddl of the table, so that it will help me check that>scenario you faced the >problem. > > Unfortunately I can't directly share it (considered company IP), > though having said that it's only doing something that is relatively > simple and unremarkable, so I'd expect it to be much like what you are > currently doing. I can describe it in general. > > The table being used contains 100 columns (as I pointed out earlier), > with the first column of "bigserial" type, and the others of different > types like "character varying(255)", "numeric", "date" and "time > without timezone". There's about 60 of the "character varying(255)" > overall, with the other types interspersed. > > When testing with indexes, 4 b-tree indexes were used that each > included the first column and then distinctly 9 other columns. > > A CSV record (row) template file was created with test data > (corresponding to the table), and that was simply copied and appended > over and over with a record prefix in order to create the test data > file. > The following shell-script basically does it (but very slowly). I was > using a small C program to do similar, a lot faster. > In my case, N=2550000 produced about a 5GB CSV file. > > file_out=data.csv; for i in {1..N}; do echo -n "$i," >> $file_out; > cat sample_record.csv >> $file_out; done > > One other thing I should mention is that between each test run, I > cleared the OS page cache, as described here: > https://linuxhint.com/clear_cache_linux/ > That way, each COPY FROM is not taking advantage of any OS-cached data > from a previous COPY FROM. I will try with a similar test and check if I can reproduce. > If your data is somehow significantly different and you want to (and > can) share your script, then I can try it in my environment. I have attached the scripts that I used for the test results I mentioned in my previous mail. create.sql file has the table that I used, insert_data_gen.txt has the insert data generation scripts. I varied the count in insert_data_gen to generate csv files of 1GB, 2GB & 5GB & varied the data to generate 1 char, 10 char & 100 char for each column for various testing. You can rename insert_data_gen.txt to insert_data_gen.sh & generate the csv file. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
>On Wed, Sep 2, 2020 at 3:40 PM vignesh C <vignesh21@gmail.com> wrote: > I have attached the scripts that I used for the test results I > mentioned in my previous mail. create.sql file has the table that I > used, insert_data_gen.txt has the insert data generation scripts. I > varied the count in insert_data_gen to generate csv files of 1GB, 2GB > & 5GB & varied the data to generate 1 char, 10 char & 100 char for > each column for various testing. You can rename insert_data_gen.txt to > insert_data_gen.sh & generate the csv file. Hi Vignesh, I used your script and table definition, multiplying the number of records to produce a 5GB and 9.5GB CSV file. I got the following results: (1) Postgres default settings, 5GB CSV (530000 rows): Copy Type Duration (s) Load factor =============================================== Normal Copy 132.197 - Parallel Copy (#workers) 1 98.428 1.34 2 52.753 2.51 3 37.630 3.51 4 33.554 3.94 5 33.636 3.93 6 33.821 3.91 7 34.270 3.86 8 34.465 3.84 9 34.315 3.85 10 33.543 3.94 (2) Postgres increased resources, 5GB CSV (530000 rows): shared_buffers = 20% of RAM (total RAM = 376GB) = 76GB work_mem = 10% of RAM = 38GB maintenance_work_mem = 10% of RAM = 38GB max_worker_processes = 16 max_parallel_workers = 16 checkpoint_timeout = 30min max_wal_size=2GB Copy Type Duration (s) Load factor =============================================== Normal Copy 131.835 - Parallel Copy (#workers) 1 98.301 1.34 2 53.261 2.48 3 37.868 3.48 4 34.224 3.85 5 33.831 3.90 6 34.229 3.85 7 34.512 3.82 8 34.303 3.84 9 34.690 3.80 10 34.479 3.82 (3) Postgres default settings, 9.5GB CSV (1000000 rows): Copy Type Duration (s) Load factor =============================================== Normal Copy 248.503 - Parallel Copy (#workers) 1 185.724 1.34 2 99.832 2.49 3 70.560 3.52 4 63.328 3.92 5 63.182 3.93 6 64.108 3.88 7 64.131 3.87 8 64.350 3.86 9 64.293 3.87 10 63.818 3.89 (4) Postgres increased resources, 9.5GB CSV (1000000 rows): shared_buffers = 20% of RAM (total RAM = 376GB) = 76GB work_mem = 10% of RAM = 38GB maintenance_work_mem = 10% of RAM = 38GB max_worker_processes = 16 max_parallel_workers = 16 checkpoint_timeout = 30min max_wal_size=2GB Copy Type Duration (s) Load factor =============================================== Normal Copy 248.647 - Parallel Copy (#workers) 1 182.236 1.36 2 92.814 2.68 3 67.347 3.69 4 63.839 3.89 5 62.672 3.97 6 63.873 3.89 7 64.930 3.83 8 63.885 3.89 9 62.397 3.98 10 64.477 3.86 So as you found, with this particular table definition and data, 1 parallel worker always performs better than normal copy. The different result obtained for this particular case seems to be caused by the following factors: - different table definition (I used a variety of column types) - amount of data per row (I used less data per row, so more rows per same size data file) As I previously observed, if the target table has no indexes, increasing resources beyond the default settings makes little difference to the performance. Regards, Greg Nancarrow Fujitsu Australia
On Tue, Sep 1, 2020 at 3:39 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > Hi Vignesh, > > >Can you share with me the script you used to generate the data & the ddl of the table, so that it will help me check that>scenario you faced the >problem. > > Unfortunately I can't directly share it (considered company IP), > though having said that it's only doing something that is relatively > simple and unremarkable, so I'd expect it to be much like what you are > currently doing. I can describe it in general. > > The table being used contains 100 columns (as I pointed out earlier), > with the first column of "bigserial" type, and the others of different > types like "character varying(255)", "numeric", "date" and "time > without timezone". There's about 60 of the "character varying(255)" > overall, with the other types interspersed. > Thanks Greg for executing & sharing the results. I tried with a similar test case that you suggested, I was not able to reproduce the degradation scenario. If it is possible, can you run perf for the scenario with 1 worker & non parallel mode & share the perf results, we will be able to find out which of the functions is consuming more time by doing a comparison of the perf reports. Steps for running perf: 1) get the postgres pid 2) perf record -a -g -p <above pid> 3) Run copy command 4) Execute "perf report -g" once copy finishes. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Fri, Sep 11, 2020 at 3:49 AM Greg Nancarrow <gregn4422@gmail.com> wrote: > > I couldn't use the original machine from which I obtained the previous > results, but ended up using a 4-core CentOS7 VM, which showed a > similar pattern in the performance results for this test case. > I obtained the following results from loading a 2GB CSV file (1000000 > rows, 4 indexes): > > Copy Type Duration (s) Load factor > =============================================== > Normal Copy 190.891 - > > Parallel Copy > (#workers) > 1 210.947 0.90 > Hi Greg, I tried to recreate the test case(attached) and I didn't find much difference with the custom postgresql.config file. Test case: 250000 tuples, 4 indexes(composite indexes with 10 columns), 3.7GB, 100 columns(as suggested by you and all the varchar(255) columns are having 255 characters), exec time in sec. With custom postgresql.conf[1], removed and recreated the data directory after every run(I couldn't perform the OS page cache flush due to some reasons. So, chose this recreation of data dir way, for testing purpose): HEAD: 129.547, 128.624, 128.890 Patch: 0 workers - 130.213, 131.298, 130.555 Patch: 1 worker - 127.757, 125.560, 128.275 With default postgresql.conf, removed and recreated the data directory after every run: HEAD: 138.276, 150.472, 153.304 Patch: 0 workers - 162.468, 149.423, 159.137 Patch: 1 worker - 136.055, 144.250, 137.916 Few questions: 1. Was the run performed with default postgresql.conf file? If not, what are the changed configurations? 2. Are the readings for normal copy(190.891sec, mentioned by you above) taken on HEAD or with patch, 0 workers? How much is the runtime with your test case on HEAD(Without patch) and 0 workers(With patch)? 3. Was the run performed on release build? 4. Were the readings taken on multiple runs(say 3 or 4 times)? [1] - Postgres configuration used for above testing: shared_buffers = 40GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
Hi Bharath, On Tue, Sep 15, 2020 at 11:49 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Few questions: > 1. Was the run performed with default postgresql.conf file? If not, > what are the changed configurations? Yes, just default settings. > 2. Are the readings for normal copy(190.891sec, mentioned by you > above) taken on HEAD or with patch, 0 workers? With patch >How much is the runtime > with your test case on HEAD(Without patch) and 0 workers(With patch)? TBH, I didn't test that. Looking at the changes, I wouldn't expect a degradation of performance for normal copy (you have tested, right?). > 3. Was the run performed on release build? For generating the perf data I sent (normal copy vs parallel copy with 1 worker), I used a debug build (-g -O0), as that is needed for generating all the relevant perf data for Postgres code. Previously I ran with a release build (-O2). > 4. Were the readings taken on multiple runs(say 3 or 4 times)? The readings I sent were from just one run (not averaged), but I did run the tests several times to verify the readings were representative of the pattern I was seeing. Fortunately I have been given permission to share the exact table definition and data I used, so you can check the behaviour and timings on your own test machine. Please see the attachment. You can create the table using the table.sql and index_4.sql definitions in the "sql" directory. The data.csv file (to be loaded by COPY) can be created with the included "dupdata" tool in the "input" directory, which you need to build, then run, specifying a suitable number of records and path of the template record (see README). Obviously the larger the number of records, the larger the file ... The table can then be loaded using COPY with "format csv" (and "parallel N" if testing parallel copy). Regards, Greg Nancarrow Fujitsu Australia
Attachment
Hi Vignesh, I've spent some time today looking at your new set of patches and I've some thoughts and queries which I would like to put here: Why are these not part of the shared cstate structure? SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, cstate->null_print); SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim); SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote); SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape); I think in the refactoring patch we could replace all the cstate variables that would be shared between the leader and workers with a common structure which would be used even for a serial copy. Thoughts? -- Have you tested your patch when encoding conversion is needed? If so, could you please point out the email that has the test results. -- Apart from above, I've noticed some cosmetic errors which I am sharing here: +#define IsParallelCopy() (cstate->is_parallel) +#define IsLeader() (cstate->pcdata->is_leader) This doesn't look to be properly aligned. -- + shared_info_ptr = (ParallelCopyShmInfo *) shm_toc_allocate(pcxt->toc, sizeof(ParallelCopyShmInfo)); + PopulateParallelCopyShmInfo(shared_info_ptr, full_transaction_id); .. + /* Store shared build state, for which we reserved space. */ + shared_cstate = (SerializedParallelCopyState *)shm_toc_allocate(pcxt->toc, est_cstateshared); In the first case, while typecasting you've added a space between the typename and the function but that is missing in the second case. I think it would be good if you could make it consistent. Same comment applies here as well: + pg_atomic_uint32 line_state; /* line state */ + uint64 cur_lineno; /* line number for error messages */ +}ParallelCopyLineBoundary; ... + CommandId mycid; /* command id */ + ParallelCopyLineBoundaries line_boundaries; /* line array */ +} ParallelCopyShmInfo; There is no space between the closing brace and the structure name in the first case but it is in the second one. So, again this doesn't look consistent. I could also find this type of inconsistency in comments. See below: +/* It can hold upto 10000 record information for worker to process. RINGSIZE + * should be a multiple of WORKER_CHUNK_COUNT, as wrap around cases is currently + * not handled while selecting the WORKER_CHUNK_COUNT by the worker. */ +#define RINGSIZE (10 * 1000) ... +/* + * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data + * block to process to avoid lock contention. Read RINGSIZE comments before + * changing this value. + */ +#define WORKER_CHUNK_COUNT 50 You may see these kinds of errors at other places as well if you scan through your patch. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Wed, Aug 19, 2020 at 11:51 AM vignesh C <vignesh21@gmail.com> wrote: > > Thanks Greg for reviewing the patch. Please find my thoughts for your comments. > > On Mon, Aug 17, 2020 at 9:44 AM Greg Nancarrow <gregn4422@gmail.com> wrote: > > Some further comments: > > > > (1) v3-0002-Framework-for-leader-worker-in-parallel-copy.patch > > > > +/* > > + * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data > > + * block to process to avoid lock contention. This value should be divisible by > > + * RINGSIZE, as wrap around cases is currently not handled while selecting the > > + * WORKER_CHUNK_COUNT by the worker. > > + */ > > +#define WORKER_CHUNK_COUNT 50 > > > > > > "This value should be divisible by RINGSIZE" is not a correct > > statement (since obviously 50 is not divisible by 10000). > > It should say something like "This value should evenly divide into > > RINGSIZE", or "RINGSIZE should be a multiple of WORKER_CHUNK_COUNT". > > > > Fixed. Changed it to RINGSIZE should be a multiple of WORKER_CHUNK_COUNT. > > > (2) v3-0003-Allow-copy-from-command-to-process-data-from-file.patch > > > > (i) > > > > + /* > > + * If the data is present in current block > > lineInfo. line_size > > + * will be updated. If the data is spread > > across the blocks either > > > > Somehow a space has been put between "lineinfo." and "line_size". > > It should be: "If the data is present in current block > > lineInfo.line_size will be updated" > > Fixed, changed it to lineinfo->line_size. > > > > > (ii) > > > > >This is not possible because of pg_atomic_compare_exchange_u32, this > > >will succeed only for one of the workers whose line_state is > > >LINE_LEADER_POPULATED, for other workers it will fail. This is > > >explained in detail above ParallelCopyLineBoundary. > > > > Yes, but prior to that call to pg_atomic_compare_exchange_u32(), > > aren't you separately reading line_state and line_state, so that > > between those reads, it may have transitioned from leader to another > > worker, such that the read line state ("cur_line_state", being checked > > in the if block) may not actually match what is now in the line_state > > and/or the read line_size ("dataSize") doesn't actually correspond to > > the read line state? > > > > (sorry, still not 100% convinced that the synchronization and checks > > are safe in all cases) > > > > I think that you are describing about the problem could happen in the > following case: > when we read curr_line_state, the value was LINE_WORKER_PROCESSED or > LINE_WORKER_PROCESSING. Then in some cases if the leader is very fast > compared to the workers then the leader quickly populates one line and > sets the state to LINE_LEADER_POPULATED. State is changed to > LINE_LEADER_POPULATED when we are checking the currr_line_state. > I feel this will not be a problem because, Leader will populate & wait > till some RING element is available to populate. In the meantime > worker has seen that state is LINE_WORKER_PROCESSED or > LINE_WORKER_PROCESSING(previous state that it read), worker has > identified that this chunk was processed by some other worker, worker > will move and try to get the next available chunk & insert those > records. It will keep continuing till it gets the next chunk to > process. Eventually one of the workers will get this chunk and process > it. > > > (3) v3-0006-Parallel-Copy-For-Binary-Format-Files.patch > > > > >raw_buf is not used in parallel copy, instead raw_buf will be pointing > > >to shared memory data blocks. This memory was allocated as part of > > >BeginCopyFrom, uptil this point we cannot be 100% sure as copy can be > > >performed sequentially like in case max_worker_processes is not > > >available, if it switches to sequential mode raw_buf will be used > > >while performing copy operation. At this place we can safely free this > > >memory that was allocated > > > > So the following code (which checks raw_buf, which still points to > > memory that has been pfreed) is still valid? > > > > In the SetRawBufForLoad() function, which is called by CopyReadLineText(): > > > > cur_data_blk_ptr = (cstate->raw_buf) ? > > &pcshared_info->data_blocks[cur_block_pos] : NULL; > > > > The above code looks a bit dicey to me. I stepped over that line in > > the debugger when I debugged an instance of Parallel Copy, so it > > definitely gets executed. > > It makes me wonder what other code could possibly be checking raw_buf > > and using it in some way, when in fact what it points to has been > > pfreed. > > > > Are you able to add the following line of code, or will it (somehow) > > break logic that you are relying on? > > > > pfree(cstate->raw_buf); > > cstate->raw_buf = NULL; <=== I suggest that this line is added > > > > You are right, I have debugged & verified it sets it to an invalid > block which is not expected. There are chances this would have caused > some corruption in some machines. The suggested fix is required, I > have fixed it. I have moved this change to > 0003-Allow-copy-from-command-to-process-data-from-file.patch as > 0006-Parallel-Copy-For-Binary-Format-Files is only for Binary format > parallel copy & that change is common change for parallel copy. > > I have attached new set of patches with the fixes. > Thoughts? > > Regards, > Vignesh > EnterpriseDB: http://www.enterprisedb.com
On Wed, Sep 16, 2020 at 1:20 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > Fortunately I have been given permission to share the exact table > definition and data I used, so you can check the behaviour and timings > on your own test machine. > Thanks Greg for the script. I ran your test case and I didn't observe any increase in exec time with 1 worker, indeed, we have benefitted a few seconds from 0 to 1 worker as expected. Execution time is in seconds. Each test case is executed 3 times on release build. Each time the data directory is recreated. Case 1: 1000000 rows, 2GB With Patch, default configuration, 0 worker: 88.933, 92.261, 88.423 With Patch, default configuration, 1 worker: 73.825, 74.583, 72.678 With Patch, custom configuration, 0 worker: 76.191, 78.160, 78.822 With Patch, custom configuration, 1 worker: 61.289, 61.288, 60.573 Case 2: 2550000 rows, 5GB With Patch, default configuration, 0 worker: 246.031, 188.323, 216.683 With Patch, default configuration, 1 worker: 156.299, 153.293, 170.307 With Patch, custom configuration, 0 worker: 197.234, 195.866, 196.049 With Patch, custom configuration, 1 worker: 157.173, 158.287, 157.090 [1] - Custom configuration is set up to ensure that no other processes influence the results. The postgresql.conf used: shared_buffers = 40GB synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Thanks Ashutosh for your comments. On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Hi Vignesh, > > I've spent some time today looking at your new set of patches and I've > some thoughts and queries which I would like to put here: > > Why are these not part of the shared cstate structure? > > SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, cstate->null_print); > SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim); > SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote); > SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape); > I have used shared_cstate mainly to share the integer & bool data types from the leader to worker process. The above data types are of char* data type, I will not be able to use it like how I could do it for integer type. So I preferred to send these as separate keys to the worker. Thoughts? > I think in the refactoring patch we could replace all the cstate > variables that would be shared between the leader and workers with a > common structure which would be used even for a serial copy. Thoughts? > Currently we are using shared_cstate only to share integer & bool data types from leader to worker. Once worker retrieves the shared data for integer & bool data types, worker will copy it to cstate. I preferred this way because only for integer & bool we retrieve to shared_cstate & copy it to cstate and for rest of the members any way we are directly copying back to cstate. Thoughts? > Have you tested your patch when encoding conversion is needed? If so, > could you please point out the email that has the test results. > We have not yet done encoding testing, we will do and post the results separately in the coming days. > Apart from above, I've noticed some cosmetic errors which I am sharing here: > > +#define IsParallelCopy() (cstate->is_parallel) > +#define IsLeader() (cstate->pcdata->is_leader) > > This doesn't look to be properly aligned. > Fixed. > + shared_info_ptr = (ParallelCopyShmInfo *) > shm_toc_allocate(pcxt->toc, sizeof(ParallelCopyShmInfo)); > + PopulateParallelCopyShmInfo(shared_info_ptr, full_transaction_id); > > .. > > + /* Store shared build state, for which we reserved space. */ > + shared_cstate = (SerializedParallelCopyState > *)shm_toc_allocate(pcxt->toc, est_cstateshared); > > In the first case, while typecasting you've added a space between the > typename and the function but that is missing in the second case. I > think it would be good if you could make it consistent. > Fixed > Same comment applies here as well: > > + pg_atomic_uint32 line_state; /* line state */ > + uint64 cur_lineno; /* line number for error messages */ > +}ParallelCopyLineBoundary; > > ... > > + CommandId mycid; /* command id */ > + ParallelCopyLineBoundaries line_boundaries; /* line array */ > +} ParallelCopyShmInfo; > > There is no space between the closing brace and the structure name in > the first case but it is in the second one. So, again this doesn't > look consistent. > Fixed > I could also find this type of inconsistency in comments. See below: > > +/* It can hold upto 10000 record information for worker to process. RINGSIZE > + * should be a multiple of WORKER_CHUNK_COUNT, as wrap around cases > is currently > + * not handled while selecting the WORKER_CHUNK_COUNT by the worker. */ > +#define RINGSIZE (10 * 1000) > > ... > > +/* > + * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data > + * block to process to avoid lock contention. Read RINGSIZE comments before > + * changing this value. > + */ > +#define WORKER_CHUNK_COUNT 50 > > You may see these kinds of errors at other places as well if you scan > through your patch. Fixed. Please find the attached v5 patch which has the fixes for the same. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
- v5-0001-Copy-code-readjustment-to-support-parallel-copy.patch
- v5-0002-Framework-for-leader-worker-in-parallel-copy.patch
- v5-0003-Allow-copy-from-command-to-process-data-from-file.patch
- v5-0004-Documentation-for-parallel-copy.patch
- v5-0005-Tests-for-parallel-copy.patch
- v5-0006-Parallel-Copy-For-Binary-Format-Files.patch
On Thu, Sep 17, 2020 at 11:06 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Wed, Sep 16, 2020 at 1:20 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> >
> > Fortunately I have been given permission to share the exact table
> > definition and data I used, so you can check the behaviour and timings
> > on your own test machine.
> >
>
> Thanks Greg for the script. I ran your test case and I didn't observe
> any increase in exec time with 1 worker, indeed, we have benefitted a
> few seconds from 0 to 1 worker as expected.
>
> Execution time is in seconds. Each test case is executed 3 times on
> release build. Each time the data directory is recreated.
>
> Case 1: 1000000 rows, 2GB
> With Patch, default configuration, 0 worker: 88.933, 92.261, 88.423
> With Patch, default configuration, 1 worker: 73.825, 74.583, 72.678
>
> With Patch, custom configuration, 0 worker: 76.191, 78.160, 78.822
> With Patch, custom configuration, 1 worker: 61.289, 61.288, 60.573
>
> Case 2: 2550000 rows, 5GB
> With Patch, default configuration, 0 worker: 246.031, 188.323, 216.683
> With Patch, default configuration, 1 worker: 156.299, 153.293, 170.307
>
> With Patch, custom configuration, 0 worker: 197.234, 195.866, 196.049
> With Patch, custom configuration, 1 worker: 157.173, 158.287, 157.090
>
Hi Greg,
If you still observe the issue in your testing environment, I'm attaching a testing patch(that applies on top of the latest parallel copy patch set i.e. v5 1 to 6) to capture various timings such as total copy time in leader and worker, index and table insertion time, leader and worker waiting time. These logs are shown in the server log file.
Few things to follow before testing:
1. Is the table being dropped/truncated after the test with 0 workers and before running with 1 worker? If not, then the index insertion time would increase.[1](for me it is increasing by 10 sec). This is obvious because the 1st time index will be created from bottom up manner(from leaves to root), but for the 2nd time it has to search and insert at the proper leaves and inner B+Tree nodes.
2. If possible, can you also run with custom postgresql.conf settings[2] along with default? Just to ensure that other bg processes such as checkpointer, autovacuum, bgwriter etc. don't affect our testcase. For instance, with default postgresql.conf file, it looks like checkpointing[3] is happening frequently, could you please let us know if that happens at your end?
3. Could you please run the test case 3 times at least? Just to ensure the consistency of the issue.
4. I ran the tests in a performance test system where no other user processes(except system processes) are running. Is it possible for you to do the same?
Please capture and share the timing logs with us.
Here's a snapshot of how the added timings show up in the logs: ( I captured this with your test case case 1: 1000000 rows, 2GB, custom postgresql.conf file settings[2]).
with 0 workers:
2020-09-22 10:49:27.508 BST [163910] LOG: totaltableinsertiontime = 24072.034 ms
2020-09-22 10:49:27.508 BST [163910] LOG: totalindexinsertiontime = 60.682 ms
2020-09-22 10:49:27.508 BST [163910] LOG: totalcopytime = 59664.594 ms
with 1 worker:
2020-09-22 10:53:58.409 BST [163947] LOG: totalcopyworkerwaitingtime = 59.815 ms
2020-09-22 10:53:58.409 BST [163947] LOG: totaltableinsertiontime = 23585.881 ms
2020-09-22 10:53:58.409 BST [163947] LOG: totalindexinsertiontime = 30.946 ms
2020-09-22 10:53:58.409 BST [163947] LOG: totalcopytimeworker = 47047.956 ms
2020-09-22 10:53:58.429 BST [163946] LOG: totalcopyleaderwaitingtime = 26746.744 ms
2020-09-22 10:53:58.429 BST [163946] LOG: totalcopytime = 47150.002 ms
[1]
0 worker:
LOG: totaltableinsertiontime = 25491.881 ms
LOG: totalindexinsertiontime = 14136.104 ms
LOG: totalcopytime = 75606.858 ms
LOG: totaltableinsertiontime = 21360.875 ms
LOG: totalindexinsertiontime = 24843.570 ms
LOG: totalcopytimeworker = 69837.162 ms
LOG: totalcopyleaderwaitingtime = 49548.441 ms
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off
[3]
LOG: checkpoints are occurring too frequently (14 seconds apart)
HINT: Consider increasing the configuration parameter "max_wal_size".
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
>
> On Wed, Sep 16, 2020 at 1:20 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> >
> > Fortunately I have been given permission to share the exact table
> > definition and data I used, so you can check the behaviour and timings
> > on your own test machine.
> >
>
> Thanks Greg for the script. I ran your test case and I didn't observe
> any increase in exec time with 1 worker, indeed, we have benefitted a
> few seconds from 0 to 1 worker as expected.
>
> Execution time is in seconds. Each test case is executed 3 times on
> release build. Each time the data directory is recreated.
>
> Case 1: 1000000 rows, 2GB
> With Patch, default configuration, 0 worker: 88.933, 92.261, 88.423
> With Patch, default configuration, 1 worker: 73.825, 74.583, 72.678
>
> With Patch, custom configuration, 0 worker: 76.191, 78.160, 78.822
> With Patch, custom configuration, 1 worker: 61.289, 61.288, 60.573
>
> Case 2: 2550000 rows, 5GB
> With Patch, default configuration, 0 worker: 246.031, 188.323, 216.683
> With Patch, default configuration, 1 worker: 156.299, 153.293, 170.307
>
> With Patch, custom configuration, 0 worker: 197.234, 195.866, 196.049
> With Patch, custom configuration, 1 worker: 157.173, 158.287, 157.090
>
Hi Greg,
If you still observe the issue in your testing environment, I'm attaching a testing patch(that applies on top of the latest parallel copy patch set i.e. v5 1 to 6) to capture various timings such as total copy time in leader and worker, index and table insertion time, leader and worker waiting time. These logs are shown in the server log file.
Few things to follow before testing:
1. Is the table being dropped/truncated after the test with 0 workers and before running with 1 worker? If not, then the index insertion time would increase.[1](for me it is increasing by 10 sec). This is obvious because the 1st time index will be created from bottom up manner(from leaves to root), but for the 2nd time it has to search and insert at the proper leaves and inner B+Tree nodes.
2. If possible, can you also run with custom postgresql.conf settings[2] along with default? Just to ensure that other bg processes such as checkpointer, autovacuum, bgwriter etc. don't affect our testcase. For instance, with default postgresql.conf file, it looks like checkpointing[3] is happening frequently, could you please let us know if that happens at your end?
3. Could you please run the test case 3 times at least? Just to ensure the consistency of the issue.
4. I ran the tests in a performance test system where no other user processes(except system processes) are running. Is it possible for you to do the same?
Please capture and share the timing logs with us.
Here's a snapshot of how the added timings show up in the logs: ( I captured this with your test case case 1: 1000000 rows, 2GB, custom postgresql.conf file settings[2]).
with 0 workers:
2020-09-22 10:49:27.508 BST [163910] LOG: totaltableinsertiontime = 24072.034 ms
2020-09-22 10:49:27.508 BST [163910] LOG: totalindexinsertiontime = 60.682 ms
2020-09-22 10:49:27.508 BST [163910] LOG: totalcopytime = 59664.594 ms
with 1 worker:
2020-09-22 10:53:58.409 BST [163947] LOG: totalcopyworkerwaitingtime = 59.815 ms
2020-09-22 10:53:58.409 BST [163947] LOG: totaltableinsertiontime = 23585.881 ms
2020-09-22 10:53:58.409 BST [163947] LOG: totalindexinsertiontime = 30.946 ms
2020-09-22 10:53:58.409 BST [163947] LOG: totalcopytimeworker = 47047.956 ms
2020-09-22 10:53:58.429 BST [163946] LOG: totalcopyleaderwaitingtime = 26746.744 ms
2020-09-22 10:53:58.429 BST [163946] LOG: totalcopytime = 47150.002 ms
[1]
0 worker:
LOG: totaltableinsertiontime = 25491.881 ms
LOG: totalindexinsertiontime = 14136.104 ms
LOG: totalcopytime = 75606.858 ms
table is not dropped and so are indexes
1 worker:
LOG: totalcopyworkerwaitingtime = 64.582 msLOG: totaltableinsertiontime = 21360.875 ms
LOG: totalindexinsertiontime = 24843.570 ms
LOG: totalcopytimeworker = 69837.162 ms
LOG: totalcopyleaderwaitingtime = 49548.441 ms
LOG: totalcopytime = 69997.778 ms
[2]
custom postgresql.conf configuration:
shared_buffers = 40GBmax_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off
[3]
LOG: checkpoints are occurring too frequently (14 seconds apart)
HINT: Consider increasing the configuration parameter "max_wal_size".
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Attachment
Hi Bharath, > Few things to follow before testing: > 1. Is the table being dropped/truncated after the test with 0 workers and before running with 1 worker? If not, then theindex insertion time would increase.[1](for me it is increasing by 10 sec). This is obvious because the 1st time indexwill be created from bottom up manner(from leaves to root), but for the 2nd time it has to search and insert at theproper leaves and inner B+Tree nodes. Yes, it' being truncated before running each and every COPY. > 2. If possible, can you also run with custom postgresql.conf settings[2] along with default? Just to ensure that otherbg processes such as checkpointer, autovacuum, bgwriter etc. don't affect our testcase. For instance, with default postgresql.conffile, it looks like checkpointing[3] is happening frequently, could you please let us know if that happensat your end? Yes, have run with default and your custom settings. With default settings, I can confirm that checkpointing is happening frequently with the tests I've run here. > 3. Could you please run the test case 3 times at least? Just to ensure the consistency of the issue. Yes, have run 4 times. Seems to be a performance hit (whether normal copy or parallel-1 copy) on the first COPY run on a freshly created database. After that, results are consistent. > 4. I ran the tests in a performance test system where no other user processes(except system processes) are running. Isit possible for you to do the same? > > Please capture and share the timing logs with us. > Yes, I have ensured the system is as idle as possible prior to testing. I have attached the test results obtained after building with your Parallel Copy patch and testing patch applied (HEAD at 733fa9aa51c526582f100aa0d375e0eb9a6bce8b). Test results show that Parallel COPY with 1 worker is performing better than normal COPY in the test scenarios run. There is a performance hit (regardless of COPY type) on the very first COPY run on a freshly-created database. I ran the test case 4 times. and also in reverse order, with truncate run before each COPY (output and logs named xxxx_0_1 run normal COPY then parallel COPY, and named xxxx_1_0 run parallel COPY and then normal COPY). Please refer to attached results. Regards, Greg
Attachment
Thanks Greg for the testing.
On Thu, Sep 24, 2020 at 8:27 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> > 3. Could you please run the test case 3 times at least? Just to ensure the consistency of the issue.
>
> Yes, have run 4 times. Seems to be a performance hit (whether normal
> copy or parallel-1 copy) on the first COPY run on a freshly created
> database. After that, results are consistent.
>
From the logs, I see that it is happening only with default postgresql.conf, and there's inconsistency in table insertion times, especially from the 1st time to 2nd time. Also, the table insertion time variation is more. This is expected with the default postgresql.conf, because of the background processes interference. That's the reason we usually run with custom configuration to correctly measure the performance gain.
br_default_0_1.log:
2020-09-23 22:32:36.944 JST [112616] LOG: totaltableinsertiontime = 155068.244 ms
2020-09-23 22:33:57.615 JST [11426] LOG: totaltableinsertiontime = 42096.275 ms
2020-09-23 22:37:39.192 JST [43097] LOG: totaltableinsertiontime = 29135.262 ms
2020-09-23 22:38:56.389 JST [54205] LOG: totaltableinsertiontime = 38953.912 ms
2020-09-23 22:40:27.573 JST [66485] LOG: totaltableinsertiontime = 27895.326 ms
2020-09-23 22:41:34.948 JST [77523] LOG: totaltableinsertiontime = 28929.642 ms
2020-09-23 22:43:18.938 JST [89857] LOG: totaltableinsertiontime = 30625.015 ms
2020-09-23 22:44:21.938 JST [101372] LOG: totaltableinsertiontime = 24624.045 ms
br_default_1_0.log:
2020-09-24 11:12:14.989 JST [56146] LOG: totaltableinsertiontime = 192068.350 ms
2020-09-24 11:13:38.228 JST [88455] LOG: totaltableinsertiontime = 30999.942 ms
2020-09-24 11:15:50.381 JST [108935] LOG: totaltableinsertiontime = 31673.204 ms
2020-09-24 11:17:14.260 JST [118541] LOG: totaltableinsertiontime = 31367.027 ms
2020-09-24 11:20:18.975 JST [17270] LOG: totaltableinsertiontime = 26858.924 ms
2020-09-24 11:22:17.822 JST [26852] LOG: totaltableinsertiontime = 66531.442 ms
2020-09-24 11:24:09.221 JST [47971] LOG: totaltableinsertiontime = 38943.384 ms
2020-09-24 11:25:30.955 JST [58849] LOG: totaltableinsertiontime = 28286.634 ms
br_custom_0_1.log:
2020-09-24 10:29:44.956 JST [110477] LOG: totaltableinsertiontime = 20207.928 ms
2020-09-24 10:30:49.570 JST [120568] LOG: totaltableinsertiontime = 23360.006 ms
2020-09-24 10:32:31.659 JST [2753] LOG: totaltableinsertiontime = 19837.588 ms
2020-09-24 10:35:49.245 JST [31118] LOG: totaltableinsertiontime = 21759.253 ms
2020-09-24 10:36:54.834 JST [41763] LOG: totaltableinsertiontime = 23547.323 ms
2020-09-24 10:38:53.507 JST [56779] LOG: totaltableinsertiontime = 21543.984 ms
2020-09-24 10:39:58.713 JST [67489] LOG: totaltableinsertiontime = 25254.563 ms
br_custom_1_0.log:
2020-09-24 10:49:03.242 JST [15308] LOG: totaltableinsertiontime = 16541.201 ms
2020-09-24 10:50:11.848 JST [23324] LOG: totaltableinsertiontime = 15076.577 ms
2020-09-24 10:51:24.497 JST [35394] LOG: totaltableinsertiontime = 16400.777 ms
2020-09-24 10:52:32.354 JST [42953] LOG: totaltableinsertiontime = 15591.051 ms
2020-09-24 10:54:30.327 JST [61136] LOG: totaltableinsertiontime = 16700.954 ms
2020-09-24 10:55:38.377 JST [68719] LOG: totaltableinsertiontime = 15435.150 ms
2020-09-24 10:57:08.927 JST [83335] LOG: totaltableinsertiontime = 17133.251 ms
2020-09-24 10:58:17.420 JST [90905] LOG: totaltableinsertiontime = 15352.753 ms
>
> Test results show that Parallel COPY with 1 worker is performing
> better than normal COPY in the test scenarios run.
>
Good to know :)
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On Thu, Sep 24, 2020 at 8:27 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> > 3. Could you please run the test case 3 times at least? Just to ensure the consistency of the issue.
>
> Yes, have run 4 times. Seems to be a performance hit (whether normal
> copy or parallel-1 copy) on the first COPY run on a freshly created
> database. After that, results are consistent.
>
From the logs, I see that it is happening only with default postgresql.conf, and there's inconsistency in table insertion times, especially from the 1st time to 2nd time. Also, the table insertion time variation is more. This is expected with the default postgresql.conf, because of the background processes interference. That's the reason we usually run with custom configuration to correctly measure the performance gain.
br_default_0_1.log:
2020-09-23 22:32:36.944 JST [112616] LOG: totaltableinsertiontime = 155068.244 ms
2020-09-23 22:33:57.615 JST [11426] LOG: totaltableinsertiontime = 42096.275 ms
2020-09-23 22:37:39.192 JST [43097] LOG: totaltableinsertiontime = 29135.262 ms
2020-09-23 22:38:56.389 JST [54205] LOG: totaltableinsertiontime = 38953.912 ms
2020-09-23 22:40:27.573 JST [66485] LOG: totaltableinsertiontime = 27895.326 ms
2020-09-23 22:41:34.948 JST [77523] LOG: totaltableinsertiontime = 28929.642 ms
2020-09-23 22:43:18.938 JST [89857] LOG: totaltableinsertiontime = 30625.015 ms
2020-09-23 22:44:21.938 JST [101372] LOG: totaltableinsertiontime = 24624.045 ms
br_default_1_0.log:
2020-09-24 11:12:14.989 JST [56146] LOG: totaltableinsertiontime = 192068.350 ms
2020-09-24 11:13:38.228 JST [88455] LOG: totaltableinsertiontime = 30999.942 ms
2020-09-24 11:15:50.381 JST [108935] LOG: totaltableinsertiontime = 31673.204 ms
2020-09-24 11:17:14.260 JST [118541] LOG: totaltableinsertiontime = 31367.027 ms
2020-09-24 11:20:18.975 JST [17270] LOG: totaltableinsertiontime = 26858.924 ms
2020-09-24 11:22:17.822 JST [26852] LOG: totaltableinsertiontime = 66531.442 ms
2020-09-24 11:24:09.221 JST [47971] LOG: totaltableinsertiontime = 38943.384 ms
2020-09-24 11:25:30.955 JST [58849] LOG: totaltableinsertiontime = 28286.634 ms
br_custom_0_1.log:
2020-09-24 10:29:44.956 JST [110477] LOG: totaltableinsertiontime = 20207.928 ms
2020-09-24 10:30:49.570 JST [120568] LOG: totaltableinsertiontime = 23360.006 ms
2020-09-24 10:32:31.659 JST [2753] LOG: totaltableinsertiontime = 19837.588 ms
2020-09-24 10:35:49.245 JST [31118] LOG: totaltableinsertiontime = 21759.253 ms
2020-09-24 10:36:54.834 JST [41763] LOG: totaltableinsertiontime = 23547.323 ms
2020-09-24 10:38:53.507 JST [56779] LOG: totaltableinsertiontime = 21543.984 ms
2020-09-24 10:39:58.713 JST [67489] LOG: totaltableinsertiontime = 25254.563 ms
br_custom_1_0.log:
2020-09-24 10:49:03.242 JST [15308] LOG: totaltableinsertiontime = 16541.201 ms
2020-09-24 10:50:11.848 JST [23324] LOG: totaltableinsertiontime = 15076.577 ms
2020-09-24 10:51:24.497 JST [35394] LOG: totaltableinsertiontime = 16400.777 ms
2020-09-24 10:52:32.354 JST [42953] LOG: totaltableinsertiontime = 15591.051 ms
2020-09-24 10:54:30.327 JST [61136] LOG: totaltableinsertiontime = 16700.954 ms
2020-09-24 10:55:38.377 JST [68719] LOG: totaltableinsertiontime = 15435.150 ms
2020-09-24 10:57:08.927 JST [83335] LOG: totaltableinsertiontime = 17133.251 ms
2020-09-24 10:58:17.420 JST [90905] LOG: totaltableinsertiontime = 15352.753 ms
>
> Test results show that Parallel COPY with 1 worker is performing
> better than normal COPY in the test scenarios run.
>
Good to know :)
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
>
> > Have you tested your patch when encoding conversion is needed? If so,
> > could you please point out the email that has the test results.
> >
>
> We have not yet done encoding testing, we will do and post the results
> separately in the coming days.
>
Hi Ashutosh,
I ran the tests ensuring pg_server_to_any() gets called from copy.c. I specified the encoding option of COPY command, with client and server encodings being UTF-8.
Tests are performed with custom postgresql.conf[1], 10million rows, 5.2GB data. The results are of the triplet form (exec time in sec, number of workers, gain)
Use case 1: 2 indexes on integer columns, 1 index on text column
(1174.395, 0, 1X), (1127.792, 1, 1.04X), (644.260, 2, 1.82X), (341.284, 4, 3.43X), (204.423, 8, 5.74X), (140.692, 16, 8.34X), (129.843, 20, 9.04X), (134.511, 30, 8.72X)
Use case 2: 1 gist index on text column
(811.412, 0, 1X), (772.203, 1, 1.05X), (437.364, 2, 1.85X), (263.575, 4, 3.08X), (175.135, 8, 4.63X), (155.355, 16, 5.22X), (178.704, 20, 4.54X), (199.402, 30, 4.06)
Use case 3: 3 indexes on integer columns
(220.680, 0, 1X), (185.096, 1, 1.19X), (134.811, 2, 1.64X), (114.585, 4, 1.92X), (107.707, 8, 2.05X), (101.253, 16, 2.18X), (100.749, 20, 2.19X), (100.656, 30, 2.19X)
The results are similar to our earlier runs[2].
[1]
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off
[2]
https://www.postgresql.org/message-id/CALDaNm13zK%3DJXfZWqZJsm3%2B2yagYDJc%3DeJBgE4i77-4PPNj7vw%40mail.gmail.com
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
> > Have you tested your patch when encoding conversion is needed? If so,
> > could you please point out the email that has the test results.
> >
>
> We have not yet done encoding testing, we will do and post the results
> separately in the coming days.
>
Hi Ashutosh,
I ran the tests ensuring pg_server_to_any() gets called from copy.c. I specified the encoding option of COPY command, with client and server encodings being UTF-8.
Tests are performed with custom postgresql.conf[1], 10million rows, 5.2GB data. The results are of the triplet form (exec time in sec, number of workers, gain)
Use case 1: 2 indexes on integer columns, 1 index on text column
(1174.395, 0, 1X), (1127.792, 1, 1.04X), (644.260, 2, 1.82X), (341.284, 4, 3.43X), (204.423, 8, 5.74X), (140.692, 16, 8.34X), (129.843, 20, 9.04X), (134.511, 30, 8.72X)
Use case 2: 1 gist index on text column
(811.412, 0, 1X), (772.203, 1, 1.05X), (437.364, 2, 1.85X), (263.575, 4, 3.08X), (175.135, 8, 4.63X), (155.355, 16, 5.22X), (178.704, 20, 4.54X), (199.402, 30, 4.06)
Use case 3: 3 indexes on integer columns
(220.680, 0, 1X), (185.096, 1, 1.19X), (134.811, 2, 1.64X), (114.585, 4, 1.92X), (107.707, 8, 2.05X), (101.253, 16, 2.18X), (100.749, 20, 2.19X), (100.656, 30, 2.19X)
The results are similar to our earlier runs[2].
[1]
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off
[2]
https://www.postgresql.org/message-id/CALDaNm13zK%3DJXfZWqZJsm3%2B2yagYDJc%3DeJBgE4i77-4PPNj7vw%40mail.gmail.com
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On Thu, Sep 24, 2020 at 3:00 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > Have you tested your patch when encoding conversion is needed? If so, > > > could you please point out the email that has the test results. > > > > > > > We have not yet done encoding testing, we will do and post the results > > separately in the coming days. > > > > Hi Ashutosh, > > I ran the tests ensuring pg_server_to_any() gets called from copy.c. I specified the encoding option of COPY command, withclient and server encodings being UTF-8. > Thanks Bharath for the testing. The results look impressive. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Wed, Jul 22, 2020 at 7:48 PM vignesh C <vignesh21@gmail.com> wrote: > > On Tue, Jul 21, 2020 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > Review comments: > > =================== > > > > 0001-Copy-code-readjustment-to-support-parallel-copy > > 1. > > @@ -807,8 +835,11 @@ CopyLoadRawBuf(CopyState cstate) > > else > > nbytes = 0; /* no data need be saved */ > > > > + if (cstate->copy_dest == COPY_NEW_FE) > > + minread = RAW_BUF_SIZE - nbytes; > > + > > inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes, > > - 1, RAW_BUF_SIZE - nbytes); > > + minread, RAW_BUF_SIZE - nbytes); > > > > No comment to explain why this change is done? > > > > 0002-Framework-for-leader-worker-in-parallel-copy > > Currently CopyGetData copies a lesser amount of data to buffer even though space is available in buffer because minreadwas passed as 1 to CopyGetData. Because of this there are frequent call to CopyGetData for fetching the data. In thiscase it will load only some data due to the below check: > while (maxread > 0 && bytesread < minread && !cstate->reached_eof) > After reading some data bytesread will be greater than minread which is passed as 1 and return with lesser amount of data,even though there is some space. > This change is required for parallel copy feature as each time we get a new DSM data block which is of 64K size and copythe data. If we copy less data into DSM data blocks we might end up consuming all the DSM data blocks. > Why can't we reuse the DSM block which has unfilled space? > I felt this issue can be fixed as part of HEAD. Have posted a separate thread [1] for this. I'm planning to remove thatchange once it gets committed. Can that go as a separate > patch or should we include it here? > [1] - https://www.postgresql.org/message-id/CALDaNm0v4CjmvSnftYnx_9pOS_dKRG%3DO3NnBgJsQmi0KipvLog%40mail.gmail.com > I am convinced by the reason given by Kyotaro-San in that another thread [1] and performance data shown by Peter that this can't be an independent improvement and rather in some cases it can do harm. Now, if you need it for a parallel-copy path then we can change it specifically to the parallel-copy code path but I don't understand your reason completely. > > 2. .. > > + */ > > +typedef struct ParallelCopyLineBoundary > > > > Are we doing all this state management to avoid using locks while > > processing lines? If so, I think we can use either spinlock or LWLock > > to keep the main patch simple and then provide a later patch to make > > it lock-less. This will allow us to first focus on the main design of > > the patch rather than trying to make this datastructure processing > > lock-less in the best possible way. > > > > The steps will be more or less same if we use spinlock too. step 1, step 3 & step 4 will be common we have to use lock& unlock instead of step 2 & step 5. I feel we can retain the current implementation. > I'll study this in detail and let you know my opinion on the same but in the meantime, I don't follow one part of this comment: "If they don't follow this order the worker might process wrong line_size and leader might populate the information which worker has not yet processed or in the process of processing." Do you want to say that leader might overwrite some information which worker hasn't read yet? If so, it is not clear from the comment. Another minor point about this comment: + * ParallelCopyLineBoundary is common data structure between leader & worker, + * Leader process will be populating data block, data block offset & the size of I think there should be a full-stop after worker instead of a comma. > > > 6. > > In function BeginParallelCopy(), you need to keep a provision to > > collect wal_usage and buf_usage stats. See _bt_begin_parallel for > > reference. Those will be required for pg_stat_statements. > > > > Fixed > How did you ensure that this is fixed? Have you tested it, if so please share the test? I see a basic problem with your fix. + /* Report WAL/buffer usage during parallel execution */ + bufferusage = shm_toc_lookup(toc, PARALLEL_COPY_BUFFER_USAGE, false); + walusage = shm_toc_lookup(toc, PARALLEL_COPY_WAL_USAGE, false); + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], + &walusage[ParallelWorkerNumber]); You need to call InstrStartParallelQuery() before the actual operation starts, without that stats won't be accurate? Also, after calling WaitForParallelWorkersToFinish(), you need to accumulate the stats collected from workers which neither you have done nor is possible with the current code in your patch because you haven't made any provision to capture them in BeginParallelCopy. I suggest you look into lazy_parallel_vacuum_indexes() and begin_parallel_vacuum() to understand how the buffer/wal usage stats are accumulated. Also, please test this functionality using pg_stat_statements. > > > 0003-Allow-copy-from-command-to-process-data-from-file-ST > > 10. > > In the commit message, you have written "The leader does not > > participate in the insertion of data, leaders only responsibility will > > be to identify the lines as fast as possible for the workers to do the > > actual copy operation. The leader waits till all the lines populated > > are processed by the workers and exits." > > > > I think you should also mention that we have chosen this design based > > on the reason "that everything stalls if the leader doesn't accept > > further input data, as well as when there are no available splitted > > chunks so it doesn't seem like a good idea to have the leader do other > > work. This is backed by the performance data where we have seen that > > with 1 worker there is just a 5-10% (or whatever percentage difference > > you have seen) performance difference)". > > Fixed. > Make it a one-paragraph starting from "The leader does not participate in the insertion of data .... just a 5-10% performance difference". Right now both the parts look a bit disconnected. Few additional comments: ====================== v5-0001-Copy-code-readjustment-to-support-parallel-copy --------------------------------------------------------------------------------- 1. +/* + * CLEAR_EOL_LINE - Wrapper for clearing EOL. + */ +#define CLEAR_EOL_LINE() \ +if (!result && !IsHeaderLine()) \ + ClearEOLFromCopiedData(cstate, cstate->line_buf.data, \ + cstate->line_buf.len, \ + &cstate->line_buf.len) \ I don't like this macro. I think it is sufficient to move the common code to be called from the parallel and non-parallel path in ClearEOLFromCopiedData but I think the other checks can be done in-place. I think having macros for such a thing makes code less readable. 2. - +static void PopulateCommonCstateInfo(CopyState cstate, TupleDesc tup_desc, + List *attnamelist); Spurious line removal. v5-0002-Framework-for-leader-worker-in-parallel-copy --------------------------------------------------------------------------- 3. + FullTransactionId full_transaction_id; /* xid for copy from statement */ + CommandId mycid; /* command id */ + ParallelCopyLineBoundaries line_boundaries; /* line array */ +} ParallelCopyShmInfo; We already serialize FullTransactionId and CommandId via InitializeParallelDSM->SerializeTransactionState. Can't we reuse it? I think recently Parallel Insert patch has also done something for this [2] so you can refer that if you want. v5-0004-Documentation-for-parallel-copy ----------------------------------------------------------- 1. Perform <command>COPY FROM</command> in parallel using <replaceable + class="parameter"> integer</replaceable> background workers. No need for space before integer. [1] - https://www.postgresql.org/message-id/20200911.155804.359271394064499501.horikyota.ntt%40gmail.com [2] - https://www.postgresql.org/message-id/CAJcOf-fn1nhEtaU91NvRuA3EbvbJGACMd4_c%2BUu3XU5VMv37Aw%40mail.gmail.com -- With Regards, Amit Kapila.
On Tue, Sep 22, 2020 at 2:44 PM vignesh C <vignesh21@gmail.com> wrote: > > Thanks Ashutosh for your comments. > > On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > Hi Vignesh, > > > > I've spent some time today looking at your new set of patches and I've > > some thoughts and queries which I would like to put here: > > > > Why are these not part of the shared cstate structure? > > > > SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, cstate->null_print); > > SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim); > > SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote); > > SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape); > > > > I have used shared_cstate mainly to share the integer & bool data > types from the leader to worker process. The above data types are of > char* data type, I will not be able to use it like how I could do it > for integer type. So I preferred to send these as separate keys to the > worker. Thoughts? > I think the way you have written will work but if we go with Ashutosh's proposal it will look elegant and in the future, if we need to share more strings as part of cstate structure then that would be easier. You can probably refer to EstimateParamListSpace, SerializeParamList, and RestoreParamList to see how we can share different types of data in one key. -- With Regards, Amit Kapila.
On Mon, Sep 28, 2020 at 3:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Sep 22, 2020 at 2:44 PM vignesh C <vignesh21@gmail.com> wrote: > > > > Thanks Ashutosh for your comments. > > > > On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > > > Hi Vignesh, > > > > > > I've spent some time today looking at your new set of patches and I've > > > some thoughts and queries which I would like to put here: > > > > > > Why are these not part of the shared cstate structure? > > > > > > SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, cstate->null_print); > > > SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim); > > > SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote); > > > SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape); > > > > > > > I have used shared_cstate mainly to share the integer & bool data > > types from the leader to worker process. The above data types are of > > char* data type, I will not be able to use it like how I could do it > > for integer type. So I preferred to send these as separate keys to the > > worker. Thoughts? > > > > I think the way you have written will work but if we go with > Ashutosh's proposal it will look elegant and in the future, if we need > to share more strings as part of cstate structure then that would be > easier. You can probably refer to EstimateParamListSpace, > SerializeParamList, and RestoreParamList to see how we can share > different types of data in one key. > Yeah. And in addition to that it will also reduce the number of DSM keys that we need to maintain. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Hi Vignesh and Bharath, Seems like the Parallel Copy patch is regarding RI_TRIGGER_PK as parallel-unsafe. Can you explain why this is? Regards, Greg Nancarrow Fujitsu Australia
On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Few additional comments: > ====================== Some more comments: v5-0002-Framework-for-leader-worker-in-parallel-copy =========================================== 1. These values + * help in handover of multiple records with significant size of data to be + * processed by each of the workers to make sure there is no context switch & the + * work is fairly distributed among the workers. How about writing it as: "These values help in the handover of multiple records with the significant size of data to be processed by each of the workers. This also ensures there is no context switch and the work is fairly distributed among the workers." 2. Can we keep WORKER_CHUNK_COUNT, MAX_BLOCKS_COUNT, and RINGSIZE as power-of-two? Say WORKER_CHUNK_COUNT as 64, MAX_BLOCK_COUNT as 1024, and accordingly choose RINGSIZE. At many places, we do that way. I think it can sometimes help in faster processing due to cache size requirements and in this case, I don't see a reason why we can't choose these values to be power-of-two. If you agree with this change then also do some performance testing after this change? 3. + bool curr_blk_completed; + char data[DATA_BLOCK_SIZE]; /* data read from file */ + uint8 skip_bytes; +} ParallelCopyDataBlock; Is there a reason to keep skip_bytes after data? Normally the variable size data is at the end of the structure. Also, there is no comment explaining the purpose of skip_bytes. 4. + * Copy data block information. + * ParallelCopyDataBlock's will be created in DSM. Data read from file will be + * copied in these DSM data blocks. The leader process identifies the records + * and the record information will be shared to the workers. The workers will + * insert the records into the table. There can be one or more number of records + * in each of the data block based on the record size. + */ +typedef struct ParallelCopyDataBlock Keep one empty line after the description line like below. I also suggested to do a minor tweak in the above sentence which is as follows: * Copy data block information. * * These data blocks are created in DSM. Data read ... Try to follow a similar format in other comments as well. 5. I think it is better to move parallelism related code to a new file (we can name it as copyParallel.c or something like that). 6. copy.c(1648,25): warning C4133: 'function': incompatible types - from 'ParallelCopyLineState *' to 'uint32 *' Getting above compilation warning on Windows. v5-0003-Allow-copy-from-command-to-process-data-from-file ================================================== 1. @@ -4294,7 +5047,7 @@ BeginCopyFrom(ParseState *pstate, * only in text mode. */ initStringInfo(&cstate->attribute_buf); - cstate->raw_buf = (char *) palloc(RAW_BUF_SIZE + 1); + cstate->raw_buf = (IsParallelCopy()) ? NULL : (char *) palloc(RAW_BUF_SIZE + 1); Is there anyway IsParallelCopy can be true by this time? AFAICS, we do anything about parallelism after this. If you want to save this allocation then we need to move this after we determine that parallelism can be used or not and accordingly the below code in the patch needs to be changed. * ParallelCopyFrom - parallel copy leader's functionality. * * Leader executes the before statement for before statement trigger, if before @@ -1110,8 +1547,302 @@ ParallelCopyFrom(CopyState cstate) ParallelCopyShmInfo *pcshared_info = cstate->pcdata->pcshared_info; ereport(DEBUG1, (errmsg("Running parallel copy leader"))); + /* raw_buf is not used in parallel copy, instead data blocks are used.*/ + pfree(cstate->raw_buf); + cstate->raw_buf = NULL; Is there anything else also the allocation of which depends on parallelism? 2. +static pg_attribute_always_inline bool +IsParallelCopyAllowed(CopyState cstate) +{ + /* Parallel copy not allowed for frontend (2.0 protocol) & binary option. */ + if ((cstate->copy_dest == COPY_OLD_FE) || cstate->binary) + return false; + + /* Check if copy is into foreign table or temporary table. */ + if (cstate->rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE || + RelationUsesLocalBuffers(cstate->rel)) + return false; + + /* Check if trigger function is parallel safe. */ + if (cstate->rel->trigdesc != NULL && + !IsTriggerFunctionParallelSafe(cstate->rel->trigdesc)) + return false; + + /* + * Check if there is after statement or instead of trigger or transition + * table triggers. + */ + if (cstate->rel->trigdesc != NULL && + (cstate->rel->trigdesc->trig_insert_after_statement || + cstate->rel->trigdesc->trig_insert_instead_row || + cstate->rel->trigdesc->trig_insert_new_table)) + return false; + + /* Check if the volatile expressions are parallel safe, if present any. */ + if (!CheckExprParallelSafety(cstate)) + return false; + + /* Check if the insertion mode is single. */ + if (FindInsertMethod(cstate) == CIM_SINGLE) + return false; + + return true; +} In the comments, we should write why parallelism is not allowed for a particular case. The cases where parallel-unsafe clause is involved are okay but it is not clear from comments why it is not allowed in other cases. 3. + ParallelCopyShmInfo *pcshared_info = cstate->pcdata->pcshared_info; + ParallelCopyLineBoundary *lineInfo; + uint32 line_first_block = pcshared_info->cur_block_pos; + line_pos = UpdateBlockInLineInfo(cstate, + line_first_block, + cstate->raw_buf_index, -1, + LINE_LEADER_POPULATING); + lineInfo = &pcshared_info->line_boundaries.ring[line_pos]; + elog(DEBUG1, "[Leader] Adding - block:%d, offset:%d, line position:%d", + line_first_block, lineInfo->start_offset, line_pos); Can we take all the code here inside function UpdateBlockInLineInfo? I see that it is called from one other place but I guess most of the surrounding code there can also be moved inside the function. Can we change the name of the function to UpdateSharedLineInfo or something like that and remove inline marking from this? I am not sure we want to inline such big functions. If it make difference in performance then we can probably consider it. 4. EndLineParallelCopy() { .. + /* Update line size. */ + pg_atomic_write_u32(&lineInfo->line_size, line_size); + pg_atomic_write_u32(&lineInfo->line_state, LINE_LEADER_POPULATED); + elog(DEBUG1, "[Leader] After adding - line position:%d, line_size:%d", + line_pos, line_size); .. } Can we instead call UpdateSharedLineInfo (new function name for UpdateBlockInLineInfo) to do this and maybe see it only updates the required info? The idea is to centralize the code for updating SharedLineInfo. 5. +static uint32 +GetLinePosition(CopyState cstate) +{ + ParallelCopyData *pcdata = cstate->pcdata; + ParallelCopyShmInfo *pcshared_info = pcdata->pcshared_info; + uint32 previous_pos = pcdata->worker_processed_pos; + uint32 write_pos = (previous_pos == -1) ? 0 : (previous_pos + 1) % RINGSIZE; It seems to me that each worker has to hop through all the processed chunks before getting the chunk which it can process. This will work but I think it is better if we have some shared counter which can tell us the next chunk to be processed and avoid all the unnecessary work of hopping to find the exact position. v5-0004-Documentation-for-parallel-copy ----------------------------------------- 1. Can you add one or two examples towards the end of the page where we have examples for other Copy options? Please run pgindent on all patches as that will make the code look better. From the testing perspective, 1. Test by having something force_parallel_mode = regress which means that all existing Copy tests in the regression will be executed via new worker code. You can have this as a test-only patch for now and make sure all existing tests passed with this. 2. Do we have tests for toast tables? I think if you implement the previous point some existing tests might cover it but I feel we should have at least one or two tests for the same. 3. Have we checked the code coverage of the newly added code with existing tests? -- With Regards, Amit Kapila.
On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Few additional comments: > > ====================== > > Some more comments: > Thanks Amit for the comments, I will work on the comments and provide a patch in the next few days. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Tue, Sep 29, 2020 at 3:16 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > Hi Vignesh and Bharath, > > Seems like the Parallel Copy patch is regarding RI_TRIGGER_PK as > parallel-unsafe. > Can you explain why this is? > I don't think we need to restrict this case and even if there is some reason to do so then probably the same should be mentioned in the comments. -- With Regards, Amit Kapila.
Hello Vignesh, I've done some basic benchmarking on the v4 version of the patches (but AFAIKC the v5 should perform about the same), and some initial review. For the benchmarking, I used the lineitem table from TPC-H - for 75GB data set, this largest table is about 64GB once loaded, with another 54GB in 5 indexes. This is on a server with 32 cores, 64GB of RAM and NVME storage. The COPY duration with varying number of workers (specified using the parallel COPY option) looks like this: workers duration --------------------- 0 1366 1 1255 2 704 3 526 4 434 5 385 6 347 7 322 8 327 So this seems to work pretty well - initially we get almost linear speedup, then it slows down (likely due to contention for locks, I/O etc.). Not bad. I've only done a quick review, but overall the patch looks in fairly good shape. 1) I don't quite understand why we need INCREMENTPROCESSED and RETURNPROCESSED, considering it just does ++ or return. It just obfuscated the code, I think. 2) I find it somewhat strange that BeginParallelCopy can just decide not to do parallel copy after all. Why not to do this decisions in the caller? Or maybe it's fine this way, not sure. 3) AFAIK we don't modify typedefs.list in patches, so these changes should be removed. 4) IsTriggerFunctionParallelSafe actually checks all triggers, not just one, so the comment needs minor rewording. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Oct 3, 2020 at 6:20 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Hello Vignesh, > > I've done some basic benchmarking on the v4 version of the patches (but > AFAIKC the v5 should perform about the same), and some initial review. > > For the benchmarking, I used the lineitem table from TPC-H - for 75GB > data set, this largest table is about 64GB once loaded, with another > 54GB in 5 indexes. This is on a server with 32 cores, 64GB of RAM and > NVME storage. > > The COPY duration with varying number of workers (specified using the > parallel COPY option) looks like this: > > workers duration > --------------------- > 0 1366 > 1 1255 > 2 704 > 3 526 > 4 434 > 5 385 > 6 347 > 7 322 > 8 327 > > So this seems to work pretty well - initially we get almost linear > speedup, then it slows down (likely due to contention for locks, I/O > etc.). Not bad. > +1. These numbers (> 4x speed up) look good to me. -- With Regards, Amit Kapila.
On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jul 22, 2020 at 7:48 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Tue, Jul 21, 2020 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > Review comments: > > > =================== > > > > > > 0001-Copy-code-readjustment-to-support-parallel-copy > > > 1. > > > @@ -807,8 +835,11 @@ CopyLoadRawBuf(CopyState cstate) > > > else > > > nbytes = 0; /* no data need be saved */ > > > > > > + if (cstate->copy_dest == COPY_NEW_FE) > > > + minread = RAW_BUF_SIZE - nbytes; > > > + > > > inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes, > > > - 1, RAW_BUF_SIZE - nbytes); > > > + minread, RAW_BUF_SIZE - nbytes); > > > > > > No comment to explain why this change is done? > > > > > > 0002-Framework-for-leader-worker-in-parallel-copy > > > > Currently CopyGetData copies a lesser amount of data to buffer even though space is available in buffer because minreadwas passed as 1 to CopyGetData. Because of this there are frequent call to CopyGetData for fetching the data. In thiscase it will load only some data due to the below check: > > while (maxread > 0 && bytesread < minread && !cstate->reached_eof) > > After reading some data bytesread will be greater than minread which is passed as 1 and return with lesser amount ofdata, even though there is some space. > > This change is required for parallel copy feature as each time we get a new DSM data block which is of 64K size and copythe data. If we copy less data into DSM data blocks we might end up consuming all the DSM data blocks. > > > > Why can't we reuse the DSM block which has unfilled space? > > > I felt this issue can be fixed as part of HEAD. Have posted a separate thread [1] for this. I'm planning to remove thatchange once it gets committed. Can that go as a separate > > patch or should we include it here? > > [1] - https://www.postgresql.org/message-id/CALDaNm0v4CjmvSnftYnx_9pOS_dKRG%3DO3NnBgJsQmi0KipvLog%40mail.gmail.com > > > > I am convinced by the reason given by Kyotaro-San in that another > thread [1] and performance data shown by Peter that this can't be an > independent improvement and rather in some cases it can do harm. Now, > if you need it for a parallel-copy path then we can change it > specifically to the parallel-copy code path but I don't understand > your reason completely. > Whenever we need data to be populated, we will get a new data block & pass it to CopyGetData to populate the data. In case of file copy, the server will completely fill the data block. We expect the data to be filled completely. If data is available it will completely load the complete data block in case of file copy. There is no scenario where even if data is present a partial data block will be returned except for EOF or no data available. But in case of STDIN data copy, even though there is 8K data available in data block & 8K data available in STDIN, CopyGetData will return as soon as libpq buffer data is more than the minread. We will pass new data block every time to load data. Every time we pass an 8K data block but CopyGetData loads a few bytes in the new data block & returns. I wanted to keep the same data population logic for both file copy & STDIN copy i.e copy full 8K data blocks & then the populated data can be required. There is an alternative solution I can have some special handling in case of STDIN wherein the existing data block can be passed with the index from where the data should be copied. Thoughts? > > > 2. > .. > > > + */ > > > +typedef struct ParallelCopyLineBoundary > > > > > > Are we doing all this state management to avoid using locks while > > > processing lines? If so, I think we can use either spinlock or LWLock > > > to keep the main patch simple and then provide a later patch to make > > > it lock-less. This will allow us to first focus on the main design of > > > the patch rather than trying to make this datastructure processing > > > lock-less in the best possible way. > > > > > > > The steps will be more or less same if we use spinlock too. step 1, step 3 & step 4 will be common we have to use lock& unlock instead of step 2 & step 5. I feel we can retain the current implementation. > > > > I'll study this in detail and let you know my opinion on the same but > in the meantime, I don't follow one part of this comment: "If they > don't follow this order the worker might process wrong line_size and > leader might populate the information which worker has not yet > processed or in the process of processing." > > Do you want to say that leader might overwrite some information which > worker hasn't read yet? If so, it is not clear from the comment. > Another minor point about this comment: > Here leader and worker must follow these steps to avoid any corruption or hang issue. Changed it to: * The leader & worker process access the shared line information by following * the below steps to avoid any data corruption or hang: > + * ParallelCopyLineBoundary is common data structure between leader & worker, > + * Leader process will be populating data block, data block offset & > the size of > > I think there should be a full-stop after worker instead of a comma. > Changed it. > > > > > 6. > > > In function BeginParallelCopy(), you need to keep a provision to > > > collect wal_usage and buf_usage stats. See _bt_begin_parallel for > > > reference. Those will be required for pg_stat_statements. > > > > > > > Fixed > > > > How did you ensure that this is fixed? Have you tested it, if so > please share the test? I see a basic problem with your fix. > > + /* Report WAL/buffer usage during parallel execution */ > + bufferusage = shm_toc_lookup(toc, PARALLEL_COPY_BUFFER_USAGE, false); > + walusage = shm_toc_lookup(toc, PARALLEL_COPY_WAL_USAGE, false); > + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], > + &walusage[ParallelWorkerNumber]); > > You need to call InstrStartParallelQuery() before the actual operation > starts, without that stats won't be accurate? Also, after calling > WaitForParallelWorkersToFinish(), you need to accumulate the stats > collected from workers which neither you have done nor is possible > with the current code in your patch because you haven't made any > provision to capture them in BeginParallelCopy. > > I suggest you look into lazy_parallel_vacuum_indexes() and > begin_parallel_vacuum() to understand how the buffer/wal usage stats > are accumulated. Also, please test this functionality using > pg_stat_statements. > Made changes accordingly. I have verified it using: postgres=# select * from pg_stat_statements where query like '%copy%'; userid | dbid | queryid | query | plans | total_plan_time | min_plan_time | max_plan_time | mean_plan_time | stddev_plan_time | calls | total_exec_time | min_exec_time | max_exec_time | mean_exec_time | stddev_exec_time | rows | shared_blks_hi t | shared_blks_read | shared_blks_dirtied | shared_blks_written | local_blks_hit | local_blks_read | local_blks_dirtied | local_blks_written | temp_blks_read | temp_blks_written | blk_ read_time | blk_write_time | wal_records | wal_fpi | wal_bytes --------+-------+----------------------+---------------------------------------------------------------------------------------------------------------------+-------+-----------------+- --------------+---------------+----------------+------------------+-------+-----------------+---------------+---------------+----------------+------------------+--------+--------------- --+------------------+---------------------+---------------------+----------------+-----------------+--------------------+--------------------+----------------+-------------------+----- ----------+----------------+-------------+---------+----------- 10 | 13743 | -6947756673093447609 | copy hw from '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format csv, delimiter ',') | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 265.195105 | 265.195105 | 265.195105 | 265.195105 | 0 | 175000 | 191 6 | 0 | 946 | 946 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1116 | 0 | 3587203 10 | 13743 | 8570215596364326047 | copy hw from '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format csv, delimiter ',', parallel '2') | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 35668.402482 | 35668.402482 | 35668.402482 | 35668.402482 | 0 | 175000 | 310 1 | 36 | 952 | 919 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1119 | 6 | 3624405 (2 rows) > > > > > 0003-Allow-copy-from-command-to-process-data-from-file-ST > > > 10. > > > In the commit message, you have written "The leader does not > > > participate in the insertion of data, leaders only responsibility will > > > be to identify the lines as fast as possible for the workers to do the > > > actual copy operation. The leader waits till all the lines populated > > > are processed by the workers and exits." > > > > > > I think you should also mention that we have chosen this design based > > > on the reason "that everything stalls if the leader doesn't accept > > > further input data, as well as when there are no available splitted > > > chunks so it doesn't seem like a good idea to have the leader do other > > > work. This is backed by the performance data where we have seen that > > > with 1 worker there is just a 5-10% (or whatever percentage difference > > > you have seen) performance difference)". > > > > Fixed. > > > > Make it a one-paragraph starting from "The leader does not participate > in the insertion of data .... just a 5-10% performance difference". > Right now both the parts look a bit disconnected. > Made the contents starting from "The leader does not" in a paragraph. > Few additional comments: > ====================== > v5-0001-Copy-code-readjustment-to-support-parallel-copy > --------------------------------------------------------------------------------- > 1. > +/* > + * CLEAR_EOL_LINE - Wrapper for clearing EOL. > + */ > +#define CLEAR_EOL_LINE() \ > +if (!result && !IsHeaderLine()) \ > + ClearEOLFromCopiedData(cstate, cstate->line_buf.data, \ > + cstate->line_buf.len, \ > + &cstate->line_buf.len) \ > > I don't like this macro. I think it is sufficient to move the common > code to be called from the parallel and non-parallel path in > ClearEOLFromCopiedData but I think the other checks can be done > in-place. I think having macros for such a thing makes code less > readable. > I have removed the macro & called ClearEOLFromCopiedData directly wherever required. > 2. > - > +static void PopulateCommonCstateInfo(CopyState cstate, TupleDesc tup_desc, > + List *attnamelist); > > Spurious line removal. > I have modified it to keep it as it is. > v5-0002-Framework-for-leader-worker-in-parallel-copy > --------------------------------------------------------------------------- > 3. > + FullTransactionId full_transaction_id; /* xid for copy from statement */ > + CommandId mycid; /* command id */ > + ParallelCopyLineBoundaries line_boundaries; /* line array */ > +} ParallelCopyShmInfo; > > We already serialize FullTransactionId and CommandId via > InitializeParallelDSM->SerializeTransactionState. Can't we reuse it? I > think recently Parallel Insert patch has also done something for this > [2] so you can refer that if you want. > Changed it to remove setting of command id & full transaction id. Added a function SetCurrentCommandIdUsedForWorker to set currentCommandIdUsed to true & called GetCurrentCommandId by passing !IsParallelCopy(). > v5-0004-Documentation-for-parallel-copy > ----------------------------------------------------------- > 1. Perform <command>COPY FROM</command> in parallel using <replaceable > + class="parameter"> integer</replaceable> background workers. > > No need for space before integer. > I have removed it. Attached v6 patch with the fixes. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
- v6-0001-Copy-code-readjustment-to-support-parallel-copy.patch
- v6-0002-Framework-for-leader-worker-in-parallel-copy.patch
- v6-0003-Allow-copy-from-command-to-process-data-from-file.patch
- v6-0004-Documentation-for-parallel-copy.patch
- v6-0005-Tests-for-parallel-copy.patch
- v6-0006-Parallel-Copy-For-Binary-Format-Files.patch
On Tue, Sep 29, 2020 at 3:16 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > Hi Vignesh and Bharath, > > Seems like the Parallel Copy patch is regarding RI_TRIGGER_PK as > parallel-unsafe. > Can you explain why this is? Yes we don't need to restrict parallelism for RI_TRIGGER_PK cases as we don't do any command counter increments while performing PK checks as opposed to RI_TRIGGER_FK/foreign key checks. We have modified this in the v6 patch set. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 28, 2020 at 3:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Sep 22, 2020 at 2:44 PM vignesh C <vignesh21@gmail.com> wrote: > > > > Thanks Ashutosh for your comments. > > > > On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > > > Hi Vignesh, > > > > > > I've spent some time today looking at your new set of patches and I've > > > some thoughts and queries which I would like to put here: > > > > > > Why are these not part of the shared cstate structure? > > > > > > SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, cstate->null_print); > > > SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim); > > > SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote); > > > SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape); > > > > > > > I have used shared_cstate mainly to share the integer & bool data > > types from the leader to worker process. The above data types are of > > char* data type, I will not be able to use it like how I could do it > > for integer type. So I preferred to send these as separate keys to the > > worker. Thoughts? > > > > I think the way you have written will work but if we go with > Ashutosh's proposal it will look elegant and in the future, if we need > to share more strings as part of cstate structure then that would be > easier. You can probably refer to EstimateParamListSpace, > SerializeParamList, and RestoreParamList to see how we can share > different types of data in one key. > Thanks for the solution Amit, I have fixed this and handled it in the v6 patch shared in my previous mail. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Thu, Oct 8, 2020 at 5:44 AM vignesh C <vignesh21@gmail.com> wrote: > Attached v6 patch with the fixes. > Hi Vignesh, I noticed a couple of issues when scanning the code in the following patch: v6-0003-Allow-copy-from-command-to-process-data-from-file.patch In the following code, it will put a junk uint16 value into *destptr (and thus may well cause a crash) on a Big Endian architecture (Solaris Sparc, s390x, etc.): You're storing a (uint16) string length in a uint32 and then pulling out the lower two bytes of the uint32 and copying them into the location pointed to by destptr. static void +CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr, + uint32 *copiedsize) +{ + uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0; + + memcpy(destptr, (uint16 *) &len, sizeof(uint16)); + *copiedsize += sizeof(uint16); + if (len) + { + memcpy(destptr + sizeof(uint16), srcPtr, len); + *copiedsize += len; + } +} I suggest you change the code to: uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0; memcpy(destptr, &len, sizeof(uint16)); [I assume string length here can't ever exceed (65535 - 1), right?] Looking a bit deeper into this, I'm wondering if in fact your EstimateStringSize() and EstimateNodeSize() functions should be using BUFFERALIGN() for EACH stored string/node (rather than just calling shm_toc_estimate_chunk() once at the end, after the length of packed strings and nodes has been estimated), to ensure alignment of start of each string/node. Other Postgres code appears to be aligning each stored chunk using shm_toc_estimate_chunk(). See the definition of that macro and its current usages. Then you could safely use: uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0; *(uint16 *)destptr = len; *copiedsize += sizeof(uint16); if (len) { memcpy(destptr + sizeof(uint16), srcPtr, len); *copiedsize += len; } and in the CopyStringFromSharedMemory() function, then could safely use: len = *(uint16 *)srcPtr; The compiler may be smart enough to optimize-away the memcpy() in this case anyway, but there are issues in doing this for architectures that take a performance hit for unaligned access, or don't support unaligned access. Also, in CopyXXXXFromSharedMemory() functions, you should use palloc() instead of palloc0(), as you're filling the entire palloc'd buffer anyway, so no need to ask for additional MemSet() of all buffer bytes to 0 prior to memcpy(). Regards, Greg Nancarrow Fujitsu Australia
On Mon, Sep 28, 2020 at 6:37 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > On Mon, Sep 28, 2020 at 3:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Sep 22, 2020 at 2:44 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > Thanks Ashutosh for your comments. > > > > > > On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > > > > > Hi Vignesh, > > > > > > > > I've spent some time today looking at your new set of patches and I've > > > > some thoughts and queries which I would like to put here: > > > > > > > > Why are these not part of the shared cstate structure? > > > > > > > > SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, cstate->null_print); > > > > SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim); > > > > SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote); > > > > SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape); > > > > > > > > > > I have used shared_cstate mainly to share the integer & bool data > > > types from the leader to worker process. The above data types are of > > > char* data type, I will not be able to use it like how I could do it > > > for integer type. So I preferred to send these as separate keys to the > > > worker. Thoughts? > > > > > > > I think the way you have written will work but if we go with > > Ashutosh's proposal it will look elegant and in the future, if we need > > to share more strings as part of cstate structure then that would be > > easier. You can probably refer to EstimateParamListSpace, > > SerializeParamList, and RestoreParamList to see how we can share > > different types of data in one key. > > > > Yeah. And in addition to that it will also reduce the number of DSM > keys that we need to maintain. > Thanks Ashutosh, This is handled as part of the v6 patch set. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Few additional comments: > > ====================== > > Some more comments: > > v5-0002-Framework-for-leader-worker-in-parallel-copy > =========================================== > 1. > These values > + * help in handover of multiple records with significant size of data to be > + * processed by each of the workers to make sure there is no context > switch & the > + * work is fairly distributed among the workers. > > How about writing it as: "These values help in the handover of > multiple records with the significant size of data to be processed by > each of the workers. This also ensures there is no context switch and > the work is fairly distributed among the workers." Changed as suggested. > > 2. Can we keep WORKER_CHUNK_COUNT, MAX_BLOCKS_COUNT, and RINGSIZE as > power-of-two? Say WORKER_CHUNK_COUNT as 64, MAX_BLOCK_COUNT as 1024, > and accordingly choose RINGSIZE. At many places, we do that way. I > think it can sometimes help in faster processing due to cache size > requirements and in this case, I don't see a reason why we can't > choose these values to be power-of-two. If you agree with this change > then also do some performance testing after this change? > Modified as suggested, Have checked few performance tests & verified there is no degradation. We will post a performance run of this separately in the coming days.. > 3. > + bool curr_blk_completed; > + char data[DATA_BLOCK_SIZE]; /* data read from file */ > + uint8 skip_bytes; > +} ParallelCopyDataBlock; > > Is there a reason to keep skip_bytes after data? Normally the variable > size data is at the end of the structure. Also, there is no comment > explaining the purpose of skip_bytes. > Modified as suggested and added comments. > 4. > + * Copy data block information. > + * ParallelCopyDataBlock's will be created in DSM. Data read from file will be > + * copied in these DSM data blocks. The leader process identifies the records > + * and the record information will be shared to the workers. The workers will > + * insert the records into the table. There can be one or more number > of records > + * in each of the data block based on the record size. > + */ > +typedef struct ParallelCopyDataBlock > > Keep one empty line after the description line like below. I also > suggested to do a minor tweak in the above sentence which is as > follows: > > * Copy data block information. > * > * These data blocks are created in DSM. Data read ... > > Try to follow a similar format in other comments as well. > Modified as suggested. > 5. I think it is better to move parallelism related code to a new file > (we can name it as copyParallel.c or something like that). > Modified, added copyparallel.c file to include copy parallelism functionality & copyparallel.c file & some of the function prototype & data structure were moved to copy.h header file so that it can be shared between copy.c & copyparallel.c > 6. copy.c(1648,25): warning C4133: 'function': incompatible types - > from 'ParallelCopyLineState *' to 'uint32 *' > Getting above compilation warning on Windows. > Modified the data type. > v5-0003-Allow-copy-from-command-to-process-data-from-file > ================================================== > 1. > @@ -4294,7 +5047,7 @@ BeginCopyFrom(ParseState *pstate, > * only in text mode. > */ > initStringInfo(&cstate->attribute_buf); > - cstate->raw_buf = (char *) palloc(RAW_BUF_SIZE + 1); > + cstate->raw_buf = (IsParallelCopy()) ? NULL : (char *) > palloc(RAW_BUF_SIZE + 1); > > Is there anyway IsParallelCopy can be true by this time? AFAICS, we do > anything about parallelism after this. If you want to save this > allocation then we need to move this after we determine that > parallelism can be used or not and accordingly the below code in the > patch needs to be changed. > > * ParallelCopyFrom - parallel copy leader's functionality. > * > * Leader executes the before statement for before statement trigger, if before > @@ -1110,8 +1547,302 @@ ParallelCopyFrom(CopyState cstate) > ParallelCopyShmInfo *pcshared_info = cstate->pcdata->pcshared_info; > ereport(DEBUG1, (errmsg("Running parallel copy leader"))); > > + /* raw_buf is not used in parallel copy, instead data blocks are used.*/ > + pfree(cstate->raw_buf); > + cstate->raw_buf = NULL; > Removed the palloc change, raw_buf will be allocated both for parallel and non parallel copy. One other solution that I thought was to move the memory allocation to CopyFrom, but this solution might affect fdw where they use BeginCopyFrom, NextCopyFrom & EndCopyFrom. So I have kept the allocation as in BeginCopyFrom & freeing for parallel copy in ParallelCopyFrom. > Is there anything else also the allocation of which depends on parallelism? > I felt this is the only allocated memory that sequential copy requires and which is not required in parallel copy. > 2. > +static pg_attribute_always_inline bool > +IsParallelCopyAllowed(CopyState cstate) > +{ > + /* Parallel copy not allowed for frontend (2.0 protocol) & binary option. */ > + if ((cstate->copy_dest == COPY_OLD_FE) || cstate->binary) > + return false; > + > + /* Check if copy is into foreign table or temporary table. */ > + if (cstate->rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE || > + RelationUsesLocalBuffers(cstate->rel)) > + return false; > + > + /* Check if trigger function is parallel safe. */ > + if (cstate->rel->trigdesc != NULL && > + !IsTriggerFunctionParallelSafe(cstate->rel->trigdesc)) > + return false; > + > + /* > + * Check if there is after statement or instead of trigger or transition > + * table triggers. > + */ > + if (cstate->rel->trigdesc != NULL && > + (cstate->rel->trigdesc->trig_insert_after_statement || > + cstate->rel->trigdesc->trig_insert_instead_row || > + cstate->rel->trigdesc->trig_insert_new_table)) > + return false; > + > + /* Check if the volatile expressions are parallel safe, if present any. */ > + if (!CheckExprParallelSafety(cstate)) > + return false; > + > + /* Check if the insertion mode is single. */ > + if (FindInsertMethod(cstate) == CIM_SINGLE) > + return false; > + > + return true; > +} > > In the comments, we should write why parallelism is not allowed for a > particular case. The cases where parallel-unsafe clause is involved > are okay but it is not clear from comments why it is not allowed in > other cases. > Added comments. > 3. > + ParallelCopyShmInfo *pcshared_info = cstate->pcdata->pcshared_info; > + ParallelCopyLineBoundary *lineInfo; > + uint32 line_first_block = pcshared_info->cur_block_pos; > + line_pos = UpdateBlockInLineInfo(cstate, > + line_first_block, > + cstate->raw_buf_index, -1, > + LINE_LEADER_POPULATING); > + lineInfo = &pcshared_info->line_boundaries.ring[line_pos]; > + elog(DEBUG1, "[Leader] Adding - block:%d, offset:%d, line position:%d", > + line_first_block, lineInfo->start_offset, line_pos); > > Can we take all the code here inside function UpdateBlockInLineInfo? I > see that it is called from one other place but I guess most of the > surrounding code there can also be moved inside the function. Can we > change the name of the function to UpdateSharedLineInfo or something > like that and remove inline marking from this? I am not sure we want > to inline such big functions. If it make difference in performance > then we can probably consider it. > Changed as suggested. > 4. > EndLineParallelCopy() > { > .. > + /* Update line size. */ > + pg_atomic_write_u32(&lineInfo->line_size, line_size); > + pg_atomic_write_u32(&lineInfo->line_state, LINE_LEADER_POPULATED); > + elog(DEBUG1, "[Leader] After adding - line position:%d, line_size:%d", > + line_pos, line_size); > .. > } > > Can we instead call UpdateSharedLineInfo (new function name for > UpdateBlockInLineInfo) to do this and maybe see it only updates the > required info? The idea is to centralize the code for updating > SharedLineInfo. > Updated as suggested. > 5. > +static uint32 > +GetLinePosition(CopyState cstate) > +{ > + ParallelCopyData *pcdata = cstate->pcdata; > + ParallelCopyShmInfo *pcshared_info = pcdata->pcshared_info; > + uint32 previous_pos = pcdata->worker_processed_pos; > + uint32 write_pos = (previous_pos == -1) ? 0 : (previous_pos + 1) % RINGSIZE; > > It seems to me that each worker has to hop through all the processed > chunks before getting the chunk which it can process. This will work > but I think it is better if we have some shared counter which can tell > us the next chunk to be processed and avoid all the unnecessary work > of hopping to find the exact position. I had tried to have a spin lock & try to track this position instead of hopping through the processed chunks. But I did not get the earlier performance results, there was slight degradation: Use case 2: 3 indexes on integer columns Run on earlier patches without spinlock: (220.680, 0, 1X), (185.096, 1, 1.19X), (134.811, 2, 1.64X), (114.585, 4, 1.92X), (107.707, 8, 2.05X), (101.253, 16, 2.18X), (100.749, 20, 2.19X), (100.656, 30, 2.19X) Run on latest v6 patches with spinlock: (216.059, 0, 1X), (177.639, 1, 1.22X), (145.213, 2, 1.49X), (126.370, 4, 1.71X), (121.013, 8, 1.78X), (102.933, 16, 2.1X), (103.000, 20, 2.1X), (100.308, 30, 2.15X) I have not included these changes as there was some performance degradation. I will try to come with a different solution for this and discuss in the coming days. This point is not yet handled. > v5-0004-Documentation-for-parallel-copy > ----------------------------------------- > 1. Can you add one or two examples towards the end of the page where > we have examples for other Copy options? > > > Please run pgindent on all patches as that will make the code look better. Have run pgindent on the latest patches. > From the testing perspective, > 1. Test by having something force_parallel_mode = regress which means > that all existing Copy tests in the regression will be executed via > new worker code. You can have this as a test-only patch for now and > make sure all existing tests passed with this. > 2. Do we have tests for toast tables? I think if you implement the > previous point some existing tests might cover it but I feel we should > have at least one or two tests for the same. > 3. Have we checked the code coverage of the newly added code with > existing tests? These will be handled in the next few days. These changes are present as part of the v6 patch set. I'm summarizing the pending open points so that I don't miss anything: 1) Performance test on latest patch set. 2) Testing points suggested. 3) Support of parallel copy for COPY_OLD_FE. 4) Worker has to hop through all the processed chunks before getting the chunk which it can process. 5) Handling of Tomas's comments. 6) Handling of Greg's comments. We plan to work on this & complete in the next few days. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Thu, Oct 8, 2020 at 12:14 AM vignesh C <vignesh21@gmail.com> wrote: > > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > I am convinced by the reason given by Kyotaro-San in that another > > thread [1] and performance data shown by Peter that this can't be an > > independent improvement and rather in some cases it can do harm. Now, > > if you need it for a parallel-copy path then we can change it > > specifically to the parallel-copy code path but I don't understand > > your reason completely. > > > > Whenever we need data to be populated, we will get a new data block & > pass it to CopyGetData to populate the data. In case of file copy, the > server will completely fill the data block. We expect the data to be > filled completely. If data is available it will completely load the > complete data block in case of file copy. There is no scenario where > even if data is present a partial data block will be returned except > for EOF or no data available. But in case of STDIN data copy, even > though there is 8K data available in data block & 8K data available in > STDIN, CopyGetData will return as soon as libpq buffer data is more > than the minread. We will pass new data block every time to load data. > Every time we pass an 8K data block but CopyGetData loads a few bytes > in the new data block & returns. I wanted to keep the same data > population logic for both file copy & STDIN copy i.e copy full 8K data > blocks & then the populated data can be required. There is an > alternative solution I can have some special handling in case of STDIN > wherein the existing data block can be passed with the index from > where the data should be copied. Thoughts? > What you are proposing as an alternative solution, isn't that what we are doing without the patch? IIUC, you require this because of your corresponding changes to handle COPY_NEW_FE in CopyReadLine(), is that right? If so, what is the difficulty in making it behave similar to the non-parallel case? -- With Regards, Amit Kapila.
On Thu, Oct 8, 2020 at 12:14 AM vignesh C <vignesh21@gmail.com> wrote: > > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > + */ > > > > +typedef struct ParallelCopyLineBoundary > > > > > > > > Are we doing all this state management to avoid using locks while > > > > processing lines? If so, I think we can use either spinlock or LWLock > > > > to keep the main patch simple and then provide a later patch to make > > > > it lock-less. This will allow us to first focus on the main design of > > > > the patch rather than trying to make this datastructure processing > > > > lock-less in the best possible way. > > > > > > > > > > The steps will be more or less same if we use spinlock too. step 1, step 3 & step 4 will be common we have to use lock& unlock instead of step 2 & step 5. I feel we can retain the current implementation. > > > > > > > I'll study this in detail and let you know my opinion on the same but > > in the meantime, I don't follow one part of this comment: "If they > > don't follow this order the worker might process wrong line_size and > > leader might populate the information which worker has not yet > > processed or in the process of processing." > > > > Do you want to say that leader might overwrite some information which > > worker hasn't read yet? If so, it is not clear from the comment. > > Another minor point about this comment: > > > > Here leader and worker must follow these steps to avoid any corruption > or hang issue. Changed it to: > * The leader & worker process access the shared line information by following > * the below steps to avoid any data corruption or hang: > Actually, I wanted more on the lines why such corruption or hang can happen? It might help reviewers to understand why you have followed such a sequence. > > > > How did you ensure that this is fixed? Have you tested it, if so > > please share the test? I see a basic problem with your fix. > > > > + /* Report WAL/buffer usage during parallel execution */ > > + bufferusage = shm_toc_lookup(toc, PARALLEL_COPY_BUFFER_USAGE, false); > > + walusage = shm_toc_lookup(toc, PARALLEL_COPY_WAL_USAGE, false); > > + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], > > + &walusage[ParallelWorkerNumber]); > > > > You need to call InstrStartParallelQuery() before the actual operation > > starts, without that stats won't be accurate? Also, after calling > > WaitForParallelWorkersToFinish(), you need to accumulate the stats > > collected from workers which neither you have done nor is possible > > with the current code in your patch because you haven't made any > > provision to capture them in BeginParallelCopy. > > > > I suggest you look into lazy_parallel_vacuum_indexes() and > > begin_parallel_vacuum() to understand how the buffer/wal usage stats > > are accumulated. Also, please test this functionality using > > pg_stat_statements. > > > > Made changes accordingly. > I have verified it using: > postgres=# select * from pg_stat_statements where query like '%copy%'; > userid | dbid | queryid | > query > | plans | total_plan_time | > min_plan_time | max_plan_time | mean_plan_time | stddev_plan_time | > calls | total_exec_time | min_exec_time | max_exec_time | > mean_exec_time | stddev_exec_time | rows | shared_blks_hi > t | shared_blks_read | shared_blks_dirtied | shared_blks_written | > local_blks_hit | local_blks_read | local_blks_dirtied | > local_blks_written | temp_blks_read | temp_blks_written | blk_ > read_time | blk_write_time | wal_records | wal_fpi | wal_bytes > --------+-------+----------------------+---------------------------------------------------------------------------------------------------------------------+-------+-----------------+- > --------------+---------------+----------------+------------------+-------+-----------------+---------------+---------------+----------------+------------------+--------+--------------- > --+------------------+---------------------+---------------------+----------------+-----------------+--------------------+--------------------+----------------+-------------------+----- > ----------+----------------+-------------+---------+----------- > 10 | 13743 | -6947756673093447609 | copy hw from > '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format > csv, delimiter ',') | 0 | 0 | > 0 | 0 | 0 | 0 | > 1 | 265.195105 | 265.195105 | 265.195105 | 265.195105 > | 0 | 175000 | 191 > 6 | 0 | 946 | 946 | > 0 | 0 | 0 | 0 > | 0 | 0 | > 0 | 0 | 1116 | 0 | 3587203 > 10 | 13743 | 8570215596364326047 | copy hw from > '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format > csv, delimiter ',', parallel '2') | 0 | 0 | > 0 | 0 | 0 | 0 | > 1 | 35668.402482 | 35668.402482 | 35668.402482 | 35668.402482 > | 0 | 175000 | 310 > 1 | 36 | 952 | 919 | > 0 | 0 | 0 | 0 > | 0 | 0 | > 0 | 0 | 1119 | 6 | 3624405 > (2 rows) > I am not able to properly parse the data but If understand the wal data for non-parallel (1116 | 0 | 3587203) and parallel (1119 | 6 | 3624405) case doesn't seem to be the same. Is that right? If so, why? Please ensure that no checkpoint happens for both cases. -- With Regards, Amit Kapila.
On Thu, Oct 8, 2020 at 8:43 AM Greg Nancarrow <gregn4422@gmail.com> wrote: > > On Thu, Oct 8, 2020 at 5:44 AM vignesh C <vignesh21@gmail.com> wrote: > > > Attached v6 patch with the fixes. > > > > Hi Vignesh, > > I noticed a couple of issues when scanning the code in the following patch: > > v6-0003-Allow-copy-from-command-to-process-data-from-file.patch > > In the following code, it will put a junk uint16 value into *destptr > (and thus may well cause a crash) on a Big Endian architecture > (Solaris Sparc, s390x, etc.): > You're storing a (uint16) string length in a uint32 and then pulling > out the lower two bytes of the uint32 and copying them into the > location pointed to by destptr. > > > static void > +CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr, > + uint32 *copiedsize) > +{ > + uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0; > + > + memcpy(destptr, (uint16 *) &len, sizeof(uint16)); > + *copiedsize += sizeof(uint16); > + if (len) > + { > + memcpy(destptr + sizeof(uint16), srcPtr, len); > + *copiedsize += len; > + } > +} > > I suggest you change the code to: > > uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0; > memcpy(destptr, &len, sizeof(uint16)); > > [I assume string length here can't ever exceed (65535 - 1), right?] > Your suggestion makes sense to me if the assumption related to string length is correct. If we can't ensure that then we need to probably use four bytes uint32 to store the length. > Looking a bit deeper into this, I'm wondering if in fact your > EstimateStringSize() and EstimateNodeSize() functions should be using > BUFFERALIGN() for EACH stored string/node (rather than just calling > shm_toc_estimate_chunk() once at the end, after the length of packed > strings and nodes has been estimated), to ensure alignment of start of > each string/node. Other Postgres code appears to be aligning each > stored chunk using shm_toc_estimate_chunk(). See the definition of > that macro and its current usages. > I am not sure if this required for the purpose of correctness. AFAIU, we do store/estimate multiple parameters in same way at other places, see EstimateParamListSpace and SerializeParamList. Do you have something else in mind? While looking at the latest code, I observed below issue in patch v6-0003-Allow-copy-from-command-to-process-data-from-file: + /* Estimate the size for shared information for PARALLEL_COPY_KEY_CSTATE */ + est_cstateshared = MAXALIGN(sizeof(SerializedParallelCopyState)); + shm_toc_estimate_chunk(&pcxt->estimator, est_cstateshared); + shm_toc_estimate_keys(&pcxt->estimator, 1); + + strsize = EstimateCstateSize(pcxt, cstate, attnamelist, &whereClauseStr, + &rangeTableStr, &attnameListStr, + ¬nullListStr, &nullListStr, + &convertListStr); Here, do we need to separately estimate the size of SerializedParallelCopyState when it is also done in EstimateCstateSize? -- With Regards, Amit Kapila.
On Fri, Oct 9, 2020 at 5:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Looking a bit deeper into this, I'm wondering if in fact your > > EstimateStringSize() and EstimateNodeSize() functions should be using > > BUFFERALIGN() for EACH stored string/node (rather than just calling > > shm_toc_estimate_chunk() once at the end, after the length of packed > > strings and nodes has been estimated), to ensure alignment of start of > > each string/node. Other Postgres code appears to be aligning each > > stored chunk using shm_toc_estimate_chunk(). See the definition of > > that macro and its current usages. > > > > I am not sure if this required for the purpose of correctness. AFAIU, > we do store/estimate multiple parameters in same way at other places, > see EstimateParamListSpace and SerializeParamList. Do you have > something else in mind? > The point I was trying to make is that potentially more efficient code can be used if the individual strings/nodes are aligned, rather than packed (as they are now), but as you point out, there are already cases (e.g. SerializeParamList) where within the separately-aligned chunks the data is not aligned, so maybe not a big deal. Oh well, without alignment, that means use of memcpy() cannot really be avoided here for serializing/de-serializing ints etc., let's hope the compiler optimizes it as best it can. Regards, Greg Nancarrow Fujitsu Australia
On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> From the testing perspective,
> 1. Test by having something force_parallel_mode = regress which means
> that all existing Copy tests in the regression will be executed via
> new worker code. You can have this as a test-only patch for now and
> make sure all existing tests passed with this.
>
I don't think all the existing copy test cases(except the new test cases added in the parallel copy patch set) would run inside the parallel worker if force_parallel_mode is on. This is because, the parallelism will be picked up for parallel copy only if parallel option is specified unlike parallelism for select queries.
Anyways, I ran with force_parallel_mode on and regress. All copy related tests and make check/make check-world ran fine.
>
> 2. Do we have tests for toast tables? I think if you implement the
> previous point some existing tests might cover it but I feel we should
> have at least one or two tests for the same.
>
Toast table use case 1: 10000 tuples, 9.6GB data, 3 indexes 2 on integer columns, 1 on text column(not the toast column), csv file, each row is > 1320KB:
(222.767, 0, 1X), (134.171, 1, 1.66X), (93.749, 2, 2.38X), (93.672, 4, 2.38X), (94.827, 8, 2.35X), (93.766, 16, 2.37X), (98.153, 20, 2.27X), (122.721, 30, 1.81X)
Toast table use case 2: 100000 tuples, 96GB data, 3 indexes 2 on integer columns, 1 on text column(not the toast column), csv file, each row is > 1320KB:
(2255.032, 0, 1X), (1358.628, 1, 1.66X), (901.170, 2, 2.5X), (912.743, 4, 2.47X), (988.718, 8, 2.28X), (938.000, 16, 2.4X), (997.556, 20, 2.26X), (1000.586, 30, 2.25X)
Toast table use case3: 10000 tuples, 9.6GB, no indexes, binary file, each row is > 1320KB:
(136.983, 0, 1X), (136.418, 1, 1X), (81.896, 2, 1.66X), (62.929, 4, 2.16X), (52.311, 8, 2.6X), (40.032, 16, 3.49X), (44.097, 20, 3.09X), (62.310, 30, 2.18X)
In the case of a Toast table, we could achieve upto 2.5X for csv files, and 3.5X for binary files. We are analyzing this point and will post an update on our findings soon.
While testing for the Toast table case with a binary file, I discovered an issue with the earlier v6-0006-Parallel-Copy-For-Binary-Format-Files.patch from [1], I fixed it and added the updated v6-0006 patch here. Please note that I'm also attaching the 1 to 5 patches from version 6 just for completion, that have no change from what Vignesh sent earlier in [1].
>
> 3. Have we checked the code coverage of the newly added code with
> existing tests?
>
So far, we manually ensured that most of the code parts are covered(see below list of test cases). But we are also planning to do the code coverage using some tool in the coming days.
Apart from the above tests, I also captured performance measurement on the latest v6 patch set.
Use case 1: 10million rows, 5.2GB data,2 indexes on integer columns, 1 index on text column, csv file
(1168.484, 0, 1X), (1116.442, 1, 1.05X), (641.272, 2, 1.82X), (338.963, 4, 3.45X), (202.914, 8, 5.76X), (139.884, 16, 8.35X), (128.955, 20, 9.06X), (131.898, 30, 8.86X)
Use case 2: 10million rows, 5.2GB data,2 indexes on integer columns, 1 index on text column, binary file
(1097.83, 0, 1X), (1095.735, 1, 1.002X), (625.610, 2, 1.75X), (319.833, 4, 3.43X), (186.908, 8, 5.87X), (132.115, 16, 8.31X), (128.854, 20, 8.52X), (134.965, 30, 8.13X)
Use case 2: 10million rows, 5.2GB data, 3 indexes on integer columns, csv file
(218.227, 0, 1X), (182.815, 1, 1.19X), (135.500, 2, 1.61), (113.954, 4, 1.91X), (106.243, 8, 2.05X), (101.222, 16, 2.15X), (100.378, 20, 2.17X), (100.351, 30, 2.17X)
All the above tests are performed on the latest v6 patch set (attached here in this thread) with custom postgresql.conf[1]. The results are of the triplet form (exec time in sec, number of workers, gain)
Overall, we have below test cases to cover the code and for performance measurements. We plan to run these tests whenever a new set of patches is posted.
1. csv
2. binary
3. force parallel mode = regress
4. toast data csv and binary
5. foreign key check, before row, after row, before statement, after statement, instead of triggers
6. partition case
7. foreign partitions and partitions having trigger cases
8. where clause having parallel unsafe and safe expression, default parallel unsafe and safe expression
9. temp, global, local, unlogged, inherited tables cases, foreign tables
[1] https://www.postgresql.org/message-id/CALDaNm29DJKy0-vozs8eeBRf2u3rbvPdZHCocrd0VjoWHS7h5A%40mail.gmail.com
[2]
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
>
> From the testing perspective,
> 1. Test by having something force_parallel_mode = regress which means
> that all existing Copy tests in the regression will be executed via
> new worker code. You can have this as a test-only patch for now and
> make sure all existing tests passed with this.
>
I don't think all the existing copy test cases(except the new test cases added in the parallel copy patch set) would run inside the parallel worker if force_parallel_mode is on. This is because, the parallelism will be picked up for parallel copy only if parallel option is specified unlike parallelism for select queries.
Anyways, I ran with force_parallel_mode on and regress. All copy related tests and make check/make check-world ran fine.
>
> 2. Do we have tests for toast tables? I think if you implement the
> previous point some existing tests might cover it but I feel we should
> have at least one or two tests for the same.
>
Toast table use case 1: 10000 tuples, 9.6GB data, 3 indexes 2 on integer columns, 1 on text column(not the toast column), csv file, each row is > 1320KB:
(222.767, 0, 1X), (134.171, 1, 1.66X), (93.749, 2, 2.38X), (93.672, 4, 2.38X), (94.827, 8, 2.35X), (93.766, 16, 2.37X), (98.153, 20, 2.27X), (122.721, 30, 1.81X)
Toast table use case 2: 100000 tuples, 96GB data, 3 indexes 2 on integer columns, 1 on text column(not the toast column), csv file, each row is > 1320KB:
(2255.032, 0, 1X), (1358.628, 1, 1.66X), (901.170, 2, 2.5X), (912.743, 4, 2.47X), (988.718, 8, 2.28X), (938.000, 16, 2.4X), (997.556, 20, 2.26X), (1000.586, 30, 2.25X)
Toast table use case3: 10000 tuples, 9.6GB, no indexes, binary file, each row is > 1320KB:
(136.983, 0, 1X), (136.418, 1, 1X), (81.896, 2, 1.66X), (62.929, 4, 2.16X), (52.311, 8, 2.6X), (40.032, 16, 3.49X), (44.097, 20, 3.09X), (62.310, 30, 2.18X)
In the case of a Toast table, we could achieve upto 2.5X for csv files, and 3.5X for binary files. We are analyzing this point and will post an update on our findings soon.
While testing for the Toast table case with a binary file, I discovered an issue with the earlier v6-0006-Parallel-Copy-For-Binary-Format-Files.patch from [1], I fixed it and added the updated v6-0006 patch here. Please note that I'm also attaching the 1 to 5 patches from version 6 just for completion, that have no change from what Vignesh sent earlier in [1].
>
> 3. Have we checked the code coverage of the newly added code with
> existing tests?
>
So far, we manually ensured that most of the code parts are covered(see below list of test cases). But we are also planning to do the code coverage using some tool in the coming days.
Apart from the above tests, I also captured performance measurement on the latest v6 patch set.
Use case 1: 10million rows, 5.2GB data,2 indexes on integer columns, 1 index on text column, csv file
(1168.484, 0, 1X), (1116.442, 1, 1.05X), (641.272, 2, 1.82X), (338.963, 4, 3.45X), (202.914, 8, 5.76X), (139.884, 16, 8.35X), (128.955, 20, 9.06X), (131.898, 30, 8.86X)
Use case 2: 10million rows, 5.2GB data,2 indexes on integer columns, 1 index on text column, binary file
(1097.83, 0, 1X), (1095.735, 1, 1.002X), (625.610, 2, 1.75X), (319.833, 4, 3.43X), (186.908, 8, 5.87X), (132.115, 16, 8.31X), (128.854, 20, 8.52X), (134.965, 30, 8.13X)
Use case 2: 10million rows, 5.2GB data, 3 indexes on integer columns, csv file
(218.227, 0, 1X), (182.815, 1, 1.19X), (135.500, 2, 1.61), (113.954, 4, 1.91X), (106.243, 8, 2.05X), (101.222, 16, 2.15X), (100.378, 20, 2.17X), (100.351, 30, 2.17X)
All the above tests are performed on the latest v6 patch set (attached here in this thread) with custom postgresql.conf[1]. The results are of the triplet form (exec time in sec, number of workers, gain)
Overall, we have below test cases to cover the code and for performance measurements. We plan to run these tests whenever a new set of patches is posted.
1. csv
2. binary
3. force parallel mode = regress
4. toast data csv and binary
5. foreign key check, before row, after row, before statement, after statement, instead of triggers
6. partition case
7. foreign partitions and partitions having trigger cases
8. where clause having parallel unsafe and safe expression, default parallel unsafe and safe expression
9. temp, global, local, unlogged, inherited tables cases, foreign tables
[1] https://www.postgresql.org/message-id/CALDaNm29DJKy0-vozs8eeBRf2u3rbvPdZHCocrd0VjoWHS7h5A%40mail.gmail.com
[2]
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Attachment
- v6-0001-Copy-code-readjustment-to-support-parallel-copy.patch
- v6-0002-Framework-for-leader-worker-in-parallel-copy.patch
- v6-0003-Allow-copy-from-command-to-process-data-from-file.patch
- v6-0004-Documentation-for-parallel-copy.patch
- v6-0005-Tests-for-parallel-copy.patch
- v6-0006-Parallel-Copy-For-Binary-Format-Files.patch
On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > From the testing perspective, > > 1. Test by having something force_parallel_mode = regress which means > > that all existing Copy tests in the regression will be executed via > > new worker code. You can have this as a test-only patch for now and > > make sure all existing tests passed with this. > > > > I don't think all the existing copy test cases(except the new test cases added in the parallel copy patch set) would runinside the parallel worker if force_parallel_mode is on. This is because, the parallelism will be picked up for parallelcopy only if parallel option is specified unlike parallelism for select queries. > Sure, you need to change the code such that when force_parallel_mode = 'regress' is specified then it always uses one worker. This is primarily for testing purposes and will help during the development of this patch as it will make all exiting Copy tests to use quite a good portion of the parallel infrastructure. > > All the above tests are performed on the latest v6 patch set (attached here in this thread) with custom postgresql.conf[1].The results are of the triplet form (exec time in sec, number of workers, gain) > Okay, so I am assuming the performance is the same as we have seen with the earlier versions of patches. > Overall, we have below test cases to cover the code and for performance measurements. We plan to run these tests whenevera new set of patches is posted. > > 1. csv > 2. binary Don't we need the tests for plain text files as well? > 3. force parallel mode = regress > 4. toast data csv and binary > 5. foreign key check, before row, after row, before statement, after statement, instead of triggers > 6. partition case > 7. foreign partitions and partitions having trigger cases > 8. where clause having parallel unsafe and safe expression, default parallel unsafe and safe expression > 9. temp, global, local, unlogged, inherited tables cases, foreign tables > Sounds like good coverage. So, are you doing all this testing manually? How are you maintaining these tests? -- With Regards, Amit Kapila.
On Fri, Oct 9, 2020 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > From the testing perspective, > > > 1. Test by having something force_parallel_mode = regress which means > > > that all existing Copy tests in the regression will be executed via > > > new worker code. You can have this as a test-only patch for now and > > > make sure all existing tests passed with this. > > > > > > > I don't think all the existing copy test cases(except the new test cases added in the parallel copy patch set) wouldrun inside the parallel worker if force_parallel_mode is on. This is because, the parallelism will be picked up forparallel copy only if parallel option is specified unlike parallelism for select queries. > > > > Sure, you need to change the code such that when force_parallel_mode = > 'regress' is specified then it always uses one worker. This is > primarily for testing purposes and will help during the development of > this patch as it will make all exiting Copy tests to use quite a good > portion of the parallel infrastructure. > IIUC, firstly, I will set force_parallel_mode = FORCE_PARALLEL_REGRESS as default value in guc.c, and then adjust the parallelism related code in copy.c such that it always picks 1 worker and spawns it. This way, all the existing copy test cases would be run in parallel worker. Please let me know if this is okay. If yes, I will do this and update here. > > > All the above tests are performed on the latest v6 patch set (attached here in this thread) with custom postgresql.conf[1].The results are of the triplet form (exec time in sec, number of workers, gain) > > > > Okay, so I am assuming the performance is the same as we have seen > with the earlier versions of patches. > Yes. Most recent run on v5 patch set [1] > > > Overall, we have below test cases to cover the code and for performance measurements. We plan to run these tests whenevera new set of patches is posted. > > > > 1. csv > > 2. binary > > Don't we need the tests for plain text files as well? > Will add one. > > > 3. force parallel mode = regress > > 4. toast data csv and binary > > 5. foreign key check, before row, after row, before statement, after statement, instead of triggers > > 6. partition case > > 7. foreign partitions and partitions having trigger cases > > 8. where clause having parallel unsafe and safe expression, default parallel unsafe and safe expression > > 9. temp, global, local, unlogged, inherited tables cases, foreign tables > > > > Sounds like good coverage. So, are you doing all this testing > manually? How are you maintaining these tests? > Yes, running them manually. Few of the tests(1,2,4) require huge datasets for performance measurements and other test cases are to ensure we don't choose parallelism. We will try to add test cases that are not meant for performance, to the patch test. [1] - https://www.postgresql.org/message-id/CALj2ACW%3Djm5ri%2B7rXiQaFT_c5h2rVS%3DcJOQVFR5R%2Bbowt3QDkw%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Fri, Oct 9, 2020 at 3:50 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Fri, Oct 9, 2020 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > From the testing perspective, > > > > 1. Test by having something force_parallel_mode = regress which means > > > > that all existing Copy tests in the regression will be executed via > > > > new worker code. You can have this as a test-only patch for now and > > > > make sure all existing tests passed with this. > > > > > > > > > > I don't think all the existing copy test cases(except the new test cases added in the parallel copy patch set) wouldrun inside the parallel worker if force_parallel_mode is on. This is because, the parallelism will be picked up forparallel copy only if parallel option is specified unlike parallelism for select queries. > > > > > > > Sure, you need to change the code such that when force_parallel_mode = > > 'regress' is specified then it always uses one worker. This is > > primarily for testing purposes and will help during the development of > > this patch as it will make all exiting Copy tests to use quite a good > > portion of the parallel infrastructure. > > > > IIUC, firstly, I will set force_parallel_mode = FORCE_PARALLEL_REGRESS > as default value in guc.c, > No need to set this as the default value. You can change it in postgresql.conf before running tests. > and then adjust the parallelism related > code in copy.c such that it always picks 1 worker and spawns it. This > way, all the existing copy test cases would be run in parallel worker. > Please let me know if this is okay. > Yeah, this sounds fine. > If yes, I will do this and update > here. > Okay, thanks, but ensure the difference in test execution before and after your change. After your change, all the 'copy' tests should invoke the worker to perform a copy. > > > > > All the above tests are performed on the latest v6 patch set (attached here in this thread) with custom postgresql.conf[1].The results are of the triplet form (exec time in sec, number of workers, gain) > > > > > > > Okay, so I am assuming the performance is the same as we have seen > > with the earlier versions of patches. > > > > Yes. Most recent run on v5 patch set [1] > Okay, good to know that. -- With Regards, Amit Kapila.
On Fri, Oct 9, 2020 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > While looking at the latest code, I observed below issue in patch > v6-0003-Allow-copy-from-command-to-process-data-from-file: > > + /* Estimate the size for shared information for PARALLEL_COPY_KEY_CSTATE */ > + est_cstateshared = MAXALIGN(sizeof(SerializedParallelCopyState)); > + shm_toc_estimate_chunk(&pcxt->estimator, est_cstateshared); > + shm_toc_estimate_keys(&pcxt->estimator, 1); > + > + strsize = EstimateCstateSize(pcxt, cstate, attnamelist, &whereClauseStr, > + &rangeTableStr, &attnameListStr, > + ¬nullListStr, &nullListStr, > + &convertListStr); > > Here, do we need to separately estimate the size of > SerializedParallelCopyState when it is also done in > EstimateCstateSize? This is not required, this has been removed in the attached patches. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
- v7-0001-Copy-code-readjustment-to-support-parallel-copy.patch
- v7-0002-Framework-for-leader-worker-in-parallel-copy.patch
- v7-0003-Allow-copy-from-command-to-process-data-from-file.patch
- v7-0004-Documentation-for-parallel-copy.patch
- v7-0005-Tests-for-parallel-copy.patch
- v7-0006-Parallel-Copy-For-Binary-Format-Files.patch
I did performance testing on v7 patch set[1] with custom postgresql.conf[2]. The results are of the triplet form (exec time in sec, number of workers, gain) Use case 1: 10million rows, 5.2GB data, 2 indexes on integer columns, 1 index on text column, binary file (1104.898, 0, 1X), (1112.221, 1, 1X), (640.236, 2, 1.72X), (335.090, 4, 3.3X), (200.492, 8, 5.51X), (131.448, 16, 8.4X), (121.832, 20, 9.1X), (124.287, 30, 8.9X) Use case 2: 10million rows, 5.2GB data,2 indexes on integer columns, 1 index on text column, copy from stdin, csv format (1203.282, 0, 1X), (1135.517, 1, 1.06X), (655.140, 2, 1.84X), (343.688, 4, 3.5X), (203.742, 8, 5.9X), (144.793, 16, 8.31X), (133.339, 20, 9.02X), (136.672, 30, 8.8X) Use case 3: 10million rows, 5.2GB data,2 indexes on integer columns, 1 index on text column, text file (1165.991, 0, 1X), (1128.599, 1, 1.03X), (644.793, 2, 1.81X), (342.813, 4, 3.4X), (204.279, 8, 5.71X), (139.986, 16, 8.33X), (128.259, 20, 9.1X), (132.764, 30, 8.78X) Above results are similar to the results with earlier versions of the patch set. On Fri, Oct 9, 2020 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Sure, you need to change the code such that when force_parallel_mode = > 'regress' is specified then it always uses one worker. This is > primarily for testing purposes and will help during the development of > this patch as it will make all exiting Copy tests to use quite a good > portion of the parallel infrastructure. > I performed force_parallel_mode = regress testing and found 2 issues, the fixes for the same are available in v7 patch set[1]. > > > Overall, we have below test cases to cover the code and for performance measurements. We plan to run these tests whenevera new set of patches is posted. > > > > 1. csv > > 2. binary > > Don't we need the tests for plain text files as well? > I added a text use case and above mentioned are perf results on v7 patch set[1]. > > > 3. force parallel mode = regress > > 4. toast data csv and binary > > 5. foreign key check, before row, after row, before statement, after statement, instead of triggers > > 6. partition case > > 7. foreign partitions and partitions having trigger cases > > 8. where clause having parallel unsafe and safe expression, default parallel unsafe and safe expression > > 9. temp, global, local, unlogged, inherited tables cases, foreign tables > > > > Sounds like good coverage. So, are you doing all this testing > manually? How are you maintaining these tests? > All test cases listed above, except for the cases that are meant to measure perf gain with huge data, are present in v7-0005 patch in v7 patch set[1]. [1] https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com [2] shared_buffers = 40GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Fri, Oct 9, 2020 at 10:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Oct 8, 2020 at 12:14 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > I am convinced by the reason given by Kyotaro-San in that another
> > > thread [1] and performance data shown by Peter that this can't be an
> > > independent improvement and rather in some cases it can do harm. Now,
> > > if you need it for a parallel-copy path then we can change it
> > > specifically to the parallel-copy code path but I don't understand
> > > your reason completely.
> > >
> >
> > Whenever we need data to be populated, we will get a new data block &
> > pass it to CopyGetData to populate the data. In case of file copy, the
> > server will completely fill the data block. We expect the data to be
> > filled completely. If data is available it will completely load the
> > complete data block in case of file copy. There is no scenario where
> > even if data is present a partial data block will be returned except
> > for EOF or no data available. But in case of STDIN data copy, even
> > though there is 8K data available in data block & 8K data available in
> > STDIN, CopyGetData will return as soon as libpq buffer data is more
> > than the minread. We will pass new data block every time to load data.
> > Every time we pass an 8K data block but CopyGetData loads a few bytes
> > in the new data block & returns. I wanted to keep the same data
> > population logic for both file copy & STDIN copy i.e copy full 8K data
> > blocks & then the populated data can be required. There is an
> > alternative solution I can have some special handling in case of STDIN
> > wherein the existing data block can be passed with the index from
> > where the data should be copied. Thoughts?
> >
>
> What you are proposing as an alternative solution, isn't that what we
> are doing without the patch? IIUC, you require this because of your
> corresponding changes to handle COPY_NEW_FE in CopyReadLine(), is that
> right? If so, what is the difficulty in making it behave similar to
> the non-parallel case?
>
The alternate solution is similar to how existing copy handles STDIN copies, I have made changes in the v7 patch attached in [1] to have parallel copy handle STDIN data similar to non parallel copy, so the original comment on why this change is required has been removed from 001 patch:
> > + if (cstate->copy_dest == COPY_NEW_FE)
> > + minread = RAW_BUF_SIZE - nbytes;
> > +
> > inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes,
> > - 1, RAW_BUF_SIZE - nbytes);
> > + minread, RAW_BUF_SIZE - nbytes);
> >
> > No comment to explain why this change is done?
[1] https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
On Fri, Oct 9, 2020 at 11:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Oct 8, 2020 at 12:14 AM vignesh C <vignesh21@gmail.com> wrote: > > > > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > + */ > > > > > +typedef struct ParallelCopyLineBoundary > > > > > > > > > > Are we doing all this state management to avoid using locks while > > > > > processing lines? If so, I think we can use either spinlock or LWLock > > > > > to keep the main patch simple and then provide a later patch to make > > > > > it lock-less. This will allow us to first focus on the main design of > > > > > the patch rather than trying to make this datastructure processing > > > > > lock-less in the best possible way. > > > > > > > > > > > > > The steps will be more or less same if we use spinlock too. step 1, step 3 & step 4 will be common we have to uselock & unlock instead of step 2 & step 5. I feel we can retain the current implementation. > > > > > > > > > > I'll study this in detail and let you know my opinion on the same but > > > in the meantime, I don't follow one part of this comment: "If they > > > don't follow this order the worker might process wrong line_size and > > > leader might populate the information which worker has not yet > > > processed or in the process of processing." > > > > > > Do you want to say that leader might overwrite some information which > > > worker hasn't read yet? If so, it is not clear from the comment. > > > Another minor point about this comment: > > > > > > > Here leader and worker must follow these steps to avoid any corruption > > or hang issue. Changed it to: > > * The leader & worker process access the shared line information by following > > * the below steps to avoid any data corruption or hang: > > > > Actually, I wanted more on the lines why such corruption or hang can > happen? It might help reviewers to understand why you have followed > such a sequence. There are 3 variables which the leader & worker are working on: line_size, line_state & data. Leader will update line_state & populate data, update line_size & line_state. Workers will wait for line_state to be updated, once the updated leader will read the data based on the line_size. If the worker is not synchronized wrong line_size will be set & read wrong amount of data, anything can happen.There are 3 variables which leader & worker are working on: line_size, line_state & data. Leader will update line_state & populate data, update line_size & line_state. Workers will wait for line_state to be updated, once the updated leader will read the data based on the line_size. If the worker is not synchronized wrong line_size will be set & read wrong amount of data, anything can happen. This is the usual concurrency case with reader/writers. I felt that much details need not be mentioned. > > > > > > How did you ensure that this is fixed? Have you tested it, if so > > > please share the test? I see a basic problem with your fix. > > > > > > + /* Report WAL/buffer usage during parallel execution */ > > > + bufferusage = shm_toc_lookup(toc, PARALLEL_COPY_BUFFER_USAGE, false); > > > + walusage = shm_toc_lookup(toc, PARALLEL_COPY_WAL_USAGE, false); > > > + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], > > > + &walusage[ParallelWorkerNumber]); > > > > > > You need to call InstrStartParallelQuery() before the actual operation > > > starts, without that stats won't be accurate? Also, after calling > > > WaitForParallelWorkersToFinish(), you need to accumulate the stats > > > collected from workers which neither you have done nor is possible > > > with the current code in your patch because you haven't made any > > > provision to capture them in BeginParallelCopy. > > > > > > I suggest you look into lazy_parallel_vacuum_indexes() and > > > begin_parallel_vacuum() to understand how the buffer/wal usage stats > > > are accumulated. Also, please test this functionality using > > > pg_stat_statements. > > > > > > > Made changes accordingly. > > I have verified it using: > > postgres=# select * from pg_stat_statements where query like '%copy%'; > > userid | dbid | queryid | > > query > > | plans | total_plan_time | > > min_plan_time | max_plan_time | mean_plan_time | stddev_plan_time | > > calls | total_exec_time | min_exec_time | max_exec_time | > > mean_exec_time | stddev_exec_time | rows | shared_blks_hi > > t | shared_blks_read | shared_blks_dirtied | shared_blks_written | > > local_blks_hit | local_blks_read | local_blks_dirtied | > > local_blks_written | temp_blks_read | temp_blks_written | blk_ > > read_time | blk_write_time | wal_records | wal_fpi | wal_bytes > > --------+-------+----------------------+---------------------------------------------------------------------------------------------------------------------+-------+-----------------+- > > --------------+---------------+----------------+------------------+-------+-----------------+---------------+---------------+----------------+------------------+--------+--------------- > > --+------------------+---------------------+---------------------+----------------+-----------------+--------------------+--------------------+----------------+-------------------+----- > > ----------+----------------+-------------+---------+----------- > > 10 | 13743 | -6947756673093447609 | copy hw from > > '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format > > csv, delimiter ',') | 0 | 0 | > > 0 | 0 | 0 | 0 | > > 1 | 265.195105 | 265.195105 | 265.195105 | 265.195105 > > | 0 | 175000 | 191 > > 6 | 0 | 946 | 946 | > > 0 | 0 | 0 | 0 > > | 0 | 0 | > > 0 | 0 | 1116 | 0 | 3587203 > > 10 | 13743 | 8570215596364326047 | copy hw from > > '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format > > csv, delimiter ',', parallel '2') | 0 | 0 | > > 0 | 0 | 0 | 0 | > > 1 | 35668.402482 | 35668.402482 | 35668.402482 | 35668.402482 > > | 0 | 175000 | 310 > > 1 | 36 | 952 | 919 | > > 0 | 0 | 0 | 0 > > | 0 | 0 | > > 0 | 0 | 1119 | 6 | 3624405 > > (2 rows) > > > > I am not able to properly parse the data but If understand the wal > data for non-parallel (1116 | 0 | 3587203) and parallel (1119 > | 6 | 3624405) case doesn't seem to be the same. Is that > right? If so, why? Please ensure that no checkpoint happens for both > cases. > I have disabled checkpoint, the results with the checkpoint disabled are given below: | wal_records | wal_fpi | wal_bytes Sequential Copy | 1116 | 0 | 3587669 Parallel Copy(1 worker) | 1116 | 0 | 3587669 Parallel Copy(4 worker) | 1121 | 0 | 3587668 I noticed that for 1 worker wal_records & wal_bytes are same as sequential copy, but with different worker count I had noticed that there is difference in wal_records & wal_bytes, I think the difference should be ok because with more than 1 worker the order of records processed will be different based on which worker picks which records to process from input file. In the case of sequential copy/1 worker the order in which the records will be processed is always in the same order hence wal_bytes are the same. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Sat, Oct 3, 2020 at 6:20 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Hello Vignesh,
>
> I've done some basic benchmarking on the v4 version of the patches (but
> AFAIKC the v5 should perform about the same), and some initial review.
>
> For the benchmarking, I used the lineitem table from TPC-H - for 75GB
> data set, this largest table is about 64GB once loaded, with another
> 54GB in 5 indexes. This is on a server with 32 cores, 64GB of RAM and
> NVME storage.
>
> The COPY duration with varying number of workers (specified using the
> parallel COPY option) looks like this:
>
> workers duration
> ---------------------
> 0 1366
> 1 1255
> 2 704
> 3 526
> 4 434
> 5 385
> 6 347
> 7 322
> 8 327
>
> So this seems to work pretty well - initially we get almost linear
> speedup, then it slows down (likely due to contention for locks, I/O
> etc.). Not bad.
Thanks for testing with different workers & posting the results.
> I've only done a quick review, but overall the patch looks in fairly
> good shape.
>
> 1) I don't quite understand why we need INCREMENTPROCESSED and
> RETURNPROCESSED, considering it just does ++ or return. It just
> obfuscated the code, I think.
>
I have removed the macros.
> 2) I find it somewhat strange that BeginParallelCopy can just decide not
> to do parallel copy after all. Why not to do this decisions in the
> caller? Or maybe it's fine this way, not sure.
>
I have moved the check IsParallelCopyAllowed to the caller.
> 3) AFAIK we don't modify typedefs.list in patches, so these changes
> should be removed.
>
I had seen that in many of the commits typedefs.list is getting changed, also it helps in running pgindent. So I'm retaining this change.
> 4) IsTriggerFunctionParallelSafe actually checks all triggers, not just
> one, so the comment needs minor rewording.
>
Modified the comments.
[1] https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
>
> Hello Vignesh,
>
> I've done some basic benchmarking on the v4 version of the patches (but
> AFAIKC the v5 should perform about the same), and some initial review.
>
> For the benchmarking, I used the lineitem table from TPC-H - for 75GB
> data set, this largest table is about 64GB once loaded, with another
> 54GB in 5 indexes. This is on a server with 32 cores, 64GB of RAM and
> NVME storage.
>
> The COPY duration with varying number of workers (specified using the
> parallel COPY option) looks like this:
>
> workers duration
> ---------------------
> 0 1366
> 1 1255
> 2 704
> 3 526
> 4 434
> 5 385
> 6 347
> 7 322
> 8 327
>
> So this seems to work pretty well - initially we get almost linear
> speedup, then it slows down (likely due to contention for locks, I/O
> etc.). Not bad.
Thanks for testing with different workers & posting the results.
> I've only done a quick review, but overall the patch looks in fairly
> good shape.
>
> 1) I don't quite understand why we need INCREMENTPROCESSED and
> RETURNPROCESSED, considering it just does ++ or return. It just
> obfuscated the code, I think.
>
I have removed the macros.
> 2) I find it somewhat strange that BeginParallelCopy can just decide not
> to do parallel copy after all. Why not to do this decisions in the
> caller? Or maybe it's fine this way, not sure.
>
I have moved the check IsParallelCopyAllowed to the caller.
> 3) AFAIK we don't modify typedefs.list in patches, so these changes
> should be removed.
>
I had seen that in many of the commits typedefs.list is getting changed, also it helps in running pgindent. So I'm retaining this change.
> 4) IsTriggerFunctionParallelSafe actually checks all triggers, not just
> one, so the comment needs minor rewording.
>
Modified the comments.
Thanks for the comments & sharing the test results Tomas, These changes are fixed in one of my earlier mail [1] that I sent.
[1] https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
On Thu, Oct 8, 2020 at 8:43 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> On Thu, Oct 8, 2020 at 5:44 AM vignesh C <vignesh21@gmail.com> wrote:
>
> > Attached v6 patch with the fixes.
> >
>
> Hi Vignesh,
>
> I noticed a couple of issues when scanning the code in the following patch:
>
> v6-0003-Allow-copy-from-command-to-process-data-from-file.patch
>
> In the following code, it will put a junk uint16 value into *destptr
> (and thus may well cause a crash) on a Big Endian architecture
> (Solaris Sparc, s390x, etc.):
> You're storing a (uint16) string length in a uint32 and then pulling
> out the lower two bytes of the uint32 and copying them into the
> location pointed to by destptr.
>
>
> static void
> +CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr,
> + uint32 *copiedsize)
> +{
> + uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0;
> +
> + memcpy(destptr, (uint16 *) &len, sizeof(uint16));
> + *copiedsize += sizeof(uint16);
> + if (len)
> + {
> + memcpy(destptr + sizeof(uint16), srcPtr, len);
> + *copiedsize += len;
> + }
> +}
>
> I suggest you change the code to:
>
> uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
> memcpy(destptr, &len, sizeof(uint16));
>
> [I assume string length here can't ever exceed (65535 - 1), right?]
>
> Looking a bit deeper into this, I'm wondering if in fact your
> EstimateStringSize() and EstimateNodeSize() functions should be using
> BUFFERALIGN() for EACH stored string/node (rather than just calling
> shm_toc_estimate_chunk() once at the end, after the length of packed
> strings and nodes has been estimated), to ensure alignment of start of
> each string/node. Other Postgres code appears to be aligning each
> stored chunk using shm_toc_estimate_chunk(). See the definition of
> that macro and its current usages.
>
I'm not handling this, this is similar to how it is handled in other places.
> Then you could safely use:
>
> uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
> *(uint16 *)destptr = len;
> *copiedsize += sizeof(uint16);
> if (len)
> {
> memcpy(destptr + sizeof(uint16), srcPtr, len);
> *copiedsize += len;
> }
>
> and in the CopyStringFromSharedMemory() function, then could safely use:
>
> len = *(uint16 *)srcPtr;
>
> The compiler may be smart enough to optimize-away the memcpy() in this
> case anyway, but there are issues in doing this for architectures that
> take a performance hit for unaligned access, or don't support
> unaligned access.
Changed it to uin32, so that there are no issues in case if length exceeds 65535 & also to avoid problems in Big Endian architecture.
> Also, in CopyXXXXFromSharedMemory() functions, you should use palloc()
> instead of palloc0(), as you're filling the entire palloc'd buffer
> anyway, so no need to ask for additional MemSet() of all buffer bytes
> to 0 prior to memcpy().
>
I have changed palloc0 to palloc.
Thanks Greg for reviewing & providing your comments. These changes are fixed in one of my earlier mail [1] that I sent.
[1] https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
>
> On Thu, Oct 8, 2020 at 5:44 AM vignesh C <vignesh21@gmail.com> wrote:
>
> > Attached v6 patch with the fixes.
> >
>
> Hi Vignesh,
>
> I noticed a couple of issues when scanning the code in the following patch:
>
> v6-0003-Allow-copy-from-command-to-process-data-from-file.patch
>
> In the following code, it will put a junk uint16 value into *destptr
> (and thus may well cause a crash) on a Big Endian architecture
> (Solaris Sparc, s390x, etc.):
> You're storing a (uint16) string length in a uint32 and then pulling
> out the lower two bytes of the uint32 and copying them into the
> location pointed to by destptr.
>
>
> static void
> +CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr,
> + uint32 *copiedsize)
> +{
> + uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0;
> +
> + memcpy(destptr, (uint16 *) &len, sizeof(uint16));
> + *copiedsize += sizeof(uint16);
> + if (len)
> + {
> + memcpy(destptr + sizeof(uint16), srcPtr, len);
> + *copiedsize += len;
> + }
> +}
>
> I suggest you change the code to:
>
> uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
> memcpy(destptr, &len, sizeof(uint16));
>
> [I assume string length here can't ever exceed (65535 - 1), right?]
>
> Looking a bit deeper into this, I'm wondering if in fact your
> EstimateStringSize() and EstimateNodeSize() functions should be using
> BUFFERALIGN() for EACH stored string/node (rather than just calling
> shm_toc_estimate_chunk() once at the end, after the length of packed
> strings and nodes has been estimated), to ensure alignment of start of
> each string/node. Other Postgres code appears to be aligning each
> stored chunk using shm_toc_estimate_chunk(). See the definition of
> that macro and its current usages.
>
I'm not handling this, this is similar to how it is handled in other places.
> Then you could safely use:
>
> uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
> *(uint16 *)destptr = len;
> *copiedsize += sizeof(uint16);
> if (len)
> {
> memcpy(destptr + sizeof(uint16), srcPtr, len);
> *copiedsize += len;
> }
>
> and in the CopyStringFromSharedMemory() function, then could safely use:
>
> len = *(uint16 *)srcPtr;
>
> The compiler may be smart enough to optimize-away the memcpy() in this
> case anyway, but there are issues in doing this for architectures that
> take a performance hit for unaligned access, or don't support
> unaligned access.
Changed it to uin32, so that there are no issues in case if length exceeds 65535 & also to avoid problems in Big Endian architecture.
> Also, in CopyXXXXFromSharedMemory() functions, you should use palloc()
> instead of palloc0(), as you're filling the entire palloc'd buffer
> anyway, so no need to ask for additional MemSet() of all buffer bytes
> to 0 prior to memcpy().
>
I have changed palloc0 to palloc.
Thanks Greg for reviewing & providing your comments. These changes are fixed in one of my earlier mail [1] that I sent.
[1] https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
On Wed, Oct 14, 2020 at 6:51 PM vignesh C <vignesh21@gmail.com> wrote: > > On Fri, Oct 9, 2020 at 11:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I am not able to properly parse the data but If understand the wal > > data for non-parallel (1116 | 0 | 3587203) and parallel (1119 > > | 6 | 3624405) case doesn't seem to be the same. Is that > > right? If so, why? Please ensure that no checkpoint happens for both > > cases. > > > > I have disabled checkpoint, the results with the checkpoint disabled > are given below: > | wal_records | wal_fpi | wal_bytes > Sequential Copy | 1116 | 0 | 3587669 > Parallel Copy(1 worker) | 1116 | 0 | 3587669 > Parallel Copy(4 worker) | 1121 | 0 | 3587668 > I noticed that for 1 worker wal_records & wal_bytes are same as > sequential copy, but with different worker count I had noticed that > there is difference in wal_records & wal_bytes, I think the difference > should be ok because with more than 1 worker the order of records > processed will be different based on which worker picks which records > to process from input file. In the case of sequential copy/1 worker > the order in which the records will be processed is always in the same > order hence wal_bytes are the same. > Are all records of the same size in your test? If so, then why the order should matter? Also, even the number of wal_records has increased but wal_bytes are not increased, rather it is one-byte less. Can we identify what is going on here? I don't intend to say that it is a problem but we should know the reason clearly. -- With Regards, Amit Kapila.
Hi Vignesh, After having a look over the patch, I have some suggestions for 0003-Allow-copy-from-command-to-process-data-from-file.patch. 1. +static uint32 +EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List *attnamelist, + char **whereClauseStr, char **rangeTableStr, + char **attnameListStr, char **notnullListStr, + char **nullListStr, char **convertListStr) +{ + uint32 strsize = MAXALIGN(sizeof(SerializedParallelCopyState)); + + strsize += EstimateStringSize(cstate->null_print); + strsize += EstimateStringSize(cstate->delim); + strsize += EstimateStringSize(cstate->quote); + strsize += EstimateStringSize(cstate->escape); It use function EstimateStringSize to get the strlen of null_print, delim, quote and escape. But the length of null_print seems has been stored in null_print_len. And delim/quote/escape must be 1 byte, so I think call strlen again seems unnecessary. How about " strsize += sizeof(uint32) + cstate->null_print_len + 1" 2. + strsize += EstimateNodeSize(cstate->whereClause, whereClauseStr); + copiedsize += CopyStringToSharedMemory(cstate, whereClauseStr, + shmptr + copiedsize); Some string length is counted for two times. The ' whereClauseStr ' has call strlen in EstimateNodeSize once and call strlen in CopyStringToSharedMemory again. I don't know wheather it's worth to refacor the code to avoid duplicate strlen . what do you think ? Best regards, houzj
On Sun, Oct 18, 2020 at 7:47 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > Hi Vignesh, > > After having a look over the patch, > I have some suggestions for > 0003-Allow-copy-from-command-to-process-data-from-file.patch. > > 1. > > +static uint32 > +EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List *attnamelist, > + char **whereClauseStr, char **rangeTableStr, > + char **attnameListStr, char **notnullListStr, > + char **nullListStr, char **convertListStr) > +{ > + uint32 strsize = MAXALIGN(sizeof(SerializedParallelCopyState)); > + > + strsize += EstimateStringSize(cstate->null_print); > + strsize += EstimateStringSize(cstate->delim); > + strsize += EstimateStringSize(cstate->quote); > + strsize += EstimateStringSize(cstate->escape); > > > It use function EstimateStringSize to get the strlen of null_print, delim, quote and escape. > But the length of null_print seems has been stored in null_print_len. > And delim/quote/escape must be 1 byte, so I think call strlen again seems unnecessary. > > How about " strsize += sizeof(uint32) + cstate->null_print_len + 1" > +1. This seems like a good suggestion but add comments for delim/quote/escape to indicate that we are considering one-byte for each. I think this will obviate the need of function EstimateStringSize. Another thing in this regard is that we normally use add_size function to compute the size but I don't see that being used in this and nearby computation. That helps us to detect overflow of addition if any. EstimateCstateSize() { .. + + strsize++; .. } Why do we need this additional one-byte increment? Does it make sense to add a small comment for the same? > 2. > + strsize += EstimateNodeSize(cstate->whereClause, whereClauseStr); > > + copiedsize += CopyStringToSharedMemory(cstate, whereClauseStr, > + shmptr + copiedsize); > > Some string length is counted for two times. > The ' whereClauseStr ' has call strlen in EstimateNodeSize once and call strlen in CopyStringToSharedMemory again. > I don't know wheather it's worth to refacor the code to avoid duplicate strlen . what do you think ? > It doesn't seem worth to me. We probably need to use additional variables to save those lengths. I think it will add more code/complexity than we will save. See EstimateParamListSpace and SerializeParamList where we get the typeLen each time, that way code looks neat to me and we are don't going to save much by not following a similar thing here. -- With Regards, Amit Kapila.
On Thu, Oct 15, 2020 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Oct 14, 2020 at 6:51 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Fri, Oct 9, 2020 at 11:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > I am not able to properly parse the data but If understand the wal > > > data for non-parallel (1116 | 0 | 3587203) and parallel (1119 > > > | 6 | 3624405) case doesn't seem to be the same. Is that > > > right? If so, why? Please ensure that no checkpoint happens for both > > > cases. > > > > > > > I have disabled checkpoint, the results with the checkpoint disabled > > are given below: > > | wal_records | wal_fpi | wal_bytes > > Sequential Copy | 1116 | 0 | 3587669 > > Parallel Copy(1 worker) | 1116 | 0 | 3587669 > > Parallel Copy(4 worker) | 1121 | 0 | 3587668 > > I noticed that for 1 worker wal_records & wal_bytes are same as > > sequential copy, but with different worker count I had noticed that > > there is difference in wal_records & wal_bytes, I think the difference > > should be ok because with more than 1 worker the order of records > > processed will be different based on which worker picks which records > > to process from input file. In the case of sequential copy/1 worker > > the order in which the records will be processed is always in the same > > order hence wal_bytes are the same. > > > > Are all records of the same size in your test? If so, then why the > order should matter? Also, even the number of wal_records has > increased but wal_bytes are not increased, rather it is one-byte less. > Can we identify what is going on here? I don't intend to say that it > is a problem but we should know the reason clearly. The earlier run that I executed was with varying record size. The below results are by modifying the records to keep it of same size: | wal_records | wal_fpi | wal_bytes Sequential Copy | 1307 | 0 | 4198526 Parallel Copy(1 worker) | 1307 | 0 | 4198526 Parallel Copy(2 worker) | 1308 | 0 | 4198836 Parallel Copy(4 worker) | 1307 | 0 | 4199147 Parallel Copy(8 worker) | 1312 | 0 | 4199735 Parallel Copy(16 worker) | 1313 | 0 | 4200311 Still I noticed that there is some difference in wal_records & wal_bytes. I feel the difference in wal_records & wal_bytes is because of the following reasons: Each worker prepares 1000 tuples and then tries to do heap_multi_insert for 1000 tuples, In our case approximately 185 tuples is stored in 1 page, 925 tuples are stored in 5 WAL records and the remaining 75 tuples are stored in next WAL record. The wal dump is like below: rmgr: Heap2 len (rec/tot): 3750/ 3750, tx: 510, lsn: 0/0160EC80, prev 0/0160DDB0, desc: MULTI_INSERT+INIT 185 tuples flags 0x00, blkref #0: rel 1663/13751/16384 blk 0 rmgr: Heap2 len (rec/tot): 3750/ 3750, tx: 510, lsn: 0/0160FB28, prev 0/0160EC80, desc: MULTI_INSERT+INIT 185 tuples flags 0x00, blkref #0: rel 1663/13751/16384 blk 1 rmgr: Heap2 len (rec/tot): 3750/ 3750, tx: 510, lsn: 0/016109E8, prev 0/0160FB28, desc: MULTI_INSERT+INIT 185 tuples flags 0x00, blkref #0: rel 1663/13751/16384 blk 2 rmgr: Heap2 len (rec/tot): 3750/ 3750, tx: 510, lsn: 0/01611890, prev 0/016109E8, desc: MULTI_INSERT+INIT 185 tuples flags 0x00, blkref #0: rel 1663/13751/16384 blk 3 rmgr: Heap2 len (rec/tot): 3750/ 3750, tx: 510, lsn: 0/01612750, prev 0/01611890, desc: MULTI_INSERT+INIT 185 tuples flags 0x00, blkref #0: rel 1663/13751/16384 blk 4 rmgr: Heap2 len (rec/tot): 1550/ 1550, tx: 510, lsn: 0/016135F8, prev 0/01612750, desc: MULTI_INSERT+INIT 75 tuples flags 0x02, blkref #0: rel 1663/13751/16384 blk 5 After the 1st 1000 tuples are inserted and when the worker tries to insert another 1000 tuples, it will use the last page which had free space to insert where we can insert 110 more tuples: rmgr: Heap2 len (rec/tot): 2470/ 2470, tx: 510, lsn: 0/01613C08, prev 0/016135F8, desc: MULTI_INSERT 110 tuples flags 0x00, blkref #0: rel 1663/13751/16384 blk 5 rmgr: Heap2 len (rec/tot): 3750/ 3750, tx: 510, lsn: 0/016145C8, prev 0/01613C08, desc: MULTI_INSERT+INIT 185 tuples flags 0x00, blkref #0: rel 1663/13751/16384 blk 6 rmgr: Heap2 len (rec/tot): 3750/ 3750, tx: 510, lsn: 0/01615470, prev 0/016145C8, desc: MULTI_INSERT+INIT 185 tuples flags 0x00, blkref #0: rel 1663/13751/16384 blk 7 rmgr: Heap2 len (rec/tot): 3750/ 3750, tx: 510, lsn: 0/01616330, prev 0/01615470, desc: MULTI_INSERT+INIT 185 tuples flags 0x00, blkref #0: rel 1663/13751/16384 blk 8 rmgr: Heap2 len (rec/tot): 3750/ 3750, tx: 510, lsn: 0/016171D8, prev 0/01616330, desc: MULTI_INSERT+INIT 185 tuples flags 0x00, blkref #0: rel 1663/13751/16384 blk 9 rmgr: Heap2 len (rec/tot): 3050/ 3050, tx: 510, lsn: 0/01618098, prev 0/016171D8, desc: MULTI_INSERT+INIT 150 tuples flags 0x02, blkref #0: rel 1663/13751/16384 blk 10 This behavior will be the same for sequential copy and copy with 1 worker as the sequence of insert & the pages used to insert is in same order. There 2 reasons together result in the varying wal_size & wal_records with multiple worker: 1) When more than 1 worker is involved the sequence in which the pages that will be selected is not guaranteed, the MULTI_INSERT tuple count varies & MULTI_INSERT/MULTI_INSERT+INIT description varies. 2) wal_records will increase with more number of workers because when the tuples are split across the workers, one of the worker will have few more WAL record because the last heap_multi_insert gets split across the workers and generates new wal records like: rmgr: Heap2 len (rec/tot): 600/ 600, tx: 510, lsn: 0/019F8B08, prev 0/019F7C48, desc: MULTI_INSERT 25 tuples flags 0x00, blkref #0: rel 1663/13751/16384 blk 1065 Attached the tar of wal file dump which was used for analysis. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 2. Do we have tests for toast tables? I think if you implement the
> > previous point some existing tests might cover it but I feel we should
> > have at least one or two tests for the same.
> >
> Toast table use case 1: 10000 tuples, 9.6GB data, 3 indexes 2 on integer columns, 1 on text column(not the toast column), csv file, each row is > 1320KB:
> (222.767, 0, 1X), (134.171, 1, 1.66X), (93.749, 2, 2.38X), (93.672, 4, 2.38X), (94.827, 8, 2.35X), (93.766, 16, 2.37X), (98.153, 20, 2.27X), (122.721, 30, 1.81X)
>
> Toast table use case 2: 100000 tuples, 96GB data, 3 indexes 2 on integer columns, 1 on text column(not the toast column), csv file, each row is > 1320KB:
> (2255.032, 0, 1X), (1358.628, 1, 1.66X), (901.170, 2, 2.5X), (912.743, 4, 2.47X), (988.718, 8, 2.28X), (938.000, 16, 2.4X), (997.556, 20, 2.26X), (1000.586, 30, 2.25X)
>
> Toast table use case3: 10000 tuples, 9.6GB, no indexes, binary file, each row is > 1320KB:
> (136.983, 0, 1X), (136.418, 1, 1X), (81.896, 2, 1.66X), (62.929, 4, 2.16X), (52.311, 8, 2.6X), (40.032, 16, 3.49X), (44.097, 20, 3.09X), (62.310, 30, 2.18X)
>
> In the case of a Toast table, we could achieve upto 2.5X for csv files, and 3.5X for binary files. We are analyzing this point and will post an update on our findings soon.
>
I analyzed the above point of getting only upto 2.5X performance improvement for csv files with a toast table with 3 indexers - 2 on integer columns and 1 on text column(not the toast column). Reason is that workers are fast enough to do the work and they are waiting for the leader to fill in the data blocks and in this case the leader is able to serve the workers at its maximum possible speed. Hence most of the time the workers are waiting not doing any beneficial work.
Having observed the above point, I tried to make workers perform more work to avoid waiting time. For this, I added a gist index on the toasted text column. The use and results are as follows.
Toast table use case4: 10000 tuples, 9.6GB, 4 indexes - 2 on integer columns, 1 on non-toasted text column and 1 gist index on toasted text column, csv file, each row is ~ 12.2KB:
(1322.839, 0, 1X), (1261.176, 1, 1.05X), (632.296, 2, 2.09X), (321.941, 4, 4.11X), (181.796, 8, 7.27X), (105.750, 16, 12.51X), (107.099, 20, 12.35X), (123.262, 30, 10.73X)
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
>
> On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 2. Do we have tests for toast tables? I think if you implement the
> > previous point some existing tests might cover it but I feel we should
> > have at least one or two tests for the same.
> >
> Toast table use case 1: 10000 tuples, 9.6GB data, 3 indexes 2 on integer columns, 1 on text column(not the toast column), csv file, each row is > 1320KB:
> (222.767, 0, 1X), (134.171, 1, 1.66X), (93.749, 2, 2.38X), (93.672, 4, 2.38X), (94.827, 8, 2.35X), (93.766, 16, 2.37X), (98.153, 20, 2.27X), (122.721, 30, 1.81X)
>
> Toast table use case 2: 100000 tuples, 96GB data, 3 indexes 2 on integer columns, 1 on text column(not the toast column), csv file, each row is > 1320KB:
> (2255.032, 0, 1X), (1358.628, 1, 1.66X), (901.170, 2, 2.5X), (912.743, 4, 2.47X), (988.718, 8, 2.28X), (938.000, 16, 2.4X), (997.556, 20, 2.26X), (1000.586, 30, 2.25X)
>
> Toast table use case3: 10000 tuples, 9.6GB, no indexes, binary file, each row is > 1320KB:
> (136.983, 0, 1X), (136.418, 1, 1X), (81.896, 2, 1.66X), (62.929, 4, 2.16X), (52.311, 8, 2.6X), (40.032, 16, 3.49X), (44.097, 20, 3.09X), (62.310, 30, 2.18X)
>
> In the case of a Toast table, we could achieve upto 2.5X for csv files, and 3.5X for binary files. We are analyzing this point and will post an update on our findings soon.
>
I analyzed the above point of getting only upto 2.5X performance improvement for csv files with a toast table with 3 indexers - 2 on integer columns and 1 on text column(not the toast column). Reason is that workers are fast enough to do the work and they are waiting for the leader to fill in the data blocks and in this case the leader is able to serve the workers at its maximum possible speed. Hence most of the time the workers are waiting not doing any beneficial work.
Having observed the above point, I tried to make workers perform more work to avoid waiting time. For this, I added a gist index on the toasted text column. The use and results are as follows.
Toast table use case4: 10000 tuples, 9.6GB, 4 indexes - 2 on integer columns, 1 on non-toasted text column and 1 gist index on toasted text column, csv file, each row is ~ 12.2KB:
(1322.839, 0, 1X), (1261.176, 1, 1.05X), (632.296, 2, 2.09X), (321.941, 4, 4.11X), (181.796, 8, 7.27X), (105.750, 16, 12.51X), (107.099, 20, 12.35X), (123.262, 30, 10.73X)
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On Mon, Oct 19, 2020 at 2:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Oct 18, 2020 at 7:47 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > > > Hi Vignesh, > > > > After having a look over the patch, > > I have some suggestions for > > 0003-Allow-copy-from-command-to-process-data-from-file.patch. > > > > 1. > > > > +static uint32 > > +EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List *attnamelist, > > + char **whereClauseStr, char **rangeTableStr, > > + char **attnameListStr, char **notnullListStr, > > + char **nullListStr, char **convertListStr) > > +{ > > + uint32 strsize = MAXALIGN(sizeof(SerializedParallelCopyState)); > > + > > + strsize += EstimateStringSize(cstate->null_print); > > + strsize += EstimateStringSize(cstate->delim); > > + strsize += EstimateStringSize(cstate->quote); > > + strsize += EstimateStringSize(cstate->escape); > > > > > > It use function EstimateStringSize to get the strlen of null_print, delim, quote and escape. > > But the length of null_print seems has been stored in null_print_len. > > And delim/quote/escape must be 1 byte, so I think call strlen again seems unnecessary. > > > > How about " strsize += sizeof(uint32) + cstate->null_print_len + 1" > > > > +1. This seems like a good suggestion but add comments for > delim/quote/escape to indicate that we are considering one-byte for > each. I think this will obviate the need of function > EstimateStringSize. Another thing in this regard is that we normally > use add_size function to compute the size but I don't see that being > used in this and nearby computation. That helps us to detect overflow > of addition if any. > > EstimateCstateSize() > { > .. > + > + strsize++; > .. > } > > Why do we need this additional one-byte increment? Does it make sense > to add a small comment for the same? > Changed it to handle null_print, delim, quote & escape accordingly in the attached patch, the one byte increment is not required, I have removed it. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
- v8-0001-Copy-code-readjustment-to-support-parallel-copy.patch
- v8-0002-Framework-for-leader-worker-in-parallel-copy.patch
- v8-0003-Allow-copy-from-command-to-process-data-from-file.patch
- v8-0004-Documentation-for-parallel-copy.patch
- v8-0005-Tests-for-parallel-copy.patch
- v8-0006-Parallel-Copy-For-Binary-Format-Files.patch
On Thu, Oct 8, 2020 at 11:15 AM vignesh C <vignesh21@gmail.com> wrote:
>
> I'm summarizing the pending open points so that I don't miss anything:
> 1) Performance test on latest patch set.
It is tested and results are shared by bharath at [1]
> 2) Testing points suggested.
Tests are added as suggested and details shared by bharath at [1]
> 3) Support of parallel copy for COPY_OLD_FE.
It is handled as part of v8 patch shared at [2]
> 4) Worker has to hop through all the processed chunks before getting
> the chunk which it can process.
Open
> 5) Handling of Tomas's comments.
I have fixed and updated the fix details as part of [3]
> 6) Handling of Greg's comments.
I have fixed and updated the fix details as part of [4]
Except for "4) Worker has to hop through all the processed chunks before getting the chunk which it can process", all open tasks are handled. I will work on this and provide an update shortly.
[1] https://www.postgresql.org/message-id/CALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CALDaNm2UcmCMozcbKL8B7az9oYd9hZ+fNDcZHSSiiQJ4v-xN0Q@mail.gmail.com
[3] https://www.postgresql.org/message-id/CALDaNm0_zUa9%2BS%3DpwCz3Yp43SY3r9bnO4v-9ucXUujEE%3D0Sd7g%40mail.gmail.com
[4] https://www.postgresql.org/message-id/CALDaNm31pGG%2BL9N4HbM0mO4iuceih4mJ5s87jEwOPaFLpmDKyQ%40mail.gmail.com
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
Hi Vignesh, I took a look at the v8 patch set. Here are some comments: 1. PopulateCommonCstateInfo() -- can we use PopulateCommonCStateInfo() or PopulateCopyStateInfo()? And also EstimateCstateSize() -- EstimateCStateSize(), PopulateCstateCatalogInfo() -- PopulateCStateCatalogInfo()? 2. Instead of mentioning numbers like 1024, 64K, 10240 in the comments, can we represent them in terms of macros? /* It can hold 1024 blocks of 64K data in DSM to be processed by the worker. */ #define MAX_BLOCKS_COUNT 1024 /* * It can hold upto 10240 record information for worker to process. RINGSIZE 3. How about " Each worker at once will pick the WORKER_CHUNK_COUNT records from the DSM data blocks and store them in it's local memory. This is to make workers not contend much while getting record information from the DSM. Read RINGSIZE comments before changing this value. " instead of /* * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data * block to process to avoid lock contention. Read RINGSIZE comments before * changing this value. */ 4. How about one line gap before and after for comments: "Leader should operate in the following order:" and "Worker should operate in the following order:" 5. Can we move RAW_BUF_BYTES macro definition to the beginning of the copy.h where all the macro are defined? 6. I don't think we need the change in toast_internals.c with the temporary hack Assert(!(IsParallelWorker() && !currentCommandIdUsed)); in GetCurrentCommandId() 7. I think /* Can't perform copy in parallel */ if (parallel_workers <= 0) return NULL; can be /* Can't perform copy in parallel */ if (parallel_workers == 0) return NULL; as parallel_workers can never be < 0 since we enter BeginParallelCopy only if cstate->nworkers > 0 and also we are not allowed to have negative values for max_worker_processes. 8. Do we want to pfree(cstate->pcdata) in case we failed to start any parallel workers, we would have allocated a good else { /* * Reset nworkers to -1 here. This is useful in cases where user * specifies parallel workers, but, no worker is picked up, so go * back to non parallel mode value of nworkers. */ cstate->nworkers = -1; *processed = CopyFrom(cstate); /* copy from file to database */ } 9. Instead of calling CopyStringToSharedMemory() for each string variable, can't we just create a linked list of all the strings that need to be copied into shm and call CopyStringToSharedMemory() only once? We could avoid 5 function calls? 10. Similar to above comment: can we fill all the required cstate->variables inside the function CopyNodeFromSharedMemory() and call it only once? In each worker we could save overhead of 5 function calls. 11. Looks like CopyStringFromSharedMemory() and CopyNodeFromSharedMemory() do almost the same things except stringToNode() and pfree(destptr);. Can we have a generic function CopyFromSharedMemory() or something else and handle with flag "bool isnode" to differentiate the two use cases? 12. Can we move below check to the end in IsParallelCopyAllowed()? /* Check parallel safety of the trigger functions. */ if (cstate->rel->trigdesc != NULL && !CheckRelTrigFunParallelSafety(cstate->rel->trigdesc)) return false; 13. CacheLineInfo(): Instead of goto empty_data_line_update; how about having this directly inside the if block as it's being used only once? 14. GetWorkerLine(): How about avoiding goto statements and replacing the common code with a always static inline function or a macro? 15. UpdateSharedLineInfo(): Below line is misaligned. lineInfo->first_block = blk_pos; lineInfo->start_offset = offset; 16. ParallelCopyFrom(): Do we need CHECK_FOR_INTERRUPTS(); at the start of for (;;)? 17. Remove extra lines after #define IsHeaderLine() (cstate->header_line && cstate->cur_lineno == 1) in copy.h With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Wed, Oct 21, 2020 at 3:19 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > > 9. Instead of calling CopyStringToSharedMemory() for each string > variable, can't we just create a linked list of all the strings that > need to be copied into shm and call CopyStringToSharedMemory() only > once? We could avoid 5 function calls? > If we want to avoid different function calls then can't we just store all these strings in a local structure and use it? That might improve the other parts of code as well where we are using these as individual parameters. > 10. Similar to above comment: can we fill all the required > cstate->variables inside the function CopyNodeFromSharedMemory() and > call it only once? In each worker we could save overhead of 5 function > calls. > Yeah, that makes sense. -- With Regards, Amit Kapila.
On Wed, Oct 21, 2020 at 3:18 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > 17. Remove extra lines after #define IsHeaderLine() > (cstate->header_line && cstate->cur_lineno == 1) in copy.h > I missed one comment: 18. I think we need to treat the number of parallel workers as an integer similar to the parallel option in vacuum. postgres=# copy t1 from stdin with(parallel '1'); <<<<< - we should not allow this. Enter data to be copied followed by a newline. postgres=# vacuum (parallel '1') t1; ERROR: parallel requires an integer value With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
I had a brief look at at this patch. Important work! A couple of first impressions: 1. The split between patches 0002-Framework-for-leader-worker-in-parallel-copy.patch and 0003-Allow-copy-from-command-to-process-data-from-file.patch is quite artificial. All the stuff introduced in the first is unused until the second patch is applied. The first patch introduces a forward declaration for ParallelCopyData(), but the function only comes in the second patch. The comments in the first patch talk about LINE_LEADER_POPULATING and LINE_LEADER_POPULATED, but the enum only comes in the second patch. I think these have to merged into one. If you want to split it somehow, I'd suggest having a separate patch just to move CopyStateData from copy.c to copy.h. The subsequent patch would then be easier to read as you could see more easily what's being added to CopyStateData. Actually I think it would be better to have a new header file, copy_internal.h, to hold CopyStateData and the other structs, and keep copy.h as it is. 2. This desperately needs some kind of a high-level overview of how it works. What is a leader, what is a worker? Which process does each step of COPY processing, like reading from the file/socket, splitting the input into lines, handling escapes, calling input functions, and updating the heap and indexes? What data structures are used for the communication? How does is the work synchronized between the processes? There are comments on those individual aspects scattered in the patch, but if you're not already familiar with it, you don't know where to start. There's some of that in the commit message, but it needs to be somewhere in the source code, maybe in a long comment at the top of copyparallel.c. 3. I'm surprised there's a separate ParallelCopyLineBoundary struct for every input line. Doesn't that incur a lot of synchronization overhead? I haven't done any testing, this is just my gut feeling, but I assumed you'd work in batches of, say, 100 or 1000 lines each. - Heikki
Hi Vignesh, Thanks for the updated patches. Here are some more comments that I can find after reviewing your latest patches: +/* + * This structure helps in storing the common data from CopyStateData that are + * required by the workers. This information will then be allocated and stored + * into the DSM for the worker to retrieve and copy it to CopyStateData. + */ +typedef struct SerializedParallelCopyState +{ + /* low-level state data */ + CopyDest copy_dest; /* type of copy source/destination */ + int file_encoding; /* file or remote side's character encoding */ + bool need_transcoding; /* file encoding diff from server? */ + bool encoding_embeds_ascii; /* ASCII can be non-first byte? */ + ... ... + + /* Working state for COPY FROM */ + AttrNumber num_defaults; + Oid relid; +} SerializedParallelCopyState; Can the above structure not be part of the CopyStateData structure? I am just asking this question because all the fields present in the above structure are also present in the CopyStateData structure. So, including it in the CopyStateData structure will reduce the code duplication and will also make CopyStateData a bit shorter. -- + pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist, + relid); Do we need to pass cstate->nworkers and relid to BeginParallelCopy() function when we are already passing cstate structure, using which both of these information can be retrieved ? -- +/* DSM keys for parallel copy. */ +#define PARALLEL_COPY_KEY_SHARED_INFO 1 +#define PARALLEL_COPY_KEY_CSTATE 2 +#define PARALLEL_COPY_WAL_USAGE 3 +#define PARALLEL_COPY_BUFFER_USAGE 4 DSM key names do not appear to be consistent. For shared info and cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY", but for WalUsage and BufferUsage structures, it is prefixed with "PARALLEL_COPY". I think it would be better to make them consistent. -- if (resultRelInfo->ri_TrigDesc != NULL && (resultRelInfo->ri_TrigDesc->trig_insert_before_row || resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) { /* * Can't support multi-inserts when there are any BEFORE/INSTEAD OF * triggers on the table. Such triggers might query the table we're * inserting into and act differently if the tuples that have already * been processed and prepared for insertion are not there. */ insertMethod = CIM_SINGLE; } else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL && resultRelInfo->ri_TrigDesc->trig_insert_new_table) { /* * For partitioned tables we can't support multi-inserts when there * are any statement level insert triggers. It might be possible to * allow partitioned tables with such triggers in the future, but for * now, CopyMultiInsertInfoFlush expects that any before row insert * and statement level insert triggers are on the same relation. */ insertMethod = CIM_SINGLE; } else if (resultRelInfo->ri_FdwRoutine != NULL || cstate->volatile_defexprs) { ... ... I think, if possible, all these if-else checks in CopyFrom() can be moved to a single function which can probably be named as IdentifyCopyInsertMethod() and this function can be called in IsParallelCopyAllowed(). This will ensure that in case of Parallel Copy when the leader has performed all these checks, the worker won't do it again. I also feel that it will make the code look a bit cleaner. -- +void +ParallelCopyMain(dsm_segment *seg, shm_toc *toc) +{ ... ... + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], + &walusage[ParallelWorkerNumber]); + + MemoryContextSwitchTo(oldcontext); + pfree(cstate); + return; +} It seems like you also need to delete the memory context (cstate->copycontext) here. -- +void +ExecBeforeStmtTrigger(CopyState cstate) +{ + EState *estate = CreateExecutorState(); + ResultRelInfo *resultRelInfo; This function has a lot of comments which have been copied as it is from the CopyFrom function, I think it would be good to remove those comments from here and mention that this code changes done in this function has been taken from the CopyFrom function. If any queries people may refer to the CopyFrom function. This will again avoid the unnecessary code in the patch. -- As Heikki rightly pointed out in his previous email, we need some high level description of how Parallel Copy works somewhere in copyparallel.c file. For reference, please see how a brief description about parallel vacuum has been added in the vacuumlazy.c file. * Lazy vacuum supports parallel execution with parallel worker processes. In * a parallel vacuum, we perform both index vacuum and index cleanup with * parallel worker processes. Individual indexes are processed by one vacuum ... ... -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Wed, Oct 21, 2020 at 12:08 PM vignesh C <vignesh21@gmail.com> wrote: > > On Mon, Oct 19, 2020 at 2:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Sun, Oct 18, 2020 at 7:47 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > > > > > Hi Vignesh, > > > > > > After having a look over the patch, > > > I have some suggestions for > > > 0003-Allow-copy-from-command-to-process-data-from-file.patch. > > > > > > 1. > > > > > > +static uint32 > > > +EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List *attnamelist, > > > + char **whereClauseStr, char **rangeTableStr, > > > + char **attnameListStr, char **notnullListStr, > > > + char **nullListStr, char **convertListStr) > > > +{ > > > + uint32 strsize = MAXALIGN(sizeof(SerializedParallelCopyState)); > > > + > > > + strsize += EstimateStringSize(cstate->null_print); > > > + strsize += EstimateStringSize(cstate->delim); > > > + strsize += EstimateStringSize(cstate->quote); > > > + strsize += EstimateStringSize(cstate->escape); > > > > > > > > > It use function EstimateStringSize to get the strlen of null_print, delim, quote and escape. > > > But the length of null_print seems has been stored in null_print_len. > > > And delim/quote/escape must be 1 byte, so I think call strlen again seems unnecessary. > > > > > > How about " strsize += sizeof(uint32) + cstate->null_print_len + 1" > > > > > > > +1. This seems like a good suggestion but add comments for > > delim/quote/escape to indicate that we are considering one-byte for > > each. I think this will obviate the need of function > > EstimateStringSize. Another thing in this regard is that we normally > > use add_size function to compute the size but I don't see that being > > used in this and nearby computation. That helps us to detect overflow > > of addition if any. > > > > EstimateCstateSize() > > { > > .. > > + > > + strsize++; > > .. > > } > > > > Why do we need this additional one-byte increment? Does it make sense > > to add a small comment for the same? > > > > Changed it to handle null_print, delim, quote & escape accordingly in > the attached patch, the one byte increment is not required, I have > removed it. > > Regards, > Vignesh > EnterpriseDB: http://www.enterprisedb.com
On Fri, Oct 23, 2020 at 5:42 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Hi Vignesh, > > Thanks for the updated patches. Here are some more comments that I can > find after reviewing your latest patches: > > +/* > + * This structure helps in storing the common data from CopyStateData that are > + * required by the workers. This information will then be allocated and stored > + * into the DSM for the worker to retrieve and copy it to CopyStateData. > + */ > +typedef struct SerializedParallelCopyState > +{ > + /* low-level state data */ > + CopyDest copy_dest; /* type of copy source/destination */ > + int file_encoding; /* file or remote side's character encoding */ > + bool need_transcoding; /* file encoding diff from server? */ > + bool encoding_embeds_ascii; /* ASCII can be non-first byte? */ > + > ... > ... > + > + /* Working state for COPY FROM */ > + AttrNumber num_defaults; > + Oid relid; > +} SerializedParallelCopyState; > > Can the above structure not be part of the CopyStateData structure? I > am just asking this question because all the fields present in the > above structure are also present in the CopyStateData structure. So, > including it in the CopyStateData structure will reduce the code > duplication and will also make CopyStateData a bit shorter. > > -- > > + pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist, > + relid); > > Do we need to pass cstate->nworkers and relid to BeginParallelCopy() > function when we are already passing cstate structure, using which > both of these information can be retrieved ? > > -- > > +/* DSM keys for parallel copy. */ > +#define PARALLEL_COPY_KEY_SHARED_INFO 1 > +#define PARALLEL_COPY_KEY_CSTATE 2 > +#define PARALLEL_COPY_WAL_USAGE 3 > +#define PARALLEL_COPY_BUFFER_USAGE 4 > > DSM key names do not appear to be consistent. For shared info and > cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY", > but for WalUsage and BufferUsage structures, it is prefixed with > "PARALLEL_COPY". I think it would be better to make them consistent. > > -- > > if (resultRelInfo->ri_TrigDesc != NULL && > (resultRelInfo->ri_TrigDesc->trig_insert_before_row || > resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) > { > /* > * Can't support multi-inserts when there are any BEFORE/INSTEAD OF > * triggers on the table. Such triggers might query the table we're > * inserting into and act differently if the tuples that have already > * been processed and prepared for insertion are not there. > */ > insertMethod = CIM_SINGLE; > } > else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL && > resultRelInfo->ri_TrigDesc->trig_insert_new_table) > { > /* > * For partitioned tables we can't support multi-inserts when there > * are any statement level insert triggers. It might be possible to > * allow partitioned tables with such triggers in the future, but for > * now, CopyMultiInsertInfoFlush expects that any before row insert > * and statement level insert triggers are on the same relation. > */ > insertMethod = CIM_SINGLE; > } > else if (resultRelInfo->ri_FdwRoutine != NULL || > cstate->volatile_defexprs) > { > ... > ... > > I think, if possible, all these if-else checks in CopyFrom() can be > moved to a single function which can probably be named as > IdentifyCopyInsertMethod() and this function can be called in > IsParallelCopyAllowed(). This will ensure that in case of Parallel > Copy when the leader has performed all these checks, the worker won't > do it again. I also feel that it will make the code look a bit > cleaner. > Just rewriting above comment to make it a bit more clear: I think, if possible, all these if-else checks in CopyFrom() should be moved to a separate function which can probably be named as IdentifyCopyInsertMethod() and this function called from IsParallelCopyAllowed() and CopyFrom() functions. It will only be called from CopyFrom() when IsParallelCopy() returns false. This will ensure that in case of Parallel Copy if the leader has performed all these checks, the worker won't do it again. I also feel that having a separate function containing all these checks will make the code look a bit cleaner. > -- > > +void > +ParallelCopyMain(dsm_segment *seg, shm_toc *toc) > +{ > ... > ... > + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], > + &walusage[ParallelWorkerNumber]); > + > + MemoryContextSwitchTo(oldcontext); > + pfree(cstate); > + return; > +} > > It seems like you also need to delete the memory context > (cstate->copycontext) here. > > -- > > +void > +ExecBeforeStmtTrigger(CopyState cstate) > +{ > + EState *estate = CreateExecutorState(); > + ResultRelInfo *resultRelInfo; > > This function has a lot of comments which have been copied as it is > from the CopyFrom function, I think it would be good to remove those > comments from here and mention that this code changes done in this > function has been taken from the CopyFrom function. If any queries > people may refer to the CopyFrom function. This will again avoid the > unnecessary code in the patch. > > -- > > As Heikki rightly pointed out in his previous email, we need some high > level description of how Parallel Copy works somewhere in > copyparallel.c file. For reference, please see how a brief description > about parallel vacuum has been added in the vacuumlazy.c file. > > * Lazy vacuum supports parallel execution with parallel worker processes. In > * a parallel vacuum, we perform both index vacuum and index cleanup with > * parallel worker processes. Individual indexes are processed by one vacuum > ... > ... > > -- > With Regards, > Ashutosh Sharma > EnterpriseDB:http://www.enterprisedb.com > > > On Wed, Oct 21, 2020 at 12:08 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Mon, Oct 19, 2020 at 2:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Sun, Oct 18, 2020 at 7:47 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > > > > > > > Hi Vignesh, > > > > > > > > After having a look over the patch, > > > > I have some suggestions for > > > > 0003-Allow-copy-from-command-to-process-data-from-file.patch. > > > > > > > > 1. > > > > > > > > +static uint32 > > > > +EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List *attnamelist, > > > > + char **whereClauseStr, char **rangeTableStr, > > > > + char **attnameListStr, char **notnullListStr, > > > > + char **nullListStr, char **convertListStr) > > > > +{ > > > > + uint32 strsize = MAXALIGN(sizeof(SerializedParallelCopyState)); > > > > + > > > > + strsize += EstimateStringSize(cstate->null_print); > > > > + strsize += EstimateStringSize(cstate->delim); > > > > + strsize += EstimateStringSize(cstate->quote); > > > > + strsize += EstimateStringSize(cstate->escape); > > > > > > > > > > > > It use function EstimateStringSize to get the strlen of null_print, delim, quote and escape. > > > > But the length of null_print seems has been stored in null_print_len. > > > > And delim/quote/escape must be 1 byte, so I think call strlen again seems unnecessary. > > > > > > > > How about " strsize += sizeof(uint32) + cstate->null_print_len + 1" > > > > > > > > > > +1. This seems like a good suggestion but add comments for > > > delim/quote/escape to indicate that we are considering one-byte for > > > each. I think this will obviate the need of function > > > EstimateStringSize. Another thing in this regard is that we normally > > > use add_size function to compute the size but I don't see that being > > > used in this and nearby computation. That helps us to detect overflow > > > of addition if any. > > > > > > EstimateCstateSize() > > > { > > > .. > > > + > > > + strsize++; > > > .. > > > } > > > > > > Why do we need this additional one-byte increment? Does it make sense > > > to add a small comment for the same? > > > > > > > Changed it to handle null_print, delim, quote & escape accordingly in > > the attached patch, the one byte increment is not required, I have > > removed it. > > > > Regards, > > Vignesh > > EnterpriseDB: http://www.enterprisedb.com
Thanks for the comments, please find my thoughts below. On Wed, Oct 21, 2020 at 3:19 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Hi Vignesh, > > I took a look at the v8 patch set. Here are some comments: > > 1. PopulateCommonCstateInfo() -- can we use PopulateCommonCStateInfo() > or PopulateCopyStateInfo()? And also EstimateCstateSize() -- > EstimateCStateSize(), PopulateCstateCatalogInfo() -- > PopulateCStateCatalogInfo()? > Changed as suggested. > 2. Instead of mentioning numbers like 1024, 64K, 10240 in the > comments, can we represent them in terms of macros? > /* It can hold 1024 blocks of 64K data in DSM to be processed by the worker. */ > #define MAX_BLOCKS_COUNT 1024 > /* > * It can hold upto 10240 record information for worker to process. RINGSIZE > Changed as suggested. > 3. How about > " > Each worker at once will pick the WORKER_CHUNK_COUNT records from the > DSM data blocks and store them in it's local memory. > This is to make workers not contend much while getting record > information from the DSM. Read RINGSIZE comments before > changing this value. > " > instead of > /* > * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data > * block to process to avoid lock contention. Read RINGSIZE comments before > * changing this value. > */ > Rephrased it. > 4. How about one line gap before and after for comments: "Leader > should operate in the following order:" and "Worker should operate in > the following order:" > Changed it. > 5. Can we move RAW_BUF_BYTES macro definition to the beginning of the > copy.h where all the macro are defined? > Change was done as part of another commit & we are using as it is. I preferred it to be as it is. > 6. I don't think we need the change in toast_internals.c with the > temporary hack Assert(!(IsParallelWorker() && !currentCommandIdUsed)); > in GetCurrentCommandId() > Modified it. > 7. I think > /* Can't perform copy in parallel */ > if (parallel_workers <= 0) > return NULL; > can be > /* Can't perform copy in parallel */ > if (parallel_workers == 0) > return NULL; > as parallel_workers can never be < 0 since we enter BeginParallelCopy > only if cstate->nworkers > 0 and also we are not allowed to have > negative values for max_worker_processes. > Modified it. > 8. Do we want to pfree(cstate->pcdata) in case we failed to start any > parallel workers, we would have allocated a good > else > { > /* > * Reset nworkers to -1 here. This is useful in cases where user > * specifies parallel workers, but, no worker is picked up, so go > * back to non parallel mode value of nworkers. > */ > cstate->nworkers = -1; > *processed = CopyFrom(cstate); /* copy from file to database */ > } > Added pfree. > 9. Instead of calling CopyStringToSharedMemory() for each string > variable, can't we just create a linked list of all the strings that > need to be copied into shm and call CopyStringToSharedMemory() only > once? We could avoid 5 function calls? > I feel keeping it this way makes the code more readable, and also this is not in a performance intensive tight loop. I'm retaining the change as is unless we feel this will make an impact. > 10. Similar to above comment: can we fill all the required > cstate->variables inside the function CopyNodeFromSharedMemory() and > call it only once? In each worker we could save overhead of 5 function > calls. > same as above. > 11. Looks like CopyStringFromSharedMemory() and > CopyNodeFromSharedMemory() do almost the same things except > stringToNode() and pfree(destptr);. Can we have a generic function > CopyFromSharedMemory() or something else and handle with flag "bool > isnode" to differentiate the two use cases? > Removed CopyStringFromSharedMemory & used CopyNodeFromSharedMemory appropriately. CopyNodeFromSharedMemory is renamed to RestoreNodeFromSharedMemory keep the name consistent. > 12. Can we move below check to the end in IsParallelCopyAllowed()? > /* Check parallel safety of the trigger functions. */ > if (cstate->rel->trigdesc != NULL && > !CheckRelTrigFunParallelSafety(cstate->rel->trigdesc)) > return false; > Modified. > 13. CacheLineInfo(): Instead of goto empty_data_line_update; how about > having this directly inside the if block as it's being used only once? > Have removed the goto by using a macro. > 14. GetWorkerLine(): How about avoiding goto statements and replacing > the common code with a always static inline function or a macro? > Have removed the goto by using a macro. > 15. UpdateSharedLineInfo(): Below line is misaligned. > lineInfo->first_block = blk_pos; > lineInfo->start_offset = offset; > Changed it. > 16. ParallelCopyFrom(): Do we need CHECK_FOR_INTERRUPTS(); at the > start of for (;;)? > Added it. > 17. Remove extra lines after #define IsHeaderLine() > (cstate->header_line && cstate->cur_lineno == 1) in copy.h > Modified it. Attached v9 patches have the fixes for the above comments. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Oct 21, 2020 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Oct 21, 2020 at 3:19 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > 9. Instead of calling CopyStringToSharedMemory() for each string > > variable, can't we just create a linked list of all the strings that > > need to be copied into shm and call CopyStringToSharedMemory() only > > once? We could avoid 5 function calls? > > > > If we want to avoid different function calls then can't we just store > all these strings in a local structure and use it? That might improve > the other parts of code as well where we are using these as individual > parameters. > I have made one structure SerializedListToStrCState to store all the variables. The rest of the common variables is directly copied from & into cstate. > > 10. Similar to above comment: can we fill all the required > > cstate->variables inside the function CopyNodeFromSharedMemory() and > > call it only once? In each worker we could save overhead of 5 function > > calls. > > > > Yeah, that makes sense. > I feel keeping it this way makes the code more readable, and also this is not in a performance intensive tight loop. I'm retaining the change as is unless we feel this will make an impact. This is addressed in v9 patch shared at [1]. [1] - https://www.postgresql.org/message-id/CALDaNm1cAONkFDN6K72DSiRpgqNGvwxQL7TjEiHZ58opnp9VoA@mail.gmail.com Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Wed, Oct 21, 2020 at 4:20 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Oct 21, 2020 at 3:18 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > 17. Remove extra lines after #define IsHeaderLine() > > (cstate->header_line && cstate->cur_lineno == 1) in copy.h > > > > I missed one comment: > > 18. I think we need to treat the number of parallel workers as an > integer similar to the parallel option in vacuum. > > postgres=# copy t1 from stdin with(parallel '1'); <<<<< - we > should not allow this. > Enter data to be copied followed by a newline. > > postgres=# vacuum (parallel '1') t1; > ERROR: parallel requires an integer value > I have made the behavior the same as vacuum. This is addressed in v9 patch shared at [1]. [1] - https://www.postgresql.org/message-id/CALDaNm1cAONkFDN6K72DSiRpgqNGvwxQL7TjEiHZ58opnp9VoA@mail.gmail.com Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Thanks Heikki for reviewing and providing your comments. Please find my thoughts below. On Fri, Oct 23, 2020 at 2:01 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I had a brief look at at this patch. Important work! A couple of first > impressions: > > 1. The split between patches > 0002-Framework-for-leader-worker-in-parallel-copy.patch and > 0003-Allow-copy-from-command-to-process-data-from-file.patch is quite > artificial. All the stuff introduced in the first is unused until the > second patch is applied. The first patch introduces a forward > declaration for ParallelCopyData(), but the function only comes in the > second patch. The comments in the first patch talk about > LINE_LEADER_POPULATING and LINE_LEADER_POPULATED, but the enum only > comes in the second patch. I think these have to merged into one. If you > want to split it somehow, I'd suggest having a separate patch just to > move CopyStateData from copy.c to copy.h. The subsequent patch would > then be easier to read as you could see more easily what's being added > to CopyStateData. Actually I think it would be better to have a new > header file, copy_internal.h, to hold CopyStateData and the other > structs, and keep copy.h as it is. > I have merged 0002 & 0003 patch, I have moved few things like creation of copy_internal.h, moving of CopyStateData from copy.c into copy_internal.h into 0001 patch. > 2. This desperately needs some kind of a high-level overview of how it > works. What is a leader, what is a worker? Which process does each step > of COPY processing, like reading from the file/socket, splitting the > input into lines, handling escapes, calling input functions, and > updating the heap and indexes? What data structures are used for the > communication? How does is the work synchronized between the processes? > There are comments on those individual aspects scattered in the patch, > but if you're not already familiar with it, you don't know where to > start. There's some of that in the commit message, but it needs to be > somewhere in the source code, maybe in a long comment at the top of > copyparallel.c. > Added it in copyparallel.c > 3. I'm surprised there's a separate ParallelCopyLineBoundary struct for > every input line. Doesn't that incur a lot of synchronization overhead? > I haven't done any testing, this is just my gut feeling, but I assumed > you'd work in batches of, say, 100 or 1000 lines each. > Data read from the file will be stored in DSM which is of size 64k * 1024. Leader will parse and identify the line boundary like which line starts from which data block, what is the starting offset in the data block, what is the line size, this information will be present in ParallelCopyLineBoundary. Like you said, each worker processes WORKER_CHUNK_COUNT 64 lines at a time. Performance test results run for parallel copy are available at [1]. This is addressed in v9 patch shared at [2]. [1] https://www.postgresql.org/message-id/CALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CALDaNm1cAONkFDN6K72DSiRpgqNGvwxQL7TjEiHZ58opnp9VoA@mail.gmail.com Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Thanks Ashutosh for reviewing and providing your comments. On Fri, Oct 23, 2020 at 5:43 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Hi Vignesh, > > Thanks for the updated patches. Here are some more comments that I can > find after reviewing your latest patches: > > +/* > + * This structure helps in storing the common data from CopyStateData that are > + * required by the workers. This information will then be allocated and stored > + * into the DSM for the worker to retrieve and copy it to CopyStateData. > + */ > +typedef struct SerializedParallelCopyState > +{ > + /* low-level state data */ > + CopyDest copy_dest; /* type of copy source/destination */ > + int file_encoding; /* file or remote side's character encoding */ > + bool need_transcoding; /* file encoding diff from server? */ > + bool encoding_embeds_ascii; /* ASCII can be non-first byte? */ > + > ... > ... > + > + /* Working state for COPY FROM */ > + AttrNumber num_defaults; > + Oid relid; > +} SerializedParallelCopyState; > > Can the above structure not be part of the CopyStateData structure? I > am just asking this question because all the fields present in the > above structure are also present in the CopyStateData structure. So, > including it in the CopyStateData structure will reduce the code > duplication and will also make CopyStateData a bit shorter. > I have removed the common members from the structure, now there are no common members between CopyStateData & the new structure. I'm using CopyStateData to copy to/from directly in the new patch. > -- > > + pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist, > + relid); > > Do we need to pass cstate->nworkers and relid to BeginParallelCopy() > function when we are already passing cstate structure, using which > both of these information can be retrieved ? > nworkers need not be passed as you have suggested but relid need to be passed as we will be setting it to pcdata, modified nworkers as suggested. > -- > > +/* DSM keys for parallel copy. */ > +#define PARALLEL_COPY_KEY_SHARED_INFO 1 > +#define PARALLEL_COPY_KEY_CSTATE 2 > +#define PARALLEL_COPY_WAL_USAGE 3 > +#define PARALLEL_COPY_BUFFER_USAGE 4 > > DSM key names do not appear to be consistent. For shared info and > cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY", > but for WalUsage and BufferUsage structures, it is prefixed with > "PARALLEL_COPY". I think it would be better to make them consistent. > Modified as suggested > -- > > if (resultRelInfo->ri_TrigDesc != NULL && > (resultRelInfo->ri_TrigDesc->trig_insert_before_row || > resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) > { > /* > * Can't support multi-inserts when there are any BEFORE/INSTEAD OF > * triggers on the table. Such triggers might query the table we're > * inserting into and act differently if the tuples that have already > * been processed and prepared for insertion are not there. > */ > insertMethod = CIM_SINGLE; > } > else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL && > resultRelInfo->ri_TrigDesc->trig_insert_new_table) > { > /* > * For partitioned tables we can't support multi-inserts when there > * are any statement level insert triggers. It might be possible to > * allow partitioned tables with such triggers in the future, but for > * now, CopyMultiInsertInfoFlush expects that any before row insert > * and statement level insert triggers are on the same relation. > */ > insertMethod = CIM_SINGLE; > } > else if (resultRelInfo->ri_FdwRoutine != NULL || > cstate->volatile_defexprs) > { > ... > ... > > I think, if possible, all these if-else checks in CopyFrom() can be > moved to a single function which can probably be named as > IdentifyCopyInsertMethod() and this function can be called in > IsParallelCopyAllowed(). This will ensure that in case of Parallel > Copy when the leader has performed all these checks, the worker won't > do it again. I also feel that it will make the code look a bit > cleaner. > In the recent patch posted we have changed it to simplify the check for parallel copy, it is not an exact match. I feel this comment is not applicable on the latest patch > -- > > +void > +ParallelCopyMain(dsm_segment *seg, shm_toc *toc) > +{ > ... > ... > + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], > + &walusage[ParallelWorkerNumber]); > + > + MemoryContextSwitchTo(oldcontext); > + pfree(cstate); > + return; > +} > > It seems like you also need to delete the memory context > (cstate->copycontext) here. > Added it. > -- > > +void > +ExecBeforeStmtTrigger(CopyState cstate) > +{ > + EState *estate = CreateExecutorState(); > + ResultRelInfo *resultRelInfo; > > This function has a lot of comments which have been copied as it is > from the CopyFrom function, I think it would be good to remove those > comments from here and mention that this code changes done in this > function has been taken from the CopyFrom function. If any queries > people may refer to the CopyFrom function. This will again avoid the > unnecessary code in the patch. > Changed as suggested. > -- > > As Heikki rightly pointed out in his previous email, we need some high > level description of how Parallel Copy works somewhere in > copyparallel.c file. For reference, please see how a brief description > about parallel vacuum has been added in the vacuumlazy.c file. > > * Lazy vacuum supports parallel execution with parallel worker processes. In > * a parallel vacuum, we perform both index vacuum and index cleanup with > * parallel worker processes. Individual indexes are processed by one vacuum > ... Added it in copyparallel.c This is addressed in v9 patch shared at [1]. [1] - https://www.postgresql.org/message-id/CALDaNm1cAONkFDN6K72DSiRpgqNGvwxQL7TjEiHZ58opnp9VoA@mail.gmail.com Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Fri, Oct 23, 2020 at 6:58 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > > I think, if possible, all these if-else checks in CopyFrom() can be > > moved to a single function which can probably be named as > > IdentifyCopyInsertMethod() and this function can be called in > > IsParallelCopyAllowed(). This will ensure that in case of Parallel > > Copy when the leader has performed all these checks, the worker won't > > do it again. I also feel that it will make the code look a bit > > cleaner. > > > > Just rewriting above comment to make it a bit more clear: > > I think, if possible, all these if-else checks in CopyFrom() should be > moved to a separate function which can probably be named as > IdentifyCopyInsertMethod() and this function called from > IsParallelCopyAllowed() and CopyFrom() functions. It will only be > called from CopyFrom() when IsParallelCopy() returns false. This will > ensure that in case of Parallel Copy if the leader has performed all > these checks, the worker won't do it again. I also feel that having a > separate function containing all these checks will make the code look > a bit cleaner. > In the recent patch posted we have changed it to simplify the check for parallel copy, it is not an exact match. I feel this comment is not applicable on the latest patch Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Hi I found some issue in v9-0002 1. + + elog(DEBUG1, "[Worker] Processing - line position:%d, block:%d, unprocessed lines:%d, offset:%d, line size:%d", + write_pos, lineInfo->first_block, + pg_atomic_read_u32(&data_blk_ptr->unprocessed_line_parts), + offset, pg_atomic_read_u32(&lineInfo->line_size)); + write_pos or other variable to be printed here are type of uint32, I think it'better to use '%u' in elog msg. 2. + * line_size will be set. Read the line_size again to be sure if it is + * completed or partial block. + */ + dataSize = pg_atomic_read_u32(&lineInfo->line_size); + if (dataSize) It use dataSize( type int ) to get uint32 which seems a little dangerous. Is it better to define dataSize uint32 here? 3. Since function with 'Cstate' in name has been changed to 'CState' I think we can change function PopulateCommonCstateInfo as well. 4. + if (pcdata->worker_line_buf_count) I think some check like the above can be 'if (xxx > 0)', which seems easier to understand. Best regards, houzj
On Tue, Oct 27, 2020 at 7:06 PM vignesh C <vignesh21@gmail.com> wrote: > [latest version] I think the parallel-safety checks in this patch (v9-0002-Allow-copy-from-command-to-process-data-from-file) are incomplete and wrong. See below comments. 1. +static pg_attribute_always_inline bool +CheckExprParallelSafety(CopyState cstate) +{ + if (contain_volatile_functions(cstate->whereClause)) + { + if (max_parallel_hazard((Query *) cstate->whereClause) != PROPARALLEL_SAFE) + return false; + } I don't understand the above check. Why do we only need to check where clause for parallel-safety when it contains volatile functions? It should be checked otherwise as well, no? The similar comment applies to other checks in this function. Also, I don't think there is a need to make this function inline. 2. +/* + * IsParallelCopyAllowed + * + * Check if parallel copy can be allowed. + */ +bool +IsParallelCopyAllowed(CopyState cstate) { .. + * When there are BEFORE/AFTER/INSTEAD OF row triggers on the table. We do + * not allow parallelism in such cases because such triggers might query + * the table we are inserting into and act differently if the tuples that + * have already been processed and prepared for insertion are not there. + * Now, if we allow parallelism with such triggers the behaviour would + * depend on if the parallel worker has already inserted or not that + * particular tuples. + */ + if (cstate->rel->trigdesc != NULL && + (cstate->rel->trigdesc->trig_insert_after_statement || + cstate->rel->trigdesc->trig_insert_new_table || + cstate->rel->trigdesc->trig_insert_before_row || + cstate->rel->trigdesc->trig_insert_after_row || + cstate->rel->trigdesc->trig_insert_instead_row)) + return false; .. Why do we need to disable parallelism for before/after row triggers unless they have parallel-unsafe functions? I see a few lines down in this function you are checking parallel-safety of trigger functions, what is the use of the same if you are already disabling parallelism with the above check. 3. What about if the index on table has expressions that are parallel-unsafe? What is your strategy to check parallel-safety for partitioned tables? I suggest checking Greg's patch for parallel-safety of Inserts [1]. I think you will find that most of those checks are required here as well and see how we can use that patch (at least what is common). I feel the first patch should be just to have parallel-safety checks and we can test that by trying to enable Copy with force_parallel_mode. We can build the rest of the patch atop of it or in other words, let's move all parallel-safety work into a separate patch. Few assorted comments: ======================== 1. +/* + * ESTIMATE_NODE_SIZE - Estimate the size required for node type in shared + * memory. + */ +#define ESTIMATE_NODE_SIZE(list, listStr, strsize) \ +{ \ + uint32 estsize = sizeof(uint32); \ + if ((List *)list != NIL) \ + { \ + listStr = nodeToString(list); \ + estsize += strlen(listStr) + 1; \ + } \ + \ + strsize = add_size(strsize, estsize); \ +} This can be probably a function instead of a macro. 2. +/* + * ESTIMATE_1BYTE_STR_SIZE - Estimate the size required for 1Byte strings in + * shared memory. + */ +#define ESTIMATE_1BYTE_STR_SIZE(src, strsize) \ +{ \ + strsize = add_size(strsize, sizeof(uint8)); \ + strsize = add_size(strsize, (src) ? 1 : 0); \ +} This could be an inline function. 3. +/* + * SERIALIZE_1BYTE_STR - Copy 1Byte strings to shared memory. + */ +#define SERIALIZE_1BYTE_STR(dest, src, copiedsize) \ +{ \ + uint8 len = (src) ? 1 : 0; \ + memcpy(dest + copiedsize, (uint8 *) &len, sizeof(uint8)); \ + copiedsize += sizeof(uint8); \ + if (src) \ + dest[copiedsize++] = src[0]; \ +} Similarly, this could be a function. I think keeping such things as macros in-between code makes it difficult to read. Please see if you can make these and similar macros as functions unless they are doing few memory instructions. Having functions makes it easier to debug the code as well. [1] - https://www.postgresql.org/message-id/CAJcOf-cgfjj0NfYPrNFGmQJxsnNW102LTXbzqxQJuziar1EKfQ%40mail.gmail.com -- With Regards, Amit Kapila.
On 27/10/2020 15:36, vignesh C wrote: > Attached v9 patches have the fixes for the above comments. I did some testing: /tmp/longdata.pl: -------- #!/usr/bin/perl # # Generate three rows: # foo # longdatalongdatalongdata... # bar # # The length of the middle row is given as command line arg. # my $bytes = $ARGV[0]; print "foo\n"; for(my $i = 0; $i < $bytes; $i+=8){ print "longdata"; } print "\n"; print "bar\n"; -------- postgres=# copy longdata from program 'perl /tmp/longdata.pl 100000000' with (parallel 2); This gets stuck forever (or at least I didn't have the patience to wait it finish). Both worker processes are consuming 100% of CPU. - Heikki
On 27/10/2020 15:36, vignesh C wrote: >> Attached v9 patches have the fixes for the above comments. >I did some testing: I did some testing as well and have a cosmetic remark: postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 1000000000); ERROR: value 1000000000 out of bounds for option "parallel" DETAIL: Valid values are between "1" and "1024". postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 100000000000); ERROR: parallel requires an integer value postgres=# Wouldn't it make more sense to only have one error message? The first one seems to be the better message. Regards Daniel
On Thu, Oct 29, 2020 at 11:45 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Oct 27, 2020 at 7:06 PM vignesh C <vignesh21@gmail.com> wrote: > > > [latest version] > > I think the parallel-safety checks in this patch > (v9-0002-Allow-copy-from-command-to-process-data-from-file) are > incomplete and wrong. > One more point, I have noticed that some time back [1], I have given one suggestion related to the way workers process the set of lines (aka chunk). I think you can try by increasing the chunk size to say 100, 500, 1000 and use some shared counter to remember the number of chunks processed. [1] - https://www.postgresql.org/message-id/CAA4eK1L-Xgw1zZEbGePmhBBWmEmLFL6rCaiOMDPnq2GNMVz-sg%40mail.gmail.com -- With Regards, Amit Kapila.
On 27/10/2020 15:36, vignesh C wrote: > Attached v9 patches have the fixes for the above comments. I find this design to be very complicated. Why does the line-boundary information need to be in shared memory? I think this would be much simpler if each worker grabbed a fixed-size block of raw data, and processed that. In your patch, the leader process scans the input to find out where one line ends and another begins, and because of that decision, the leader needs to make the line boundaries available in shared memory, for the worker processes. If we moved that responsibility to the worker processes, you wouldn't need to keep the line boundaries in shared memory. A worker would only need to pass enough state to the next worker to tell it where to start scanning the next block. Whether the leader process finds the EOLs or the worker processes, it's pretty clear that it needs to be done ASAP, for a chunk at a time, because that cannot be done in parallel. I think some refactoring in CopyReadLine() and friends would be in order. It probably would be faster, or at least not slower, to find all the EOLs in a block in one tight loop, even when parallel copy is not used. - Heikki
On 30/10/2020 18:36, Heikki Linnakangas wrote: > I find this design to be very complicated. Why does the line-boundary > information need to be in shared memory? I think this would be much > simpler if each worker grabbed a fixed-size block of raw data, and > processed that. > > In your patch, the leader process scans the input to find out where one > line ends and another begins, and because of that decision, the leader > needs to make the line boundaries available in shared memory, for the > worker processes. If we moved that responsibility to the worker > processes, you wouldn't need to keep the line boundaries in shared > memory. A worker would only need to pass enough state to the next worker > to tell it where to start scanning the next block. Here's a high-level sketch of how I'm imagining this to work: The shared memory structure consists of a queue of blocks, arranged as a ring buffer. Each block is of fixed size, and contains 64 kB of data, and a few fields for coordination: typedef struct { /* Current state of the block */ pg_atomic_uint32 state; /* starting offset of first line within the block */ int startpos; char data[64 kB]; } ParallelCopyDataBlock; Where state is one of: enum { FREE, /* buffer is empty */ FILLED, /* leader has filled the buffer with raw data */ READY, /* start pos has been filled in, but no worker process has claimed the block yet */ PROCESSING, /* worker has claimed the block, and is processing it */ } State changes FREE -> FILLED -> READY -> PROCESSING -> FREE. As the COPY progresses, the ring of blocks will always look something like this: blk 0 startpos 0: PROCESSING [worker 1] blk 1 startpos 12: PROCESSING [worker 2] blk 2 startpos 10: READY blk 3 starptos -: FILLED blk 4 startpos -: FILLED blk 5 starptos -: FILLED blk 6 startpos -: FREE blk 7 startpos -: FREE Typically, each worker process is busy processing a block. After the blocks being processed, there is one block in READY state, and after that, blocks in FILLED state. Leader process: The leader process is simple. It picks the next FREE buffer, fills it with raw data from the file, and marks it as FILLED. If no buffers are FREE, wait. Worker process: 1. Claim next READY block from queue, by changing its state to PROCESSING. If the next block is not READY yet, wait until it is. 2. Start scanning the block from 'startpos', finding end-of-line markers. (in CSV mode, need to track when we're in-quotes). 3. When you reach the end of the block, if the last line continues to next block, wait for the next block to become FILLED. Peek into the next block, and copy the remaining part of the split line to a local buffer, and set the 'startpos' on the next block to point to the end of the split line. Mark the next block as READY. 4. Process all the lines in the block, call input functions, insert rows. 5. Mark the block as DONE. In this design, you don't need to keep line boundaries in shared memory, because each worker process is responsible for finding the line boundaries of its own block. There's a point of serialization here, in that the next block cannot be processed, until the worker working on the previous block has finished scanning the EOLs, and set the starting position on the next block, putting it in READY state. That's not very different from your patch, where you had a similar point of serialization because the leader scanned the EOLs, but I think the coordination between processes is simpler here. - Heikki
On 30/10/2020 18:36, Heikki Linnakangas wrote: > Whether the leader process finds the EOLs or the worker processes, it's > pretty clear that it needs to be done ASAP, for a chunk at a time, > because that cannot be done in parallel. I think some refactoring in > CopyReadLine() and friends would be in order. It probably would be > faster, or at least not slower, to find all the EOLs in a block in one > tight loop, even when parallel copy is not used. Something like the attached. It passes the regression tests, but it's quite incomplete. It's missing handing of "\." as end-of-file marker, and I haven't tested encoding conversions at all, for starters. Quick testing suggests that this a little bit faster than the current code, but the difference is small; I had to use a "WHERE false" option to really see the difference. The crucial thing here is that there's a new function, ParseLinesText(), to find all end-of-line characters in a buffer in one go. In this patch, it's used against 'raw_buf', but with parallel copy, you could point it at a block in shared memory instead. - Heikki
Attachment
Hi, I've done a bit more testing today, and I think the parsing is busted in some way. Consider this: test=# create extension random; CREATE EXTENSION test=# create table t (a text); CREATE TABLE test=# insert into t select random_string(random_int(10, 256*1024)) from generate_series(1,10000); INSERT 0 10000 test=# copy t to '/mnt/data/t.csv'; COPY 10000 test=# truncate t; TRUNCATE TABLE test=# copy t from '/mnt/data/t.csv'; COPY 10000 test=# truncate t; TRUNCATE TABLE test=# copy t from '/mnt/data/t.csv' with (parallel 2); ERROR: invalid byte sequence for encoding "UTF8": 0x00 CONTEXT: COPY t, line 485: "m&\nh%_a"%r]>qtCl:Q5ltvF~;2oS6@HB>F>og,bD$Lw'nZY\tYl#BH\t{(j~ryoZ08"SGU~.}8CcTRk1\ts$@U3szCC+U1U3i@P..." parallel worker The functions come from an extension I use to generate random data, I've pushed it to github [1]. The random_string() generates a random string with ASCII characters, symbols and a couple special characters (\r\n\t). The intent was to try loading data where a fields may span multiple 64kB blocks and may contain newlines etc. The non-parallel copy works fine, the parallel one fails. I haven't investigated the details, but I guess it gets confused about where a string starts/end, or something like that. [1] https://github.com/tvondra/random regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Oct 30, 2020 at 06:41:41PM +0200, Heikki Linnakangas wrote: >On 30/10/2020 18:36, Heikki Linnakangas wrote: >>I find this design to be very complicated. Why does the line-boundary >>information need to be in shared memory? I think this would be much >>simpler if each worker grabbed a fixed-size block of raw data, and >>processed that. >> >>In your patch, the leader process scans the input to find out where one >>line ends and another begins, and because of that decision, the leader >>needs to make the line boundaries available in shared memory, for the >>worker processes. If we moved that responsibility to the worker >>processes, you wouldn't need to keep the line boundaries in shared >>memory. A worker would only need to pass enough state to the next worker >>to tell it where to start scanning the next block. > >Here's a high-level sketch of how I'm imagining this to work: > >The shared memory structure consists of a queue of blocks, arranged as >a ring buffer. Each block is of fixed size, and contains 64 kB of >data, and a few fields for coordination: > >typedef struct >{ > /* Current state of the block */ > pg_atomic_uint32 state; > > /* starting offset of first line within the block */ > int startpos; > > char data[64 kB]; >} ParallelCopyDataBlock; > >Where state is one of: > >enum { > FREE, /* buffer is empty */ > FILLED, /* leader has filled the buffer with raw data */ > READY, /* start pos has been filled in, but no worker process >has claimed the block yet */ > PROCESSING, /* worker has claimed the block, and is processing it */ >} > >State changes FREE -> FILLED -> READY -> PROCESSING -> FREE. As the >COPY progresses, the ring of blocks will always look something like >this: > >blk 0 startpos 0: PROCESSING [worker 1] >blk 1 startpos 12: PROCESSING [worker 2] >blk 2 startpos 10: READY >blk 3 starptos -: FILLED >blk 4 startpos -: FILLED >blk 5 starptos -: FILLED >blk 6 startpos -: FREE >blk 7 startpos -: FREE > >Typically, each worker process is busy processing a block. After the >blocks being processed, there is one block in READY state, and after >that, blocks in FILLED state. > >Leader process: > >The leader process is simple. It picks the next FREE buffer, fills it >with raw data from the file, and marks it as FILLED. If no buffers are >FREE, wait. > >Worker process: > >1. Claim next READY block from queue, by changing its state to > PROCESSING. If the next block is not READY yet, wait until it is. > >2. Start scanning the block from 'startpos', finding end-of-line > markers. (in CSV mode, need to track when we're in-quotes). > >3. When you reach the end of the block, if the last line continues to > next block, wait for the next block to become FILLED. Peek into the > next block, and copy the remaining part of the split line to a local > buffer, and set the 'startpos' on the next block to point to the end > of the split line. Mark the next block as READY. > >4. Process all the lines in the block, call input functions, insert > rows. > >5. Mark the block as DONE. > >In this design, you don't need to keep line boundaries in shared >memory, because each worker process is responsible for finding the >line boundaries of its own block. > >There's a point of serialization here, in that the next block cannot >be processed, until the worker working on the previous block has >finished scanning the EOLs, and set the starting position on the next >block, putting it in READY state. That's not very different from your >patch, where you had a similar point of serialization because the >leader scanned the EOLs, but I think the coordination between >processes is simpler here. > I agree this design looks simpler. I'm a bit worried about serializing the parsing like this, though. It's true the current approach (where the first phase of parsing happens in the leader) has a similar issue, but I think it would be easier to improve that in that design. My plan was to parallelize the parsing roughly like this: 1) split the input buffer into smaller chunks 2) let workers scan the buffers and record positions of interesting characters (delimiters, quotes, ...) and pass it back to the leader 3) use the information to actually parse the input data (we only need to look at the interesting characters, skipping large parts of data) 4) pass the parsed chunks to workers, just like in the current patch But maybe something like that would be possible even with the approach you propose - we could have a special parse phase for processing each buffer, where any worker could look for the special characters, record the positions in a bitmap next to the buffer. So the whole sequence of states would look something like this: EMPTY FILLED PARSED READY PROCESSING Of course, the question is whether parsing really is sufficiently expensive for this to be worth it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 30/10/2020 22:56, Tomas Vondra wrote: > I agree this design looks simpler. I'm a bit worried about serializing > the parsing like this, though. It's true the current approach (where the > first phase of parsing happens in the leader) has a similar issue, but I > think it would be easier to improve that in that design. > > My plan was to parallelize the parsing roughly like this: > > 1) split the input buffer into smaller chunks > > 2) let workers scan the buffers and record positions of interesting > characters (delimiters, quotes, ...) and pass it back to the leader > > 3) use the information to actually parse the input data (we only need to > look at the interesting characters, skipping large parts of data) > > 4) pass the parsed chunks to workers, just like in the current patch > > > But maybe something like that would be possible even with the approach > you propose - we could have a special parse phase for processing each > buffer, where any worker could look for the special characters, record > the positions in a bitmap next to the buffer. So the whole sequence of > states would look something like this: > > EMPTY > FILLED > PARSED > READY > PROCESSING I think it's even simpler than that. You don't need to communicate the "interesting positions" between processes, if the same worker takes care of the chunk through all states from FILLED to DONE. You can build the bitmap of interesting positions immediately in FILLED state, independently of all previous blocks. Once you've built the bitmap, you need to wait for the information on where the first line starts, but presumably finding the interesting positions is the expensive part. > Of course, the question is whether parsing really is sufficiently > expensive for this to be worth it. Yeah, I don't think it's worth it. Splitting the lines is pretty fast, I think we have many years to come before that becomes a bottleneck. But if it turns out I'm wrong and we need to implement that, the path is pretty straightforward. - Heikki
On Sat, Oct 31, 2020 at 12:09:32AM +0200, Heikki Linnakangas wrote: >On 30/10/2020 22:56, Tomas Vondra wrote: >>I agree this design looks simpler. I'm a bit worried about serializing >>the parsing like this, though. It's true the current approach (where the >>first phase of parsing happens in the leader) has a similar issue, but I >>think it would be easier to improve that in that design. >> >>My plan was to parallelize the parsing roughly like this: >> >>1) split the input buffer into smaller chunks >> >>2) let workers scan the buffers and record positions of interesting >>characters (delimiters, quotes, ...) and pass it back to the leader >> >>3) use the information to actually parse the input data (we only need to >>look at the interesting characters, skipping large parts of data) >> >>4) pass the parsed chunks to workers, just like in the current patch >> >> >>But maybe something like that would be possible even with the approach >>you propose - we could have a special parse phase for processing each >>buffer, where any worker could look for the special characters, record >>the positions in a bitmap next to the buffer. So the whole sequence of >>states would look something like this: >> >> EMPTY >> FILLED >> PARSED >> READY >> PROCESSING > >I think it's even simpler than that. You don't need to communicate the >"interesting positions" between processes, if the same worker takes >care of the chunk through all states from FILLED to DONE. > >You can build the bitmap of interesting positions immediately in >FILLED state, independently of all previous blocks. Once you've built >the bitmap, you need to wait for the information on where the first >line starts, but presumably finding the interesting positions is the >expensive part. > I don't think it's that simple. For example, the previous block may contain a very long value (say, 1MB), so a bunch of blocks have to be processed by the same worker. That probably makes the state transitions a bit, and it also means the bitmap would need to be passed to the worker that actually processes the block. Or we might just ignore this, on the grounds that it's not a very common situation. >>Of course, the question is whether parsing really is sufficiently >>expensive for this to be worth it. > >Yeah, I don't think it's worth it. Splitting the lines is pretty fast, >I think we have many years to come before that becomes a bottleneck. >But if it turns out I'm wrong and we need to implement that, the path >is pretty straightforward. > OK. I agree the parsing is relatively cheap, and I don't recall seeing CSV parsing as a bottleneck in production. I suspect that's might be simply because we're hitting other bottlenecks first, though. regard -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > Leader process: > > The leader process is simple. It picks the next FREE buffer, fills it > with raw data from the file, and marks it as FILLED. If no buffers are > FREE, wait. > > Worker process: > > 1. Claim next READY block from queue, by changing its state to > PROCESSING. If the next block is not READY yet, wait until it is. > > 2. Start scanning the block from 'startpos', finding end-of-line > markers. (in CSV mode, need to track when we're in-quotes). > > 3. When you reach the end of the block, if the last line continues to > next block, wait for the next block to become FILLED. Peek into the > next block, and copy the remaining part of the split line to a local > buffer, and set the 'startpos' on the next block to point to the end > of the split line. Mark the next block as READY. > > 4. Process all the lines in the block, call input functions, insert > rows. > > 5. Mark the block as DONE. > > In this design, you don't need to keep line boundaries in shared memory, > because each worker process is responsible for finding the line > boundaries of its own block. > > There's a point of serialization here, in that the next block cannot be > processed, until the worker working on the previous block has finished > scanning the EOLs, and set the starting position on the next block, > putting it in READY state. That's not very different from your patch, > where you had a similar point of serialization because the leader > scanned the EOLs, > But in the design (single producer multiple consumer) used by the patch the worker doesn't need to wait till the complete block is processed, it can start processing the lines already found. This will also allow workers to start much earlier to process the data as it doesn't need to wait for all the offsets corresponding to 64K block ready. However, in the design where each worker is processing the 64K block, it can lead to much longer waits. I think this will impact the Copy STDIN case more where in most cases (200-300 bytes tuples) we receive line-by-line from client and find the line-endings by leader. If the leader doesn't find the line-endings the workers need to wait till the leader fill the entire 64K chunk, OTOH, with current approach the worker can start as soon as leader is able to populate some minimum number of line-endings The other point is that the leader backend won't be used completely as it is only doing a very small part (primarily reading the file) of the overall work. We have discussed both these approaches (a) single producer multiple consumer, and (b) all workers doing the processing as you are saying in the beginning and concluded that (a) is better, see some of the relevant emails [1][2][3]. [1] - https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de [2] - https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de [3] - https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de -- With Regards, Amit Kapila.
On 02/11/2020 08:14, Amit Kapila wrote: > On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >> Leader process: >> >> The leader process is simple. It picks the next FREE buffer, fills it >> with raw data from the file, and marks it as FILLED. If no buffers are >> FREE, wait. >> >> Worker process: >> >> 1. Claim next READY block from queue, by changing its state to >> PROCESSING. If the next block is not READY yet, wait until it is. >> >> 2. Start scanning the block from 'startpos', finding end-of-line >> markers. (in CSV mode, need to track when we're in-quotes). >> >> 3. When you reach the end of the block, if the last line continues to >> next block, wait for the next block to become FILLED. Peek into the >> next block, and copy the remaining part of the split line to a local >> buffer, and set the 'startpos' on the next block to point to the end >> of the split line. Mark the next block as READY. >> >> 4. Process all the lines in the block, call input functions, insert >> rows. >> >> 5. Mark the block as DONE. >> >> In this design, you don't need to keep line boundaries in shared memory, >> because each worker process is responsible for finding the line >> boundaries of its own block. >> >> There's a point of serialization here, in that the next block cannot be >> processed, until the worker working on the previous block has finished >> scanning the EOLs, and set the starting position on the next block, >> putting it in READY state. That's not very different from your patch, >> where you had a similar point of serialization because the leader >> scanned the EOLs, > > But in the design (single producer multiple consumer) used by the > patch the worker doesn't need to wait till the complete block is > processed, it can start processing the lines already found. This will > also allow workers to start much earlier to process the data as it > doesn't need to wait for all the offsets corresponding to 64K block > ready. However, in the design where each worker is processing the 64K > block, it can lead to much longer waits. I think this will impact the > Copy STDIN case more where in most cases (200-300 bytes tuples) we > receive line-by-line from client and find the line-endings by leader. > If the leader doesn't find the line-endings the workers need to wait > till the leader fill the entire 64K chunk, OTOH, with current approach > the worker can start as soon as leader is able to populate some > minimum number of line-endings You can use a smaller block size. However, the point of parallel copy is to maximize bandwidth. If the workers ever have to sit idle, it means that the bottleneck is in receiving data from the client, i.e. the backend is fast enough, and you can't make the overall COPY finish any faster no matter how you do it. > The other point is that the leader backend won't be used completely as > it is only doing a very small part (primarily reading the file) of the > overall work. An idle process doesn't cost anything. If you have free CPU resources, use more workers. > We have discussed both these approaches (a) single producer multiple > consumer, and (b) all workers doing the processing as you are saying > in the beginning and concluded that (a) is better, see some of the > relevant emails [1][2][3]. > > [1] - https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de > [2] - https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de > [3] - https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de Sorry I'm late to the party. I don't think the design I proposed was discussed in that threads. The alternative that's discussed in that thread seems to be something much more fine-grained, where processes claim individual lines. I'm not sure though, I didn't fully understand the alternative designs. I want to throw out one more idea. It's an interim step, not the final solution we want, but a useful step in getting there: Have the leader process scan the input for line-endings. Split the input data into blocks of slightly under 64 kB in size, so that a line never crosses a block. Put the blocks in shared memory. A worker process claims a block from shared memory, processes it from beginning to end. It *also* has to parse the input to split it into lines. In this design, the line-splitting is done twice. That's clearly not optimal, and we want to avoid that in the final patch, but I think it would be a useful milestone. After that patch is done, write another patch to either a) implement the design I sketched, where blocks are fixed-size and a worker notifies the next worker on where the first line in next block begins, or b) have the leader process report the line-ending positions in shared memory, so that workers don't need to scan them again. Even if we apply the patches together, I think splitting them like that would make for easier review. - Heikki
On 02/11/2020 09:10, Heikki Linnakangas wrote: > On 02/11/2020 08:14, Amit Kapila wrote: >> We have discussed both these approaches (a) single producer multiple >> consumer, and (b) all workers doing the processing as you are saying >> in the beginning and concluded that (a) is better, see some of the >> relevant emails [1][2][3]. >> >> [1] - https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de >> [2] - https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de >> [3] - https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de > > Sorry I'm late to the party. I don't think the design I proposed was > discussed in that threads. The alternative that's discussed in that > thread seems to be something much more fine-grained, where processes > claim individual lines. I'm not sure though, I didn't fully understand > the alternative designs. I read the thread more carefully, and I think Robert had basically the right idea here (https://www.postgresql.org/message-id/CA%2BTgmoZMU4az9MmdJtg04pjRa0wmWQtmoMxttdxNrupYJNcR3w%40mail.gmail.com): > I really think we don't want a single worker in charge of finding > tuple boundaries for everybody. That adds a lot of unnecessary > inter-process communication and synchronization. Each process should > just get the next tuple starting after where the last one ended, and > then advance the end pointer so that the next process can do the same > thing. [...] And here (https://www.postgresql.org/message-id/CA%2BTgmoZw%2BF3y%2BoaxEsHEZBxdL1x1KAJ7pRMNgCqX0WjmjGNLrA%40mail.gmail.com): > On Thu, Apr 9, 2020 at 2:55 PM Andres Freund <andres(at)anarazel(dot)de> wrote: >> I'm fairly certain that we do *not* want to distribute input data >> between processes on a single tuple basis. Probably not even below >> a few hundred kb. If there's any sort of natural clustering in the loaded data - extremely common, think timestamps - splitting on a granular basis will make indexing much more expensive. And have a lot more contention. > > That's a fair point. I think the solution ought to be that once any > process starts finding line endings, it continues until it's grabbed > at least a certain amount of data for itself. Then it stops and lets > some other process grab a chunk of data. Yes! That's pretty close to the design I sketched. I imagined that the leader would divide the input into 64 kB blocks, and each block would have few metadata fields, notably the starting position of the first line in the block. I think Robert envisioned having a single "next starting position" field in shared memory. That works too, and is even simpler, so +1 for that. For some reason, the discussion took a different turn from there, to discuss how the line-endings (called "chunks" in the discussion) should be represented in shared memory. But none of that is necessary with Robert's design. - Heikki
On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 02/11/2020 08:14, Amit Kapila wrote: > > On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >> > >> In this design, you don't need to keep line boundaries in shared memory, > >> because each worker process is responsible for finding the line > >> boundaries of its own block. > >> > >> There's a point of serialization here, in that the next block cannot be > >> processed, until the worker working on the previous block has finished > >> scanning the EOLs, and set the starting position on the next block, > >> putting it in READY state. That's not very different from your patch, > >> where you had a similar point of serialization because the leader > >> scanned the EOLs, > > > > But in the design (single producer multiple consumer) used by the > > patch the worker doesn't need to wait till the complete block is > > processed, it can start processing the lines already found. This will > > also allow workers to start much earlier to process the data as it > > doesn't need to wait for all the offsets corresponding to 64K block > > ready. However, in the design where each worker is processing the 64K > > block, it can lead to much longer waits. I think this will impact the > > Copy STDIN case more where in most cases (200-300 bytes tuples) we > > receive line-by-line from client and find the line-endings by leader. > > If the leader doesn't find the line-endings the workers need to wait > > till the leader fill the entire 64K chunk, OTOH, with current approach > > the worker can start as soon as leader is able to populate some > > minimum number of line-endings > > You can use a smaller block size. > Sure, but the same problem can happen if the last line in that block is too long and we need to peek into the next block. And then there could be cases where a single line could be greater than 64K. > However, the point of parallel copy is > to maximize bandwidth. > Okay, but this first-phase (finding the line boundaries) can anyway be not done in parallel and we have seen in some of the initial benchmarking that this initial phase is a small part of work especially when the table has indexes, constraints, etc. So, I think it won't matter much if this splitting is done in a single process or multiple processes. > If the workers ever have to sit idle, it means > that the bottleneck is in receiving data from the client, i.e. the > backend is fast enough, and you can't make the overall COPY finish any > faster no matter how you do it. > > > The other point is that the leader backend won't be used completely as > > it is only doing a very small part (primarily reading the file) of the > > overall work. > > An idle process doesn't cost anything. If you have free CPU resources, > use more workers. > > > We have discussed both these approaches (a) single producer multiple > > consumer, and (b) all workers doing the processing as you are saying > > in the beginning and concluded that (a) is better, see some of the > > relevant emails [1][2][3]. > > > > [1] - https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de > > [2] - https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de > > [3] - https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de > > Sorry I'm late to the party. I don't think the design I proposed was > discussed in that threads. > I think something close to that is discussed as you have noticed in your next email but IIRC, because many people (Andres, Ants, myself and author) favoured the current approach (single reader and multiple consumers) we decided to go with that. I feel this patch is very much in the POC stage due to which the code doesn't look good and as we move forward we need to see what is the better way to improve it, maybe one of the ways is to split it as you are suggesting so that it can be easier to review. I think the other important thing which this patch has not addressed properly is the parallel-safety checks as pointed by me earlier. There are two things to solve there (a) the lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c APIs, etc.) have checks which doesn't allow any writes, we need to see which of those we can open now (or do some additional work to prevent from those checks) after some of the work done for parallel-writes in PG-13[1][2], and (b) in which all cases we can parallel-writes (parallel copy) is allowed, for example need to identify whether table or one of its partitions has any constraint/expression which is parallel-unsafe. [1] 85f6b49 Allow relation extension lock to conflict among parallel group members [2] 3ba59cc Allow page lock to conflict among parallel group members > > I want to throw out one more idea. It's an interim step, not the final > solution we want, but a useful step in getting there: > > Have the leader process scan the input for line-endings. Split the input > data into blocks of slightly under 64 kB in size, so that a line never > crosses a block. Put the blocks in shared memory. > > A worker process claims a block from shared memory, processes it from > beginning to end. It *also* has to parse the input to split it into lines. > > In this design, the line-splitting is done twice. That's clearly not > optimal, and we want to avoid that in the final patch, but I think it > would be a useful milestone. After that patch is done, write another > patch to either a) implement the design I sketched, where blocks are > fixed-size and a worker notifies the next worker on where the first line > in next block begins, or b) have the leader process report the > line-ending positions in shared memory, so that workers don't need to > scan them again. > > Even if we apply the patches together, I think splitting them like that > would make for easier review. > I think this is worth exploring especially if it makes the patch easier to review. -- With Regards, Amit Kapila.
On 03/11/2020 10:59, Amit Kapila wrote: > On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> However, the point of parallel copy is to maximize bandwidth. > > Okay, but this first-phase (finding the line boundaries) can anyway > be not done in parallel and we have seen in some of the initial > benchmarking that this initial phase is a small part of work > especially when the table has indexes, constraints, etc. So, I think > it won't matter much if this splitting is done in a single process > or multiple processes. Right, it won't matter performance-wise. That's not my point. The difference is in the complexity. If you don't store the line boundaries in shared memory, you get away with much simpler shared memory structures. > I think something close to that is discussed as you have noticed in > your next email but IIRC, because many people (Andres, Ants, myself > and author) favoured the current approach (single reader and multiple > consumers) we decided to go with that. I feel this patch is very much > in the POC stage due to which the code doesn't look good and as we > move forward we need to see what is the better way to improve it, > maybe one of the ways is to split it as you are suggesting so that it > can be easier to review. Sure. I think the roadmap here is: 1. Split copy.c [1]. Not strictly necessary, but I think it'd make this nice to review and work with. 2. Refactor CopyReadLine(), so that finding the line-endings and the rest of the line-parsing are separated into separate functions. 3. Implement parallel copy. > I think the other important thing which this > patch has not addressed properly is the parallel-safety checks as > pointed by me earlier. There are two things to solve there (a) the > lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c > APIs, etc.) have checks which doesn't allow any writes, we need to see > which of those we can open now (or do some additional work to prevent > from those checks) after some of the work done for parallel-writes in > PG-13[1][2], and (b) in which all cases we can parallel-writes > (parallel copy) is allowed, for example need to identify whether table > or one of its partitions has any constraint/expression which is > parallel-unsafe. Agreed, that needs to be solved. I haven't given it any thought myself. - Heikki [1] https://www.postgresql.org/message-id/8e15b560-f387-7acc-ac90-763986617bfb%40iki.fi
Hi > > my $bytes = $ARGV[0]; > for(my $i = 0; $i < $bytes; $i+=8){ > print "longdata"; > } > print "\n"; > -------- > > postgres=# copy longdata from program 'perl /tmp/longdata.pl 100000000' > with (parallel 2); > > This gets stuck forever (or at least I didn't have the patience to wait > it finish). Both worker processes are consuming 100% of CPU. I had a look over this problem. the ParallelCopyDataBlock has size limit: uint8 skip_bytes; char data[DATA_BLOCK_SIZE]; /* data read from file */ It seems the input line is so long that the leader process run out of the Shared memory among parallel copy workers. And the leader process keep waiting free block. For the worker process, it wait util line_state becomes LINE_LEADER_POPULATED, But leader process won't set the line_state unless it read the whole line. So it stuck forever. May be we should reconsider about this situation. The stack is as follows: Leader stack: #3 0x000000000075f7a1 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=1, wait_event_info=wait_event_info@entry=150994945)at latch.c:411 #4 0x00000000005a9245 in WaitGetFreeCopyBlock (pcshared_info=pcshared_info@entry=0x7f26d2ed3580) at copyparallel.c:1546 #5 0x00000000005a98ce in SetRawBufForLoad (cstate=cstate@entry=0x2978a88, line_size=67108864, copy_buf_len=copy_buf_len@entry=65536,raw_buf_ptr=raw_buf_ptr@entry=65536, copy_raw_buf=copy_raw_buf@entry=0x7fff4cdc0e18) at copyparallel.c:1572 #6 0x00000000005a1963 in CopyReadLineText (cstate=cstate@entry=0x2978a88) at copy.c:4058 #7 0x00000000005a4e76 in CopyReadLine (cstate=cstate@entry=0x2978a88) at copy.c:3863 Worker stack: #0 GetLinePosition (cstate=cstate@entry=0x29e1f28) at copyparallel.c:1474 #1 0x00000000005a8aa4 in CacheLineInfo (cstate=cstate@entry=0x29e1f28, buff_count=buff_count@entry=0) at copyparallel.c:711 #2 0x00000000005a8e46 in GetWorkerLine (cstate=cstate@entry=0x29e1f28) at copyparallel.c:885 #3 0x00000000005a4f2e in NextCopyFromRawFields (cstate=cstate@entry=0x29e1f28, fields=fields@entry=0x7fff4cdc0b48, nfields=nfields@entry=0x7fff4cdc0b44)at copy.c:3615 #4 0x00000000005a50af in NextCopyFrom (cstate=cstate@entry=0x29e1f28, econtext=econtext@entry=0x2a358d8, values=0x2a42068,nulls=0x2a42070) at copy.c:3696 #5 0x00000000005a5b90 in CopyFrom (cstate=cstate@entry=0x29e1f28) at copy.c:2985 Best regards, houzj
On Thu, Nov 5, 2020 at 6:33 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > Hi > > > > > my $bytes = $ARGV[0]; > > for(my $i = 0; $i < $bytes; $i+=8){ > > print "longdata"; > > } > > print "\n"; > > -------- > > > > postgres=# copy longdata from program 'perl /tmp/longdata.pl 100000000' > > with (parallel 2); > > > > This gets stuck forever (or at least I didn't have the patience to wait > > it finish). Both worker processes are consuming 100% of CPU. > > I had a look over this problem. > > the ParallelCopyDataBlock has size limit: > uint8 skip_bytes; > char data[DATA_BLOCK_SIZE]; /* data read from file */ > > It seems the input line is so long that the leader process run out of the Shared memory among parallel copy workers. > And the leader process keep waiting free block. > > For the worker process, it wait util line_state becomes LINE_LEADER_POPULATED, > But leader process won't set the line_state unless it read the whole line. > > So it stuck forever. > May be we should reconsider about this situation. > > The stack is as follows: > > Leader stack: > #3 0x000000000075f7a1 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=1, wait_event_info=wait_event_info@entry=150994945)at latch.c:411 > #4 0x00000000005a9245 in WaitGetFreeCopyBlock (pcshared_info=pcshared_info@entry=0x7f26d2ed3580) at copyparallel.c:1546 > #5 0x00000000005a98ce in SetRawBufForLoad (cstate=cstate@entry=0x2978a88, line_size=67108864, copy_buf_len=copy_buf_len@entry=65536,raw_buf_ptr=raw_buf_ptr@entry=65536, > copy_raw_buf=copy_raw_buf@entry=0x7fff4cdc0e18) at copyparallel.c:1572 > #6 0x00000000005a1963 in CopyReadLineText (cstate=cstate@entry=0x2978a88) at copy.c:4058 > #7 0x00000000005a4e76 in CopyReadLine (cstate=cstate@entry=0x2978a88) at copy.c:3863 > > Worker stack: > #0 GetLinePosition (cstate=cstate@entry=0x29e1f28) at copyparallel.c:1474 > #1 0x00000000005a8aa4 in CacheLineInfo (cstate=cstate@entry=0x29e1f28, buff_count=buff_count@entry=0) at copyparallel.c:711 > #2 0x00000000005a8e46 in GetWorkerLine (cstate=cstate@entry=0x29e1f28) at copyparallel.c:885 > #3 0x00000000005a4f2e in NextCopyFromRawFields (cstate=cstate@entry=0x29e1f28, fields=fields@entry=0x7fff4cdc0b48, nfields=nfields@entry=0x7fff4cdc0b44)at copy.c:3615 > #4 0x00000000005a50af in NextCopyFrom (cstate=cstate@entry=0x29e1f28, econtext=econtext@entry=0x2a358d8, values=0x2a42068,nulls=0x2a42070) at copy.c:3696 > #5 0x00000000005a5b90 in CopyFrom (cstate=cstate@entry=0x29e1f28) at copy.c:2985 > Thanks for providing your thoughts. I have analyzed this issue and I'm working on the fix for this, I will be posting a patch for this shortly. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > On 02/11/2020 08:14, Amit Kapila wrote: > > > On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > >> > > >> In this design, you don't need to keep line boundaries in shared memory, > > >> because each worker process is responsible for finding the line > > >> boundaries of its own block. > > >> > > >> There's a point of serialization here, in that the next block cannot be > > >> processed, until the worker working on the previous block has finished > > >> scanning the EOLs, and set the starting position on the next block, > > >> putting it in READY state. That's not very different from your patch, > > >> where you had a similar point of serialization because the leader > > >> scanned the EOLs, > > > > > > But in the design (single producer multiple consumer) used by the > > > patch the worker doesn't need to wait till the complete block is > > > processed, it can start processing the lines already found. This will > > > also allow workers to start much earlier to process the data as it > > > doesn't need to wait for all the offsets corresponding to 64K block > > > ready. However, in the design where each worker is processing the 64K > > > block, it can lead to much longer waits. I think this will impact the > > > Copy STDIN case more where in most cases (200-300 bytes tuples) we > > > receive line-by-line from client and find the line-endings by leader. > > > If the leader doesn't find the line-endings the workers need to wait > > > till the leader fill the entire 64K chunk, OTOH, with current approach > > > the worker can start as soon as leader is able to populate some > > > minimum number of line-endings > > > > You can use a smaller block size. > > > > Sure, but the same problem can happen if the last line in that block > is too long and we need to peek into the next block. And then there > could be cases where a single line could be greater than 64K. > > > However, the point of parallel copy is > > to maximize bandwidth. > > > > Okay, but this first-phase (finding the line boundaries) can anyway be > not done in parallel and we have seen in some of the initial > benchmarking that this initial phase is a small part of work > especially when the table has indexes, constraints, etc. So, I think > it won't matter much if this splitting is done in a single process or > multiple processes. > > > If the workers ever have to sit idle, it means > > that the bottleneck is in receiving data from the client, i.e. the > > backend is fast enough, and you can't make the overall COPY finish any > > faster no matter how you do it. > > > > > The other point is that the leader backend won't be used completely as > > > it is only doing a very small part (primarily reading the file) of the > > > overall work. > > > > An idle process doesn't cost anything. If you have free CPU resources, > > use more workers. > > > > > We have discussed both these approaches (a) single producer multiple > > > consumer, and (b) all workers doing the processing as you are saying > > > in the beginning and concluded that (a) is better, see some of the > > > relevant emails [1][2][3]. > > > > > > [1] - https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de > > > [2] - https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de > > > [3] - https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de > > > > Sorry I'm late to the party. I don't think the design I proposed was > > discussed in that threads. > > > > I think something close to that is discussed as you have noticed in > your next email but IIRC, because many people (Andres, Ants, myself > and author) favoured the current approach (single reader and multiple > consumers) we decided to go with that. I feel this patch is very much > in the POC stage due to which the code doesn't look good and as we > move forward we need to see what is the better way to improve it, > maybe one of the ways is to split it as you are suggesting so that it > can be easier to review. I think the other important thing which this > patch has not addressed properly is the parallel-safety checks as > pointed by me earlier. There are two things to solve there (a) the > lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c > APIs, etc.) have checks which doesn't allow any writes, we need to see > which of those we can open now (or do some additional work to prevent > from those checks) after some of the work done for parallel-writes in > PG-13[1][2], and (b) in which all cases we can parallel-writes > (parallel copy) is allowed, for example need to identify whether table > or one of its partitions has any constraint/expression which is > parallel-unsafe. > I have worked to provide a patch for the parallel safety checks. It checks if parallely copy can be performed, Parallel copy cannot be performed for the following a) If relation is temporary table b) If relation is foreign table c) If relation has non parallel safe index expressions d) If relation has triggers present whose type is of non before statement trigger type e) If relation has check constraint which are not parallel safe f) If relation has partition and any partition has the above type. This patch has the checks for it. This patch will be used by parallel copy implementation. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Nov 10, 2020 at 7:12 PM vignesh C <vignesh21@gmail.com> wrote: > > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I have worked to provide a patch for the parallel safety checks. It > checks if parallely copy can be performed, Parallel copy cannot be > performed for the following a) If relation is temporary table b) If > relation is foreign table c) If relation has non parallel safe index > expressions d) If relation has triggers present whose type is of non > before statement trigger type e) If relation has check constraint > which are not parallel safe f) If relation has partition and any > partition has the above type. This patch has the checks for it. This > patch will be used by parallel copy implementation. > How did you ensure that this is sufficient? For parallel-insert's patch we have enabled parallel-mode for Inserts and ran the tests with force_parallel_mode to see if we are not missing anything. Also, it seems there are many common things here w.r.t parallel-insert patch, is it possible to prepare this atop that patch or do you have any reason to keep this separate? -- With Regards, Amit Kapila.
On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Nov 10, 2020 at 7:12 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > I have worked to provide a patch for the parallel safety checks. It > > checks if parallely copy can be performed, Parallel copy cannot be > > performed for the following a) If relation is temporary table b) If > > relation is foreign table c) If relation has non parallel safe index > > expressions d) If relation has triggers present whose type is of non > > before statement trigger type e) If relation has check constraint > > which are not parallel safe f) If relation has partition and any > > partition has the above type. This patch has the checks for it. This > > patch will be used by parallel copy implementation. > > > > How did you ensure that this is sufficient? For parallel-insert's > patch we have enabled parallel-mode for Inserts and ran the tests with > force_parallel_mode to see if we are not missing anything. Also, it > seems there are many common things here w.r.t parallel-insert patch, > is it possible to prepare this atop that patch or do you have any > reason to keep this separate? > I have done similar testing for copy too, I had set force_parallel mode to regress, hardcoded in the code to pick parallel workers for copy operation and ran make installcheck-world to verify. Many checks in this patch are common between both patches, but I was not sure how to handle it as both the projects are in-progress and are being updated based on the reviewer's opinion. How to handle this? Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Wed, Nov 11, 2020 at 10:42 PM vignesh C <vignesh21@gmail.com> wrote: > > On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Nov 10, 2020 at 7:12 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > I have worked to provide a patch for the parallel safety checks. It > > > checks if parallely copy can be performed, Parallel copy cannot be > > > performed for the following a) If relation is temporary table b) If > > > relation is foreign table c) If relation has non parallel safe index > > > expressions d) If relation has triggers present whose type is of non > > > before statement trigger type e) If relation has check constraint > > > which are not parallel safe f) If relation has partition and any > > > partition has the above type. This patch has the checks for it. This > > > patch will be used by parallel copy implementation. > > > > > > > How did you ensure that this is sufficient? For parallel-insert's > > patch we have enabled parallel-mode for Inserts and ran the tests with > > force_parallel_mode to see if we are not missing anything. Also, it > > seems there are many common things here w.r.t parallel-insert patch, > > is it possible to prepare this atop that patch or do you have any > > reason to keep this separate? > > > > I have done similar testing for copy too, I had set force_parallel > mode to regress, hardcoded in the code to pick parallel workers for > copy operation and ran make installcheck-world to verify. Many checks > in this patch are common between both patches, but I was not sure how > to handle it as both the projects are in-progress and are being > updated based on the reviewer's opinion. How to handle this? > Thoughts? > I have not studied the differences in detail but if it is possible to prepare it on top of that patch then there shouldn't be a problem. To avoid confusion if you want you can always either post the latest version of that patch with your patch or point to it. -- With Regards, Amit Kapila.
On Thu, Oct 29, 2020 at 2:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> 4) Worker has to hop through all the processed chunks before getting
> the chunk which it can process.
>
> One more point, I have noticed that some time back [1], I have given
> one suggestion related to the way workers process the set of lines
> (aka chunk). I think you can try by increasing the chunk size to say
> 100, 500, 1000 and use some shared counter to remember the number of
> chunks processed.
>
Hi, I did some analysis on using spinlock protected worker write position i.e. each worker acquires spinlock on a shared write position to choose the next available chunk vs each worker hops to get the next available chunk position:
With spinlock:
(1,1126.443,1060.067,0.478,0), (2,669.343,630.769,0.306,26), (4,346.297,326.950,0.161,89), (8,209.600,196.417,0.088,291), (16,166.113,157.086,0.065,1468), (20,173.884,166.013,0.067,2700), (30,173.087,1166.565,0.0065,5346)
Without spinlock:
(1,1119.695,1054.586,0.496,0), (2,645.733,608.313,1.5,8), (4,340.620,320.344,1.6,58), (8,203.985,189.644,1.3,222), (16,142.997,133.045,1,813), (20,132.621,122.527,1.1,1215), (30,135.737,126.716,1.5,2901)
With spinlock each worker is getting the required write position quickly and proceeding further till the index insertion(which is becoming a single point of contention) where we observed more buffer lock contention. Reason is that all the workers are reaching the index insertion point at the similar time.
Without spinlock, each worker is spending some time in hopping to get the write position, by the time the other workers are inserting into the indexes. So basically, all the workers are not reaching the index insertion point at the same time and hence less buffer lock contention.
The same behaviour(explained above) is observed with different worker chunk count(default 64, 128, 512 and 1024) i.e. the number of tuples each worker caches into its local memory before inserting into table.
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
>
> 4) Worker has to hop through all the processed chunks before getting
> the chunk which it can process.
>
> One more point, I have noticed that some time back [1], I have given
> one suggestion related to the way workers process the set of lines
> (aka chunk). I think you can try by increasing the chunk size to say
> 100, 500, 1000 and use some shared counter to remember the number of
> chunks processed.
>
Hi, I did some analysis on using spinlock protected worker write position i.e. each worker acquires spinlock on a shared write position to choose the next available chunk vs each worker hops to get the next available chunk position:
Use Case: 10mn rows, 5.6GB data, 2 indexes on integer columns, 1 index on text column, results are of the form (no of workers, total exec time in sec, index insertion time in sec, worker write pos get time in sec, buffer contention event count):
(1,1126.443,1060.067,0.478,0), (2,669.343,630.769,0.306,26), (4,346.297,326.950,0.161,89), (8,209.600,196.417,0.088,291), (16,166.113,157.086,0.065,1468), (20,173.884,166.013,0.067,2700), (30,173.087,1166.565,0.0065,5346)
Without spinlock:
(1,1119.695,1054.586,0.496,0), (2,645.733,608.313,1.5,8), (4,340.620,320.344,1.6,58), (8,203.985,189.644,1.3,222), (16,142.997,133.045,1,813), (20,132.621,122.527,1.1,1215), (30,135.737,126.716,1.5,2901)
With spinlock each worker is getting the required write position quickly and proceeding further till the index insertion(which is becoming a single point of contention) where we observed more buffer lock contention. Reason is that all the workers are reaching the index insertion point at the similar time.
Without spinlock, each worker is spending some time in hopping to get the write position, by the time the other workers are inserting into the indexes. So basically, all the workers are not reaching the index insertion point at the same time and hence less buffer lock contention.
The same behaviour(explained above) is observed with different worker chunk count(default 64, 128, 512 and 1024) i.e. the number of tuples each worker caches into its local memory before inserting into table.
In summary: with spinlock, it looks like we are able to avoid workers waiting to get the next chunk, which also means that we are not creating any contention point inside the parallel copy code. However this is causing another choking point i.e. index insertion if indexes are available on the table, which is out of scope of parallel copy code. We think that it would be good to use spinlock-protected worker write position or an atomic variable for worker write position(as it performs equal to spinlock or little better in some platforms). Thoughts?
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On Thu, Oct 29, 2020 at 11:45 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Oct 27, 2020 at 7:06 PM vignesh C <vignesh21@gmail.com> wrote: > > > [latest version] > > I think the parallel-safety checks in this patch > (v9-0002-Allow-copy-from-command-to-process-data-from-file) are > incomplete and wrong. See below comments. > 1. > +static pg_attribute_always_inline bool > +CheckExprParallelSafety(CopyState cstate) > +{ > + if (contain_volatile_functions(cstate->whereClause)) > + { > + if (max_parallel_hazard((Query *) cstate->whereClause) != PROPARALLEL_SAFE) > + return false; > + } > > I don't understand the above check. Why do we only need to check where > clause for parallel-safety when it contains volatile functions? It > should be checked otherwise as well, no? The similar comment applies > to other checks in this function. Also, I don't think there is a need > to make this function inline. > I felt we should check if where clause is parallel safe and also check if it does not contain volatile function, this is to avoid cases where expressions may query the table we're inserting into. Modified it accordingly. > 2. > +/* > + * IsParallelCopyAllowed > + * > + * Check if parallel copy can be allowed. > + */ > +bool > +IsParallelCopyAllowed(CopyState cstate) > { > .. > + * When there are BEFORE/AFTER/INSTEAD OF row triggers on the table. We do > + * not allow parallelism in such cases because such triggers might query > + * the table we are inserting into and act differently if the tuples that > + * have already been processed and prepared for insertion are not there. > + * Now, if we allow parallelism with such triggers the behaviour would > + * depend on if the parallel worker has already inserted or not that > + * particular tuples. > + */ > + if (cstate->rel->trigdesc != NULL && > + (cstate->rel->trigdesc->trig_insert_after_statement || > + cstate->rel->trigdesc->trig_insert_new_table || > + cstate->rel->trigdesc->trig_insert_before_row || > + cstate->rel->trigdesc->trig_insert_after_row || > + cstate->rel->trigdesc->trig_insert_instead_row)) > + return false; > .. > > Why do we need to disable parallelism for before/after row triggers > unless they have parallel-unsafe functions? I see a few lines down in > this function you are checking parallel-safety of trigger functions, > what is the use of the same if you are already disabling parallelism > with the above check. > Currently only before statement trigger is supported, rest of the triggers are not supported, comments for the same is mentioned atop of the checks. Removed the parallel safe check which was not required. > 3. What about if the index on table has expressions that are > parallel-unsafe? What is your strategy to check parallel-safety for > partitioned tables? > > I suggest checking Greg's patch for parallel-safety of Inserts [1]. I > think you will find that most of those checks are required here as > well and see how we can use that patch (at least what is common). I > feel the first patch should be just to have parallel-safety checks and > we can test that by trying to enable Copy with force_parallel_mode. We > can build the rest of the patch atop of it or in other words, let's > move all parallel-safety work into a separate patch. > I have made this as a separate patch as of now. I will work on to see if I can use Greg's changes as it is or if require I will provide few review comments on top of Greg's patch so that it is usable for parallel copy too and later post a separate patch with the changes on top of it. I will retain it as a separate patch till that time. > Few assorted comments: > ======================== > 1. > +/* > + * ESTIMATE_NODE_SIZE - Estimate the size required for node type in shared > + * memory. > + */ > +#define ESTIMATE_NODE_SIZE(list, listStr, strsize) \ > +{ \ > + uint32 estsize = sizeof(uint32); \ > + if ((List *)list != NIL) \ > + { \ > + listStr = nodeToString(list); \ > + estsize += strlen(listStr) + 1; \ > + } \ > + \ > + strsize = add_size(strsize, estsize); \ > +} > > This can be probably a function instead of a macro. > Changed it to a function. > 2. > +/* > + * ESTIMATE_1BYTE_STR_SIZE - Estimate the size required for 1Byte strings in > + * shared memory. > + */ > +#define ESTIMATE_1BYTE_STR_SIZE(src, strsize) \ > +{ \ > + strsize = add_size(strsize, sizeof(uint8)); \ > + strsize = add_size(strsize, (src) ? 1 : 0); \ > +} > > This could be an inline function. > Changed it to an inline function. > 3. > +/* > + * SERIALIZE_1BYTE_STR - Copy 1Byte strings to shared memory. > + */ > +#define SERIALIZE_1BYTE_STR(dest, src, copiedsize) \ > +{ \ > + uint8 len = (src) ? 1 : 0; \ > + memcpy(dest + copiedsize, (uint8 *) &len, sizeof(uint8)); \ > + copiedsize += sizeof(uint8); \ > + if (src) \ > + dest[copiedsize++] = src[0]; \ > +} > > Similarly, this could be a function. I think keeping such things as > macros in-between code makes it difficult to read. Please see if you > can make these and similar macros as functions unless they are doing > few memory instructions. Having functions makes it easier to debug the > code as well. > Changed it to a function. Attached v10 patch has the fixes for the same. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
- v10-0001-Copy-code-readjustment-to-support-parallel-copy.patch
- v10-0002-Check-if-parallel-copy-can-be-performed.patch
- v10-0003-Allow-copy-from-command-to-process-data-from-fil.patch
- v10-0004-Documentation-for-parallel-copy.patch
- v10-0005-Parallel-Copy-For-Binary-Format-Files.patch
- v10-0006-Tests-for-parallel-copy.patch
On Thu, Oct 29, 2020 at 2:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 27/10/2020 15:36, vignesh C wrote:
> > Attached v9 patches have the fixes for the above comments.
>
> I did some testing:
>
> /tmp/longdata.pl:
> --------
> #!/usr/bin/perl
> #
> # Generate three rows:
> # foo
> # longdatalongdatalongdata...
> # bar
> #
> # The length of the middle row is given as command line arg.
> #
>
> my $bytes = $ARGV[0];
>
> print "foo\n";
> for(my $i = 0; $i < $bytes; $i+=8){
> print "longdata";
> }
> print "\n";
> print "bar\n";
> --------
>
> postgres=# copy longdata from program 'perl /tmp/longdata.pl 100000000'
> with (parallel 2);
>
> This gets stuck forever (or at least I didn't have the patience to wait
> it finish). Both worker processes are consuming 100% of CPU.
>
Thanks for identifying this issue, this issue is fixed in v10 patch posted at [1]
[1] https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com
>
> On 27/10/2020 15:36, vignesh C wrote:
> > Attached v9 patches have the fixes for the above comments.
>
> I did some testing:
>
> /tmp/longdata.pl:
> --------
> #!/usr/bin/perl
> #
> # Generate three rows:
> # foo
> # longdatalongdatalongdata...
> # bar
> #
> # The length of the middle row is given as command line arg.
> #
>
> my $bytes = $ARGV[0];
>
> print "foo\n";
> for(my $i = 0; $i < $bytes; $i+=8){
> print "longdata";
> }
> print "\n";
> print "bar\n";
> --------
>
> postgres=# copy longdata from program 'perl /tmp/longdata.pl 100000000'
> with (parallel 2);
>
> This gets stuck forever (or at least I didn't have the patience to wait
> it finish). Both worker processes are consuming 100% of CPU.
>
Thanks for identifying this issue, this issue is fixed in v10 patch posted at [1]
[1] https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com
On Wed, Oct 28, 2020 at 5:36 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
>
> Hi
>
> I found some issue in v9-0002
>
> 1.
> +
> + elog(DEBUG1, "[Worker] Processing - line position:%d, block:%d, unprocessed lines:%d, offset:%d, line size:%d",
> + write_pos, lineInfo->first_block,
> + pg_atomic_read_u32(&data_blk_ptr->unprocessed_line_parts),
> + offset, pg_atomic_read_u32(&lineInfo->line_size));
> +
>
> write_pos or other variable to be printed here are type of uint32, I think it'better to use '%u' in elog msg.
>
Modified it.
> 2.
> + * line_size will be set. Read the line_size again to be sure if it is
> + * completed or partial block.
> + */
> + dataSize = pg_atomic_read_u32(&lineInfo->line_size);
> + if (dataSize)
>
> It use dataSize( type int ) to get uint32 which seems a little dangerous.
> Is it better to define dataSize uint32 here?
>
Modified it.
> 3.
> Since function with 'Cstate' in name has been changed to 'CState'
> I think we can change function PopulateCommonCstateInfo as well.
>
Modified it.
> 4.
> + if (pcdata->worker_line_buf_count)
>
> I think some check like the above can be 'if (xxx > 0)', which seems easier to understand.
Modified it.
Thanks for the comments, these issues are fixed in v10 patch posted at [1]
[1] https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
>
> Hi
>
> I found some issue in v9-0002
>
> 1.
> +
> + elog(DEBUG1, "[Worker] Processing - line position:%d, block:%d, unprocessed lines:%d, offset:%d, line size:%d",
> + write_pos, lineInfo->first_block,
> + pg_atomic_read_u32(&data_blk_ptr->unprocessed_line_parts),
> + offset, pg_atomic_read_u32(&lineInfo->line_size));
> +
>
> write_pos or other variable to be printed here are type of uint32, I think it'better to use '%u' in elog msg.
>
Modified it.
> 2.
> + * line_size will be set. Read the line_size again to be sure if it is
> + * completed or partial block.
> + */
> + dataSize = pg_atomic_read_u32(&lineInfo->line_size);
> + if (dataSize)
>
> It use dataSize( type int ) to get uint32 which seems a little dangerous.
> Is it better to define dataSize uint32 here?
>
Modified it.
> 3.
> Since function with 'Cstate' in name has been changed to 'CState'
> I think we can change function PopulateCommonCstateInfo as well.
>
Modified it.
> 4.
> + if (pcdata->worker_line_buf_count)
>
> I think some check like the above can be 'if (xxx > 0)', which seems easier to understand.
Modified it.
Thanks for the comments, these issues are fixed in v10 patch posted at [1]
[1] https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
On Thu, Oct 29, 2020 at 2:26 PM Daniel Westermann (DWE) <daniel.westermann@dbi-services.com> wrote: > > On 27/10/2020 15:36, vignesh C wrote: > >> Attached v9 patches have the fixes for the above comments. > > >I did some testing: > > I did some testing as well and have a cosmetic remark: > > postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 1000000000); > ERROR: value 1000000000 out of bounds for option "parallel" > DETAIL: Valid values are between "1" and "1024". > postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 100000000000); > ERROR: parallel requires an integer value > postgres=# > > Wouldn't it make more sense to only have one error message? The first one seems to be the better message. > I had seen similar behavior in other places too: postgres=# vacuum (parallel 1000000000) t1; ERROR: parallel vacuum degree must be between 0 and 1024 LINE 1: vacuum (parallel 1000000000) t1; ^ postgres=# vacuum (parallel 100000000000) t1; ERROR: parallel requires an integer value I'm not sure if we should fix this. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Fri, Nov 13, 2020 at 2:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Nov 11, 2020 at 10:42 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Nov 10, 2020 at 7:12 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > I have worked to provide a patch for the parallel safety checks. It > > > > checks if parallely copy can be performed, Parallel copy cannot be > > > > performed for the following a) If relation is temporary table b) If > > > > relation is foreign table c) If relation has non parallel safe index > > > > expressions d) If relation has triggers present whose type is of non > > > > before statement trigger type e) If relation has check constraint > > > > which are not parallel safe f) If relation has partition and any > > > > partition has the above type. This patch has the checks for it. This > > > > patch will be used by parallel copy implementation. > > > > > > > > > > How did you ensure that this is sufficient? For parallel-insert's > > > patch we have enabled parallel-mode for Inserts and ran the tests with > > > force_parallel_mode to see if we are not missing anything. Also, it > > > seems there are many common things here w.r.t parallel-insert patch, > > > is it possible to prepare this atop that patch or do you have any > > > reason to keep this separate? > > > > > > > I have done similar testing for copy too, I had set force_parallel > > mode to regress, hardcoded in the code to pick parallel workers for > > copy operation and ran make installcheck-world to verify. Many checks > > in this patch are common between both patches, but I was not sure how > > to handle it as both the projects are in-progress and are being > > updated based on the reviewer's opinion. How to handle this? > > Thoughts? > > > > I have not studied the differences in detail but if it is possible to > prepare it on top of that patch then there shouldn't be a problem. To > avoid confusion if you want you can always either post the latest > version of that patch with your patch or point to it. > I have made this as a separate patch as of now. I will work on to see if I can use Greg's changes as it is or if required I will provide a few review comments on top of Greg's patch so that it is usable for parallel copy too and later post a separate patch with the changes on top of it. I will retain it as a separate patch till that time. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Sat, Oct 31, 2020 at 2:07 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> I've done a bit more testing today, and I think the parsing is busted in
> some way. Consider this:
>
> test=# create extension random;
> CREATE EXTENSION
>
> test=# create table t (a text);
> CREATE TABLE
>
> test=# insert into t select random_string(random_int(10, 256*1024)) from generate_series(1,10000);
> INSERT 0 10000
>
> test=# copy t to '/mnt/data/t.csv';
> COPY 10000
>
> test=# truncate t;
> TRUNCATE TABLE
>
> test=# copy t from '/mnt/data/t.csv';
> COPY 10000
>
> test=# truncate t;
> TRUNCATE TABLE
>
> test=# copy t from '/mnt/data/t.csv' with (parallel 2);
> ERROR: invalid byte sequence for encoding "UTF8": 0x00
> CONTEXT: COPY t, line 485: "m&\nh%_a"%r]>qtCl:Q5ltvF~;2oS6@HB>F>og,bD$Lw'nZY\tYl#BH\t{(j~ryoZ08"SGU~.}8CcTRk1\ts$@U3szCC+U1U3i@P..."
> parallel worker
>
>
> The functions come from an extension I use to generate random data, I've
> pushed it to github [1]. The random_string() generates a random string
> with ASCII characters, symbols and a couple special characters (\r\n\t).
> The intent was to try loading data where a fields may span multiple 64kB
> blocks and may contain newlines etc.
>
> The non-parallel copy works fine, the parallel one fails. I haven't
> investigated the details, but I guess it gets confused about where a
> string starts/end, or something like that.
>
Thanks for identifying this issue, this issue is fixed in v10 patch posted at [1]
[1] https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
On Sat, Nov 7, 2020 at 7:01 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, Nov 5, 2020 at 6:33 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
> >
> > Hi
> >
> > >
> > > my $bytes = $ARGV[0];
> > > for(my $i = 0; $i < $bytes; $i+=8){
> > > print "longdata";
> > > }
> > > print "\n";
> > > --------
> > >
> > > postgres=# copy longdata from program 'perl /tmp/longdata.pl 100000000'
> > > with (parallel 2);
> > >
> > > This gets stuck forever (or at least I didn't have the patience to wait
> > > it finish). Both worker processes are consuming 100% of CPU.
> >
> > I had a look over this problem.
> >
> > the ParallelCopyDataBlock has size limit:
> > uint8 skip_bytes;
> > char data[DATA_BLOCK_SIZE]; /* data read from file */
> >
> > It seems the input line is so long that the leader process run out of the Shared memory among parallel copy workers.
> > And the leader process keep waiting free block.
> >
> > For the worker process, it wait util line_state becomes LINE_LEADER_POPULATED,
> > But leader process won't set the line_state unless it read the whole line.
> >
> > So it stuck forever.
> > May be we should reconsider about this situation.
> >
> > The stack is as follows:
> >
> > Leader stack:
> > #3 0x000000000075f7a1 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=1, wait_event_info=wait_event_info@entry=150994945) at latch.c:411
> > #4 0x00000000005a9245 in WaitGetFreeCopyBlock (pcshared_info=pcshared_info@entry=0x7f26d2ed3580) at copyparallel.c:1546
> > #5 0x00000000005a98ce in SetRawBufForLoad (cstate=cstate@entry=0x2978a88, line_size=67108864, copy_buf_len=copy_buf_len@entry=65536, raw_buf_ptr=raw_buf_ptr@entry=65536,
> > copy_raw_buf=copy_raw_buf@entry=0x7fff4cdc0e18) at copyparallel.c:1572
> > #6 0x00000000005a1963 in CopyReadLineText (cstate=cstate@entry=0x2978a88) at copy.c:4058
> > #7 0x00000000005a4e76 in CopyReadLine (cstate=cstate@entry=0x2978a88) at copy.c:3863
> >
> > Worker stack:
> > #0 GetLinePosition (cstate=cstate@entry=0x29e1f28) at copyparallel.c:1474
> > #1 0x00000000005a8aa4 in CacheLineInfo (cstate=cstate@entry=0x29e1f28, buff_count=buff_count@entry=0) at copyparallel.c:711
> > #2 0x00000000005a8e46 in GetWorkerLine (cstate=cstate@entry=0x29e1f28) at copyparallel.c:885
> > #3 0x00000000005a4f2e in NextCopyFromRawFields (cstate=cstate@entry=0x29e1f28, fields=fields@entry=0x7fff4cdc0b48, nfields=nfields@entry=0x7fff4cdc0b44) at copy.c:3615
> > #4 0x00000000005a50af in NextCopyFrom (cstate=cstate@entry=0x29e1f28, econtext=econtext@entry=0x2a358d8, values=0x2a42068, nulls=0x2a42070) at copy.c:3696
> > #5 0x00000000005a5b90 in CopyFrom (cstate=cstate@entry=0x29e1f28) at copy.c:2985
> >
>
> Thanks for providing your thoughts. I have analyzed this issue and I'm
> working on the fix for this, I will be posting a patch for this
> shortly.
>
I have fixed and provided a patch for this at [1]
[1] https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
>
> On Thu, Nov 5, 2020 at 6:33 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
> >
> > Hi
> >
> > >
> > > my $bytes = $ARGV[0];
> > > for(my $i = 0; $i < $bytes; $i+=8){
> > > print "longdata";
> > > }
> > > print "\n";
> > > --------
> > >
> > > postgres=# copy longdata from program 'perl /tmp/longdata.pl 100000000'
> > > with (parallel 2);
> > >
> > > This gets stuck forever (or at least I didn't have the patience to wait
> > > it finish). Both worker processes are consuming 100% of CPU.
> >
> > I had a look over this problem.
> >
> > the ParallelCopyDataBlock has size limit:
> > uint8 skip_bytes;
> > char data[DATA_BLOCK_SIZE]; /* data read from file */
> >
> > It seems the input line is so long that the leader process run out of the Shared memory among parallel copy workers.
> > And the leader process keep waiting free block.
> >
> > For the worker process, it wait util line_state becomes LINE_LEADER_POPULATED,
> > But leader process won't set the line_state unless it read the whole line.
> >
> > So it stuck forever.
> > May be we should reconsider about this situation.
> >
> > The stack is as follows:
> >
> > Leader stack:
> > #3 0x000000000075f7a1 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=1, wait_event_info=wait_event_info@entry=150994945) at latch.c:411
> > #4 0x00000000005a9245 in WaitGetFreeCopyBlock (pcshared_info=pcshared_info@entry=0x7f26d2ed3580) at copyparallel.c:1546
> > #5 0x00000000005a98ce in SetRawBufForLoad (cstate=cstate@entry=0x2978a88, line_size=67108864, copy_buf_len=copy_buf_len@entry=65536, raw_buf_ptr=raw_buf_ptr@entry=65536,
> > copy_raw_buf=copy_raw_buf@entry=0x7fff4cdc0e18) at copyparallel.c:1572
> > #6 0x00000000005a1963 in CopyReadLineText (cstate=cstate@entry=0x2978a88) at copy.c:4058
> > #7 0x00000000005a4e76 in CopyReadLine (cstate=cstate@entry=0x2978a88) at copy.c:3863
> >
> > Worker stack:
> > #0 GetLinePosition (cstate=cstate@entry=0x29e1f28) at copyparallel.c:1474
> > #1 0x00000000005a8aa4 in CacheLineInfo (cstate=cstate@entry=0x29e1f28, buff_count=buff_count@entry=0) at copyparallel.c:711
> > #2 0x00000000005a8e46 in GetWorkerLine (cstate=cstate@entry=0x29e1f28) at copyparallel.c:885
> > #3 0x00000000005a4f2e in NextCopyFromRawFields (cstate=cstate@entry=0x29e1f28, fields=fields@entry=0x7fff4cdc0b48, nfields=nfields@entry=0x7fff4cdc0b44) at copy.c:3615
> > #4 0x00000000005a50af in NextCopyFrom (cstate=cstate@entry=0x29e1f28, econtext=econtext@entry=0x2a358d8, values=0x2a42068, nulls=0x2a42070) at copy.c:3696
> > #5 0x00000000005a5b90 in CopyFrom (cstate=cstate@entry=0x29e1f28) at copy.c:2985
> >
>
> Thanks for providing your thoughts. I have analyzed this issue and I'm
> working on the fix for this, I will be posting a patch for this
> shortly.
>
I have fixed and provided a patch for this at [1]
[1] https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
Hi Vignesh, I took a look at the v10 patch set. Here are some comments: 1. +/* + * CheckExprParallelSafety + * + * Determine if where cluase and default expressions are parallel safe & do not + * have volatile expressions, return true if condition satisfies else return + * false. + */ 'cluase' seems a typo. 2. + /* + * Make sure that no worker has consumed this element, if this + * line is spread across multiple data blocks, worker would have + * started processing, no need to change the state to + * LINE_LEADER_POPULATING in this case. + */ + (void) pg_atomic_compare_exchange_u32(&lineInfo->line_state, + ¤t_line_state, + LINE_LEADER_POPULATED); About the commect + * started processing, no need to change the state to + * LINE_LEADER_POPULATING in this case. Does it means no need to change the state to LINE_LEADER_POPULATED ' here? 3. + * 3) only one worker should choose one line for processing, this is handled by + * using pg_atomic_compare_exchange_u32, worker will change the state to + * LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED. In the latest patch, it will set the state to LINE_WORKER_PROCESSING if line_state is LINE_LEADER_POPULATED or LINE_LEADER_POPULATING. So The comment here seems wrong. 4. A suggestion for CacheLineInfo. It use appendBinaryStringXXX to store the line in memory. appendBinaryStringXXX will double the str memory when there is no enough spaces. How about call enlargeStringInfo in advance, if we already know the whole line size? It can avoid some memory waste and may impove a little performance. Best regards, houzj
Thanks for the comments. > I took a look at the v10 patch set. Here are some comments: > > 1. > +/* > + * CheckExprParallelSafety > + * > + * Determine if where cluase and default expressions are parallel safe & do not > + * have volatile expressions, return true if condition satisfies else return > + * false. > + */ > > 'cluase' seems a typo. > changed. > 2. > + /* > + * Make sure that no worker has consumed this element, if this > + * line is spread across multiple data blocks, worker would have > + * started processing, no need to change the state to > + * LINE_LEADER_POPULATING in this case. > + */ > + (void) pg_atomic_compare_exchange_u32(&lineInfo->line_state, > + ¤t_line_state, > + LINE_LEADER_POPULATED); > About the commect > > + * started processing, no need to change the state to > + * LINE_LEADER_POPULATING in this case. > > Does it means no need to change the state to LINE_LEADER_POPULATED ' here? > > Yes it is LINE_LEADER_POPULATED, changed accordingly. > 3. > + * 3) only one worker should choose one line for processing, this is handled by > + * using pg_atomic_compare_exchange_u32, worker will change the state to > + * LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED. > > In the latest patch, it will set the state to LINE_WORKER_PROCESSING if line_state is LINE_LEADER_POPULATED or LINE_LEADER_POPULATING. > So The comment here seems wrong. > Updated the comments. > 4. > A suggestion for CacheLineInfo. > > It use appendBinaryStringXXX to store the line in memory. > appendBinaryStringXXX will double the str memory when there is no enough spaces. > > How about call enlargeStringInfo in advance, if we already know the whole line size? > It can avoid some memory waste and may impove a little performance. > Here we will not know the size beforehand, in some cases we will start processing the data when current block is populated and keep processing block by block, we will come to know of the size at the end. We cannot use enlargeStringInfo because of this. Attached v11 patch has the fix for this, it also includes the changes to rebase on top of head. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
- v11-0001-Copy-code-readjustment-to-support-parallel-copy.patch
- v11-0002-Check-if-parallel-copy-can-be-performed.patch
- v11-0003-Allow-copy-from-command-to-process-data-from-fil.patch
- v11-0004-Documentation-for-parallel-copy.patch
- v11-0005-Parallel-Copy-For-Binary-Format-Files.patch
- v11-0006-Tests-for-parallel-copy.patch
> > 4. > > A suggestion for CacheLineInfo. > > > > It use appendBinaryStringXXX to store the line in memory. > > appendBinaryStringXXX will double the str memory when there is no enough > spaces. > > > > How about call enlargeStringInfo in advance, if we already know the whole > line size? > > It can avoid some memory waste and may impove a little performance. > > > > Here we will not know the size beforehand, in some cases we will start > processing the data when current block is populated and keep processing > block by block, we will come to know of the size at the end. We cannot use > enlargeStringInfo because of this. > > Attached v11 patch has the fix for this, it also includes the changes to > rebase on top of head. Thanks for the explanation. I think there is still chances we can know the size. + * line_size will be set. Read the line_size again to be sure if it is + * completed or partial block. + */ + dataSize = pg_atomic_read_u32(&lineInfo->line_size); + if (dataSize != -1) + { If I am not wrong, this seems the branch that procsssing the populated block. I think we can check the copiedSize here, if copiedSize == 0, that means Datasizes is the size of the whole line and in this case we can do the enlarge. Best regards, houzj
On Mon, Dec 7, 2020 at 3:00 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > > Attached v11 patch has the fix for this, it also includes the changes to > > rebase on top of head. > > Thanks for the explanation. > > I think there is still chances we can know the size. > > + * line_size will be set. Read the line_size again to be sure if it is > + * completed or partial block. > + */ > + dataSize = pg_atomic_read_u32(&lineInfo->line_size); > + if (dataSize != -1) > + { > > If I am not wrong, this seems the branch that procsssing the populated block. > I think we can check the copiedSize here, if copiedSize == 0, that means > Datasizes is the size of the whole line and in this case we can do the enlarge. > > Yes this optimization can be done, I will handle this in the next patch set. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Hi > Yes this optimization can be done, I will handle this in the next patch > set. > I have a suggestion for the parallel safety-check. As designed, The leader does not participate in the insertion of data. If User use (PARALLEL 1), there is only one worker process which will do the insertion. IMO, we can skip some of the safety-check in this case, becase the safety-check is to limit parallel insert. (except temporary table or ...) So, how about checking (PARALLEL 1) separately ? Although it looks a bit complicated, But (PARALLEL 1) do have a good performance improvement. Best regards, houzj
On Wed, Dec 23, 2020 at 3:05 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote: > > Hi > > > Yes this optimization can be done, I will handle this in the next patch > > set. > > > > I have a suggestion for the parallel safety-check. > > As designed, The leader does not participate in the insertion of data. > If User use (PARALLEL 1), there is only one worker process which will do the insertion. > > IMO, we can skip some of the safety-check in this case, becase the safety-check is to limit parallel insert. > (except temporary table or ...) > > So, how about checking (PARALLEL 1) separately ? > Although it looks a bit complicated, But (PARALLEL 1) do have a good performance improvement. > Thanks for the comments Hou Zhijie, I will run a few tests with 1 worker and try to include this in the next patch set. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > On 02/11/2020 08:14, Amit Kapila wrote: > > > On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > >> > > >> In this design, you don't need to keep line boundaries in shared memory, > > >> because each worker process is responsible for finding the line > > >> boundaries of its own block. > > >> > > >> There's a point of serialization here, in that the next block cannot be > > >> processed, until the worker working on the previous block has finished > > >> scanning the EOLs, and set the starting position on the next block, > > >> putting it in READY state. That's not very different from your patch, > > >> where you had a similar point of serialization because the leader > > >> scanned the EOLs, > > > > > > But in the design (single producer multiple consumer) used by the > > > patch the worker doesn't need to wait till the complete block is > > > processed, it can start processing the lines already found. This will > > > also allow workers to start much earlier to process the data as it > > > doesn't need to wait for all the offsets corresponding to 64K block > > > ready. However, in the design where each worker is processing the 64K > > > block, it can lead to much longer waits. I think this will impact the > > > Copy STDIN case more where in most cases (200-300 bytes tuples) we > > > receive line-by-line from client and find the line-endings by leader. > > > If the leader doesn't find the line-endings the workers need to wait > > > till the leader fill the entire 64K chunk, OTOH, with current approach > > > the worker can start as soon as leader is able to populate some > > > minimum number of line-endings > > > > You can use a smaller block size. > > > > Sure, but the same problem can happen if the last line in that block > is too long and we need to peek into the next block. And then there > could be cases where a single line could be greater than 64K. > > > However, the point of parallel copy is > > to maximize bandwidth. > > > > Okay, but this first-phase (finding the line boundaries) can anyway be > not done in parallel and we have seen in some of the initial > benchmarking that this initial phase is a small part of work > especially when the table has indexes, constraints, etc. So, I think > it won't matter much if this splitting is done in a single process or > multiple processes. > I wrote a patch to compare the performance of the current implementation leader identifying the line bound design vs the workers identifying the line boundary. The results of the same is given below: The below data can be read as parallel copy time taken in seconds based on the leader identifying the line boundary design, parallel copy time taken in seconds based on the workers identifying the line boundary design, workers. Use case 1 - 10million rows, 5.2GB data,3 indexes on integer columns: (211.206, 632.583, 1), (165.402, 360.152, 2), (137.608, 219.623, 4), (128.003, 206.851, 8), (114.518, 177.790, 16), (109.257, 170.058, 20), (102.050, 158.376, 30) Use case 2 - 10million rows, 5.2GB data,2 indexes on integer columns, 1 index on text column, csv file: (1212.356, 1602.118, 1), (707.191, 849.105, 2), (369.620, 441.068, 4), (221.359, 252.775, 8), (167.152, 180.207, 16), (168.804, 181.986, 20), (172.320, 194.875, 30) Use case 3 - 10million rows, 5.2GB data without index: (96.317, 437.453, 1), (70.730, 240.517, 2), (64.436, 197.604, 4), (67.186, 175.630, 8), (76.561, 156.015, 16), (81.025, 150.687, 20), (86.578, 148.481, 30) Use case 4 - 10000 records, 9.6GB, toast data: (147.076, 276.323, 1), (101.610, 141.893, 2), (100.703, 134.096, 4), (112.583, 134.765, 8), (101.898, 135.789, 16), (109.258, 135.625, 20), (109.219, 136.144, 30) Attached is a patch that was used for the same. The patch is written on top of the parallel copy patch. The design Amit, Andres & myself voted for that is the leader identifying the line bound design and sharing it in shared memory is performing better. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Dec 28, 2020 at 3:14 PM vignesh C <vignesh21@gmail.com> wrote: > > Attached is a patch that was used for the same. The patch is written > on top of the parallel copy patch. > The design Amit, Andres & myself voted for that is the leader > identifying the line bound design and sharing it in shared memory is > performing better. Hi Hackers, I see following are some of the problem with parallel copy feature: 1) Leader identifying the line/tuple boundaries from the file, letting the workers pick, insert parallelly vs leader reading the file and letting workers identify line/tuple boundaries, insert 2) Determining parallel safety of partitioned tables 3) Bulk extension of relation while inserting i.e. adding more than one extra blocks to the relation in RelationAddExtraBlocks Please let me know if I'm missing anything. For (1) - from Vignesh's experiments above, it shows that the " leader identifying the line/tuple boundaries from the file, letting the workers pick, insert parallelly" fares better. For (2) - while it's being discussed in another thread (I'm not sure what's the status of that thread), how about we take this feature without the support for partitioned tables i.e. parallel copy is disabled for partitioned tables? Once the other discussion gets to a logical end, we can come back and enable parallel copy for partitioned tables. For (3) - we need a way to extend or add new blocks fastly - fallocate might help here, not sure who's working on it, others can comment better here. Can we take the "parallel copy" feature forward of course with some restrictions in place? Thoughts? Regards, Bharath Rupireddy.