Re: Parallel copy - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Parallel copy |
Date | |
Msg-id | CAA4eK1LmRW6pN4cuLyyg5rsayjOHuqGOtpOsDgycQC1OGX9naA@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel copy (Alastair Turner <minion@decodable.me>) |
Responses |
Re: Parallel copy
|
List | pgsql-hackers |
On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote: > > On Fri, 14 Feb 2020 at 11:57, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> > >> > On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > ... >> >> > > Another approach that came up during an offlist discussion with Robert >> > > is that we have one dedicated worker for reading the chunks from file >> > > and it copies the complete tuples of one chunk in the shared memory >> > > and once that is done, a handover that chunks to another worker which >> > > can process tuples in that area. We can imagine that the reader >> > > worker is responsible to form some sort of work queue that can be >> > > processed by the other workers. In this idea, we won't be able to get >> > > the benefit of initial tokenization (forming tuple boundaries) via >> > > parallel workers and might need some additional memory processing as >> > > after reader worker has handed the initial shared memory segment, we >> > > need to somehow identify tuple boundaries and then process them. > > > Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line returns inquoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote state required. > AFAIU, the whole of this in-quote/out-of-quote state is manged inside CopyReadLineText which will be done by each of the parallel workers, something on the lines of what Thomas did in his patch [1]. Basically, we need to invent a mechanism to allocate chunks to individual workers and then the whole processing will be done as we are doing now except for special handling for partial tuples which I have explained in my previous email. Am, I missing something here? >> >> > > > ... >> >> >> > > Another thing we need to figure out is the how many workers to use for >> > > the copy command. I think we can use it based on the file size which >> > > needs some experiments or may be based on user input. >> > >> > It seems like we don't even really have a general model for that sort >> > of thing in the rest of the system yet, and I guess some kind of >> > fairly dumb explicit system would make sense in the early days... >> > >> >> makes sense. > > The ratio between chunking or line parsing processes and the parallel worker pool would vary with the width of the table,complexity of the data or file (dates, encoding conversions), complexity of constraints and acceptable impact of theload. Being able to control it through user input would be great. > Okay, I think one simple way could be that we compute the number of workers based on filesize (some experiments are required to determine this) unless the user has given the input. If the user has provided the input then we can use that with an upper limit to max_parallel_workers. [1] - https://www.postgresql.org/message-id/CA%2BhUKGKZu8fpZo0W%3DPOmQEN46kXhLedzqqAnt5iJZy7tD0x6sw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: