Re: Parallel copy - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Parallel copy
Date
Msg-id CAA4eK1LmRW6pN4cuLyyg5rsayjOHuqGOtpOsDgycQC1OGX9naA@mail.gmail.com
Whole thread Raw
In response to Re: Parallel copy  (Alastair Turner <minion@decodable.me>)
Responses Re: Parallel copy
List pgsql-hackers
On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote:
>
> On Fri, 14 Feb 2020 at 11:57, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> >
>> > On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>  ...
>>
>> > > Another approach that came up during an offlist discussion with Robert
>> > > is that we have one dedicated worker for reading the chunks from file
>> > > and it copies the complete tuples of one chunk in the shared memory
>> > > and once that is done, a handover that chunks to another worker which
>> > > can process tuples in that area.  We can imagine that the reader
>> > > worker is responsible to form some sort of work queue that can be
>> > > processed by the other workers.  In this idea, we won't be able to get
>> > > the benefit of initial tokenization (forming tuple boundaries) via
>> > > parallel workers and might need some additional memory processing as
>> > > after reader worker has handed the initial shared memory segment, we
>> > > need to somehow identify tuple boundaries and then process them.
>
>
> Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line returns
inquoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote state
required.
>

AFAIU, the whole of this in-quote/out-of-quote state is manged inside
CopyReadLineText which will be done by each of the parallel workers,
something on the lines of what Thomas did in his patch [1].
Basically, we need to invent a mechanism to allocate chunks to
individual workers and then the whole processing will be done as we
are doing now except for special handling for partial tuples which I
have explained in my previous email.  Am, I missing something here?

>>
>> >
>
> ...
>>
>>
>> > > Another thing we need to figure out is the how many workers to use for
>> > > the copy command.  I think we can use it based on the file size which
>> > > needs some experiments or may be based on user input.
>> >
>> > It seems like we don't even really have a general model for that sort
>> > of thing in the rest of the system yet, and I guess some kind of
>> > fairly dumb explicit system would make sense in the early days...
>> >
>>
>> makes sense.
>
> The ratio between chunking or line parsing processes and the parallel worker pool would vary with the width of the
table,complexity of the data or file (dates, encoding conversions), complexity of constraints and acceptable impact of
theload. Being able to control it through user input would be great. 
>

Okay, I think one simple way could be that we compute the number of
workers based on filesize (some experiments are required to determine
this) unless the user has given the input.  If the user has provided
the input then we can use that with an upper limit to
max_parallel_workers.


[1] - https://www.postgresql.org/message-id/CA%2BhUKGKZu8fpZo0W%3DPOmQEN46kXhLedzqqAnt5iJZy7tD0x6sw%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: jsonb_object() seems to be buggy. jsonb_build_object() is good.
Next
From: Bryn Llewellyn
Date:
Subject: Re: jsonb_object() seems to be buggy. jsonb_build_object() is good.