Home > mailing lists

Re: Parallel copy - Mailing list pgsql-hackers

From	Alastair Turner
Subject	Re: Parallel copy
Date	February 14, 2020 13:45:55
Msg-id	CAC0Gmyxf8xV9bbvPaJMEepDGC3cUoe=SQObzr4sMU8Ps8rptsg@mail.gmail.com Whole thread Raw
In response to	Re: Parallel copy (Amit Kapila <amit.kapila16@gmail.com>)
Responses	Re: Parallel copy
List	pgsql-hackers

Tree view

On Fri, 14 Feb 2020 at 11:57, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

...

> > Another approach that came up during an offlist discussion with Robert
> > is that we have one dedicated worker for reading the chunks from file
> > and it copies the complete tuples of one chunk in the shared memory
> > and once that is done, a handover that chunks to another worker which
> > can process tuples in that area. We can imagine that the reader
> > worker is responsible to form some sort of work queue that can be
> > processed by the other workers. In this idea, we won't be able to get
> > the benefit of initial tokenization (forming tuple boundaries) via
> > parallel workers and might need some additional memory processing as
> > after reader worker has handed the initial shared memory segment, we
> > need to somehow identify tuple boundaries and then process them.

Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line returns in quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote state required. A single worker could also process a stream without having to reread/rewind so it would be able to process input from STDIN or PROGRAM sources, making the improvements applicable to load operations done by third party tools and scripted \copy in psql.

>

...

> > Another thing we need to figure out is the how many workers to use for
> > the copy command. I think we can use it based on the file size which
> > needs some experiments or may be based on user input.
>
> It seems like we don't even really have a general model for that sort
> of thing in the rest of the system yet, and I guess some kind of
> fairly dumb explicit system would make sense in the early days...
>

makes sense.

The ratio between chunking or line parsing processes and the parallel worker pool would vary with the width of the table, complexity of the data or file (dates, encoding conversions), complexity of constraints and acceptable impact of the load. Being able to control it through user input would be great.

Alastair

pgsql-hackers by date:

From: Amit Langote
Date: 14 February 2020, 12:44:19
Subject: Re: assert pg_class.relnatts is consistent

From: Amit Langote
Date: 14 February 2020, 14:22:05
Subject: Re: assert pg_class.relnatts is consistent

Re: Parallel copy - Mailing list pgsql-hackers

Previous

Next