Home > mailing lists

Re: Parallel copy - Mailing list pgsql-hackers

From	Alastair Turner
Subject	Re: Parallel copy
Date	February 15, 2020 10:38:07
Msg-id	CAC0Gmyw4iHXLPJvvA1gPQXa01P=PrGznSgYXfxE-nA506A2RMg@mail.gmail.com Whole thread
In response to	Re: Parallel copy (Amit Kapila <amit.kapila16@gmail.com>)
Responses	Re: Parallel copy
List	pgsql-hackers

Tree view

On Sat, 15 Feb 2020 at 04:55, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote:
> >
...
> >
> > Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line
returnsin quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote
staterequired.

> >
>
> AFAIU, the whole of this in-quote/out-of-quote state is manged inside
> CopyReadLineText which will be done by each of the parallel workers,
> something on the lines of what Thomas did in his patch [1].
> Basically, we need to invent a mechanism to allocate chunks to
> individual workers and then the whole processing will be done as we
> are doing now except for special handling for partial tuples which I
> have explained in my previous email.  Am, I missing something here?
>
The problem case that I see is the chunk boundary falling in the
middle of a quoted field where
 - The quote opens in chunk 1
 - The quote closes in chunk 2
 - There is an EoL character between the start of chunk 2 and the closing quote

When the worker processing chunk 2 starts, it believes itself to be in
out-of-quote state, so only data between the start of the chunk and
the EoL is regarded as belonging to the partial line. From that point
on the parsing of the rest of the chunk goes off track.

Some of the resulting errors can be avoided by, for instance,
requiring a quote to be preceded by a delimiter or EoL. That answer
fails when fields end with EoL characters, which happens often enough
in the wild.

Recovering from an incorrect in-quote/out-of-quote state assumption at
the start of parsing a chunk just seems like a hole with no bottom. So
it looks to me like it's best done in a single process which can keep
track of that state reliably.

--
Aastair

pgsql-hackers by date:

From: Pavel Stehule
Date: 15 February 2020, 10:06:18
Subject: Re: [Proposal] Global temporary tables

From: Amit Kapila
Date: 15 February 2020, 12:32:06
Subject: Re: Parallel copy

Re: Parallel copy - Mailing list pgsql-hackers

Previous

Next