Re: Parallel copy - Mailing list pgsql-hackers

From David Fetter
Subject Re: Parallel copy
Date
Msg-id 20200215175105.GY24870@fetter.org
Whole thread Raw
In response to Re: Parallel copy  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Sat, Feb 15, 2020 at 06:02:06PM +0530, Amit Kapila wrote:
> On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion@decodable.me> wrote:
> >
> > On Sat, 15 Feb 2020 at 04:55, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote:
> > > >
> > ...
> > > >
> > > > Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line
returnsin quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote
staterequired.
 
> > > >
> > >
> > > AFAIU, the whole of this in-quote/out-of-quote state is manged inside
> > > CopyReadLineText which will be done by each of the parallel workers,
> > > something on the lines of what Thomas did in his patch [1].
> > > Basically, we need to invent a mechanism to allocate chunks to
> > > individual workers and then the whole processing will be done as we
> > > are doing now except for special handling for partial tuples which I
> > > have explained in my previous email.  Am, I missing something here?
> > >
> > The problem case that I see is the chunk boundary falling in the
> > middle of a quoted field where
> >  - The quote opens in chunk 1
> >  - The quote closes in chunk 2
> >  - There is an EoL character between the start of chunk 2 and the closing quote
> >
> > When the worker processing chunk 2 starts, it believes itself to be in
> > out-of-quote state, so only data between the start of the chunk and
> > the EoL is regarded as belonging to the partial line. From that point
> > on the parsing of the rest of the chunk goes off track.
> >
> > Some of the resulting errors can be avoided by, for instance,
> > requiring a quote to be preceded by a delimiter or EoL. That answer
> > fails when fields end with EoL characters, which happens often enough
> > in the wild.
> >
> > Recovering from an incorrect in-quote/out-of-quote state assumption at
> > the start of parsing a chunk just seems like a hole with no bottom. So
> > it looks to me like it's best done in a single process which can keep
> > track of that state reliably.
> >
> 
> Good point and I agree with you that having a single process would
> avoid any such stuff.   However, I will think some more on it and if
> you/anyone else gets some idea on how to deal with this in a
> multi-worker system (where we can allow each worker to read and
> process the chunk) then feel free to share your thoughts.

I see two pieces of this puzzle: an input format we control, and the
ones we don't.

In the former case, we could encode all fields with base85 (or
something similar that reduces the input alphabet efficiently), then
reserve bytes that denote delimiters of various types. ASCII has
separators for file, group, record, and unit that we could use as
inspiration.

I don't have anything to offer for free-form input other than to agree
that it looks like a hole with no bottom, and maybe we should just
keep that process serial, at least until someone finds a bottom.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



pgsql-hackers by date:

Previous
From: Andreas Karlsson
Date:
Subject: Re: Use LN_S instead of "ln -s" in Makefile
Next
From: Tom Lane
Date:
Subject: Re: Use LN_S instead of "ln -s" in Makefile