Re: Parallel copy - Mailing list pgsql-hackers
From | David Fetter |
---|---|
Subject | Re: Parallel copy |
Date | |
Msg-id | 20200215175105.GY24870@fetter.org Whole thread Raw |
In response to | Re: Parallel copy (Amit Kapila <amit.kapila16@gmail.com>) |
List | pgsql-hackers |
On Sat, Feb 15, 2020 at 06:02:06PM +0530, Amit Kapila wrote: > On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion@decodable.me> wrote: > > > > On Sat, 15 Feb 2020 at 04:55, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote: > > > > > > ... > > > > > > > > Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line returnsin quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote staterequired. > > > > > > > > > > AFAIU, the whole of this in-quote/out-of-quote state is manged inside > > > CopyReadLineText which will be done by each of the parallel workers, > > > something on the lines of what Thomas did in his patch [1]. > > > Basically, we need to invent a mechanism to allocate chunks to > > > individual workers and then the whole processing will be done as we > > > are doing now except for special handling for partial tuples which I > > > have explained in my previous email. Am, I missing something here? > > > > > The problem case that I see is the chunk boundary falling in the > > middle of a quoted field where > > - The quote opens in chunk 1 > > - The quote closes in chunk 2 > > - There is an EoL character between the start of chunk 2 and the closing quote > > > > When the worker processing chunk 2 starts, it believes itself to be in > > out-of-quote state, so only data between the start of the chunk and > > the EoL is regarded as belonging to the partial line. From that point > > on the parsing of the rest of the chunk goes off track. > > > > Some of the resulting errors can be avoided by, for instance, > > requiring a quote to be preceded by a delimiter or EoL. That answer > > fails when fields end with EoL characters, which happens often enough > > in the wild. > > > > Recovering from an incorrect in-quote/out-of-quote state assumption at > > the start of parsing a chunk just seems like a hole with no bottom. So > > it looks to me like it's best done in a single process which can keep > > track of that state reliably. > > > > Good point and I agree with you that having a single process would > avoid any such stuff. However, I will think some more on it and if > you/anyone else gets some idea on how to deal with this in a > multi-worker system (where we can allow each worker to read and > process the chunk) then feel free to share your thoughts. I see two pieces of this puzzle: an input format we control, and the ones we don't. In the former case, we could encode all fields with base85 (or something similar that reduces the input alphabet efficiently), then reserve bytes that denote delimiters of various types. ASCII has separators for file, group, record, and unit that we could use as inspiration. I don't have anything to offer for free-form input other than to agree that it looks like a hole with no bottom, and maybe we should just keep that process serial, at least until someone finds a bottom. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
pgsql-hackers by date: