Re: Parallel copy - Mailing list pgsql-hackers
From | Andrew Dunstan |
---|---|
Subject | Re: Parallel copy |
Date | |
Msg-id | c74e4d42-900c-26a8-df59-13684b154f74@2ndQuadrant.com Whole thread Raw |
In response to | Re: Parallel copy (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Parallel copy
Re: Parallel copy |
List | pgsql-hackers |
On 2/15/20 7:32 AM, Amit Kapila wrote: > On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion@decodable.me> wrote: >> On Sat, 15 Feb 2020 at 04:55, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote: >> ... >>>> Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line returnsin quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote staterequired. >>>> >>> AFAIU, the whole of this in-quote/out-of-quote state is manged inside >>> CopyReadLineText which will be done by each of the parallel workers, >>> something on the lines of what Thomas did in his patch [1]. >>> Basically, we need to invent a mechanism to allocate chunks to >>> individual workers and then the whole processing will be done as we >>> are doing now except for special handling for partial tuples which I >>> have explained in my previous email. Am, I missing something here? >>> >> The problem case that I see is the chunk boundary falling in the >> middle of a quoted field where >> - The quote opens in chunk 1 >> - The quote closes in chunk 2 >> - There is an EoL character between the start of chunk 2 and the closing quote >> >> When the worker processing chunk 2 starts, it believes itself to be in >> out-of-quote state, so only data between the start of the chunk and >> the EoL is regarded as belonging to the partial line. From that point >> on the parsing of the rest of the chunk goes off track. >> >> Some of the resulting errors can be avoided by, for instance, >> requiring a quote to be preceded by a delimiter or EoL. That answer >> fails when fields end with EoL characters, which happens often enough >> in the wild. >> >> Recovering from an incorrect in-quote/out-of-quote state assumption at >> the start of parsing a chunk just seems like a hole with no bottom. So >> it looks to me like it's best done in a single process which can keep >> track of that state reliably. >> > Good point and I agree with you that having a single process would > avoid any such stuff. However, I will think some more on it and if > you/anyone else gets some idea on how to deal with this in a > multi-worker system (where we can allow each worker to read and > process the chunk) then feel free to share your thoughts. > IIRC, in_quote only matters here in CSV mode (because CSV fields can have embedded newlines). So why not just forbid parallel copy in CSV mode, at least for now? I guess it depends on the actual use case. If we expect to be parallel loading humungous CSVs then that won't fly. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: