On Sat, 15 Feb 2020 at 04:55, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion@decodable.me> wrote:
> >
...
> >
> > Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line
returnsin quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote
staterequired.
> >
>
> AFAIU, the whole of this in-quote/out-of-quote state is manged inside
> CopyReadLineText which will be done by each of the parallel workers,
> something on the lines of what Thomas did in his patch [1].
> Basically, we need to invent a mechanism to allocate chunks to
> individual workers and then the whole processing will be done as we
> are doing now except for special handling for partial tuples which I
> have explained in my previous email. Am, I missing something here?
>
The problem case that I see is the chunk boundary falling in the
middle of a quoted field where
- The quote opens in chunk 1
- The quote closes in chunk 2
- There is an EoL character between the start of chunk 2 and the closing quote
When the worker processing chunk 2 starts, it believes itself to be in
out-of-quote state, so only data between the start of the chunk and
the EoL is regarded as belonging to the partial line. From that point
on the parsing of the rest of the chunk goes off track.
Some of the resulting errors can be avoided by, for instance,
requiring a quote to be preceded by a delimiter or EoL. That answer
fails when fields end with EoL characters, which happens often enough
in the wild.
Recovering from an incorrect in-quote/out-of-quote state assumption at
the start of parsing a chunk just seems like a hole with no bottom. So
it looks to me like it's best done in a single process which can keep
track of that state reliably.
--
Aastair