On Thu, 11 May 2023 at 10:04, Joel Jacobson <joel@compiler.org> wrote:
>
> The parser currently accepts quoting within an unquoted field. This can lead to
> data misinterpretation when the quote is part of the field data (e.g.,
> for inches, like in the example).
I think you're thinking about it differently than the parser. I think
the parser is treating this the way, say, the shell treats quotes.
That is, it sees a quoted "I bought this for my 6" followed by an
unquoted "a laptop but it didn't fit my 8" followed by a quoted "
tablet".
So for example, in that world you might only quote commas and newlines
so you might print something like
1,2,I bought this for my "6"" laptop
" but it "didn't" fit my "8""" laptop
The actual CSV spec https://datatracker.ietf.org/doc/html/rfc4180 only
allows fully quoted or fully unquoted fields and there can only be
escaped double-doublequote characters in quoted fields and no
doublequote characters in unquoted fields.
But it also says
Due to lack of a single specification, there are considerable
differences among implementations. Implementors should "be
conservative in what you do, be liberal in what you accept from
others" (RFC 793 [8]) when processing CSV files. An attempt at a
common definition can be found in Section 2.
So the real question is are there tools out there that generate
entries like this and what are their intentions?
> I think we should throw a parsing error for unescaped mid-field quotes,
> and add a COPY option like ALLOW_MIDFIELD_QUOTES for cases where mid-field
> quotes are necessary. The error message could suggest this option when it
> encounters an unescaped mid-field quote.
>
> I think the convenience of not having to use an extra option doesn't outweigh
> the risk of undetected data integrity issues.
It's also a pretty annoying experience to get a message saying "error,
turn this option on to not get an error". I get what you're saying
too, which is more of a risk depends on whether turning off the error
is really the right thing most of the time or is just causing data to
be read incorrectly.
--
greg