Re: Should CSV parsing be stricter about mid-field quotes? - Mailing list pgsql-hackers

From Greg Stark
Subject Re: Should CSV parsing be stricter about mid-field quotes?
Date
Msg-id CAM-w4HOEwz13f2aekQAORq+K7KFCO_iVEg3v0NtjZQ1XQgRhXQ@mail.gmail.com
Whole thread Raw
In response to Should CSV parsing be stricter about mid-field quotes?  ("Joel Jacobson" <joel@compiler.org>)
List pgsql-hackers
On Thu, 11 May 2023 at 10:04, Joel Jacobson <joel@compiler.org> wrote:
>
> The parser currently accepts quoting within an unquoted field. This can lead to
> data misinterpretation when the quote is part of the field data (e.g.,
> for inches, like in the example).

I think you're thinking about it differently than the parser. I think
the parser is treating this the way, say, the shell treats quotes.
That is, it sees a quoted "I bought this for my 6" followed by an
unquoted "a laptop but it didn't fit my 8" followed by a quoted "
tablet".

So for example, in that world you might only quote commas and newlines
so you might print something like

1,2,I bought this for my "6"" laptop
" but it "didn't" fit my "8""" laptop

The actual CSV spec https://datatracker.ietf.org/doc/html/rfc4180 only
allows fully quoted or fully unquoted fields and there can only be
escaped double-doublequote characters in quoted fields and no
doublequote characters in unquoted fields.

But it also says

      Due to lack of a single specification, there are considerable
      differences among implementations.  Implementors should "be
      conservative in what you do, be liberal in what you accept from
      others" (RFC 793 [8]) when processing CSV files.  An attempt at a
      common definition can be found in Section 2.


So the real question is are there tools out there that generate
entries like this and what are their intentions?

> I think we should throw a parsing error for unescaped mid-field quotes,
> and add a COPY option like ALLOW_MIDFIELD_QUOTES for cases where mid-field
> quotes are necessary. The error message could suggest this option when it
> encounters an unescaped mid-field quote.
>
> I think the convenience of not having to use an extra option doesn't outweigh
> the risk of undetected data integrity issues.

It's also a pretty annoying experience to get a message saying "error,
turn this option on to not get an error". I get what you're saying
too, which is more of a risk depends on whether turning off the error
is really the right thing most of the time or is just causing data to
be read incorrectly.



-- 
greg



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: smgrzeroextend clarification
Next
From: Tom Lane
Date:
Subject: Re: psql tests hangs