Hi hackers,
I've come across an unexpected behavior in our CSV parser that I'd like to
bring up for discussion.
% cat example.csv
id,rating,review
1,5,"Great product, will buy again."
2,3,"I bought this for my 6" laptop but it didn't fit my 8" tablet"
% psql
CREATE TABLE reviews (id int, rating int, review text);
\COPY reviews FROM example.csv WITH CSV HEADER;
SELECT * FROM reviews;
This gives:
id | rating | review
----+--------+-------------------------------------------------------------
1 | 5 | Great product, will buy again.
2 | 3 | I bought this for my 6 laptop but it didn't fit my 8 tablet
(2 rows)
The parser currently accepts quoting within an unquoted field. This can lead to
data misinterpretation when the quote is part of the field data (e.g.,
for inches, like in the example).
Our CSV output rules quote an entire field or not at all. But the import of
fields with mid-field quotes might lead to surprising and undetected outcomes.
I think we should throw a parsing error for unescaped mid-field quotes,
and add a COPY option like ALLOW_MIDFIELD_QUOTES for cases where mid-field
quotes are necessary. The error message could suggest this option when it
encounters an unescaped mid-field quote.
I think the convenience of not having to use an extra option doesn't outweigh
the risk of undetected data integrity issues.
Thoughts?
/Joel