Re: Should CSV parsing be stricter about mid-field quotes? - Mailing list pgsql-hackers

From Joel Jacobson
Subject Re: Should CSV parsing be stricter about mid-field quotes?
Date
Msg-id 47b5c6a7-5c0e-40aa-8ea2-c7b95ccf296f@app.fastmail.com
Whole thread Raw
In response to Re: Should CSV parsing be stricter about mid-field quotes?  (Noah Misch <noah@leadboat.com>)
Responses Re: Should CSV parsing be stricter about mid-field quotes?
List pgsql-hackers
On Sun, Jul 2, 2023, at 07:45, Noah Misch wrote:
> On Sat, May 20, 2023 at 09:16:30AM +0200, Joel Jacobson wrote:
>> On Fri, May 19, 2023, at 18:06, Daniel Verite wrote:
>> > COPY FROM file CSV somewhat differs as your example shows,
>> > but it still mishandle \. when unquoted. For instance, consider this
>> > file to load with COPY    t FROM '/tmp/t.csv' WITH CSV
>> > $ cat /tmp/t.csv
>> > line 1
>> > \.
>> > line 3
>> > line 4
>> >
>> > It results in having only "line 1" being imported.
>> 
>> Hmm, this is a problem for one of the new use-cases I brought up that would be
>> possible with DELIMITER NONE QUOTE NONE, i.e. to import unstructured log files,
>> where each raw line should be imported "as is" into a single text column.
>> 
>> Is there a valid reason why \. is needed for COPY FROM filename?
>
> No.
>
>> It seems to me it would only be necessary for the COPY FROM STDIN case,
>> since files have a natural end-of-file and a known file size.
>
> Right.  Even for COPY FROM STDIN, it's not needed anymore since the 
> removal of
> protocol v2.  psql would still use it to find the end of inline COPY 
> data,
> though.  Here's another relevant thread:
> https://postgr.es/m/flat/bfcd57e4-8f23-4c3e-a5db-2571d09208e2%40beta.fastmail.com

I was very pleased to see commit 7702337:

    Do not treat \. as an EOF marker in CSV mode for COPY IN.

Great job!

Thanks to this fix, maybe there is now interest to resume the discussion on
the ideas discussed in this thread?

Recap of ideas:

1. Stricter parsing, reject mid-field quotes

The idea is to prevent balanced mid-field quotes from being silently removed.

Example:

% cat example.csv
id,post
1,<p>Hello there!</p>
2,<a href="http://example.com">Click me!</a>

% psql
# \copy posts from example.csv with csv header;
COPY 2
# SELECT * FROM posts;
 id |                   post
----+------------------------------------------
  1 | <p>Hello there!</p>
  2 | <a href=http://example.com>Click me!</a>
(2 rows)

Note how the quotes around the URL disappeared.

2. Avoid needing hacks like using E'\x01' as quoting char.

Introduce QUOTE NONE and DELIMITER NONE,
to allow raw lines to be imported "as is" into a single text column.

Best regards,

Joel



pgsql-hackers by date:

Previous
From: Greg Sabino Mullane
Date:
Subject: Re: New PostgreSQL Contributors
Next
From: Peter Geoghegan
Date:
Subject: Re: POC, WIP: OR-clause support for indexes