Re: multiline CSV fields - Mailing list pgsql-hackers

From Patrick B Kelly
Subject Re: multiline CSV fields
Date
Msg-id F82E5F5D-3435-11D9-B14C-000A958A3956@patrickbkelly.org
Whole thread Raw
In response to Re: multiline CSV fields  (Andrew Dunstan <andrew@dunslane.net>)
Responses Re: multiline CSV fields  (Andrew Dunstan <andrew@dunslane.net>)
Re: multiline CSV fields  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Nov 11, 2004, at 2:56 PM, Andrew Dunstan wrote:

>
>
> Tom Lane wrote:
>
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>
>>> Patrick B Kelly wrote:
>>>
>>>> Actually, when I try to export a sheet with multi-line cells from 
>>>> excel, it tells me that this feature is incompatible with the CSV 
>>>> format and will not include them in the CSV file.
>>>>
>>
>>
>>> It probably depends on the version. I have just tested with Excel 
>>> 2000 on a WinXP machine and it both read and wrote these files.
>>>
>>
>> I'd be inclined to define Excel 2000 as broken, honestly, if it's
>> writing unescaped newlines as data.  To support this would mean 
>> throwing
>> away most of our ability to detect incorrectly formatted CSV files.
>> A simple error like a missing close quote would look to the machine 
>> like
>> the rest of the file is a single long data line where all the newlines
>> are embedded in data fields.  How likely is it that you'll get a 
>> useful
>> error message out of that?  Most likely the error message would point 
>> to
>> the end of the file, or at least someplace well removed from the 
>> actual
>> mistake.
>>
>> I would vote in favor of removing the current code that attempts to
>> support unquoted newlines, and waiting to see if there are complaints.
>>
>>
>>
>
> This feature was specifically requested when we discussed what sort of 
> CSVs we would handle.
>
> And it does in fact work as long as the newline style is the same.
>
> I just had an idea. How about if we add a new CSV option MULTILINE. If 
> absent, then on output we would not output unescaped LF/CR characters 
> and on input we would not allow fields with embedded unescaped LF/CR 
> characters. In both cases we could error out for now, with perhaps an 
> 8.1 TODO to provide some other behaviour.
>
> Or we could drop the whole multiline "feature" for now and make the 
> whole thing an 8.1 item, although it would be a bit of a pity when it 
> does work in what will surely be the most common case.
>

What about just coding a FSM into 
backend/commands/copy.c:CopyReadLine() that does not process any flavor 
of NL characters when it is inside of a data field?


Patrick B. Kelly
------------------------------------------------------                              http://patrickbkelly.org



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: MAX/MIN optimization via rewrite (plus query rewrites generally)
Next
From: Thomas Hallgren
Date:
Subject: GUC custom variables broken