Thread: multiline CSV fields

multiline CSV fields

From
Andrew Dunstan
Date:
Darcy Buskermolen has drawn my attention to unfortunate behaviour of 
COPY CSV with fields containing embedded line end chars if the embedded 
sequence isn't the same as those of the file containing the CSV data. In 
that case we error out when reading the data in. This means there are 
cases where we can produce a CSV data file which we can't read in, which 
is not at all pleasant.

Possible approaches to the problem:
. make it a documented limitation
. have a "csv read" mode for backend/commands/copy.c:CopyReadLine() that 
relaxes some of the restrictions on inconsistent line endings
. escape embedded line end chars

The last really isn't an option, because the whole point of CSVs is to 
play with other programs, and my understanding is that those that 
understand multiline fields (e.g. Excel) expect them not to be escaped, 
and do not produce them escaped.

So right now I'm tossing up in my head between the first two options. Or 
maybe there's another solution I haven't thought of.

Thoughts?

cheers

andrew


Re: multiline CSV fields

From
Patrick B Kelly
Date:
On Nov 10, 2004, at 6:10 PM, Andrew Dunstan wrote:

> The last really isn't an option, because the whole point of CSVs is to 
> play with other programs, and my understanding is that those that 
> understand multiline fields (e.g. Excel) expect them not to be 
> escaped, and do not produce them escaped.
>

Actually, when I try to export a sheet with multi-line cells from 
excel, it tells me that this feature is incompatible with the CSV 
format and will not include them in the CSV file.


Patrick B. Kelly
------------------------------------------------------                              http://patrickbkelly.org



Re: multiline CSV fields

From
Andrew Dunstan
Date:

Patrick B Kelly wrote:

>
> On Nov 10, 2004, at 6:10 PM, Andrew Dunstan wrote:
>
>> The last really isn't an option, because the whole point of CSVs is 
>> to play with other programs, and my understanding is that those that 
>> understand multiline fields (e.g. Excel) expect them not to be 
>> escaped, and do not produce them escaped.
>>
>
> Actually, when I try to export a sheet with multi-line cells from 
> excel, it tells me that this feature is incompatible with the CSV 
> format and will not include them in the CSV file.
>
>
>

It probably depends on the version. I have just tested with Excel 2000 
on a WinXP machine and it both read and wrote these files.

cheers

andrew


Re: multiline CSV fields

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Patrick B Kelly wrote:
>> Actually, when I try to export a sheet with multi-line cells from 
>> excel, it tells me that this feature is incompatible with the CSV 
>> format and will not include them in the CSV file.

> It probably depends on the version. I have just tested with Excel 2000 
> on a WinXP machine and it both read and wrote these files.

I'd be inclined to define Excel 2000 as broken, honestly, if it's
writing unescaped newlines as data.  To support this would mean throwing
away most of our ability to detect incorrectly formatted CSV files.
A simple error like a missing close quote would look to the machine like
the rest of the file is a single long data line where all the newlines
are embedded in data fields.  How likely is it that you'll get a useful
error message out of that?  Most likely the error message would point to
the end of the file, or at least someplace well removed from the actual
mistake.

I would vote in favor of removing the current code that attempts to
support unquoted newlines, and waiting to see if there are complaints.
        regards, tom lane


Re: multiline CSV fields

From
Andrew Dunstan
Date:

Tom Lane wrote:

>Andrew Dunstan <andrew@dunslane.net> writes:
>  
>
>>Patrick B Kelly wrote:
>>    
>>
>>>Actually, when I try to export a sheet with multi-line cells from 
>>>excel, it tells me that this feature is incompatible with the CSV 
>>>format and will not include them in the CSV file.
>>>      
>>>
>
>  
>
>>It probably depends on the version. I have just tested with Excel 2000 
>>on a WinXP machine and it both read and wrote these files.
>>    
>>
>
>I'd be inclined to define Excel 2000 as broken, honestly, if it's
>writing unescaped newlines as data.  To support this would mean throwing
>away most of our ability to detect incorrectly formatted CSV files.
>A simple error like a missing close quote would look to the machine like
>the rest of the file is a single long data line where all the newlines
>are embedded in data fields.  How likely is it that you'll get a useful
>error message out of that?  Most likely the error message would point to
>the end of the file, or at least someplace well removed from the actual
>mistake.
>
>I would vote in favor of removing the current code that attempts to
>support unquoted newlines, and waiting to see if there are complaints.
>
>
>  
>

This feature was specifically requested when we discussed what sort of 
CSVs we would handle.

And it does in fact work as long as the newline style is the same.

I just had an idea. How about if we add a new CSV option MULTILINE. If 
absent, then on output we would not output unescaped LF/CR characters 
and on input we would not allow fields with embedded unescaped LF/CR 
characters. In both cases we could error out for now, with perhaps an 
8.1 TODO to provide some other behaviour.

Or we could drop the whole multiline "feature" for now and make the 
whole thing an 8.1 item, although it would be a bit of a pity when it 
does work in what will surely be the most common case.

cheers

andrew




Re: multiline CSV fields

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> I would vote in favor of removing the current code that attempts to
> support unquoted newlines, and waiting to see if there are complaints.

Uhm. *raises hand*

I agree with your argument but one way or another I have to load these CSVs
I'm given. And like it or not virtually all the CSVs people get are going to
be coming from Excel. So far with 7.4 I've just opened them up in Emacs and
removed the newlines, but it's a royal pain in the arse.

-- 
greg



Re: multiline CSV fields

From
David Fetter
Date:
On Thu, Nov 11, 2004 at 03:38:16PM -0500, Greg Stark wrote:
> 
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> 
> > I would vote in favor of removing the current code that attempts
> > to support unquoted newlines, and waiting to see if there are
> > complaints.
> 
> Uhm. *raises hand*
> 
> I agree with your argument but one way or another I have to load
> these CSVs I'm given. And like it or not virtually all the CSVs
> people get are going to be coming from Excel. So far with 7.4 I've
> just opened them up in Emacs and removed the newlines, but it's a
> royal pain in the arse.

Meanwhile, check out dbi-link.  It lets you query against DBI data
sources including DBD::Excel :)

http://pgfoundry.org/projects/dbi-link/

Bug reports welcome.

Cheers,
D
-- 
David Fetter david@fetter.org http://fetter.org/
phone: +1 510 893 6100   mobile: +1 415 235 3778

Remember to vote!


Re: multiline CSV fields

From
Patrick B Kelly
Date:
On Nov 11, 2004, at 2:56 PM, Andrew Dunstan wrote:

>
>
> Tom Lane wrote:
>
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>
>>> Patrick B Kelly wrote:
>>>
>>>> Actually, when I try to export a sheet with multi-line cells from 
>>>> excel, it tells me that this feature is incompatible with the CSV 
>>>> format and will not include them in the CSV file.
>>>>
>>
>>
>>> It probably depends on the version. I have just tested with Excel 
>>> 2000 on a WinXP machine and it both read and wrote these files.
>>>
>>
>> I'd be inclined to define Excel 2000 as broken, honestly, if it's
>> writing unescaped newlines as data.  To support this would mean 
>> throwing
>> away most of our ability to detect incorrectly formatted CSV files.
>> A simple error like a missing close quote would look to the machine 
>> like
>> the rest of the file is a single long data line where all the newlines
>> are embedded in data fields.  How likely is it that you'll get a 
>> useful
>> error message out of that?  Most likely the error message would point 
>> to
>> the end of the file, or at least someplace well removed from the 
>> actual
>> mistake.
>>
>> I would vote in favor of removing the current code that attempts to
>> support unquoted newlines, and waiting to see if there are complaints.
>>
>>
>>
>
> This feature was specifically requested when we discussed what sort of 
> CSVs we would handle.
>
> And it does in fact work as long as the newline style is the same.
>
> I just had an idea. How about if we add a new CSV option MULTILINE. If 
> absent, then on output we would not output unescaped LF/CR characters 
> and on input we would not allow fields with embedded unescaped LF/CR 
> characters. In both cases we could error out for now, with perhaps an 
> 8.1 TODO to provide some other behaviour.
>
> Or we could drop the whole multiline "feature" for now and make the 
> whole thing an 8.1 item, although it would be a bit of a pity when it 
> does work in what will surely be the most common case.
>

What about just coding a FSM into 
backend/commands/copy.c:CopyReadLine() that does not process any flavor 
of NL characters when it is inside of a data field?


Patrick B. Kelly
------------------------------------------------------                              http://patrickbkelly.org



Re: multiline CSV fields

From
Andrew Dunstan
Date:

Patrick B Kelly wrote:

>
>
> What about just coding a FSM into 
> backend/commands/copy.c:CopyReadLine() that does not process any 
> flavor of NL characters when it is inside of a data field?
>
>
>
It would be a major change - the routine doesn't read data a field at a 
time, and has no idea if we are even in CSV mode at all. It would be 
rather late in the dev cycle to be making such changes, I suspect.

cheers

andrew


Re: multiline CSV fields

From
Tom Lane
Date:
Patrick B Kelly <pbk@patrickbkelly.org> writes:
> What about just coding a FSM into 
> backend/commands/copy.c:CopyReadLine() that does not process any flavor 
> of NL characters when it is inside of a data field?

CopyReadLine has no business tracking that.  One reason why not is that
it is dealing with data not yet converted out of the client's encoding,
which makes matching to user-specified quote/escape characters
difficult.
        regards, tom lane


Re: multiline CSV fields

From
Patrick B Kelly
Date:
On Nov 11, 2004, at 6:16 PM, Tom Lane wrote:

> Patrick B Kelly <pbk@patrickbkelly.org> writes:
>> What about just coding a FSM into
>> backend/commands/copy.c:CopyReadLine() that does not process any 
>> flavor
>> of NL characters when it is inside of a data field?
>
> CopyReadLine has no business tracking that.  One reason why not is that
> it is dealing with data not yet converted out of the client's encoding,
> which makes matching to user-specified quote/escape characters
> difficult.
>
>             regards, tom lane
>
> ---------------------------(end of 
> broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>
>

I appreciate what you are saying about the encoding and you are, of 
course, right but CopyReadLine is already processing the NL characters 
and it is doing it without considering the context in which they 
appear. Unfortunately, the same character(s) are used for two different 
purposes in the files in question. Without considering whether they 
appear inside or outside of data fields, CopyReadline will mistake one 
for the other and cannot correctly do what it is already trying to do 
which is break the input file into lines.

My suggestion is to simply have CopyReadLine recognize these two states 
(in-field and out-of-field) and execute the current logic only while in 
the second state. It would not be too hard but as you mentioned it is 
non-trivial.


Patrick B. Kelly
------------------------------------------------------                              http://patrickbkelly.org



Re: multiline CSV fields

From
Andrew Dunstan
Date:

Patrick B Kelly wrote:

>
>
>
> My suggestion is to simply have CopyReadLine recognize these two 
> states (in-field and out-of-field) and execute the current logic only 
> while in the second state. It would not be too hard but as you 
> mentioned it is non-trivial.
>
>
>

We don't know what state we expect the end of line to be in until after 
we have actually read the line. To know how to treat the end of line on 
your scheme we would have to parse as we go rather than after reading 
the line as now. Changing this would be not only be non-trivial but 
significantly invasive to the code.

cheers

andrew


Re: multiline CSV fields

From
Patrick B Kelly
Date:
On Nov 11, 2004, at 10:07 PM, Andrew Dunstan wrote:

>
>
> Patrick B Kelly wrote:
>
>>
>>
>>
>> My suggestion is to simply have CopyReadLine recognize these two 
>> states (in-field and out-of-field) and execute the current logic only 
>> while in the second state. It would not be too hard but as you 
>> mentioned it is non-trivial.
>>
>>
>>
>
> We don't know what state we expect the end of line to be in until 
> after we have actually read the line. To know how to treat the end of 
> line on your scheme we would have to parse as we go rather than after 
> reading the line as now. Changing this would be not only be 
> non-trivial but significantly invasive to the code.
>
>

Perhaps I am misunderstanding the code. As I read it the code currently 
goes through the input character by character looking for NL and EOF 
characters. It appears to be very well structured for what I am 
proposing. The section in question is a small and clearly defined loop 
which reads the input one character at a time and decides when it has 
reached the end of the line or file. Each call of CopyReadLine attempts 
to get one more line. I would propose that each time it starts out in 
the out-of-field state and the state is toggled by each un-escaped 
quote that it encounters in the stream. When in the in-field state, it 
would only look for the next un-escaped quote and while in the 
out-of-field state, it would execute the existing logic as well as 
looking for the next un-escaped quote.

I may not be explaining myself well or I may fundamentally 
misunderstand how copy works. I would be happy to code the change and 
send it to you for review, if you would be interested in looking it 
over and it is felt to be a worthwhile capability.



Patrick B. Kelly
------------------------------------------------------                              http://patrickbkelly.org



Re: multiline CSV fields

From
Bruce Momjian
Date:
Can I see an example of such a failure line?

---------------------------------------------------------------------------

Andrew Dunstan wrote:
> 
> Darcy Buskermolen has drawn my attention to unfortunate behaviour of 
> COPY CSV with fields containing embedded line end chars if the embedded 
> sequence isn't the same as those of the file containing the CSV data. In 
> that case we error out when reading the data in. This means there are 
> cases where we can produce a CSV data file which we can't read in, which 
> is not at all pleasant.
> 
> Possible approaches to the problem:
> . make it a documented limitation
> . have a "csv read" mode for backend/commands/copy.c:CopyReadLine() that 
> relaxes some of the restrictions on inconsistent line endings
> . escape embedded line end chars
> 
> The last really isn't an option, because the whole point of CSVs is to 
> play with other programs, and my understanding is that those that 
> understand multiline fields (e.g. Excel) expect them not to be escaped, 
> and do not produce them escaped.
> 
> So right now I'm tossing up in my head between the first two options. Or 
> maybe there's another solution I haven't thought of.
> 
> Thoughts?
> 
> cheers
> 
> andrew
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: multiline CSV fields

From
Tom Lane
Date:
Patrick B Kelly <pbk@patrickbkelly.org> writes:
> I may not be explaining myself well or I may fundamentally 
> misunderstand how copy works.

Well, you're definitely ignoring the character-set-conversion issue.
        regards, tom lane


Re: multiline CSV fields

From
Andrew Dunstan
Date:
This example should fail  on data line 2 or 3 on any platform, 
regardless of the platform's line-end convention, although I haven't 
tested on Windows.

cheers

andrew

[andrew@aloysius inst]$ bin/psql -e -f csverr.sql ; od -c 
/tmp/csverrtest.csv
create table csverrtest (a int, b text, c int);
CREATE TABLE
insert into csverrtest values(1,'a',1);
INSERT 122471 1
insert into csverrtest values(2,'foo\r\nbar',2);
INSERT 122472 1
insert into csverrtest values(3,'baz\nblurfl',3);
INSERT 122473 1
insert into csverrtest values(4,'d',4);
INSERT 122474 1
insert into csverrtest values(5,'e',5);
INSERT 122475 1
copy csverrtest to '/tmp/csverrtest.csv' csv;
COPY
truncate csverrtest;
TRUNCATE TABLE
copy csverrtest from '/tmp/csverrtest.csv' csv;
psql:cvserr.sql:9: ERROR:  literal carriage return found in data
HINT:  Use "\r" to represent carriage return.
CONTEXT:  COPY csverrtest, line 2: "2,"foo"
drop table csverrtest;
DROP TABLE
0000000   1   ,   a   ,   1  \n   2   ,   "   f   o   o  \r  \n   b   a
0000020   r   "   ,   2  \n   3   ,   "   b   a   z  \n   b   l   u   r
0000040   f   l   "   ,   3  \n   4   ,   d   ,   4  \n   5   ,   e   ,
0000060   5  \n
0000062
[andrew@aloysius inst]$

Bruce Momjian wrote:

>Can I see an example of such a failure line?
>
>---------------------------------------------------------------------------
>
>Andrew Dunstan wrote:
>  
>
>>Darcy Buskermolen has drawn my attention to unfortunate behaviour of 
>>COPY CSV with fields containing embedded line end chars if the embedded 
>>sequence isn't the same as those of the file containing the CSV data. In 
>>that case we error out when reading the data in. This means there are 
>>cases where we can produce a CSV data file which we can't read in, which 
>>is not at all pleasant.
>>
>>Possible approaches to the problem:
>>. make it a documented limitation
>>. have a "csv read" mode for backend/commands/copy.c:CopyReadLine() that 
>>relaxes some of the restrictions on inconsistent line endings
>>. escape embedded line end chars
>>
>>The last really isn't an option, because the whole point of CSVs is to 
>>play with other programs, and my understanding is that those that 
>>understand multiline fields (e.g. Excel) expect them not to be escaped, 
>>and do not produce them escaped.
>>
>>So right now I'm tossing up in my head between the first two options. Or 
>>maybe there's another solution I haven't thought of.
>>
>>Thoughts?
>>
>>cheers
>>
>>andrew
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>>
>>    
>>
>
>  
>


Re: multiline CSV fields

From
Patrick B Kelly
Date:
On Nov 12, 2004, at 12:20 AM, Tom Lane wrote:

> Patrick B Kelly <pbk@patrickbkelly.org> writes:
>> I may not be explaining myself well or I may fundamentally
>> misunderstand how copy works.
>
> Well, you're definitely ignoring the character-set-conversion issue.
>

I was not trying to ignore the character set and encoding issues but 
perhaps my assumptions are naive or overly optimistic. I realized that 
quotes are not as consistent as the NL characters but I was assuming 
that some encodings would escape to ASCII or a similar encoding like 
JIS Roman that would simplify recognition of the quote character. 
Unicode files make recognizing other punctuation like the quote fairly 
straightforward and to the naive observer, the code in CopyReadLine as 
it is currently written appears to handle multi-byte encodings such as 
SJIS that may present characters below 127 in trailing bytes.

As I said, perhaps I was oversimplifying. Is there a regression test 
set of input files for that I could review to see all of the supported 
encodings?


Patrick B. Kelly
------------------------------------------------------                              http://patrickbkelly.org



Re: multiline CSV fields

From
Bruce Momjian
Date:
OK, what solutions do we have for this?  Not being able to load dumped
data is a serious bug.  I have added this to the open items list:
* fix COPY CSV with \r,\n in data

My feeling is that if we are in a quoted string we just process whatever
characters we find, even passing through an EOL.  I realize it might not
mark missing quote errors well but that seems minor compared to not
loading valid data.

---------------------------------------------------------------------------

Andrew Dunstan wrote:
> 
> This example should fail  on data line 2 or 3 on any platform, 
> regardless of the platform's line-end convention, although I haven't 
> tested on Windows.
> 
> cheers
> 
> andrew
> 
> [andrew@aloysius inst]$ bin/psql -e -f csverr.sql ; od -c 
> /tmp/csverrtest.csv
> create table csverrtest (a int, b text, c int);
> CREATE TABLE
> insert into csverrtest values(1,'a',1);
> INSERT 122471 1
> insert into csverrtest values(2,'foo\r\nbar',2);
> INSERT 122472 1
> insert into csverrtest values(3,'baz\nblurfl',3);
> INSERT 122473 1
> insert into csverrtest values(4,'d',4);
> INSERT 122474 1
> insert into csverrtest values(5,'e',5);
> INSERT 122475 1
> copy csverrtest to '/tmp/csverrtest.csv' csv;
> COPY
> truncate csverrtest;
> TRUNCATE TABLE
> copy csverrtest from '/tmp/csverrtest.csv' csv;
> psql:cvserr.sql:9: ERROR:  literal carriage return found in data
> HINT:  Use "\r" to represent carriage return.
> CONTEXT:  COPY csverrtest, line 2: "2,"foo"
> drop table csverrtest;
> DROP TABLE
> 0000000   1   ,   a   ,   1  \n   2   ,   "   f   o   o  \r  \n   b   a
> 0000020   r   "   ,   2  \n   3   ,   "   b   a   z  \n   b   l   u   r
> 0000040   f   l   "   ,   3  \n   4   ,   d   ,   4  \n   5   ,   e   ,
> 0000060   5  \n
> 0000062
> [andrew@aloysius inst]$
> 
> Bruce Momjian wrote:
> 
> >Can I see an example of such a failure line?
> >
> >---------------------------------------------------------------------------
> >
> >Andrew Dunstan wrote:
> >  
> >
> >>Darcy Buskermolen has drawn my attention to unfortunate behaviour of 
> >>COPY CSV with fields containing embedded line end chars if the embedded 
> >>sequence isn't the same as those of the file containing the CSV data. In 
> >>that case we error out when reading the data in. This means there are 
> >>cases where we can produce a CSV data file which we can't read in, which 
> >>is not at all pleasant.
> >>
> >>Possible approaches to the problem:
> >>. make it a documented limitation
> >>. have a "csv read" mode for backend/commands/copy.c:CopyReadLine() that 
> >>relaxes some of the restrictions on inconsistent line endings
> >>. escape embedded line end chars
> >>
> >>The last really isn't an option, because the whole point of CSVs is to 
> >>play with other programs, and my understanding is that those that 
> >>understand multiline fields (e.g. Excel) expect them not to be escaped, 
> >>and do not produce them escaped.
> >>
> >>So right now I'm tossing up in my head between the first two options. Or 
> >>maybe there's another solution I haven't thought of.
> >>
> >>Thoughts?
> >>
> >>cheers
> >>
> >>andrew
> >>
> >>---------------------------(end of broadcast)---------------------------
> >>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> >>
> >>    
> >>
> >
> >  
> >
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: multiline CSV fields

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> OK, what solutions do we have for this?  Not being able to load dumped
> data is a serious bug.

Which we do not have, because pg_dump doesn't use CSV.  I do not think
this is a must-fix, especially not if the proposed fix introduces
inconsistencies elsewhere.
        regards, tom lane


Re: multiline CSV fields

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > OK, what solutions do we have for this?  Not being able to load dumped
> > data is a serious bug.
> 
> Which we do not have, because pg_dump doesn't use CSV.  I do not think
> this is a must-fix, especially not if the proposed fix introduces
> inconsistencies elsewhere.

Sure, pg_dump doesn't use it but COPY should be able to load anything it
output.

Can this be fixed if we ignore the problem with reporting errors?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: multiline CSV fields

From
"Andrew Dunstan"
Date:
Bruce Momjian said:
> Tom Lane wrote:
>> Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> > OK, what solutions do we have for this?  Not being able to load
>> > dumped data is a serious bug.
>>
>> Which we do not have, because pg_dump doesn't use CSV.  I do not think
>> this is a must-fix, especially not if the proposed fix introduces
>> inconsistencies elsewhere.
>
> Sure, pg_dump doesn't use it but COPY should be able to load anything
> it output.
>
> Can this be fixed if we ignore the problem with reporting errors?
>


When I looked at it I could not see any simple fix that was not worse than
the symptom. If the asymmetry offends you, then we could do as Tom suggested
and rip out the multiline processing completely for now. Personally I would
regard that as a pity, as it would disallow many cases that will work quite
happily as we are, and because this is a feature that was requested when we
did this work. The limitation has been documented - my incliniation would be
to revisit this during the 8.1 dev cycle.

cheers

andrew




Re: multiline CSV fields

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> Which we do not have, because pg_dump doesn't use CSV.  I do not think
>> this is a must-fix, especially not if the proposed fix introduces
>> inconsistencies elsewhere.

> Sure, pg_dump doesn't use it but COPY should be able to load anything it
> output.

I'd buy into that proposition if CSV showed any evidence of being a
sanely defined format, but it shows every indication of being neither
well-defined, nor self-consistent, nor even particularly portable.
I suggest adjusting your expectations.  All I expect from that code is
being able to load the majority of data from the more popular Microsloth
applications.  Trying to achieve 100% consistency for corner cases is
just going to interfere with the real use-case for the feature, which is
coping with output from applications that aren't very consistent in the
first place.
        regards, tom lane


Re: multiline CSV fields

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> Which we do not have, because pg_dump doesn't use CSV.  I do not think
> >> this is a must-fix, especially not if the proposed fix introduces
> >> inconsistencies elsewhere.
> 
> > Sure, pg_dump doesn't use it but COPY should be able to load anything it
> > output.
> 
> I'd buy into that proposition if CSV showed any evidence of being a
> sanely defined format, but it shows every indication of being neither
> well-defined, nor self-consistent, nor even particularly portable.
> I suggest adjusting your expectations.  All I expect from that code is
> being able to load the majority of data from the more popular Microsloth
> applications.  Trying to achieve 100% consistency for corner cases is
> just going to interfere with the real use-case for the feature, which is
> coping with output from applications that aren't very consistent in the
> first place.

OK, then should we disallow dumping out data in CVS format that we can't
load?  Seems like the least we should do for 8.0.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: multiline CSV fields

From
Andrew Dunstan
Date:

Bruce Momjian wrote:

>Tom Lane wrote:
>  
>
>>Bruce Momjian <pgman@candle.pha.pa.us> writes:
>>    
>>
>>>Tom Lane wrote:
>>>      
>>>
>>>>Which we do not have, because pg_dump doesn't use CSV.  I do not think
>>>>this is a must-fix, especially not if the proposed fix introduces
>>>>inconsistencies elsewhere.
>>>>        
>>>>
>>>Sure, pg_dump doesn't use it but COPY should be able to load anything it
>>>output.
>>>      
>>>
>>I'd buy into that proposition if CSV showed any evidence of being a
>>sanely defined format, but it shows every indication of being neither
>>well-defined, nor self-consistent, nor even particularly portable.
>>I suggest adjusting your expectations.  All I expect from that code is
>>being able to load the majority of data from the more popular Microsloth
>>applications.  Trying to achieve 100% consistency for corner cases is
>>just going to interfere with the real use-case for the feature, which is
>>coping with output from applications that aren't very consistent in the
>>first place.
>>    
>>
>
>OK, then should we disallow dumping out data in CVS format that we can't
>load?  Seems like the least we should do for 8.0.
>
>  
>

As Tom rightly points out, having data make the round trip was not the 
goal of the exercise. Excel, for example, has no trouble reading such 
data (or at least my installation of it).

Personally I consider CSVs with line end chars embedded in fields to be 
broken anyway, but this was something that was specifically mentioned 
when we were discussing requirements, which is why I coded for it.

cheers

andrew




Re: multiline CSV fields

From
Bruce Momjian
Date:
Andrew Dunstan wrote:
> >OK, then should we disallow dumping out data in CVS format that we can't
> >load?  Seems like the least we should do for 8.0.
> >
> >  
> >
> 
> As Tom rightly points out, having data make the round trip was not the 
> goal of the exercise. Excel, for example, has no trouble reading such 
> data (or at least my installation of it).
> 
> Personally I consider CSVs with line end chars embedded in fields to be 
> broken anyway, but this was something that was specifically mentioned 
> when we were discussing requirements, which is why I coded for it.

OK, I am pretty uncomforable with this but you know this usage better
than I do.  Should we issue a warning message stating it will not be
able to be reloaded?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: multiline CSV fields

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> Andrew Dunstan wrote:
> > >OK, then should we disallow dumping out data in CVS format that we can't
> > >load?  Seems like the least we should do for 8.0.
> > >
> > >  
> > >
> > 
> > As Tom rightly points out, having data make the round trip was not the 
> > goal of the exercise. Excel, for example, has no trouble reading such 
> > data (or at least my installation of it).
> > 
> > Personally I consider CSVs with line end chars embedded in fields to be 
> > broken anyway, but this was something that was specifically mentioned 
> > when we were discussing requirements, which is why I coded for it.
> 
> OK, I am pretty uncomforable with this but you know this usage better
> than I do.  Should we issue a warning message stating it will not be
> able to be reloaded?

Also, can you explain why we can't read across a newline to the next
quote?  Is it a problem with the way our code is structured or is it a
logical problem?  Someone mentioned multibyte encodings but I don't
understand how that applies here.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: multiline CSV fields

From
Andrew Dunstan
Date:

Bruce Momjian wrote:

>Andrew Dunstan wrote:
>  
>
>>>OK, then should we disallow dumping out data in CVS format that we can't
>>>load?  Seems like the least we should do for 8.0.
>>>
>>> 
>>>
>>>      
>>>
>>As Tom rightly points out, having data make the round trip was not the 
>>goal of the exercise. Excel, for example, has no trouble reading such 
>>data (or at least my installation of it).
>>
>>Personally I consider CSVs with line end chars embedded in fields to be 
>>broken anyway, but this was something that was specifically mentioned 
>>when we were discussing requirements, which is why I coded for it.
>>    
>>
>
>OK, I am pretty uncomforable with this but you know this usage better
>than I do.  Should we issue a warning message stating it will not be
>able to be reloaded?
>  
>

If it bothers you that much. I'd make a flag, cleared at the start of 
each COPY, and then where we test for CR or LF in CopyAttributeOutCSV, 
if the flag is not set then set it and issue the warning.

Longer term I'd like to be able to have a command parameter that 
specifies certain fields as multiline and for those relax the line end 
matching restriction (and for others forbid multiline altogether). That 
would be a TODO for 8.1 though, along with optional special handling for 
first line column headings.

cheers

andrew


Re: multiline CSV fields

From
Andrew Dunstan
Date:

Bruce Momjian wrote:

>
>Also, can you explain why we can't read across a newline to the next
>quote?  Is it a problem with the way our code is structured or is it a
>logical problem?  Someone mentioned multibyte encodings but I don't
>understand how that applies here.
>
>  
>

In a CSV file, each line is a record. Reading across a newline for the 
next quote (assuming the next field is quoted) would mean stealing 
fields from the next record.

I did see one complaint about missing or extra fields at the end of a 
record - I think it is reasonable for us to expect the data to be 
rectangular, and not ragged.

(I hope this answers your question - I am not 100% certain I understaood 
it).

cheers

andrew


Re: multiline CSV fields

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Also, can you explain why we can't read across a newline to the next
> quote?  Is it a problem with the way our code is structured or is it a
> logical problem?

It's a structural issue in the sense that we separate the act of
dividing the input into rows from the act of dividing it into columns.
I do not think that separation is wrong however.
        regards, tom lane


Re: multiline CSV fields

From
Kris Jurka
Date:

On Mon, 29 Nov 2004, Andrew Dunstan wrote:

> Longer term I'd like to be able to have a command parameter that 
> specifies certain fields as multiline and for those relax the line end 
> matching restriction (and for others forbid multiline altogether). That 
> would be a TODO for 8.1 though, along with optional special handling for 
> first line column headings.
> 

Endlessly extending the COPY command doesn't seem like a winning 
proposition to me and I think if we aren't comfortable telling every user 
to write a script to pre/post-process the data we should instead provide a 
bulk loader/unloader that transforms things to our limited COPY 
functionality.  There are all kinds of feature requests I've seen 
along these lines that would make COPY a million option mess if we try to 
support all of it directly.

- skipping header rows
- skipping certain data file columns
- specifying date formats
- ignoring duplicates
- outputting an arbitrary SELECT statement

Kris Jurka



Re: multiline CSV fields

From
Bruce Momjian
Date:
Kris Jurka wrote:
> 
> 
> On Mon, 29 Nov 2004, Andrew Dunstan wrote:
> 
> > Longer term I'd like to be able to have a command parameter that 
> > specifies certain fields as multiline and for those relax the line end 
> > matching restriction (and for others forbid multiline altogether). That 
> > would be a TODO for 8.1 though, along with optional special handling for 
> > first line column headings.
> > 
> 
> Endlessly extending the COPY command doesn't seem like a winning 
> proposition to me and I think if we aren't comfortable telling every user 
> to write a script to pre/post-process the data we should instead provide a 
> bulk loader/unloader that transforms things to our limited COPY 
> functionality.  There are all kinds of feature requests I've seen 
> along these lines that would make COPY a million option mess if we try to 
> support all of it directly.
> 
> - skipping header rows
> - skipping certain data file columns
> - specifying date formats
> - ignoring duplicates
> - outputting an arbitrary SELECT statement

Agreed. There are lots of wishes for COPY and it will become bloated if
we do them all.

I am concerned someone will say, "Oh, I know the CSV format and I might
load the data into another database someday so I will always use CVS"
not knowing it isn't a 100% consistent format.  I think we need to
issues a warning if a \r or \n is output by COPY CSV just so people
understand the limitation.  We can then reevaluate where we need to go
for 8.1.

Open item updated:
* warn on COPY TO ... CSV with \r,\n in data

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: multiline CSV fields

From
Tom Lane
Date:
Kris Jurka <books@ejurka.com> writes:
> Endlessly extending the COPY command doesn't seem like a winning 
> proposition to me and I think if we aren't comfortable telling every user 
> to write a script to pre/post-process the data we should instead provide a 
> bulk loader/unloader that transforms things to our limited COPY 
> functionality.  There are all kinds of feature requests I've seen 
> along these lines that would make COPY a million option mess if we try to 
> support all of it directly.

I agree completely --- personally I'd not have put CSV into the backend
either.

IIRC we already have a TODO item for a separate bulk loader, but no
one's stepped up to the plate yet :-(
        regards, tom lane


Re: multiline CSV fields

From
Bruce Momjian
Date:
Tom Lane wrote:
> Kris Jurka <books@ejurka.com> writes:
> > Endlessly extending the COPY command doesn't seem like a winning 
> > proposition to me and I think if we aren't comfortable telling every user 
> > to write a script to pre/post-process the data we should instead provide a 
> > bulk loader/unloader that transforms things to our limited COPY 
> > functionality.  There are all kinds of feature requests I've seen 
> > along these lines that would make COPY a million option mess if we try to 
> > support all of it directly.
> 
> I agree completely --- personally I'd not have put CSV into the backend
> either.

What pushed us was the large number of request for it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: multiline CSV fields

From
Andrew Dunstan
Date:

Tom Lane wrote:

>Kris Jurka <books@ejurka.com> writes:
>  
>
>>Endlessly extending the COPY command doesn't seem like a winning 
>>proposition to me and I think if we aren't comfortable telling every user 
>>to write a script to pre/post-process the data we should instead provide a 
>>bulk loader/unloader that transforms things to our limited COPY 
>>functionality.  There are all kinds of feature requests I've seen 
>>along these lines that would make COPY a million option mess if we try to 
>>support all of it directly.
>>    
>>
>
>I agree completely --- personally I'd not have put CSV into the backend
>either.
>
>IIRC we already have a TODO item for a separate bulk loader, but no
>one's stepped up to the plate yet :-(
>

IIRC, the way it happened was that a proposal was made to do CSV import/export in a fairly radical way, I countered
witha much more modest approach, which was generally accepted and which Bruce and I then implemented, not without some
angst(as well as a little sturm und drang).
 


The advantage of having it in COPY is that it can be done serverside 
direct from the file system. For massive bulk loads that might be a 
plus, although I don't know what the protocol+socket overhead is. Maybe 
it would just be lost in the noise. Certainly I can see some sense in 
having COPY deal with straightforward cases and a bulk-load-unload 
program in bin to handle the hairier cases. Multiline fields would come 
into that category. The bulk-load-unload facility could possibly handle 
things other than CSV format too (XML anyone?). The nice thing about an 
external program is that it would not have to handle data embedded in an 
SQL stream, so the dangers from shifts in newline style, missing quotes, 
and the like would be far lower.

We do need to keep things in perspective a bit. The small wrinkle that 
has spawned this whole thread will not affect most users of the facility 
- and many many users will thanks us for having provided it.

cheers

andrew


Re: multiline CSV fields

From
Greg Stark
Date:
Andrew Dunstan <andrew@dunslane.net> writes:

> The advantage of having it in COPY is that it can be done serverside direct
> from the file system. For massive bulk loads that might be a plus, although I
> don't know what the protocol+socket overhead is. 

Actually even if you use client-side COPY it's *still* more efficient than any
more general client-side alternative.

As Tom pointed out to me a while back, neither the protocol nor libpq allow
for having multiple queries in flight simultaneously. That makes it impossible
to stream large quantities of data to the server efficiently. Each record
requires a round-trip and context switch overhead.

In an ideal world the client should be able to queue up enough records to fill
the socket buffers and allow the kernel to switch to a more batch oriented
context switch mode where the server can process large numbers of records
before switching back to the client. Ideally this would apply to any kind of
query execution.

But as a kind of short cut towards this for bulk loading I'm curious whether
it would be possible to adopt a sort of batch execution mode where a statement
is prepared, then parameters to the statement are streamed to the server in a
kind of COPY mode. It would have to be some format that allowed for embedded
newlines though; there's just no point in an interface that can't handle
arbitrary data.

Personally I find the current CSV support inadequate. It seems pointless to
support CSV if it can't load data exported from Excel, which seems like the
main use case. But I always thought bulk loading should be from some external
application anyways. The problem is that there isn't any interface suitable
for an external application to use.

-- 
greg



Re: multiline CSV fields

From
Kris Jurka
Date:

On Tue, 30 Nov 2004, Greg Stark wrote:

> 
> Andrew Dunstan <andrew@dunslane.net> writes:
> 
> > The advantage of having it in COPY is that it can be done serverside
> > direct from the file system. For massive bulk loads that might be a
> > plus, although I don't know what the protocol+socket overhead is.
> 
> Actually even if you use client-side COPY it's *still* more efficient than any
> more general client-side alternative.

The idea would be to still use COPY just from a program that did 
additional processing, not as direct SQL.

> As Tom pointed out to me a while back, neither the protocol nor libpq allow
> for having multiple queries in flight simultaneously. That makes it impossible
> to stream large quantities of data to the server efficiently. Each record
> requires a round-trip and context switch overhead.

Multiplexing queries is different than having multiple queries in flight.  
You can have multiple queries on the wire now, they are just processed 
sequentially.

> In an ideal world the client should be able to queue up enough records to fill
> the socket buffers and allow the kernel to switch to a more batch oriented
> context switch mode where the server can process large numbers of records
> before switching back to the client. Ideally this would apply to any kind of
> query execution.

This is possible now with the V3 protocol (and used by the JDBC driver).  
For an executeBatch() on a PreparedStatement, the driver sends a parse
message, then any number of bind/execute pairs, but with a buffered stream
nothing happens until a Sync message is sent and the stream is flushed.  
Then it collects the results of all of the executes.

Kris Jurka


Re: multiline CSV fields

From
Andrew Dunstan
Date:

Greg Stark wrote:

>Personally I find the current CSV support inadequate. It seems pointless to
>support CSV if it can't load data exported from Excel, which seems like the
>main use case. 
>  
>

OK, I'm starting to get mildly annoyed now. We have identified one 
failure case connected with multiline fields. Even in the multiline 
case, I expect that the embedded newlines will usually match those of 
the CSV file, in which case there will be no failure. It's a very big 
step from there to the far more general "can't load data exported from 
Excel". Or did you have some other limitation in mind?

FWIW, I don't make a habit of using multiline fields in my spreadsheets 
- and some users I have spoken to aren't even aware that you can have 
them at all.


cheers

andrew


Re: multiline CSV fields

From
Ben.Young@risk.sungard.com
Date:
> 
> 
> Greg Stark wrote:
> 
> >Personally I find the current CSV support inadequate. It seems 
pointless to
> >support CSV if it can't load data exported from Excel, which seems like 
the
> >main use case. 
> > 
> >
> 
> OK, I'm starting to get mildly annoyed now. We have identified one 
> failure case connected with multiline fields. Even in the multiline 
> case, I expect that the embedded newlines will usually match those of 
> the CSV file, in which case there will be no failure. It's a very big 
> step from there to the far more general "can't load data exported from 
> Excel". Or did you have some other limitation in mind?
> 
> FWIW, I don't make a habit of using multiline fields in my spreadsheets 
> - and some users I have spoken to aren't even aware that you can have 
> them at all.
> 

I am normally more of a lurker on these lists, but I thought you had 
better know 
that when we developed CSV import/export for an application at my last 
company
we discovered that Excel can't always even read the CSV that _it_ has 
output! 
(With embedded newlines a particular problem) 

It is far more reliable if you output your data as an HTML table, in which 
case it 
practically always gets it right. Perhaps Postgres could support this as 
an import/
export mechanism as I have found it to be far less error prone than CSV?

Cheers,
Ben Young


> cheers
> 
> andrew
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faqs/FAQ.html



Re: multiline CSV fields

From
Andrew Dunstan
Date:

Ben.Young@risk.sungard.com wrote:

>
>I am normally more of a lurker on these lists, but I thought you had 
>better know 
>that when we developed CSV import/export for an application at my last 
>company
>we discovered that Excel can't always even read the CSV that _it_ has 
>output! 
>(With embedded newlines a particular problem) 
>  
>


Quite so.  This is not about being perfect, and there will be failures. 
But, absent this problem the feature should work reasonably well.

Note that Excel is not the only kid on the block - if it were there 
would be a good case for avoiding CSV anyway and instead reading/writing 
the excel files directly. We included support for CSVs because, 
nothwithstanding how braindead the format is, is is still widely used in 
data exchange.

>It is far more reliable if you output your data as an HTML table, in which 
>case it 
>practically always gets it right. Perhaps Postgres could support this as 
>an import/
>export mechanism as I have found it to be far less error prone than CSV?
>  
>

That's probably possibly but rather ugly. A well defined XML format 
would be far better (don't we have something like that in contrib?). 
Things for a bulk-export facility, I think.

cheers

andrew




Re: multiline CSV fields

From
Greg Stark
Date:
Andrew Dunstan <andrew@dunslane.net> writes:

> FWIW, I don't make a habit of using multiline fields in my spreadsheets - and
> some users I have spoken to aren't even aware that you can have them at all.

Unfortunately I don't get a choice. I offer a field on the web site where
users can upload an excel sheet. Some of the fields of my database are
expected to have multi-line text in them. So I expect virtually all the
uploads to have multi-line fields in them.

So as far as I'm concerned, this import system is simply unusable. I have to
write a program to do the import. Since I was always planning to write such a
program I'm not too disappointed though.

-- 
greg



Re: multiline CSV fields

From
Bruce Momjian
Date:
Andrew Dunstan wrote:
> 
> 
> Greg Stark wrote:
> 
> >Personally I find the current CSV support inadequate. It seems pointless to
> >support CSV if it can't load data exported from Excel, which seems like the
> >main use case. 
> >  
> >
> 
> OK, I'm starting to get mildly annoyed now. We have identified one 
> failure case connected with multiline fields. Even in the multiline 
> case, I expect that the embedded newlines will usually match those of 
> the CSV file, in which case there will be no failure. It's a very big 
> step from there to the far more general "can't load data exported from 
> Excel". Or did you have some other limitation in mind?
> 
> FWIW, I don't make a habit of using multiline fields in my spreadsheets 
> - and some users I have spoken to aren't even aware that you can have 
> them at all.

I am wondering if one good solution would be to pre-process the input
stream in copy.c to convert newline to \n and carriage return to \r and
double data backslashes and tell copy.c to interpret those like it does
for normal text COPY files.  That way, the changes to copy.c might be
minimal; basically, place a filter in front of the CSV file that cleans
up the input so it can be more easily processed.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: multiline CSV fields

From
Andrew Dunstan
Date:

Bruce Momjian wrote:

>I am wondering if one good solution would be to pre-process the input
>stream in copy.c to convert newline to \n and carriage return to \r and
>double data backslashes and tell copy.c to interpret those like it does
>for normal text COPY files.  That way, the changes to copy.c might be
>minimal; basically, place a filter in front of the CSV file that cleans
>up the input so it can be more easily processed.
>  
>

This would have to parse the input stream, because you would need to 
know which CRs and LFs were part of the data stream and so should be 
escaped, and which really ended data lines and so should be left alone. 
However, while the idea is basically sound, parsing the stream twice 
seems crazy. My argument has been that at this stage in the dev cycle we 
should document the limitation, maybe issue a warning as you want, and 
make the more invasive code changes to fix it properly in 8.1. If you 
don't want to wait, then following your train of thought a bit, ISTM 
that the correct solution is a routine for CSV mode that combines the 
functions of CopyReadAttributeCSV() and CopyReadLine(). Then we'd have a 
genuine and fast fix for Greg's and Darcy's problem.

cheers

andrew


Re: multiline CSV fields

From
Bruce Momjian
Date:
Andrew Dunstan wrote:
> 
> 
> Bruce Momjian wrote:
> 
> >I am wondering if one good solution would be to pre-process the input
> >stream in copy.c to convert newline to \n and carriage return to \r and
> >double data backslashes and tell copy.c to interpret those like it does
> >for normal text COPY files.  That way, the changes to copy.c might be
> >minimal; basically, place a filter in front of the CSV file that cleans
> >up the input so it can be more easily processed.
> >  
> >
> 
> This would have to parse the input stream, because you would need to 
> know which CRs and LFs were part of the data stream and so should be 
> escaped, and which really ended data lines and so should be left alone. 
> However, while the idea is basically sound, parsing the stream twice 
> seems crazy. My argument has been that at this stage in the dev cycle we 
> should document the limitation, maybe issue a warning as you want, and 
> make the more invasive code changes to fix it properly in 8.1. If you 

OK, right.

> don't want to wait, then following your train of thought a bit, ISTM 
> that the correct solution is a routine for CSV mode that combines the 
> functions of CopyReadAttributeCSV() and CopyReadLine(). Then we'd have a 
> genuine and fast fix for Greg's and Darcy's problem.

We are fine for 8.0, except for the warning, and you think we can fix it
perfectly in 8.1, good.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: multiline CSV fields

From
Andrew Dunstan
Date:

I wrote:

>
> If it bothers you that much. I'd make a flag, cleared at the start of
> each COPY, and then where we test for CR or LF in CopyAttributeOutCSV,
> if the flag is not set then set it and issue the warning.



I didn't realise until Bruce told me just now that I was on the hook for
this. I guess i should keep my big mouth shut. (Yeah, that's gonna
happen ...)

Anyway, here's a tiny patch that does what I had in mind.

cheers

andrew
Index: copy.c
===================================================================
RCS file: /home/cvsmirror/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.234
diff -c -r1.234 copy.c
*** copy.c    6 Nov 2004 17:46:27 -0000    1.234
--- copy.c    2 Dec 2004 23:34:20 -0000
***************
*** 98,103 ****
--- 98,104 ----
  static EolType eol_type;        /* EOL type of input */
  static int    client_encoding;    /* remote side's character encoding */
  static int    server_encoding;    /* local encoding */
+ static bool embedded_line_warning;

  /* these are just for error messages, see copy_in_error_callback */
  static bool copy_binary;        /* is it a binary copy? */
***************
*** 1190,1195 ****
--- 1191,1197 ----
      attr = tupDesc->attrs;
      num_phys_attrs = tupDesc->natts;
      attr_count = list_length(attnumlist);
+     embedded_line_warning = false;

      /*
       * Get info about the columns we need to process.
***************
*** 2627,2632 ****
--- 2629,2653 ----
           !use_quote && (c = *test_string) != '\0';
           test_string += mblen)
      {
+         /*
+          * We don't know here what the surrounding line end characters
+          * might be. It might not even be under postgres' control. So
+          * we simple warn on ANY embedded line ending character.
+          *
+          * This warning will disappear when we make line parsing field-aware,
+          * so that we can reliably read in embedded line ending characters
+          * regardless of the file's line-end context.
+          *
+          */
+
+         if (!embedded_line_warning  && (c == '\n' || c == '\r') )
+         {
+             embedded_line_warning = true;
+             elog(WARNING,
+                  "CSV fields with embedded linefeed or carriage return "
+                  "characters might not be able to be reimported");
+         }
+
          if (c == delimc || c == quotec || c == '\n' || c == '\r')
              use_quote = true;
          if (!same_encoding)

Re: multiline CSV fields

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> +         if (!embedded_line_warning  && (c == '\n' || c == '\r') )
> +         {
> +             embedded_line_warning = true;
> +             elog(WARNING,
> +                  "CSV fields with embedded linefeed or carriage return "
> +                  "characters might not be able to be reimported");
> +         }

What about forcibly translating them to the two-character sequences \n
or \r?  Or is that not considered a CSV-compatible representation?

            regards, tom lane

Re: multiline CSV fields

From
Andrew Dunstan
Date:

Tom Lane wrote:

>Andrew Dunstan <andrew@dunslane.net> writes:
>
>
>>+         if (!embedded_line_warning  && (c == '\n' || c == '\r') )
>>+         {
>>+             embedded_line_warning = true;
>>+             elog(WARNING,
>>+                  "CSV fields with embedded linefeed or carriage return "
>>+                  "characters might not be able to be reimported");
>>+         }
>>
>>
>
>What about forcibly translating them to the two-character sequences \n
>or \r?  Or is that not considered a CSV-compatible representation?
>
>
>
>

Not compatible AFAIK. Certainly not portably. And the warning would
still be true, because we don't do this unescaping on the way back in. I
think the way the comment in the patch suggests and previous emails have
discussed is the right way to go with this - I will attend to it after
we branch ;-)

cheers

andrew

Re: [PATCHES] multiline CSV fields

From
Bruce Momjian
Date:
Patch applied.  Thanks.

---------------------------------------------------------------------------


Andrew Dunstan wrote:
>
>
> I wrote:
>
> >
> > If it bothers you that much. I'd make a flag, cleared at the start of
> > each COPY, and then where we test for CR or LF in CopyAttributeOutCSV,
> > if the flag is not set then set it and issue the warning.
>
>
>
> I didn't realise until Bruce told me just now that I was on the hook for
> this. I guess i should keep my big mouth shut. (Yeah, that's gonna
> happen ...)
>
> Anyway, here's a tiny patch that does what I had in mind.
>
> cheers
>
> andrew

[ text/x-patch is unsupported, treating like TEXT/PLAIN ]

> Index: copy.c
> ===================================================================
> RCS file: /home/cvsmirror/pgsql/src/backend/commands/copy.c,v
> retrieving revision 1.234
> diff -c -r1.234 copy.c
> *** copy.c    6 Nov 2004 17:46:27 -0000    1.234
> --- copy.c    2 Dec 2004 23:34:20 -0000
> ***************
> *** 98,103 ****
> --- 98,104 ----
>   static EolType eol_type;        /* EOL type of input */
>   static int    client_encoding;    /* remote side's character encoding */
>   static int    server_encoding;    /* local encoding */
> + static bool embedded_line_warning;
>
>   /* these are just for error messages, see copy_in_error_callback */
>   static bool copy_binary;        /* is it a binary copy? */
> ***************
> *** 1190,1195 ****
> --- 1191,1197 ----
>       attr = tupDesc->attrs;
>       num_phys_attrs = tupDesc->natts;
>       attr_count = list_length(attnumlist);
> +     embedded_line_warning = false;
>
>       /*
>        * Get info about the columns we need to process.
> ***************
> *** 2627,2632 ****
> --- 2629,2653 ----
>            !use_quote && (c = *test_string) != '\0';
>            test_string += mblen)
>       {
> +         /*
> +          * We don't know here what the surrounding line end characters
> +          * might be. It might not even be under postgres' control. So
> +          * we simple warn on ANY embedded line ending character.
> +          *
> +          * This warning will disappear when we make line parsing field-aware,
> +          * so that we can reliably read in embedded line ending characters
> +          * regardless of the file's line-end context.
> +          *
> +          */
> +
> +         if (!embedded_line_warning  && (c == '\n' || c == '\r') )
> +         {
> +             embedded_line_warning = true;
> +             elog(WARNING,
> +                  "CSV fields with embedded linefeed or carriage return "
> +                  "characters might not be able to be reimported");
> +         }
> +
>           if (c == delimc || c == quotec || c == '\n' || c == '\r')
>               use_quote = true;
>           if (!same_encoding)

>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073