Thread: multiline CSV fields
Darcy Buskermolen has drawn my attention to unfortunate behaviour of COPY CSV with fields containing embedded line end chars if the embedded sequence isn't the same as those of the file containing the CSV data. In that case we error out when reading the data in. This means there are cases where we can produce a CSV data file which we can't read in, which is not at all pleasant. Possible approaches to the problem: . make it a documented limitation . have a "csv read" mode for backend/commands/copy.c:CopyReadLine() that relaxes some of the restrictions on inconsistent line endings . escape embedded line end chars The last really isn't an option, because the whole point of CSVs is to play with other programs, and my understanding is that those that understand multiline fields (e.g. Excel) expect them not to be escaped, and do not produce them escaped. So right now I'm tossing up in my head between the first two options. Or maybe there's another solution I haven't thought of. Thoughts? cheers andrew
On Nov 10, 2004, at 6:10 PM, Andrew Dunstan wrote: > The last really isn't an option, because the whole point of CSVs is to > play with other programs, and my understanding is that those that > understand multiline fields (e.g. Excel) expect them not to be > escaped, and do not produce them escaped. > Actually, when I try to export a sheet with multi-line cells from excel, it tells me that this feature is incompatible with the CSV format and will not include them in the CSV file. Patrick B. Kelly ------------------------------------------------------ http://patrickbkelly.org
Patrick B Kelly wrote: > > On Nov 10, 2004, at 6:10 PM, Andrew Dunstan wrote: > >> The last really isn't an option, because the whole point of CSVs is >> to play with other programs, and my understanding is that those that >> understand multiline fields (e.g. Excel) expect them not to be >> escaped, and do not produce them escaped. >> > > Actually, when I try to export a sheet with multi-line cells from > excel, it tells me that this feature is incompatible with the CSV > format and will not include them in the CSV file. > > > It probably depends on the version. I have just tested with Excel 2000 on a WinXP machine and it both read and wrote these files. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Patrick B Kelly wrote: >> Actually, when I try to export a sheet with multi-line cells from >> excel, it tells me that this feature is incompatible with the CSV >> format and will not include them in the CSV file. > It probably depends on the version. I have just tested with Excel 2000 > on a WinXP machine and it both read and wrote these files. I'd be inclined to define Excel 2000 as broken, honestly, if it's writing unescaped newlines as data. To support this would mean throwing away most of our ability to detect incorrectly formatted CSV files. A simple error like a missing close quote would look to the machine like the rest of the file is a single long data line where all the newlines are embedded in data fields. How likely is it that you'll get a useful error message out of that? Most likely the error message would point to the end of the file, or at least someplace well removed from the actual mistake. I would vote in favor of removing the current code that attempts to support unquoted newlines, and waiting to see if there are complaints. regards, tom lane
Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>Patrick B Kelly wrote: >> >> >>>Actually, when I try to export a sheet with multi-line cells from >>>excel, it tells me that this feature is incompatible with the CSV >>>format and will not include them in the CSV file. >>> >>> > > > >>It probably depends on the version. I have just tested with Excel 2000 >>on a WinXP machine and it both read and wrote these files. >> >> > >I'd be inclined to define Excel 2000 as broken, honestly, if it's >writing unescaped newlines as data. To support this would mean throwing >away most of our ability to detect incorrectly formatted CSV files. >A simple error like a missing close quote would look to the machine like >the rest of the file is a single long data line where all the newlines >are embedded in data fields. How likely is it that you'll get a useful >error message out of that? Most likely the error message would point to >the end of the file, or at least someplace well removed from the actual >mistake. > >I would vote in favor of removing the current code that attempts to >support unquoted newlines, and waiting to see if there are complaints. > > > > This feature was specifically requested when we discussed what sort of CSVs we would handle. And it does in fact work as long as the newline style is the same. I just had an idea. How about if we add a new CSV option MULTILINE. If absent, then on output we would not output unescaped LF/CR characters and on input we would not allow fields with embedded unescaped LF/CR characters. In both cases we could error out for now, with perhaps an 8.1 TODO to provide some other behaviour. Or we could drop the whole multiline "feature" for now and make the whole thing an 8.1 item, although it would be a bit of a pity when it does work in what will surely be the most common case. cheers andrew
Tom Lane <tgl@sss.pgh.pa.us> writes: > I would vote in favor of removing the current code that attempts to > support unquoted newlines, and waiting to see if there are complaints. Uhm. *raises hand* I agree with your argument but one way or another I have to load these CSVs I'm given. And like it or not virtually all the CSVs people get are going to be coming from Excel. So far with 7.4 I've just opened them up in Emacs and removed the newlines, but it's a royal pain in the arse. -- greg
On Thu, Nov 11, 2004 at 03:38:16PM -0500, Greg Stark wrote: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > I would vote in favor of removing the current code that attempts > > to support unquoted newlines, and waiting to see if there are > > complaints. > > Uhm. *raises hand* > > I agree with your argument but one way or another I have to load > these CSVs I'm given. And like it or not virtually all the CSVs > people get are going to be coming from Excel. So far with 7.4 I've > just opened them up in Emacs and removed the newlines, but it's a > royal pain in the arse. Meanwhile, check out dbi-link. It lets you query against DBI data sources including DBD::Excel :) http://pgfoundry.org/projects/dbi-link/ Bug reports welcome. Cheers, D -- David Fetter david@fetter.org http://fetter.org/ phone: +1 510 893 6100 mobile: +1 415 235 3778 Remember to vote!
On Nov 11, 2004, at 2:56 PM, Andrew Dunstan wrote: > > > Tom Lane wrote: > >> Andrew Dunstan <andrew@dunslane.net> writes: >> >>> Patrick B Kelly wrote: >>> >>>> Actually, when I try to export a sheet with multi-line cells from >>>> excel, it tells me that this feature is incompatible with the CSV >>>> format and will not include them in the CSV file. >>>> >> >> >>> It probably depends on the version. I have just tested with Excel >>> 2000 on a WinXP machine and it both read and wrote these files. >>> >> >> I'd be inclined to define Excel 2000 as broken, honestly, if it's >> writing unescaped newlines as data. To support this would mean >> throwing >> away most of our ability to detect incorrectly formatted CSV files. >> A simple error like a missing close quote would look to the machine >> like >> the rest of the file is a single long data line where all the newlines >> are embedded in data fields. How likely is it that you'll get a >> useful >> error message out of that? Most likely the error message would point >> to >> the end of the file, or at least someplace well removed from the >> actual >> mistake. >> >> I would vote in favor of removing the current code that attempts to >> support unquoted newlines, and waiting to see if there are complaints. >> >> >> > > This feature was specifically requested when we discussed what sort of > CSVs we would handle. > > And it does in fact work as long as the newline style is the same. > > I just had an idea. How about if we add a new CSV option MULTILINE. If > absent, then on output we would not output unescaped LF/CR characters > and on input we would not allow fields with embedded unescaped LF/CR > characters. In both cases we could error out for now, with perhaps an > 8.1 TODO to provide some other behaviour. > > Or we could drop the whole multiline "feature" for now and make the > whole thing an 8.1 item, although it would be a bit of a pity when it > does work in what will surely be the most common case. > What about just coding a FSM into backend/commands/copy.c:CopyReadLine() that does not process any flavor of NL characters when it is inside of a data field? Patrick B. Kelly ------------------------------------------------------ http://patrickbkelly.org
Patrick B Kelly wrote: > > > What about just coding a FSM into > backend/commands/copy.c:CopyReadLine() that does not process any > flavor of NL characters when it is inside of a data field? > > > It would be a major change - the routine doesn't read data a field at a time, and has no idea if we are even in CSV mode at all. It would be rather late in the dev cycle to be making such changes, I suspect. cheers andrew
Patrick B Kelly <pbk@patrickbkelly.org> writes: > What about just coding a FSM into > backend/commands/copy.c:CopyReadLine() that does not process any flavor > of NL characters when it is inside of a data field? CopyReadLine has no business tracking that. One reason why not is that it is dealing with data not yet converted out of the client's encoding, which makes matching to user-specified quote/escape characters difficult. regards, tom lane
On Nov 11, 2004, at 6:16 PM, Tom Lane wrote: > Patrick B Kelly <pbk@patrickbkelly.org> writes: >> What about just coding a FSM into >> backend/commands/copy.c:CopyReadLine() that does not process any >> flavor >> of NL characters when it is inside of a data field? > > CopyReadLine has no business tracking that. One reason why not is that > it is dealing with data not yet converted out of the client's encoding, > which makes matching to user-specified quote/escape characters > difficult. > > regards, tom lane > > ---------------------------(end of > broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > > I appreciate what you are saying about the encoding and you are, of course, right but CopyReadLine is already processing the NL characters and it is doing it without considering the context in which they appear. Unfortunately, the same character(s) are used for two different purposes in the files in question. Without considering whether they appear inside or outside of data fields, CopyReadline will mistake one for the other and cannot correctly do what it is already trying to do which is break the input file into lines. My suggestion is to simply have CopyReadLine recognize these two states (in-field and out-of-field) and execute the current logic only while in the second state. It would not be too hard but as you mentioned it is non-trivial. Patrick B. Kelly ------------------------------------------------------ http://patrickbkelly.org
Patrick B Kelly wrote: > > > > My suggestion is to simply have CopyReadLine recognize these two > states (in-field and out-of-field) and execute the current logic only > while in the second state. It would not be too hard but as you > mentioned it is non-trivial. > > > We don't know what state we expect the end of line to be in until after we have actually read the line. To know how to treat the end of line on your scheme we would have to parse as we go rather than after reading the line as now. Changing this would be not only be non-trivial but significantly invasive to the code. cheers andrew
On Nov 11, 2004, at 10:07 PM, Andrew Dunstan wrote: > > > Patrick B Kelly wrote: > >> >> >> >> My suggestion is to simply have CopyReadLine recognize these two >> states (in-field and out-of-field) and execute the current logic only >> while in the second state. It would not be too hard but as you >> mentioned it is non-trivial. >> >> >> > > We don't know what state we expect the end of line to be in until > after we have actually read the line. To know how to treat the end of > line on your scheme we would have to parse as we go rather than after > reading the line as now. Changing this would be not only be > non-trivial but significantly invasive to the code. > > Perhaps I am misunderstanding the code. As I read it the code currently goes through the input character by character looking for NL and EOF characters. It appears to be very well structured for what I am proposing. The section in question is a small and clearly defined loop which reads the input one character at a time and decides when it has reached the end of the line or file. Each call of CopyReadLine attempts to get one more line. I would propose that each time it starts out in the out-of-field state and the state is toggled by each un-escaped quote that it encounters in the stream. When in the in-field state, it would only look for the next un-escaped quote and while in the out-of-field state, it would execute the existing logic as well as looking for the next un-escaped quote. I may not be explaining myself well or I may fundamentally misunderstand how copy works. I would be happy to code the change and send it to you for review, if you would be interested in looking it over and it is felt to be a worthwhile capability. Patrick B. Kelly ------------------------------------------------------ http://patrickbkelly.org
Can I see an example of such a failure line? --------------------------------------------------------------------------- Andrew Dunstan wrote: > > Darcy Buskermolen has drawn my attention to unfortunate behaviour of > COPY CSV with fields containing embedded line end chars if the embedded > sequence isn't the same as those of the file containing the CSV data. In > that case we error out when reading the data in. This means there are > cases where we can produce a CSV data file which we can't read in, which > is not at all pleasant. > > Possible approaches to the problem: > . make it a documented limitation > . have a "csv read" mode for backend/commands/copy.c:CopyReadLine() that > relaxes some of the restrictions on inconsistent line endings > . escape embedded line end chars > > The last really isn't an option, because the whole point of CSVs is to > play with other programs, and my understanding is that those that > understand multiline fields (e.g. Excel) expect them not to be escaped, > and do not produce them escaped. > > So right now I'm tossing up in my head between the first two options. Or > maybe there's another solution I haven't thought of. > > Thoughts? > > cheers > > andrew > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Patrick B Kelly <pbk@patrickbkelly.org> writes: > I may not be explaining myself well or I may fundamentally > misunderstand how copy works. Well, you're definitely ignoring the character-set-conversion issue. regards, tom lane
This example should fail on data line 2 or 3 on any platform, regardless of the platform's line-end convention, although I haven't tested on Windows. cheers andrew [andrew@aloysius inst]$ bin/psql -e -f csverr.sql ; od -c /tmp/csverrtest.csv create table csverrtest (a int, b text, c int); CREATE TABLE insert into csverrtest values(1,'a',1); INSERT 122471 1 insert into csverrtest values(2,'foo\r\nbar',2); INSERT 122472 1 insert into csverrtest values(3,'baz\nblurfl',3); INSERT 122473 1 insert into csverrtest values(4,'d',4); INSERT 122474 1 insert into csverrtest values(5,'e',5); INSERT 122475 1 copy csverrtest to '/tmp/csverrtest.csv' csv; COPY truncate csverrtest; TRUNCATE TABLE copy csverrtest from '/tmp/csverrtest.csv' csv; psql:cvserr.sql:9: ERROR: literal carriage return found in data HINT: Use "\r" to represent carriage return. CONTEXT: COPY csverrtest, line 2: "2,"foo" drop table csverrtest; DROP TABLE 0000000 1 , a , 1 \n 2 , " f o o \r \n b a 0000020 r " , 2 \n 3 , " b a z \n b l u r 0000040 f l " , 3 \n 4 , d , 4 \n 5 , e , 0000060 5 \n 0000062 [andrew@aloysius inst]$ Bruce Momjian wrote: >Can I see an example of such a failure line? > >--------------------------------------------------------------------------- > >Andrew Dunstan wrote: > > >>Darcy Buskermolen has drawn my attention to unfortunate behaviour of >>COPY CSV with fields containing embedded line end chars if the embedded >>sequence isn't the same as those of the file containing the CSV data. In >>that case we error out when reading the data in. This means there are >>cases where we can produce a CSV data file which we can't read in, which >>is not at all pleasant. >> >>Possible approaches to the problem: >>. make it a documented limitation >>. have a "csv read" mode for backend/commands/copy.c:CopyReadLine() that >>relaxes some of the restrictions on inconsistent line endings >>. escape embedded line end chars >> >>The last really isn't an option, because the whole point of CSVs is to >>play with other programs, and my understanding is that those that >>understand multiline fields (e.g. Excel) expect them not to be escaped, >>and do not produce them escaped. >> >>So right now I'm tossing up in my head between the first two options. Or >>maybe there's another solution I haven't thought of. >> >>Thoughts? >> >>cheers >> >>andrew >> >>---------------------------(end of broadcast)--------------------------- >>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >> >> >> > > >
On Nov 12, 2004, at 12:20 AM, Tom Lane wrote: > Patrick B Kelly <pbk@patrickbkelly.org> writes: >> I may not be explaining myself well or I may fundamentally >> misunderstand how copy works. > > Well, you're definitely ignoring the character-set-conversion issue. > I was not trying to ignore the character set and encoding issues but perhaps my assumptions are naive or overly optimistic. I realized that quotes are not as consistent as the NL characters but I was assuming that some encodings would escape to ASCII or a similar encoding like JIS Roman that would simplify recognition of the quote character. Unicode files make recognizing other punctuation like the quote fairly straightforward and to the naive observer, the code in CopyReadLine as it is currently written appears to handle multi-byte encodings such as SJIS that may present characters below 127 in trailing bytes. As I said, perhaps I was oversimplifying. Is there a regression test set of input files for that I could review to see all of the supported encodings? Patrick B. Kelly ------------------------------------------------------ http://patrickbkelly.org
OK, what solutions do we have for this? Not being able to load dumped data is a serious bug. I have added this to the open items list: * fix COPY CSV with \r,\n in data My feeling is that if we are in a quoted string we just process whatever characters we find, even passing through an EOL. I realize it might not mark missing quote errors well but that seems minor compared to not loading valid data. --------------------------------------------------------------------------- Andrew Dunstan wrote: > > This example should fail on data line 2 or 3 on any platform, > regardless of the platform's line-end convention, although I haven't > tested on Windows. > > cheers > > andrew > > [andrew@aloysius inst]$ bin/psql -e -f csverr.sql ; od -c > /tmp/csverrtest.csv > create table csverrtest (a int, b text, c int); > CREATE TABLE > insert into csverrtest values(1,'a',1); > INSERT 122471 1 > insert into csverrtest values(2,'foo\r\nbar',2); > INSERT 122472 1 > insert into csverrtest values(3,'baz\nblurfl',3); > INSERT 122473 1 > insert into csverrtest values(4,'d',4); > INSERT 122474 1 > insert into csverrtest values(5,'e',5); > INSERT 122475 1 > copy csverrtest to '/tmp/csverrtest.csv' csv; > COPY > truncate csverrtest; > TRUNCATE TABLE > copy csverrtest from '/tmp/csverrtest.csv' csv; > psql:cvserr.sql:9: ERROR: literal carriage return found in data > HINT: Use "\r" to represent carriage return. > CONTEXT: COPY csverrtest, line 2: "2,"foo" > drop table csverrtest; > DROP TABLE > 0000000 1 , a , 1 \n 2 , " f o o \r \n b a > 0000020 r " , 2 \n 3 , " b a z \n b l u r > 0000040 f l " , 3 \n 4 , d , 4 \n 5 , e , > 0000060 5 \n > 0000062 > [andrew@aloysius inst]$ > > Bruce Momjian wrote: > > >Can I see an example of such a failure line? > > > >--------------------------------------------------------------------------- > > > >Andrew Dunstan wrote: > > > > > >>Darcy Buskermolen has drawn my attention to unfortunate behaviour of > >>COPY CSV with fields containing embedded line end chars if the embedded > >>sequence isn't the same as those of the file containing the CSV data. In > >>that case we error out when reading the data in. This means there are > >>cases where we can produce a CSV data file which we can't read in, which > >>is not at all pleasant. > >> > >>Possible approaches to the problem: > >>. make it a documented limitation > >>. have a "csv read" mode for backend/commands/copy.c:CopyReadLine() that > >>relaxes some of the restrictions on inconsistent line endings > >>. escape embedded line end chars > >> > >>The last really isn't an option, because the whole point of CSVs is to > >>play with other programs, and my understanding is that those that > >>understand multiline fields (e.g. Excel) expect them not to be escaped, > >>and do not produce them escaped. > >> > >>So right now I'm tossing up in my head between the first two options. Or > >>maybe there's another solution I haven't thought of. > >> > >>Thoughts? > >> > >>cheers > >> > >>andrew > >> > >>---------------------------(end of broadcast)--------------------------- > >>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > >> > >> > >> > > > > > > > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > OK, what solutions do we have for this? Not being able to load dumped > data is a serious bug. Which we do not have, because pg_dump doesn't use CSV. I do not think this is a must-fix, especially not if the proposed fix introduces inconsistencies elsewhere. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > OK, what solutions do we have for this? Not being able to load dumped > > data is a serious bug. > > Which we do not have, because pg_dump doesn't use CSV. I do not think > this is a must-fix, especially not if the proposed fix introduces > inconsistencies elsewhere. Sure, pg_dump doesn't use it but COPY should be able to load anything it output. Can this be fixed if we ignore the problem with reporting errors? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian said: > Tom Lane wrote: >> Bruce Momjian <pgman@candle.pha.pa.us> writes: >> > OK, what solutions do we have for this? Not being able to load >> > dumped data is a serious bug. >> >> Which we do not have, because pg_dump doesn't use CSV. I do not think >> this is a must-fix, especially not if the proposed fix introduces >> inconsistencies elsewhere. > > Sure, pg_dump doesn't use it but COPY should be able to load anything > it output. > > Can this be fixed if we ignore the problem with reporting errors? > When I looked at it I could not see any simple fix that was not worse than the symptom. If the asymmetry offends you, then we could do as Tom suggested and rip out the multiline processing completely for now. Personally I would regard that as a pity, as it would disallow many cases that will work quite happily as we are, and because this is a feature that was requested when we did this work. The limitation has been documented - my incliniation would be to revisit this during the 8.1 dev cycle. cheers andrew
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> Which we do not have, because pg_dump doesn't use CSV. I do not think >> this is a must-fix, especially not if the proposed fix introduces >> inconsistencies elsewhere. > Sure, pg_dump doesn't use it but COPY should be able to load anything it > output. I'd buy into that proposition if CSV showed any evidence of being a sanely defined format, but it shows every indication of being neither well-defined, nor self-consistent, nor even particularly portable. I suggest adjusting your expectations. All I expect from that code is being able to load the majority of data from the more popular Microsloth applications. Trying to achieve 100% consistency for corner cases is just going to interfere with the real use-case for the feature, which is coping with output from applications that aren't very consistent in the first place. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> Which we do not have, because pg_dump doesn't use CSV. I do not think > >> this is a must-fix, especially not if the proposed fix introduces > >> inconsistencies elsewhere. > > > Sure, pg_dump doesn't use it but COPY should be able to load anything it > > output. > > I'd buy into that proposition if CSV showed any evidence of being a > sanely defined format, but it shows every indication of being neither > well-defined, nor self-consistent, nor even particularly portable. > I suggest adjusting your expectations. All I expect from that code is > being able to load the majority of data from the more popular Microsloth > applications. Trying to achieve 100% consistency for corner cases is > just going to interfere with the real use-case for the feature, which is > coping with output from applications that aren't very consistent in the > first place. OK, then should we disallow dumping out data in CVS format that we can't load? Seems like the least we should do for 8.0. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: >Tom Lane wrote: > > >>Bruce Momjian <pgman@candle.pha.pa.us> writes: >> >> >>>Tom Lane wrote: >>> >>> >>>>Which we do not have, because pg_dump doesn't use CSV. I do not think >>>>this is a must-fix, especially not if the proposed fix introduces >>>>inconsistencies elsewhere. >>>> >>>> >>>Sure, pg_dump doesn't use it but COPY should be able to load anything it >>>output. >>> >>> >>I'd buy into that proposition if CSV showed any evidence of being a >>sanely defined format, but it shows every indication of being neither >>well-defined, nor self-consistent, nor even particularly portable. >>I suggest adjusting your expectations. All I expect from that code is >>being able to load the majority of data from the more popular Microsloth >>applications. Trying to achieve 100% consistency for corner cases is >>just going to interfere with the real use-case for the feature, which is >>coping with output from applications that aren't very consistent in the >>first place. >> >> > >OK, then should we disallow dumping out data in CVS format that we can't >load? Seems like the least we should do for 8.0. > > > As Tom rightly points out, having data make the round trip was not the goal of the exercise. Excel, for example, has no trouble reading such data (or at least my installation of it). Personally I consider CSVs with line end chars embedded in fields to be broken anyway, but this was something that was specifically mentioned when we were discussing requirements, which is why I coded for it. cheers andrew
Andrew Dunstan wrote: > >OK, then should we disallow dumping out data in CVS format that we can't > >load? Seems like the least we should do for 8.0. > > > > > > > > As Tom rightly points out, having data make the round trip was not the > goal of the exercise. Excel, for example, has no trouble reading such > data (or at least my installation of it). > > Personally I consider CSVs with line end chars embedded in fields to be > broken anyway, but this was something that was specifically mentioned > when we were discussing requirements, which is why I coded for it. OK, I am pretty uncomforable with this but you know this usage better than I do. Should we issue a warning message stating it will not be able to be reloaded? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: > Andrew Dunstan wrote: > > >OK, then should we disallow dumping out data in CVS format that we can't > > >load? Seems like the least we should do for 8.0. > > > > > > > > > > > > > As Tom rightly points out, having data make the round trip was not the > > goal of the exercise. Excel, for example, has no trouble reading such > > data (or at least my installation of it). > > > > Personally I consider CSVs with line end chars embedded in fields to be > > broken anyway, but this was something that was specifically mentioned > > when we were discussing requirements, which is why I coded for it. > > OK, I am pretty uncomforable with this but you know this usage better > than I do. Should we issue a warning message stating it will not be > able to be reloaded? Also, can you explain why we can't read across a newline to the next quote? Is it a problem with the way our code is structured or is it a logical problem? Someone mentioned multibyte encodings but I don't understand how that applies here. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: >Andrew Dunstan wrote: > > >>>OK, then should we disallow dumping out data in CVS format that we can't >>>load? Seems like the least we should do for 8.0. >>> >>> >>> >>> >>> >>As Tom rightly points out, having data make the round trip was not the >>goal of the exercise. Excel, for example, has no trouble reading such >>data (or at least my installation of it). >> >>Personally I consider CSVs with line end chars embedded in fields to be >>broken anyway, but this was something that was specifically mentioned >>when we were discussing requirements, which is why I coded for it. >> >> > >OK, I am pretty uncomforable with this but you know this usage better >than I do. Should we issue a warning message stating it will not be >able to be reloaded? > > If it bothers you that much. I'd make a flag, cleared at the start of each COPY, and then where we test for CR or LF in CopyAttributeOutCSV, if the flag is not set then set it and issue the warning. Longer term I'd like to be able to have a command parameter that specifies certain fields as multiline and for those relax the line end matching restriction (and for others forbid multiline altogether). That would be a TODO for 8.1 though, along with optional special handling for first line column headings. cheers andrew
Bruce Momjian wrote: > >Also, can you explain why we can't read across a newline to the next >quote? Is it a problem with the way our code is structured or is it a >logical problem? Someone mentioned multibyte encodings but I don't >understand how that applies here. > > > In a CSV file, each line is a record. Reading across a newline for the next quote (assuming the next field is quoted) would mean stealing fields from the next record. I did see one complaint about missing or extra fields at the end of a record - I think it is reasonable for us to expect the data to be rectangular, and not ragged. (I hope this answers your question - I am not 100% certain I understaood it). cheers andrew
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Also, can you explain why we can't read across a newline to the next > quote? Is it a problem with the way our code is structured or is it a > logical problem? It's a structural issue in the sense that we separate the act of dividing the input into rows from the act of dividing it into columns. I do not think that separation is wrong however. regards, tom lane
On Mon, 29 Nov 2004, Andrew Dunstan wrote: > Longer term I'd like to be able to have a command parameter that > specifies certain fields as multiline and for those relax the line end > matching restriction (and for others forbid multiline altogether). That > would be a TODO for 8.1 though, along with optional special handling for > first line column headings. > Endlessly extending the COPY command doesn't seem like a winning proposition to me and I think if we aren't comfortable telling every user to write a script to pre/post-process the data we should instead provide a bulk loader/unloader that transforms things to our limited COPY functionality. There are all kinds of feature requests I've seen along these lines that would make COPY a million option mess if we try to support all of it directly. - skipping header rows - skipping certain data file columns - specifying date formats - ignoring duplicates - outputting an arbitrary SELECT statement Kris Jurka
Kris Jurka wrote: > > > On Mon, 29 Nov 2004, Andrew Dunstan wrote: > > > Longer term I'd like to be able to have a command parameter that > > specifies certain fields as multiline and for those relax the line end > > matching restriction (and for others forbid multiline altogether). That > > would be a TODO for 8.1 though, along with optional special handling for > > first line column headings. > > > > Endlessly extending the COPY command doesn't seem like a winning > proposition to me and I think if we aren't comfortable telling every user > to write a script to pre/post-process the data we should instead provide a > bulk loader/unloader that transforms things to our limited COPY > functionality. There are all kinds of feature requests I've seen > along these lines that would make COPY a million option mess if we try to > support all of it directly. > > - skipping header rows > - skipping certain data file columns > - specifying date formats > - ignoring duplicates > - outputting an arbitrary SELECT statement Agreed. There are lots of wishes for COPY and it will become bloated if we do them all. I am concerned someone will say, "Oh, I know the CSV format and I might load the data into another database someday so I will always use CVS" not knowing it isn't a 100% consistent format. I think we need to issues a warning if a \r or \n is output by COPY CSV just so people understand the limitation. We can then reevaluate where we need to go for 8.1. Open item updated: * warn on COPY TO ... CSV with \r,\n in data -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Kris Jurka <books@ejurka.com> writes: > Endlessly extending the COPY command doesn't seem like a winning > proposition to me and I think if we aren't comfortable telling every user > to write a script to pre/post-process the data we should instead provide a > bulk loader/unloader that transforms things to our limited COPY > functionality. There are all kinds of feature requests I've seen > along these lines that would make COPY a million option mess if we try to > support all of it directly. I agree completely --- personally I'd not have put CSV into the backend either. IIRC we already have a TODO item for a separate bulk loader, but no one's stepped up to the plate yet :-( regards, tom lane
Tom Lane wrote: > Kris Jurka <books@ejurka.com> writes: > > Endlessly extending the COPY command doesn't seem like a winning > > proposition to me and I think if we aren't comfortable telling every user > > to write a script to pre/post-process the data we should instead provide a > > bulk loader/unloader that transforms things to our limited COPY > > functionality. There are all kinds of feature requests I've seen > > along these lines that would make COPY a million option mess if we try to > > support all of it directly. > > I agree completely --- personally I'd not have put CSV into the backend > either. What pushed us was the large number of request for it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom Lane wrote: >Kris Jurka <books@ejurka.com> writes: > > >>Endlessly extending the COPY command doesn't seem like a winning >>proposition to me and I think if we aren't comfortable telling every user >>to write a script to pre/post-process the data we should instead provide a >>bulk loader/unloader that transforms things to our limited COPY >>functionality. There are all kinds of feature requests I've seen >>along these lines that would make COPY a million option mess if we try to >>support all of it directly. >> >> > >I agree completely --- personally I'd not have put CSV into the backend >either. > >IIRC we already have a TODO item for a separate bulk loader, but no >one's stepped up to the plate yet :-( > IIRC, the way it happened was that a proposal was made to do CSV import/export in a fairly radical way, I countered witha much more modest approach, which was generally accepted and which Bruce and I then implemented, not without some angst(as well as a little sturm und drang). The advantage of having it in COPY is that it can be done serverside direct from the file system. For massive bulk loads that might be a plus, although I don't know what the protocol+socket overhead is. Maybe it would just be lost in the noise. Certainly I can see some sense in having COPY deal with straightforward cases and a bulk-load-unload program in bin to handle the hairier cases. Multiline fields would come into that category. The bulk-load-unload facility could possibly handle things other than CSV format too (XML anyone?). The nice thing about an external program is that it would not have to handle data embedded in an SQL stream, so the dangers from shifts in newline style, missing quotes, and the like would be far lower. We do need to keep things in perspective a bit. The small wrinkle that has spawned this whole thread will not affect most users of the facility - and many many users will thanks us for having provided it. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > The advantage of having it in COPY is that it can be done serverside direct > from the file system. For massive bulk loads that might be a plus, although I > don't know what the protocol+socket overhead is. Actually even if you use client-side COPY it's *still* more efficient than any more general client-side alternative. As Tom pointed out to me a while back, neither the protocol nor libpq allow for having multiple queries in flight simultaneously. That makes it impossible to stream large quantities of data to the server efficiently. Each record requires a round-trip and context switch overhead. In an ideal world the client should be able to queue up enough records to fill the socket buffers and allow the kernel to switch to a more batch oriented context switch mode where the server can process large numbers of records before switching back to the client. Ideally this would apply to any kind of query execution. But as a kind of short cut towards this for bulk loading I'm curious whether it would be possible to adopt a sort of batch execution mode where a statement is prepared, then parameters to the statement are streamed to the server in a kind of COPY mode. It would have to be some format that allowed for embedded newlines though; there's just no point in an interface that can't handle arbitrary data. Personally I find the current CSV support inadequate. It seems pointless to support CSV if it can't load data exported from Excel, which seems like the main use case. But I always thought bulk loading should be from some external application anyways. The problem is that there isn't any interface suitable for an external application to use. -- greg
On Tue, 30 Nov 2004, Greg Stark wrote: > > Andrew Dunstan <andrew@dunslane.net> writes: > > > The advantage of having it in COPY is that it can be done serverside > > direct from the file system. For massive bulk loads that might be a > > plus, although I don't know what the protocol+socket overhead is. > > Actually even if you use client-side COPY it's *still* more efficient than any > more general client-side alternative. The idea would be to still use COPY just from a program that did additional processing, not as direct SQL. > As Tom pointed out to me a while back, neither the protocol nor libpq allow > for having multiple queries in flight simultaneously. That makes it impossible > to stream large quantities of data to the server efficiently. Each record > requires a round-trip and context switch overhead. Multiplexing queries is different than having multiple queries in flight. You can have multiple queries on the wire now, they are just processed sequentially. > In an ideal world the client should be able to queue up enough records to fill > the socket buffers and allow the kernel to switch to a more batch oriented > context switch mode where the server can process large numbers of records > before switching back to the client. Ideally this would apply to any kind of > query execution. This is possible now with the V3 protocol (and used by the JDBC driver). For an executeBatch() on a PreparedStatement, the driver sends a parse message, then any number of bind/execute pairs, but with a buffered stream nothing happens until a Sync message is sent and the stream is flushed. Then it collects the results of all of the executes. Kris Jurka
Greg Stark wrote: >Personally I find the current CSV support inadequate. It seems pointless to >support CSV if it can't load data exported from Excel, which seems like the >main use case. > > OK, I'm starting to get mildly annoyed now. We have identified one failure case connected with multiline fields. Even in the multiline case, I expect that the embedded newlines will usually match those of the CSV file, in which case there will be no failure. It's a very big step from there to the far more general "can't load data exported from Excel". Or did you have some other limitation in mind? FWIW, I don't make a habit of using multiline fields in my spreadsheets - and some users I have spoken to aren't even aware that you can have them at all. cheers andrew
> > > Greg Stark wrote: > > >Personally I find the current CSV support inadequate. It seems pointless to > >support CSV if it can't load data exported from Excel, which seems like the > >main use case. > > > > > > OK, I'm starting to get mildly annoyed now. We have identified one > failure case connected with multiline fields. Even in the multiline > case, I expect that the embedded newlines will usually match those of > the CSV file, in which case there will be no failure. It's a very big > step from there to the far more general "can't load data exported from > Excel". Or did you have some other limitation in mind? > > FWIW, I don't make a habit of using multiline fields in my spreadsheets > - and some users I have spoken to aren't even aware that you can have > them at all. > I am normally more of a lurker on these lists, but I thought you had better know that when we developed CSV import/export for an application at my last company we discovered that Excel can't always even read the CSV that _it_ has output! (With embedded newlines a particular problem) It is far more reliable if you output your data as an HTML table, in which case it practically always gets it right. Perhaps Postgres could support this as an import/ export mechanism as I have found it to be far less error prone than CSV? Cheers, Ben Young > cheers > > andrew > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html
Ben.Young@risk.sungard.com wrote: > >I am normally more of a lurker on these lists, but I thought you had >better know >that when we developed CSV import/export for an application at my last >company >we discovered that Excel can't always even read the CSV that _it_ has >output! >(With embedded newlines a particular problem) > > Quite so. This is not about being perfect, and there will be failures. But, absent this problem the feature should work reasonably well. Note that Excel is not the only kid on the block - if it were there would be a good case for avoiding CSV anyway and instead reading/writing the excel files directly. We included support for CSVs because, nothwithstanding how braindead the format is, is is still widely used in data exchange. >It is far more reliable if you output your data as an HTML table, in which >case it >practically always gets it right. Perhaps Postgres could support this as >an import/ >export mechanism as I have found it to be far less error prone than CSV? > > That's probably possibly but rather ugly. A well defined XML format would be far better (don't we have something like that in contrib?). Things for a bulk-export facility, I think. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > FWIW, I don't make a habit of using multiline fields in my spreadsheets - and > some users I have spoken to aren't even aware that you can have them at all. Unfortunately I don't get a choice. I offer a field on the web site where users can upload an excel sheet. Some of the fields of my database are expected to have multi-line text in them. So I expect virtually all the uploads to have multi-line fields in them. So as far as I'm concerned, this import system is simply unusable. I have to write a program to do the import. Since I was always planning to write such a program I'm not too disappointed though. -- greg
Andrew Dunstan wrote: > > > Greg Stark wrote: > > >Personally I find the current CSV support inadequate. It seems pointless to > >support CSV if it can't load data exported from Excel, which seems like the > >main use case. > > > > > > OK, I'm starting to get mildly annoyed now. We have identified one > failure case connected with multiline fields. Even in the multiline > case, I expect that the embedded newlines will usually match those of > the CSV file, in which case there will be no failure. It's a very big > step from there to the far more general "can't load data exported from > Excel". Or did you have some other limitation in mind? > > FWIW, I don't make a habit of using multiline fields in my spreadsheets > - and some users I have spoken to aren't even aware that you can have > them at all. I am wondering if one good solution would be to pre-process the input stream in copy.c to convert newline to \n and carriage return to \r and double data backslashes and tell copy.c to interpret those like it does for normal text COPY files. That way, the changes to copy.c might be minimal; basically, place a filter in front of the CSV file that cleans up the input so it can be more easily processed. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: >I am wondering if one good solution would be to pre-process the input >stream in copy.c to convert newline to \n and carriage return to \r and >double data backslashes and tell copy.c to interpret those like it does >for normal text COPY files. That way, the changes to copy.c might be >minimal; basically, place a filter in front of the CSV file that cleans >up the input so it can be more easily processed. > > This would have to parse the input stream, because you would need to know which CRs and LFs were part of the data stream and so should be escaped, and which really ended data lines and so should be left alone. However, while the idea is basically sound, parsing the stream twice seems crazy. My argument has been that at this stage in the dev cycle we should document the limitation, maybe issue a warning as you want, and make the more invasive code changes to fix it properly in 8.1. If you don't want to wait, then following your train of thought a bit, ISTM that the correct solution is a routine for CSV mode that combines the functions of CopyReadAttributeCSV() and CopyReadLine(). Then we'd have a genuine and fast fix for Greg's and Darcy's problem. cheers andrew
Andrew Dunstan wrote: > > > Bruce Momjian wrote: > > >I am wondering if one good solution would be to pre-process the input > >stream in copy.c to convert newline to \n and carriage return to \r and > >double data backslashes and tell copy.c to interpret those like it does > >for normal text COPY files. That way, the changes to copy.c might be > >minimal; basically, place a filter in front of the CSV file that cleans > >up the input so it can be more easily processed. > > > > > > This would have to parse the input stream, because you would need to > know which CRs and LFs were part of the data stream and so should be > escaped, and which really ended data lines and so should be left alone. > However, while the idea is basically sound, parsing the stream twice > seems crazy. My argument has been that at this stage in the dev cycle we > should document the limitation, maybe issue a warning as you want, and > make the more invasive code changes to fix it properly in 8.1. If you OK, right. > don't want to wait, then following your train of thought a bit, ISTM > that the correct solution is a routine for CSV mode that combines the > functions of CopyReadAttributeCSV() and CopyReadLine(). Then we'd have a > genuine and fast fix for Greg's and Darcy's problem. We are fine for 8.0, except for the warning, and you think we can fix it perfectly in 8.1, good. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
I wrote: > > If it bothers you that much. I'd make a flag, cleared at the start of > each COPY, and then where we test for CR or LF in CopyAttributeOutCSV, > if the flag is not set then set it and issue the warning. I didn't realise until Bruce told me just now that I was on the hook for this. I guess i should keep my big mouth shut. (Yeah, that's gonna happen ...) Anyway, here's a tiny patch that does what I had in mind. cheers andrew Index: copy.c =================================================================== RCS file: /home/cvsmirror/pgsql/src/backend/commands/copy.c,v retrieving revision 1.234 diff -c -r1.234 copy.c *** copy.c 6 Nov 2004 17:46:27 -0000 1.234 --- copy.c 2 Dec 2004 23:34:20 -0000 *************** *** 98,103 **** --- 98,104 ---- static EolType eol_type; /* EOL type of input */ static int client_encoding; /* remote side's character encoding */ static int server_encoding; /* local encoding */ + static bool embedded_line_warning; /* these are just for error messages, see copy_in_error_callback */ static bool copy_binary; /* is it a binary copy? */ *************** *** 1190,1195 **** --- 1191,1197 ---- attr = tupDesc->attrs; num_phys_attrs = tupDesc->natts; attr_count = list_length(attnumlist); + embedded_line_warning = false; /* * Get info about the columns we need to process. *************** *** 2627,2632 **** --- 2629,2653 ---- !use_quote && (c = *test_string) != '\0'; test_string += mblen) { + /* + * We don't know here what the surrounding line end characters + * might be. It might not even be under postgres' control. So + * we simple warn on ANY embedded line ending character. + * + * This warning will disappear when we make line parsing field-aware, + * so that we can reliably read in embedded line ending characters + * regardless of the file's line-end context. + * + */ + + if (!embedded_line_warning && (c == '\n' || c == '\r') ) + { + embedded_line_warning = true; + elog(WARNING, + "CSV fields with embedded linefeed or carriage return " + "characters might not be able to be reimported"); + } + if (c == delimc || c == quotec || c == '\n' || c == '\r') use_quote = true; if (!same_encoding)
Andrew Dunstan <andrew@dunslane.net> writes: > + if (!embedded_line_warning && (c == '\n' || c == '\r') ) > + { > + embedded_line_warning = true; > + elog(WARNING, > + "CSV fields with embedded linefeed or carriage return " > + "characters might not be able to be reimported"); > + } What about forcibly translating them to the two-character sequences \n or \r? Or is that not considered a CSV-compatible representation? regards, tom lane
Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>+ if (!embedded_line_warning && (c == '\n' || c == '\r') ) >>+ { >>+ embedded_line_warning = true; >>+ elog(WARNING, >>+ "CSV fields with embedded linefeed or carriage return " >>+ "characters might not be able to be reimported"); >>+ } >> >> > >What about forcibly translating them to the two-character sequences \n >or \r? Or is that not considered a CSV-compatible representation? > > > > Not compatible AFAIK. Certainly not portably. And the warning would still be true, because we don't do this unescaping on the way back in. I think the way the comment in the patch suggests and previous emails have discussed is the right way to go with this - I will attend to it after we branch ;-) cheers andrew
Patch applied. Thanks. --------------------------------------------------------------------------- Andrew Dunstan wrote: > > > I wrote: > > > > > If it bothers you that much. I'd make a flag, cleared at the start of > > each COPY, and then where we test for CR or LF in CopyAttributeOutCSV, > > if the flag is not set then set it and issue the warning. > > > > I didn't realise until Bruce told me just now that I was on the hook for > this. I guess i should keep my big mouth shut. (Yeah, that's gonna > happen ...) > > Anyway, here's a tiny patch that does what I had in mind. > > cheers > > andrew [ text/x-patch is unsupported, treating like TEXT/PLAIN ] > Index: copy.c > =================================================================== > RCS file: /home/cvsmirror/pgsql/src/backend/commands/copy.c,v > retrieving revision 1.234 > diff -c -r1.234 copy.c > *** copy.c 6 Nov 2004 17:46:27 -0000 1.234 > --- copy.c 2 Dec 2004 23:34:20 -0000 > *************** > *** 98,103 **** > --- 98,104 ---- > static EolType eol_type; /* EOL type of input */ > static int client_encoding; /* remote side's character encoding */ > static int server_encoding; /* local encoding */ > + static bool embedded_line_warning; > > /* these are just for error messages, see copy_in_error_callback */ > static bool copy_binary; /* is it a binary copy? */ > *************** > *** 1190,1195 **** > --- 1191,1197 ---- > attr = tupDesc->attrs; > num_phys_attrs = tupDesc->natts; > attr_count = list_length(attnumlist); > + embedded_line_warning = false; > > /* > * Get info about the columns we need to process. > *************** > *** 2627,2632 **** > --- 2629,2653 ---- > !use_quote && (c = *test_string) != '\0'; > test_string += mblen) > { > + /* > + * We don't know here what the surrounding line end characters > + * might be. It might not even be under postgres' control. So > + * we simple warn on ANY embedded line ending character. > + * > + * This warning will disappear when we make line parsing field-aware, > + * so that we can reliably read in embedded line ending characters > + * regardless of the file's line-end context. > + * > + */ > + > + if (!embedded_line_warning && (c == '\n' || c == '\r') ) > + { > + embedded_line_warning = true; > + elog(WARNING, > + "CSV fields with embedded linefeed or carriage return " > + "characters might not be able to be reimported"); > + } > + > if (c == delimc || c == quotec || c == '\n' || c == '\r') > use_quote = true; > if (!same_encoding) > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073