Thread: COPY for CSV documentation
Attached is a patch with documentation for the CSV mode of COPY patch submitted yesterday. cheers andrew Index: doc/src/sgml/ref/copy.sgml =================================================================== RCS file: /projects/cvsroot/pgsql-server/doc/src/sgml/ref/copy.sgml,v retrieving revision 1.55 diff -c -r1.55 copy.sgml *** doc/src/sgml/ref/copy.sgml 13 Dec 2003 23:59:07 -0000 1.55 --- doc/src/sgml/ref/copy.sgml 10 Apr 2004 16:22:23 -0000 *************** *** 136,142 **** <para> Specifies copying the OID for each row. (An error is raised if <literal>OIDS</literal> is specified for a table that does not ! have OIDs.) </para> </listitem> </varlistentry> --- 136,143 ---- <para> Specifies copying the OID for each row. (An error is raised if <literal>OIDS</literal> is specified for a table that does not ! have OIDs, or if CSV mode is selected by specifying ! <literal>DELIMITER</literal> longer then one character.) </para> </listitem> </varlistentry> *************** *** 145,152 **** <term><replaceable class="parameter">delimiter</replaceable></term> <listitem> <para> ! The single character that separates columns within each row ! (line) of the file. The default is a tab character. </para> </listitem> </varlistentry> --- 146,162 ---- <term><replaceable class="parameter">delimiter</replaceable></term> <listitem> <para> ! A one- to three-character string containing characters used in importing ! or exporting Text or Comma Separated Variable (CSV) files. ! With only one character, Text file format is used. ! If there are two or three characters, CSV format is used. ! The first character for either file format is the delimiter character ! used between values for a single row. ! The second character is the CSV quote character, ! and the third character, if present, is the CSV escape character ! used inside quoted values, and defaults to the quote character. ! The default for this parameter is a single tab character. (Thus TEXT ! file mode is the default.) </para> </listitem> </varlistentry> *************** *** 156,163 **** <listitem> <para> The string that represents a null value. The default is ! <literal>\N</literal> (backslash-N). You might prefer an empty ! string, for example. </para> <note> --- 166,174 ---- <listitem> <para> The string that represents a null value. The default is ! <literal>\N</literal> (backslash-N), unless a CSV is being processed, ! in which case it is an empty string. You might prefer an empty string ! in any case. </para> <note> *************** *** 168,173 **** --- 179,197 ---- <command>COPY TO</command>. </para> </note> + + <note> + <para> + If you don't want anything used as null when using + <command>COPY FROM</command>, you can specify some value that is very + unlikely to appear in the file, such as <literal>frobnitz</literal> or + <literal>d5f4074b254c76cd8ae37bf1731f4aed</literal> (which is + <literal>md5('frobnitz')</literal>). This could be especially useful + when importing a CSV file into a table with <literal>NOT NULL</> + columns. + </para> + </note> + </listitem> </varlistentry> </variablelist> *************** *** 252,259 **** <title>Text Format</title> <para> ! When <command>COPY</command> is used without the <literal>BINARY</literal> option, ! the data read or written is a text file with one line per table row. Columns in a row are separated by the delimiter character. The column values themselves are strings generated by the output function, or acceptable to the input function, of each --- 276,285 ---- <title>Text Format</title> <para> ! When <command>COPY</command> is used without the <literal>BINARY</literal> ! option, the data read or written is a text file with one line per table ! row unless <literal>DELIMITER</literal> is longer than one character, ! in which case CSV format is used. Columns in a row are separated by the delimiter character. The column values themselves are strings generated by the output function, or acceptable to the input function, of each *************** *** 380,385 **** --- 406,459 ---- </refsect2> <refsect2> + <title>CSV format</title> + + <para> + This format is used for importing from and exporting to the Comma + Separated Variable (CSV) file format used by many other programs, + such as spreadsheets. Instead of escaping in + <productname>PostgreSQL</productname>'s standard text mode, it produces + and recognises the common CSV escaping mechanism. + </para> + + <para> + The values in each record are separated by the delimiter character, which + is the first character in the <literal>DELIMITER</literal> parameter. + If the value contains the delimiter character, the quote character (the + second character in the <literal>DELIMITER</literal> parameter), or a + carriage return or line feed character, then the whole value is prefixed + and suffixed by the quote character, and any occurrence within the value + of a quote character or the escape character (the third character in the + <literal>DELIMITER</literal> parameter if present, or else just the quote + character itself) is preceded by the escape character. + </para> + + <para> + In the most common cases, simply specifying <literal>DELIMITER ',"'</> + or <literal>DELIMITER ',\''</> should suffice, both for + <command>COPY FROM</> and <command>COPY TO</>. + </para> + + <note> + <para> + CSV mode will both recognise and produce CSV files with quoted values + containing embedded carriage returns and line feeds. Thus the files are + not strictly one line per table row, as TEXT files are. + </para> + </note> + + <note> + <para> + Many programs produce strange and occasionally perverse CSV files, and + the file format is a convention rather than a standard. Thus you might + encounter some files that cannot be imported using this mechanism, and + you might produce files that other programs fail to recognise properly. + </para> + </note> + + </refsect2> + + <refsect2> <title>Binary Format</title> <para> *************** *** 532,537 **** --- 606,620 ---- using the vertical bar (<literal>|</literal>) as the field delimiter: <programlisting> COPY country TO STDOUT WITH DELIMITER '|'; + </programlisting> + </para> + + <para> + To do the same thing but instead produce a standard CSV file, using + <literal>,</> as the delimiter and <literal>"</> as both the quote and + escape characters: + <programlisting> + COPY country TO STDOUT WITH DELIMITER ',"'; </programlisting> </para>
I have reviewed this patch. Basically, CSV is enabled by specifying more than one delimiter character to COPY, e.g. DELIMITER ',"' or DELIMITER ',""'. Is this API good for folks? Prior to 7.2, a multi-character delimiter could be specified, but only the first character was used. 7.2 release notes state: * COPY DELIMITERS string must be exactly one character (Tom) I am a little worried about multibyte too, as you mentioned. It doesn't look like we support multi-byte delimiters. I see in the code: if (strlen(delim) != 1) ereport(ERROR, With this code, if you use multibyte delimiters, you get an error. However, with CVS, I can see cases where someone trying to use multibyte would get a very strange output file rather than an error. It would use byte one of the multibyte for the delimiter, and the second byte for the quote, and maybe a third byte for quoting quotes. We could use pg_mblen(). Maybe we should compute pg_mblen and strlen on the delimiter and throw an error if the they are different lengths. That would prevent multibyte delimiters. Comments? --------------------------------------------------------------------------- Andrew Dunstan wrote: > > Attached is a patch with documentation for the CSV mode of COPY patch > submitted yesterday. > > cheers > > andrew > Index: doc/src/sgml/ref/copy.sgml > =================================================================== > RCS file: /projects/cvsroot/pgsql-server/doc/src/sgml/ref/copy.sgml,v > retrieving revision 1.55 > diff -c -r1.55 copy.sgml > *** doc/src/sgml/ref/copy.sgml 13 Dec 2003 23:59:07 -0000 1.55 > --- doc/src/sgml/ref/copy.sgml 10 Apr 2004 16:22:23 -0000 > *************** > *** 136,142 **** > <para> > Specifies copying the OID for each row. (An error is raised if > <literal>OIDS</literal> is specified for a table that does not > ! have OIDs.) > </para> > </listitem> > </varlistentry> > --- 136,143 ---- > <para> > Specifies copying the OID for each row. (An error is raised if > <literal>OIDS</literal> is specified for a table that does not > ! have OIDs, or if CSV mode is selected by specifying > ! <literal>DELIMITER</literal> longer then one character.) > </para> > </listitem> > </varlistentry> > *************** > *** 145,152 **** > <term><replaceable class="parameter">delimiter</replaceable></term> > <listitem> > <para> > ! The single character that separates columns within each row > ! (line) of the file. The default is a tab character. > </para> > </listitem> > </varlistentry> > --- 146,162 ---- > <term><replaceable class="parameter">delimiter</replaceable></term> > <listitem> > <para> > ! A one- to three-character string containing characters used in importing > ! or exporting Text or Comma Separated Variable (CSV) files. > ! With only one character, Text file format is used. > ! If there are two or three characters, CSV format is used. > ! The first character for either file format is the delimiter character > ! used between values for a single row. > ! The second character is the CSV quote character, > ! and the third character, if present, is the CSV escape character > ! used inside quoted values, and defaults to the quote character. > ! The default for this parameter is a single tab character. (Thus TEXT > ! file mode is the default.) > </para> > </listitem> > </varlistentry> > *************** > *** 156,163 **** > <listitem> > <para> > The string that represents a null value. The default is > ! <literal>\N</literal> (backslash-N). You might prefer an empty > ! string, for example. > </para> > > <note> > --- 166,174 ---- > <listitem> > <para> > The string that represents a null value. The default is > ! <literal>\N</literal> (backslash-N), unless a CSV is being processed, > ! in which case it is an empty string. You might prefer an empty string > ! in any case. > </para> > > <note> > *************** > *** 168,173 **** > --- 179,197 ---- > <command>COPY TO</command>. > </para> > </note> > + > + <note> > + <para> > + If you don't want anything used as null when using > + <command>COPY FROM</command>, you can specify some value that is very > + unlikely to appear in the file, such as <literal>frobnitz</literal> or > + <literal>d5f4074b254c76cd8ae37bf1731f4aed</literal> (which is > + <literal>md5('frobnitz')</literal>). This could be especially useful > + when importing a CSV file into a table with <literal>NOT NULL</> > + columns. > + </para> > + </note> > + > </listitem> > </varlistentry> > </variablelist> > *************** > *** 252,259 **** > <title>Text Format</title> > > <para> > ! When <command>COPY</command> is used without the <literal>BINARY</literal> option, > ! the data read or written is a text file with one line per table row. > Columns in a row are separated by the delimiter character. > The column values themselves are strings generated by the > output function, or acceptable to the input function, of each > --- 276,285 ---- > <title>Text Format</title> > > <para> > ! When <command>COPY</command> is used without the <literal>BINARY</literal> > ! option, the data read or written is a text file with one line per table > ! row unless <literal>DELIMITER</literal> is longer than one character, > ! in which case CSV format is used. > Columns in a row are separated by the delimiter character. > The column values themselves are strings generated by the > output function, or acceptable to the input function, of each > *************** > *** 380,385 **** > --- 406,459 ---- > </refsect2> > > <refsect2> > + <title>CSV format</title> > + > + <para> > + This format is used for importing from and exporting to the Comma > + Separated Variable (CSV) file format used by many other programs, > + such as spreadsheets. Instead of escaping in > + <productname>PostgreSQL</productname>'s standard text mode, it produces > + and recognises the common CSV escaping mechanism. > + </para> > + > + <para> > + The values in each record are separated by the delimiter character, which > + is the first character in the <literal>DELIMITER</literal> parameter. > + If the value contains the delimiter character, the quote character (the > + second character in the <literal>DELIMITER</literal> parameter), or a > + carriage return or line feed character, then the whole value is prefixed > + and suffixed by the quote character, and any occurrence within the value > + of a quote character or the escape character (the third character in the > + <literal>DELIMITER</literal> parameter if present, or else just the quote > + character itself) is preceded by the escape character. > + </para> > + > + <para> > + In the most common cases, simply specifying <literal>DELIMITER ',"'</> > + or <literal>DELIMITER ',\''</> should suffice, both for > + <command>COPY FROM</> and <command>COPY TO</>. > + </para> > + > + <note> > + <para> > + CSV mode will both recognise and produce CSV files with quoted values > + containing embedded carriage returns and line feeds. Thus the files are > + not strictly one line per table row, as TEXT files are. > + </para> > + </note> > + > + <note> > + <para> > + Many programs produce strange and occasionally perverse CSV files, and > + the file format is a convention rather than a standard. Thus you might > + encounter some files that cannot be imported using this mechanism, and > + you might produce files that other programs fail to recognise properly. > + </para> > + </note> > + > + </refsect2> > + > + <refsect2> > <title>Binary Format</title> > > <para> > *************** > *** 532,537 **** > --- 606,620 ---- > using the vertical bar (<literal>|</literal>) as the field delimiter: > <programlisting> > COPY country TO STDOUT WITH DELIMITER '|'; > + </programlisting> > + </para> > + > + <para> > + To do the same thing but instead produce a standard CSV file, using > + <literal>,</> as the delimiter and <literal>"</> as both the quote and > + escape characters: > + <programlisting> > + COPY country TO STDOUT WITH DELIMITER ',"'; > </programlisting> > </para> > > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: >I have reviewed this patch. Basically, CSV is enabled by specifying >more than one delimiter character to COPY, e.g. DELIMITER ',"' or >DELIMITER ',""'. Is this API good for folks? > (if not I'll be displeased, as it is what was the general consensus about a month ago) > >Prior to 7.2, a multi-character delimiter could be specified, but only >the first character was used. 7.2 release notes state: > > * COPY DELIMITERS string must be exactly one character (Tom) > >I am a little worried about multibyte too, as you mentioned. > I am not sure that CSVs even make sense in a multibyte world. If not, it would make sense to disable it in such a case. (Is that your only worry?) cheers andrew
Andrew Dunstan wrote: > > > Bruce Momjian wrote: > > >I have reviewed this patch. Basically, CSV is enabled by specifying > >more than one delimiter character to COPY, e.g. DELIMITER ',"' or > >DELIMITER ',""'. Is this API good for folks? > > > > (if not I'll be displeased, as it is what was the general consensus > about a month ago) OK > >Prior to 7.2, a multi-character delimiter could be specified, but only > >the first character was used. 7.2 release notes state: > > > > * COPY DELIMITERS string must be exactly one character (Tom) > > > >I am a little worried about multibyte too, as you mentioned. > > > > I am not sure that CSVs even make sense in a multibyte world. If not, it > would make sense to disable it in such a case. > > (Is that your only worry?) Yes, my worry is that someone will use a multibyte character that the system sees as several bytes and enters CSV mode. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: >>>Prior to 7.2, a multi-character delimiter could be specified, but only >>>the first character was used. 7.2 release notes state: >>> >>> * COPY DELIMITERS string must be exactly one character (Tom) >>> >>>I am a little worried about multibyte too, as you mentioned. >>> >>> >>> >>I am not sure that CSVs even make sense in a multibyte world. If not, it >>would make sense to disable it in such a case. >> >>(Is that your only worry?) >> >> > >Yes, my worry is that someone will use a multibyte character that the >system sees as several bytes and enters CSV mode. > How about if we specify it explicitly, like BINARY, instead of it being implied by the length of DELIMITER? COPY a FROM stdin CSV DELIMITER ',"'; That would make the patch somewhat more extensive, but maybe not hugely more invasive (I tried to keep it as uninvasive as possible). I could do that, I think. cheers andrew
Andrew Dunstan wrote: > >>I am not sure that CSVs even make sense in a multibyte world. If not, it > >>would make sense to disable it in such a case. > >> > >>(Is that your only worry?) > >> > >> > > > >Yes, my worry is that someone will use a multibyte character that the > >system sees as several bytes and enters CSV mode. > > > > > How about if we specify it explicitly, like BINARY, instead of it being > implied by the length of DELIMITER? > > COPY a FROM stdin CSV DELIMITER ',"'; > > That would make the patch somewhat more extensive, but maybe not hugely > more invasive (I tried to keep it as uninvasive as possible). I could do > that, I think. That's what I was wondering. Is triggering CSV for multi-character delimiters a little too clever? This reminds me of the use of LIMIT X,Y with no indication which is limit and which is offset. We certainly could code to prevent the multibyte problem I mentioned, but should we? I am thinking just: > COPY a FROM stdin WITH CSV ',"'; or > COPY a FROM stdin WITH DELIMITER "," QUOTE '"' EQUOTE '"'; EQUOTE for embedded quote. These are used in very limited situations and don't have to be reserved words or anything. I can help with these changes if folks like them. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian said: >> >Yes, my worry is that someone will use a multibyte character that the >> >system sees as several bytes and enters CSV mode. >> > >> >> >> How about if we specify it explicitly, like BINARY, instead of it >> being implied by the length of DELIMITER? >> >> COPY a FROM stdin CSV DELIMITER ',"'; >> >> That would make the patch somewhat more extensive, but maybe not >> hugely more invasive (I tried to keep it as uninvasive as possible). >> I could do that, I think. > > That's what I was wondering. Is triggering CSV for multi-character > delimiters a little too clever? This reminds me of the use of LIMIT > X,Y with no indication which is limit and which is offset. > > We certainly could code to prevent the multibyte problem I mentioned, > but should we? I confess that in my anglocentric world I have remained lamentably ignorant of how MBCS works. Just reading up a little, and looking over some of our code (e.g. the scanner) it looks like the simple solution would be to check that the delimiter was 8-bit clean. (I assume that ASCII is a subset of every MBCS we support - is that correct?) However ... > > I am thinking just: > >> COPY a FROM stdin WITH CSV ',"'; > > or > >> COPY a FROM stdin WITH DELIMITER "," QUOTE '"' EQUOTE '"'; > > EQUOTE for embedded quote. These are used in very limited situations > and don't have to be reserved words or anything. > > I can help with these changes if folks like them. > I prefer either the first, because it ensures things are specified together. If you want to do that I will work on some regression tests. cheers andrew
Bruce Momjian <pgman@candle.pha.pa.us> writes: > That's what I was wondering. Is triggering CSV for multi-character > delimiters a little too clever? This reminds me of the use of LIMIT X,Y > with no indication which is limit and which is offset. I agree, this seems risky and not at all readable to someone who doesn't remember exactly how the parameter is defined. >> COPY a FROM stdin WITH DELIMITER "," QUOTE '"' EQUOTE '"'; > EQUOTE for embedded quote. ESCAPE would be better no? It's already a keyword ... BTW, don't forget that the syntax options have to be provided in psql's \copy as well. Did the patch cover that? regards, tom lane
Andrew Dunstan wrote: > > I am thinking just: > > > >> COPY a FROM stdin WITH CSV ',"'; > > > > or > > > >> COPY a FROM stdin WITH DELIMITER "," QUOTE '"' EQUOTE '"'; > > > > EQUOTE for embedded quote. These are used in very limited situations > > and don't have to be reserved words or anything. > > > > I can help with these changes if folks like them. > > > > > I prefer either the first, because it ensures things are specified > together. > > If you want to do that I will work on some regression tests. ( Jump to the bottom for my final solution.) I thought about this today. I am thinking of: > COPY a FROM stdin WITH CSV or > COPY a FROM stdin WITH CSV '""' DELIMITER ',' In other words, the string after CSV is optional. However, looking at the COPY syntax, there isn't any case where we have an optional string after a keyword. Is that OK? In this case, CVS is a mode that makes the delimiter ',' and the quote and quote escape '"'. This way, CVS doesn't require any special quote if you want the default. However, this still has CSV using a two-character string with special meaning for the first and second characters. What if we call it QUOTE mode: > COPY a FROM stdin WITH QUOTE that enables CVS with comma delimiters and '"' for quote escape, so the above would be equavalent to: > COPY a FROM stdin WITH QUOTE '"' ESCAPE '"' DELIMITER ',' (I have used ESCAPE because Tom suggested it and it is already a keyword.) I am a little worried that ESCAPE only has meaning with QUOTE, and QUOTE sets defaults for all the others. This makes the syntax addition pretty confusing for users. --------------------------------------------------------------------------- Thinking further, maybe we need to add CSV, QUOTE, and ESCAPE to COPY. QUOTE and ESCAPE are only available in CVS mode, so you can say: > COPY a FROM stdin WITH CSV or > COPY a FROM stdin WITH CSV ESCAPE '\\' This means that there is no optional string for keywords. Here is the line at the bottom we have to add to the COPY syntax. COPY tablename [ ( column [, ...] ) ] TO { 'filename' | STDOUT } [ [ WITH ] [ BINARY ] [ OIDS ] [ DELIMITER [ AS ] 'delimiter' ] [ NULL [ AS ] 'null string' ] ] [ CSV [ QUOTE 'quote' ] [ ESCAPE 'escape' ] ] DELIMITER default to tab, except in CSV mode, where it is a comma. That sounds very clear to me. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > That's what I was wondering. Is triggering CSV for multi-character > > delimiters a little too clever? This reminds me of the use of LIMIT X,Y > > with no indication which is limit and which is offset. > > I agree, this seems risky and not at all readable to someone who doesn't > remember exactly how the parameter is defined. Yep. > >> COPY a FROM stdin WITH DELIMITER "," QUOTE '"' EQUOTE '"'; > > EQUOTE for embedded quote. > > ESCAPE would be better no? It's already a keyword ... Yep. > BTW, don't forget that the syntax options have to be provided in psql's > \copy as well. Did the patch cover that? Oh, yea. That needs to be done too. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > In other words, the string after CSV is optional. However, looking at > the COPY syntax, there isn't any case where we have an optional string > after a keyword. Is that OK? Seems better to avoid it. > However, this still has CSV using a two-character string with special > meaning for the first and second characters. One point that I don't think was made before is that if we do any such thing, we'll be forever foreclosing any chance of allowing multi-character delimiters. ISTM that would not be forward-looking. regards, tom lane
Tom Lane said: >>> COPY a FROM stdin WITH DELIMITER "," QUOTE '"' EQUOTE '"'; >> EQUOTE for embedded quote. > > ESCAPE would be better no? It's already a keyword ... much better. > > BTW, don't forget that the syntax options have to be provided in psql's > \copy as well. Did the patch cover that? > No, because it didn't make a syntax change. I admit I didn't test that though. cheers andrew
Bruce Momjian said: > > COPY tablename [ ( column [, ...] ) ] > TO { 'filename' | STDOUT } > [ [ WITH ] > [ BINARY ] > [ OIDS ] > [ DELIMITER [ AS ] 'delimiter' ] > [ NULL [ AS ] 'null string' ] ] > > [ CSV [ QUOTE 'quote' ] [ ESCAPE 'escape' ] ] > > DELIMITER default to tab, except in CSV mode, where it is a comma. > > That sounds very clear to me. > TIMTOWTDI, as we say in the perl world, and we could debate it endlessly. This looks fine to me. a few points: . ESCAPE should default to same-as-quote . in CSV mode, NULL should default to '' - that was in what I sent in. . I assume that for now we will enforce the one byte rule for QUOTE and ESCAPE as well as for DELIMITER. cheers andrew
Andrew Dunstan wrote: > Bruce Momjian said: > > > > COPY tablename [ ( column [, ...] ) ] > > TO { 'filename' | STDOUT } > > [ [ WITH ] > > [ BINARY ] > > [ OIDS ] > > [ DELIMITER [ AS ] 'delimiter' ] > > [ NULL [ AS ] 'null string' ] ] > > > > [ CSV [ QUOTE 'quote' ] [ ESCAPE 'escape' ] ] > > > > DELIMITER default to tab, except in CSV mode, where it is a comma. > > > > That sounds very clear to me. > > > > > TIMTOWTDI, as we say in the perl world, and we could debate it endlessly. > This looks fine to me. > > a few points: > > . ESCAPE should default to same-as-quote > . in CSV mode, NULL should default to '' - that was in what I sent in. > . I assume that for now we will enforce the one byte rule for QUOTE and > ESCAPE as well as for DELIMITER. Yes, agreed. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian said: > Andrew Dunstan wrote: >> Bruce Momjian said: >> > >> > COPY tablename [ ( column [, ...] ) ] >> > TO { 'filename' | STDOUT } >> > [ [ WITH ] >> > [ BINARY ] >> > [ OIDS ] >> > [ DELIMITER [ AS ] 'delimiter' ] >> > [ NULL [ AS ] 'null string' ] ] >> > >> > [ CSV [ QUOTE 'quote' ] [ ESCAPE 'escape' ] ] >> > >> > DELIMITER default to tab, except in CSV mode, where it is a comma. >> > >> > That sounds very clear to me. >> > >> >> >> TIMTOWTDI, as we say in the perl world, and we could debate it >> endlessly. This looks fine to me. >> >> a few points: >> >> . ESCAPE should default to same-as-quote >> . in CSV mode, NULL should default to '' - that was in what I sent in. >> . I assume that for now we will enforce the one byte rule for QUOTE >> and ESCAPE as well as for DELIMITER. > > Andrew, would you like me to take your patch, implement the changes > discussed, add \copy, and post the result for your testing? > That would be great. Thanks. I don't think the changes to what I did with copy.c will be huge. BTW, how to do this is asked at least once a day on the #postgresql channel on freenode - a lot of people *really* want this. cheers andrew
On Mon, Apr 12, 2004 at 02:26:14 -0400, Andrew Dunstan <andrew@dunslane.net> wrote: > > a few points: > > . in CSV mode, NULL should default to '' - that was in what I sent in. Postgres normally treats an empty string as an empty string. Are you sure you really want it to be treated as a NULL by default in this one place?
Bruno Wolff III said: > On Mon, Apr 12, 2004 at 02:26:14 -0400, > Andrew Dunstan <andrew@dunslane.net> wrote: >> >> a few points: >> >> . in CSV mode, NULL should default to '' - that was in what I sent in. > > Postgres normally treats an empty string as an empty string. Are you > sure you really want it to be treated as a NULL by default in this one > place? > Yes ;-) Otherwise, what will happen when we try to import into some non-text field for which '' is not a valid value? Spreadsheets commonly represent missing values as empty strings when reading/writing CSVs - that's why this behaviour should be the default. If you want to force it to use an empty string instead, simply specify some unlikely value for NULL, like 'frobnitz'. But if you do, be prepared for lots of errors unless you are importing into fields where empty string is a valid text value. Of course, a NOT NULL constraint will also break things, unless we went to the MySQL method of handling insertion of NULL into a NOT NULL field ... no I really am kidding. cheers andrew
On Mon, Apr 12, 2004 at 08:07:12 -0400, Andrew Dunstan <andrew@dunslane.net> wrote: > > Otherwise, what will happen when we try to import into some non-text field > for which '' is not a valid value? I would expect the copy to fail as it does normally. > Spreadsheets commonly represent missing values as empty strings when > reading/writing CSVs - that's why this behaviour should be the default. That doesn't strike me as an overwhelming reason to do it that way. > If you want to force it to use an empty string instead, simply specify > some unlikely value for NULL, like 'frobnitz'. But if you do, be prepared > for lots of errors unless you are importing into fields where empty string > is a valid text value. This concerns me. If NULL is on by default, there should be some way to turn it off, not to change it to some value you hope won't show up in the input stream.
Bruno Wolff III <bruno@wolff.to> writes: > On Mon, Apr 12, 2004 at 02:26:14 -0400, > Andrew Dunstan <andrew@dunslane.net> wrote: >> a few points: >> . in CSV mode, NULL should default to '' - that was in what I sent in. > Postgres normally treats an empty string as an empty string. Are you sure > you really want it to be treated as a NULL by default in this one place? I think that's a spectacularly bad idea too. People who really want that can write "NULL ''", but it shouldn't be implied by CSV mode. regards, tom lane
Tom Lane wrote: >Bruno Wolff III <bruno@wolff.to> writes: > > >>On Mon, Apr 12, 2004 at 02:26:14 -0400, >> Andrew Dunstan <andrew@dunslane.net> wrote: >> >> >>>a few points: >>>. in CSV mode, NULL should default to '' - that was in what I sent in. >>> >>> > > > >>Postgres normally treats an empty string as an empty string. Are you sure >>you really want it to be treated as a NULL by default in this one place? >> >> > >I think that's a spectacularly bad idea too. People who really want >that can write "NULL ''", but it shouldn't be implied by CSV mode. > > > Spectacularly? Hmm. My approach was that the default should be the most common case. Perhaps on import it's a tossup, but on export a CSV containing lots of \N cells is likely to be ... unexpected. But, honestly, it's not worth dying in a ditch over. cheers andrew
Andrew Dunstan wrote: > Tom Lane wrote: > > >Bruno Wolff III <bruno@wolff.to> writes: > > > > > >>On Mon, Apr 12, 2004 at 02:26:14 -0400, > >> Andrew Dunstan <andrew@dunslane.net> wrote: > >> > >> > >>>a few points: > >>>. in CSV mode, NULL should default to '' - that was in what I sent in. > >>> > >>> > > > > > > > >>Postgres normally treats an empty string as an empty string. Are you sure > >>you really want it to be treated as a NULL by default in this one place? > >> > >> > > > >I think that's a spectacularly bad idea too. People who really want > >that can write "NULL ''", but it shouldn't be implied by CSV mode. > > > > > > > > Spectacularly? Hmm. > > My approach was that the default should be the most common case. Perhaps > on import it's a tossup, but on export a CSV containing lots of \N cells > is likely to be ... unexpected. > > But, honestly, it's not worth dying in a ditch over. It is my understanding that \N is a valid column value (no backslash escape in CSV, right?), so we can't use it for NULL. The only thing I can think of is for NULL to be: ,, (no quotes) and a zero-length string to be: ,"", How do most applications handle those two cases? If they accept either, can we use that so we can read our own CSV files without losing the NULL specification? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Mon, Apr 12, 2004 at 10:30:22 -0400, Bruce Momjian <pgman@candle.pha.pa.us> wrote: > > It is my understanding that \N is a valid column value (no backslash > escape in CSV, right?), so we can't use it for NULL. > The only thing I can think of is for NULL to be: > > ,, > > (no quotes) and a zero-length string to be: > > ,"", > > How do most applications handle those two cases? If they accept either, > can we use that so we can read our own CSV files without losing the NULL > specification? I think the above are going to be treated as equvialent by most CSV parsers. There doesn't seem to be a standard for CSV. From what I found describing it, there isn't any feature for distinguishing NULLs from empty strings. So whatever gets done is going to be application specific.
Bruno Wolff III wrote: > On Mon, Apr 12, 2004 at 10:30:22 -0400, > Bruce Momjian <pgman@candle.pha.pa.us> wrote: > > > > It is my understanding that \N is a valid column value (no backslash > > escape in CSV, right?), so we can't use it for NULL. > > The only thing I can think of is for NULL to be: > > > > ,, > > > > (no quotes) and a zero-length string to be: > > > > ,"", > > > > How do most applications handle those two cases? If they accept either, > > can we use that so we can read our own CSV files without losing the NULL > > specification? > > I think the above are going to be treated as equvialent by most CSV parsers. > > There doesn't seem to be a standard for CSV. From what I found describing > it, there isn't any feature for distinguishing NULLs from empty strings. > So whatever gets done is going to be application specific. I am thinking we could enable this NULL handling by default, and allow it to be over-ridden with the NULL keyword. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: >It is my understanding that \N is a valid column value (no backslash >escape in CSV, right?), so we can't use it for NULL. > > \ is not conventionally magical in CSVs in the general case. That doesn't mean we couldn't use \N, but to me it would violate the principle of least surprise - no other application will interpret it in any special way, and the whole point about this facility is exchanging data with other applications. >The only thing I can think of is for NULL to be: > > ,, > >(no quotes) and a zero-length string to be: > > ,"", > >How do most applications handle those two cases? If they accept either, >can we use that so we can read our own CSV files without losing the NULL >specification? > > > In fact, in the patch I sent in, no quoted string is marked as null when being read (so even if you use \N as the null marker, "\N" will be that literal and not a null marker). And the null marker, whatever it is, should be made quote safe by us throwing an error if it contains the quote marker, just as we now make sure that the null marker is delimiter-safe. I will check on the write behaviour - it might need ammending too. I'll submit a revised patch based on the original syntax scheme, and then you (Bruce) can make the syntax/psql changes that seem to be agreed on now - is that ok? The default NULL value issue can be determined at the end of any exhaustive debate we have - in the end it's a one line code change ;-) cheers andrew
Andrew Dunstan wrote: > In fact, in the patch I sent in, no quoted string is marked as null when > being read (so even if you use \N as the null marker, "\N" will be that > literal and not a null marker). And the null marker, whatever it is, > should be made quote safe by us throwing an error if it contains the > quote marker, just as we now make sure that the null marker is > delimiter-safe. What value does an int column get if the input file has ',,'. Don't tell me zero? :-) Error? > I will check on the write behaviour - it might need ammending too. > > I'll submit a revised patch based on the original syntax scheme, and > then you (Bruce) can make the syntax/psql changes that seem to be agreed > on now - is that ok? OK, go as far as you want and post it. I will turn around a new patch in a few hours after you post. > The default NULL value issue can be determined at the end of any > exhaustive debate we have - in the end it's a one line code change ;-) Agreed. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: >Andrew Dunstan wrote: > > >>In fact, in the patch I sent in, no quoted string is marked as null when >>being read (so even if you use \N as the null marker, "\N" will be that >>literal and not a null marker). And the null marker, whatever it is, >>should be made quote safe by us throwing an error if it contains the >>quote marker, just as we now make sure that the null marker is >>delimiter-safe. >> >> > >What value does an int column get if the input file has ',,'. Don't >tell me zero? :-) Error? > > If the null marker is not an empty string, it gets an error, of course - if it is it gets a null: [andrew@marmaduke pginst]$ echo ',,' | bin/psql -c "create temp table foo (a int, b text, c text); copy foo from stdin delimiter ',\"' null '\\\\N';" ERROR: invalid input syntax for integer: "" CONTEXT: COPY foo, line 1, column a: "" [andrew@marmaduke pginst]$ echo ',,' | bin/psql -c "create temp table foo (a int, b text, c text); copy foo from stdin delimiter ',\"' ;" [andrew@marmaduke pginst]$ I hope that is expected behaviour - it's what *I* expect, at least. > > >>I will check on the write behaviour - it might need ammending too. >> >>I'll submit a revised patch based on the original syntax scheme, and >>then you (Bruce) can make the syntax/psql changes that seem to be agreed >>on now - is that ok? >> >> > >OK, go as far as you want and post it. I will turn around a new patch >in a few hours after you post. > > > >>The default NULL value issue can be determined at the end of any >>exhaustive debate we have - in the end it's a one line code change ;-) >> >> > >Agreed. > > > Attached patch has these additions to previously posted patch: . quote character may not appear in NULL marker . any non-null value that matches the NULL marker is forced to be quoted when written. cheers andrew Index: src/backend/commands/copy.c =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/backend/commands/copy.c,v retrieving revision 1.219 diff -c -r1.219 copy.c *** src/backend/commands/copy.c 6 Apr 2004 13:21:33 -0000 1.219 --- src/backend/commands/copy.c 12 Apr 2004 16:21:33 -0000 *************** *** 70,76 **** typedef enum CopyReadResult { NORMAL_ATTR, ! END_OF_LINE } CopyReadResult; /* --- 70,77 ---- typedef enum CopyReadResult { NORMAL_ATTR, ! END_OF_LINE, ! UNTERMINATED_FIELD } CopyReadResult; /* *************** *** 136,144 **** --- 137,148 ---- static bool CopyReadLine(void); static char *CopyReadAttribute(const char *delim, const char *null_print, CopyReadResult *result, bool *isnull); + static char *CopyReadAttributeCSV(const char *delim, const char *null_print, + CopyReadResult *result, bool *isnull); static Datum CopyReadBinaryAttribute(int column_no, FmgrInfo *flinfo, Oid typelem, bool *isnull); static void CopyAttributeOut(char *string, char *delim); + static void CopyAttributeOutCSV(char *string, char *delim, bool force_quote); static List *CopyGetAttnums(Relation rel, List *attnamelist); static void limit_printout_length(StringInfo buf); *************** *** 682,687 **** --- 686,692 ---- List *attnumlist; bool binary = false; bool oids = false; + bool csv_mode = false; char *delim = NULL; char *null_print = NULL; Relation rel; *************** *** 744,751 **** if (!delim) delim = "\t"; if (!null_print) ! null_print = "\\N"; /* * Open and lock the relation, using the appropriate lock type. --- 749,759 ---- if (!delim) delim = "\t"; + if (strlen(delim) > 1) + csv_mode = true; + if (!null_print) ! null_print = csv_mode ? "" : "\\N"; /* * Open and lock the relation, using the appropriate lock type. *************** *** 772,783 **** "psql's \\copy command also works for anyone."))); /* ! * Presently, only single-character delimiter strings are supported. */ ! if (strlen(delim) != 1) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("COPY delimiter must be a single character"))); /* * Don't allow the delimiter to appear in the null string. --- 780,806 ---- "psql's \\copy command also works for anyone."))); /* ! * Only single-character delimiter strings are supported, ! * except in CSV mode, where the string must be ! * delimiter-char quote-char [escape-char] */ ! if (!csv_mode && strlen(delim) != 1) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("COPY delimiter must be a single character"))); + else if (csv_mode) + { + if(strlen(delim) > 3) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("COPY delimiters for CSV must be a 2 or 3 characters"))); + if (delim[0] == delim[1] || + (strlen(delim) == 3 && delim[0] == delim[2])) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("CSV delimiter character must not be same as quote character or escape character"))); + + } /* * Don't allow the delimiter to appear in the null string. *************** *** 788,793 **** --- 811,833 ---- errmsg("COPY delimiter must not appear in the NULL specification"))); /* + * Don't allow the csv quote char to appear in the null string. + */ + if (csv_mode && strchr(null_print, delim[1]) != NULL) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("CSV quote character must not appear in the NULL specification"))); + + /* + * Don't allow OIDs in CSV mode + */ + + if (csv_mode && oids) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("Cannot specify OIDS in CSV mode "))); + + /* * Don't allow COPY w/ OIDs to or from a table without them */ if (oids && !rel->rd_rel->relhasoids) *************** *** 969,974 **** --- 1009,1015 ---- FmgrInfo *out_functions; Oid *elements; bool *isvarlena; + bool csv_mode; char *string; Snapshot mySnapshot; List *cur; *************** *** 979,984 **** --- 1020,1026 ---- attr = tupDesc->attrs; num_phys_attrs = tupDesc->natts; attr_count = length(attnumlist); + csv_mode = (strlen(delim) > 1); /* * Get info about the columns we need to process. *************** *** 1051,1057 **** while ((tuple = heap_getnext(scandesc, ForwardScanDirection)) != NULL) { bool need_delim = false; - CHECK_FOR_INTERRUPTS(); MemoryContextReset(mycontext); --- 1093,1098 ---- *************** *** 1113,1119 **** value, ObjectIdGetDatum(elements[attnum - 1]), Int32GetDatum(attr[attnum - 1]->atttypmod))); ! CopyAttributeOut(string, delim); } else { --- 1154,1167 ---- value, ObjectIdGetDatum(elements[attnum - 1]), Int32GetDatum(attr[attnum - 1]->atttypmod))); ! if (csv_mode) ! { ! bool force_quote = (strcmp(string,null_print) == 0); ! CopyAttributeOutCSV(string, delim, force_quote); ! } ! else ! CopyAttributeOut(string, delim); ! } else { *************** *** 1263,1268 **** --- 1311,1317 ---- Datum *values; char *nulls; bool done = false; + bool csv_mode; bool isnull; ResultRelInfo *resultRelInfo; EState *estate = CreateExecutorState(); /* for ExecConstraints() */ *************** *** 1280,1285 **** --- 1329,1335 ---- num_phys_attrs = tupDesc->natts; attr_count = length(attnumlist); num_defaults = 0; + csv_mode = (strlen(delim) > 1); /* * We need a ResultRelInfo so we can use the regular executor's *************** *** 1499,1504 **** --- 1549,1555 ---- if (file_has_oids) { + /* can't be in CSV mode here */ string = CopyReadAttribute(delim, null_print, &result, &isnull); *************** *** 1537,1544 **** errmsg("missing data for column \"%s\"", NameStr(attr[m]->attname)))); ! string = CopyReadAttribute(delim, null_print, ! &result, &isnull); if (isnull) { --- 1588,1608 ---- errmsg("missing data for column \"%s\"", NameStr(attr[m]->attname)))); ! if (csv_mode) ! { ! string = CopyReadAttributeCSV(delim, null_print, ! &result, &isnull); ! if (result == UNTERMINATED_FIELD) ! ereport(ERROR, ! (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), ! errmsg("unterminated CSV quoted field"))); ! } ! else ! { ! string = CopyReadAttribute(delim, null_print, ! &result, &isnull); ! } ! if (isnull) { *************** *** 2069,2074 **** --- 2133,2288 ---- return attribute_buf.data; } + + /* + * Read the value of a single attribute in CSV mode, + * performing de-escaping as needed. Escaping does not follow the normal + * PostgreSQL text mode, but instead "standard" (i.e. common) CSV usage. + * + * Quoted fields can span lines, in which case the line end is embedded + * in the returned string. + * + * delim is a 2- or 3-character string. The first character is the + * field delimiter, the second the quote character, the third is the + * escape character indise quotes, and defaults to the quote character. + * + * null_print is the null marker string. Note that this is compared to + * the pre-de-escaped input string (thus if it is quoted it is not a NULL). + * + * *result is set to indicate what terminated the read: + * NORMAL_ATTR: column delimiter + * END_OF_LINE: end of line + * UNTERMINATED_FIELD no quote detected at end of a quoted field + * + * In any case, the string read up to the terminator (or end of file) + * is returned. + * + * *isnull is set true or false depending on whether the input matched + * the null marker. Note that the caller cannot check this since the + * returned string will be the post-de-escaping equivalent, which may + * look the same as some valid data string. + *---------- + */ + + static char * + CopyReadAttributeCSV(const char *delim, const char *null_print, + CopyReadResult *result, bool *isnull) + { + char delimc = delim[0]; + char quotec = delim[1]; + char escapec = delim[2] ? delim[2] : delim[1]; + char c; + int start_cursor = line_buf.cursor; + int end_cursor = start_cursor;; + int input_len; + bool in_quote = false; + bool saw_quote = false; + + /* reset attribute_buf to empty */ + attribute_buf.len = 0; + attribute_buf.data[0] = '\0'; + + /* set default status */ + *result = END_OF_LINE; + + for (;;) + { + /* handle multiline quoted fields */ + if (in_quote && line_buf.cursor >= line_buf.len) + { + bool done; + + switch(eol_type) + { + case EOL_NL: + appendStringInfoString(&attribute_buf,"\n"); + break; + case EOL_CR: + appendStringInfoString(&attribute_buf,"\r"); + break; + case EOL_CRNL: + appendStringInfoString(&attribute_buf,"\r\n"); + break; + case EOL_UNKNOWN: + /* shouldn't happen - just keep going */ + break; + } + + copy_lineno++; + done = CopyReadLine(); + if (done && line_buf.len == 0) + break; + start_cursor = line_buf.cursor; + } + + end_cursor = line_buf.cursor; + if (line_buf.cursor >= line_buf.len) + break; + c = line_buf.data[line_buf.cursor++]; + /* + * unquoted field delimiter + */ + if (!in_quote && c == delimc) + { + *result = NORMAL_ATTR; + break; + } + /* + * start of quoted field (or part of field) + */ + if (!in_quote && c == quotec) + { + saw_quote = true; + in_quote = true; + continue; + } + /* + * escape within a quoted field + */ + if (in_quote && c == escapec) + { + /* + * peek at the next char if available, and escape it if it + * is an escape char or a quote char + */ + if (line_buf.cursor <= line_buf.len) + { + char nextc = line_buf.data[line_buf.cursor]; + if (nextc == escapec || nextc == quotec) + { + appendStringInfoCharMacro(&attribute_buf, nextc); + line_buf.cursor++; + continue; + } + } + } + /* + * end of quoted field. + * Must do this test after testing for escape in case quote char + * and escape char are the same (which is the common case). + */ + if(in_quote && c == quotec) + { + in_quote = false; + continue; + } + appendStringInfoCharMacro(&attribute_buf, c); + } + + if (in_quote) + *result = UNTERMINATED_FIELD; + + /* check whether raw input matched null marker */ + input_len = end_cursor - start_cursor; + if (!saw_quote && input_len == strlen(null_print) && + strncmp(&line_buf.data[start_cursor], null_print, input_len) == 0) + *isnull = true; + else + *isnull = false; + + return attribute_buf.data; + } + /* * Read a binary attribute */ *************** *** 2192,2197 **** --- 2406,2479 ---- break; } } + } + + /* + * Send CSV representation of one attribute, with conversion and + * CSV type escaping + */ + static void + CopyAttributeOutCSV(char *server_string, char *delim, bool force_quote) + { + char *string; + char c; + char delimc = delim[0]; + char quotec = delim[1]; + char escapec = delim[2] ? delim[2] : delim[1]; + bool need_quote = force_quote; + char *test_string; + bool same_encoding; + int mblen; + int i; + + same_encoding = (server_encoding == client_encoding); + if (!same_encoding) + string = (char *) pg_server_to_client((unsigned char *) server_string, + strlen(server_string)); + else + string = server_string; + + /* have to run through the string twice, + * first time to see if it needs quoting, second to actually send it + */ + + for(test_string = string; + !need_quote && (c = *test_string) != '\0'; + test_string += mblen) + { + if (c == delimc || c == quotec || c == '\n' || c == '\r') + { + need_quote = true; + } + if (!same_encoding) + mblen = pg_encoding_mblen(client_encoding, test_string); + else + mblen = 1; + } + + if (need_quote) + CopySendChar(quotec); + + for (; (c = *string) != '\0'; string += mblen) + { + if (c == quotec || c == escapec) + CopySendChar(escapec); + + CopySendChar(c); + + if (!same_encoding) + { + /* send additional bytes of the char, if any */ + mblen = pg_encoding_mblen(client_encoding, string); + for (i = 1; i < mblen; i++) + CopySendChar(string[i]); + } + else + mblen = 1; + } + + if (need_quote) + CopySendChar(quotec); } /*
Andrew Dunstan wrote: > Bruce Momjian wrote: > > >Andrew Dunstan wrote: > > > > > >>In fact, in the patch I sent in, no quoted string is marked as null when > >>being read (so even if you use \N as the null marker, "\N" will be that > >>literal and not a null marker). And the null marker, whatever it is, > >>should be made quote safe by us throwing an error if it contains the > >>quote marker, just as we now make sure that the null marker is > >>delimiter-safe. > >> > >> > > > >What value does an int column get if the input file has ',,'. Don't > >tell me zero? :-) Error? > > > > > > If the null marker is not an empty string, it gets an error, of course - > if it is it gets a null: > > [andrew@marmaduke pginst]$ echo ',,' | bin/psql -c "create temp table > foo (a int, b text, c text); copy foo from stdin delimiter ',\"' null > '\\\\N';" > ERROR: invalid input syntax for integer: "" > CONTEXT: COPY foo, line 1, column a: "" > [andrew@marmaduke pginst]$ echo ',,' | bin/psql -c "create temp table > foo (a int, b text, c text); copy foo from stdin delimiter ',\"' ;" > [andrew@marmaduke pginst]$ > > > I hope that is expected behaviour - it's what *I* expect, at least. > Yea, sorry, I misread your "no quoted string is marked as null" as not handling ,, as null. > Attached patch has these additions to previously posted patch: > . quote character may not appear in NULL marker > . any non-null value that matches the NULL marker is forced to be quoted > when written. OK, let me work on this now and repost. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
FYI, this CVS is turning into quite a job, but doing it right takes this kind of effort. --------------------------------------------------------------------------- Andrew Dunstan wrote: > Bruce Momjian wrote: > > >Andrew Dunstan wrote: > > > > > >>In fact, in the patch I sent in, no quoted string is marked as null when > >>being read (so even if you use \N as the null marker, "\N" will be that > >>literal and not a null marker). And the null marker, whatever it is, > >>should be made quote safe by us throwing an error if it contains the > >>quote marker, just as we now make sure that the null marker is > >>delimiter-safe. > >> > >> > > > >What value does an int column get if the input file has ',,'. Don't > >tell me zero? :-) Error? > > > > > > If the null marker is not an empty string, it gets an error, of course - > if it is it gets a null: > > [andrew@marmaduke pginst]$ echo ',,' | bin/psql -c "create temp table > foo (a int, b text, c text); copy foo from stdin delimiter ',\"' null > '\\\\N';" > ERROR: invalid input syntax for integer: "" > CONTEXT: COPY foo, line 1, column a: "" > [andrew@marmaduke pginst]$ echo ',,' | bin/psql -c "create temp table > foo (a int, b text, c text); copy foo from stdin delimiter ',\"' ;" > [andrew@marmaduke pginst]$ > > > I hope that is expected behaviour - it's what *I* expect, at least. > > > > > > > >>I will check on the write behaviour - it might need ammending too. > >> > >>I'll submit a revised patch based on the original syntax scheme, and > >>then you (Bruce) can make the syntax/psql changes that seem to be agreed > >>on now - is that ok? > >> > >> > > > >OK, go as far as you want and post it. I will turn around a new patch > >in a few hours after you post. > > > > > > > >>The default NULL value issue can be determined at the end of any > >>exhaustive debate we have - in the end it's a one line code change ;-) > >> > >> > > > >Agreed. > > > > > > > > Attached patch has these additions to previously posted patch: > . quote character may not appear in NULL marker > . any non-null value that matches the NULL marker is forced to be quoted > when written. > > > cheers > > andrew > > Index: src/backend/commands/copy.c > =================================================================== > RCS file: /projects/cvsroot/pgsql-server/src/backend/commands/copy.c,v > retrieving revision 1.219 > diff -c -r1.219 copy.c > *** src/backend/commands/copy.c 6 Apr 2004 13:21:33 -0000 1.219 > --- src/backend/commands/copy.c 12 Apr 2004 16:21:33 -0000 > *************** > *** 70,76 **** > typedef enum CopyReadResult > { > NORMAL_ATTR, > ! END_OF_LINE > } CopyReadResult; > > /* > --- 70,77 ---- > typedef enum CopyReadResult > { > NORMAL_ATTR, > ! END_OF_LINE, > ! UNTERMINATED_FIELD > } CopyReadResult; > > /* > *************** > *** 136,144 **** > --- 137,148 ---- > static bool CopyReadLine(void); > static char *CopyReadAttribute(const char *delim, const char *null_print, > CopyReadResult *result, bool *isnull); > + static char *CopyReadAttributeCSV(const char *delim, const char *null_print, > + CopyReadResult *result, bool *isnull); > static Datum CopyReadBinaryAttribute(int column_no, FmgrInfo *flinfo, > Oid typelem, bool *isnull); > static void CopyAttributeOut(char *string, char *delim); > + static void CopyAttributeOutCSV(char *string, char *delim, bool force_quote); > static List *CopyGetAttnums(Relation rel, List *attnamelist); > static void limit_printout_length(StringInfo buf); > > *************** > *** 682,687 **** > --- 686,692 ---- > List *attnumlist; > bool binary = false; > bool oids = false; > + bool csv_mode = false; > char *delim = NULL; > char *null_print = NULL; > Relation rel; > *************** > *** 744,751 **** > if (!delim) > delim = "\t"; > > if (!null_print) > ! null_print = "\\N"; > > /* > * Open and lock the relation, using the appropriate lock type. > --- 749,759 ---- > if (!delim) > delim = "\t"; > > + if (strlen(delim) > 1) > + csv_mode = true; > + > if (!null_print) > ! null_print = csv_mode ? "" : "\\N"; > > /* > * Open and lock the relation, using the appropriate lock type. > *************** > *** 772,783 **** > "psql's \\copy command also works for anyone."))); > > /* > ! * Presently, only single-character delimiter strings are supported. > */ > ! if (strlen(delim) != 1) > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > errmsg("COPY delimiter must be a single character"))); > > /* > * Don't allow the delimiter to appear in the null string. > --- 780,806 ---- > "psql's \\copy command also works for anyone."))); > > /* > ! * Only single-character delimiter strings are supported, > ! * except in CSV mode, where the string must be > ! * delimiter-char quote-char [escape-char] > */ > ! if (!csv_mode && strlen(delim) != 1) > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > errmsg("COPY delimiter must be a single character"))); > + else if (csv_mode) > + { > + if(strlen(delim) > 3) > + ereport(ERROR, > + (errcode(ERRCODE_SYNTAX_ERROR), > + errmsg("COPY delimiters for CSV must be a 2 or 3 characters"))); > + if (delim[0] == delim[1] || > + (strlen(delim) == 3 && delim[0] == delim[2])) > + ereport(ERROR, > + (errcode(ERRCODE_SYNTAX_ERROR), > + errmsg("CSV delimiter character must not be same as quote character or escape character"))); > + > + } > > /* > * Don't allow the delimiter to appear in the null string. > *************** > *** 788,793 **** > --- 811,833 ---- > errmsg("COPY delimiter must not appear in the NULL specification"))); > > /* > + * Don't allow the csv quote char to appear in the null string. > + */ > + if (csv_mode && strchr(null_print, delim[1]) != NULL) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("CSV quote character must not appear in the NULL specification"))); > + > + /* > + * Don't allow OIDs in CSV mode > + */ > + > + if (csv_mode && oids) > + ereport(ERROR, > + (errcode(ERRCODE_SYNTAX_ERROR), > + errmsg("Cannot specify OIDS in CSV mode "))); > + > + /* > * Don't allow COPY w/ OIDs to or from a table without them > */ > if (oids && !rel->rd_rel->relhasoids) > *************** > *** 969,974 **** > --- 1009,1015 ---- > FmgrInfo *out_functions; > Oid *elements; > bool *isvarlena; > + bool csv_mode; > char *string; > Snapshot mySnapshot; > List *cur; > *************** > *** 979,984 **** > --- 1020,1026 ---- > attr = tupDesc->attrs; > num_phys_attrs = tupDesc->natts; > attr_count = length(attnumlist); > + csv_mode = (strlen(delim) > 1); > > /* > * Get info about the columns we need to process. > *************** > *** 1051,1057 **** > while ((tuple = heap_getnext(scandesc, ForwardScanDirection)) != NULL) > { > bool need_delim = false; > - > CHECK_FOR_INTERRUPTS(); > > MemoryContextReset(mycontext); > --- 1093,1098 ---- > *************** > *** 1113,1119 **** > value, > ObjectIdGetDatum(elements[attnum - 1]), > Int32GetDatum(attr[attnum - 1]->atttypmod))); > ! CopyAttributeOut(string, delim); > } > else > { > --- 1154,1167 ---- > value, > ObjectIdGetDatum(elements[attnum - 1]), > Int32GetDatum(attr[attnum - 1]->atttypmod))); > ! if (csv_mode) > ! { > ! bool force_quote = (strcmp(string,null_print) == 0); > ! CopyAttributeOutCSV(string, delim, force_quote); > ! } > ! else > ! CopyAttributeOut(string, delim); > ! > } > else > { > *************** > *** 1263,1268 **** > --- 1311,1317 ---- > Datum *values; > char *nulls; > bool done = false; > + bool csv_mode; > bool isnull; > ResultRelInfo *resultRelInfo; > EState *estate = CreateExecutorState(); /* for ExecConstraints() */ > *************** > *** 1280,1285 **** > --- 1329,1335 ---- > num_phys_attrs = tupDesc->natts; > attr_count = length(attnumlist); > num_defaults = 0; > + csv_mode = (strlen(delim) > 1); > > /* > * We need a ResultRelInfo so we can use the regular executor's > *************** > *** 1499,1504 **** > --- 1549,1555 ---- > > if (file_has_oids) > { > + /* can't be in CSV mode here */ > string = CopyReadAttribute(delim, null_print, > &result, &isnull); > > *************** > *** 1537,1544 **** > errmsg("missing data for column \"%s\"", > NameStr(attr[m]->attname)))); > > ! string = CopyReadAttribute(delim, null_print, > ! &result, &isnull); > > if (isnull) > { > --- 1588,1608 ---- > errmsg("missing data for column \"%s\"", > NameStr(attr[m]->attname)))); > > ! if (csv_mode) > ! { > ! string = CopyReadAttributeCSV(delim, null_print, > ! &result, &isnull); > ! if (result == UNTERMINATED_FIELD) > ! ereport(ERROR, > ! (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > ! errmsg("unterminated CSV quoted field"))); > ! } > ! else > ! { > ! string = CopyReadAttribute(delim, null_print, > ! &result, &isnull); > ! } > ! > > if (isnull) > { > *************** > *** 2069,2074 **** > --- 2133,2288 ---- > return attribute_buf.data; > } > > + > + /* > + * Read the value of a single attribute in CSV mode, > + * performing de-escaping as needed. Escaping does not follow the normal > + * PostgreSQL text mode, but instead "standard" (i.e. common) CSV usage. > + * > + * Quoted fields can span lines, in which case the line end is embedded > + * in the returned string. > + * > + * delim is a 2- or 3-character string. The first character is the > + * field delimiter, the second the quote character, the third is the > + * escape character indise quotes, and defaults to the quote character. > + * > + * null_print is the null marker string. Note that this is compared to > + * the pre-de-escaped input string (thus if it is quoted it is not a NULL). > + * > + * *result is set to indicate what terminated the read: > + * NORMAL_ATTR: column delimiter > + * END_OF_LINE: end of line > + * UNTERMINATED_FIELD no quote detected at end of a quoted field > + * > + * In any case, the string read up to the terminator (or end of file) > + * is returned. > + * > + * *isnull is set true or false depending on whether the input matched > + * the null marker. Note that the caller cannot check this since the > + * returned string will be the post-de-escaping equivalent, which may > + * look the same as some valid data string. > + *---------- > + */ > + > + static char * > + CopyReadAttributeCSV(const char *delim, const char *null_print, > + CopyReadResult *result, bool *isnull) > + { > + char delimc = delim[0]; > + char quotec = delim[1]; > + char escapec = delim[2] ? delim[2] : delim[1]; > + char c; > + int start_cursor = line_buf.cursor; > + int end_cursor = start_cursor;; > + int input_len; > + bool in_quote = false; > + bool saw_quote = false; > + > + /* reset attribute_buf to empty */ > + attribute_buf.len = 0; > + attribute_buf.data[0] = '\0'; > + > + /* set default status */ > + *result = END_OF_LINE; > + > + for (;;) > + { > + /* handle multiline quoted fields */ > + if (in_quote && line_buf.cursor >= line_buf.len) > + { > + bool done; > + > + switch(eol_type) > + { > + case EOL_NL: > + appendStringInfoString(&attribute_buf,"\n"); > + break; > + case EOL_CR: > + appendStringInfoString(&attribute_buf,"\r"); > + break; > + case EOL_CRNL: > + appendStringInfoString(&attribute_buf,"\r\n"); > + break; > + case EOL_UNKNOWN: > + /* shouldn't happen - just keep going */ > + break; > + } > + > + copy_lineno++; > + done = CopyReadLine(); > + if (done && line_buf.len == 0) > + break; > + start_cursor = line_buf.cursor; > + } > + > + end_cursor = line_buf.cursor; > + if (line_buf.cursor >= line_buf.len) > + break; > + c = line_buf.data[line_buf.cursor++]; > + /* > + * unquoted field delimiter > + */ > + if (!in_quote && c == delimc) > + { > + *result = NORMAL_ATTR; > + break; > + } > + /* > + * start of quoted field (or part of field) > + */ > + if (!in_quote && c == quotec) > + { > + saw_quote = true; > + in_quote = true; > + continue; > + } > + /* > + * escape within a quoted field > + */ > + if (in_quote && c == escapec) > + { > + /* > + * peek at the next char if available, and escape it if it > + * is an escape char or a quote char > + */ > + if (line_buf.cursor <= line_buf.len) > + { > + char nextc = line_buf.data[line_buf.cursor]; > + if (nextc == escapec || nextc == quotec) > + { > + appendStringInfoCharMacro(&attribute_buf, nextc); > + line_buf.cursor++; > + continue; > + } > + } > + } > + /* > + * end of quoted field. > + * Must do this test after testing for escape in case quote char > + * and escape char are the same (which is the common case). > + */ > + if(in_quote && c == quotec) > + { > + in_quote = false; > + continue; > + } > + appendStringInfoCharMacro(&attribute_buf, c); > + } > + > + if (in_quote) > + *result = UNTERMINATED_FIELD; > + > + /* check whether raw input matched null marker */ > + input_len = end_cursor - start_cursor; > + if (!saw_quote && input_len == strlen(null_print) && > + strncmp(&line_buf.data[start_cursor], null_print, input_len) == 0) > + *isnull = true; > + else > + *isnull = false; > + > + return attribute_buf.data; > + } > + > /* > * Read a binary attribute > */ > *************** > *** 2192,2197 **** > --- 2406,2479 ---- > break; > } > } > + } > + > + /* > + * Send CSV representation of one attribute, with conversion and > + * CSV type escaping > + */ > + static void > + CopyAttributeOutCSV(char *server_string, char *delim, bool force_quote) > + { > + char *string; > + char c; > + char delimc = delim[0]; > + char quotec = delim[1]; > + char escapec = delim[2] ? delim[2] : delim[1]; > + bool need_quote = force_quote; > + char *test_string; > + bool same_encoding; > + int mblen; > + int i; > + > + same_encoding = (server_encoding == client_encoding); > + if (!same_encoding) > + string = (char *) pg_server_to_client((unsigned char *) server_string, > + strlen(server_string)); > + else > + string = server_string; > + > + /* have to run through the string twice, > + * first time to see if it needs quoting, second to actually send it > + */ > + > + for(test_string = string; > + !need_quote && (c = *test_string) != '\0'; > + test_string += mblen) > + { > + if (c == delimc || c == quotec || c == '\n' || c == '\r') > + { > + need_quote = true; > + } > + if (!same_encoding) > + mblen = pg_encoding_mblen(client_encoding, test_string); > + else > + mblen = 1; > + } > + > + if (need_quote) > + CopySendChar(quotec); > + > + for (; (c = *string) != '\0'; string += mblen) > + { > + if (c == quotec || c == escapec) > + CopySendChar(escapec); > + > + CopySendChar(c); > + > + if (!same_encoding) > + { > + /* send additional bytes of the char, if any */ > + mblen = pg_encoding_mblen(client_encoding, string); > + for (i = 1; i < mblen; i++) > + CopySendChar(string[i]); > + } > + else > + mblen = 1; > + } > + > + if (need_quote) > + CopySendChar(quotec); > } > > /* > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Andrew Dunstan wrote: > If the null marker is not an empty string, it gets an error, of > course - if it is it gets a null: > > [andrew@marmaduke pginst]$ echo ',,' | bin/psql -c "create temp > table foo (a int, b text, c text); copy foo from stdin delimiter > ',\"' null '\\\\N';" ERROR: invalid input syntax for integer: > "" CONTEXT: COPY foo, line 1, column a: "" [andrew@marmaduke > pginst]$ echo ',,' | bin/psql -c "create temp table foo (a int, > b text, c text); copy foo from stdin delimiter ',\"' ;" > [andrew@marmaduke pginst]$ > > > I hope that is expected behaviour - it's what *I* expect, at > least. > > > > Attached patch has these additions to previously posted patch: > . quote character may not appear in NULL marker > . any non-null value that matches the NULL marker is forced to be quoted > when written. OK, here is a new version of the patch that includes the grammar changes we agreed upon, SGML changes, and \copy support. I will not make any more changes without contacting you so feel free to make adjustments and repost. I have two open issues. First, CSV should support WITH OIDS, no? Second, I found a problem with NULLs. If I do: . test=> create table test (x text, y text); CREATE TABLE test=> insert into test values ('', NULL); INSERT 17221 1 test=> then this: test=> copy test to '/tmp/b' with csv; creates: "", and this: test=> copy test to '/tmp/b' with csv NULL 'fred'; creates: ,fred Is that logical? A non-null field went from "" to nothing. I think it is caused by this code: bool force_quote = (strcmp(string, null_print) == 0); CopyAttributeOutCSV(string, delim, quote, escape, force_quote); The reason it happens is that when the null string is '', it matches a zero-length string, so the value is quoted. When the null stirng isn't blank, a zero-length string doesn't match the null string so it isn't quoted. I think we need to add special logic for zero-length strings so they are always quoted, even if there is a special null string. This will make our dumps more consistent, I think, or maybe the current behavior is OK. It just struck me as strange. I did a dump/reload test with a null string and null, and it worked fine. Is there any data that can not be dumped/reloaded via CSV? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 Index: doc/src/sgml/ref/copy.sgml =================================================================== RCS file: /cvsroot/pgsql-server/doc/src/sgml/ref/copy.sgml,v retrieving revision 1.55 diff -c -c -r1.55 copy.sgml *** doc/src/sgml/ref/copy.sgml 13 Dec 2003 23:59:07 -0000 1.55 --- doc/src/sgml/ref/copy.sgml 13 Apr 2004 04:18:22 -0000 *************** *** 26,32 **** [ BINARY ] [ OIDS ] [ DELIMITER [ AS ] '<replaceable class="parameter">delimiter</replaceable>' ] ! [ NULL [ AS ] '<replaceable class="parameter">null string</replaceable>' ] ] COPY <replaceable class="parameter">tablename</replaceable> [ ( <replaceable class="parameter">column</replaceable> [,...] ) ] TO { '<replaceable class="parameter">filename</replaceable>' | STDOUT } --- 26,34 ---- [ BINARY ] [ OIDS ] [ DELIMITER [ AS ] '<replaceable class="parameter">delimiter</replaceable>' ] ! [ NULL [ AS ] '<replaceable class="parameter">null string</replaceable>' ] ! [ CSV [ QUOTE [ AS ] '<replaceable class="parameter">quote</replaceable>' ] ! [ ESCAPE [ AS ] '<replaceable class="parameter">escape</replaceable>' ] ] COPY <replaceable class="parameter">tablename</replaceable> [ ( <replaceable class="parameter">column</replaceable> [,...] ) ] TO { '<replaceable class="parameter">filename</replaceable>' | STDOUT } *************** *** 34,40 **** [ BINARY ] [ OIDS ] [ DELIMITER [ AS ] '<replaceable class="parameter">delimiter</replaceable>' ] ! [ NULL [ AS ] '<replaceable class="parameter">null string</replaceable>' ] ] </synopsis> </refsynopsisdiv> --- 36,44 ---- [ BINARY ] [ OIDS ] [ DELIMITER [ AS ] '<replaceable class="parameter">delimiter</replaceable>' ] ! [ NULL [ AS ] '<replaceable class="parameter">null string</replaceable>' ] ! [ CSV [ QUOTE [ AS ] '<replaceable class="parameter">quote</replaceable>' ] ! [ ESCAPE [ AS ] '<replaceable class="parameter">escape</replaceable>' ] ] </synopsis> </refsynopsisdiv> *************** *** 136,142 **** <para> Specifies copying the OID for each row. (An error is raised if <literal>OIDS</literal> is specified for a table that does not ! have OIDs.) </para> </listitem> </varlistentry> --- 140,146 ---- <para> Specifies copying the OID for each row. (An error is raised if <literal>OIDS</literal> is specified for a table that does not ! have OIDs.) FIX CSV FOR OIDS! </para> </listitem> </varlistentry> *************** *** 146,152 **** <listitem> <para> The single character that separates columns within each row ! (line) of the file. The default is a tab character. </para> </listitem> </varlistentry> --- 150,157 ---- <listitem> <para> The single character that separates columns within each row ! (line) of the file. The default is a tab character in normal mode, ! a comma in <literal>CSV</> mode. </para> </listitem> </varlistentry> *************** *** 156,175 **** <listitem> <para> The string that represents a null value. The default is ! <literal>\N</literal> (backslash-N). You might prefer an empty ! string, for example. </para> <note> <para> ! On a <command>COPY FROM</command>, any data item that matches this string will be stored as a null value, so you should make sure that you use the same string as you used with <command>COPY TO</command>. </para> </note> </listitem> </varlistentry> </variablelist> </refsect1> --- 161,225 ---- <listitem> <para> The string that represents a null value. The default is ! <literal>\N</literal> (backslash-N) in normal mode, and a missing ! value (no quotes) in <literal>CSV</> mode. You might prefer an empty ! string in cases where you don't want to distinguish nulls from ! empty strings. </para> <note> <para> ! When using <command>COPY FROM</command>, any data item that matches this string will be stored as a null value, so you should make sure that you use the same string as you used with <command>COPY TO</command>. </para> + + <para> + If you do not want anything used as null when using + <command>COPY FROM</command>, you can specify some value that is very + unlikely to appear in the file, such as <literal>frobnitz</literal> or + <literal>d5f4074b254c76cd8ae37bf1731f4aed</literal> (which is + <literal>md5('frobnitz')</literal>). This could be especially useful + when importing a <literal>CSV</> file into a table with <literal>NOT NULL</> + columns. + </para> </note> + </listitem> </varlistentry> + + <varlistentry> + <term><literal>CSV</literal></term> + <listitem> + <para> + Enables Comma Separated Variable (<literal>CSV</>) mode. It sets the + default <literal>DELIMITER</> to comma, and <literal>QUOTE</> and + <literal>ESCAPE</> values to double-quote. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term><replaceable class="parameter">quote string</replaceable></term> + <listitem> + <para> + Specifies the quotation character in <literal>CSV</> mode. + The default is double-quote. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term><replaceable class="parameter">escape string</replaceable></term> + <listitem> + <para> + Specifies the character that should appear before a <literal>QUOTE</> + data character value in <literal>CSV</> mode. The default is double-quote. + </para> + </listitem> + </varlistentry> + </variablelist> </refsect1> *************** *** 253,259 **** <para> When <command>COPY</command> is used without the <literal>BINARY</literal> option, ! the data read or written is a text file with one line per table row. Columns in a row are separated by the delimiter character. The column values themselves are strings generated by the output function, or acceptable to the input function, of each --- 303,310 ---- <para> When <command>COPY</command> is used without the <literal>BINARY</literal> option, ! the data read or written is a text file with one line per table row, ! unless <literal>CSV</> mode is used. Columns in a row are separated by the delimiter character. The column values themselves are strings generated by the output function, or acceptable to the input function, of each *************** *** 377,382 **** --- 428,473 ---- meant as data, <command>COPY FROM</command> will complain if the line endings in the input are not all alike. </para> + </refsect2> + + <refsect2> + <title>CSV Format</title> + + <para> + This format is used for importing from and exporting to the Comma + Separated Variable (<literal>CSV</>) file format used by many other programs, + such as spreadsheets. Instead of the escaping used by + <productname>PostgreSQL</productname>'s standard text mode, it produces + and recognises the common CSV escaping mechanism. + </para> + + <para> + The values in each record are separated by the delimiter character. + If the value contains the delimiter character, the <literal>QUOTE</> character + or a carriage return or line feed character, then the whole value is prefixed + and suffixed by the quote character, and any occurrence within the value + of a quote character or the <literal>ESCAPE</> character is preceded + by the escape character. + </para> + + <note> + <para> + CSV mode will both recognise and produce CSV files with quoted values + containing embedded carriage returns and line feeds. Thus the files are + not strictly one line per table row like non-CSV files. + </para> + </note> + + <note> + <para> + Many programs produce strange and occasionally perverse CSV files, so + the file format is more a convention than a standard. Thus you might + encounter some files that cannot be imported using this mechanism, and + <command>COPY</> might produce files that other programs can not + process. + </para> + </note> + </refsect2> <refsect2> Index: doc/src/sgml/ref/psql-ref.sgml =================================================================== RCS file: /cvsroot/pgsql-server/doc/src/sgml/ref/psql-ref.sgml,v retrieving revision 1.110 diff -c -c -r1.110 psql-ref.sgml *** doc/src/sgml/ref/psql-ref.sgml 12 Apr 2004 15:58:52 -0000 1.110 --- doc/src/sgml/ref/psql-ref.sgml 13 Apr 2004 04:18:25 -0000 *************** *** 711,716 **** --- 711,718 ---- [ <literal>oids</literal> ] [ <literal>delimiter [as] </literal> '<replaceable class="parameter">character</replaceable>' ] [ <literal>null [as] </literal> '<replaceable class="parameter">string</replaceable>' ]</literal> + [ <literal>csv [ quote [as] </literal> '<replaceable class="parameter">string</replaceable>' ] + [ <literal>escape [as] </literal> '<replaceable class="parameter">string</replaceable>' ] ] </term> <listitem> Index: src/backend/commands/copy.c =================================================================== RCS file: /cvsroot/pgsql-server/src/backend/commands/copy.c,v retrieving revision 1.219 diff -c -c -r1.219 copy.c *** src/backend/commands/copy.c 6 Apr 2004 13:21:33 -0000 1.219 --- src/backend/commands/copy.c 13 Apr 2004 04:18:27 -0000 *************** *** 70,76 **** typedef enum CopyReadResult { NORMAL_ATTR, ! END_OF_LINE } CopyReadResult; /* --- 70,77 ---- typedef enum CopyReadResult { NORMAL_ATTR, ! END_OF_LINE, ! UNTERMINATED_FIELD } CopyReadResult; /* *************** *** 130,144 **** /* non-export function prototypes */ static void CopyTo(Relation rel, List *attnumlist, bool binary, bool oids, ! char *delim, char *null_print); static void CopyFrom(Relation rel, List *attnumlist, bool binary, bool oids, ! char *delim, char *null_print); static bool CopyReadLine(void); static char *CopyReadAttribute(const char *delim, const char *null_print, CopyReadResult *result, bool *isnull); static Datum CopyReadBinaryAttribute(int column_no, FmgrInfo *flinfo, Oid typelem, bool *isnull); static void CopyAttributeOut(char *string, char *delim); static List *CopyGetAttnums(Relation rel, List *attnamelist); static void limit_printout_length(StringInfo buf); --- 131,150 ---- /* non-export function prototypes */ static void CopyTo(Relation rel, List *attnumlist, bool binary, bool oids, ! char *delim, char *null_print, bool csv_mode, char *quote, char *escape); static void CopyFrom(Relation rel, List *attnumlist, bool binary, bool oids, ! char *delim, char *null_print, bool csv_mode, char *quote, char *escape); static bool CopyReadLine(void); static char *CopyReadAttribute(const char *delim, const char *null_print, CopyReadResult *result, bool *isnull); + static char *CopyReadAttributeCSV(const char *delim, const char *null_print, + char *quote, char *escape, + CopyReadResult *result, bool *isnull); static Datum CopyReadBinaryAttribute(int column_no, FmgrInfo *flinfo, Oid typelem, bool *isnull); static void CopyAttributeOut(char *string, char *delim); + static void CopyAttributeOutCSV(char *string, char *delim, char *quote, + char *escape, bool force_quote); static List *CopyGetAttnums(Relation rel, List *attnamelist); static void limit_printout_length(StringInfo buf); *************** *** 682,688 **** --- 688,697 ---- List *attnumlist; bool binary = false; bool oids = false; + bool csv_mode = false; char *delim = NULL; + char *quote = NULL; + char *escape = NULL; char *null_print = NULL; Relation rel; AclMode required_access = (is_from ? ACL_INSERT : ACL_SELECT); *************** *** 725,730 **** --- 734,763 ---- errmsg("conflicting or redundant options"))); null_print = strVal(defel->arg); } + else if (strcmp(defel->defname, "csv") == 0) + { + if (csv_mode) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"))); + csv_mode = intVal(defel->arg); + } + else if (strcmp(defel->defname, "quote") == 0) + { + if (quote) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"))); + quote = strVal(defel->arg); + } + else if (strcmp(defel->defname, "escape") == 0) + { + if (escape) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"))); + escape = strVal(defel->arg); + } else elog(ERROR, "option \"%s\" not recognized", defel->defname); *************** *** 735,740 **** --- 768,778 ---- (errcode(ERRCODE_SYNTAX_ERROR), errmsg("cannot specify DELIMITER in BINARY mode"))); + if (binary && csv_mode) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("cannot specify CSV in BINARY mode"))); + if (binary && null_print) ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), *************** *** 742,751 **** /* Set defaults */ if (!delim) ! delim = "\t"; ! if (!null_print) ! null_print = "\\N"; /* * Open and lock the relation, using the appropriate lock type. --- 780,847 ---- /* Set defaults */ if (!delim) ! delim = csv_mode ? "," : "\t"; ! if (!null_print) ! null_print = csv_mode ? "" : "\\N"; ! ! if (csv_mode) ! { ! if (!quote) ! quote = "\""; ! if (!escape) ! escape = "\""; ! } ! ! /* ! * Only single-character delimiter strings are supported. ! */ ! if (strlen(delim) != 1) ! ereport(ERROR, ! (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), ! errmsg("COPY delimiter must be a single character"))); ! ! /* ! * Check quote ! */ ! if (!csv_mode && quote != NULL) ! ereport(ERROR, ! (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), ! errmsg("COPY quote available only in CSV mode"))); ! ! if (csv_mode && strlen(quote) != 1) ! ereport(ERROR, ! (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), ! errmsg("COPY quote must be a single character"))); ! ! /* ! * Check escape ! */ ! if (!csv_mode && escape != NULL) ! ereport(ERROR, ! (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), ! errmsg("COPY escape available only in CSV mode"))); ! ! if (csv_mode && strlen(escape) != 1) ! ereport(ERROR, ! (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), ! errmsg("COPY escape must be a single character"))); ! ! /* ! * Don't allow the delimiter to appear in the null string. ! */ ! if (strchr(null_print, delim[0]) != NULL) ! ereport(ERROR, ! (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), ! errmsg("COPY delimiter must not appear in the NULL specification"))); ! ! /* ! * Don't allow the csv quote char to appear in the null string. ! */ ! if (csv_mode && strchr(null_print, quote[0]) != NULL) ! ereport(ERROR, ! (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), ! errmsg("CSV quote character must not appear in the NULL specification"))); /* * Open and lock the relation, using the appropriate lock type. *************** *** 772,791 **** "psql's \\copy command also works for anyone."))); /* ! * Presently, only single-character delimiter strings are supported. */ - if (strlen(delim) != 1) - ereport(ERROR, - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY delimiter must be a single character"))); ! /* ! * Don't allow the delimiter to appear in the null string. ! */ ! if (strchr(null_print, delim[0]) != NULL) ereport(ERROR, ! (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), ! errmsg("COPY delimiter must not appear in the NULL specification"))); /* * Don't allow COPY w/ OIDs to or from a table without them --- 868,880 ---- "psql's \\copy command also works for anyone."))); /* ! * Don't allow OIDs in CSV mode */ ! if (csv_mode && oids) // FIX ME bjm ereport(ERROR, ! (errcode(ERRCODE_SYNTAX_ERROR), ! errmsg("Cannot specify OIDS in CSV mode "))); /* * Don't allow COPY w/ OIDs to or from a table without them *************** *** 864,870 **** errmsg("\"%s\" is a directory", filename))); } } ! CopyFrom(rel, attnumlist, binary, oids, delim, null_print); } else { /* copy from database to file */ --- 953,960 ---- errmsg("\"%s\" is a directory", filename))); } } ! CopyFrom(rel, attnumlist, binary, oids, delim, null_print, csv_mode, ! quote, escape); } else { /* copy from database to file */ *************** *** 926,932 **** errmsg("\"%s\" is a directory", filename))); } } ! CopyTo(rel, attnumlist, binary, oids, delim, null_print); } if (!pipe) --- 1016,1023 ---- errmsg("\"%s\" is a directory", filename))); } } ! CopyTo(rel, attnumlist, binary, oids, delim, null_print, csv_mode, ! quote, escape); } if (!pipe) *************** *** 958,964 **** */ static void CopyTo(Relation rel, List *attnumlist, bool binary, bool oids, ! char *delim, char *null_print) { HeapTuple tuple; TupleDesc tupDesc; --- 1049,1056 ---- */ static void CopyTo(Relation rel, List *attnumlist, bool binary, bool oids, ! char *delim, char *null_print, bool csv_mode, char *quote, ! char *escape) { HeapTuple tuple; TupleDesc tupDesc; *************** *** 1051,1057 **** while ((tuple = heap_getnext(scandesc, ForwardScanDirection)) != NULL) { bool need_delim = false; - CHECK_FOR_INTERRUPTS(); MemoryContextReset(mycontext); --- 1143,1148 ---- *************** *** 1113,1119 **** value, ObjectIdGetDatum(elements[attnum - 1]), Int32GetDatum(attr[attnum - 1]->atttypmod))); ! CopyAttributeOut(string, delim); } else { --- 1204,1218 ---- value, ObjectIdGetDatum(elements[attnum - 1]), Int32GetDatum(attr[attnum - 1]->atttypmod))); ! if (csv_mode) ! { ! bool force_quote = (strcmp(string, null_print) == 0); ! CopyAttributeOutCSV(string, delim, quote, escape, ! force_quote); ! } ! else ! CopyAttributeOut(string, delim); ! } else { *************** *** 1243,1249 **** */ static void CopyFrom(Relation rel, List *attnumlist, bool binary, bool oids, ! char *delim, char *null_print) { HeapTuple tuple; TupleDesc tupDesc; --- 1342,1349 ---- */ static void CopyFrom(Relation rel, List *attnumlist, bool binary, bool oids, ! char *delim, char *null_print, bool csv_mode, char *quote, ! char *escape) { HeapTuple tuple; TupleDesc tupDesc; *************** *** 1388,1396 **** ExecBSInsertTriggers(estate, resultRelInfo); if (!binary) - { file_has_oids = oids; /* must rely on user to tell us this... */ - } else { /* Read and verify binary header */ --- 1488,1494 ---- *************** *** 1499,1504 **** --- 1597,1603 ---- if (file_has_oids) { + /* can't be in CSV mode here */ string = CopyReadAttribute(delim, null_print, &result, &isnull); *************** *** 1537,1550 **** errmsg("missing data for column \"%s\"", NameStr(attr[m]->attname)))); ! string = CopyReadAttribute(delim, null_print, ! &result, &isnull); ! ! if (isnull) { ! /* we read an SQL NULL, no need to do anything */ } else { copy_attname = NameStr(attr[m]->attname); values[m] = FunctionCall3(&in_functions[m], --- 1636,1658 ---- errmsg("missing data for column \"%s\"", NameStr(attr[m]->attname)))); ! if (csv_mode) { ! string = CopyReadAttributeCSV(delim, null_print, quote, ! escape, &result, &isnull); ! if (result == UNTERMINATED_FIELD) ! ereport(ERROR, ! (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), ! errmsg("unterminated CSV quoted field"))); } else + string = CopyReadAttribute(delim, null_print, + &result, &isnull); + + + + /* we read an SQL NULL, no need to do anything */ + if (!isnull) { copy_attname = NameStr(attr[m]->attname); values[m] = FunctionCall3(&in_functions[m], *************** *** 2069,2074 **** --- 2177,2328 ---- return attribute_buf.data; } + + /* + * Read the value of a single attribute in CSV mode, + * performing de-escaping as needed. Escaping does not follow the normal + * PostgreSQL text mode, but instead "standard" (i.e. common) CSV usage. + * + * Quoted fields can span lines, in which case the line end is embedded + * in the returned string. + * + * null_print is the null marker string. Note that this is compared to + * the pre-de-escaped input string (thus if it is quoted it is not a NULL). + * + * *result is set to indicate what terminated the read: + * NORMAL_ATTR: column delimiter + * END_OF_LINE: end of line + * UNTERMINATED_FIELD no quote detected at end of a quoted field + * + * In any case, the string read up to the terminator (or end of file) + * is returned. + * + * *isnull is set true or false depending on whether the input matched + * the null marker. Note that the caller cannot check this since the + * returned string will be the post-de-escaping equivalent, which may + * look the same as some valid data string. + *---------- + */ + + static char * + CopyReadAttributeCSV(const char *delim, const char *null_print, char *quote, + char *escape, CopyReadResult *result, bool *isnull) + { + char delimc = delim[0]; + char quotec = quote[0]; + char escapec = escape[0]; + char c; + int start_cursor = line_buf.cursor; + int end_cursor = start_cursor; + int input_len; + bool in_quote = false; + bool saw_quote = false; + + /* reset attribute_buf to empty */ + attribute_buf.len = 0; + attribute_buf.data[0] = '\0'; + + /* set default status */ + *result = END_OF_LINE; + + for (;;) + { + /* handle multiline quoted fields */ + if (in_quote && line_buf.cursor >= line_buf.len) + { + bool done; + + switch(eol_type) + { + case EOL_NL: + appendStringInfoString(&attribute_buf,"\n"); + break; + case EOL_CR: + appendStringInfoString(&attribute_buf,"\r"); + break; + case EOL_CRNL: + appendStringInfoString(&attribute_buf,"\r\n"); + break; + case EOL_UNKNOWN: + /* shouldn't happen - just keep going */ + break; + } + + copy_lineno++; + done = CopyReadLine(); + if (done && line_buf.len == 0) + break; + start_cursor = line_buf.cursor; + } + + end_cursor = line_buf.cursor; + if (line_buf.cursor >= line_buf.len) + break; + c = line_buf.data[line_buf.cursor++]; + /* + * unquoted field delimiter + */ + if (!in_quote && c == delimc) + { + *result = NORMAL_ATTR; + break; + } + /* + * start of quoted field (or part of field) + */ + if (!in_quote && c == quotec) + { + saw_quote = true; + in_quote = true; + continue; + } + /* + * escape within a quoted field + */ + if (in_quote && c == escapec) + { + /* + * peek at the next char if available, and escape it if it + * is an escape char or a quote char + */ + if (line_buf.cursor <= line_buf.len) + { + char nextc = line_buf.data[line_buf.cursor]; + if (nextc == escapec || nextc == quotec) + { + appendStringInfoCharMacro(&attribute_buf, nextc); + line_buf.cursor++; + continue; + } + } + } + /* + * end of quoted field. + * Must do this test after testing for escape in case quote char + * and escape char are the same (which is the common case). + */ + if (in_quote && c == quotec) + { + in_quote = false; + continue; + } + appendStringInfoCharMacro(&attribute_buf, c); + } + + if (in_quote) + *result = UNTERMINATED_FIELD; + + /* check whether raw input matched null marker */ + input_len = end_cursor - start_cursor; + if (!saw_quote && input_len == strlen(null_print) && + strncmp(&line_buf.data[start_cursor], null_print, input_len) == 0) + *isnull = true; + else + *isnull = false; + + return attribute_buf.data; + } + /* * Read a binary attribute */ *************** *** 2192,2197 **** --- 2446,2518 ---- break; } } + } + + /* + * Send CSV representation of one attribute, with conversion and + * CSV type escaping + */ + static void + CopyAttributeOutCSV(char *server_string, char *delim, char *quote, + char *escape, bool force_quote) + { + char *string; + char c; + char delimc = delim[0]; + char quotec = quote[0]; + char escapec = escape[0]; + bool need_quote = force_quote; + char *test_string; + bool same_encoding; + int mblen; + int i; + + same_encoding = (server_encoding == client_encoding); + if (!same_encoding) + string = (char *) pg_server_to_client((unsigned char *) server_string, + strlen(server_string)); + else + string = server_string; + + /* have to run through the string twice, + * first time to see if it needs quoting, second to actually send it + */ + + for(test_string = string; + !need_quote && (c = *test_string) != '\0'; + test_string += mblen) + { + if (c == delimc || c == quotec || c == '\n' || c == '\r') + need_quote = true; + if (!same_encoding) + mblen = pg_encoding_mblen(client_encoding, test_string); + else + mblen = 1; + } + + if (need_quote) + CopySendChar(quotec); + + for (; (c = *string) != '\0'; string += mblen) + { + if (c == quotec || c == escapec) + CopySendChar(escapec); + + CopySendChar(c); + + if (!same_encoding) + { + /* send additional bytes of the char, if any */ + mblen = pg_encoding_mblen(client_encoding, string); + for (i = 1; i < mblen; i++) + CopySendChar(string[i]); + } + else + mblen = 1; + } + + if (need_quote) + CopySendChar(quotec); } /* Index: src/backend/parser/gram.y =================================================================== RCS file: /cvsroot/pgsql-server/src/backend/parser/gram.y,v retrieving revision 2.450 diff -c -c -r2.450 gram.y *** src/backend/parser/gram.y 5 Apr 2004 03:07:26 -0000 2.450 --- src/backend/parser/gram.y 13 Apr 2004 04:18:31 -0000 *************** *** 343,349 **** CHARACTER CHARACTERISTICS CHECK CHECKPOINT CLASS CLOSE CLUSTER COALESCE COLLATE COLUMN COMMENT COMMIT COMMITTED CONSTRAINT CONSTRAINTS CONVERSION_P CONVERT COPY CREATE CREATEDB ! CREATEUSER CROSS CURRENT_DATE CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS --- 343,349 ---- CHARACTER CHARACTERISTICS CHECK CHECKPOINT CLASS CLOSE CLUSTER COALESCE COLLATE COLUMN COMMENT COMMIT COMMITTED CONSTRAINT CONSTRAINTS CONVERSION_P CONVERT COPY CREATE CREATEDB ! CREATEUSER CROSS CSV CURRENT_DATE CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS *************** *** 386,391 **** --- 386,393 ---- PRECISION PRESERVE PREPARE PRIMARY PRIOR PRIVILEGES PROCEDURAL PROCEDURE + QUOTE + READ REAL RECHECK REFERENCES REINDEX RELATIVE_P RENAME REPEATABLE REPLACE RESET RESTART RESTRICT RETURNS REVOKE RIGHT ROLLBACK ROW ROWS RULE *************** *** 1356,1361 **** --- 1358,1375 ---- { $$ = makeDefElem("delimiter", (Node *)makeString($3)); } + | CSV + { + $$ = makeDefElem("csv", (Node *)makeInteger(TRUE)); + } + | QUOTE opt_as Sconst + { + $$ = makeDefElem("quote", (Node *)makeString($3)); + } + | ESCAPE opt_as Sconst + { + $$ = makeDefElem("escape", (Node *)makeString($3)); + } | NULL_P opt_as Sconst { $$ = makeDefElem("null", (Node *)makeString($3)); *************** *** 7420,7425 **** --- 7434,7440 ---- | COPY | CREATEDB | CREATEUSER + | CSV | CURSOR | CYCLE | DATABASE *************** *** 7507,7512 **** --- 7522,7528 ---- | PRIVILEGES | PROCEDURAL | PROCEDURE + | QUOTE | READ | RECHECK | REINDEX Index: src/backend/parser/keywords.c =================================================================== RCS file: /cvsroot/pgsql-server/src/backend/parser/keywords.c,v retrieving revision 1.147 diff -c -c -r1.147 keywords.c *** src/backend/parser/keywords.c 11 Mar 2004 01:47:40 -0000 1.147 --- src/backend/parser/keywords.c 13 Apr 2004 04:18:31 -0000 *************** *** 90,95 **** --- 90,96 ---- {"createdb", CREATEDB}, {"createuser", CREATEUSER}, {"cross", CROSS}, + {"csv", CSV}, {"current_date", CURRENT_DATE}, {"current_time", CURRENT_TIME}, {"current_timestamp", CURRENT_TIMESTAMP}, *************** *** 248,253 **** --- 249,255 ---- {"privileges", PRIVILEGES}, {"procedural", PROCEDURAL}, {"procedure", PROCEDURE}, + {"quote", QUOTE}, {"read", READ}, {"real", REAL}, {"recheck", RECHECK}, Index: src/bin/psql/copy.c =================================================================== RCS file: /cvsroot/pgsql-server/src/bin/psql/copy.c,v retrieving revision 1.43 diff -c -c -r1.43 copy.c *** src/bin/psql/copy.c 12 Apr 2004 15:58:52 -0000 1.43 --- src/bin/psql/copy.c 13 Apr 2004 04:18:32 -0000 *************** *** 66,73 **** --- 66,76 ---- bool from; bool binary; bool oids; + bool csv_mode; char *delim; char *null; + char *quote; + char *escape; }; *************** *** 81,86 **** --- 84,91 ---- free(ptr->file); free(ptr->delim); free(ptr->null); + free(ptr->quote); + free(ptr->escape); free(ptr); } *************** *** 277,282 **** --- 282,291 ---- { result->oids = true; } + else if (strcasecmp(token, "csv") == 0) + { + result->csv_mode = true; + } else if (strcasecmp(token, "delimiter") == 0) { token = strtokx(NULL, whitespace, NULL, "'", *************** *** 301,306 **** --- 310,339 ---- else goto error; } + else if (strcasecmp(token, "quote") == 0) + { + token = strtokx(NULL, whitespace, NULL, "'", + '\\', false, pset.encoding); + if (token && strcasecmp(token, "as") == 0) + token = strtokx(NULL, whitespace, NULL, "'", + '\\', false, pset.encoding); + if (token) + result->quote = pg_strdup(token); + else + goto error; + } + else if (strcasecmp(token, "escape") == 0) + { + token = strtokx(NULL, whitespace, NULL, "'", + '\\', false, pset.encoding); + if (token && strcasecmp(token, "as") == 0) + token = strtokx(NULL, whitespace, NULL, "'", + '\\', false, pset.encoding); + if (token) + result->escape = pg_strdup(token); + else + goto error; + } else goto error; *************** *** 340,346 **** PGresult *result; bool success; struct stat st; ! /* parse options */ options = parse_slash_copy(args); --- 373,380 ---- PGresult *result; bool success; struct stat st; ! bool with_output = false; ! /* parse options */ options = parse_slash_copy(args); *************** *** 379,390 **** --- 413,454 ---- options->delim); } + /* There is no backward-compatible CSV syntax */ if (options->null) { if (options->null[0] == '\'') appendPQExpBuffer(&query, " WITH NULL AS %s", options->null); else appendPQExpBuffer(&query, " WITH NULL AS '%s'", options->null); + with_output = true; + } + + if (options->csv_mode) + { + appendPQExpBuffer(&query, " %sCSV ", with_output ? "" : "WITH "); + with_output = true; + } + + if (options->quote) + { + if (options->quote[0] == '\'') + appendPQExpBuffer(&query, " %sQUOTE AS %s", + with_output ? "" : "WITH ", options->quote); + else + appendPQExpBuffer(&query, " %sQUOTE AS '%s'", + with_output ? "" : "WITH ", options->quote); + with_output = true; + } + + if (options->escape) + { + if (options->escape[0] == '\'') + appendPQExpBuffer(&query, " %sESCAPE AS %s", + with_output ? "" : "WITH ", options->escape); + else + appendPQExpBuffer(&query, " %sESCAPE AS '%s'", + with_output ? "" : "WITH ", options->escape); + with_output = true; } if (options->from)
Bruce Momjian said: > Second, I found a problem with NULLs. If I do: > . > test=> create table test (x text, y text); > CREATE TABLE > test=> insert into test values ('', NULL); > INSERT 17221 1 > test=> > > then this: > > test=> copy test to '/tmp/b' with csv; > > creates: > > "", > > and this: > > test=> copy test to '/tmp/b' with csv NULL 'fred'; > > creates: > > ,fred > > Is that logical? A non-null field went from "" to nothing. > One more point about this - we can't force quoting of every non-null value, which would remove the "inconsistency" you see here, because spreadsheets especially infer information from whether or not a CSV value is quoted. In particular, they will not usually treat a quoted numeric value as numeric, which would be a very undesirable effect. cheers andrew
Andrew Dunstan wrote: >Bruce Momjian said: > > >>Second, I found a problem with NULLs. If I do: >>. >> test=> create table test (x text, y text); >> CREATE TABLE >> test=> insert into test values ('', NULL); >> INSERT 17221 1 >> test=> >> >>then this: >> >> test=> copy test to '/tmp/b' with csv; >> >>creates: >> >> "", >> >>and this: >> >> test=> copy test to '/tmp/b' with csv NULL 'fred'; >> >>creates: >> >> ,fred >> >>Is that logical? A non-null field went from "" to nothing. >> >> >> > >One more point about this - we can't force quoting of every non-null >value, which would remove the "inconsistency" you see here, because >spreadsheets especially infer information from whether or not a CSV value >is quoted. In particular, they will not usually treat a quoted numeric >value as numeric, which would be a very undesirable effect. > > > Thinking about this some more .... maybe the right rule would be "quote all non-numeric non-null values". That would fix the odd effect that Bruce saw, and it would also stop a spreadsheet from interpreting a date like 2002-03-04 as an arithmetic expression. Note that we don't care if a value is quoted or not - the only inference we draw from it is that if it is quoted it can't be null. We don't need to draw type inferences from it because we know the type we are trying to import into from the table defn. This fits nicely with the rule about being liberal with what you accept. Thoughts? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Thinking about this some more .... maybe the right rule would be "quote > all non-numeric non-null values". And how would you define "numeric"? I do *not* like putting data-type-specific knowledge into COPY. regards, tom lane
Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>Thinking about this some more .... maybe the right rule would be "quote >>all non-numeric non-null values". >> >> > >And how would you define "numeric"? > > At least the following: int8 int2 int4 float4 float8 numeric money and domains based on them. >I do *not* like putting data-type-specific knowledge into COPY. > > I'm trying to keep this as simple as possible. But we have to be a bit smart if we want to be able to export nicely. Here's the problem: say you have a text field that stores something that has numeric form (phone number, SSN, whatever). You want that exported as text (i.e. quoted). Otherwise, things like leading zeros will get lost by the importing program. However, you *must* not quote genuine number values, or they will not be imported correctly either. So, either we get a little bit smart about the types we are exporting, or our export is going to be broken in some cases. We need to be able to use more information than is contained in the text representation of the value. I would certainly hope to keep all this confined to the CSV code path, and not intrude on any existing functionality in any way. Exporting nicely has a lot more wrinkles than importing nicely, because predicting the behaviour of the program we might be exporting to is difficult. If you can suggest a better way I'm all ears. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> And how would you define "numeric"? > At least the following: > int8 > int2 > int4 > float4 > float8 > numeric > money > and domains based on them. Wrong answer, as this excludes user-defined types. COPY should not discriminate on the basis of recognizing particular data types. > I'm trying to keep this as simple as possible. But we have to be a bit > smart if we want to be able to export nicely. Here's the problem: say > you have a text field that stores something that has numeric form (phone > number, SSN, whatever). You want that exported as text (i.e. quoted). > Otherwise, things like leading zeros will get lost by the importing > program. However, you *must* not quote genuine number values, or they > will not be imported correctly either. Again, you are trying to make COPY into something it isn't and shouldn't be. > Exporting nicely has a lot more wrinkles than importing nicely, because > predicting the behaviour of the program we might be exporting to is > difficult. s/difficult/impossible/. I might be willing to accept this sort of cruft if it were well-defined cruft, but in point of fact you are trying to set up expectations that will be impossible to satisfy. We will be forever making more little tweaks to COPY that render its behavior ever less predictable, in the vain hope of reclosing the can of worms you want to open. It would be a lot wiser to implement this sort of behavior outside the backend, in code that is easily hacked by users. regards, tom lane
Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>Tom Lane wrote: >> >> >>>And how would you define "numeric"? >>> >>> > > > >>At least the following: >> >> > > > >> int8 >> int2 >> int4 >> float4 >> float8 >> numeric >> money >> >> > > > >>and domains based on them. >> >> > >Wrong answer, as this excludes user-defined types. COPY should not >discriminate on the basis of recognizing particular data types. > > > >>I'm trying to keep this as simple as possible. But we have to be a bit >>smart if we want to be able to export nicely. Here's the problem: say >>you have a text field that stores something that has numeric form (phone >>number, SSN, whatever). You want that exported as text (i.e. quoted). >>Otherwise, things like leading zeros will get lost by the importing >>program. However, you *must* not quote genuine number values, or they >>will not be imported correctly either. >> >> > >Again, you are trying to make COPY into something it isn't and shouldn't >be. > > > >>Exporting nicely has a lot more wrinkles than importing nicely, because >>predicting the behaviour of the program we might be exporting to is >>difficult. >> >> > >s/difficult/impossible/. I might be willing to accept this sort of >cruft if it were well-defined cruft, but in point of fact you are trying >to set up expectations that will be impossible to satisfy. We will be >forever making more little tweaks to COPY that render its behavior ever >less predictable, in the vain hope of reclosing the can of worms you >want to open. It would be a lot wiser to implement this sort of >behavior outside the backend, in code that is easily hacked by users. > > > I have neither the time nor the inclination for a fight over this. I hope I never have to use this anyway. Plan B would be to provide a libpq client for importing/exporting CSVs. This is such a basic function that I'd like to see it in the core distribution. Most of the required logic is in what has already been done. It would mean that it had to be done via a client connection, which might have speed implications for very large imports that could run faster direct from the server's file system. I thought there was an agreement on how to do this, but the straight jacket into which it is being forced will significantly limit its utility, IMNSHO. cheers andrew
On Tue, Apr 13, 2004 at 10:43:35AM -0400, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > > Thinking about this some more .... maybe the right rule would be "quote > > all non-numeric non-null values". > > And how would you define "numeric"? And don't forget that number format depend on locale. Now we ignore LC_NUMERIC, but maybe at some time in future we will support it. I think the best solution is quote all values including numerics too. 123,456 (in Czech this is one number with decimal "point" ;-) Karel -- Karel Zak <zakkr@zf.jcu.cz> http://home.zf.jcu.cz/~zakkr/
Karel Zak said: > On Tue, Apr 13, 2004 at 10:43:35AM -0400, Tom Lane wrote: >> Andrew Dunstan <andrew@dunslane.net> writes: >> > Thinking about this some more .... maybe the right rule would be >> > "quote all non-numeric non-null values". >> >> And how would you define "numeric"? > > And don't forget that number format depend on locale. Now we ignore > LC_NUMERIC, but maybe at some time in future we will support it. I > think the best solution is quote all values including numerics too. > > 123,456 (in Czech this is one number with decimal "point" ;-) > I am assuming that such users would use a delimiter other than a comma. Any string that contains the delimiter *must* be quoted, regardless of type. That's just the way CSVs work - it's not an artifact of ours. cheers andrew