Re: COPY for CSV documentation - Mailing list pgsql-patches

From Bruce Momjian
Subject Re: COPY for CSV documentation
Date
Msg-id 200404110031.i3B0VGC20184@candle.pha.pa.us
Whole thread Raw
In response to COPY for CSV documentation  (Andrew Dunstan <andrew@dunslane.net>)
Responses Re: COPY for CSV documentation  (Andrew Dunstan <andrew@dunslane.net>)
List pgsql-patches
I have reviewed this patch.  Basically, CSV is enabled by specifying
more than one delimiter character to COPY, e.g. DELIMITER ',"' or
DELIMITER ',""'.  Is this API good for folks?

Prior to 7.2, a multi-character delimiter could be specified, but only
the first character was used.  7.2 release notes state:

     * COPY DELIMITERS string must be exactly one character (Tom)

I am a little worried about multibyte too, as you mentioned.  It doesn't
look like we support multi-byte delimiters.  I see in the code:

    if (strlen(delim) != 1)
        ereport(ERROR,

With this code, if you use multibyte delimiters, you get an error.
However, with CVS, I can see cases where someone trying to use multibyte
would get a very strange output file rather than an error.  It would use
byte one of the multibyte for the delimiter, and the second byte for the
quote, and maybe a third byte for quoting quotes.

We could use pg_mblen().  Maybe we should compute pg_mblen and strlen on
the delimiter and throw an error if the they are different lengths.
That would prevent multibyte delimiters.

Comments?

---------------------------------------------------------------------------

Andrew Dunstan wrote:
>
> Attached is a patch with documentation for the CSV mode of COPY patch
> submitted yesterday.
>
> cheers
>
> andrew

> Index: doc/src/sgml/ref/copy.sgml
> ===================================================================
> RCS file: /projects/cvsroot/pgsql-server/doc/src/sgml/ref/copy.sgml,v
> retrieving revision 1.55
> diff -c -r1.55 copy.sgml
> *** doc/src/sgml/ref/copy.sgml    13 Dec 2003 23:59:07 -0000    1.55
> --- doc/src/sgml/ref/copy.sgml    10 Apr 2004 16:22:23 -0000
> ***************
> *** 136,142 ****
>        <para>
>         Specifies copying the OID for each row.  (An error is raised if
>         <literal>OIDS</literal> is specified for a table that does not
> !       have OIDs.)
>        </para>
>       </listitem>
>      </varlistentry>
> --- 136,143 ----
>        <para>
>         Specifies copying the OID for each row.  (An error is raised if
>         <literal>OIDS</literal> is specified for a table that does not
> !       have OIDs, or if CSV mode is selected by specifying
> !       <literal>DELIMITER</literal> longer then one character.)
>        </para>
>       </listitem>
>      </varlistentry>
> ***************
> *** 145,152 ****
>       <term><replaceable class="parameter">delimiter</replaceable></term>
>       <listitem>
>        <para>
> !       The single character that separates columns within each row
> !       (line) of the file.  The default is a tab character.
>        </para>
>       </listitem>
>      </varlistentry>
> --- 146,162 ----
>       <term><replaceable class="parameter">delimiter</replaceable></term>
>       <listitem>
>        <para>
> !       A one- to three-character string containing characters used in importing
> !       or exporting Text or Comma Separated Variable (CSV) files.
> !       With only one character, Text file format is used.
> !       If there are two or three characters, CSV format is used.
> !       The first character for either file format is the delimiter character
> !       used between values for a single row.
> !       The second character is the CSV quote character,
> !       and the third character, if present, is the CSV escape character
> !       used inside quoted values, and defaults to the quote character.
> !       The default for this parameter is a single tab character. (Thus TEXT
> !       file mode is the default.)
>        </para>
>       </listitem>
>      </varlistentry>
> ***************
> *** 156,163 ****
>       <listitem>
>        <para>
>         The string that represents a null value. The default is
> !       <literal>\N</literal> (backslash-N). You might prefer an empty
> !       string, for example.
>        </para>
>
>        <note>
> --- 166,174 ----
>       <listitem>
>        <para>
>         The string that represents a null value. The default is
> !       <literal>\N</literal> (backslash-N), unless a CSV is being processed,
> !       in which case it is an empty string. You might prefer an empty string
> !       in any case.
>        </para>
>
>        <note>
> ***************
> *** 168,173 ****
> --- 179,197 ----
>          <command>COPY TO</command>.
>         </para>
>        </note>
> +
> +      <note>
> +       <para>
> +        If you don't want anything used as null when using
> +        <command>COPY FROM</command>, you can specify some value that is very
> +        unlikely to appear in the file, such as <literal>frobnitz</literal> or
> +        <literal>d5f4074b254c76cd8ae37bf1731f4aed</literal> (which is
> +        <literal>md5('frobnitz')</literal>). This could be especially useful
> +        when importing a CSV file into a table with <literal>NOT NULL</>
> +        columns.
> +       </para>
> +      </note>
> +
>       </listitem>
>      </varlistentry>
>     </variablelist>
> ***************
> *** 252,259 ****
>      <title>Text Format</title>
>
>      <para>
> !     When <command>COPY</command> is used without the <literal>BINARY</literal> option,
> !     the data read or written is a text file with one line per table row.
>       Columns in a row are separated by the delimiter character.
>       The column values themselves are strings generated by the
>       output function, or acceptable to the input function, of each
> --- 276,285 ----
>      <title>Text Format</title>
>
>      <para>
> !     When <command>COPY</command> is used without the <literal>BINARY</literal>
> !     option, the data read or written is a text file with one line per table
> !     row unless <literal>DELIMITER</literal> is longer than one character,
> !     in which case CSV format is used.
>       Columns in a row are separated by the delimiter character.
>       The column values themselves are strings generated by the
>       output function, or acceptable to the input function, of each
> ***************
> *** 380,385 ****
> --- 406,459 ----
>     </refsect2>
>
>     <refsect2>
> +    <title>CSV format</title>
> +
> +    <para>
> +     This format is used for importing from and exporting to the Comma
> +     Separated Variable (CSV) file format used by many other programs,
> +     such as spreadsheets. Instead of escaping in
> +     <productname>PostgreSQL</productname>'s standard text mode, it produces
> +     and recognises the common CSV escaping mechanism.
> +    </para>
> +
> +    <para>
> +     The values in each record are separated by the delimiter character, which
> +     is the first character in the <literal>DELIMITER</literal> parameter.
> +     If the value contains the delimiter character, the quote character (the
> +     second character in the <literal>DELIMITER</literal> parameter), or a
> +     carriage return or line feed character, then the whole value is prefixed
> +     and suffixed by the quote character, and any occurrence within the value
> +     of a quote character or the escape character (the third character in the
> +     <literal>DELIMITER</literal> parameter if present, or else just the quote
> +     character itself) is preceded by the escape character.
> +    </para>
> +
> +    <para>
> +     In the most common cases, simply specifying <literal>DELIMITER ',"'</>
> +     or <literal>DELIMITER ',\''</> should suffice, both for
> +     <command>COPY FROM</> and <command>COPY TO</>.
> +    </para>
> +
> +    <note>
> +     <para>
> +      CSV mode will both recognise and produce CSV files with quoted values
> +      containing embedded carriage returns and line feeds. Thus the files are
> +      not strictly one line per table row, as TEXT files are.
> +     </para>
> +    </note>
> +
> +    <note>
> +     <para>
> +      Many programs produce strange and occasionally perverse CSV files, and
> +      the file format is a convention rather than a standard. Thus you might
> +      encounter some files that cannot be imported using this mechanism, and
> +      you might produce files that other programs fail to recognise properly.
> +     </para>
> +    </note>
> +
> +   </refsect2>
> +
> +   <refsect2>
>      <title>Binary Format</title>
>
>      <para>
> ***************
> *** 532,537 ****
> --- 606,620 ----
>      using the vertical bar (<literal>|</literal>) as the field delimiter:
>   <programlisting>
>   COPY country TO STDOUT WITH DELIMITER '|';
> + </programlisting>
> +   </para>
> +
> +   <para>
> +    To do the same thing but instead produce a standard CSV file, using
> +    <literal>,</> as the delimiter and <literal>"</> as both the quote and
> +    escape characters:
> + <programlisting>
> + COPY country TO STDOUT WITH DELIMITER ',"';
>   </programlisting>
>     </para>
>

>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

pgsql-patches by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: sound extract century/millennium date_part
Next
From: Andrew Dunstan
Date:
Subject: Re: COPY for CSV documentation