Re: New "raw" COPY format - Mailing list pgsql-hackers

From jian he
Subject Re: New "raw" COPY format
Date
Msg-id CACJufxGWet+n+E7-ymwMxA8cFPGc65CmBpxOfT_hi9OPnou3Gg@mail.gmail.com
Whole thread Raw
In response to Re: New "raw" COPY format  (Tatsuo Ishii <ishii@postgresql.org>)
List pgsql-hackers
On Wed, Oct 16, 2024 at 2:37 PM Joel Jacobson <joel@compiler.org> wrote:
>
> On Wed, Oct 16, 2024, at 05:31, jian he wrote:
> > Hi.
> > I only checked 0001, 0002, 0003.
> > the raw format patch is v9-0016.
> > 003-0016 is a lot of small patches, maybe you can consolidate it to
> > make the review more easier.
>
> Thanks for reviewing.
>
> OK, I've consolidated the v9 0003-0016 into a single patch.
>

+  <refsect2>
+   <title>Raw Format</title>
+
+   <para>
+    This format option is used for importing and exporting files containing
+    unstructured text, where each line is treated as a single field. It is
+    ideal for data that does not conform to a structured, tabular format and
+    lacks delimiters.
+   </para>
+
+   <para>
+    In the <literal>raw</literal> format, each line of the input or output is
+    considered a complete value without any field separation. There are no
+    field delimiters, and all characters are taken literally. There is no
+    special handling for quotes, backslashes, or escape sequences. All
+    characters, including whitespace and special characters, are preserved
+    exactly as they appear in the file. However, it's important to note that
+    the text is still interpreted according to the specified
<literal>ENCODING</literal>
+    option or the current client encoding for input, and encoded using the
+    specified <literal>ENCODING</literal> or the current client
encoding for output.
+   </para>
+
+   <para>
+    When using this format, the <command>COPY</command> command must specify
+    exactly one column. Specifying multiple columns will result in an error.
+    If the table has multiple columns and no column list is provided, an error
+    will occur.
+   </para>
+
+   <para>
+    The <literal>raw</literal> format does not distinguish a
<literal>NULL</literal>
+    value from an empty string. Empty lines are imported as empty strings, not
+    as <literal>NULL</literal> values.
+   </para>
+
+   <para>
+    Encoding works the same as in the <literal>text</literal> and
<literal>CSV</literal> formats.
+   </para>
+
+  </refsect2>
+
+  <refsect2>
+   <title>Raw Format</title>
+
+   <para>
+    This format option is used for importing and exporting files containing
+    unstructured text, where each line is treated as a single field. It is
+    ideal for data that does not conform to a structured, tabular format and
+    lacks delimiters.
+   </para>
+
+   <para>
+    In the <literal>raw</literal> format, each line of the input or output is
+    considered a complete value without any field separation. There are no
+    field delimiters, and all characters are taken literally. There is no
+    special handling for quotes, backslashes, or escape sequences. All
+    characters, including whitespace and special characters, are preserved
+    exactly as they appear in the file. However, it's important to note that
+    the text is still interpreted according to the specified
<literal>ENCODING</literal>
+    option or the current client encoding for input, and encoded using the
+    specified <literal>ENCODING</literal> or the current client
encoding for output.
+   </para>
+
+   <para>
+    When using this format, the <command>COPY</command> command must specify
+    exactly one column. Specifying multiple columns will result in an error.
+    If the table has multiple columns and no column list is provided, an error
+    will occur.
+   </para>
+
+   <para>
+    The <literal>raw</literal> format does not distinguish a
<literal>NULL</literal>
+    value from an empty string. Empty lines are imported as empty strings, not
+    as <literal>NULL</literal> values.
+   </para>
+
+   <para>
+    Encoding works the same as in the <literal>text</literal> and
<literal>CSV</literal> formats.
+   </para>
+
+  </refsect2>
+
   <refsect2 id="sql-copy-binary-format" xreflabel="Binary Format">
    <title>Binary Format</title>

<refsect2> <title>Raw Format</title> is duplicated
<title>Raw Format</title> didn't mention the special handling of
end-of-data marker.


+COPY copy_raw_test (col) FROM :'filename' RAW;
we may need to support this.
since we not allow
COPY x from stdin text;
COPY x to stdout text;
so I think adding the RAW keyword in gram.y may not be necessary.


    /* Complete COPY <sth> FROM|TO filename WITH (FORMAT */
    else if (Matches("COPY|\\copy", MatchAny, "FROM|TO", MatchAny,
"WITH", "(", "FORMAT"))
        COMPLETE_WITH("binary", "csv", "text");
src/bin/psql/tab-complete.in.c, we can also add "raw".



    /* --- ESCAPE option --- */
    if (opts_out->escape)
    {
        if (opts_out->format != COPY_FORMAT_CSV)
            ereport(ERROR,
                    (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
            /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
                    errmsg("COPY %s requires CSV mode", "ESCAPE")));
}
escape option no regress test.


    /* --- QUOTE option --- */
    if (opts_out->quote)
    {
        if (opts_out->format != COPY_FORMAT_CSV)
            ereport(ERROR,
                    (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
            /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
                    errmsg("COPY %s requires CSV mode", "QUOTE")));
}
escape option no regress test.


CopyOneRowTo
    else if (cstate->opts.format == COPY_FORMAT_RAW)
    {
        int            attnum;
        Datum        value;
        bool        isnull;
        /* Ensure only one column is being copied */
        if (list_length(cstate->attnumlist) != 1)
            ereport(ERROR,
                    (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                     errmsg("COPY with format 'raw' must specify
exactly one column")));
        attnum = linitial_int(cstate->attnumlist);
        value = slot->tts_values[attnum - 1];
        isnull = slot->tts_isnull[attnum - 1];
        if (!isnull)
        {
            char       *string = OutputFunctionCall(&out_functions[attnum - 1],
                                                    value);
            CopyAttributeOutRaw(cstate, string);
        }
        /* For RAW format, we don't send anything for NULL values */
    }
We already did column length checking at BeginCopyTo.
no need to  "if (list_length(cstate->attnumlist) != 1)" error check in
CopyOneRowTo?



pgsql-hackers by date:

Previous
From: jian he
Date:
Subject: Re: Eager aggregation, take 3
Next
From: Matthias van de Meent
Date:
Subject: Re: Recovery of .partial WAL segments