Thread: Re: New "raw" COPY format

Re: New "raw" COPY format

From

Tatsuo Ishii

Date:

13 October 2024, 12:52:04

> Hi hackers,
> 
> This thread is about implementing a new "raw" COPY format.
> 
> This idea came up in a different thread [1], moved here.
> 
> [1] https://postgr.es/m/47b5c6a7-5c0e-40aa-8ea2-c7b95ccf296f%40app.fastmail.com
> 
> The main use-case for the raw format, is when needing to import arbitrary
> unstructured text files, such as log files, into a single text column
> of a table.

After copy imported the "unstructured text file" in "row" COPY format,
what the column type is?  text? or bytea? If it's text, how do you
handle encoding conversion if the "unstructured text file" is encoded
in server side unsafe encoding such as SJIS?

>  All characters are taken literally.
>  There is no special handling for quotes, backslashes, or escape sequences.

If SJIS text is imported "literally" (i.e. no encoding conversion), it
should be rejected.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: New "raw" COPY format

From

"Joel Jacobson"

Date:

13 October 2024, 15:39:54

On Sun, Oct 13, 2024, at 11:52, Tatsuo Ishii wrote:
> After copy imported the "unstructured text file" in "row" COPY format,
> what the column type is?  text? or bytea? If it's text, how do you
> handle encoding conversion if the "unstructured text file" is encoded
> in server side unsafe encoding such as SJIS?
>
>>  All characters are taken literally.
>>  There is no special handling for quotes, backslashes, or escape sequences.
>
> If SJIS text is imported "literally" (i.e. no encoding conversion), it
> should be rejected.

I think encoding conversion is still necessary,
and should work the same as for the COPY formats "text" and "csv".

/Joel

Re: New "raw" COPY format

From

Jacob Champion

Date:

15 October 2024, 20:30:26

Hi,

Idle thoughts from a design perspective -- feel free to ignore, since
I'm not the target audience for the feature:

- If the column data stored in Postgres contains newlines, it seems
like COPY TO won't work "correctly". Is that acceptable?
- RAW seems like an okay-ish label, but for something that's doing as
much magic end-of-line detection as this patch is, I'd personally
prefer SINGLE (as in, "single column").
- Speaking of magic end-of-line detection, can there be a way to turn
that off? Say, via DELIMITER?
- Generic DELIMITER support, for any single-byte separator at all,
might make a "single-column" format more generally applicable. But I
might be over-architecting. And it would make the COPY TO issue even
worse...

Thanks,
--Jacob

Re: New "raw" COPY format

From

"Joel Jacobson"

Date:

15 October 2024, 23:38:06

On Tue, Oct 15, 2024, at 19:30, Jacob Champion wrote:
> Hi,
>
> Idle thoughts from a design perspective -- feel free to ignore, since
> I'm not the target audience for the feature:

Many thanks for looking at this!

> - If the column data stored in Postgres contains newlines, it seems
> like COPY TO won't work "correctly". Is that acceptable?

That's an interesting edge-case to think about.

Rejecting such column data, to ensure all data that can be COPY TO'd,
can be loaded back using COPY FROM, preserving the data intact,
seems attractive because it protects users against unintentional mistakes,
and all the other COPY formats are able to preserve the data unchanged,
when doing a COPY FROM of a file created with COPY TO.

OTOH, if we think of COPY TO in this case as a way of `cat`-ing
text values, it might be more pragmatic to allow it.
With `cat`-ing I mean like if in Unix doing...
    cat file1.txt file2.txt > file3.txt
...there is no way to reverse that operation,
that is, to reconstruct file1.txt and file2.txt from file3.txt.

However, I thinking rejecting such column data seems like the
better alternative, to ensure data exported with COPY TO
can always be imported back using COPY FROM,
for the same format. If text column data contains newlines,
users probably ought to be using the text or csv format instead.

> - RAW seems like an okay-ish label, but for something that's doing as
> much magic end-of-line detection as this patch is, I'd personally
> prefer SINGLE (as in, "single column").

It's actually the same end-of-line detection as the text format
in copyfromparse.c's CopyReadLineText(), except the code
is simpler thanks to not having to deal with quotes or escapes.

It basically just learns the newline sequence based on the first
occurrence, and then require it to be the same throughout the file.

The same data files can be tested with the text format,
since they don't contain any escape, quote or delimiter characters.

Different newline styles detected automatically also by format text:

    COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_lr.data' (FORMAT text);
    COPY 2

    COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_cr.data' (FORMAT text);
    COPY 2

    COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_cr_lr.data' (FORMAT text);
    COPY 2

The mixed newline style causes errors also for the text format:

    COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_1.data' (FORMAT text);
    ERROR:  literal newline found in data
    HINT:  Use "\n" to represent newline.
    CONTEXT:  COPY t, line 2

    COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_2.data' (FORMAT text);
    ERROR:  literal newline found in data
    HINT:  Use "\n" to represent newline.
    CONTEXT:  COPY t, line 2

    COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_3.data' (FORMAT text);
    ERROR:  literal carriage return found in data
    HINT:  Use "\r" to represent carriage return.
    CONTEXT:  COPY t, line 2

    COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_4.data' (FORMAT text);
    ERROR:  literal carriage return found in data
    HINT:  Use "\r" to represent carriage return.
    CONTEXT:  COPY t, line 2

    COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_5.data' (FORMAT text);
    ERROR:  literal carriage return found in data
    HINT:  Use "\r" to represent carriage return.
    CONTEXT:  COPY t, line 2

I must confess, I didn't know about this newline detection before reading the
copyfromparse.c source code.

> - Speaking of magic end-of-line detection, can there be a way to turn
> that off? Say, via DELIMITER?
> - Generic DELIMITER support, for any single-byte separator at all,
> might make a "single-column" format more generally applicable. But I
> might be over-architecting. And it would make the COPY TO issue even
> worse...

That's an interesting idea that would provide more flexibility,
though, at the cost of complicating things by overloading the meaning
of DELIMITER.

If aiming to make this more generally applicable,
then at least DELIMITER would need to be multi-byte,
since otherwise the Windows case \r\n couldn't be specified.

But I feel COPY already has quite a lot of options, and I fear it's
quite complicated for users as it is.

What I found appealing with the idea of a new COPY format,
was that instead of overloading the existing options
with more complexity, a new format wouldn't need to affect
the existing options, and the new format could be explained
separately, without making things worse for users not
using this format.

/Joel

Re: New "raw" COPY format

From

jian he

Date:

16 October 2024, 06:31:56

On Tue, Oct 15, 2024 at 8:50 PM Joel Jacobson <joel@compiler.org> wrote:
>

Hi.
I only checked 0001, 0002, 0003.
the raw format patch is v9-0016.
003-0016 is a lot of small patches, maybe you can consolidate it to
make the review more easier.

-COPY x to stdin (format TEXT, force_quote(a));
+COPY x to stdout (format TEXT, force_quote(a));
0001 make sense to me, i think generally we do "to stdout", "from stdin"

v9-0002-Fix-validation-of-FORCE_NOT_NULL-FORCE_NULL-for-all-.patch
looks good.


typedef enum CopyLogVerbosityChoice
{
    COPY_LOG_VERBOSITY_SILENT = -1, /* logs none */
    COPY_LOG_VERBOSITY_DEFAULT = 0, /* logs no additional messages. As this is
                                     * the default, assign 0 */
    COPY_LOG_VERBOSITY_VERBOSE, /* logs additional messages */
} CopyLogVerbosityChoice;
/*
 * Represents the format of the COPY operation.
 */
typedef enum CopyFormat
{
    COPY_FORMAT_TEXT,
    COPY_FORMAT_BINARY,
    COPY_FORMAT_CSV,
} CopyFormat;

BeginCopyTo
cstate = (CopyToStateData *) palloc0(sizeof(CopyToStateData));
ProcessCopyOptions(pstate, &cstate->opts, false /* is_from */ , options);

palloc0(sizeof(CopyToStateData)); makes the default format to COPY_FORMAT_TEXT.

I think you may need COPY_FORMAT_TEXT = 0, even though based on [1],
it seems not required.

[1] https://stackoverflow.com/questions/6434105/are-default-enum-values-in-c-the-same-for-all-compilers

Re: New "raw" COPY format

From

Jacob Champion

Date:

16 October 2024, 19:04:50

On Tue, Oct 15, 2024 at 1:38 PM Joel Jacobson <joel@compiler.org> wrote:
>
> However, I thinking rejecting such column data seems like the
> better alternative, to ensure data exported with COPY TO
> can always be imported back using COPY FROM,
> for the same format. If text column data contains newlines,
> users probably ought to be using the text or csv format instead.

Yeah. I think _someone's_ going to have strong opinions one way or the
other, but that person is not me. And I assume a contents check during
COPY TO is going to have a noticeable performance impact...

> > - RAW seems like an okay-ish label, but for something that's doing as
> > much magic end-of-line detection as this patch is, I'd personally
> > prefer SINGLE (as in, "single column").
>
> It's actually the same end-of-line detection as the text format
> in copyfromparse.c's CopyReadLineText(), except the code
> is simpler thanks to not having to deal with quotes or escapes.

Right, sorry, I hadn't meant to imply that you made it up. :D Just
that a "raw" format that is actually automagically detecting things
doesn't seem very "raw" to me, so I prefer the other name.

> It basically just learns the newline sequence based on the first
> occurrence, and then require it to be the same throughout the file.

A hypothetical type whose text representation can contain '\r' but not
'\n' still can't be unambiguously round-tripped under this scheme:
COPY FROM will see the "mixed" line endings and complain, even though
there's no ambiguity.

Maybe no one will run into that problem in practice? But if they did,
I think that'd be a pretty frustrating limitation. It'd be nice to
override the behavior, to change it from "do what you think I mean" to
"do what I say".

> > - Speaking of magic end-of-line detection, can there be a way to turn
> > that off? Say, via DELIMITER?
> > - Generic DELIMITER support, for any single-byte separator at all,
> > might make a "single-column" format more generally applicable. But I
> > might be over-architecting. And it would make the COPY TO issue even
> > worse...
>
> That's an interesting idea that would provide more flexibility,
> though, at the cost of complicating things by overloading the meaning
> of DELIMITER.

I think that'd be a docs issue rather than a conceptual one, though...
it's still a delimiter. I wouldn't really expect end-user confusion.

> If aiming to make this more generally applicable,
> then at least DELIMITER would need to be multi-byte,
> since otherwise the Windows case \r\n couldn't be specified.

True.

> What I found appealing with the idea of a new COPY format,
> was that instead of overloading the existing options
> with more complexity, a new format wouldn't need to affect
> the existing options, and the new format could be explained
> separately, without making things worse for users not
> using this format.

I agree that we should not touch the existing formats. If
RAW/SINGLE/whatever needed a multibyte line delimiter, I'm not
proposing that the other formats should change.

--Jacob

Re: New "raw" COPY format

From

"Daniel Verite"

Date:

16 October 2024, 19:34:13

    Joel Jacobson wrote:

> However, I thinking rejecting such column data seems like the
> better alternative, to ensure data exported with COPY TO
> can always be imported back using COPY FROM,
> for the same format.

On the other hand, that might prevent cases where we
want to export, for instance, a valid json array:

  copy (select json_agg(col) from table ) to 'file' RAW

This is a variant of the discussion in [1] where the OP does:

 copy (select json_agg(row_to_json(t)) from <query> t) TO 'file'

and he complains that both text and csv "break the JSON".
That discussion morphed into a proposed patch adding JSON
format to COPY,  but RAW would work directly as the OP
expected.

That is, unless <query> happens to include JSON fields with LF/CRLF
in them, and the RAW format says this is an error condition.
In that case it's quite annoying to make it an error, rather than
simply let it pass.

[1]
https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU=kcg@mail.gmail.com

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: New "raw" COPY format

From

"Joel Jacobson"

Date:

16 October 2024, 20:36:24

On Wed, Oct 16, 2024, at 18:04, Jacob Champion wrote:
> A hypothetical type whose text representation can contain '\r' but not
> '\n' still can't be unambiguously round-tripped under this scheme:
> COPY FROM will see the "mixed" line endings and complain, even though
> there's no ambiguity.

Yeah, that's quite an ugly limitation.

> Maybe no one will run into that problem in practice? But if they did,
> I think that'd be a pretty frustrating limitation. It'd be nice to
> override the behavior, to change it from "do what you think I mean" to
> "do what I say".

That would be nice.

>> That's an interesting idea that would provide more flexibility,
>> though, at the cost of complicating things by overloading the meaning
>> of DELIMITER.
>
> I think that'd be a docs issue rather than a conceptual one, though...
> it's still a delimiter. I wouldn't really expect end-user confusion.

Yeah, I meant the docs, but that's probably fine,
we could just add <note> to DELIMITER.

>> What I found appealing with the idea of a new COPY format,
>> was that instead of overloading the existing options
>> with more complexity, a new format wouldn't need to affect
>> the existing options, and the new format could be explained
>> separately, without making things worse for users not
>> using this format.
>
> I agree that we should not touch the existing formats. If
> RAW/SINGLE/whatever needed a multibyte line delimiter, I'm not
> proposing that the other formats should change.

Right, I didn't think you did either, I meant overloading the existing
options, from a docs perspective.

But I agree it's probably fine if we just overload DELIMITER in the docs,
that should be possible to explain in a pedagogic way,
without causing confusion.

/Joel

Re: New "raw" COPY format

From

"Joel Jacobson"

Date:

16 October 2024, 21:30:47

On Wed, Oct 16, 2024, at 18:34, Daniel Verite wrote:
> Joel Jacobson wrote:
>
>> However, I thinking rejecting such column data seems like the
>> better alternative, to ensure data exported with COPY TO
>> can always be imported back using COPY FROM,
>> for the same format. 
>
> On the other hand, that might prevent cases where we
> want to export, for instance, a valid json array:
>
>   copy (select json_agg(col) from table ) to 'file' RAW
>
> This is a variant of the discussion in [1] where the OP does:
>
>  copy (select json_agg(row_to_json(t)) from <query> t) TO 'file' 
>
> and he complains that both text and csv "break the JSON".
> That discussion morphed into a proposed patch adding JSON
> format to COPY,  but RAW would work directly as the OP
> expected.
>
> That is, unless <query> happens to include JSON fields with LF/CRLF
> in them, and the RAW format says this is an error condition.
> In that case it's quite annoying to make it an error, rather than
> simply let it pass.
>
>
> [1]
> https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU=kcg@mail.gmail.com

Thanks for finding this related thread.

Very good example, that would be solved with RAW.

I've used a different hack myself many times to export text "as is",
which is to use COPY TO ... BINARY, and then to manually edit the
file using vim, stripping away the header and footer from the file. :D

But I don't see how JSON fields that come from PostgreSQL
could contain LF/CRLF though, could they?
Since LF/CR must be escaped inside a JSON field,
and when casting JSONB to text, there are no newlines
injected anywhere in between fields.
I can only see how a value of the legacy JSON type could
have newlines in it.

That doesn't matter though, this is still a very good example
on the need to export text "as is" from PostgreSQL to a file.

I like Jacob's idea of letting the user specify a DELIMITER
also for the RAW format, to override the automagical
newline detection. This would e.g. allow importing values
containing e.g. \r when the newline delimiter is \n,
which would otherwise be reject with an
"inconsistent newline style" error.

I think it would also be useful to allow specifying
DELIMITER NONE, which would allow importing an
entire text file, "as is", into a single column and single row.

To export a single column single row, the delimiter
wouldn't matter, since there wouldn't be any.
But DELIMITER NONE would be useful also for COPY TO,
if wanting to concatenate text "as is" without adding
newlines in between, to a file, similar to the
UNIX "cat" command.

Regarding the name for the format, I thought SINGLE
was nice before I read this message, but now since
the need for more rawness seems desirable,
I think I like RAW the most again, since if the
automagical newline detection can be overridden,
then it's really raw for real.

A final thought is to maybe consider just skipping
the automagical newline detection for RAW?

Instead of the automagical detection,
the default newline delimiter could be the OS default,
similar to how COPY TO works.

That way, it would almost always just work for most users,
as long as processing files within their OS,
and when not, they would just need to specify the DELIMITER.

/Joel

Re: New "raw" COPY format

From

"Joel Jacobson"

Date:

16 October 2024, 22:13:55

On Wed, Oct 16, 2024, at 20:30, Joel Jacobson wrote:
> A final thought is to maybe consider just skipping
> the automagical newline detection for RAW?
>
> Instead of the automagical detection,
> the default newline delimiter could be the OS default,
> similar to how COPY TO works.
>
> That way, it would almost always just work for most users,
> as long as processing files within their OS,
> and when not, they would just need to specify the DELIMITER.

I would guess that nowadays, dealing with unstructured text files
are probably less common, than dealing with structured text files,
such as JSON, YAML, TOML, XML, etc.

Therefore, maybe DELIMITER NONE would be a better default
for RAW? Especially since it's then also more honest in being "raw".

If needing to import an unstructured text file that is just newline
delimited, and not wanting the entire file as a single value,
the newline style would then just need to be specified
using the DELIMITER option.

/Joel

Re: New "raw" COPY format

From

"Joel Jacobson"

Date:

17 October 2024, 14:26:59

On Wed, Oct 16, 2024, at 21:13, Joel Jacobson wrote:
> Therefore, maybe DELIMITER NONE would be a better default
> for RAW? Especially since it's then also more honest in being "raw".
>
> If needing to import an unstructured text file that is just newline
> delimited, and not wanting the entire file as a single value,
> the newline style would then just need to be specified
> using the DELIMITER option.

I realize the DELIMITER NONE syntax is unnecessary,
since if that's the default for RAW, we would just not specify any delimiter.

/Joel

Re: New "raw" COPY format

From

jian he

Date:

18 October 2024, 16:52:43

On Wed, Oct 16, 2024 at 2:37 PM Joel Jacobson <joel@compiler.org> wrote:
>
> On Wed, Oct 16, 2024, at 05:31, jian he wrote:
> > Hi.
> > I only checked 0001, 0002, 0003.
> > the raw format patch is v9-0016.
> > 003-0016 is a lot of small patches, maybe you can consolidate it to
> > make the review more easier.
>
> Thanks for reviewing.
>
> OK, I've consolidated the v9 0003-0016 into a single patch.
>

+  <refsect2>
+   <title>Raw Format</title>
+
+   <para>
+    This format option is used for importing and exporting files containing
+    unstructured text, where each line is treated as a single field. It is
+    ideal for data that does not conform to a structured, tabular format and
+    lacks delimiters.
+   </para>
+
+   <para>
+    In the <literal>raw</literal> format, each line of the input or output is
+    considered a complete value without any field separation. There are no
+    field delimiters, and all characters are taken literally. There is no
+    special handling for quotes, backslashes, or escape sequences. All
+    characters, including whitespace and special characters, are preserved
+    exactly as they appear in the file. However, it's important to note that
+    the text is still interpreted according to the specified
<literal>ENCODING</literal>
+    option or the current client encoding for input, and encoded using the
+    specified <literal>ENCODING</literal> or the current client
encoding for output.
+   </para>
+
+   <para>
+    When using this format, the <command>COPY</command> command must specify
+    exactly one column. Specifying multiple columns will result in an error.
+    If the table has multiple columns and no column list is provided, an error
+    will occur.
+   </para>
+
+   <para>
+    The <literal>raw</literal> format does not distinguish a
<literal>NULL</literal>
+    value from an empty string. Empty lines are imported as empty strings, not
+    as <literal>NULL</literal> values.
+   </para>
+
+   <para>
+    Encoding works the same as in the <literal>text</literal> and
<literal>CSV</literal> formats.
+   </para>
+
+  </refsect2>
+
+  <refsect2>
+   <title>Raw Format</title>
+
+   <para>
+    This format option is used for importing and exporting files containing
+    unstructured text, where each line is treated as a single field. It is
+    ideal for data that does not conform to a structured, tabular format and
+    lacks delimiters.
+   </para>
+
+   <para>
+    In the <literal>raw</literal> format, each line of the input or output is
+    considered a complete value without any field separation. There are no
+    field delimiters, and all characters are taken literally. There is no
+    special handling for quotes, backslashes, or escape sequences. All
+    characters, including whitespace and special characters, are preserved
+    exactly as they appear in the file. However, it's important to note that
+    the text is still interpreted according to the specified
<literal>ENCODING</literal>
+    option or the current client encoding for input, and encoded using the
+    specified <literal>ENCODING</literal> or the current client
encoding for output.
+   </para>
+
+   <para>
+    When using this format, the <command>COPY</command> command must specify
+    exactly one column. Specifying multiple columns will result in an error.
+    If the table has multiple columns and no column list is provided, an error
+    will occur.
+   </para>
+
+   <para>
+    The <literal>raw</literal> format does not distinguish a
<literal>NULL</literal>
+    value from an empty string. Empty lines are imported as empty strings, not
+    as <literal>NULL</literal> values.
+   </para>
+
+   <para>
+    Encoding works the same as in the <literal>text</literal> and
<literal>CSV</literal> formats.
+   </para>
+
+  </refsect2>
+
   <refsect2 id="sql-copy-binary-format" xreflabel="Binary Format">
    <title>Binary Format</title>

<refsect2> <title>Raw Format</title> is duplicated
<title>Raw Format</title> didn't mention the special handling of
end-of-data marker.


+COPY copy_raw_test (col) FROM :'filename' RAW;
we may need to support this.
since we not allow
COPY x from stdin text;
COPY x to stdout text;
so I think adding the RAW keyword in gram.y may not be necessary.


    /* Complete COPY <sth> FROM|TO filename WITH (FORMAT */
    else if (Matches("COPY|\\copy", MatchAny, "FROM|TO", MatchAny,
"WITH", "(", "FORMAT"))
        COMPLETE_WITH("binary", "csv", "text");
src/bin/psql/tab-complete.in.c, we can also add "raw".



    /* --- ESCAPE option --- */
    if (opts_out->escape)
    {
        if (opts_out->format != COPY_FORMAT_CSV)
            ereport(ERROR,
                    (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
            /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
                    errmsg("COPY %s requires CSV mode", "ESCAPE")));
}
escape option no regress test.


    /* --- QUOTE option --- */
    if (opts_out->quote)
    {
        if (opts_out->format != COPY_FORMAT_CSV)
            ereport(ERROR,
                    (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
            /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
                    errmsg("COPY %s requires CSV mode", "QUOTE")));
}
escape option no regress test.


CopyOneRowTo
    else if (cstate->opts.format == COPY_FORMAT_RAW)
    {
        int            attnum;
        Datum        value;
        bool        isnull;
        /* Ensure only one column is being copied */
        if (list_length(cstate->attnumlist) != 1)
            ereport(ERROR,
                    (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                     errmsg("COPY with format 'raw' must specify
exactly one column")));
        attnum = linitial_int(cstate->attnumlist);
        value = slot->tts_values[attnum - 1];
        isnull = slot->tts_isnull[attnum - 1];
        if (!isnull)
        {
            char       *string = OutputFunctionCall(&out_functions[attnum - 1],
                                                    value);
            CopyAttributeOutRaw(cstate, string);
        }
        /* For RAW format, we don't send anything for NULL values */
    }
We already did column length checking at BeginCopyTo.
no need to  "if (list_length(cstate->attnumlist) != 1)" error check in
CopyOneRowTo?

Re: New "raw" COPY format

From

"Joel Jacobson"

Date:

19 October 2024, 10:55:15

On Fri, Oct 18, 2024, at 19:24, Joel Jacobson wrote:
> Attachments:
> * v11-0001-Refactor-ProcessCopyOptions-introduce-CopyFormat-enu.patch
> * v11-0002-Add-raw-format-to-COPY-command.patch

Here is a demo of a importing a decently sized real text file,
that can't currently be imported without the CSV hack:

$ head /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64
.package-cache-mutate                        devel/cargo
bin                                admin/base-files
bin/archdetect                            admin/ubiquity
bin/autopartition                        admin/ubiquity
bin/autopartition-crypto                    admin/ubiquity
bin/autopartition-loop                        admin/ubiquity
bin/autopartition-lvm                        admin/ubiquity
bin/block-attr                            admin/ubiquity
bin/blockdev-keygen                        admin/ubiquity
bin/blockdev-wipe                        admin/ubiquity

This file uses a combination of tabs and spaces, in between the two columns,
so none of the existing formats are suitable to deal with this file.

$ ls -lah /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64
-rw-r--r-- 1 root root 791M Apr 24 02:07 /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64

To import using the CSV hack, we first have find two bytes that don't exist anyway,
which can be done using e.g. ripgrep. The below command verifies \x01 and \x02
don't exist anywhere:

$ rg -uuu --multiline '(?-u)[\x01|\x02]' /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64
$

Knowing these bytes don't exist anywhere,
we can then safely use these as delimiter and quote characters,
as a hack to disable these features:

CREATE TABLE package_contents (raw_line text);

COPY package_contents FROM '/var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64' (FORMAT CSV,
DELIMITERE'\x01', QUOTE E'\x02');
 
COPY 8443588
Time: 3882.100 ms (00:03.882)
Time: 3552.991 ms (00:03.553)
Time: 3748.038 ms (00:03.748)
Time: 3775.947 ms (00:03.776)
Time: 3729.020 ms (00:03.729)

I tested writing a Rust program that would read the file line-by-line and INSERT each line instead.
This is of course a lot slower, since it has to execute each insert separately:

$ cargo run --release
   Compiling insert_package_contents v0.1.0 (/home/joel/insert_package_contents)
    Finished `release` profile [optimized] target(s) in 0.70s
     Running `target/release/insert_package_contents`
Connecting to the PostgreSQL database...
Successfully connected to the database.
Starting to insert lines from the file...
Successfully inserted 8443588 lines into package_contents in 134.65s.

New approach using the RAW format:

COPY package_contents FROM '/var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64' (FORMAT RAW,
DELIMITERE'\n');
 
COPY 8443588
Time: 2918.489 ms (00:02.918)
Time: 3020.372 ms (00:03.020)
Time: 3336.589 ms (00:03.337)
Time: 3067.268 ms (00:03.067)
Time: 3343.694 ms (00:03.344)

Apart from the convenience improvement,
it seems to be somewhat faster already.

/Joel

Re: New "raw" COPY format

From

jian he

Date:

19 October 2024, 13:13:50

On Sat, Oct 19, 2024 at 1:24 AM Joel Jacobson <joel@compiler.org> wrote:
>>
> Handling of e.g. JSON and other structured text files that could contain
> newlines, in a seamless way seems important, so therefore the default is
> no delimiter for the raw format, so that the entire input is read as one data
> value for COPY FROM, and all column data is concatenated without delimiter
> for COPY TO.
>
> When specifying a delimiter for the raw format, it separates *rows*, and can be
> a multi-byte string, such as E'\r\n' to handle Windows text files.
>
> This has been documented under the DELIMITER option, as well as under the
> Raw Format section.
>
We already make RAW and can only have one column.
if RAW has no default delimiter, then COPY FROM a text file will
become one datum value;
which makes it looks like importing a Large Object.
(https://www.postgresql.org/docs/17/lo-funcs.html)

i think, most of the time, you have more than one row/value to import
and export?

> The refactoring is now in a separate first single commit, which seems
> necessary, to separate the new functionality, from the refactoring.
I agree.

ProcessCopyOptions
/* Extract options from the statement node tree */
foreach(option, options)
{
}
/* --- DELIMITER option --- */
/* --- NULL option --- */
/* --- QUOTE option --- */
Currently the regress test passed, i think that means your refactor is fine.

in ProcessCopyOptions, maybe we can rearrange the code after the
foreach loop (foreach(option, options)
based on the parameters order in
https://www.postgresql.org/docs/devel/sql-copy.html Parameters section.
so we can review it by comparing the refactoring with the
sql-copy.html Parameters section's description.

>
> > We already did column length checking at BeginCopyTo.
> > no need to  "if (list_length(cstate->attnumlist) != 1)" error check in
> > CopyOneRowTo?
>
> Hmm, not sure really, since DoCopy() calls both BeginCopyTo()
> and DoCopyTo() which in turn calls CopyOneRowTo(),
> but CopyOneRowTo() is also being called from copy_dest_receive().
>

BeginCopyTo do the preparation work.
cstate->attnumlist = CopyGetAttnums(tupDesc, cstate->rel, attnamelist);

After CopyGetAttnums, the number of attributes for COPY TO cannot be changed.
right after CopyGetAttnums call then check the length of cstate->attnumlist
seems fine for me.
I think in CopyOneRowTo, we can actually
Assert(list_length(cstate->attnumlist) == 1).
for raw format.

src10=# drop table if exists x;
create table x(a int);
COPY x from stdin (FORMAT raw);
DROP TABLE
CREATE TABLE
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 11
>> 12
>> \.
ERROR:  invalid input syntax for type integer: "11
12
"
CONTEXT:  COPY x, line 1, column a: "11
12
"

The above case means COPY FROM STDIN (FORMAT RAW) can only import one
single value (when successful).
user need to specify like:

COPY x from stdin (FORMAT raw, delimiter E'\n');

seems raw format default no delimiter is not user friendly.

Re: New "raw" COPY format

From

"Joel Jacobson"

Date:

21 October 2024, 18:11:16

On Mon, Oct 21, 2024, at 16:35, jian he wrote:
> make the ProcessCopyOptions process in following order:
> 1. Extract options from the statement node tree
> 2. checking each option, if not there set default value.
> 3. checking for interdependent options
>
> I still think
> making step2 aligned with the doc parameter section order will make it
> more readable.
>
> based on your patch
> (v12-0001-Refactor-ProcessCopyOptions-introduce-CopyFormat-enu.patch)
> I put some checking to step3, make step2 checking order aligned with doc.

Smart to move the interdependent check to the designated section for it,
that's exactly the right place for it.

Really nice the order in the code now is aligned with the doc order.

/Joel

Re: New "raw" COPY format

From

Masahiko Sawada

Date:

24 October 2024, 03:54:39

Hi,

On Sat, Oct 19, 2024 at 8:33 AM Joel Jacobson <joel@compiler.org> wrote:
>
> On Sat, Oct 19, 2024, at 12:13, jian he wrote:
> > We already make RAW and can only have one column.
> > if RAW has no default delimiter, then COPY FROM a text file will
> > become one datum value;
> > which makes it looks like importing a Large Object.
> > (https://www.postgresql.org/docs/17/lo-funcs.html)
>
> The single datum value might not come from a physical column; it could be
> an aggregated JSON value, as in the example Daniel mentioned:
>
> On Wed, Oct 16, 2024, at 18:34, Daniel Verite wrote:
> >   copy (select json_agg(col) from table ) to 'file' RAW
> >
> > This is a variant of the discussion in [1] where the OP does:
> >
> >  copy (select json_agg(row_to_json(t)) from <query> t) TO 'file'
> >
> > and he complains that both text and csv "break the JSON".
> > That discussion morphed into a proposed patch adding JSON
> > format to COPY,  but RAW would work directly as the OP
> > expected.
> >
> > That is, unless <query> happens to include JSON fields with LF/CRLF
> > in them, and the RAW format says this is an error condition.
> > In that case it's quite annoying to make it an error, rather than
> > simply let it pass.
> >
> > [1]
> > https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU=kcg@mail.gmail.com
>
> In such cases, a user could perform the following:
>
>     CREATE TABLE customers (id int, name text, email text);
>
>     INSERT INTO customers (id, name, email) VALUES
>     (1, 'John Doe', 'john.doe@example.com'),
>     (2, 'Jane Smith', 'jane.smith@example.com'),
>     (3, 'Alice Johnson', 'alice.johnson@example.com');
>
>     COPY (SELECT json_agg(row_to_json(t)) FROM customers t) TO '/tmp/file' (FORMAT raw);
>
> % cat /tmp/file
> [{"id":1,"name":"John Doe","email":"john.doe@example.com"}, {"id":2,"name":"Jane
Smith","email":"jane.smith@example.com"},{"id":3,"name":"Alice Johnson","email":"alice.johnson@example.com"}]% 
>
> > i think, most of the time, you have more than one row/value to import
> > and export?
>
> Yes, probably, but it might not be a physical row. It could be an aggregated
> one, like in the example above. When importing, it might be a large JSON array
> of objects that is imported into a temporary table and then deserialized into
> a proper schema.
>
> The need to load entire files is already fulfilled by pg_read_file(text) -> text,
> but there is no pg_write_file(), likely for security reasons.
> So COPY TO ... (FORMAT RAW) with no delimiter seems necessary,
> and then COPY FROM also needs to work accordingly.
>
> >> The refactoring is now in a separate first single commit, which seems
> >> necessary, to separate the new functionality, from the refactoring.
> > I agree.
>
> > ProcessCopyOptions
> > /* Extract options from the statement node tree */
> > foreach(option, options)
> > {
> > }
> > /* --- DELIMITER option --- */
> > /* --- NULL option --- */
> > /* --- QUOTE option --- */
> > Currently the regress test passed, i think that means your refactor is fine.
>
> I believe that a passing test indicates it might be okay,
> but a failing test definitely means it's not. :D
>
> I've meticulously refactored one option at a time, checking which code in
> ProcessCopyOptions depends on each option field to ensure the semantics
> are preserved.
>
> I think the changes are easy to follow, and it's clear that each change is
> correct when looking at them individually, though it might be more challenging
> when viewing the total change.
>
> I've tried to minimize code movement, preserving as much of the original
> code placement as possible.
>
> > in ProcessCopyOptions, maybe we can rearrange the code after the
> > foreach loop (foreach(option, options)
> > based on the parameters order in
> > https://www.postgresql.org/docs/devel/sql-copy.html Parameters section.
> > so we can review it by comparing the refactoring with the
> > sql-copy.html Parameters section's description.
>
> That would be nice, but unfortunately, it's not possible because the order of
> the option code blocks matters due to the setting of defaults in else/else
> if branches when an option is not specified.
>
> For example, in the documentation, DEFAULT precedes QUOTE,
> but in ProcessCopyOptions, the QUOTE code block must come before
> the DEFAULT code block due to the check:
>
>    /* Don't allow the CSV quote char to appear in the default string. */
>
> I also believe there's value in minimizing code movement.
>
>
> >> > We already did column length checking at BeginCopyTo.
> >> > no need to  "if (list_length(cstate->attnumlist) != 1)" error check in
> >> > CopyOneRowTo?
> >>
> >> Hmm, not sure really, since DoCopy() calls both BeginCopyTo()
> >> and DoCopyTo() which in turn calls CopyOneRowTo(),
> >> but CopyOneRowTo() is also being called from copy_dest_receive().
> >>
> >
> > BeginCopyTo do the preparation work.
> > cstate->attnumlist = CopyGetAttnums(tupDesc, cstate->rel, attnamelist);
> >
> > After CopyGetAttnums, the number of attributes for COPY TO cannot be changed.
> > right after CopyGetAttnums call then check the length of cstate->attnumlist
> > seems fine for me.
> > I think in CopyOneRowTo, we can actually
> > Assert(list_length(cstate->attnumlist) == 1).
> > for raw format.
>
> Right, I've changed it to an Assert instead.
>
> > src10=# drop table if exists x;
> > create table x(a int);
> > COPY x from stdin (FORMAT raw);
> > DROP TABLE
> > CREATE TABLE
> > Enter data to be copied followed by a newline.
> > End with a backslash and a period on a line by itself, or an EOF signal.
> >>> 11
> >>> 12
> >>> \.
> > ERROR:  invalid input syntax for type integer: "11
> > 12
> > "
> > CONTEXT:  COPY x, line 1, column a: "11
> > 12
> > "
> >
> > The above case means COPY FROM STDIN (FORMAT RAW) can only import one
> > single value (when successful).
> > user need to specify like:
> >
> > COPY x from stdin (FORMAT raw, delimiter E'\n');
> >
> > seems raw format default no delimiter is not user friendly.
>
> I have no idea if dealing with .json files that would contain newlines
> in between fields, and would therefore need to be imported "as is",
> is more common than dealing with e.g. .jsonl files where it's guaranteed
> each json value is on a single line.
>
> I think Jacob raised some valid concerns on automagically detecting
> newlines, that is how text/csv works, so I don't think we want that.
>
> Maybe the OS default EOL would be an OK default,
> if we want it to be the default delimiter, that is.
>
> I have no strong opinion here, except automagical newline detection seems
> like a bad idea.
>
> I'm fine with OS default EOL as the default for the delimiter,
> or no delimiter as the default.
>
> New patch attached.

I have one question:

From the 0001 patch's commit message:

No behavioral changes are intended; this is a pure refactoring to improve code
clarity and maintainability.

Does the reorganization of the option validation done by this patch
also help make the 0002 patch simple or small? If not much, while it
makes sense to me that introducing the CopyFormat enum is required by
the 0002 patch, I think we can discuss the reorganization part
separately. And I'd suggest the patch organization would be:

0001: introduce CopyFormat and replace csv_mode and binary fields with it.
0002: add new 'raw' format.
0003: reorganize option validations.

One benefit would be that we can remove the simple replacements like
'->binary' with '.format == COPY_FORMAT_BINARY' from the
reorganization patch, which makes it small. The 0001 and 0002 patches
that seem to be more straightforward are independent from the 0003
patch, and we can discuss how to make the option validations including
the new 'raw' format better.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: New "raw" COPY format

From

jian he

Date:

28 October 2024, 09:56:12

On Thu, Oct 24, 2024 at 2:30 PM Joel Jacobson <joel@compiler.org> wrote:
>
> On Thu, Oct 24, 2024, at 03:54, Masahiko Sawada wrote:
> > I have one question:
> >
> > From the 0001 patch's commit message:
> >
> > No behavioral changes are intended; this is a pure refactoring to improve code
> > clarity and maintainability.
> >
> > Does the reorganization of the option validation done by this patch
> > also help make the 0002 patch simple or small?
>
> Thanks for the review. No, not much, except the changes necessary to
> ProcessCopyOptions for raw, without also refactoring it, makes it
> more complicated.
>
> > If not much, while it
> > makes sense to me that introducing the CopyFormat enum is required by
> > the 0002 patch, I think we can discuss the reorganization part
> > separately. And I'd suggest the patch organization would be:
> >
> > 0001: introduce CopyFormat and replace csv_mode and binary fields with it.
> > 0002: add new 'raw' format.
> > 0003: reorganize option validations.
> >


  /* Check force_quote */
- if (!opts_out->csv_mode && (opts_out->force_quote ||
opts_out->force_quote_all))
+ if (opts_out->format != COPY_FORMAT_CSV && (opts_out->force_quote ||
+ opts_out->force_quote_all))
  ereport(ERROR,
  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */

maybe this has a code indentation issue.
since "if" and "opts_out" in the same column position.

It came to my mind,
change errmsg occurrence of  "BINARY mode", "CSV mode" to "binary
format",  "csv format" respectively.
I think "format" would be more accurate.
but the message seems invasive,
so i guess we need to use "mode".

overall v13-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patch
looks good to me.

Re: New "raw" COPY format

From

Masahiko Sawada

Date:

28 October 2024, 20:50:37

On Mon, Oct 28, 2024 at 3:21 AM Joel Jacobson <joel@compiler.org> wrote:
>
> On Mon, Oct 28, 2024, at 10:30, Joel Jacobson wrote:
> > On Mon, Oct 28, 2024, at 08:56, jian he wrote:
> >>   /* Check force_quote */
> >> - if (!opts_out->csv_mode && (opts_out->force_quote ||
> >> opts_out->force_quote_all))
> >> + if (opts_out->format != COPY_FORMAT_CSV && (opts_out->force_quote ||
> >> + opts_out->force_quote_all))
> >>   ereport(ERROR,
> >>   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> >>   /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
> >>
> >> maybe this has a code indentation issue.
> >> since "if" and "opts_out" in the same column position.
> >
> > Thanks for review.
> >
> > I've fixed the indentation issues.
>
> I've now installed pgindent, and will use it from hereon, to avoid this class of problems.
>
> New version where all three patches are now indented using pgindent.

Thank you for updating the patch. Here are review comments on the v15
0002 patch:

When testing the patch with an empty delimiter, I got the following failure:

postgres(1:903898)=# copy hoge from '/tmp/tmp.raw' with (format 'raw',
delimiter '');
TRAP: failed Assert("delim_len > 0"), File: "copyfromparse.c", Line:
1173, PID: 903898

---
-           else
+           else if (cstate->opts.format == COPY_FORMAT_TEXT)
                fldct = CopyReadAttributesText(cstate);
+           else
+           {
+               elog(ERROR, "unexpected COPY format: %d", cstate->opts.format);
+               pg_unreachable();
+           }

Since we already check the incompatible options with COPY_FORMAT_RAW
and default_print, I think it's better to add an assertion to make
sure the format is either COPY_FORMAT_CSV or COPY_FORMAT_TEXT, instead
of using elog(ERROR).

---
+/*
+ * CopyReadLineRawText - inner loop of CopyReadLine for raw text mode
+ */
+static bool
+CopyReadLineRawText(CopyFromState cstate)

This function has a lot of duplication with CopyReadLineText(). I
think it's better to modify CopyReadLineText() to support 'raw'
format, rather than adding a separate function.

---
+   bool        read_entire_file = (cstate->opts.delim == NULL);
+   int         delim_len = cstate->opts.delim ? strlen(cstate->opts.delim) : 0;

I think we can use 'delim_len == 0' instead of read_entire_file.

---
+       if (read_entire_file)
+       {
+           /* Continue until EOF if reading entire file */
+           input_buf_ptr++;
+           continue;
+       }

In the case where we're reading the entire file as a single tuple, we
don't need to advance the input_buf_ptr one by one. Instead,
input_buf_ptr can jump to copy_buf_len, which is faster.

---
+           /* Check for delimiter, possibly multi-byte */
+           IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(delim_len - 1);
+           if (strncmp(©_input_buf[input_buf_ptr], cstate->opts.delim,
+                       delim_len) == 0)
+           {
+               cstate->eol_type = EOL_CUSTOM;
+               input_buf_ptr += delim_len;
+               break;
+           }
+           input_buf_ptr++;

Similar to the above comment, I think we don't need to check the char
one by one. I guess that it would be faster if we locate the delimiter
string in the intput_buf (e.g. using strstr()), and then move
input_buf_ptr to the detected position.

---
+   /* Copy the entire line into attribute_buf */
+   memcpy(cstate->attribute_buf.data, cstate->line_buf.data,
+          cstate->line_buf.len);
+   cstate->attribute_buf.data[cstate->line_buf.len] = '\0';
+   cstate->attribute_buf.len = cstate->line_buf.len;

The CopyReadAttributesRaw() just copies line_buf data to
attirbute_buf, which seems to be a waste. I think we can have
attribute_buf point to the line_buf. That way, we can skip the whole
step 4 that is described in the comment on top o f copyfromparse.c:

* [data source] --> raw_buf --> input_buf --> line_buf --> attribute_buf
*                1.          2.            3.           4.

---
+static int
+CopyReadAttributesRaw(CopyFromState cstate)
+{
+   /* Enforce single column requirement */
+   if (cstate->max_fields != 1)
+   {
+       ereport(ERROR,
+               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                errmsg("COPY with format 'raw' must specify exactly
one column")));
+   }

This check should have already been done in BeginCopyFrom(). Is there
any case where max_fields gets to != 1 during reading the input?

---
It's a bit odd to me to use the delimiter as a EOL marker in raw
format, but probably it's okay.

---
-           if (cstate->opts.format != COPY_FORMAT_BINARY)
+           if (cstate->opts.format == COPY_FORMAT_RAW &&
+               cstate->opts.delim != NULL)
+           {
+               /* Output the user-specified delimiter between rows */
+               CopySendString(cstate, cstate->opts.delim);
+           }
+           else if (cstate->opts.format == COPY_FORMAT_TEXT ||
+                    cstate->opts.format == COPY_FORMAT_CSV)

Since it sends the delimiter as a string, even if we specify the
delimiter to '\n', it doesn't send the new line (i.e. ASCII LF, 10).
For example,

postgres(1:904427)=# copy (select '{"j" : 1}'::jsonb) to stdout with
(format 'raw', delimiter '\n');
{"j": 1}\npostgres(1:904427)=#

I think there is a similar problem in COPY FROM; if we set a delimiter
to '\n' when doing COPY FROM in raw format, it expects the string '\n'
as a line termination but not ASCII LF(10). I think that input data
normally doesn't use the string '\n' as a line termination.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: New "raw" COPY format

From

Masahiko Sawada

Date:

31 October 2024, 01:29:46

On Tue, Oct 29, 2024 at 9:48 AM Joel Jacobson <joel@compiler.org> wrote:
>
> > ---
> > It's a bit odd to me to use the delimiter as a EOL marker in raw
> > format, but probably it's okay.
> >
> > ---
> > -           if (cstate->opts.format != COPY_FORMAT_BINARY)
> > +           if (cstate->opts.format == COPY_FORMAT_RAW &&
> > +               cstate->opts.delim != NULL)
> > +           {
> > +               /* Output the user-specified delimiter between rows */
> > +               CopySendString(cstate, cstate->opts.delim);
> > +           }
> > +           else if (cstate->opts.format == COPY_FORMAT_TEXT ||
> > +                    cstate->opts.format == COPY_FORMAT_CSV)
> >
> > Since it sends the delimiter as a string, even if we specify the
> > delimiter to '\n', it doesn't send the new line (i.e. ASCII LF, 10).
> > For example,
> >
> > postgres(1:904427)=# copy (select '{"j" : 1}'::jsonb) to stdout with
> > (format 'raw', delimiter '\n');
> > {"j": 1}\npostgres(1:904427)=#
> >
> > I think there is a similar problem in COPY FROM; if we set a delimiter
> > to '\n' when doing COPY FROM in raw format, it expects the string '\n'
> > as a line termination but not ASCII LF(10). I think that input data
> > normally doesn't use the string '\n' as a line termination.
>
> You need to use E'\n' to get ASCII LF(10), since '\n' is just a delimiter
> consisting of backslash followed by "n".
>
> Is this a problem? Since any string can be used as delimiter,
> I think it would be strange if we parsed it and replaced the string
> with a different string.
>
> Another thought:
>
> Maybe we shouldn't default to no delimiter after all,
> maybe it would be better to default to the OS default EOL,

It seems to be useful for loading unstructured non-delimited text
files such as log files. Users can set the delimiter an empty when
loading the entire file to a single column. On the other hand, I think
that If we make it default it might not be necessary to allow the
delimiter to be multi bytes. It would be flexible but it would not
necessarily be necessary. Also, it would be somewhat odd to me that we
can use multi bytes characters as the delimiter only in 'raw' mode,
but not in 'text' mode.

> maybe a final delimiter should always be written at the end,
> so that when exporting a single json field, it would get exported
> to the text file with \n at the end, which is what most text editor
> does when saving a .json file.

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: New "raw" COPY format

From

Masahiko Sawada

Date:

02 November 2024, 00:28:35

On Wed, Oct 30, 2024 at 4:54 AM Joel Jacobson <joel@compiler.org> wrote:
>
> On Wed, Oct 30, 2024, at 09:14, Joel Jacobson wrote:
> > $ psql -f bench_result.sql
>
> Ops, I realized I benchmarked a debug build,
> reran the benchmark with `meson setup build --buildtype=release`,
> and also added benchmarking of HEAD:
>
> [plot image showing time distribution by format and version]
>
>
> For those unable to see the image, here is the same data as a text table:
>
> select format, version, min(time_ms), round(avg(time_ms),2) avg, max(time_ms), round(stddev(time_ms)) as stddev from
benchgroup by 1,2 order by 1,2; 
>
> format | version |   min    |   avg   |   max    | stddev
> --------+---------+----------+---------+----------+--------
> csv    | HEAD    | 2862.537 | 2895.91 | 2968.420 |     27
> csv    | v16     | 2834.663 | 2869.02 | 2926.489 |     28
> csv    | v17     | 2958.570 | 3002.71 | 3056.949 |     31
> raw    | v16     | 1697.899 | 1723.48 | 1798.020 |     22
> raw    | v17     | 1720.251 | 1753.56 | 1811.399 |     28
> text   | HEAD    | 2442.845 | 2476.60 | 2543.767 |     33
> text   | v16     | 2358.062 | 2389.06 | 2472.382 |     32
> text   | v17     | 2517.778 | 2552.65 | 2621.161 |     30
> (8 rows)
>
> I think the improvement for the 'text' format between HEAD and v16 could be just noise,
> possibly due to different binary layout when compiling.
>
> v17 (single function) seems slower than v16 (separate functions), also in release build.
>

Here are review comments on v17 patches:

+    data values. Unlike in other formats, the delimiter in
+    <literal>raw</literal> format can be any string, including multi-byte
+    characters. If no <literal>DELIMITER</literal> is specified, the entire
+    input or output is treated as a single data value.

As I mentioned in a separate email, if we use the OS default EOL as
the default EOL in raw format, it would not be necessary to allow it
to be multi characters. I think it's worth considering it.

---
 * copyfromparse.c
 *      Parse CSV/text/binary format for COPY FROM.

We need to update here as well.

--
- * 4. CopyReadAttributesText/CSV() function takes the input line from
+ * 4. CopyReadAttributesText/CSV/Raw() function takes the input line from

I think we don't have CopyReadAttributesRaw() function now.

---
I think it would be better to explain how to parse data in raw mode,
especially which steps in the pipeline we skip, in the comment at the
top of copyfromparse.c.

---
-       if (cstate->opts.format == COPY_FORMAT_CSV)
+       if (cstate->opts.format == COPY_FORMAT_RAW)
        {
-           /*
-            * If character is '\r', we may need to look ahead below.  Force
-            * fetch of the next character if we don't already have it.  We
-            * need to do this before changing CSV state, in case '\r' is also
-            * the quote or escape character.
-            */
-           if (c == '\r')
+           if (delim_len == 0)
            {
-               IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
+               /* When reading entire file, consume all remaining
bytes at once */
+               input_buf_ptr = copy_buf_len;
+               continue;
            }
+           else

I think that the code become much simpler if we do 'continue' at the
end of  'if (cstate->opts.format == COPY_FORMAT_RAW)' branch, instead
of doing 'if (cstate->opts.format == COPY_FORMAT_RAW) {} else ...'.

---
@@ -1461,6 +1536,7 @@ CopyReadLineText(CopyFromState cstate)
    return result;
 }

+
 /*
  * Return decimal value for a hexadecimal digit
  */
@@ -1937,7 +2013,6 @@ endfield:
    return fieldno;
 }

-
 /*
  * Read a binary attribute
  */

Unnecessary line addition and removal.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: New "raw" COPY format

From

"Joel Jacobson"

Date:

02 November 2024, 14:08:31

On Fri, Nov 1, 2024, at 22:28, Masahiko Sawada wrote:
> As I mentioned in a separate email, if we use the OS default EOL as
> the default EOL in raw format, it would not be necessary to allow it
> to be multi characters. I think it's worth considering it.

I like the idea, but not sure I understand how it would work.

What if a user's OS default is \n (LF) and this user wants
to import a Windows text file \r\n (CR LR), which is a
multi characters EOL delimiter.

Was your idea to make an exception for that particular EOL,
or to simply not support that edge case?

> ---
>  * copyfromparse.c
>  *      Parse CSV/text/binary format for COPY FROM.
>
> We need to update here as well.
>
> --
> - * 4. CopyReadAttributesText/CSV() function takes the input line from
> + * 4. CopyReadAttributesText/CSV/Raw() function takes the input line from
>
> I think we don't have CopyReadAttributesRaw() function now.
>
> ---
> I think it would be better to explain how to parse data in raw mode,
> especially which steps in the pipeline we skip, in the comment at the
> top of copyfromparse.c.
>
> ---
> -       if (cstate->opts.format == COPY_FORMAT_CSV)
> +       if (cstate->opts.format == COPY_FORMAT_RAW)
>         {
> -           /*
> -            * If character is '\r', we may need to look ahead below.  Force
> -            * fetch of the next character if we don't already have it.  We
> -            * need to do this before changing CSV state, in case '\r' is also
> -            * the quote or escape character.
> -            */
> -           if (c == '\r')
> +           if (delim_len == 0)
>             {
> -               IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
> +               /* When reading entire file, consume all remaining
> bytes at once */
> +               input_buf_ptr = copy_buf_len;
> +               continue;
>             }
> +           else
>
> I think that the code become much simpler if we do 'continue' at the
> end of  'if (cstate->opts.format == COPY_FORMAT_RAW)' branch, instead
> of doing 'if (cstate->opts.format == COPY_FORMAT_RAW) {} else ...'.
>
> ---
> @@ -1461,6 +1536,7 @@ CopyReadLineText(CopyFromState cstate)
>     return result;
>  }
>
> +
>  /*
>   * Return decimal value for a hexadecimal digit
>   */
> @@ -1937,7 +2013,6 @@ endfield:
>     return fieldno;
>  }
>
> -
>  /*
>   * Read a binary attribute
>   */
>
> Unnecessary line addition and removal.

Thanks for review, I agree on all comments,
awaiting your comment on EOL before implementing changes.

/Joel

Re: New "raw" COPY format

From

Masahiko Sawada

Date:

04 November 2024, 21:34:28

On Sat, Nov 2, 2024 at 4:08 AM Joel Jacobson <joel@compiler.org> wrote:
>
> On Fri, Nov 1, 2024, at 22:28, Masahiko Sawada wrote:
> > As I mentioned in a separate email, if we use the OS default EOL as
> > the default EOL in raw format, it would not be necessary to allow it
> > to be multi characters. I think it's worth considering it.
>
> I like the idea, but not sure I understand how it would work.
>
> What if a user's OS default is \n (LF) and this user wants
> to import a Windows text file \r\n (CR LR), which is a
> multi characters EOL delimiter.
>
> Was your idea to make an exception for that particular EOL,
> or to simply not support that edge case?

IIUC the text and csv formats already support it. We start from the
EOL_UNKNOWN state and guess the EOL marker while parsing the line. I
think we can do something similar to what we do in the text and csv
formats but we won't need to care about quotes and escapes in the raw
format.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: New "raw" COPY format

From

"Joel Jacobson"

Date:

05 November 2024, 06:22:13

On Mon, Nov 4, 2024, at 19:34, Masahiko Sawada wrote:
> On Sat, Nov 2, 2024 at 4:08 AM Joel Jacobson <joel@compiler.org> wrote:
>>
>> On Fri, Nov 1, 2024, at 22:28, Masahiko Sawada wrote:
>> > As I mentioned in a separate email, if we use the OS default EOL as
>> > the default EOL in raw format, it would not be necessary to allow it
>> > to be multi characters. I think it's worth considering it.
>>
>> I like the idea, but not sure I understand how it would work.
>>
>> What if a user's OS default is \n (LF) and this user wants
>> to import a Windows text file \r\n (CR LR), which is a
>> multi characters EOL delimiter.
>>
>> Was your idea to make an exception for that particular EOL,
>> or to simply not support that edge case?
>
> IIUC the text and csv formats already support it. We start from the
> EOL_UNKNOWN state and guess the EOL marker while parsing the line. I
> think we can do something similar to what we do in the text and csv
> formats but we won't need to care about quotes and escapes in the raw
> format.

Ah, OK, then I see what you mean.

That's actually how the patch worked initially, but due to comments by
Jacob Champion, the magic EOL detection was removed.

I have no strong opinion, maybe it's fine, since that's how most
text editor seems to work, they detect the EOL automatically.

Maybe we should then also rename the format to SINGLE, like suggested by
Jacob and Andrew, since it perhaps wouldn't be fair to say it's RAW when
it does magic detection.

Below is the relevant part of the discussion earlier in this thread.

I'll await your comments on this before making any changes.

On Tue, Oct 15, 2024, at 19:30, Jacob Champion wrote:
> Hi,
>
> Idle thoughts from a design perspective -- feel free to ignore, since
> I'm not the target audience for the feature:
>
> - If the column data stored in Postgres contains newlines, it seems
> like COPY TO won't work "correctly". Is that acceptable?
> - RAW seems like an okay-ish label, but for something that's doing as
> much magic end-of-line detection as this patch is, I'd personally
> prefer SINGLE (as in, "single column").
> - Speaking of magic end-of-line detection, can there be a way to turn
> that off? Say, via DELIMITER?
> - Generic DELIMITER support, for any single-byte separator at all,
> might make a "single-column" format more generally applicable. But I
> might be over-architecting. And it would make the COPY TO issue even
> worse...
>
> Thanks,
> --Jacob


On Wed, Oct 16, 2024, at 18:04, Jacob Champion wrote:
> On Tue, Oct 15, 2024 at 1:38 PM Joel Jacobson <joel@compiler.org> wrote:
>> It's actually the same end-of-line detection as the text format
>> in copyfromparse.c's CopyReadLineText(), except the code
>> is simpler thanks to not having to deal with quotes or escapes.
>
> Right, sorry, I hadn't meant to imply that you made it up. :D Just
> that a "raw" format that is actually automagically detecting things
> doesn't seem very "raw" to me, so I prefer the other name.
>
>> It basically just learns the newline sequence based on the first
>> occurrence, and then require it to be the same throughout the file.
>
> A hypothetical type whose text representation can contain '\r' but not
> '\n' still can't be unambiguously round-tripped under this scheme:
> COPY FROM will see the "mixed" line endings and complain, even though
> there's no ambiguity.
>
> Maybe no one will run into that problem in practice? But if they did,
> I think that'd be a pretty frustrating limitation. It'd be nice to
> override the behavior, to change it from "do what you think I mean" to
> "do what I say".
>
>> > - Speaking of magic end-of-line detection, can there be a way to turn
>> > that off? Say, via DELIMITER?
>> > - Generic DELIMITER support, for any single-byte separator at all,
>> > might make a "single-column" format more generally applicable. But I
>> > might be over-architecting. And it would make the COPY TO issue even
>> > worse...
>>
>> That's an interesting idea that would provide more flexibility,
>> though, at the cost of complicating things by overloading the meaning
>> of DELIMITER.
>
> I think that'd be a docs issue rather than a conceptual one, though...
> it's still a delimiter. I wouldn't really expect end-user confusion.
>
>> If aiming to make this more generally applicable,
>> then at least DELIMITER would need to be multi-byte,
>> since otherwise the Windows case \r\n couldn't be specified.
>
> True.
...
> --Jacob

/Joel

Re: New "raw" COPY format

From

Masahiko Sawada

Date:

05 November 2024, 10:01:23

On Mon, Nov 4, 2024 at 7:22 PM Joel Jacobson <joel@compiler.org> wrote:
>
> On Mon, Nov 4, 2024, at 19:34, Masahiko Sawada wrote:
> > On Sat, Nov 2, 2024 at 4:08 AM Joel Jacobson <joel@compiler.org> wrote:
> >>
> >> On Fri, Nov 1, 2024, at 22:28, Masahiko Sawada wrote:
> >> > As I mentioned in a separate email, if we use the OS default EOL as
> >> > the default EOL in raw format, it would not be necessary to allow it
> >> > to be multi characters. I think it's worth considering it.
> >>
> >> I like the idea, but not sure I understand how it would work.
> >>
> >> What if a user's OS default is \n (LF) and this user wants
> >> to import a Windows text file \r\n (CR LR), which is a
> >> multi characters EOL delimiter.
> >>
> >> Was your idea to make an exception for that particular EOL,
> >> or to simply not support that edge case?
> >
> > IIUC the text and csv formats already support it. We start from the
> > EOL_UNKNOWN state and guess the EOL marker while parsing the line. I
> > think we can do something similar to what we do in the text and csv
> > formats but we won't need to care about quotes and escapes in the raw
> > format.
>
> Ah, OK, then I see what you mean.
>
> That's actually how the patch worked initially, but due to comments by
> Jacob Champion, the magic EOL detection was removed.
>
> I have no strong opinion, maybe it's fine, since that's how most
> text editor seems to work, they detect the EOL automatically.
>
> Maybe we should then also rename the format to SINGLE, like suggested by
> Jacob and Andrew, since it perhaps wouldn't be fair to say it's RAW when
> it does magic detection.
>
> Below is the relevant part of the discussion earlier in this thread.
>
> I'll await your comments on this before making any changes.

If I understand this feature correctly, users can fully benefit from
the raw format even without a multi-byte magic EOL. If a multi-byte
magic EOL improves user experiences further, we can develop it later
as an improvement. That way, we can keep the first patch simple and
small.

As an alternative idea, as with the Text and CSV formats, I think it
would be possible that we can use the newline EOL ('\r' and/or '\n')
as the default and can specify either a single-byte EOL as delimiter
or none/empty (i.e. loading the whole file into a single tuple).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: New "raw" COPY format

From

"Joel Jacobson"

Date:

07 November 2024, 19:19:50

Thread renamed to: New "single" COPY format [1]

[1] https://postgr.es/m/1db18e33-f1cf-4f2c-9d52-b6d7ff242295@app.fastmail.com

/Joel