Thread: Re: New "single" COPY format

Re: New "single" COPY format

From

Masahiko Sawada

Date:

08 November 2024, 02:13:32

Hi,

On Thu, Nov 7, 2024 at 8:16 AM Joel Jacobson <joel@compiler.org> wrote:
>
> Hi hackers,
>
> Thread [1] renamed, since the format name has now been changed from 'raw' to
> 'single', as suggested by Andrew Dunstan and Jacob Champion.
>
> [1] https://postgr.es/m/c12516b1-77dc-4ad3-94a7-88527360aee0@app.fastmail.com
>
> Recap: This is about adding support to import/export text-based formats such as
> JSONL, or any unstructured text file, where wanting to import each line "as is"
> into a single column, or wanting to export a single column to a text file.
>
> Example importing the meson-logs/testlog.json file Meson generates
> when building PostgreSQL, which is in JSONL format:
>
> # create table meson_log (log_line jsonb);
> # \copy meson_log from meson-logs/testlog.json (format single);
> COPY 306
> # select log_line->'name' name, log_line->'result' result from meson_log limit 3;
>                   name                   | result
> -----------------------------------------+--------
>  "postgresql:setup / tmp_install"        | "OK"
>  "postgresql:setup / install_test_files" | "OK"
>  "postgresql:setup / initdb_cache"       | "OK"
> (3 rows)
>
> Changes since v16:
>
> * EOL handling now works the same as for 'text' and 'csv'.
> In v16, we supported multi-byte delimiters to allow specifying
> e.g. Windows EOL (\r\n), but this seemed unnecessary, if we just do what we do
> for text/csv, that is, to auto-detect the EOL for COPY FROM, and use
> the OS default EOL for COPY TO.
> The DELIMITER option is therefore invalid for the 'single' format.
> This is the biggest change in the code, between v16 and v18.
> CopyReadLineRawText() has been renamed to CopyReadLineSingleText(),
> and changed accordingly.

In earlier versions, we supported loading the whole file into a single
tuple. Is there any reason that it doesn't support it in v18? I think
if it's useful we can improve it in a separate patch.

>
> * A final EOL is now emitted to the last record in COPY TO.
> So now it works just like 'text' and 'csv'.
>

+1

> * HEADER [ boolean | MATCH ] now supported
> This is now again supported, as previously suggested by Daniel Verite,
> possible thanks to the EOL handling.

It makes sense to support it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: New "single" COPY format

From

"Joel Jacobson"

Date:

08 November 2024, 09:02:34

On Fri, Nov 8, 2024, at 00:13, Masahiko Sawada wrote:
> In earlier versions, we supported loading the whole file into a single
> tuple. Is there any reason that it doesn't support it in v18? I think
> if it's useful we can improve it in a separate patch.

Not sure how useful it is, since we already have pg_read_file().

Also, I think it's out of scope for the 'single' format, since I think it should
be about about processing text line by line, in the same way 'csv' and 'text'
work.

The implementation depended on the delimiter option, where the default was
no delimiter, which then read the entire file, and since we don't have
the delimiter option anymore, that approach won't work.

From a docs perspective, it would get quite ugly and confusing, since we would
need to rephrase sentences like the below, since they would then not always
be true:

    "each line is treated as a single field"

    "each line of the input or output is considered a
    complete value without any field separation"

>> * A final EOL is now emitted to the last record in COPY TO.
>> So now it works just like 'text' and 'csv'.
>>
>
> +1
>
>> * HEADER [ boolean | MATCH ] now supported
>> This is now again supported, as previously suggested by Daniel Verite,
>> possible thanks to the EOL handling.
>
> It makes sense to support it.

/Joel

Re: New "single" COPY format

From

"David G. Johnston"

Date:

08 November 2024, 09:14:02

On Thursday, November 7, 2024, Joel Jacobson <joel@compiler.org> wrote:

On Fri, Nov 8, 2024, at 00:13, Masahiko Sawada wrote:
> In earlier versions, we supported loading the whole file into a single
> tuple. Is there any reason that it doesn't support it in v18? I think
> if it's useful we can improve it in a separate patch.

Not sure how useful it is, since we already have pg_read_file().

Being forced to have the file server-readable, non-stdin, destroys quite a bit of the usefulness of pg_read_file.

If we want clients to be able to pass the effort here to the server, copy is definitely the most useful way to do so.

I’d be concerned choosing “single” given this future possibility. I do agree that such an enhancement would be best done in its own patch.

David J.

Re: New "single" COPY format

From

"Joel Jacobson"

Date:

08 November 2024, 10:42:43

On Fri, Nov 8, 2024, at 07:14, David G. Johnston wrote:
> On Thursday, November 7, 2024, Joel Jacobson <joel@compiler.org> wrote:
>> On Fri, Nov 8, 2024, at 00:13, Masahiko Sawada wrote:
>> > In earlier versions, we supported loading the whole file into a single
>> > tuple. Is there any reason that it doesn't support it in v18? I think
>> > if it's useful we can improve it in a separate patch.
>>
>> Not sure how useful it is, since we already have pg_read_file().
>
> Being forced to have the file server-readable, non-stdin, destroys
> quite a bit of the usefulness of pg_read_file.
>
> If we want clients to be able to pass the effort here to the server,
> copy is definitely the most useful way to do so.

Right, good point, I agree.

> I’d be concerned choosing “single” given this future possibility.  I do
> agree that such an enhancement would be best done in its own patch.

OK, sounds good to do it in its own patch.

If the name "single" doesn't work for this reason, I see at least
two alternatives:

1) Keep "single" as format name, and let it only be concerned about line by line
processing, and introduce a different format for entire file processing,
in its own patch.

2) Some other format name ("raw"?) that allows such future enhancement to
be done within the same format, in its own patch.

Other ideas?

/Joel

Re: New "single" COPY format

From

"Joel Jacobson"

Date:

08 November 2024, 11:04:30

On Fri, Nov 8, 2024, at 08:42, Joel Jacobson wrote:
>> I’d be concerned choosing “single” given this future possibility.  I do
>> agree that such an enhancement would be best done in its own patch.
>
> OK, sounds good to do it in its own patch.
>
> If the name "single" doesn't work for this reason, I see at least
> two alternatives:
>
> 1) Keep "single" as format name, and let it only be concerned about line by line
> processing, and introduce a different format for entire file processing,
> in its own patch.
>
> 2) Some other format name ("raw"?) that allows such future enhancement to
> be done within the same format, in its own patch.
>
> Other ideas?

Sorry for noise, "raw" is of course not an option given how v18 works,
due to the auto-magic EOL detection, same as in "text" and "csv",
as pointed out by others earlier in the thread.

How about "single_column"?

Then, a future patch could implement a "single_value" format,
to process an entire file or value. Such format could then also
support binary data, by detecting if the column type is "bytea".

/Joel