Thread: Re: New "single" COPY format

Re: New "single" COPY format

From

Aleksander Alekseev

Date:

08 November 2024, 14:18:00

Hi Joel,

> Recap: This is about adding support to import/export text-based formats such as
> JSONL, or any unstructured text file, where wanting to import each line "as is"
> into a single column, or wanting to export a single column to a text file.

Sorry for being late for the discussion.

I disagree with the idea of adding a new format name for this. Mostly
because it is *not* a new format and pretending that it is will be
just a poor and/or confusing user interface.

IMO it should be 'text' we already have with special options e.g.
DELIMITER AS NULL ESCAPE AS NULL. If there are no escape characters
and column delimiters (and no NULLs designations, and what else I
forgot) then your text file just contains one tuple per line.

-- 
Best regards,
Aleksander Alekseev

Re: New "single" COPY format

From

Aleksander Alekseev

Date:

08 November 2024, 14:25:18

Hi,

> > Recap: This is about adding support to import/export text-based formats such as
> > JSONL, or any unstructured text file, where wanting to import each line "as is"
> > into a single column, or wanting to export a single column to a text file.
>
> Sorry for being late for the discussion.
>
> I disagree with the idea of adding a new format name for this. Mostly
> because it is *not* a new format and pretending that it is will be
> just a poor and/or confusing user interface.
>
> IMO it should be 'text' we already have with special options e.g.
> DELIMITER AS NULL ESCAPE AS NULL. If there are no escape characters
> and column delimiters (and no NULLs designations, and what else I
> forgot) then your text file just contains one tuple per line.

Personally I wouldn't mind a special syntax such as LINES AS IS or
maybe COPY AS IS for convenience. Perhaps we should discuss it
separately though as a syntax sugar for a long list of options we
already support.

-- 
Best regards,
Aleksander Alekseev

Re: New "single" COPY format

From

"Joel Jacobson"

Date:

08 November 2024, 15:31:30

On Fri, Nov 8, 2024, at 12:25, Aleksander Alekseev wrote:
> Sorry for being late for the discussion.

No worries, better late than never, thanks for chiming in.

> I disagree with the idea of adding a new format name for this. Mostly
> because it is *not* a new format and pretending that it is will be
> just a poor and/or confusing user interface.
>
> IMO it should be 'text' we already have with special options e.g.
> DELIMITER AS NULL ESCAPE AS NULL. If there are no escape characters
> and column delimiters (and no NULLs designations, and what else I
> forgot) then your text file just contains one tuple per line.
>
> Personally I wouldn't mind a special syntax such as LINES AS IS or
> maybe COPY AS IS for convenience. Perhaps we should discuss it
> separately though as a syntax sugar for a long list of options we
> already support.

From an implementation perspective, I agree with you that this could
be handled by tweaking the existing code for the 'text' and 'csv' formats,
although with a performance penalty for the existing formats.

But from a user-perspective, the implementation is of course irrelevant,
then what I think is important, is that the format should have an intuitive name,
where the default behaviour should match a typical file in the format,
as closely as possible.

For this reason, the 'text' format is unfortunately a poor name,
since it gives the impression it's a generic format for text files,
which it's not, it's a PostgreSQL-specific format, where "\." on a
single line has special meaning, and other defaults such as \N
are also PostgreSQL-specific, and needs to be overriden, if dealing
with a non-PostgreSQL specific text file.
Users who fail to understand these details, risks being surprised.

In contrast, the 'csv' format, works quite as expected.

So for this reason, I think a new format, is a good idea, not only
because it makes it much clearer how to have a fast parsing path
in the implementation, but also because it will increase the chances
users will get things right, when dealing with non-PostgreSQL specific
text files, such as JSONL and log files.

Sure, from an implementation perspective, we could have separate
specialized functions, to allow for fast parsing paths, even if
just overloading the existing options, but that would be a bit awkward I think.

The "DELIMITER AS NULL ESCAPE AS NULL" idea was proposed in the old thread
"Should CSV parsing be stricter about mid-field quotes?" [1]

[1] https://postgr.es/m/8aeab305-5e94-4fa5-82bf-6da6baee6e05@app.fastmail.com

However, some of us came to the conclusion that it would be
better to introduce a new format, for reasons explained below,
quoted from the old thread [1]:

On Wed, Oct 9, 2024, at 18:14, Andrew Dunstan wrote:
> On 2024-10-09 We 11:58 AM, Tom Lane wrote:
>> "Joel Jacobson" <joel@compiler.org> writes:
>>> I think it would be nicest to introduce a new "raw" FORMAT,
>>> that clearly get things right already at the top-level,
>>> instead of trying to customize any of the existing formats.
>>
>> FWIW, I like this idea.  It makes it much clearer how to have a
>> fast parsing path.
>
> WFM, so something like FORMAT {SIMPLE, RAW, FAST, SINGLE}? We don't seem
> to have an existing keyword that we could sanely reuse here.

To add to that, I think there is value of a new format, from a user-friendiness
perspective, by keeping the existing formats and their options intact,
and instead just add a new format, with a clear name, with well-defined
semantics, explained in the docs under its own section, to avoid cluttering
the documentation further, where users would need to assemble various options,
and understand their intricate details, in order to get things right.

/Joel

Re: New "single" COPY format

From

"Daniel Verite"

Date:

08 November 2024, 22:44:30

    Aleksander Alekseev wrote:

> IMO it should be 'text' we already have with special options e.g.
> DELIMITER AS NULL ESCAPE AS NULL. If there are no escape characters
> and column delimiters (and no NULLs designations, and what else I
> forgot) then your text file just contains one tuple per line.

+1 for the idea that accepting "no delimiter"  and "no escape"
as a valid combination for the text format seems better
than adding a new format.
However inviting "NULL" into that syntax when it has nothing to do
with the SQL "NULL" does not look like a good idea.
Maybe DELIMITER '' ESCAPE '', or DELIMITER NONE ESCAPE NONE.

Besides, "single" as a format name does not sound right.
Generally the name for a text format designates a set
of characteristics meaning that certain combinations of
characters have specific behaviors.
Sometimes "plain" is used in the context of text formats
to indicate that no character is special ("plain" is also the
default subtype of "text" in MIME types).

"single" as proposed is to be understood as "single-column",
which is a consequence of the lack of a field delimiter, but
not an intrinsic characteristic of the format.
If COPY accepted fixed-length fields, it could be in a
no-delimiter no-escape mode and still handle multiple
columns, in opposition to what "single" suggests.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: New "single" COPY format

From

"Joel Jacobson"

Date:

09 November 2024, 00:19:19

On Fri, Nov 8, 2024, at 20:44, Daniel Verite wrote:
> Aleksander Alekseev wrote:
>
>> IMO it should be 'text' we already have with special options e.g.
>> DELIMITER AS NULL ESCAPE AS NULL. If there are no escape characters
>> and column delimiters (and no NULLs designations, and what else I
>> forgot) then your text file just contains one tuple per line.
>
> +1 for the idea that accepting "no delimiter"  and "no escape"
> as a valid combination for the text format seems better
> than adding a new format.
> However inviting "NULL" into that syntax when it has nothing to do
> with the SQL "NULL" does not look like a good idea.
> Maybe DELIMITER '' ESCAPE '', or DELIMITER NONE ESCAPE NONE.

Okay, let's see if we can solve all problems I see with
overloading the 'text' format:

1. Text files containing \. in the middle of the file
% cat /tmp/test.txt
foo
\.
bar

How do we import such a file?
Is it not supported?
Or another option to turn off the special meaning of \.?
Both seems like bad ideas to me, maybe there is a nice idea I fail to see?

2. NULL option is \N for 'text', so to import a plain text
file safely, where \N lines should not be converted to NULL,
users would need to also specify NULL '', which seems
like a footgun to me.

3. What should happen if specifying DELIMITER NONE, and:
- specifying a column list with more than one column?
- not also specifying ESCAPE NONE?

4. What should happen if specifying ESCAPE NONE, and
- specifying a column list with more than one column?

5. What about the isomorphism violation, I brought up in my
previous email, that is, the non-bijective mapping and irreversibility,
for records with embedded newlines?
This is also a problem with a separate format,
but I wonder what you think about the problem,
if it's acceptable, or needs to be solved, and if so,
if you see any solutions.

> Besides, "single" as a format name does not sound right.
> Generally the name for a text format designates a set
> of characteristics meaning that certain combinations of
> characters have specific behaviors.
> Sometimes "plain" is used in the context of text formats
> to indicate that no character is special ("plain" is also the
> default subtype of "text" in MIME types).
>
> "single" as proposed is to be understood as "single-column",
> which is a consequence of the lack of a field delimiter, but
> not an intrinsic characteristic of the format.
> If COPY accepted fixed-length fields, it could be in a
> no-delimiter no-escape mode and still handle multiple
> columns, in opposition to what "single" suggests.

Good points. I agree "plain" is a better name.

/Joel

Re: New "single" COPY format

From

"David G. Johnston"

Date:

09 November 2024, 00:47:22

On Fri, Nov 8, 2024 at 2:20 PM Joel Jacobson <joel@compiler.org> wrote:

1. Text files containing \. in the middle of the file
% cat /tmp/test.txt
foo
\.
bar

Or another option to turn off the special meaning of \.?

This does seem like an orthogonal option worth considering.

> Besides, "single" as a format name does not sound right.
> Generally the name for a text format designates a set
> of characteristics meaning that certain combinations of
> characters have specific behaviors.
> Sometimes "plain" is used in the context of text formats
> to indicate that no character is special ("plain" is also the
> default subtype of "text" in MIME types).
>
> "single" as proposed is to be understood as "single-column",
> which is a consequence of the lack of a field delimiter, but
> not an intrinsic characteristic of the format.
> If COPY accepted fixed-length fields, it could be in a
> no-delimiter no-escape mode and still handle multiple
> columns, in opposition to what "single" suggests.

Good points. I agree "plain" is a better name.

I'm on board with a new named format that selects the desired defaults instead of requiring the user to know and change them all manually.

This seems to me like a "list" format. Implying each row is a list entry. Since we have tables the concept of list would likewise reasonably imply a single column.

Since newlines are special, i.e., record delimiters, "plain" thus would remain misleading. It could be used for a case where the entire file is loaded into a new row, single column.

David J.

Re: New "single" COPY format

From

"Joel Jacobson"

Date:

09 November 2024, 11:04:53

On Sat, Nov 9, 2024, at 08:07, Joel Jacobson wrote:
> Here's a draft of an idea I'm considering (not yet implemented):

I realize the last part about optional quoting is unnecessary,
since if quoting is desired, users could just use the 'csv' format.

Revised draft of the idea (not yet implemented):

- Fast path for newline-free types:
For the list of built-in types where we know the ::text representation cannot
contain newlines, we take the fast path in NextCopyFromRawFields(),
pointing cstate->raw_fields[0] directly to cstate->line_buf.data.

- Handling newlines for other types:
For any other types, we would need to scan the string for newline characters.
If a newline is encountered, it results in an error.

This brings up the question on what to offer users wanting to export text
values containing newlines.

To address this need, I think that's out of scope for the 'list' format,
and is better handled by a separate 'value' format:
- Such a format would be specialized for exporting a value "as is" to a file,
or importing en entire file as a single value.
- Such a value could be a physical single-column single-row,
but could also be constructed using e.g. string_agg().
- The 'value' format could also easily support import/export
binary data (bytea), to e.g. allow importing/exporting images, etc.

Dimensionality perspective on formats:

2D formats: 'text', 'csv', 'binary' (tabular formats)
1D format: 'list' (single-column)
0D format: 'value' (single-column, single-row)

/Joel

Re: New "single" COPY format

From

"David G. Johnston"

Date:

09 November 2024, 17:13:47

On Saturday, November 9, 2024, Joel Jacobson <joel@compiler.org> wrote:

On Sat, Nov 9, 2024, at 08:07, Joel Jacobson wrote:
> Here's a draft of an idea I'm considering (not yet implemented):

I realize the last part about optional quoting is unnecessary,
since if quoting is desired, users could just use the 'csv' format.

Revised draft of the idea (not yet implemented):

- Fast path for newline-free types:
For the list of built-in types where we know the ::text representation cannot
contain newlines, we take the fast path in NextCopyFromRawFields(),
pointing cstate->raw_fields[0] directly to cstate->line_buf.data.

- Handling newlines for other types:
For any other types, we would need to scan the string for newline characters.
If a newline is encountered, it results in an error.

Make sense to me.

David J.

Re: New "single" COPY format

From

"David G. Johnston"

Date:

09 November 2024, 17:28:18

On Saturday, November 9, 2024, Joel Jacobson <joel@compiler.org> wrote:

On Fri, Nov 8, 2024, at 22:47, David G. Johnston wrote:
> On Fri, Nov 8, 2024 at 2:20 PM Joel Jacobson <joel@compiler.org> wrote:
>>
>> 1. Text files containing \. in the middle of the file
>> % cat /tmp/test.txt
>> foo
>> \.
>> bar
>>
>> Or another option to turn off the special meaning of \.?
>
> This does seem like an orthogonal option worth considering.

I agree; if we want to integrate this into 'text', it's an option worth considering.

PostgreSQL cannot store the NUL byte. Would that be an option for the record separator. Default to new line but accept NUL if one needs to input/output lists containing newlines. Or whatever character the user believes is not part of their data - tab probably being a popular option.

David J.

Re: New "single" COPY format

From

"Joel Jacobson"

Date:

09 November 2024, 23:48:07

On Sat, Nov 9, 2024, at 15:28, David G. Johnston wrote:
> PostgreSQL cannot store the NUL byte.  Would that be an option for the 
> record separator.  Default to new line but accept NUL if one needs to 
> input/output lists containing newlines.  Or whatever character the user 
> believes is not part of their data - tab probably being a popular 
> option.

Clever idea, could work, but using NUL bytes in text files feels a bit
unorthodox, and I can imagine surprising results in other systems having to deal
with such files.

I have no idea how useful such file format would be, but some googling suggest
it's a trick that's used out there, so I won't exclude the idea entirely, just
feels like a type of hack where it's difficult to foresee the consequences
of allowing it.

/Joel

Re: New "single" COPY format

From

"David G. Johnston"

Date:

09 November 2024, 23:52:06

On Sat, Nov 9, 2024 at 1:48 PM Joel Jacobson <joel@compiler.org> wrote:

On Sat, Nov 9, 2024, at 15:28, David G. Johnston wrote:
> PostgreSQL cannot store the NUL byte. Would that be an option for the
> record separator. Default to new line but accept NUL if one needs to
> input/output lists containing newlines. Or whatever character the user
> believes is not part of their data - tab probably being a popular
> option.

Clever idea, could work, but using NUL bytes in text files feels a bit
unorthodox, and I can imagine surprising results in other systems having to deal
with such files.

Yeah. I was inspired by xargs and find but for a permanent file it is a bit different.

David J.

Re: New "single" COPY format

From

jian he

Date:

10 November 2024, 07:26:21

On Sun, Nov 10, 2024 at 3:29 AM Joel Jacobson <joel@compiler.org> wrote:
>
> Cool. I've drafted a new patch on this approach.
> The list of newline-free built-in types is not exhaustive, yet.


do we care that COPY back and forth always work?
doc not mentioned, but seems it's an implicit idea.

copy the_table  to '/tmp/3.txt' with (format whatever_format);
truncate the_table;
copy the_table from '/tmp/3.txt' with (format whatever_format);

but v20, will not work for an non-text column with SQL NULL data in it.

example:
drop table if exists x1;
create table x1(a int);
insert into x1 select null;
copy x1 to '/tmp/3.txt' with (format list);
copy x1 from '/tmp/3.txt' with (format list);
ERROR:  invalid input syntax for type integer: ""
CONTEXT:  COPY x1, line 1, column a: ""


   <para>
    The <literal>list</literal> format does not distinguish a
<literal>NULL</literal>
    value from an empty string. Empty lines are imported as empty strings, not
    as <literal>NULL</literal> values.
   </para>
we only mentioned import, not export (COPY TO) dealing with
NULL value.


+ if (c == '\n' || c == '\r')
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("list format doesn't support newlines in field values"),
+ errhint("Consider using csv or text format for data containing newlines.")));

"list format doesn't support newlines in field values"
word list need single or double quote?


        ereport(ERROR,
                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                 errmsg("Unsupported COPY format")));
should be "unsupported" per
https://www.postgresql.org/docs/current/error-style-guide.html#ERROR-STYLE-GUIDE-CASE

Re: New "single" COPY format

From

"David G. Johnston"

Date:

10 November 2024, 07:55:15

On Saturday, November 9, 2024, jian he <jian.universality@gmail.com> wrote:

<para>
The <literal>list</literal> format does not distinguish a
<literal>NULL</literal>
value from an empty string. Empty lines are imported as empty strings, not
as <literal>NULL</literal> values.
</para>
we only mentioned import, not export (COPY TO) dealing with
NULL value.

Yeah, while not being able to distinguish between the two is consistent with the list format’s premise/design the choice would need to resolve to the null value in order to continue to be data-type agnostic. We’d simply have to note for the text types that empty strings in lists are not supported, and if encountered will be resolved to a null value.

David J.

Re: New "single" COPY format

From

"Joel Jacobson"

Date:

10 November 2024, 10:32:30

On Sun, Nov 10, 2024, at 05:55, David G. Johnston wrote:
> On Saturday, November 9, 2024, jian he <jian.universality@gmail.com> wrote:
>>    <para>
>>     The <literal>list</literal> format does not distinguish a
>> <literal>NULL</literal>
>>     value from an empty string. Empty lines are imported as empty strings, not
>>     as <literal>NULL</literal> values.
>>    </para>
>> we only mentioned import, not export (COPY TO) dealing with
>> NULL value.
>>
>
> Yeah, while not being able to distinguish between the two is consistent
> with the list format’s premise/design the choice would need to resolve
> to the null value in order to continue to be data-type agnostic.  We’d
> simply have to note for the text types that empty strings in lists are
> not supported, and if encountered will be resolved to a null value.

Seems like we have two options to decide between, both with pros and cons.

For full reversibility, we can't support both NULL values and the empty string.

To make a sound design decision here, I think we should test both options
against all real-world use-cases we can come up with.

The use-cases I can think of are:

1) Arbitrary unstructured text lists, where each line could be any text string
2) JSONL, where each line is a valid JSON value, and cannot be an empty string

Option A:
COPY TO: Empty string field gets exported as an empty line. NULL field is an error.
COPY FROM: Empty line is imported as an empty string.

Option B:
COPY TO: NULL field gets exported as an empty line. Empty string field is an error.
COPY FROM: Empty line is imported as a NULL value.

I think Option A seems more useful, because:

1) Arbitrary text files, very often contain empty lines to separate sections from each other.
2) JSONL cannot contain empty lines, they are an error: https://jsonlines.org/validator/

Nothing implemented yet, awaiting opinions.

/Joel

[1] https://jsonlines.org/validator/

Re: New "single" COPY format

From

"Joel Jacobson"

Date:

10 November 2024, 10:48:56

On Sun, Nov 10, 2024, at 08:32, Joel Jacobson wrote:
> Option A:
> COPY TO: Empty string field gets exported as an empty line. NULL field 
> is an error.
> COPY FROM: Empty line is imported as an empty string.
>
> Option B:
> COPY TO: NULL field gets exported as an empty line. Empty string field 
> is an error.
> COPY FROM: Empty line is imported as a NULL value.
>
> I think Option A seems more useful, because:
>
> 1) Arbitrary text files, very often contain empty lines to separate 
> sections from each other.
> 2) JSONL cannot contain empty lines, they are an error: 
> https://jsonlines.org/validator/

David, I forgot about your NUL idea, so there is also a third option.

Option C:
COPY TO: NULL field gets exported as a the NUL byte. Empty string field is an empty line.
COPY FROM: Empty line is imported as an empty string. NUL byte is imported as a NULL value.

For arbitrary text files, Option C would work fine, since they usually don't contain NUL bytes, and if they do, then it
seemsuseful to be handle to deal with such files in some way, even if it can't be known NUL always means NULL, then we
couldat least import such files, and then do some post-processing of the imported data, to get the desired result.

For JSONL, Option C would also work fine, since they can't contain NUL bytes.

It's a bit of a hack, but I kinda like it., since it seems like the only option without an error situation.
Maybe OK if we add one a cautionary <note> the docs?

/Joel

Re: New "single" COPY format

From

"Joel Jacobson"

Date:

10 November 2024, 11:00:44

On Sun, Nov 10, 2024, at 08:48, Joel Jacobson wrote:
> On Sun, Nov 10, 2024, at 08:32, Joel Jacobson wrote:
>> Option A:
>> COPY TO: Empty string field gets exported as an empty line. NULL field 
>> is an error.
>> COPY FROM: Empty line is imported as an empty string.
>>
>> Option B:
>> COPY TO: NULL field gets exported as an empty line. Empty string field 
>> is an error.
>> COPY FROM: Empty line is imported as a NULL value.
>>
>> I think Option A seems more useful, because:
>>
>> 1) Arbitrary text files, very often contain empty lines to separate 
>> sections from each other.
>> 2) JSONL cannot contain empty lines, they are an error: 
>> https://jsonlines.org/validator/
>
> David, I forgot about your NUL idea, so there is also a third option.

To avoid confusion, I should have been clear that the below idea
is just based on your NUL idea, it's not the same idea per se,
since your was about newline handling in textual types.

> Option C:
> COPY TO: NULL field gets exported as a the NUL byte. Empty string field 
> is an empty line.
> COPY FROM: Empty line is imported as an empty string. NUL byte is 
> imported as a NULL value.
>
> For arbitrary text files, Option C would work fine, since they usually 
> don't contain NUL bytes, and if they do, then it seems useful to be 
> handle to deal with such files in some way, even if it can't be known 
> NUL always means NULL, then we could at least import such files, and 
> then do some post-processing of the imported data, to get the desired 
> result.
>
> For JSONL, Option C would also work fine, since they can't contain NUL bytes.
>
> It's a bit of a hack, but I kinda like it., since it seems like the 
> only option without an error situation.
> Maybe OK if we add one a cautionary <note> the docs?

Also, we could emit NOTICE messages,
upon both COPY TO and COPY FROM,
to increase the chances of users understanding the semantics:

COPY TO:
NOTICE: NULL values encountered in data, represented as NUL bytes in output

COPY FROM:
NOTICE: NUL bytes encountered in data, stored as NULL values

/Joel

Re: New "single" COPY format

From

"Joel Jacobson"

Date:

10 November 2024, 11:32:28

On Sun, Nov 10, 2024, at 09:00, Joel Jacobson wrote:
>> It's a bit of a hack, but I kinda like it., since it seems like the 
>> only option without an error situation.

I forgot about the error situation when a textual contain
newline characters, that remains the same for option A, B and C.

/Joel

Re: New "single" COPY format

From

"Joel Jacobson"

Date:

16 December 2024, 18:09:36

Hi hackers,

After further consideration, I'm withdrawing the patch.
Some fundamental questions remain unresolved:

- Should round-trip fidelity be a strict goal? By "round-trip fidelity",
  I mean that data exported and then re-imported should yield exactly
  the original values, including the distinction between NULL and empty strings.
- If round-trip fidelity is a requirement, how do we distinguish NULL from empty
  strings without delimiters or escapes?
- Is automatic newline detection (as in "csv" and "text") more valuable than
  the ability to embed \r (CR) characters?
- Would it be better to extend the existing COPY options rather than introducing
  a new format?
- Or should we consider a JSONL format instead, one that avoids the NULL/empty
  string problem entirely?

No clear solution or consensus has emerged. For now, I'll step back from the
proposal. If someone wants to revisit this later, I'd be happy to contribute.

Thanks again for all the feedback and consideration.

/Joel

Re: New "single" COPY format

From

jian he

Date:

19 December 2024, 09:48:05

I have reviewed v21-0001 again.

v21-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patch
is a good refactor.


overall looks good to me.

Re: New "single" COPY format

From

"Joel Jacobson"

Date:

19 December 2024, 13:17:12

On Thu, Dec 19, 2024, at 07:48, jian he wrote:
> I have reviewed v21-0001 again.
>
> v21-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patch
> is a good refactor.
>
>
> overall looks good to me.

OK, I could submit it as a separate patch.

Would we also want the reorganization of existing copy option validations?
That is, v21-0003-Reorganize-option-validations.patch minus v21-0002-Add-COPY-format-list.patch.
Nice that you managed to arrange them in the same order as in the documentation. Made the code easier to follow.

/Joel

Re: New "single" COPY format

From

Andrew Dunstan

Date:

19 December 2024, 16:40:05

On 2024-12-16 Mo 10:09 AM, Joel Jacobson wrote:
> Hi hackers,
>
> After further consideration, I'm withdrawing the patch.
> Some fundamental questions remain unresolved:
>
> - Should round-trip fidelity be a strict goal? By "round-trip fidelity",
>    I mean that data exported and then re-imported should yield exactly
>    the original values, including the distinction between NULL and empty strings.
> - If round-trip fidelity is a requirement, how do we distinguish NULL from empty
>    strings without delimiters or escapes?
> - Is automatic newline detection (as in "csv" and "text") more valuable than
>    the ability to embed \r (CR) characters?
> - Would it be better to extend the existing COPY options rather than introducing
>    a new format?
> - Or should we consider a JSONL format instead, one that avoids the NULL/empty
>    string problem entirely?
>
> No clear solution or consensus has emerged. For now, I'll step back from the
> proposal. If someone wants to revisit this later, I'd be happy to contribute.
>
> Thanks again for all the feedback and consideration.
>

We seem to have got seriously into the weeds, here. I'd be sorry to see 
this dropped. After all, it's not something new, and while we have a 
sort of workaround for "one json doc per line" it's far from obvious, 
and except in a few blog posts undocumented.

I think we're trying to be far too general here but in the absence of 
more general use cases. The ones I recall having encountered in the wild 
are:

   . one json datum per line

   . one json document per file

   . a sequence of json documents per file

The last one is hard to deal with, and I think I've only seen it once or 
twice, so I suggest leaving it aside for now.

Notice these are all JSON. I could imagine XML might have similar 
requirements, but I encounter it extremely rarely.

Regarding NULL, an empty string is not a valid JSON literal, so there 
should be no confusion there. It is valid for XML, though.

Given all that I think restricting ourselves to just the JSON cases, and 
possibly just to JSONL, would be perfectly reasonable.

Regarding CR, it's not a valid character in a JSON string item, although 
it is valid in JSON whitespace. I would not treat it as magical unless 
it immediately precedes an NL. That gives rise to a very sight 
ambiguity, but I think it's one we could live with.

As for what the format is called, I don't like the "LIST" proposal much, 
even for the general case. Seems too close to an array.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: New "single" COPY format

From

"Joel Jacobson"

Date:

21 December 2024, 11:56:38

On Thu, Dec 19, 2024, at 14:40, Andrew Dunstan wrote:
> We seem to have got seriously into the weeds, here. I'd be sorry to see
> this dropped. After all, it's not something new, and while we have a
> sort of workaround for "one json doc per line" it's far from obvious,
> and except in a few blog posts undocumented.
>
> I think we're trying to be far too general here but in the absence of
> more general use cases. The ones I recall having encountered in the wild
> are:
>
>    . one json datum per line
>
>    . one json document per file
>
>    . a sequence of json documents per file
>
> The last one is hard to deal with, and I think I've only seen it once or
> twice, so I suggest leaving it aside for now.
>
> Notice these are all JSON. I could imagine XML might have similar
> requirements, but I encounter it extremely rarely.
>
> Regarding NULL, an empty string is not a valid JSON literal, so there
> should be no confusion there. It is valid for XML, though.
>
> Given all that I think restricting ourselves to just the JSON cases, and
> possibly just to JSONL, would be perfectly reasonable.
>
> Regarding CR, it's not a valid character in a JSON string item, although
> it is valid in JSON whitespace. I would not treat it as magical unless
> it immediately precedes an NL. That gives rise to a very sight
> ambiguity, but I think it's one we could live with.
>
> As for what the format is called, I don't like the "LIST" proposal much,
> even for the general case. Seems too close to an array.

On Thu, Dec 19, 2024, at 14:40, Andrew Dunstan wrote:
> We seem to have got seriously into the weeds, here. I'd be sorry to see
> this dropped. After all, it's not something new, and while we have a
> sort of workaround for "one json doc per line" it's far from obvious,
> and except in a few blog posts undocumented.
>
> I think we're trying to be far too general here but in the absence of
> more general use cases. The ones I recall having encountered in the wild
> are:
>
>    . one json datum per line
>
>    . one json document per file
>
>    . a sequence of json documents per file
>
> The last one is hard to deal with, and I think I've only seen it once or
> twice, so I suggest leaving it aside for now.
>
> Notice these are all JSON. I could imagine XML might have similar
> requirements, but I encounter it extremely rarely.
>
> Regarding NULL, an empty string is not a valid JSON literal, so there
> should be no confusion there. It is valid for XML, though.
>
> Given all that I think restricting ourselves to just the JSON cases, and
> possibly just to JSONL, would be perfectly reasonable.
>
> Regarding CR, it's not a valid character in a JSON string item, although
> it is valid in JSON whitespace. I would not treat it as magical unless
> it immediately precedes an NL. That gives rise to a very sight
> ambiguity, but I think it's one we could live with.
>
> As for what the format is called, I don't like the "LIST" proposal much,
> even for the general case. Seems too close to an array.

Thanks for weighing in.

OK, let's try to restrict ourselves to dealing with json lines and see
if we can work out the precise semantics.

The JSONL spec, or at least its validator [1], forbid empty lines.

So we would need to extend the JSONL spec to:
- Export a NULL::jsonb column value as an empty line.
- Import an empty line as a NULL::jsonb column value.

Could we also restrict ourselves to PostgreSQL's jsonb type?
I fear trying to also support the PostgreSQL's json type will be another
rabbit hole, since json values can contain LF characters in whitespace.

Due to the necessary "empty line = NULL" extension,
I think we should simply call the format 'jsonb', and not 'jsonl'.

How about:

COPY table_name [ ( column_name ) ] { FROM | TO } 'filename' (FORMAT jsonb);

- If column list is omitted, table_name must have exactly one column.
- If column list is specified, it must be of length one.
- The column type must be jsonb.
- Each line is a single jsonb value; no multi-line json permitted.
- Non-LF whitespace, i.e. [ \r\t], are allowed anywhere where
  whitespace can exist according to the json spec, but are discarded
  upon import, since jsonb doesn't store whitepsace.
- The LF character, i.e. [\n], determine the end of a jsonb value.
- Empty line are imported as NULL::jsonb values
- NULL::jsonb values are exported as empty lines

Note: Lines can end with LR or CR+LF, since the CR is just whitespace,
so even e.g. CR+CR+LF at the end would be allowed.

Naming the new COPY format 'jsonb' would signal that we're firmly
in PostgreSQL territory, not trying to produce a "pure" JSONL file.

All JSONL files can be imported though, since the "empty line = NULL" extension
just risks break other consumers of JSONL.
If that's a problem, users can just filter out NULL values upon export,
which seems acceptable to me. Just seems important to not call it JSONL.

/Joel

[1] https://jsonlines.org/validator/

Re: New "single" COPY format

From

"David G. Johnston"

Date:

18 February, 03:05:03

On Sat, Dec 21, 2024 at 1:57 AM Joel Jacobson <joel@compiler.org> wrote:

How about:

COPY table_name [ ( column_name ) ] { FROM | TO } 'filename' (FORMAT jsonb);

- If column list is omitted, table_name must have exactly one column.
- If column list is specified, it must be of length one.
- The column type must be jsonb.
- Each line is a single jsonb value; no multi-line json permitted.
- Non-LF whitespace, i.e. [ \r\t], are allowed anywhere where
whitespace can exist according to the json spec, but are discarded
upon import, since jsonb doesn't store whitepsace.
- The LF character, i.e. [\n], determine the end of a jsonb value.
- Empty line are imported as NULL::jsonb values
- NULL::jsonb values are exported as empty lines

Note: Lines can end with LR or CR+LF, since the CR is just whitespace,
so even e.g. CR+CR+LF at the end would be allowed.

My first impression of this is positive. I like the scoping and the rules make sense.

I know this is outside of COPY charter as it stands today but I'd like to suggest allowing for some way of saying "please put input line numbers into this column". Think of it as adding "with ordinality" to copy. In particular it would help to deal with potential end-of-file situations where the last line is imported as a null value.

David J.

Re: New "single" COPY format

From

Andrew Dunstan

Date:

18 February, 17:57:42

On 2025-02-17 Mo 7:05 PM, David G. Johnston wrote:

On Sat, Dec 21, 2024 at 1:57 AM Joel Jacobson <joel@compiler.org> wrote:
How about:

COPY table_name [ ( column_name ) ] { FROM | TO } 'filename' (FORMAT jsonb);

- If column list is omitted, table_name must have exactly one column.
- If column list is specified, it must be of length one.
- The column type must be jsonb.
- Each line is a single jsonb value; no multi-line json permitted.
- Non-LF whitespace, i.e. [ \r\t], are allowed anywhere where
whitespace can exist according to the json spec, but are discarded
upon import, since jsonb doesn't store whitepsace.
- The LF character, i.e. [\n], determine the end of a jsonb value.
- Empty line are imported as NULL::jsonb values
- NULL::jsonb values are exported as empty lines

Note: Lines can end with LR or CR+LF, since the CR is just whitespace,
so even e.g. CR+CR+LF at the end would be allowed.

I'd tweak this a bit:

. call the format jsonl

. allow json - on output I would just eat any LF, but I could per persuaded to do something else.

My first impression of this is positive. I like the scoping and the rules make sense.

I know this is outside of COPY charter as it stands today but I'd like to suggest allowing for some way of saying "please put input line numbers into this column". Think of it as adding "with ordinality" to copy. In particular it would help to deal with potential end-of-file situations where the last line is imported as a null value.

Let's not add feature creep. If you want a feature like this there's no reason it should belong only to JSON input.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com