Thread: Re: New "single" COPY format
Hi Joel, > Recap: This is about adding support to import/export text-based formats such as > JSONL, or any unstructured text file, where wanting to import each line "as is" > into a single column, or wanting to export a single column to a text file. Sorry for being late for the discussion. I disagree with the idea of adding a new format name for this. Mostly because it is *not* a new format and pretending that it is will be just a poor and/or confusing user interface. IMO it should be 'text' we already have with special options e.g. DELIMITER AS NULL ESCAPE AS NULL. If there are no escape characters and column delimiters (and no NULLs designations, and what else I forgot) then your text file just contains one tuple per line. -- Best regards, Aleksander Alekseev
Hi, > > Recap: This is about adding support to import/export text-based formats such as > > JSONL, or any unstructured text file, where wanting to import each line "as is" > > into a single column, or wanting to export a single column to a text file. > > Sorry for being late for the discussion. > > I disagree with the idea of adding a new format name for this. Mostly > because it is *not* a new format and pretending that it is will be > just a poor and/or confusing user interface. > > IMO it should be 'text' we already have with special options e.g. > DELIMITER AS NULL ESCAPE AS NULL. If there are no escape characters > and column delimiters (and no NULLs designations, and what else I > forgot) then your text file just contains one tuple per line. Personally I wouldn't mind a special syntax such as LINES AS IS or maybe COPY AS IS for convenience. Perhaps we should discuss it separately though as a syntax sugar for a long list of options we already support. -- Best regards, Aleksander Alekseev
On Fri, Nov 8, 2024, at 12:25, Aleksander Alekseev wrote: > Sorry for being late for the discussion. No worries, better late than never, thanks for chiming in. > I disagree with the idea of adding a new format name for this. Mostly > because it is *not* a new format and pretending that it is will be > just a poor and/or confusing user interface. > > IMO it should be 'text' we already have with special options e.g. > DELIMITER AS NULL ESCAPE AS NULL. If there are no escape characters > and column delimiters (and no NULLs designations, and what else I > forgot) then your text file just contains one tuple per line. > > Personally I wouldn't mind a special syntax such as LINES AS IS or > maybe COPY AS IS for convenience. Perhaps we should discuss it > separately though as a syntax sugar for a long list of options we > already support. From an implementation perspective, I agree with you that this could be handled by tweaking the existing code for the 'text' and 'csv' formats, although with a performance penalty for the existing formats. But from a user-perspective, the implementation is of course irrelevant, then what I think is important, is that the format should have an intuitive name, where the default behaviour should match a typical file in the format, as closely as possible. For this reason, the 'text' format is unfortunately a poor name, since it gives the impression it's a generic format for text files, which it's not, it's a PostgreSQL-specific format, where "\." on a single line has special meaning, and other defaults such as \N are also PostgreSQL-specific, and needs to be overriden, if dealing with a non-PostgreSQL specific text file. Users who fail to understand these details, risks being surprised. In contrast, the 'csv' format, works quite as expected. So for this reason, I think a new format, is a good idea, not only because it makes it much clearer how to have a fast parsing path in the implementation, but also because it will increase the chances users will get things right, when dealing with non-PostgreSQL specific text files, such as JSONL and log files. Sure, from an implementation perspective, we could have separate specialized functions, to allow for fast parsing paths, even if just overloading the existing options, but that would be a bit awkward I think. The "DELIMITER AS NULL ESCAPE AS NULL" idea was proposed in the old thread "Should CSV parsing be stricter about mid-field quotes?" [1] [1] https://postgr.es/m/8aeab305-5e94-4fa5-82bf-6da6baee6e05@app.fastmail.com However, some of us came to the conclusion that it would be better to introduce a new format, for reasons explained below, quoted from the old thread [1]: On Wed, Oct 9, 2024, at 18:14, Andrew Dunstan wrote: > On 2024-10-09 We 11:58 AM, Tom Lane wrote: >> "Joel Jacobson" <joel@compiler.org> writes: >>> I think it would be nicest to introduce a new "raw" FORMAT, >>> that clearly get things right already at the top-level, >>> instead of trying to customize any of the existing formats. >> >> FWIW, I like this idea. It makes it much clearer how to have a >> fast parsing path. > > WFM, so something like FORMAT {SIMPLE, RAW, FAST, SINGLE}? We don't seem > to have an existing keyword that we could sanely reuse here. To add to that, I think there is value of a new format, from a user-friendiness perspective, by keeping the existing formats and their options intact, and instead just add a new format, with a clear name, with well-defined semantics, explained in the docs under its own section, to avoid cluttering the documentation further, where users would need to assemble various options, and understand their intricate details, in order to get things right. /Joel
Aleksander Alekseev wrote: > IMO it should be 'text' we already have with special options e.g. > DELIMITER AS NULL ESCAPE AS NULL. If there are no escape characters > and column delimiters (and no NULLs designations, and what else I > forgot) then your text file just contains one tuple per line. +1 for the idea that accepting "no delimiter" and "no escape" as a valid combination for the text format seems better than adding a new format. However inviting "NULL" into that syntax when it has nothing to do with the SQL "NULL" does not look like a good idea. Maybe DELIMITER '' ESCAPE '', or DELIMITER NONE ESCAPE NONE. Besides, "single" as a format name does not sound right. Generally the name for a text format designates a set of characteristics meaning that certain combinations of characters have specific behaviors. Sometimes "plain" is used in the context of text formats to indicate that no character is special ("plain" is also the default subtype of "text" in MIME types). "single" as proposed is to be understood as "single-column", which is a consequence of the lack of a field delimiter, but not an intrinsic characteristic of the format. If COPY accepted fixed-length fields, it could be in a no-delimiter no-escape mode and still handle multiple columns, in opposition to what "single" suggests. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Fri, Nov 8, 2024, at 20:44, Daniel Verite wrote: > Aleksander Alekseev wrote: > >> IMO it should be 'text' we already have with special options e.g. >> DELIMITER AS NULL ESCAPE AS NULL. If there are no escape characters >> and column delimiters (and no NULLs designations, and what else I >> forgot) then your text file just contains one tuple per line. > > +1 for the idea that accepting "no delimiter" and "no escape" > as a valid combination for the text format seems better > than adding a new format. > However inviting "NULL" into that syntax when it has nothing to do > with the SQL "NULL" does not look like a good idea. > Maybe DELIMITER '' ESCAPE '', or DELIMITER NONE ESCAPE NONE. Okay, let's see if we can solve all problems I see with overloading the 'text' format: 1. Text files containing \. in the middle of the file % cat /tmp/test.txt foo \. bar How do we import such a file? Is it not supported? Or another option to turn off the special meaning of \.? Both seems like bad ideas to me, maybe there is a nice idea I fail to see? 2. NULL option is \N for 'text', so to import a plain text file safely, where \N lines should not be converted to NULL, users would need to also specify NULL '', which seems like a footgun to me. 3. What should happen if specifying DELIMITER NONE, and: - specifying a column list with more than one column? - not also specifying ESCAPE NONE? 4. What should happen if specifying ESCAPE NONE, and - specifying a column list with more than one column? 5. What about the isomorphism violation, I brought up in my previous email, that is, the non-bijective mapping and irreversibility, for records with embedded newlines? This is also a problem with a separate format, but I wonder what you think about the problem, if it's acceptable, or needs to be solved, and if so, if you see any solutions. > Besides, "single" as a format name does not sound right. > Generally the name for a text format designates a set > of characteristics meaning that certain combinations of > characters have specific behaviors. > Sometimes "plain" is used in the context of text formats > to indicate that no character is special ("plain" is also the > default subtype of "text" in MIME types). > > "single" as proposed is to be understood as "single-column", > which is a consequence of the lack of a field delimiter, but > not an intrinsic characteristic of the format. > If COPY accepted fixed-length fields, it could be in a > no-delimiter no-escape mode and still handle multiple > columns, in opposition to what "single" suggests. Good points. I agree "plain" is a better name. /Joel
On Fri, Nov 8, 2024 at 2:20 PM Joel Jacobson <joel@compiler.org> wrote:
1. Text files containing \. in the middle of the file
% cat /tmp/test.txt
foo
\.
bar
Or another option to turn off the special meaning of \.?
This does seem like an orthogonal option worth considering.
> Besides, "single" as a format name does not sound right.
> Generally the name for a text format designates a set
> of characteristics meaning that certain combinations of
> characters have specific behaviors.
> Sometimes "plain" is used in the context of text formats
> to indicate that no character is special ("plain" is also the
> default subtype of "text" in MIME types).
>
> "single" as proposed is to be understood as "single-column",
> which is a consequence of the lack of a field delimiter, but
> not an intrinsic characteristic of the format.
> If COPY accepted fixed-length fields, it could be in a
> no-delimiter no-escape mode and still handle multiple
> columns, in opposition to what "single" suggests.
Good points. I agree "plain" is a better name.
I'm on board with a new named format that selects the desired defaults instead of requiring the user to know and change them all manually.
This seems to me like a "list" format. Implying each row is a list entry. Since we have tables the concept of list would likewise reasonably imply a single column.
Since newlines are special, i.e., record delimiters, "plain" thus would remain misleading. It could be used for a case where the entire file is loaded into a new row, single column.
David J.
On Sat, Nov 9, 2024, at 08:07, Joel Jacobson wrote: > Here's a draft of an idea I'm considering (not yet implemented): I realize the last part about optional quoting is unnecessary, since if quoting is desired, users could just use the 'csv' format. Revised draft of the idea (not yet implemented): - Fast path for newline-free types: For the list of built-in types where we know the ::text representation cannot contain newlines, we take the fast path in NextCopyFromRawFields(), pointing cstate->raw_fields[0] directly to cstate->line_buf.data. - Handling newlines for other types: For any other types, we would need to scan the string for newline characters. If a newline is encountered, it results in an error. This brings up the question on what to offer users wanting to export text values containing newlines. To address this need, I think that's out of scope for the 'list' format, and is better handled by a separate 'value' format: - Such a format would be specialized for exporting a value "as is" to a file, or importing en entire file as a single value. - Such a value could be a physical single-column single-row, but could also be constructed using e.g. string_agg(). - The 'value' format could also easily support import/export binary data (bytea), to e.g. allow importing/exporting images, etc. Dimensionality perspective on formats: 2D formats: 'text', 'csv', 'binary' (tabular formats) 1D format: 'list' (single-column) 0D format: 'value' (single-column, single-row) /Joel
On Saturday, November 9, 2024, Joel Jacobson <joel@compiler.org> wrote:
On Sat, Nov 9, 2024, at 08:07, Joel Jacobson wrote:
> Here's a draft of an idea I'm considering (not yet implemented):
I realize the last part about optional quoting is unnecessary,
since if quoting is desired, users could just use the 'csv' format.
Revised draft of the idea (not yet implemented):
- Fast path for newline-free types:
For the list of built-in types where we know the ::text representation cannot
contain newlines, we take the fast path in NextCopyFromRawFields(),
pointing cstate->raw_fields[0] directly to cstate->line_buf.data.
- Handling newlines for other types:
For any other types, we would need to scan the string for newline characters.
If a newline is encountered, it results in an error.
Make sense to me.
David J.
On Saturday, November 9, 2024, Joel Jacobson <joel@compiler.org> wrote:
On Fri, Nov 8, 2024, at 22:47, David G. Johnston wrote:
> On Fri, Nov 8, 2024 at 2:20 PM Joel Jacobson <joel@compiler.org> wrote:
>>
>> 1. Text files containing \. in the middle of the file
>> % cat /tmp/test.txt
>> foo
>> \.
>> bar
>>
>> Or another option to turn off the special meaning of \.?
>
> This does seem like an orthogonal option worth considering.
I agree; if we want to integrate this into 'text', it's an option worth considering.
PostgreSQL cannot store the NUL byte. Would that be an option for the record separator. Default to new line but accept NUL if one needs to input/output lists containing newlines. Or whatever character the user believes is not part of their data - tab probably being a popular option.
David J.
On Sat, Nov 9, 2024, at 15:28, David G. Johnston wrote: > PostgreSQL cannot store the NUL byte. Would that be an option for the > record separator. Default to new line but accept NUL if one needs to > input/output lists containing newlines. Or whatever character the user > believes is not part of their data - tab probably being a popular > option. Clever idea, could work, but using NUL bytes in text files feels a bit unorthodox, and I can imagine surprising results in other systems having to deal with such files. I have no idea how useful such file format would be, but some googling suggest it's a trick that's used out there, so I won't exclude the idea entirely, just feels like a type of hack where it's difficult to foresee the consequences of allowing it. /Joel
On Sat, Nov 9, 2024 at 1:48 PM Joel Jacobson <joel@compiler.org> wrote:
On Sat, Nov 9, 2024, at 15:28, David G. Johnston wrote:
> PostgreSQL cannot store the NUL byte. Would that be an option for the
> record separator. Default to new line but accept NUL if one needs to
> input/output lists containing newlines. Or whatever character the user
> believes is not part of their data - tab probably being a popular
> option.
Clever idea, could work, but using NUL bytes in text files feels a bit
unorthodox, and I can imagine surprising results in other systems having to deal
with such files.
Yeah. I was inspired by xargs and find but for a permanent file it is a bit different.
David J.
On Sun, Nov 10, 2024 at 3:29 AM Joel Jacobson <joel@compiler.org> wrote: > > Cool. I've drafted a new patch on this approach. > The list of newline-free built-in types is not exhaustive, yet. do we care that COPY back and forth always work? doc not mentioned, but seems it's an implicit idea. copy the_table to '/tmp/3.txt' with (format whatever_format); truncate the_table; copy the_table from '/tmp/3.txt' with (format whatever_format); but v20, will not work for an non-text column with SQL NULL data in it. example: drop table if exists x1; create table x1(a int); insert into x1 select null; copy x1 to '/tmp/3.txt' with (format list); copy x1 from '/tmp/3.txt' with (format list); ERROR: invalid input syntax for type integer: "" CONTEXT: COPY x1, line 1, column a: "" <para> The <literal>list</literal> format does not distinguish a <literal>NULL</literal> value from an empty string. Empty lines are imported as empty strings, not as <literal>NULL</literal> values. </para> we only mentioned import, not export (COPY TO) dealing with NULL value. + if (c == '\n' || c == '\r') + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("list format doesn't support newlines in field values"), + errhint("Consider using csv or text format for data containing newlines."))); "list format doesn't support newlines in field values" word list need single or double quote? ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("Unsupported COPY format"))); should be "unsupported" per https://www.postgresql.org/docs/current/error-style-guide.html#ERROR-STYLE-GUIDE-CASE
On Saturday, November 9, 2024, jian he <jian.universality@gmail.com> wrote:
<para>
The <literal>list</literal> format does not distinguish a
<literal>NULL</literal>
value from an empty string. Empty lines are imported as empty strings, not
as <literal>NULL</literal> values.
</para>
we only mentioned import, not export (COPY TO) dealing with
NULL value.
Yeah, while not being able to distinguish between the two is consistent with the list format’s premise/design the choice would need to resolve to the null value in order to continue to be data-type agnostic. We’d simply have to note for the text types that empty strings in lists are not supported, and if encountered will be resolved to a null value.
David J.
On Sun, Nov 10, 2024, at 05:55, David G. Johnston wrote: > On Saturday, November 9, 2024, jian he <jian.universality@gmail.com> wrote: >> <para> >> The <literal>list</literal> format does not distinguish a >> <literal>NULL</literal> >> value from an empty string. Empty lines are imported as empty strings, not >> as <literal>NULL</literal> values. >> </para> >> we only mentioned import, not export (COPY TO) dealing with >> NULL value. >> > > Yeah, while not being able to distinguish between the two is consistent > with the list format’s premise/design the choice would need to resolve > to the null value in order to continue to be data-type agnostic. We’d > simply have to note for the text types that empty strings in lists are > not supported, and if encountered will be resolved to a null value. Seems like we have two options to decide between, both with pros and cons. For full reversibility, we can't support both NULL values and the empty string. To make a sound design decision here, I think we should test both options against all real-world use-cases we can come up with. The use-cases I can think of are: 1) Arbitrary unstructured text lists, where each line could be any text string 2) JSONL, where each line is a valid JSON value, and cannot be an empty string Option A: COPY TO: Empty string field gets exported as an empty line. NULL field is an error. COPY FROM: Empty line is imported as an empty string. Option B: COPY TO: NULL field gets exported as an empty line. Empty string field is an error. COPY FROM: Empty line is imported as a NULL value. I think Option A seems more useful, because: 1) Arbitrary text files, very often contain empty lines to separate sections from each other. 2) JSONL cannot contain empty lines, they are an error: https://jsonlines.org/validator/ Nothing implemented yet, awaiting opinions. /Joel [1] https://jsonlines.org/validator/
On Sun, Nov 10, 2024, at 08:32, Joel Jacobson wrote: > Option A: > COPY TO: Empty string field gets exported as an empty line. NULL field > is an error. > COPY FROM: Empty line is imported as an empty string. > > Option B: > COPY TO: NULL field gets exported as an empty line. Empty string field > is an error. > COPY FROM: Empty line is imported as a NULL value. > > I think Option A seems more useful, because: > > 1) Arbitrary text files, very often contain empty lines to separate > sections from each other. > 2) JSONL cannot contain empty lines, they are an error: > https://jsonlines.org/validator/ David, I forgot about your NUL idea, so there is also a third option. Option C: COPY TO: NULL field gets exported as a the NUL byte. Empty string field is an empty line. COPY FROM: Empty line is imported as an empty string. NUL byte is imported as a NULL value. For arbitrary text files, Option C would work fine, since they usually don't contain NUL bytes, and if they do, then it seemsuseful to be handle to deal with such files in some way, even if it can't be known NUL always means NULL, then we couldat least import such files, and then do some post-processing of the imported data, to get the desired result. For JSONL, Option C would also work fine, since they can't contain NUL bytes. It's a bit of a hack, but I kinda like it., since it seems like the only option without an error situation. Maybe OK if we add one a cautionary <note> the docs? /Joel
On Sun, Nov 10, 2024, at 08:48, Joel Jacobson wrote: > On Sun, Nov 10, 2024, at 08:32, Joel Jacobson wrote: >> Option A: >> COPY TO: Empty string field gets exported as an empty line. NULL field >> is an error. >> COPY FROM: Empty line is imported as an empty string. >> >> Option B: >> COPY TO: NULL field gets exported as an empty line. Empty string field >> is an error. >> COPY FROM: Empty line is imported as a NULL value. >> >> I think Option A seems more useful, because: >> >> 1) Arbitrary text files, very often contain empty lines to separate >> sections from each other. >> 2) JSONL cannot contain empty lines, they are an error: >> https://jsonlines.org/validator/ > > David, I forgot about your NUL idea, so there is also a third option. To avoid confusion, I should have been clear that the below idea is just based on your NUL idea, it's not the same idea per se, since your was about newline handling in textual types. > Option C: > COPY TO: NULL field gets exported as a the NUL byte. Empty string field > is an empty line. > COPY FROM: Empty line is imported as an empty string. NUL byte is > imported as a NULL value. > > For arbitrary text files, Option C would work fine, since they usually > don't contain NUL bytes, and if they do, then it seems useful to be > handle to deal with such files in some way, even if it can't be known > NUL always means NULL, then we could at least import such files, and > then do some post-processing of the imported data, to get the desired > result. > > For JSONL, Option C would also work fine, since they can't contain NUL bytes. > > It's a bit of a hack, but I kinda like it., since it seems like the > only option without an error situation. > Maybe OK if we add one a cautionary <note> the docs? Also, we could emit NOTICE messages, upon both COPY TO and COPY FROM, to increase the chances of users understanding the semantics: COPY TO: NOTICE: NULL values encountered in data, represented as NUL bytes in output COPY FROM: NOTICE: NUL bytes encountered in data, stored as NULL values /Joel
On Sun, Nov 10, 2024, at 09:00, Joel Jacobson wrote: >> It's a bit of a hack, but I kinda like it., since it seems like the >> only option without an error situation. I forgot about the error situation when a textual contain newline characters, that remains the same for option A, B and C. /Joel
Hi hackers, After further consideration, I'm withdrawing the patch. Some fundamental questions remain unresolved: - Should round-trip fidelity be a strict goal? By "round-trip fidelity", I mean that data exported and then re-imported should yield exactly the original values, including the distinction between NULL and empty strings. - If round-trip fidelity is a requirement, how do we distinguish NULL from empty strings without delimiters or escapes? - Is automatic newline detection (as in "csv" and "text") more valuable than the ability to embed \r (CR) characters? - Would it be better to extend the existing COPY options rather than introducing a new format? - Or should we consider a JSONL format instead, one that avoids the NULL/empty string problem entirely? No clear solution or consensus has emerged. For now, I'll step back from the proposal. If someone wants to revisit this later, I'd be happy to contribute. Thanks again for all the feedback and consideration. /Joel
I have reviewed v21-0001 again. v21-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patch is a good refactor. overall looks good to me.
On Thu, Dec 19, 2024, at 07:48, jian he wrote: > I have reviewed v21-0001 again. > > v21-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patch > is a good refactor. > > > overall looks good to me. OK, I could submit it as a separate patch. Would we also want the reorganization of existing copy option validations? That is, v21-0003-Reorganize-option-validations.patch minus v21-0002-Add-COPY-format-list.patch. Nice that you managed to arrange them in the same order as in the documentation. Made the code easier to follow. /Joel
On 2024-12-16 Mo 10:09 AM, Joel Jacobson wrote: > Hi hackers, > > After further consideration, I'm withdrawing the patch. > Some fundamental questions remain unresolved: > > - Should round-trip fidelity be a strict goal? By "round-trip fidelity", > I mean that data exported and then re-imported should yield exactly > the original values, including the distinction between NULL and empty strings. > - If round-trip fidelity is a requirement, how do we distinguish NULL from empty > strings without delimiters or escapes? > - Is automatic newline detection (as in "csv" and "text") more valuable than > the ability to embed \r (CR) characters? > - Would it be better to extend the existing COPY options rather than introducing > a new format? > - Or should we consider a JSONL format instead, one that avoids the NULL/empty > string problem entirely? > > No clear solution or consensus has emerged. For now, I'll step back from the > proposal. If someone wants to revisit this later, I'd be happy to contribute. > > Thanks again for all the feedback and consideration. > We seem to have got seriously into the weeds, here. I'd be sorry to see this dropped. After all, it's not something new, and while we have a sort of workaround for "one json doc per line" it's far from obvious, and except in a few blog posts undocumented. I think we're trying to be far too general here but in the absence of more general use cases. The ones I recall having encountered in the wild are: . one json datum per line . one json document per file . a sequence of json documents per file The last one is hard to deal with, and I think I've only seen it once or twice, so I suggest leaving it aside for now. Notice these are all JSON. I could imagine XML might have similar requirements, but I encounter it extremely rarely. Regarding NULL, an empty string is not a valid JSON literal, so there should be no confusion there. It is valid for XML, though. Given all that I think restricting ourselves to just the JSON cases, and possibly just to JSONL, would be perfectly reasonable. Regarding CR, it's not a valid character in a JSON string item, although it is valid in JSON whitespace. I would not treat it as magical unless it immediately precedes an NL. That gives rise to a very sight ambiguity, but I think it's one we could live with. As for what the format is called, I don't like the "LIST" proposal much, even for the general case. Seems too close to an array. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Thu, Dec 19, 2024, at 14:40, Andrew Dunstan wrote: > We seem to have got seriously into the weeds, here. I'd be sorry to see > this dropped. After all, it's not something new, and while we have a > sort of workaround for "one json doc per line" it's far from obvious, > and except in a few blog posts undocumented. > > I think we're trying to be far too general here but in the absence of > more general use cases. The ones I recall having encountered in the wild > are: > > . one json datum per line > > . one json document per file > > . a sequence of json documents per file > > The last one is hard to deal with, and I think I've only seen it once or > twice, so I suggest leaving it aside for now. > > Notice these are all JSON. I could imagine XML might have similar > requirements, but I encounter it extremely rarely. > > Regarding NULL, an empty string is not a valid JSON literal, so there > should be no confusion there. It is valid for XML, though. > > Given all that I think restricting ourselves to just the JSON cases, and > possibly just to JSONL, would be perfectly reasonable. > > Regarding CR, it's not a valid character in a JSON string item, although > it is valid in JSON whitespace. I would not treat it as magical unless > it immediately precedes an NL. That gives rise to a very sight > ambiguity, but I think it's one we could live with. > > As for what the format is called, I don't like the "LIST" proposal much, > even for the general case. Seems too close to an array. On Thu, Dec 19, 2024, at 14:40, Andrew Dunstan wrote: > We seem to have got seriously into the weeds, here. I'd be sorry to see > this dropped. After all, it's not something new, and while we have a > sort of workaround for "one json doc per line" it's far from obvious, > and except in a few blog posts undocumented. > > I think we're trying to be far too general here but in the absence of > more general use cases. The ones I recall having encountered in the wild > are: > > . one json datum per line > > . one json document per file > > . a sequence of json documents per file > > The last one is hard to deal with, and I think I've only seen it once or > twice, so I suggest leaving it aside for now. > > Notice these are all JSON. I could imagine XML might have similar > requirements, but I encounter it extremely rarely. > > Regarding NULL, an empty string is not a valid JSON literal, so there > should be no confusion there. It is valid for XML, though. > > Given all that I think restricting ourselves to just the JSON cases, and > possibly just to JSONL, would be perfectly reasonable. > > Regarding CR, it's not a valid character in a JSON string item, although > it is valid in JSON whitespace. I would not treat it as magical unless > it immediately precedes an NL. That gives rise to a very sight > ambiguity, but I think it's one we could live with. > > As for what the format is called, I don't like the "LIST" proposal much, > even for the general case. Seems too close to an array. Thanks for weighing in. OK, let's try to restrict ourselves to dealing with json lines and see if we can work out the precise semantics. The JSONL spec, or at least its validator [1], forbid empty lines. So we would need to extend the JSONL spec to: - Export a NULL::jsonb column value as an empty line. - Import an empty line as a NULL::jsonb column value. Could we also restrict ourselves to PostgreSQL's jsonb type? I fear trying to also support the PostgreSQL's json type will be another rabbit hole, since json values can contain LF characters in whitespace. Due to the necessary "empty line = NULL" extension, I think we should simply call the format 'jsonb', and not 'jsonl'. How about: COPY table_name [ ( column_name ) ] { FROM | TO } 'filename' (FORMAT jsonb); - If column list is omitted, table_name must have exactly one column. - If column list is specified, it must be of length one. - The column type must be jsonb. - Each line is a single jsonb value; no multi-line json permitted. - Non-LF whitespace, i.e. [ \r\t], are allowed anywhere where whitespace can exist according to the json spec, but are discarded upon import, since jsonb doesn't store whitepsace. - The LF character, i.e. [\n], determine the end of a jsonb value. - Empty line are imported as NULL::jsonb values - NULL::jsonb values are exported as empty lines Note: Lines can end with LR or CR+LF, since the CR is just whitespace, so even e.g. CR+CR+LF at the end would be allowed. Naming the new COPY format 'jsonb' would signal that we're firmly in PostgreSQL territory, not trying to produce a "pure" JSONL file. All JSONL files can be imported though, since the "empty line = NULL" extension just risks break other consumers of JSONL. If that's a problem, users can just filter out NULL values upon export, which seems acceptable to me. Just seems important to not call it JSONL. /Joel [1] https://jsonlines.org/validator/
On Sat, Dec 21, 2024 at 1:57 AM Joel Jacobson <joel@compiler.org> wrote:
How about:
COPY table_name [ ( column_name ) ] { FROM | TO } 'filename' (FORMAT jsonb);
- If column list is omitted, table_name must have exactly one column.
- If column list is specified, it must be of length one.
- The column type must be jsonb.
- Each line is a single jsonb value; no multi-line json permitted.
- Non-LF whitespace, i.e. [ \r\t], are allowed anywhere where
whitespace can exist according to the json spec, but are discarded
upon import, since jsonb doesn't store whitepsace.
- The LF character, i.e. [\n], determine the end of a jsonb value.
- Empty line are imported as NULL::jsonb values
- NULL::jsonb values are exported as empty lines
Note: Lines can end with LR or CR+LF, since the CR is just whitespace,
so even e.g. CR+CR+LF at the end would be allowed.
My first impression of this is positive. I like the scoping and the rules make sense.
I know this is outside of COPY charter as it stands today but I'd like to suggest allowing for some way of saying "please put input line numbers into this column". Think of it as adding "with ordinality" to copy. In particular it would help to deal with potential end-of-file situations where the last line is imported as a null value.
David J.
On 2025-02-17 Mo 7:05 PM, David G. Johnston wrote:
On Sat, Dec 21, 2024 at 1:57 AM Joel Jacobson <joel@compiler.org> wrote:How about:
COPY table_name [ ( column_name ) ] { FROM | TO } 'filename' (FORMAT jsonb);
- If column list is omitted, table_name must have exactly one column.
- If column list is specified, it must be of length one.
- The column type must be jsonb.
- Each line is a single jsonb value; no multi-line json permitted.
- Non-LF whitespace, i.e. [ \r\t], are allowed anywhere where
whitespace can exist according to the json spec, but are discarded
upon import, since jsonb doesn't store whitepsace.
- The LF character, i.e. [\n], determine the end of a jsonb value.
- Empty line are imported as NULL::jsonb values
- NULL::jsonb values are exported as empty lines
Note: Lines can end with LR or CR+LF, since the CR is just whitespace,
so even e.g. CR+CR+LF at the end would be allowed.
I'd tweak this a bit:
. call the format jsonl
. allow json - on output I would just eat any LF, but I could per persuaded to do something else.
My first impression of this is positive. I like the scoping and the rules make sense.I know this is outside of COPY charter as it stands today but I'd like to suggest allowing for some way of saying "please put input line numbers into this column". Think of it as adding "with ordinality" to copy. In particular it would help to deal with potential end-of-file situations where the last line is imported as a null value.
Let's not add feature creep. If you want a feature like this there's no reason it should belong only to JSON input.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com