Re: New "single" COPY format - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: New "single" COPY format
Date
Msg-id 0b70a518-f6cc-483b-8e1c-51a8585f0f72@dunslane.net
Whole thread Raw
In response to Re: New "single" COPY format  ("Joel Jacobson" <joel@compiler.org>)
Responses Re: New "single" COPY format
List pgsql-hackers
On 2024-12-16 Mo 10:09 AM, Joel Jacobson wrote:
> Hi hackers,
>
> After further consideration, I'm withdrawing the patch.
> Some fundamental questions remain unresolved:
>
> - Should round-trip fidelity be a strict goal? By "round-trip fidelity",
>    I mean that data exported and then re-imported should yield exactly
>    the original values, including the distinction between NULL and empty strings.
> - If round-trip fidelity is a requirement, how do we distinguish NULL from empty
>    strings without delimiters or escapes?
> - Is automatic newline detection (as in "csv" and "text") more valuable than
>    the ability to embed \r (CR) characters?
> - Would it be better to extend the existing COPY options rather than introducing
>    a new format?
> - Or should we consider a JSONL format instead, one that avoids the NULL/empty
>    string problem entirely?
>
> No clear solution or consensus has emerged. For now, I'll step back from the
> proposal. If someone wants to revisit this later, I'd be happy to contribute.
>
> Thanks again for all the feedback and consideration.
>

We seem to have got seriously into the weeds, here. I'd be sorry to see 
this dropped. After all, it's not something new, and while we have a 
sort of workaround for "one json doc per line" it's far from obvious, 
and except in a few blog posts undocumented.

I think we're trying to be far too general here but in the absence of 
more general use cases. The ones I recall having encountered in the wild 
are:

   . one json datum per line

   . one json document per file

   . a sequence of json documents per file

The last one is hard to deal with, and I think I've only seen it once or 
twice, so I suggest leaving it aside for now.

Notice these are all JSON. I could imagine XML might have similar 
requirements, but I encounter it extremely rarely.

Regarding NULL, an empty string is not a valid JSON literal, so there 
should be no confusion there. It is valid for XML, though.

Given all that I think restricting ourselves to just the JSON cases, and 
possibly just to JSONL, would be perfectly reasonable.

Regarding CR, it's not a valid character in a JSON string item, although 
it is valid in JSON whitespace. I would not treat it as magical unless 
it immediately precedes an NL. That gives rise to a very sight 
ambiguity, but I think it's one we could live with.

As for what the format is called, I don't like the "LIST" proposal much, 
even for the general case. Seems too close to an array.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




pgsql-hackers by date:

Previous
From: "Euler Taveira"
Date:
Subject: Re: log_min_messages per backend type
Next
From: Greg Sabino Mullane
Date:
Subject: Re: Send duration output to separate log files