Re: New "single" COPY format - Mailing list pgsql-hackers
From | Joel Jacobson |
---|---|
Subject | Re: New "single" COPY format |
Date | |
Msg-id | 90e59fd6-3f43-4edd-8b71-693347725d4a@app.fastmail.com Whole thread Raw |
In response to | Re: New "single" COPY format (Andrew Dunstan <andrew@dunslane.net>) |
List | pgsql-hackers |
On Thu, Dec 19, 2024, at 14:40, Andrew Dunstan wrote: > We seem to have got seriously into the weeds, here. I'd be sorry to see > this dropped. After all, it's not something new, and while we have a > sort of workaround for "one json doc per line" it's far from obvious, > and except in a few blog posts undocumented. > > I think we're trying to be far too general here but in the absence of > more general use cases. The ones I recall having encountered in the wild > are: > > . one json datum per line > > . one json document per file > > . a sequence of json documents per file > > The last one is hard to deal with, and I think I've only seen it once or > twice, so I suggest leaving it aside for now. > > Notice these are all JSON. I could imagine XML might have similar > requirements, but I encounter it extremely rarely. > > Regarding NULL, an empty string is not a valid JSON literal, so there > should be no confusion there. It is valid for XML, though. > > Given all that I think restricting ourselves to just the JSON cases, and > possibly just to JSONL, would be perfectly reasonable. > > Regarding CR, it's not a valid character in a JSON string item, although > it is valid in JSON whitespace. I would not treat it as magical unless > it immediately precedes an NL. That gives rise to a very sight > ambiguity, but I think it's one we could live with. > > As for what the format is called, I don't like the "LIST" proposal much, > even for the general case. Seems too close to an array. On Thu, Dec 19, 2024, at 14:40, Andrew Dunstan wrote: > We seem to have got seriously into the weeds, here. I'd be sorry to see > this dropped. After all, it's not something new, and while we have a > sort of workaround for "one json doc per line" it's far from obvious, > and except in a few blog posts undocumented. > > I think we're trying to be far too general here but in the absence of > more general use cases. The ones I recall having encountered in the wild > are: > > . one json datum per line > > . one json document per file > > . a sequence of json documents per file > > The last one is hard to deal with, and I think I've only seen it once or > twice, so I suggest leaving it aside for now. > > Notice these are all JSON. I could imagine XML might have similar > requirements, but I encounter it extremely rarely. > > Regarding NULL, an empty string is not a valid JSON literal, so there > should be no confusion there. It is valid for XML, though. > > Given all that I think restricting ourselves to just the JSON cases, and > possibly just to JSONL, would be perfectly reasonable. > > Regarding CR, it's not a valid character in a JSON string item, although > it is valid in JSON whitespace. I would not treat it as magical unless > it immediately precedes an NL. That gives rise to a very sight > ambiguity, but I think it's one we could live with. > > As for what the format is called, I don't like the "LIST" proposal much, > even for the general case. Seems too close to an array. Thanks for weighing in. OK, let's try to restrict ourselves to dealing with json lines and see if we can work out the precise semantics. The JSONL spec, or at least its validator [1], forbid empty lines. So we would need to extend the JSONL spec to: - Export a NULL::jsonb column value as an empty line. - Import an empty line as a NULL::jsonb column value. Could we also restrict ourselves to PostgreSQL's jsonb type? I fear trying to also support the PostgreSQL's json type will be another rabbit hole, since json values can contain LF characters in whitespace. Due to the necessary "empty line = NULL" extension, I think we should simply call the format 'jsonb', and not 'jsonl'. How about: COPY table_name [ ( column_name ) ] { FROM | TO } 'filename' (FORMAT jsonb); - If column list is omitted, table_name must have exactly one column. - If column list is specified, it must be of length one. - The column type must be jsonb. - Each line is a single jsonb value; no multi-line json permitted. - Non-LF whitespace, i.e. [ \r\t], are allowed anywhere where whitespace can exist according to the json spec, but are discarded upon import, since jsonb doesn't store whitepsace. - The LF character, i.e. [\n], determine the end of a jsonb value. - Empty line are imported as NULL::jsonb values - NULL::jsonb values are exported as empty lines Note: Lines can end with LR or CR+LF, since the CR is just whitespace, so even e.g. CR+CR+LF at the end would be allowed. Naming the new COPY format 'jsonb' would signal that we're firmly in PostgreSQL territory, not trying to produce a "pure" JSONL file. All JSONL files can be imported though, since the "empty line = NULL" extension just risks break other consumers of JSONL. If that's a problem, users can just filter out NULL values upon export, which seems acceptable to me. Just seems important to not call it JSONL. /Joel [1] https://jsonlines.org/validator/
pgsql-hackers by date: