Thread: Re: New "single" COPY format
Hi, On Thu, Nov 7, 2024 at 8:16 AM Joel Jacobson <joel@compiler.org> wrote: > > Hi hackers, > > Thread [1] renamed, since the format name has now been changed from 'raw' to > 'single', as suggested by Andrew Dunstan and Jacob Champion. > > [1] https://postgr.es/m/c12516b1-77dc-4ad3-94a7-88527360aee0@app.fastmail.com > > Recap: This is about adding support to import/export text-based formats such as > JSONL, or any unstructured text file, where wanting to import each line "as is" > into a single column, or wanting to export a single column to a text file. > > Example importing the meson-logs/testlog.json file Meson generates > when building PostgreSQL, which is in JSONL format: > > # create table meson_log (log_line jsonb); > # \copy meson_log from meson-logs/testlog.json (format single); > COPY 306 > # select log_line->'name' name, log_line->'result' result from meson_log limit 3; > name | result > -----------------------------------------+-------- > "postgresql:setup / tmp_install" | "OK" > "postgresql:setup / install_test_files" | "OK" > "postgresql:setup / initdb_cache" | "OK" > (3 rows) > > Changes since v16: > > * EOL handling now works the same as for 'text' and 'csv'. > In v16, we supported multi-byte delimiters to allow specifying > e.g. Windows EOL (\r\n), but this seemed unnecessary, if we just do what we do > for text/csv, that is, to auto-detect the EOL for COPY FROM, and use > the OS default EOL for COPY TO. > The DELIMITER option is therefore invalid for the 'single' format. > This is the biggest change in the code, between v16 and v18. > CopyReadLineRawText() has been renamed to CopyReadLineSingleText(), > and changed accordingly. In earlier versions, we supported loading the whole file into a single tuple. Is there any reason that it doesn't support it in v18? I think if it's useful we can improve it in a separate patch. > > * A final EOL is now emitted to the last record in COPY TO. > So now it works just like 'text' and 'csv'. > +1 > * HEADER [ boolean | MATCH ] now supported > This is now again supported, as previously suggested by Daniel Verite, > possible thanks to the EOL handling. It makes sense to support it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Nov 8, 2024, at 00:13, Masahiko Sawada wrote: > In earlier versions, we supported loading the whole file into a single > tuple. Is there any reason that it doesn't support it in v18? I think > if it's useful we can improve it in a separate patch. Not sure how useful it is, since we already have pg_read_file(). Also, I think it's out of scope for the 'single' format, since I think it should be about about processing text line by line, in the same way 'csv' and 'text' work. The implementation depended on the delimiter option, where the default was no delimiter, which then read the entire file, and since we don't have the delimiter option anymore, that approach won't work. From a docs perspective, it would get quite ugly and confusing, since we would need to rephrase sentences like the below, since they would then not always be true: "each line is treated as a single field" "each line of the input or output is considered a complete value without any field separation" >> * A final EOL is now emitted to the last record in COPY TO. >> So now it works just like 'text' and 'csv'. >> > > +1 > >> * HEADER [ boolean | MATCH ] now supported >> This is now again supported, as previously suggested by Daniel Verite, >> possible thanks to the EOL handling. > > It makes sense to support it. /Joel
On Thursday, November 7, 2024, Joel Jacobson <joel@compiler.org> wrote:
On Fri, Nov 8, 2024, at 00:13, Masahiko Sawada wrote:
> In earlier versions, we supported loading the whole file into a single
> tuple. Is there any reason that it doesn't support it in v18? I think
> if it's useful we can improve it in a separate patch.
Not sure how useful it is, since we already have pg_read_file().
Being forced to have the file server-readable, non-stdin, destroys quite a bit of the usefulness of pg_read_file.
If we want clients to be able to pass the effort here to the server, copy is definitely the most useful way to do so.
I’d be concerned choosing “single” given this future possibility. I do agree that such an enhancement would be best done in its own patch.
David J.
On Fri, Nov 8, 2024, at 07:14, David G. Johnston wrote: > On Thursday, November 7, 2024, Joel Jacobson <joel@compiler.org> wrote: >> On Fri, Nov 8, 2024, at 00:13, Masahiko Sawada wrote: >> > In earlier versions, we supported loading the whole file into a single >> > tuple. Is there any reason that it doesn't support it in v18? I think >> > if it's useful we can improve it in a separate patch. >> >> Not sure how useful it is, since we already have pg_read_file(). > > Being forced to have the file server-readable, non-stdin, destroys > quite a bit of the usefulness of pg_read_file. > > If we want clients to be able to pass the effort here to the server, > copy is definitely the most useful way to do so. Right, good point, I agree. > I’d be concerned choosing “single” given this future possibility. I do > agree that such an enhancement would be best done in its own patch. OK, sounds good to do it in its own patch. If the name "single" doesn't work for this reason, I see at least two alternatives: 1) Keep "single" as format name, and let it only be concerned about line by line processing, and introduce a different format for entire file processing, in its own patch. 2) Some other format name ("raw"?) that allows such future enhancement to be done within the same format, in its own patch. Other ideas? /Joel
On Fri, Nov 8, 2024, at 08:42, Joel Jacobson wrote: >> I’d be concerned choosing “single” given this future possibility. I do >> agree that such an enhancement would be best done in its own patch. > > OK, sounds good to do it in its own patch. > > If the name "single" doesn't work for this reason, I see at least > two alternatives: > > 1) Keep "single" as format name, and let it only be concerned about line by line > processing, and introduce a different format for entire file processing, > in its own patch. > > 2) Some other format name ("raw"?) that allows such future enhancement to > be done within the same format, in its own patch. > > Other ideas? Sorry for noise, "raw" is of course not an option given how v18 works, due to the auto-magic EOL detection, same as in "text" and "csv", as pointed out by others earlier in the thread. How about "single_column"? Then, a future patch could implement a "single_value" format, to process an entire file or value. Such format could then also support binary data, by detecting if the column type is "bytea". /Joel