Thread: Re: New "raw" COPY format
> Hi hackers, > > This thread is about implementing a new "raw" COPY format. > > This idea came up in a different thread [1], moved here. > > [1] https://postgr.es/m/47b5c6a7-5c0e-40aa-8ea2-c7b95ccf296f%40app.fastmail.com > > The main use-case for the raw format, is when needing to import arbitrary > unstructured text files, such as log files, into a single text column > of a table. After copy imported the "unstructured text file" in "row" COPY format, what the column type is? text? or bytea? If it's text, how do you handle encoding conversion if the "unstructured text file" is encoded in server side unsafe encoding such as SJIS? > All characters are taken literally. > There is no special handling for quotes, backslashes, or escape sequences. If SJIS text is imported "literally" (i.e. no encoding conversion), it should be rejected. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Sun, Oct 13, 2024, at 11:52, Tatsuo Ishii wrote: > After copy imported the "unstructured text file" in "row" COPY format, > what the column type is? text? or bytea? If it's text, how do you > handle encoding conversion if the "unstructured text file" is encoded > in server side unsafe encoding such as SJIS? > >> All characters are taken literally. >> There is no special handling for quotes, backslashes, or escape sequences. > > If SJIS text is imported "literally" (i.e. no encoding conversion), it > should be rejected. I think encoding conversion is still necessary, and should work the same as for the COPY formats "text" and "csv". /Joel
Hi, Idle thoughts from a design perspective -- feel free to ignore, since I'm not the target audience for the feature: - If the column data stored in Postgres contains newlines, it seems like COPY TO won't work "correctly". Is that acceptable? - RAW seems like an okay-ish label, but for something that's doing as much magic end-of-line detection as this patch is, I'd personally prefer SINGLE (as in, "single column"). - Speaking of magic end-of-line detection, can there be a way to turn that off? Say, via DELIMITER? - Generic DELIMITER support, for any single-byte separator at all, might make a "single-column" format more generally applicable. But I might be over-architecting. And it would make the COPY TO issue even worse... Thanks, --Jacob
On Tue, Oct 15, 2024, at 19:30, Jacob Champion wrote: > Hi, > > Idle thoughts from a design perspective -- feel free to ignore, since > I'm not the target audience for the feature: Many thanks for looking at this! > - If the column data stored in Postgres contains newlines, it seems > like COPY TO won't work "correctly". Is that acceptable? That's an interesting edge-case to think about. Rejecting such column data, to ensure all data that can be COPY TO'd, can be loaded back using COPY FROM, preserving the data intact, seems attractive because it protects users against unintentional mistakes, and all the other COPY formats are able to preserve the data unchanged, when doing a COPY FROM of a file created with COPY TO. OTOH, if we think of COPY TO in this case as a way of `cat`-ing text values, it might be more pragmatic to allow it. With `cat`-ing I mean like if in Unix doing... cat file1.txt file2.txt > file3.txt ...there is no way to reverse that operation, that is, to reconstruct file1.txt and file2.txt from file3.txt. However, I thinking rejecting such column data seems like the better alternative, to ensure data exported with COPY TO can always be imported back using COPY FROM, for the same format. If text column data contains newlines, users probably ought to be using the text or csv format instead. > - RAW seems like an okay-ish label, but for something that's doing as > much magic end-of-line detection as this patch is, I'd personally > prefer SINGLE (as in, "single column"). It's actually the same end-of-line detection as the text format in copyfromparse.c's CopyReadLineText(), except the code is simpler thanks to not having to deal with quotes or escapes. It basically just learns the newline sequence based on the first occurrence, and then require it to be the same throughout the file. The same data files can be tested with the text format, since they don't contain any escape, quote or delimiter characters. Different newline styles detected automatically also by format text: COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_lr.data' (FORMAT text); COPY 2 COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_cr.data' (FORMAT text); COPY 2 COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_cr_lr.data' (FORMAT text); COPY 2 The mixed newline style causes errors also for the text format: COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_1.data' (FORMAT text); ERROR: literal newline found in data HINT: Use "\n" to represent newline. CONTEXT: COPY t, line 2 COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_2.data' (FORMAT text); ERROR: literal newline found in data HINT: Use "\n" to represent newline. CONTEXT: COPY t, line 2 COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_3.data' (FORMAT text); ERROR: literal carriage return found in data HINT: Use "\r" to represent carriage return. CONTEXT: COPY t, line 2 COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_4.data' (FORMAT text); ERROR: literal carriage return found in data HINT: Use "\r" to represent carriage return. CONTEXT: COPY t, line 2 COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_5.data' (FORMAT text); ERROR: literal carriage return found in data HINT: Use "\r" to represent carriage return. CONTEXT: COPY t, line 2 I must confess, I didn't know about this newline detection before reading the copyfromparse.c source code. > - Speaking of magic end-of-line detection, can there be a way to turn > that off? Say, via DELIMITER? > - Generic DELIMITER support, for any single-byte separator at all, > might make a "single-column" format more generally applicable. But I > might be over-architecting. And it would make the COPY TO issue even > worse... That's an interesting idea that would provide more flexibility, though, at the cost of complicating things by overloading the meaning of DELIMITER. If aiming to make this more generally applicable, then at least DELIMITER would need to be multi-byte, since otherwise the Windows case \r\n couldn't be specified. But I feel COPY already has quite a lot of options, and I fear it's quite complicated for users as it is. What I found appealing with the idea of a new COPY format, was that instead of overloading the existing options with more complexity, a new format wouldn't need to affect the existing options, and the new format could be explained separately, without making things worse for users not using this format. /Joel
On Tue, Oct 15, 2024 at 8:50 PM Joel Jacobson <joel@compiler.org> wrote: > Hi. I only checked 0001, 0002, 0003. the raw format patch is v9-0016. 003-0016 is a lot of small patches, maybe you can consolidate it to make the review more easier. -COPY x to stdin (format TEXT, force_quote(a)); +COPY x to stdout (format TEXT, force_quote(a)); 0001 make sense to me, i think generally we do "to stdout", "from stdin" v9-0002-Fix-validation-of-FORCE_NOT_NULL-FORCE_NULL-for-all-.patch looks good. typedef enum CopyLogVerbosityChoice { COPY_LOG_VERBOSITY_SILENT = -1, /* logs none */ COPY_LOG_VERBOSITY_DEFAULT = 0, /* logs no additional messages. As this is * the default, assign 0 */ COPY_LOG_VERBOSITY_VERBOSE, /* logs additional messages */ } CopyLogVerbosityChoice; /* * Represents the format of the COPY operation. */ typedef enum CopyFormat { COPY_FORMAT_TEXT, COPY_FORMAT_BINARY, COPY_FORMAT_CSV, } CopyFormat; BeginCopyTo cstate = (CopyToStateData *) palloc0(sizeof(CopyToStateData)); ProcessCopyOptions(pstate, &cstate->opts, false /* is_from */ , options); palloc0(sizeof(CopyToStateData)); makes the default format to COPY_FORMAT_TEXT. I think you may need COPY_FORMAT_TEXT = 0, even though based on [1], it seems not required. [1] https://stackoverflow.com/questions/6434105/are-default-enum-values-in-c-the-same-for-all-compilers
On Tue, Oct 15, 2024 at 1:38 PM Joel Jacobson <joel@compiler.org> wrote: > > However, I thinking rejecting such column data seems like the > better alternative, to ensure data exported with COPY TO > can always be imported back using COPY FROM, > for the same format. If text column data contains newlines, > users probably ought to be using the text or csv format instead. Yeah. I think _someone's_ going to have strong opinions one way or the other, but that person is not me. And I assume a contents check during COPY TO is going to have a noticeable performance impact... > > - RAW seems like an okay-ish label, but for something that's doing as > > much magic end-of-line detection as this patch is, I'd personally > > prefer SINGLE (as in, "single column"). > > It's actually the same end-of-line detection as the text format > in copyfromparse.c's CopyReadLineText(), except the code > is simpler thanks to not having to deal with quotes or escapes. Right, sorry, I hadn't meant to imply that you made it up. :D Just that a "raw" format that is actually automagically detecting things doesn't seem very "raw" to me, so I prefer the other name. > It basically just learns the newline sequence based on the first > occurrence, and then require it to be the same throughout the file. A hypothetical type whose text representation can contain '\r' but not '\n' still can't be unambiguously round-tripped under this scheme: COPY FROM will see the "mixed" line endings and complain, even though there's no ambiguity. Maybe no one will run into that problem in practice? But if they did, I think that'd be a pretty frustrating limitation. It'd be nice to override the behavior, to change it from "do what you think I mean" to "do what I say". > > - Speaking of magic end-of-line detection, can there be a way to turn > > that off? Say, via DELIMITER? > > - Generic DELIMITER support, for any single-byte separator at all, > > might make a "single-column" format more generally applicable. But I > > might be over-architecting. And it would make the COPY TO issue even > > worse... > > That's an interesting idea that would provide more flexibility, > though, at the cost of complicating things by overloading the meaning > of DELIMITER. I think that'd be a docs issue rather than a conceptual one, though... it's still a delimiter. I wouldn't really expect end-user confusion. > If aiming to make this more generally applicable, > then at least DELIMITER would need to be multi-byte, > since otherwise the Windows case \r\n couldn't be specified. True. > What I found appealing with the idea of a new COPY format, > was that instead of overloading the existing options > with more complexity, a new format wouldn't need to affect > the existing options, and the new format could be explained > separately, without making things worse for users not > using this format. I agree that we should not touch the existing formats. If RAW/SINGLE/whatever needed a multibyte line delimiter, I'm not proposing that the other formats should change. --Jacob
Joel Jacobson wrote: > However, I thinking rejecting such column data seems like the > better alternative, to ensure data exported with COPY TO > can always be imported back using COPY FROM, > for the same format. On the other hand, that might prevent cases where we want to export, for instance, a valid json array: copy (select json_agg(col) from table ) to 'file' RAW This is a variant of the discussion in [1] where the OP does: copy (select json_agg(row_to_json(t)) from <query> t) TO 'file' and he complains that both text and csv "break the JSON". That discussion morphed into a proposed patch adding JSON format to COPY, but RAW would work directly as the OP expected. That is, unless <query> happens to include JSON fields with LF/CRLF in them, and the RAW format says this is an error condition. In that case it's quite annoying to make it an error, rather than simply let it pass. [1] https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU=kcg@mail.gmail.com Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Wed, Oct 16, 2024, at 18:04, Jacob Champion wrote: > A hypothetical type whose text representation can contain '\r' but not > '\n' still can't be unambiguously round-tripped under this scheme: > COPY FROM will see the "mixed" line endings and complain, even though > there's no ambiguity. Yeah, that's quite an ugly limitation. > Maybe no one will run into that problem in practice? But if they did, > I think that'd be a pretty frustrating limitation. It'd be nice to > override the behavior, to change it from "do what you think I mean" to > "do what I say". That would be nice. >> That's an interesting idea that would provide more flexibility, >> though, at the cost of complicating things by overloading the meaning >> of DELIMITER. > > I think that'd be a docs issue rather than a conceptual one, though... > it's still a delimiter. I wouldn't really expect end-user confusion. Yeah, I meant the docs, but that's probably fine, we could just add <note> to DELIMITER. >> What I found appealing with the idea of a new COPY format, >> was that instead of overloading the existing options >> with more complexity, a new format wouldn't need to affect >> the existing options, and the new format could be explained >> separately, without making things worse for users not >> using this format. > > I agree that we should not touch the existing formats. If > RAW/SINGLE/whatever needed a multibyte line delimiter, I'm not > proposing that the other formats should change. Right, I didn't think you did either, I meant overloading the existing options, from a docs perspective. But I agree it's probably fine if we just overload DELIMITER in the docs, that should be possible to explain in a pedagogic way, without causing confusion. /Joel
On Wed, Oct 16, 2024, at 18:34, Daniel Verite wrote: > Joel Jacobson wrote: > >> However, I thinking rejecting such column data seems like the >> better alternative, to ensure data exported with COPY TO >> can always be imported back using COPY FROM, >> for the same format. > > On the other hand, that might prevent cases where we > want to export, for instance, a valid json array: > > copy (select json_agg(col) from table ) to 'file' RAW > > This is a variant of the discussion in [1] where the OP does: > > copy (select json_agg(row_to_json(t)) from <query> t) TO 'file' > > and he complains that both text and csv "break the JSON". > That discussion morphed into a proposed patch adding JSON > format to COPY, but RAW would work directly as the OP > expected. > > That is, unless <query> happens to include JSON fields with LF/CRLF > in them, and the RAW format says this is an error condition. > In that case it's quite annoying to make it an error, rather than > simply let it pass. > > > [1] > https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU=kcg@mail.gmail.com Thanks for finding this related thread. Very good example, that would be solved with RAW. I've used a different hack myself many times to export text "as is", which is to use COPY TO ... BINARY, and then to manually edit the file using vim, stripping away the header and footer from the file. :D But I don't see how JSON fields that come from PostgreSQL could contain LF/CRLF though, could they? Since LF/CR must be escaped inside a JSON field, and when casting JSONB to text, there are no newlines injected anywhere in between fields. I can only see how a value of the legacy JSON type could have newlines in it. That doesn't matter though, this is still a very good example on the need to export text "as is" from PostgreSQL to a file. I like Jacob's idea of letting the user specify a DELIMITER also for the RAW format, to override the automagical newline detection. This would e.g. allow importing values containing e.g. \r when the newline delimiter is \n, which would otherwise be reject with an "inconsistent newline style" error. I think it would also be useful to allow specifying DELIMITER NONE, which would allow importing an entire text file, "as is", into a single column and single row. To export a single column single row, the delimiter wouldn't matter, since there wouldn't be any. But DELIMITER NONE would be useful also for COPY TO, if wanting to concatenate text "as is" without adding newlines in between, to a file, similar to the UNIX "cat" command. Regarding the name for the format, I thought SINGLE was nice before I read this message, but now since the need for more rawness seems desirable, I think I like RAW the most again, since if the automagical newline detection can be overridden, then it's really raw for real. A final thought is to maybe consider just skipping the automagical newline detection for RAW? Instead of the automagical detection, the default newline delimiter could be the OS default, similar to how COPY TO works. That way, it would almost always just work for most users, as long as processing files within their OS, and when not, they would just need to specify the DELIMITER. /Joel
On Wed, Oct 16, 2024, at 20:30, Joel Jacobson wrote: > A final thought is to maybe consider just skipping > the automagical newline detection for RAW? > > Instead of the automagical detection, > the default newline delimiter could be the OS default, > similar to how COPY TO works. > > That way, it would almost always just work for most users, > as long as processing files within their OS, > and when not, they would just need to specify the DELIMITER. I would guess that nowadays, dealing with unstructured text files are probably less common, than dealing with structured text files, such as JSON, YAML, TOML, XML, etc. Therefore, maybe DELIMITER NONE would be a better default for RAW? Especially since it's then also more honest in being "raw". If needing to import an unstructured text file that is just newline delimited, and not wanting the entire file as a single value, the newline style would then just need to be specified using the DELIMITER option. /Joel
On Wed, Oct 16, 2024, at 21:13, Joel Jacobson wrote: > Therefore, maybe DELIMITER NONE would be a better default > for RAW? Especially since it's then also more honest in being "raw". > > If needing to import an unstructured text file that is just newline > delimited, and not wanting the entire file as a single value, > the newline style would then just need to be specified > using the DELIMITER option. I realize the DELIMITER NONE syntax is unnecessary, since if that's the default for RAW, we would just not specify any delimiter. /Joel
On Wed, Oct 16, 2024 at 2:37 PM Joel Jacobson <joel@compiler.org> wrote: > > On Wed, Oct 16, 2024, at 05:31, jian he wrote: > > Hi. > > I only checked 0001, 0002, 0003. > > the raw format patch is v9-0016. > > 003-0016 is a lot of small patches, maybe you can consolidate it to > > make the review more easier. > > Thanks for reviewing. > > OK, I've consolidated the v9 0003-0016 into a single patch. > + <refsect2> + <title>Raw Format</title> + + <para> + This format option is used for importing and exporting files containing + unstructured text, where each line is treated as a single field. It is + ideal for data that does not conform to a structured, tabular format and + lacks delimiters. + </para> + + <para> + In the <literal>raw</literal> format, each line of the input or output is + considered a complete value without any field separation. There are no + field delimiters, and all characters are taken literally. There is no + special handling for quotes, backslashes, or escape sequences. All + characters, including whitespace and special characters, are preserved + exactly as they appear in the file. However, it's important to note that + the text is still interpreted according to the specified <literal>ENCODING</literal> + option or the current client encoding for input, and encoded using the + specified <literal>ENCODING</literal> or the current client encoding for output. + </para> + + <para> + When using this format, the <command>COPY</command> command must specify + exactly one column. Specifying multiple columns will result in an error. + If the table has multiple columns and no column list is provided, an error + will occur. + </para> + + <para> + The <literal>raw</literal> format does not distinguish a <literal>NULL</literal> + value from an empty string. Empty lines are imported as empty strings, not + as <literal>NULL</literal> values. + </para> + + <para> + Encoding works the same as in the <literal>text</literal> and <literal>CSV</literal> formats. + </para> + + </refsect2> + + <refsect2> + <title>Raw Format</title> + + <para> + This format option is used for importing and exporting files containing + unstructured text, where each line is treated as a single field. It is + ideal for data that does not conform to a structured, tabular format and + lacks delimiters. + </para> + + <para> + In the <literal>raw</literal> format, each line of the input or output is + considered a complete value without any field separation. There are no + field delimiters, and all characters are taken literally. There is no + special handling for quotes, backslashes, or escape sequences. All + characters, including whitespace and special characters, are preserved + exactly as they appear in the file. However, it's important to note that + the text is still interpreted according to the specified <literal>ENCODING</literal> + option or the current client encoding for input, and encoded using the + specified <literal>ENCODING</literal> or the current client encoding for output. + </para> + + <para> + When using this format, the <command>COPY</command> command must specify + exactly one column. Specifying multiple columns will result in an error. + If the table has multiple columns and no column list is provided, an error + will occur. + </para> + + <para> + The <literal>raw</literal> format does not distinguish a <literal>NULL</literal> + value from an empty string. Empty lines are imported as empty strings, not + as <literal>NULL</literal> values. + </para> + + <para> + Encoding works the same as in the <literal>text</literal> and <literal>CSV</literal> formats. + </para> + + </refsect2> + <refsect2 id="sql-copy-binary-format" xreflabel="Binary Format"> <title>Binary Format</title> <refsect2> <title>Raw Format</title> is duplicated <title>Raw Format</title> didn't mention the special handling of end-of-data marker. +COPY copy_raw_test (col) FROM :'filename' RAW; we may need to support this. since we not allow COPY x from stdin text; COPY x to stdout text; so I think adding the RAW keyword in gram.y may not be necessary. /* Complete COPY <sth> FROM|TO filename WITH (FORMAT */ else if (Matches("COPY|\\copy", MatchAny, "FROM|TO", MatchAny, "WITH", "(", "FORMAT")) COMPLETE_WITH("binary", "csv", "text"); src/bin/psql/tab-complete.in.c, we can also add "raw". /* --- ESCAPE option --- */ if (opts_out->escape) { if (opts_out->format != COPY_FORMAT_CSV) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */ errmsg("COPY %s requires CSV mode", "ESCAPE"))); } escape option no regress test. /* --- QUOTE option --- */ if (opts_out->quote) { if (opts_out->format != COPY_FORMAT_CSV) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */ errmsg("COPY %s requires CSV mode", "QUOTE"))); } escape option no regress test. CopyOneRowTo else if (cstate->opts.format == COPY_FORMAT_RAW) { int attnum; Datum value; bool isnull; /* Ensure only one column is being copied */ if (list_length(cstate->attnumlist) != 1) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("COPY with format 'raw' must specify exactly one column"))); attnum = linitial_int(cstate->attnumlist); value = slot->tts_values[attnum - 1]; isnull = slot->tts_isnull[attnum - 1]; if (!isnull) { char *string = OutputFunctionCall(&out_functions[attnum - 1], value); CopyAttributeOutRaw(cstate, string); } /* For RAW format, we don't send anything for NULL values */ } We already did column length checking at BeginCopyTo. no need to "if (list_length(cstate->attnumlist) != 1)" error check in CopyOneRowTo?
On Fri, Oct 18, 2024, at 19:24, Joel Jacobson wrote: > Attachments: > * v11-0001-Refactor-ProcessCopyOptions-introduce-CopyFormat-enu.patch > * v11-0002-Add-raw-format-to-COPY-command.patch Here is a demo of a importing a decently sized real text file, that can't currently be imported without the CSV hack: $ head /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64 .package-cache-mutate devel/cargo bin admin/base-files bin/archdetect admin/ubiquity bin/autopartition admin/ubiquity bin/autopartition-crypto admin/ubiquity bin/autopartition-loop admin/ubiquity bin/autopartition-lvm admin/ubiquity bin/block-attr admin/ubiquity bin/blockdev-keygen admin/ubiquity bin/blockdev-wipe admin/ubiquity This file uses a combination of tabs and spaces, in between the two columns, so none of the existing formats are suitable to deal with this file. $ ls -lah /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64 -rw-r--r-- 1 root root 791M Apr 24 02:07 /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64 To import using the CSV hack, we first have find two bytes that don't exist anyway, which can be done using e.g. ripgrep. The below command verifies \x01 and \x02 don't exist anywhere: $ rg -uuu --multiline '(?-u)[\x01|\x02]' /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64 $ Knowing these bytes don't exist anywhere, we can then safely use these as delimiter and quote characters, as a hack to disable these features: CREATE TABLE package_contents (raw_line text); COPY package_contents FROM '/var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64' (FORMAT CSV, DELIMITERE'\x01', QUOTE E'\x02'); COPY 8443588 Time: 3882.100 ms (00:03.882) Time: 3552.991 ms (00:03.553) Time: 3748.038 ms (00:03.748) Time: 3775.947 ms (00:03.776) Time: 3729.020 ms (00:03.729) I tested writing a Rust program that would read the file line-by-line and INSERT each line instead. This is of course a lot slower, since it has to execute each insert separately: $ cargo run --release Compiling insert_package_contents v0.1.0 (/home/joel/insert_package_contents) Finished `release` profile [optimized] target(s) in 0.70s Running `target/release/insert_package_contents` Connecting to the PostgreSQL database... Successfully connected to the database. Starting to insert lines from the file... Successfully inserted 8443588 lines into package_contents in 134.65s. New approach using the RAW format: COPY package_contents FROM '/var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64' (FORMAT RAW, DELIMITERE'\n'); COPY 8443588 Time: 2918.489 ms (00:02.918) Time: 3020.372 ms (00:03.020) Time: 3336.589 ms (00:03.337) Time: 3067.268 ms (00:03.067) Time: 3343.694 ms (00:03.344) Apart from the convenience improvement, it seems to be somewhat faster already. /Joel
On Sat, Oct 19, 2024 at 1:24 AM Joel Jacobson <joel@compiler.org> wrote: >> > Handling of e.g. JSON and other structured text files that could contain > newlines, in a seamless way seems important, so therefore the default is > no delimiter for the raw format, so that the entire input is read as one data > value for COPY FROM, and all column data is concatenated without delimiter > for COPY TO. > > When specifying a delimiter for the raw format, it separates *rows*, and can be > a multi-byte string, such as E'\r\n' to handle Windows text files. > > This has been documented under the DELIMITER option, as well as under the > Raw Format section. > We already make RAW and can only have one column. if RAW has no default delimiter, then COPY FROM a text file will become one datum value; which makes it looks like importing a Large Object. (https://www.postgresql.org/docs/17/lo-funcs.html) i think, most of the time, you have more than one row/value to import and export? > The refactoring is now in a separate first single commit, which seems > necessary, to separate the new functionality, from the refactoring. I agree. ProcessCopyOptions /* Extract options from the statement node tree */ foreach(option, options) { } /* --- DELIMITER option --- */ /* --- NULL option --- */ /* --- QUOTE option --- */ Currently the regress test passed, i think that means your refactor is fine. in ProcessCopyOptions, maybe we can rearrange the code after the foreach loop (foreach(option, options) based on the parameters order in https://www.postgresql.org/docs/devel/sql-copy.html Parameters section. so we can review it by comparing the refactoring with the sql-copy.html Parameters section's description. > > > We already did column length checking at BeginCopyTo. > > no need to "if (list_length(cstate->attnumlist) != 1)" error check in > > CopyOneRowTo? > > Hmm, not sure really, since DoCopy() calls both BeginCopyTo() > and DoCopyTo() which in turn calls CopyOneRowTo(), > but CopyOneRowTo() is also being called from copy_dest_receive(). > BeginCopyTo do the preparation work. cstate->attnumlist = CopyGetAttnums(tupDesc, cstate->rel, attnamelist); After CopyGetAttnums, the number of attributes for COPY TO cannot be changed. right after CopyGetAttnums call then check the length of cstate->attnumlist seems fine for me. I think in CopyOneRowTo, we can actually Assert(list_length(cstate->attnumlist) == 1). for raw format. src10=# drop table if exists x; create table x(a int); COPY x from stdin (FORMAT raw); DROP TABLE CREATE TABLE Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself, or an EOF signal. >> 11 >> 12 >> \. ERROR: invalid input syntax for type integer: "11 12 " CONTEXT: COPY x, line 1, column a: "11 12 " The above case means COPY FROM STDIN (FORMAT RAW) can only import one single value (when successful). user need to specify like: COPY x from stdin (FORMAT raw, delimiter E'\n'); seems raw format default no delimiter is not user friendly.
On Mon, Oct 21, 2024, at 16:35, jian he wrote: > make the ProcessCopyOptions process in following order: > 1. Extract options from the statement node tree > 2. checking each option, if not there set default value. > 3. checking for interdependent options > > I still think > making step2 aligned with the doc parameter section order will make it > more readable. > > based on your patch > (v12-0001-Refactor-ProcessCopyOptions-introduce-CopyFormat-enu.patch) > I put some checking to step3, make step2 checking order aligned with doc. Smart to move the interdependent check to the designated section for it, that's exactly the right place for it. Really nice the order in the code now is aligned with the doc order. /Joel
Hi, On Sat, Oct 19, 2024 at 8:33 AM Joel Jacobson <joel@compiler.org> wrote: > > On Sat, Oct 19, 2024, at 12:13, jian he wrote: > > We already make RAW and can only have one column. > > if RAW has no default delimiter, then COPY FROM a text file will > > become one datum value; > > which makes it looks like importing a Large Object. > > (https://www.postgresql.org/docs/17/lo-funcs.html) > > The single datum value might not come from a physical column; it could be > an aggregated JSON value, as in the example Daniel mentioned: > > On Wed, Oct 16, 2024, at 18:34, Daniel Verite wrote: > > copy (select json_agg(col) from table ) to 'file' RAW > > > > This is a variant of the discussion in [1] where the OP does: > > > > copy (select json_agg(row_to_json(t)) from <query> t) TO 'file' > > > > and he complains that both text and csv "break the JSON". > > That discussion morphed into a proposed patch adding JSON > > format to COPY, but RAW would work directly as the OP > > expected. > > > > That is, unless <query> happens to include JSON fields with LF/CRLF > > in them, and the RAW format says this is an error condition. > > In that case it's quite annoying to make it an error, rather than > > simply let it pass. > > > > [1] > > https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU=kcg@mail.gmail.com > > In such cases, a user could perform the following: > > CREATE TABLE customers (id int, name text, email text); > > INSERT INTO customers (id, name, email) VALUES > (1, 'John Doe', 'john.doe@example.com'), > (2, 'Jane Smith', 'jane.smith@example.com'), > (3, 'Alice Johnson', 'alice.johnson@example.com'); > > COPY (SELECT json_agg(row_to_json(t)) FROM customers t) TO '/tmp/file' (FORMAT raw); > > % cat /tmp/file > [{"id":1,"name":"John Doe","email":"john.doe@example.com"}, {"id":2,"name":"Jane Smith","email":"jane.smith@example.com"},{"id":3,"name":"Alice Johnson","email":"alice.johnson@example.com"}]% > > > i think, most of the time, you have more than one row/value to import > > and export? > > Yes, probably, but it might not be a physical row. It could be an aggregated > one, like in the example above. When importing, it might be a large JSON array > of objects that is imported into a temporary table and then deserialized into > a proper schema. > > The need to load entire files is already fulfilled by pg_read_file(text) -> text, > but there is no pg_write_file(), likely for security reasons. > So COPY TO ... (FORMAT RAW) with no delimiter seems necessary, > and then COPY FROM also needs to work accordingly. > > >> The refactoring is now in a separate first single commit, which seems > >> necessary, to separate the new functionality, from the refactoring. > > I agree. > > > ProcessCopyOptions > > /* Extract options from the statement node tree */ > > foreach(option, options) > > { > > } > > /* --- DELIMITER option --- */ > > /* --- NULL option --- */ > > /* --- QUOTE option --- */ > > Currently the regress test passed, i think that means your refactor is fine. > > I believe that a passing test indicates it might be okay, > but a failing test definitely means it's not. :D > > I've meticulously refactored one option at a time, checking which code in > ProcessCopyOptions depends on each option field to ensure the semantics > are preserved. > > I think the changes are easy to follow, and it's clear that each change is > correct when looking at them individually, though it might be more challenging > when viewing the total change. > > I've tried to minimize code movement, preserving as much of the original > code placement as possible. > > > in ProcessCopyOptions, maybe we can rearrange the code after the > > foreach loop (foreach(option, options) > > based on the parameters order in > > https://www.postgresql.org/docs/devel/sql-copy.html Parameters section. > > so we can review it by comparing the refactoring with the > > sql-copy.html Parameters section's description. > > That would be nice, but unfortunately, it's not possible because the order of > the option code blocks matters due to the setting of defaults in else/else > if branches when an option is not specified. > > For example, in the documentation, DEFAULT precedes QUOTE, > but in ProcessCopyOptions, the QUOTE code block must come before > the DEFAULT code block due to the check: > > /* Don't allow the CSV quote char to appear in the default string. */ > > I also believe there's value in minimizing code movement. > > > >> > We already did column length checking at BeginCopyTo. > >> > no need to "if (list_length(cstate->attnumlist) != 1)" error check in > >> > CopyOneRowTo? > >> > >> Hmm, not sure really, since DoCopy() calls both BeginCopyTo() > >> and DoCopyTo() which in turn calls CopyOneRowTo(), > >> but CopyOneRowTo() is also being called from copy_dest_receive(). > >> > > > > BeginCopyTo do the preparation work. > > cstate->attnumlist = CopyGetAttnums(tupDesc, cstate->rel, attnamelist); > > > > After CopyGetAttnums, the number of attributes for COPY TO cannot be changed. > > right after CopyGetAttnums call then check the length of cstate->attnumlist > > seems fine for me. > > I think in CopyOneRowTo, we can actually > > Assert(list_length(cstate->attnumlist) == 1). > > for raw format. > > Right, I've changed it to an Assert instead. > > > src10=# drop table if exists x; > > create table x(a int); > > COPY x from stdin (FORMAT raw); > > DROP TABLE > > CREATE TABLE > > Enter data to be copied followed by a newline. > > End with a backslash and a period on a line by itself, or an EOF signal. > >>> 11 > >>> 12 > >>> \. > > ERROR: invalid input syntax for type integer: "11 > > 12 > > " > > CONTEXT: COPY x, line 1, column a: "11 > > 12 > > " > > > > The above case means COPY FROM STDIN (FORMAT RAW) can only import one > > single value (when successful). > > user need to specify like: > > > > COPY x from stdin (FORMAT raw, delimiter E'\n'); > > > > seems raw format default no delimiter is not user friendly. > > I have no idea if dealing with .json files that would contain newlines > in between fields, and would therefore need to be imported "as is", > is more common than dealing with e.g. .jsonl files where it's guaranteed > each json value is on a single line. > > I think Jacob raised some valid concerns on automagically detecting > newlines, that is how text/csv works, so I don't think we want that. > > Maybe the OS default EOL would be an OK default, > if we want it to be the default delimiter, that is. > > I have no strong opinion here, except automagical newline detection seems > like a bad idea. > > I'm fine with OS default EOL as the default for the delimiter, > or no delimiter as the default. > > New patch attached. I have one question: From the 0001 patch's commit message: No behavioral changes are intended; this is a pure refactoring to improve code clarity and maintainability. Does the reorganization of the option validation done by this patch also help make the 0002 patch simple or small? If not much, while it makes sense to me that introducing the CopyFormat enum is required by the 0002 patch, I think we can discuss the reorganization part separately. And I'd suggest the patch organization would be: 0001: introduce CopyFormat and replace csv_mode and binary fields with it. 0002: add new 'raw' format. 0003: reorganize option validations. One benefit would be that we can remove the simple replacements like '->binary' with '.format == COPY_FORMAT_BINARY' from the reorganization patch, which makes it small. The 0001 and 0002 patches that seem to be more straightforward are independent from the 0003 patch, and we can discuss how to make the option validations including the new 'raw' format better. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Oct 24, 2024 at 2:30 PM Joel Jacobson <joel@compiler.org> wrote: > > On Thu, Oct 24, 2024, at 03:54, Masahiko Sawada wrote: > > I have one question: > > > > From the 0001 patch's commit message: > > > > No behavioral changes are intended; this is a pure refactoring to improve code > > clarity and maintainability. > > > > Does the reorganization of the option validation done by this patch > > also help make the 0002 patch simple or small? > > Thanks for the review. No, not much, except the changes necessary to > ProcessCopyOptions for raw, without also refactoring it, makes it > more complicated. > > > If not much, while it > > makes sense to me that introducing the CopyFormat enum is required by > > the 0002 patch, I think we can discuss the reorganization part > > separately. And I'd suggest the patch organization would be: > > > > 0001: introduce CopyFormat and replace csv_mode and binary fields with it. > > 0002: add new 'raw' format. > > 0003: reorganize option validations. > > /* Check force_quote */ - if (!opts_out->csv_mode && (opts_out->force_quote || opts_out->force_quote_all)) + if (opts_out->format != COPY_FORMAT_CSV && (opts_out->force_quote || + opts_out->force_quote_all)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */ maybe this has a code indentation issue. since "if" and "opts_out" in the same column position. It came to my mind, change errmsg occurrence of "BINARY mode", "CSV mode" to "binary format", "csv format" respectively. I think "format" would be more accurate. but the message seems invasive, so i guess we need to use "mode". overall v13-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patch looks good to me.
On Mon, Oct 28, 2024 at 3:21 AM Joel Jacobson <joel@compiler.org> wrote: > > On Mon, Oct 28, 2024, at 10:30, Joel Jacobson wrote: > > On Mon, Oct 28, 2024, at 08:56, jian he wrote: > >> /* Check force_quote */ > >> - if (!opts_out->csv_mode && (opts_out->force_quote || > >> opts_out->force_quote_all)) > >> + if (opts_out->format != COPY_FORMAT_CSV && (opts_out->force_quote || > >> + opts_out->force_quote_all)) > >> ereport(ERROR, > >> (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > >> /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */ > >> > >> maybe this has a code indentation issue. > >> since "if" and "opts_out" in the same column position. > > > > Thanks for review. > > > > I've fixed the indentation issues. > > I've now installed pgindent, and will use it from hereon, to avoid this class of problems. > > New version where all three patches are now indented using pgindent. Thank you for updating the patch. Here are review comments on the v15 0002 patch: When testing the patch with an empty delimiter, I got the following failure: postgres(1:903898)=# copy hoge from '/tmp/tmp.raw' with (format 'raw', delimiter ''); TRAP: failed Assert("delim_len > 0"), File: "copyfromparse.c", Line: 1173, PID: 903898 --- - else + else if (cstate->opts.format == COPY_FORMAT_TEXT) fldct = CopyReadAttributesText(cstate); + else + { + elog(ERROR, "unexpected COPY format: %d", cstate->opts.format); + pg_unreachable(); + } Since we already check the incompatible options with COPY_FORMAT_RAW and default_print, I think it's better to add an assertion to make sure the format is either COPY_FORMAT_CSV or COPY_FORMAT_TEXT, instead of using elog(ERROR). --- +/* + * CopyReadLineRawText - inner loop of CopyReadLine for raw text mode + */ +static bool +CopyReadLineRawText(CopyFromState cstate) This function has a lot of duplication with CopyReadLineText(). I think it's better to modify CopyReadLineText() to support 'raw' format, rather than adding a separate function. --- + bool read_entire_file = (cstate->opts.delim == NULL); + int delim_len = cstate->opts.delim ? strlen(cstate->opts.delim) : 0; I think we can use 'delim_len == 0' instead of read_entire_file. --- + if (read_entire_file) + { + /* Continue until EOF if reading entire file */ + input_buf_ptr++; + continue; + } In the case where we're reading the entire file as a single tuple, we don't need to advance the input_buf_ptr one by one. Instead, input_buf_ptr can jump to copy_buf_len, which is faster. --- + /* Check for delimiter, possibly multi-byte */ + IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(delim_len - 1); + if (strncmp(©_input_buf[input_buf_ptr], cstate->opts.delim, + delim_len) == 0) + { + cstate->eol_type = EOL_CUSTOM; + input_buf_ptr += delim_len; + break; + } + input_buf_ptr++; Similar to the above comment, I think we don't need to check the char one by one. I guess that it would be faster if we locate the delimiter string in the intput_buf (e.g. using strstr()), and then move input_buf_ptr to the detected position. --- + /* Copy the entire line into attribute_buf */ + memcpy(cstate->attribute_buf.data, cstate->line_buf.data, + cstate->line_buf.len); + cstate->attribute_buf.data[cstate->line_buf.len] = '\0'; + cstate->attribute_buf.len = cstate->line_buf.len; The CopyReadAttributesRaw() just copies line_buf data to attirbute_buf, which seems to be a waste. I think we can have attribute_buf point to the line_buf. That way, we can skip the whole step 4 that is described in the comment on top o f copyfromparse.c: * [data source] --> raw_buf --> input_buf --> line_buf --> attribute_buf * 1. 2. 3. 4. --- +static int +CopyReadAttributesRaw(CopyFromState cstate) +{ + /* Enforce single column requirement */ + if (cstate->max_fields != 1) + { + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("COPY with format 'raw' must specify exactly one column"))); + } This check should have already been done in BeginCopyFrom(). Is there any case where max_fields gets to != 1 during reading the input? --- It's a bit odd to me to use the delimiter as a EOL marker in raw format, but probably it's okay. --- - if (cstate->opts.format != COPY_FORMAT_BINARY) + if (cstate->opts.format == COPY_FORMAT_RAW && + cstate->opts.delim != NULL) + { + /* Output the user-specified delimiter between rows */ + CopySendString(cstate, cstate->opts.delim); + } + else if (cstate->opts.format == COPY_FORMAT_TEXT || + cstate->opts.format == COPY_FORMAT_CSV) Since it sends the delimiter as a string, even if we specify the delimiter to '\n', it doesn't send the new line (i.e. ASCII LF, 10). For example, postgres(1:904427)=# copy (select '{"j" : 1}'::jsonb) to stdout with (format 'raw', delimiter '\n'); {"j": 1}\npostgres(1:904427)=# I think there is a similar problem in COPY FROM; if we set a delimiter to '\n' when doing COPY FROM in raw format, it expects the string '\n' as a line termination but not ASCII LF(10). I think that input data normally doesn't use the string '\n' as a line termination. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Oct 29, 2024 at 9:48 AM Joel Jacobson <joel@compiler.org> wrote: > > > --- > > It's a bit odd to me to use the delimiter as a EOL marker in raw > > format, but probably it's okay. > > > > --- > > - if (cstate->opts.format != COPY_FORMAT_BINARY) > > + if (cstate->opts.format == COPY_FORMAT_RAW && > > + cstate->opts.delim != NULL) > > + { > > + /* Output the user-specified delimiter between rows */ > > + CopySendString(cstate, cstate->opts.delim); > > + } > > + else if (cstate->opts.format == COPY_FORMAT_TEXT || > > + cstate->opts.format == COPY_FORMAT_CSV) > > > > Since it sends the delimiter as a string, even if we specify the > > delimiter to '\n', it doesn't send the new line (i.e. ASCII LF, 10). > > For example, > > > > postgres(1:904427)=# copy (select '{"j" : 1}'::jsonb) to stdout with > > (format 'raw', delimiter '\n'); > > {"j": 1}\npostgres(1:904427)=# > > > > I think there is a similar problem in COPY FROM; if we set a delimiter > > to '\n' when doing COPY FROM in raw format, it expects the string '\n' > > as a line termination but not ASCII LF(10). I think that input data > > normally doesn't use the string '\n' as a line termination. > > You need to use E'\n' to get ASCII LF(10), since '\n' is just a delimiter > consisting of backslash followed by "n". > > Is this a problem? Since any string can be used as delimiter, > I think it would be strange if we parsed it and replaced the string > with a different string. > > Another thought: > > Maybe we shouldn't default to no delimiter after all, > maybe it would be better to default to the OS default EOL, It seems to be useful for loading unstructured non-delimited text files such as log files. Users can set the delimiter an empty when loading the entire file to a single column. On the other hand, I think that If we make it default it might not be necessary to allow the delimiter to be multi bytes. It would be flexible but it would not necessarily be necessary. Also, it would be somewhat odd to me that we can use multi bytes characters as the delimiter only in 'raw' mode, but not in 'text' mode. > maybe a final delimiter should always be written at the end, > so that when exporting a single json field, it would get exported > to the text file with \n at the end, which is what most text editor > does when saving a .json file. Agreed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com