Thread: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row
Hi,
The option choice of "ignore" in the COPY ON_ERROR clause seems overly generic. There would seem to be two relevant ways to ignore bad column input data - drop the entire row or just set the column value to null. I can see us wanting to provide the set to null option and in any case having the option name be explicit that it ignores the row seems like a good idea.
David J.
On Fri, Jan 26, 2024 at 11:09 PM David G. Johnston <david.g.johnston@gmail.com> wrote: > > Hi, > > The option choice of "ignore" in the COPY ON_ERROR clause seems overly generic. There would seem to be two relevant waysto ignore bad column input data - drop the entire row or just set the column value to null. I can see us wanting toprovide the set to null option and in any case having the option name be explicit that it ignores the row seems like agood idea. two issue I found out while playing around with it; create table x1(a int not null, b int not null ); you can only do: COPY x1 from stdin (on_error 'null'); but you cannot do COPY x1 from stdin (on_error null); we need to hack the gram.y to escape the "null". I don't know how to make it work. related post I found: https://stackoverflow.com/questions/31786611/how-to-escape-flex-keyword another issue: COPY x1 from stdin (on_error null); when we already have `not null` top level constraint for table x1. Do we need an error immediately? "on_error null" seems to conflict with `not null` constraint (assume refers to the same column). it may fail while doing bulk inserts while on_error is set to null because of violating a not null constraint.
On Sun, Jan 28, 2024 at 4:51 PM jian he <jian.universality@gmail.com> wrote:
On Fri, Jan 26, 2024 at 11:09 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> Hi,
>
> The option choice of "ignore" in the COPY ON_ERROR clause seems overly generic. There would seem to be two relevant ways to ignore bad column input data - drop the entire row or just set the column value to null. I can see us wanting to provide the set to null option and in any case having the option name be explicit that it ignores the row seems like a good idea.
two issue I found out while playing around with it;
create table x1(a int not null, b int not null );
another issue:
COPY x1 from stdin (on_error null);
when we already have `not null` top level constraint for table x1.
Do we need an error immediately?
"on_error null" seems to conflict with `not null` constraint (assume
refers to the same column).
it may fail while doing bulk inserts while on_error is set to null
because of violating a not null constraint.
You should not error immediately since whether or not there is a problem is table and data dependent. I would not check for the case of all columns being defined not null and just let the mismatch happen.
That said, maybe with this being a string we can accept something like: 'null, ignore'
And so if attempting to place any one null fails, assuming we can make that a soft error too, we would then ignore the entire row.
David J.
On Fri, 26 Jan 2024 08:08:29 -0700 "David G. Johnston" <david.g.johnston@gmail.com> wrote: > Hi, > > The option choice of "ignore" in the COPY ON_ERROR clause seems overly > generic. There would seem to be two relevant ways to ignore bad column > input data - drop the entire row or just set the column value to null. I > can see us wanting to provide the set to null option and in any case having > the option name be explicit that it ignores the row seems like a good idea. I am not in favour of renaming the option name "ignore", instead we can use another style of name for the option to set the column value to NULL, for example, "set_to_null". (Maybe, we can make a more generic option "set_to (col, val)" that can set the value of column specified by "col" value to the specified value "val" (e.g. 'N/A') on a soft error, although the syntax would be a bit complex...) IMO, it is more simple to define "ignore" as to skip the entire row rather than having variety of "ignore". Once defined it so, the option to set the column value to NULL should not be called "ignore" because values in other columns will be inserted. Regards, Yugo Nagata > > David J. -- Yugo NAGATA <nagata@sraoss.co.jp>
The idea of on_error is to tolerate errors, I think. if a column has a not null constraint, let it cannot be used with (on_error 'null') Based on this, I've made a patch. based on COPY Synopsis: ON_ERROR 'error_action' on_error 'null', the keyword NULL should be single quoted. demo: COPY check_ign_err FROM STDIN WITH (on_error 'null'); 1 {1} a 2 {2} 1 3 {3} 2 4 {4} b a {5} c \. \pset null NULL SELECT * FROM check_ign_err; n | m | k ------+-----+------ 1 | {1} | NULL 2 | {2} | 1 3 | {3} | 2 4 | {4} | NULL NULL | {5} | NULL
Attachment
Hi, On 2024-02-03 15:22, jian he wrote: > The idea of on_error is to tolerate errors, I think. > if a column has a not null constraint, let it cannot be used with > (on_error 'null') > + /* > + * we can specify on_error 'null', but it can only apply to > columns > + * don't have not null constraint. > + */ > + if (att->attnotnull && cstate->opts.on_error == > COPY_ON_ERROR_NULL) > + ereport(ERROR, > + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > + errmsg("copy on_error 'null' cannot be used with > not null constraint column"))); This means we cannot use ON_ERROR 'null' even when there is one column which have NOT NULL constraint, i.e. primary key, right? IMHO this is strong constraint and will decrease the opportunity to use this feature. It might be better to allow error_action 'null' for tables which have NOT NULL constraint columns, and when facing soft errors for those rows, skip that row or stop COPY. > Based on this, I've made a patch. > based on COPY Synopsis: ON_ERROR 'error_action' > on_error 'null', the keyword NULL should be single quoted. As you mentioned, single quotation seems a little odd.. I'm not sure what is the best name and syntax for this feature, but since current error_action are verbs('stop' and 'ignore'), I feel 'null' might not be appropriate. > demo: > COPY check_ign_err FROM STDIN WITH (on_error 'null'); > 1 {1} a > 2 {2} 1 > 3 {3} 2 > 4 {4} b > a {5} c > \. > > \pset null NULL > > SELECT * FROM check_ign_err; > n | m | k > ------+-----+------ > 1 | {1} | NULL > 2 | {2} | 1 > 3 | {3} | 2 > 4 | {4} | NULL > NULL | {5} | NULL Since we notice the number of ignored rows when ON_ERROR is 'ignore', users may want to know the number of rows which was changed to NULL when using ON_ERROR 'null'. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On Mon, Feb 5, 2024 at 10:29 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > Hi, > > On 2024-02-03 15:22, jian he wrote: > > The idea of on_error is to tolerate errors, I think. > > if a column has a not null constraint, let it cannot be used with > > (on_error 'null') > > > + /* > > + * we can specify on_error 'null', but it can only apply to > > columns > > + * don't have not null constraint. > > + */ > > + if (att->attnotnull && cstate->opts.on_error == > > COPY_ON_ERROR_NULL) > > + ereport(ERROR, > > + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > > + errmsg("copy on_error 'null' cannot be used with > > not null constraint column"))); > > This means we cannot use ON_ERROR 'null' even when there is one column > which have NOT NULL constraint, i.e. primary key, right? > IMHO this is strong constraint and will decrease the opportunity to use > this feature. I don't want to fail in the middle of bulk inserts, so I thought immediately erroring out would be a great idea. Let's see what other people think.
On Mon, 05 Feb 2024 11:28:59 +0900 torikoshia <torikoshia@oss.nttdata.com> wrote: > > Based on this, I've made a patch. > > based on COPY Synopsis: ON_ERROR 'error_action' > > on_error 'null', the keyword NULL should be single quoted. > > As you mentioned, single quotation seems a little odd.. > > I'm not sure what is the best name and syntax for this feature, but > since current error_action are verbs('stop' and 'ignore'), I feel 'null' > might not be appropriate. I am not in favour of using 'null' either, so I suggested to use "set_to_null" or more generic syntax like "set_to (col, val)" in my previous post[1], although I'm not convinced what is the best either. [1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in > On Mon, 05 Feb 2024 11:28:59 +0900 > torikoshia <torikoshia@oss.nttdata.com> wrote: > > > > Based on this, I've made a patch. > > > based on COPY Synopsis: ON_ERROR 'error_action' > > > on_error 'null', the keyword NULL should be single quoted. > > > > As you mentioned, single quotation seems a little odd.. > > > > I'm not sure what is the best name and syntax for this feature, but > > since current error_action are verbs('stop' and 'ignore'), I feel 'null' > > might not be appropriate. > > I am not in favour of using 'null' either, so I suggested to use > "set_to_null" or more generic syntax like "set_to (col, val)" in my > previous post[1], although I'm not convinced what is the best either. > > [1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp Tom sugggested using a separate option, and I agree with the suggestion. Taking this into consideration, I imagined something like the following, for example. Although I'm not sure we are actually going to do whole-tuple replacement, the action name in this example has the suffix '-column'. COPY (on_error 'replace-colomn', replacement 'null') .. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, 06 Feb 2024 09:39:09 +0900 (JST) Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in > > On Mon, 05 Feb 2024 11:28:59 +0900 > > torikoshia <torikoshia@oss.nttdata.com> wrote: > > > > > > Based on this, I've made a patch. > > > > based on COPY Synopsis: ON_ERROR 'error_action' > > > > on_error 'null', the keyword NULL should be single quoted. > > > > > > As you mentioned, single quotation seems a little odd.. > > > > > > I'm not sure what is the best name and syntax for this feature, but > > > since current error_action are verbs('stop' and 'ignore'), I feel 'null' > > > might not be appropriate. > > > > I am not in favour of using 'null' either, so I suggested to use > > "set_to_null" or more generic syntax like "set_to (col, val)" in my > > previous post[1], although I'm not convinced what is the best either. > > > > [1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp > > Tom sugggested using a separate option, and I agree with the > suggestion. Taking this into consideration, I imagined something like > the following, for example. Although I'm not sure we are actually > going to do whole-tuple replacement, the action name in this example > has the suffix '-column'. > > COPY (on_error 'replace-colomn', replacement 'null') .. Thank you for your information. I've found a post[1] you mentioned, where adding a separate option for error log destination was suggested. Considering consistency with other options, adding a separate option would be better if we want to specify a value to replace the invalid value, without introducing a complex syntax that allows options with more than one parameters. Maybe, if we allow to use values for the replacement other than NULL, we have to also add a option to specify a column (or a type) for each replacement value. Or, we may add a option to specify a list of replacement values as many as the number of columns, each of whose default is NULL. Anyway, I prefer 'replace" (or 'set_to') to just 'null' as the option value. [1] https://www.postgresql.org/message-id/2070915.1705527477%40sss.pgh.pa.us Regards, Yugo Nagata > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center -- Yugo NAGATA <nagata@sraoss.co.jp>
On Tue, Feb 6, 2024 at 3:46 PM Yugo NAGATA <nagata@sraoss.co.jp> wrote: > > On Tue, 06 Feb 2024 09:39:09 +0900 (JST) > Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in > > > On Mon, 05 Feb 2024 11:28:59 +0900 > > > torikoshia <torikoshia@oss.nttdata.com> wrote: > > > > > > > > Based on this, I've made a patch. > > > > > based on COPY Synopsis: ON_ERROR 'error_action' > > > > > on_error 'null', the keyword NULL should be single quoted. > > > > > > > > As you mentioned, single quotation seems a little odd.. > > > > > > > > I'm not sure what is the best name and syntax for this feature, but > > > > since current error_action are verbs('stop' and 'ignore'), I feel 'null' > > > > might not be appropriate. > > > > > > I am not in favour of using 'null' either, so I suggested to use > > > "set_to_null" or more generic syntax like "set_to (col, val)" in my > > > previous post[1], although I'm not convinced what is the best either. > > > > > > [1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp > > > > Tom sugggested using a separate option, and I agree with the > > suggestion. Taking this into consideration, I imagined something like > > the following, for example. Although I'm not sure we are actually > > going to do whole-tuple replacement, the action name in this example > > has the suffix '-column'. > > > > COPY (on_error 'replace-colomn', replacement 'null') .. > > Thank you for your information. I've found a post[1] you mentioned, > where adding a separate option for error log destination was suggested. > > Considering consistency with other options, adding a separate option > would be better if we want to specify a value to replace the invalid > value, without introducing a complex syntax that allows options with > more than one parameters. Maybe, if we allow to use values for the > replacement other than NULL, we have to also add a option to specify > a column (or a type) for each replacement value. Or, we may add a > option to specify a list of replacement values as many as the number of > columns, each of whose default is NULL. > > Anyway, I prefer 'replace" (or 'set_to') to just 'null' as the option > value. > Let's say tabe t column (a,b,c) if we support set_to_null(a,b), what should we do if column c has an error. should we ignore this row or error out immediately? also I am not sure it's doable to just extract columnList from the function defGetCopyOnErrorChoice. to make `COPY x from stdin (on_error set_to_null(a,b);` work, we may need to refactor to gram.y, in a similar way we do force null i am ok with COPY x from stdin (on_error set_to_null);
On Mon, 5 Feb 2024 14:26:46 +0800 jian he <jian.universality@gmail.com> wrote: > On Mon, Feb 5, 2024 at 10:29 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > > > Hi, > > > > On 2024-02-03 15:22, jian he wrote: > > > The idea of on_error is to tolerate errors, I think. > > > if a column has a not null constraint, let it cannot be used with > > > (on_error 'null') > > > > > + /* > > > + * we can specify on_error 'null', but it can only apply to > > > columns > > > + * don't have not null constraint. > > > + */ > > > + if (att->attnotnull && cstate->opts.on_error == > > > COPY_ON_ERROR_NULL) > > > + ereport(ERROR, > > > + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > > > + errmsg("copy on_error 'null' cannot be used with > > > not null constraint column"))); > > > > This means we cannot use ON_ERROR 'null' even when there is one column > > which have NOT NULL constraint, i.e. primary key, right? > > IMHO this is strong constraint and will decrease the opportunity to use > > this feature. > > I don't want to fail in the middle of bulk inserts, > so I thought immediately erroring out would be a great idea. > Let's see what other people think. I also think this restriction is too strong because it is very common that a table has a primary key, unless there is some way to specify columns that can be set to NULL. Even when ON_ERROR is specified, any constraint violation errors cannot be generally ignored, so we cannot elimiate the posibility COPY FROM fails in the middle due to invalid data, anyway. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
attached v2. syntax: `on_error set_to_null` based on upthread discussion, now if you specified `on_error set_to_null` and your column has `not null` constraint, we convert the error field to null, so it may error while bulk inserting for violating NOT NULL constraint.
Attachment
Hi! On 12.02.24 01:00, jian he wrote: > attached v2. > syntax: `on_error set_to_null` > based on upthread discussion, now if you specified `on_error > set_to_null` and your column has `not > null` constraint, we convert the error field to null, so it may error > while bulk inserting for violating NOT NULL constraint. That's a very nice feature. Thanks for implementing it! v2 applies cleanly and works as described. \pset null '(NULL)' CREATE TEMPORARY TABLE t1 (a int, b int); COPY t1 (a,b) FROM STDIN; 1 a 2 1 3 2 4 b a c \. SELECT * FROM t1; CONTEXT: COPY t1, line 1, column b: "a" a | b ---+--- (0 rows) CREATE TEMPORARY TABLE t2 (a int, b int); COPY t2 (a,b) FROM STDIN WITH (on_error set_to_null); 1 a 2 1 3 2 4 b a c \. SELECT * FROM t2; psql:test-copy-on_error-2.sql:12: NOTICE: some columns of 3 rows, value were converted to NULL due to data type incompatibility COPY 5 a | b --------+-------- 1 | (NULL) 2 | 1 3 | 2 4 | (NULL) (NULL) | (NULL) (5 rows) I have one question though: In case all columns of a record have been set to null due to data type incompatibility, should we insert it at all? See t2 example above. I'm not sure if these records would be of any use in the table. What do you think? Since the parameter is already called "set_to_null", maybe it is not necessary to mention in the NOTICE message that the values have been set to null. Perhaps something like "XX records were only partially copied due to data type incompatibility" -- Jim
On Fri, Feb 16, 2024 at 1:16 PM Jim Jones <jim.jones@uni-muenster.de> wrote:
In case all columns of a record have been set to null due to data type
incompatibility, should we insert it at all?
Yes. In particular not all columns in the table need be specified in the copy command so while the parsed input data is all nulls the record itself may not be.
The system should allow the user to exclude rows with incomplete data by ignoring a not null constraint violation.
In short we shouldn't judge non-usefulness and instead give tools to the user to decide for themselves.
David J.
On 16.02.24 21:31, David G. Johnston wrote: > Yes. In particular not all columns in the table need be specified in > the copy command so while the parsed input data is all nulls the > record itself may not be. Yeah, you have a point there. I guess if users want to avoid it to happen they can rely on NOT NULL constraints. Thanks -- Jim
Hi there On 26.08.24 02:00, jian he wrote: > hi all. > patch updated. > simplified the code a lot. > > idea is same: > COPY t_on_error_null FROM STDIN WITH (on_error set_to_null); > > If the STDIN number of columns is the same as the target table, then > InputFunctionCallSafe > call failure will make that column values be null. > > > If the STDIN number of columns is not the same as the target table, then error > ERROR: missing data for column \"%s\" > ERROR: extra data after last expected column > which is status quo. I wanted to give it another try, but the patch does not apply ... $ git apply ~/patches/copy_on_error/v3-0001-on_error-set_to_null.patch -v Checking patch doc/src/sgml/ref/copy.sgml... Checking patch src/backend/commands/copy.c... Checking patch src/backend/commands/copyfrom.c... Checking patch src/backend/commands/copyfromparse.c... Checking patch src/include/commands/copy.h... Checking patch src/test/regress/expected/copy2.out... error: while searching for: NOTICE: skipping row due to data type incompatibility at line 8 for column k: "a" CONTEXT: COPY check_ign_err NOTICE: 6 rows were skipped due to data type incompatibility -- tests for on_error option with log_verbosity and null constraint via domain CREATE DOMAIN dcheck_ign_err2 varchar(15) NOT NULL; CREATE TABLE check_ign_err2 (n int, m int[], k int, l dcheck_ign_err2); error: patch failed: src/test/regress/expected/copy2.out:753 error: src/test/regress/expected/copy2.out: patch does not apply Checking patch src/test/regress/sql/copy2.sql... -- Jim
On 12.09.24 12:13, jian he wrote: > please check the attached file. v4 applies cleanly, it works as expected, and all tests pass. postgres=# \pset null '(NULL)' Null display is "(NULL)". postgres=# CREATE TEMPORARY TABLE t2 (a int, b int); CREATE TABLE postgres=# COPY t2 (a,b) FROM STDIN WITH (on_error set_to_null, format csv); Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself, or an EOF signal. >> 1,a >> 2,1 >> 3,2 >> 4,b >> a,c >> \. COPY 5 postgres=# SELECT * FROM t2; a | b --------+-------- 1 | (NULL) 2 | 1 3 | 2 4 | (NULL) (NULL) | (NULL) (5 rows) Perhaps small changes in the docs: <literal>set_to_null</literal> means the input value will set to <literal>null</literal> and continue with the next one. "will set" -> "will be set" "and continue with" -> "and will continue with" Other than that, LGTM. Thanks! -- Jim
On 2024/10/21 18:30, Kirill Reshke wrote: > v4 no longer applies. It now conflicts with > e7834a1a251d4a28245377f383ff20a657ba8262. > Also, there were review comments. > > So, I decided to rebase. Thanks for the patch! Here are my review comments: I noticed that on_error=set_to_null does not trigger NOTICE messages for rows and columns with errors. It's "unexpected" thing for columns to be silently replaced with NULL due to on_error=set_to_null. So, similar to on_error=ignore, there should be NOTICE messages indicating which input records had columns set to NULL because of data type incompatibility. Without these messages, users might not realize that some columns were set to NULL. How should on_error=set_to_null behave when reject_limit is set? It seems intuitive to trigger an error if the number of rows with columns' data type issues exceeds reject_limit, similar to on_error=ignore. This is open to discussion. psql's tab-completion should be updated to include SET_TO_NULL. An <replaceable class="parameter">error_action</replaceable> value of <literal>stop</literal> means fail the command, while <literal>ignore</literal> means discard the input row and continue with the next one. + <literal>set_to_null</literal> means the input value will be set to <literal>null</literal> and continue with thenext one. How about merging these two descriptions to one and updating it to the following? ------------------- An error_action value of stop means fail the command, ignore means discard the input row and continue with the next one, and set_to_null means replace columns with invalid input values with NULL and move to the next row. ------------------- The <literal>ignore</literal> option is applicable only for <command>COPY FROM</command> This should be "... ignore and set_to_null options are ..."? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Mon, Oct 21, 2024 at 8:39 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > On 2024/10/21 18:30, Kirill Reshke wrote: > > v4 no longer applies. It now conflicts with > > e7834a1a251d4a28245377f383ff20a657ba8262. > > Also, there were review comments. > > > > So, I decided to rebase. > > Thanks for the patch! Here are my review comments: > > I noticed that on_error=set_to_null does not trigger NOTICE messages for rows > and columns with errors. It's "unexpected" thing for columns to be silently > replaced with NULL due to on_error=set_to_null. So, similar to on_error=ignore, > there should be NOTICE messages indicating which input records had columns > set to NULL because of data type incompatibility. Without these messages, > users might not realize that some columns were set to NULL. > on_error=set_to_null, we have two options for CopyFromStateData->num_errors. A. Counting the number of rows that on_error set_to_null happened. B. Counting number of times that on_error set_to_null happened let's say optionA: ereport(NOTICE, errmsg_plural("%llu row was converted to NULL due to data type incompatibility", "%llu rows were converted to NULL due to data type incompatibility", (unsigned long long) cstate->num_errors, (unsigned long long) cstate->num_errors)); I doubt the above message is accurate. "%llu row was converted to NULL" can mean "%llu row, for each row, all columns was converted to NULL" but here we are "%llu row, for each row, some column (can be all columns) was converted to NULL" optionB: the message can be: errmsg_plural("converted to NULL due to data type incompatibility happened %llu time") but I aslo feel the wording is not perfect also. So overall I am not sure how to construct the NOTICE messages.
On 2024/10/26 6:03, Kirill Reshke wrote: > when the REJECT LIMIT is set to some non-zero number and the number of > row NULL replacements exceeds the limit, is it OK to fail. Because > there WAS errors, and we should not tolerate more than $limit errors . > I do find this behavior to be consistent. +1 > But what if we don't set a REJECT LIMIT, it is sane to do all > replacements, as if REJECT LIMIT is inf. +1 > But our REJECT LIMIT is zero > (not set). > So, we ignore zero REJECT LIMIT if set_to_null is set. REJECT_LIMIT currently has to be greater than zero, so it won’t ever be zero. > But while I was trying to implement that, I realized that I don't > understand v4 of this patch. My misunderstanding is about > `t_on_error_null` tests. We are allowed to insert a NULL value for the > first column of t_on_error_null using COPY ON_ERROR SET_TO_NULL. Why > do we do that? My thought is we should try to execute > InputFunctionCallSafe with NULL value (i mean, here [1]) for the > column after we failed to insert the input value. And, if this second > call is successful, we do replacement, otherwise we count the row as > erroneous. Your concern is valid. Allowing NULL to be stored in a column with a NOT NULL constraint via COPY ON_ERROR=SET_TO_NULL does seem unexpected. As you suggested, NULL values set by SET_TO_NULL should probably be re-evaluated. > Hm, good catch. Applied almost as you suggested. I did tweak this > "replace columns with invalid input values with " into "replace > columns containing erroneous input values with". Is that OK? Yes, sounds good. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2024-11-09 21:55, Kirill Reshke wrote: Thanks for working on this! > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com> > wrote: >> >> >> >> On 2024/10/26 6:03, Kirill Reshke wrote: >> > when the REJECT LIMIT is set to some non-zero number and the number of >> > row NULL replacements exceeds the limit, is it OK to fail. Because >> > there WAS errors, and we should not tolerate more than $limit errors . >> > I do find this behavior to be consistent. >> >> +1 >> >> >> > But what if we don't set a REJECT LIMIT, it is sane to do all >> > replacements, as if REJECT LIMIT is inf. >> >> +1 > > After thinking for a while, I'm now more opposed to this approach. I > think we should count rows with erroneous data as errors only if > null substitution for these rows failed, not the total number of rows > which were modified. > Then, to respect the REJECT LIMIT option, we compare this number with > the limit. This is actually simpler approach IMHO. What do You think? IMHO I prefer the previous interpretation. I'm not sure this is what people expect, but I assume that REJECT_LIMIT is used to specify how many malformed rows are acceptable in the "original" data source. >> > But while I was trying to implement that, I realized that I don't >> > understand v4 of this patch. My misunderstanding is about >> > `t_on_error_null` tests. We are allowed to insert a NULL value for the >> > first column of t_on_error_null using COPY ON_ERROR SET_TO_NULL. Why >> > do we do that? My thought is we should try to execute >> > InputFunctionCallSafe with NULL value (i mean, here [1]) for the >> > column after we failed to insert the input value. And, if this second >> > call is successful, we do replacement, otherwise we count the row as >> > erroneous. >> >> Your concern is valid. Allowing NULL to be stored in a column with a >> NOT NULL >> constraint via COPY ON_ERROR=SET_TO_NULL does seem unexpected. As you >> suggested, >> NULL values set by SET_TO_NULL should probably be re-evaluated. > > Thank you. I updated the patch with a NULL re-evaluation. > > > PFA v7. I did not yet update the doc for this patch version, waiting > for feedback about REJECT LIMIT + SET_TO_NULL behaviour. > There were warnings when I applied the patch: $ git apply v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:170: trailing whitespace. /* v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:181: trailing whitespace. v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:189: trailing whitespace. /* If datatype if okay with NULL, replace v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:196: trailing whitespace. v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:212: trailing whitespace. /* > @@ -403,12 +403,14 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState > *pstate, bool is_from) > parser_errposition(pstate, def->location))); ... > > - if (opts_out->reject_limit && !opts_out->on_error) > + if (opts_out->reject_limit && !(opts_out->on_error == > COPY_ON_ERROR_NULL || opts_out->on_error == COPY_ON_ERROR_IGNORE)) Hmm, is this change necessary? Personally, I feel the previous code is easier to read. "REJECT LIMIT" should be "REJECT_LIMIT"? > 1037 errhint("Consider specifying the > REJECT LIMIT option to skip erroneous rows."))); SET_TO_NULL is one of the value for ON_ERROR, but the patch treats SET_TO_NULL as option for COPY: 221 --- a/src/bin/psql/tab-complete.in.c 222 +++ b/src/bin/psql/tab-complete.in.c 223 @@ -3235,7 +3235,7 @@ match_previous_words(int pattern_id, 224 COMPLETE_WITH("FORMAT", "FREEZE", "DELIMITER", "NULL", 225 "HEADER", "QUOTE", "ESCAPE", "FORCE_QUOTE", 226 "FORCE_NOT_NULL", "FORCE_NULL", "ENCODING", "DEFAULT", 227 - "ON_ERROR", "LOG_VERBOSITY"); 228 + "ON_ERROR", "SET_TO_NULL", "LOG_VERBOSITY"); > Best regards, > Kirill Reshke -- Regards, -- Atsushi Torikoshi Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote: > > On 2024-11-09 21:55, Kirill Reshke wrote: > > Thanks for working on this! Thanks for reviewing the v7 patch series! > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com> > > wrote: > >> > >> > >> > >> On 2024/10/26 6:03, Kirill Reshke wrote: > >> > when the REJECT LIMIT is set to some non-zero number and the number of > >> > row NULL replacements exceeds the limit, is it OK to fail. Because > >> > there WAS errors, and we should not tolerate more than $limit errors . > >> > I do find this behavior to be consistent. > >> > >> +1 > >> > >> > >> > But what if we don't set a REJECT LIMIT, it is sane to do all > >> > replacements, as if REJECT LIMIT is inf. > >> > >> +1 > > > > After thinking for a while, I'm now more opposed to this approach. I > > think we should count rows with erroneous data as errors only if > > null substitution for these rows failed, not the total number of rows > > which were modified. > > Then, to respect the REJECT LIMIT option, we compare this number with > > the limit. This is actually simpler approach IMHO. What do You think? > > IMHO I prefer the previous interpretation. > I'm not sure this is what people expect, but I assume that REJECT_LIMIT > is used to specify how many malformed rows are acceptable in the > "original" data source. I do like the first version of interpretation, but I have a struggle with it. According to this interpretation, we will fail COPY command if the number of malformed data rows exceeds the limit, not the number of rejected rows (some percentage of malformed rows are accepted with null substitution) So, a proper name for the limit will be MALFORMED_LIMIT, or something. However, we are unable to change the name since the REJECT_LIMIT option has already been committed. I guess I'll just have to put up with this contradiction. I will send an updated patch shortly... > >> > But while I was trying to implement that, I realized that I don't > >> > understand v4 of this patch. My misunderstanding is about > >> > `t_on_error_null` tests. We are allowed to insert a NULL value for the > >> > first column of t_on_error_null using COPY ON_ERROR SET_TO_NULL. Why > >> > do we do that? My thought is we should try to execute > >> > InputFunctionCallSafe with NULL value (i mean, here [1]) for the > >> > column after we failed to insert the input value. And, if this second > >> > call is successful, we do replacement, otherwise we count the row as > >> > erroneous. > >> > >> Your concern is valid. Allowing NULL to be stored in a column with a > >> NOT NULL > >> constraint via COPY ON_ERROR=SET_TO_NULL does seem unexpected. As you > >> suggested, > >> NULL values set by SET_TO_NULL should probably be re-evaluated. > > > > Thank you. I updated the patch with a NULL re-evaluation. > > > > > > PFA v7. I did not yet update the doc for this patch version, waiting > > for feedback about REJECT LIMIT + SET_TO_NULL behaviour. > > > > > There were warnings when I applied the patch: > > $ git apply > v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch > v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:170: > trailing whitespace. > /* > v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:181: > trailing whitespace. > > v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:189: > trailing whitespace. > /* If datatype if okay with NULL, > replace > v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:196: > trailing whitespace. > > v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:212: > trailing whitespace. > /* > > > @@ -403,12 +403,14 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState > > *pstate, bool is_from) > > parser_errposition(pstate, def->location))); > ... > > > > - if (opts_out->reject_limit && !opts_out->on_error) > > + if (opts_out->reject_limit && !(opts_out->on_error == > > COPY_ON_ERROR_NULL || opts_out->on_error == COPY_ON_ERROR_IGNORE)) > > Hmm, is this change necessary? > Personally, I feel the previous code is easier to read. > > > "REJECT LIMIT" should be "REJECT_LIMIT"? > > 1037 errhint("Consider specifying the > > REJECT LIMIT option to skip erroneous rows."))); > > > SET_TO_NULL is one of the value for ON_ERROR, but the patch treats > SET_TO_NULL as option for COPY: > > 221 --- a/src/bin/psql/tab-complete.in.c > 222 +++ b/src/bin/psql/tab-complete.in.c > 223 @@ -3235,7 +3235,7 @@ match_previous_words(int pattern_id, > 224 COMPLETE_WITH("FORMAT", "FREEZE", "DELIMITER", "NULL", > 225 "HEADER", "QUOTE", "ESCAPE", "FORCE_QUOTE", > 226 "FORCE_NOT_NULL", "FORCE_NULL", "ENCODING", > "DEFAULT", > 227 - "ON_ERROR", "LOG_VERBOSITY"); > 228 + "ON_ERROR", "SET_TO_NULL", "LOG_VERBOSITY"); > > > > Best regards, > > Kirill Reshke > > -- > Regards, > > -- > Atsushi Torikoshi > Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K. -- Best regards, Kirill Reshke
On Tue, 12 Nov 2024 01:27:53 +0500 Kirill Reshke <reshkekirill@gmail.com> wrote: > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote: > > > > On 2024-11-09 21:55, Kirill Reshke wrote: > > > > Thanks for working on this! > > Thanks for reviewing the v7 patch series! > > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com> > > > wrote: > > >> > > >> > > >> > > >> On 2024/10/26 6:03, Kirill Reshke wrote: > > >> > when the REJECT LIMIT is set to some non-zero number and the number of > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because > > >> > there WAS errors, and we should not tolerate more than $limit errors . > > >> > I do find this behavior to be consistent. > > >> > > >> +1 > > >> > > >> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all > > >> > replacements, as if REJECT LIMIT is inf. > > >> > > >> +1 > > > > > > After thinking for a while, I'm now more opposed to this approach. I > > > think we should count rows with erroneous data as errors only if > > > null substitution for these rows failed, not the total number of rows > > > which were modified. > > > Then, to respect the REJECT LIMIT option, we compare this number with > > > the limit. This is actually simpler approach IMHO. What do You think? > > > > IMHO I prefer the previous interpretation. > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT > > is used to specify how many malformed rows are acceptable in the > > "original" data source. I also prefer the previous version. > I do like the first version of interpretation, but I have a struggle > with it. According to this interpretation, we will fail COPY command > if the number > of malformed data rows exceeds the limit, not the number of rejected > rows (some percentage of malformed rows are accepted with null > substitution) > So, a proper name for the limit will be MALFORMED_LIMIT, or something. > However, we are unable to change the name since the REJECT_LIMIT > option has already been committed. > I guess I'll just have to put up with this contradiction. I will send > an updated patch shortly... I think we can rename the REJECT_LIMIT option because it is not yet released. The documentation says that REJECT_LIMIT "Specifies the maximum number of errors", and there are no wording "reject" in the description, so I wonder it is unclear what means in "REJECT" in REJECT_LIMIT. It may be proper to use ERROR_LIMIT since it is supposed to be used with ON_ERROR. Alternatively, if we emphasize that errors are handled other than terminating the command,perhaps MALFORMED_LIMIT as proposed above or TOLERANCE_LIMIT may be good, for example. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
On Tue, 12 Nov 2024 14:03:50 +0900 Yugo Nagata <nagata@sraoss.co.jp> wrote: > On Tue, 12 Nov 2024 01:27:53 +0500 > Kirill Reshke <reshkekirill@gmail.com> wrote: > > > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote: > > > > > > On 2024-11-09 21:55, Kirill Reshke wrote: > > > > > > Thanks for working on this! > > > > Thanks for reviewing the v7 patch series! > > > > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com> > > > > wrote: > > > >> > > > >> > > > >> > > > >> On 2024/10/26 6:03, Kirill Reshke wrote: > > > >> > when the REJECT LIMIT is set to some non-zero number and the number of > > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because > > > >> > there WAS errors, and we should not tolerate more than $limit errors . > > > >> > I do find this behavior to be consistent. > > > >> > > > >> +1 > > > >> > > > >> > > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all > > > >> > replacements, as if REJECT LIMIT is inf. > > > >> > > > >> +1 > > > > > > > > After thinking for a while, I'm now more opposed to this approach. I > > > > think we should count rows with erroneous data as errors only if > > > > null substitution for these rows failed, not the total number of rows > > > > which were modified. > > > > Then, to respect the REJECT LIMIT option, we compare this number with > > > > the limit. This is actually simpler approach IMHO. What do You think? > > > > > > IMHO I prefer the previous interpretation. > > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT > > > is used to specify how many malformed rows are acceptable in the > > > "original" data source. > > I also prefer the previous version. > > > I do like the first version of interpretation, but I have a struggle > > with it. According to this interpretation, we will fail COPY command > > if the number > > of malformed data rows exceeds the limit, not the number of rejected > > rows (some percentage of malformed rows are accepted with null > > substitution) > > So, a proper name for the limit will be MALFORMED_LIMIT, or something. > > However, we are unable to change the name since the REJECT_LIMIT > > option has already been committed. > > I guess I'll just have to put up with this contradiction. I will send > > an updated patch shortly... > > I think we can rename the REJECT_LIMIT option because it is not yet released. > > The documentation says that REJECT_LIMIT "Specifies the maximum number of errors", > and there are no wording "reject" in the description, so I wonder it is unclear > what means in "REJECT" in REJECT_LIMIT. It may be proper to use ERROR_LIMIT > since it is supposed to be used with ON_ERROR. > > Alternatively, if we emphasize that errors are handled other than terminating > the command,perhaps MALFORMED_LIMIT as proposed above or TOLERANCE_LIMIT may be > good, for example. I might misunderstand the meaning of the name. If REJECT_LIMIT means "a limit on the number of rows with any malformed value allowed before the COPY command is rejected", we would not have to rename it. -- Yugo Nagata <nagata@sraoss.co.jp>
On 2024-11-12 14:17, Yugo Nagata wrote: > On Tue, 12 Nov 2024 14:03:50 +0900 > Yugo Nagata <nagata@sraoss.co.jp> wrote: > >> On Tue, 12 Nov 2024 01:27:53 +0500 >> Kirill Reshke <reshkekirill@gmail.com> wrote: >> >> > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote: >> > > >> > > On 2024-11-09 21:55, Kirill Reshke wrote: >> > > >> > > Thanks for working on this! >> > >> > Thanks for reviewing the v7 patch series! >> > >> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com> >> > > > wrote: >> > > >> >> > > >> >> > > >> >> > > >> On 2024/10/26 6:03, Kirill Reshke wrote: >> > > >> > when the REJECT LIMIT is set to some non-zero number and the number of >> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because >> > > >> > there WAS errors, and we should not tolerate more than $limit errors . >> > > >> > I do find this behavior to be consistent. >> > > >> >> > > >> +1 >> > > >> >> > > >> >> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all >> > > >> > replacements, as if REJECT LIMIT is inf. >> > > >> >> > > >> +1 >> > > > >> > > > After thinking for a while, I'm now more opposed to this approach. I >> > > > think we should count rows with erroneous data as errors only if >> > > > null substitution for these rows failed, not the total number of rows >> > > > which were modified. >> > > > Then, to respect the REJECT LIMIT option, we compare this number with >> > > > the limit. This is actually simpler approach IMHO. What do You think? >> > > >> > > IMHO I prefer the previous interpretation. >> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT >> > > is used to specify how many malformed rows are acceptable in the >> > > "original" data source. >> >> I also prefer the previous version. >> >> > I do like the first version of interpretation, but I have a struggle >> > with it. According to this interpretation, we will fail COPY command >> > if the number >> > of malformed data rows exceeds the limit, not the number of rejected >> > rows (some percentage of malformed rows are accepted with null >> > substitution) I feel your concern is valid. Currently 'reject' can occur only when converting a column's input value to its data type, but if we introduce set_to_null option 'reject' also occurs when inserting null, i.e. not null constraint. >> > So, a proper name for the limit will be MALFORMED_LIMIT, or something. >> > However, we are unable to change the name since the REJECT_LIMIT >> > option has already been committed. >> > I guess I'll just have to put up with this contradiction. I will send >> > an updated patch shortly... >> I think we can rename the REJECT_LIMIT option because it is not yet >> released. +1 >> The documentation says that REJECT_LIMIT "Specifies the maximum number >> of errors", >> and there are no wording "reject" in the description, so I wonder it >> is unclear >> what means in "REJECT" in REJECT_LIMIT. It may be proper to use >> ERROR_LIMIT >> since it is supposed to be used with ON_ERROR. >> >> Alternatively, if we emphasize that errors are handled other than >> terminating >> the command,perhaps MALFORMED_LIMIT as proposed above or >> TOLERANCE_LIMIT may be >> good, for example. > > I might misunderstand the meaning of the name. If REJECT_LIMIT means "a > limit on > the number of rows with any malformed value allowed before the COPY > command is > rejected", we would not have to rename it. The meaning of REJECT_LIMIT is what you described, and I think Kirill worries about cases when malformed rows are accepted(=not REJECTed) with null substitution. REJECT_LIMIT counts this case as REJECTed. -- Regards, -- Atsushi Torikoshi Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
On Tue, 12 Nov 2024 17:38:25 +0900 torikoshia <torikoshia@oss.nttdata.com> wrote: > On 2024-11-12 14:17, Yugo Nagata wrote: > > On Tue, 12 Nov 2024 14:03:50 +0900 > > Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > >> On Tue, 12 Nov 2024 01:27:53 +0500 > >> Kirill Reshke <reshkekirill@gmail.com> wrote: > >> > >> > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote: > >> > > > >> > > On 2024-11-09 21:55, Kirill Reshke wrote: > >> > > > >> > > Thanks for working on this! > >> > > >> > Thanks for reviewing the v7 patch series! > >> > > >> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com> > >> > > > wrote: > >> > > >> > >> > > >> > >> > > >> > >> > > >> On 2024/10/26 6:03, Kirill Reshke wrote: > >> > > >> > when the REJECT LIMIT is set to some non-zero number and the number of > >> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because > >> > > >> > there WAS errors, and we should not tolerate more than $limit errors . > >> > > >> > I do find this behavior to be consistent. > >> > > >> > >> > > >> +1 > >> > > >> > >> > > >> > >> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all > >> > > >> > replacements, as if REJECT LIMIT is inf. > >> > > >> > >> > > >> +1 > >> > > > > >> > > > After thinking for a while, I'm now more opposed to this approach. I > >> > > > think we should count rows with erroneous data as errors only if > >> > > > null substitution for these rows failed, not the total number of rows > >> > > > which were modified. > >> > > > Then, to respect the REJECT LIMIT option, we compare this number with > >> > > > the limit. This is actually simpler approach IMHO. What do You think? > >> > > > >> > > IMHO I prefer the previous interpretation. > >> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT > >> > > is used to specify how many malformed rows are acceptable in the > >> > > "original" data source. > >> > >> I also prefer the previous version. > >> > >> > I do like the first version of interpretation, but I have a struggle > >> > with it. According to this interpretation, we will fail COPY command > >> > if the number > >> > of malformed data rows exceeds the limit, not the number of rejected > >> > rows (some percentage of malformed rows are accepted with nulnot-null constraintl > >> > substitution) > > I feel your concern is valid. > Currently 'reject' can occur only when converting a column's input value > to its data type, but if we introduce set_to_null option 'reject' also > occurs when inserting null, i.e. not null constraint. I can suppose "reject" means failure of COPY command here, that is, a reject of executing the command, not an error of data input. If so, we can interpret REJECT_LIMIT as "the number of malformed rows allowed before the COPY command is REJECTed" (not the number of rejected rows). In this case, I think we don't have to rename the option name. Of course, if there is more proper name that makes it easy for users to understand the behaviour of the option, renaming should be nice. > >> The documentation says that REJECT_LIMIT "Specifies the maximum number > >> of errors", > >> and there are no wording "reject" in the description, so I wonder it > >> is unclear > >> what means in "REJECT" in REJECT_LIMIT. It may be proper to use > >> ERROR_LIMIT > >> since it is supposed to be used with ON_ERROR. > >> > >> Alternatively, if we emphasize that errors are handled other than > >> terminating > >> the command,perhaps MALFORMED_LIMIT as proposed above or > >> TOLERANCE_LIMIT may be > >> good, for example. > > > > I might misunderstand the meaning of the name. If REJECT_LIMIT means "a > > limit on > > the number of rows with any malformed value allowed before the COPY > > command is > > rejected", we would not have to rename it. > > The meaning of REJECT_LIMIT is what you described, and I think Kirill > worries about cases when malformed rows are accepted(=not REJECTed) with > null substitution. REJECT_LIMIT counts this case as REJECTed. I am a bit confused. You mean "REJECT" is raising a soft error of data input here instead of terminating COPY? Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
On 2024-11-13 22:02, Yugo NAGATA wrote: > On Tue, 12 Nov 2024 17:38:25 +0900 > torikoshia <torikoshia@oss.nttdata.com> wrote: > >> On 2024-11-12 14:17, Yugo Nagata wrote: >> > On Tue, 12 Nov 2024 14:03:50 +0900 >> > Yugo Nagata <nagata@sraoss.co.jp> wrote: >> > >> >> On Tue, 12 Nov 2024 01:27:53 +0500 >> >> Kirill Reshke <reshkekirill@gmail.com> wrote: >> >> >> >> > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote: >> >> > > >> >> > > On 2024-11-09 21:55, Kirill Reshke wrote: >> >> > > >> >> > > Thanks for working on this! >> >> > >> >> > Thanks for reviewing the v7 patch series! >> >> > >> >> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com> >> >> > > > wrote: >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> On 2024/10/26 6:03, Kirill Reshke wrote: >> >> > > >> > when the REJECT LIMIT is set to some non-zero number and the number of >> >> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because >> >> > > >> > there WAS errors, and we should not tolerate more than $limit errors . >> >> > > >> > I do find this behavior to be consistent. >> >> > > >> >> >> > > >> +1 >> >> > > >> >> >> > > >> >> >> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all >> >> > > >> > replacements, as if REJECT LIMIT is inf. >> >> > > >> >> >> > > >> +1 >> >> > > > >> >> > > > After thinking for a while, I'm now more opposed to this approach. I >> >> > > > think we should count rows with erroneous data as errors only if >> >> > > > null substitution for these rows failed, not the total number of rows >> >> > > > which were modified. >> >> > > > Then, to respect the REJECT LIMIT option, we compare this number with >> >> > > > the limit. This is actually simpler approach IMHO. What do You think? >> >> > > >> >> > > IMHO I prefer the previous interpretation. >> >> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT >> >> > > is used to specify how many malformed rows are acceptable in the >> >> > > "original" data source. >> >> >> >> I also prefer the previous version. >> >> >> >> > I do like the first version of interpretation, but I have a struggle >> >> > with it. According to this interpretation, we will fail COPY command >> >> > if the number >> >> > of malformed data rows exceeds the limit, not the number of rejected >> >> > rows (some percentage of malformed rows are accepted with nulnot-null constraintl >> >> > substitution) >> >> I feel your concern is valid. >> Currently 'reject' can occur only when converting a column's input >> value >> to its data type, but if we introduce set_to_null option 'reject' also >> occurs when inserting null, i.e. not null constraint. > > I can suppose "reject" means failure of COPY command here, that is, a > reject > of executing the command, not an error of data input. If so, we can > interpret > REJECT_LIMIT as "the number of malformed rows allowed before the COPY > command > is REJECTed" (not the number of rejected rows). In this case, I think > we don't > have to rename the option name. > > Of course, if there is more proper name that makes it easy for users to > understand the behaviour of the option, renaming should be nice. > >> >> The documentation says that REJECT_LIMIT "Specifies the maximum number >> >> of errors", >> >> and there are no wording "reject" in the description, so I wonder it >> >> is unclear >> >> what means in "REJECT" in REJECT_LIMIT. It may be proper to use >> >> ERROR_LIMIT >> >> since it is supposed to be used with ON_ERROR. >> >> >> >> Alternatively, if we emphasize that errors are handled other than >> >> terminating >> >> the command,perhaps MALFORMED_LIMIT as proposed above or >> >> TOLERANCE_LIMIT may be >> >> good, for example. >> > >> > I might misunderstand the meaning of the name. If REJECT_LIMIT means "a >> > limit on >> > the number of rows with any malformed value allowed before the COPY >> > command is >> > rejected", we would not have to rename it. >> >> The meaning of REJECT_LIMIT is what you described, and I think Kirill >> worries about cases when malformed rows are accepted(=not REJECTed) >> with >> null substitution. REJECT_LIMIT counts this case as REJECTed. > > I am a bit confused. Me too:) Let me explain my understanding. I believe there are now two candidates that count as REJECT_LIMIT number: (1) error converting a column's input value into its data type(soft error) (2) NULL substitution failure(this comes from the proposed patch) And I understood Kirill's idea to be the following: 1st idea: REJECT_LIMIT counts (1) 2nd idea: REJECT_LIMIT counts (2) And I've agreed with the 1st one. > You mean "REJECT" is raising a soft error of data > input here instead of terminating COPY? Yes. -- Regards, -- Atsushi Torikoshi Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
On Sat, Nov 16, 2024 at 5:55 PM Kirill Reshke <reshkekirill@gmail.com> wrote: > > I am attaching my v8 for reference. > in your v8. <varlistentry> <term><literal>REJECT_LIMIT</literal></term> <listitem> <para> Specifies the maximum number of errors tolerated while converting a column's input value to its data type, when <literal>ON_ERROR</literal> is set to <literal>ignore</literal>. If the input contains more erroneous rows than the specified value, the <command>COPY</command> command fails, even with <literal>ON_ERROR</literal> set to <literal>ignore</literal>. </para> </listitem> </varlistentry> then above description not meet with following example, (i think) create table t(a int not null); COPY t FROM STDIN WITH (on_error set_to_null, reject_limit 2); Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself, or an EOF signal. >> a >> \. ERROR: null value in column "a" of relation "t" violates not-null constraint DETAIL: Failing row contains (null). CONTEXT: COPY t, line 1, column a: "a" Overall, I think making the domain not-null align with column level not-null would be a good thing. <para> Specifies how to behave when encountering an error converting a column's input value into its data type. An <replaceable class="parameter">error_action</replaceable> value of <literal>stop</literal> means fail the command, <literal>ignore</literal> means discard the input row and continue with the next one, and <literal>set_to_null</literal> means replace columns containing erroneous input values with <literal>null</literal> and move to the next row. "and move to the next row" is wrong? I think it should be " and move to the next field".
On Tue, 19 Nov 2024, 13:52 jian he, <jian.universality@gmail.com> wrote: > > On Sat, Nov 16, 2024 at 5:55 PM Kirill Reshke <reshkekirill@gmail.com> wrote: > > > > I am attaching my v8 for reference. > > > > in your v8. > > <varlistentry> > <term><literal>REJECT_LIMIT</literal></term> > <listitem> > <para> > Specifies the maximum number of errors tolerated while converting a > column's input value to its data type, when <literal>ON_ERROR</literal> is > set to <literal>ignore</literal>. > If the input contains more erroneous rows than the specified > value, the <command>COPY</command> > command fails, even with <literal>ON_ERROR</literal> set to > <literal>ignore</literal>. > </para> > </listitem> > </varlistentry> > > then above description not meet with following example, (i think) > > create table t(a int not null); > COPY t FROM STDIN WITH (on_error set_to_null, reject_limit 2); > Enter data to be copied followed by a newline. > End with a backslash and a period on a line by itself, or an EOF signal. > >> a > >> \. > ERROR: null value in column "a" of relation "t" violates not-null constraint > DETAIL: Failing row contains (null). > CONTEXT: COPY t, line 1, column a: "a" Sure, my v8 does not helps with column level NOT NULL constraint (or other constraint) > Overall, I think > making the domain not-null align with column level not-null would be a > good thing. While this looks sane, it's actually a separate topic. Even on current HEAD we have domain not-null vs column level not-null unalignment. consider this example ``` reshke=# create table ftt2 (i int not null); CREATE TABLE reshke=# copy ftt2 from stdin with (reject_limit 1000, on_error ignore); Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself, or an EOF signal. >> \N >> \. ERROR: null value in column "i" of relation "ftt2" violates not-null constraint DETAIL: Failing row contains (null). CONTEXT: COPY ftt2, line 1: "\N" reshke=# create domain dd as int not null ; CREATE DOMAIN reshke=# create table ftt3(i dd); CREATE TABLE reshke=# copy ftt3 from stdin with (reject_limit 1000, on_error ignore); Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself, or an EOF signal. >> \N >> \. NOTICE: 1 row was skipped due to data type incompatibility COPY 0 reshke=# ``` So, if we want this, we need to start another thread and deal with REJECT_LIMIT + on_error ignore first. The ExecConstraints function is the source of the error in scenario 1. Therefore, we require something like "ExecConstraintsSafe" to accommodate the aforementioned. That is a significant change. Not sure it will be accepted by the community. > > <para> > Specifies how to behave when encountering an error converting a column's > input value into its data type. > An <replaceable class="parameter">error_action</replaceable> value of > <literal>stop</literal> means fail the command, > <literal>ignore</literal> means discard the input row and > continue with the next one, and > <literal>set_to_null</literal> means replace columns containing > erroneous input values with <literal>null</literal> and move to the > next row. > > "and move to the next row" is wrong? > I think it should be " and move to the next field". Yes, "and move to the next field" is correct.