Thread: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row
Hi,
The option choice of "ignore" in the COPY ON_ERROR clause seems overly generic. There would seem to be two relevant ways to ignore bad column input data - drop the entire row or just set the column value to null. I can see us wanting to provide the set to null option and in any case having the option name be explicit that it ignores the row seems like a good idea.
David J.
On Fri, Jan 26, 2024 at 11:09 PM David G. Johnston <david.g.johnston@gmail.com> wrote: > > Hi, > > The option choice of "ignore" in the COPY ON_ERROR clause seems overly generic. There would seem to be two relevant waysto ignore bad column input data - drop the entire row or just set the column value to null. I can see us wanting toprovide the set to null option and in any case having the option name be explicit that it ignores the row seems like agood idea. two issue I found out while playing around with it; create table x1(a int not null, b int not null ); you can only do: COPY x1 from stdin (on_error 'null'); but you cannot do COPY x1 from stdin (on_error null); we need to hack the gram.y to escape the "null". I don't know how to make it work. related post I found: https://stackoverflow.com/questions/31786611/how-to-escape-flex-keyword another issue: COPY x1 from stdin (on_error null); when we already have `not null` top level constraint for table x1. Do we need an error immediately? "on_error null" seems to conflict with `not null` constraint (assume refers to the same column). it may fail while doing bulk inserts while on_error is set to null because of violating a not null constraint.
On Sun, Jan 28, 2024 at 4:51 PM jian he <jian.universality@gmail.com> wrote:
On Fri, Jan 26, 2024 at 11:09 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> Hi,
>
> The option choice of "ignore" in the COPY ON_ERROR clause seems overly generic. There would seem to be two relevant ways to ignore bad column input data - drop the entire row or just set the column value to null. I can see us wanting to provide the set to null option and in any case having the option name be explicit that it ignores the row seems like a good idea.
two issue I found out while playing around with it;
create table x1(a int not null, b int not null );
another issue:
COPY x1 from stdin (on_error null);
when we already have `not null` top level constraint for table x1.
Do we need an error immediately?
"on_error null" seems to conflict with `not null` constraint (assume
refers to the same column).
it may fail while doing bulk inserts while on_error is set to null
because of violating a not null constraint.
You should not error immediately since whether or not there is a problem is table and data dependent. I would not check for the case of all columns being defined not null and just let the mismatch happen.
That said, maybe with this being a string we can accept something like: 'null, ignore'
And so if attempting to place any one null fails, assuming we can make that a soft error too, we would then ignore the entire row.
David J.
On Fri, 26 Jan 2024 08:08:29 -0700 "David G. Johnston" <david.g.johnston@gmail.com> wrote: > Hi, > > The option choice of "ignore" in the COPY ON_ERROR clause seems overly > generic. There would seem to be two relevant ways to ignore bad column > input data - drop the entire row or just set the column value to null. I > can see us wanting to provide the set to null option and in any case having > the option name be explicit that it ignores the row seems like a good idea. I am not in favour of renaming the option name "ignore", instead we can use another style of name for the option to set the column value to NULL, for example, "set_to_null". (Maybe, we can make a more generic option "set_to (col, val)" that can set the value of column specified by "col" value to the specified value "val" (e.g. 'N/A') on a soft error, although the syntax would be a bit complex...) IMO, it is more simple to define "ignore" as to skip the entire row rather than having variety of "ignore". Once defined it so, the option to set the column value to NULL should not be called "ignore" because values in other columns will be inserted. Regards, Yugo Nagata > > David J. -- Yugo NAGATA <nagata@sraoss.co.jp>
The idea of on_error is to tolerate errors, I think. if a column has a not null constraint, let it cannot be used with (on_error 'null') Based on this, I've made a patch. based on COPY Synopsis: ON_ERROR 'error_action' on_error 'null', the keyword NULL should be single quoted. demo: COPY check_ign_err FROM STDIN WITH (on_error 'null'); 1 {1} a 2 {2} 1 3 {3} 2 4 {4} b a {5} c \. \pset null NULL SELECT * FROM check_ign_err; n | m | k ------+-----+------ 1 | {1} | NULL 2 | {2} | 1 3 | {3} | 2 4 | {4} | NULL NULL | {5} | NULL
Attachment
Hi, On 2024-02-03 15:22, jian he wrote: > The idea of on_error is to tolerate errors, I think. > if a column has a not null constraint, let it cannot be used with > (on_error 'null') > + /* > + * we can specify on_error 'null', but it can only apply to > columns > + * don't have not null constraint. > + */ > + if (att->attnotnull && cstate->opts.on_error == > COPY_ON_ERROR_NULL) > + ereport(ERROR, > + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > + errmsg("copy on_error 'null' cannot be used with > not null constraint column"))); This means we cannot use ON_ERROR 'null' even when there is one column which have NOT NULL constraint, i.e. primary key, right? IMHO this is strong constraint and will decrease the opportunity to use this feature. It might be better to allow error_action 'null' for tables which have NOT NULL constraint columns, and when facing soft errors for those rows, skip that row or stop COPY. > Based on this, I've made a patch. > based on COPY Synopsis: ON_ERROR 'error_action' > on_error 'null', the keyword NULL should be single quoted. As you mentioned, single quotation seems a little odd.. I'm not sure what is the best name and syntax for this feature, but since current error_action are verbs('stop' and 'ignore'), I feel 'null' might not be appropriate. > demo: > COPY check_ign_err FROM STDIN WITH (on_error 'null'); > 1 {1} a > 2 {2} 1 > 3 {3} 2 > 4 {4} b > a {5} c > \. > > \pset null NULL > > SELECT * FROM check_ign_err; > n | m | k > ------+-----+------ > 1 | {1} | NULL > 2 | {2} | 1 > 3 | {3} | 2 > 4 | {4} | NULL > NULL | {5} | NULL Since we notice the number of ignored rows when ON_ERROR is 'ignore', users may want to know the number of rows which was changed to NULL when using ON_ERROR 'null'. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On Mon, Feb 5, 2024 at 10:29 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > Hi, > > On 2024-02-03 15:22, jian he wrote: > > The idea of on_error is to tolerate errors, I think. > > if a column has a not null constraint, let it cannot be used with > > (on_error 'null') > > > + /* > > + * we can specify on_error 'null', but it can only apply to > > columns > > + * don't have not null constraint. > > + */ > > + if (att->attnotnull && cstate->opts.on_error == > > COPY_ON_ERROR_NULL) > > + ereport(ERROR, > > + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > > + errmsg("copy on_error 'null' cannot be used with > > not null constraint column"))); > > This means we cannot use ON_ERROR 'null' even when there is one column > which have NOT NULL constraint, i.e. primary key, right? > IMHO this is strong constraint and will decrease the opportunity to use > this feature. I don't want to fail in the middle of bulk inserts, so I thought immediately erroring out would be a great idea. Let's see what other people think.
On Mon, 05 Feb 2024 11:28:59 +0900 torikoshia <torikoshia@oss.nttdata.com> wrote: > > Based on this, I've made a patch. > > based on COPY Synopsis: ON_ERROR 'error_action' > > on_error 'null', the keyword NULL should be single quoted. > > As you mentioned, single quotation seems a little odd.. > > I'm not sure what is the best name and syntax for this feature, but > since current error_action are verbs('stop' and 'ignore'), I feel 'null' > might not be appropriate. I am not in favour of using 'null' either, so I suggested to use "set_to_null" or more generic syntax like "set_to (col, val)" in my previous post[1], although I'm not convinced what is the best either. [1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in > On Mon, 05 Feb 2024 11:28:59 +0900 > torikoshia <torikoshia@oss.nttdata.com> wrote: > > > > Based on this, I've made a patch. > > > based on COPY Synopsis: ON_ERROR 'error_action' > > > on_error 'null', the keyword NULL should be single quoted. > > > > As you mentioned, single quotation seems a little odd.. > > > > I'm not sure what is the best name and syntax for this feature, but > > since current error_action are verbs('stop' and 'ignore'), I feel 'null' > > might not be appropriate. > > I am not in favour of using 'null' either, so I suggested to use > "set_to_null" or more generic syntax like "set_to (col, val)" in my > previous post[1], although I'm not convinced what is the best either. > > [1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp Tom sugggested using a separate option, and I agree with the suggestion. Taking this into consideration, I imagined something like the following, for example. Although I'm not sure we are actually going to do whole-tuple replacement, the action name in this example has the suffix '-column'. COPY (on_error 'replace-colomn', replacement 'null') .. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, 06 Feb 2024 09:39:09 +0900 (JST) Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in > > On Mon, 05 Feb 2024 11:28:59 +0900 > > torikoshia <torikoshia@oss.nttdata.com> wrote: > > > > > > Based on this, I've made a patch. > > > > based on COPY Synopsis: ON_ERROR 'error_action' > > > > on_error 'null', the keyword NULL should be single quoted. > > > > > > As you mentioned, single quotation seems a little odd.. > > > > > > I'm not sure what is the best name and syntax for this feature, but > > > since current error_action are verbs('stop' and 'ignore'), I feel 'null' > > > might not be appropriate. > > > > I am not in favour of using 'null' either, so I suggested to use > > "set_to_null" or more generic syntax like "set_to (col, val)" in my > > previous post[1], although I'm not convinced what is the best either. > > > > [1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp > > Tom sugggested using a separate option, and I agree with the > suggestion. Taking this into consideration, I imagined something like > the following, for example. Although I'm not sure we are actually > going to do whole-tuple replacement, the action name in this example > has the suffix '-column'. > > COPY (on_error 'replace-colomn', replacement 'null') .. Thank you for your information. I've found a post[1] you mentioned, where adding a separate option for error log destination was suggested. Considering consistency with other options, adding a separate option would be better if we want to specify a value to replace the invalid value, without introducing a complex syntax that allows options with more than one parameters. Maybe, if we allow to use values for the replacement other than NULL, we have to also add a option to specify a column (or a type) for each replacement value. Or, we may add a option to specify a list of replacement values as many as the number of columns, each of whose default is NULL. Anyway, I prefer 'replace" (or 'set_to') to just 'null' as the option value. [1] https://www.postgresql.org/message-id/2070915.1705527477%40sss.pgh.pa.us Regards, Yugo Nagata > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center -- Yugo NAGATA <nagata@sraoss.co.jp>
On Tue, Feb 6, 2024 at 3:46 PM Yugo NAGATA <nagata@sraoss.co.jp> wrote: > > On Tue, 06 Feb 2024 09:39:09 +0900 (JST) > Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in > > > On Mon, 05 Feb 2024 11:28:59 +0900 > > > torikoshia <torikoshia@oss.nttdata.com> wrote: > > > > > > > > Based on this, I've made a patch. > > > > > based on COPY Synopsis: ON_ERROR 'error_action' > > > > > on_error 'null', the keyword NULL should be single quoted. > > > > > > > > As you mentioned, single quotation seems a little odd.. > > > > > > > > I'm not sure what is the best name and syntax for this feature, but > > > > since current error_action are verbs('stop' and 'ignore'), I feel 'null' > > > > might not be appropriate. > > > > > > I am not in favour of using 'null' either, so I suggested to use > > > "set_to_null" or more generic syntax like "set_to (col, val)" in my > > > previous post[1], although I'm not convinced what is the best either. > > > > > > [1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp > > > > Tom sugggested using a separate option, and I agree with the > > suggestion. Taking this into consideration, I imagined something like > > the following, for example. Although I'm not sure we are actually > > going to do whole-tuple replacement, the action name in this example > > has the suffix '-column'. > > > > COPY (on_error 'replace-colomn', replacement 'null') .. > > Thank you for your information. I've found a post[1] you mentioned, > where adding a separate option for error log destination was suggested. > > Considering consistency with other options, adding a separate option > would be better if we want to specify a value to replace the invalid > value, without introducing a complex syntax that allows options with > more than one parameters. Maybe, if we allow to use values for the > replacement other than NULL, we have to also add a option to specify > a column (or a type) for each replacement value. Or, we may add a > option to specify a list of replacement values as many as the number of > columns, each of whose default is NULL. > > Anyway, I prefer 'replace" (or 'set_to') to just 'null' as the option > value. > Let's say tabe t column (a,b,c) if we support set_to_null(a,b), what should we do if column c has an error. should we ignore this row or error out immediately? also I am not sure it's doable to just extract columnList from the function defGetCopyOnErrorChoice. to make `COPY x from stdin (on_error set_to_null(a,b);` work, we may need to refactor to gram.y, in a similar way we do force null i am ok with COPY x from stdin (on_error set_to_null);
On Mon, 5 Feb 2024 14:26:46 +0800 jian he <jian.universality@gmail.com> wrote: > On Mon, Feb 5, 2024 at 10:29 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > > > Hi, > > > > On 2024-02-03 15:22, jian he wrote: > > > The idea of on_error is to tolerate errors, I think. > > > if a column has a not null constraint, let it cannot be used with > > > (on_error 'null') > > > > > + /* > > > + * we can specify on_error 'null', but it can only apply to > > > columns > > > + * don't have not null constraint. > > > + */ > > > + if (att->attnotnull && cstate->opts.on_error == > > > COPY_ON_ERROR_NULL) > > > + ereport(ERROR, > > > + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > > > + errmsg("copy on_error 'null' cannot be used with > > > not null constraint column"))); > > > > This means we cannot use ON_ERROR 'null' even when there is one column > > which have NOT NULL constraint, i.e. primary key, right? > > IMHO this is strong constraint and will decrease the opportunity to use > > this feature. > > I don't want to fail in the middle of bulk inserts, > so I thought immediately erroring out would be a great idea. > Let's see what other people think. I also think this restriction is too strong because it is very common that a table has a primary key, unless there is some way to specify columns that can be set to NULL. Even when ON_ERROR is specified, any constraint violation errors cannot be generally ignored, so we cannot elimiate the posibility COPY FROM fails in the middle due to invalid data, anyway. Regards, Yugo Nagata -- Yugo NAGATA <nagata@sraoss.co.jp>
attached v2. syntax: `on_error set_to_null` based on upthread discussion, now if you specified `on_error set_to_null` and your column has `not null` constraint, we convert the error field to null, so it may error while bulk inserting for violating NOT NULL constraint.
Attachment
Hi! On 12.02.24 01:00, jian he wrote: > attached v2. > syntax: `on_error set_to_null` > based on upthread discussion, now if you specified `on_error > set_to_null` and your column has `not > null` constraint, we convert the error field to null, so it may error > while bulk inserting for violating NOT NULL constraint. That's a very nice feature. Thanks for implementing it! v2 applies cleanly and works as described. \pset null '(NULL)' CREATE TEMPORARY TABLE t1 (a int, b int); COPY t1 (a,b) FROM STDIN; 1 a 2 1 3 2 4 b a c \. SELECT * FROM t1; CONTEXT: COPY t1, line 1, column b: "a" a | b ---+--- (0 rows) CREATE TEMPORARY TABLE t2 (a int, b int); COPY t2 (a,b) FROM STDIN WITH (on_error set_to_null); 1 a 2 1 3 2 4 b a c \. SELECT * FROM t2; psql:test-copy-on_error-2.sql:12: NOTICE: some columns of 3 rows, value were converted to NULL due to data type incompatibility COPY 5 a | b --------+-------- 1 | (NULL) 2 | 1 3 | 2 4 | (NULL) (NULL) | (NULL) (5 rows) I have one question though: In case all columns of a record have been set to null due to data type incompatibility, should we insert it at all? See t2 example above. I'm not sure if these records would be of any use in the table. What do you think? Since the parameter is already called "set_to_null", maybe it is not necessary to mention in the NOTICE message that the values have been set to null. Perhaps something like "XX records were only partially copied due to data type incompatibility" -- Jim
On Fri, Feb 16, 2024 at 1:16 PM Jim Jones <jim.jones@uni-muenster.de> wrote:
In case all columns of a record have been set to null due to data type
incompatibility, should we insert it at all?
Yes. In particular not all columns in the table need be specified in the copy command so while the parsed input data is all nulls the record itself may not be.
The system should allow the user to exclude rows with incomplete data by ignoring a not null constraint violation.
In short we shouldn't judge non-usefulness and instead give tools to the user to decide for themselves.
David J.
On 16.02.24 21:31, David G. Johnston wrote: > Yes. In particular not all columns in the table need be specified in > the copy command so while the parsed input data is all nulls the > record itself may not be. Yeah, you have a point there. I guess if users want to avoid it to happen they can rely on NOT NULL constraints. Thanks -- Jim
Hi there On 26.08.24 02:00, jian he wrote: > hi all. > patch updated. > simplified the code a lot. > > idea is same: > COPY t_on_error_null FROM STDIN WITH (on_error set_to_null); > > If the STDIN number of columns is the same as the target table, then > InputFunctionCallSafe > call failure will make that column values be null. > > > If the STDIN number of columns is not the same as the target table, then error > ERROR: missing data for column \"%s\" > ERROR: extra data after last expected column > which is status quo. I wanted to give it another try, but the patch does not apply ... $ git apply ~/patches/copy_on_error/v3-0001-on_error-set_to_null.patch -v Checking patch doc/src/sgml/ref/copy.sgml... Checking patch src/backend/commands/copy.c... Checking patch src/backend/commands/copyfrom.c... Checking patch src/backend/commands/copyfromparse.c... Checking patch src/include/commands/copy.h... Checking patch src/test/regress/expected/copy2.out... error: while searching for: NOTICE: skipping row due to data type incompatibility at line 8 for column k: "a" CONTEXT: COPY check_ign_err NOTICE: 6 rows were skipped due to data type incompatibility -- tests for on_error option with log_verbosity and null constraint via domain CREATE DOMAIN dcheck_ign_err2 varchar(15) NOT NULL; CREATE TABLE check_ign_err2 (n int, m int[], k int, l dcheck_ign_err2); error: patch failed: src/test/regress/expected/copy2.out:753 error: src/test/regress/expected/copy2.out: patch does not apply Checking patch src/test/regress/sql/copy2.sql... -- Jim
On 12.09.24 12:13, jian he wrote: > please check the attached file. v4 applies cleanly, it works as expected, and all tests pass. postgres=# \pset null '(NULL)' Null display is "(NULL)". postgres=# CREATE TEMPORARY TABLE t2 (a int, b int); CREATE TABLE postgres=# COPY t2 (a,b) FROM STDIN WITH (on_error set_to_null, format csv); Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself, or an EOF signal. >> 1,a >> 2,1 >> 3,2 >> 4,b >> a,c >> \. COPY 5 postgres=# SELECT * FROM t2; a | b --------+-------- 1 | (NULL) 2 | 1 3 | 2 4 | (NULL) (NULL) | (NULL) (5 rows) Perhaps small changes in the docs: <literal>set_to_null</literal> means the input value will set to <literal>null</literal> and continue with the next one. "will set" -> "will be set" "and continue with" -> "and will continue with" Other than that, LGTM. Thanks! -- Jim