Thread: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
"David G. Johnston"
Date:
Hi,

The option choice of "ignore" in the COPY ON_ERROR clause seems overly generic.  There would seem to be two relevant ways to ignore bad column input data - drop the entire row or just set the column value to null.  I can see us wanting to provide the set to null option and in any case having the option name be explicit that it ignores the row seems like a good idea.

David J.

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
On Fri, Jan 26, 2024 at 11:09 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> Hi,
>
> The option choice of "ignore" in the COPY ON_ERROR clause seems overly generic.  There would seem to be two relevant
waysto ignore bad column input data - drop the entire row or just set the column value to null.  I can see us wanting
toprovide the set to null option and in any case having the option name be explicit that it ignores the row seems like
agood idea. 

two issue I found out while playing around with it;
create table x1(a int not null, b int not null );

you can only do:
COPY x1 from stdin (on_error 'null');
but you cannot do
COPY x1 from stdin (on_error null);
we need to hack the gram.y to escape the "null". I don't know how to
make it work.
related post I found:
https://stackoverflow.com/questions/31786611/how-to-escape-flex-keyword

another issue:
COPY x1 from stdin (on_error null);

when we already have `not null` top level constraint for table x1.
Do we need an error immediately?
"on_error null" seems to conflict with `not null` constraint (assume
refers to the same column).
it may fail while doing bulk inserts while on_error is set to null
because of violating a not null constraint.



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
"David G. Johnston"
Date:


On Sun, Jan 28, 2024 at 4:51 PM jian he <jian.universality@gmail.com> wrote:
On Fri, Jan 26, 2024 at 11:09 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> Hi,
>
> The option choice of "ignore" in the COPY ON_ERROR clause seems overly generic.  There would seem to be two relevant ways to ignore bad column input data - drop the entire row or just set the column value to null.  I can see us wanting to provide the set to null option and in any case having the option name be explicit that it ignores the row seems like a good idea.

two issue I found out while playing around with it;
create table x1(a int not null, b int not null );

another issue:
COPY x1 from stdin (on_error null);

when we already have `not null` top level constraint for table x1.
Do we need an error immediately?
"on_error null" seems to conflict with `not null` constraint (assume
refers to the same column).
it may fail while doing bulk inserts while on_error is set to null
because of violating a not null constraint.

You should not error immediately since whether or not there is a problem is table and data dependent.  I would not check for the case of all columns being defined not null and just let the mismatch happen.

That said, maybe with this being a string we can accept something like: 'null, ignore'

And so if attempting to place any one null fails, assuming we can make that a soft error too, we would then ignore the entire row.

David J.

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Yugo NAGATA
Date:
On Fri, 26 Jan 2024 08:08:29 -0700
"David G. Johnston" <david.g.johnston@gmail.com> wrote:

> Hi,
> 
> The option choice of "ignore" in the COPY ON_ERROR clause seems overly
> generic.  There would seem to be two relevant ways to ignore bad column
> input data - drop the entire row or just set the column value to null.  I
> can see us wanting to provide the set to null option and in any case having
> the option name be explicit that it ignores the row seems like a good idea.

I am not in favour of renaming the option name "ignore", instead we can
use another style of name for the option to set the column value to NULL,
for example, "set_to_null".

(Maybe, we can make a more generic option "set_to (col, val)" that can set
the value of column specified by "col" value to the specified value "val" 
(e.g. 'N/A') on a soft error, although the syntax would be a bit complex...) 

IMO, it is more simple to define "ignore" as to skip the entire row rather
than having variety of "ignore". Once defined it so, the option to set the
column value to NULL should not be called "ignore" because values in other
columns will be inserted.

Regards,
Yugo Nagata

> 
> David J.


-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
The idea of on_error is to tolerate errors, I think.
if a column has a not null constraint, let it cannot be used with
(on_error 'null')

Based on this, I've made a patch.
based on COPY Synopsis: ON_ERROR 'error_action'
on_error 'null', the  keyword NULL should be single quoted.

demo:
COPY check_ign_err FROM STDIN WITH (on_error 'null');
1 {1} a
2 {2} 1
3 {3} 2
4 {4} b
a {5} c
\.

\pset null NULL

SELECT * FROM check_ign_err;
  n   |  m  |  k
------+-----+------
    1 | {1} | NULL
    2 | {2} |    1
    3 | {3} |    2
    4 | {4} | NULL
 NULL | {5} | NULL

Attachment

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
torikoshia
Date:
Hi,

On 2024-02-03 15:22, jian he wrote:
> The idea of on_error is to tolerate errors, I think.
> if a column has a not null constraint, let it cannot be used with
> (on_error 'null')

> +       /*
> +        * we can specify on_error 'null', but it can only apply to 
> columns
> +        * don't have not null constraint.
> +       */
> +       if (att->attnotnull && cstate->opts.on_error == 
> COPY_ON_ERROR_NULL)
> +           ereport(ERROR,
> +                   (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> +                    errmsg("copy on_error 'null' cannot be used with 
> not null constraint column")));

This means we cannot use ON_ERROR 'null' even when there is one column 
which have NOT NULL constraint, i.e. primary key, right?
IMHO this is strong constraint and will decrease the opportunity to use 
this feature.

It might be better to allow error_action 'null' for tables which have 
NOT NULL constraint columns, and when facing soft errors for those rows, 
skip that row or stop COPY.

> Based on this, I've made a patch.
> based on COPY Synopsis: ON_ERROR 'error_action'
> on_error 'null', the  keyword NULL should be single quoted.

As you mentioned, single quotation seems a little odd..

I'm not sure what is the best name and syntax for this feature, but 
since current error_action are verbs('stop' and 'ignore'), I feel 'null' 
might not be appropriate.

> demo:
> COPY check_ign_err FROM STDIN WITH (on_error 'null');
> 1 {1} a
> 2 {2} 1
> 3 {3} 2
> 4 {4} b
> a {5} c
> \.
> 
> \pset null NULL
> 
> SELECT * FROM check_ign_err;
>   n   |  m  |  k
> ------+-----+------
>     1 | {1} | NULL
>     2 | {2} |    1
>     3 | {3} |    2
>     4 | {4} | NULL
>  NULL | {5} | NULL

Since we notice the number of ignored rows when ON_ERROR is 'ignore', 
users may want to know the number of rows which was changed to NULL when 
using ON_ERROR 'null'.

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
On Mon, Feb 5, 2024 at 10:29 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> Hi,
>
> On 2024-02-03 15:22, jian he wrote:
> > The idea of on_error is to tolerate errors, I think.
> > if a column has a not null constraint, let it cannot be used with
> > (on_error 'null')
>
> > +       /*
> > +        * we can specify on_error 'null', but it can only apply to
> > columns
> > +        * don't have not null constraint.
> > +       */
> > +       if (att->attnotnull && cstate->opts.on_error ==
> > COPY_ON_ERROR_NULL)
> > +           ereport(ERROR,
> > +                   (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> > +                    errmsg("copy on_error 'null' cannot be used with
> > not null constraint column")));
>
> This means we cannot use ON_ERROR 'null' even when there is one column
> which have NOT NULL constraint, i.e. primary key, right?
> IMHO this is strong constraint and will decrease the opportunity to use
> this feature.

I don't want to fail in the middle of bulk inserts,
so I thought immediately erroring out would be a great idea.
Let's see what other people think.



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Yugo NAGATA
Date:
On Mon, 05 Feb 2024 11:28:59 +0900
torikoshia <torikoshia@oss.nttdata.com> wrote:

> > Based on this, I've made a patch.
> > based on COPY Synopsis: ON_ERROR 'error_action'
> > on_error 'null', the  keyword NULL should be single quoted.
> 
> As you mentioned, single quotation seems a little odd..
> 
> I'm not sure what is the best name and syntax for this feature, but 
> since current error_action are verbs('stop' and 'ignore'), I feel 'null' 
> might not be appropriate.

I am not in favour of using 'null' either, so I suggested to use
"set_to_null" or more generic syntax like  "set_to (col, val)" in my
previous post[1], although I'm not convinced what is the best either.

[1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Kyotaro Horiguchi
Date:
At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in 
> On Mon, 05 Feb 2024 11:28:59 +0900
> torikoshia <torikoshia@oss.nttdata.com> wrote:
> 
> > > Based on this, I've made a patch.
> > > based on COPY Synopsis: ON_ERROR 'error_action'
> > > on_error 'null', the  keyword NULL should be single quoted.
> > 
> > As you mentioned, single quotation seems a little odd..
> > 
> > I'm not sure what is the best name and syntax for this feature, but 
> > since current error_action are verbs('stop' and 'ignore'), I feel 'null' 
> > might not be appropriate.
> 
> I am not in favour of using 'null' either, so I suggested to use
> "set_to_null" or more generic syntax like  "set_to (col, val)" in my
> previous post[1], although I'm not convinced what is the best either.
> 
> [1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp

Tom sugggested using a separate option, and I agree with the
suggestion. Taking this into consideration, I imagined something like
the following, for example.  Although I'm not sure we are actually
going to do whole-tuple replacement, the action name in this example
has the suffix '-column'.

COPY (on_error 'replace-colomn', replacement 'null') ..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Yugo NAGATA
Date:
On Tue, 06 Feb 2024 09:39:09 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in 
> > On Mon, 05 Feb 2024 11:28:59 +0900
> > torikoshia <torikoshia@oss.nttdata.com> wrote:
> > 
> > > > Based on this, I've made a patch.
> > > > based on COPY Synopsis: ON_ERROR 'error_action'
> > > > on_error 'null', the  keyword NULL should be single quoted.
> > > 
> > > As you mentioned, single quotation seems a little odd..
> > > 
> > > I'm not sure what is the best name and syntax for this feature, but 
> > > since current error_action are verbs('stop' and 'ignore'), I feel 'null' 
> > > might not be appropriate.
> > 
> > I am not in favour of using 'null' either, so I suggested to use
> > "set_to_null" or more generic syntax like  "set_to (col, val)" in my
> > previous post[1], although I'm not convinced what is the best either.
> > 
> > [1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp
> 
> Tom sugggested using a separate option, and I agree with the
> suggestion. Taking this into consideration, I imagined something like
> the following, for example.  Although I'm not sure we are actually
> going to do whole-tuple replacement, the action name in this example
> has the suffix '-column'.
> 
> COPY (on_error 'replace-colomn', replacement 'null') ..

Thank you for your information. I've found a post[1] you mentioned, 
where adding a separate option for error log destination was suggested. 

Considering consistency with other options, adding a separate option
would be better if we want to specify a value to replace the invalid
value, without introducing a complex syntax that allows options with
more than one parameters. Maybe, if we allow to use values for the
replacement other than NULL, we have to also add a option to specify
a column (or a type)  for each replacement value. Or,  we may add a
option to specify a list of replacement values as many as the number of
columns, each of whose default is NULL.

Anyway, I prefer 'replace" (or 'set_to') to just 'null' as the option
value.

[1] https://www.postgresql.org/message-id/2070915.1705527477%40sss.pgh.pa.us

Regards,
Yugo Nagata


> regards.
> 
> -- 
> Kyotaro Horiguchi
> NTT Open Source Software Center


-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
On Tue, Feb 6, 2024 at 3:46 PM Yugo NAGATA <nagata@sraoss.co.jp> wrote:
>
> On Tue, 06 Feb 2024 09:39:09 +0900 (JST)
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> > At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in
> > > On Mon, 05 Feb 2024 11:28:59 +0900
> > > torikoshia <torikoshia@oss.nttdata.com> wrote:
> > >
> > > > > Based on this, I've made a patch.
> > > > > based on COPY Synopsis: ON_ERROR 'error_action'
> > > > > on_error 'null', the  keyword NULL should be single quoted.
> > > >
> > > > As you mentioned, single quotation seems a little odd..
> > > >
> > > > I'm not sure what is the best name and syntax for this feature, but
> > > > since current error_action are verbs('stop' and 'ignore'), I feel 'null'
> > > > might not be appropriate.
> > >
> > > I am not in favour of using 'null' either, so I suggested to use
> > > "set_to_null" or more generic syntax like  "set_to (col, val)" in my
> > > previous post[1], although I'm not convinced what is the best either.
> > >
> > > [1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp
> >
> > Tom sugggested using a separate option, and I agree with the
> > suggestion. Taking this into consideration, I imagined something like
> > the following, for example.  Although I'm not sure we are actually
> > going to do whole-tuple replacement, the action name in this example
> > has the suffix '-column'.
> >
> > COPY (on_error 'replace-colomn', replacement 'null') ..
>
> Thank you for your information. I've found a post[1] you mentioned,
> where adding a separate option for error log destination was suggested.
>
> Considering consistency with other options, adding a separate option
> would be better if we want to specify a value to replace the invalid
> value, without introducing a complex syntax that allows options with
> more than one parameters. Maybe, if we allow to use values for the
> replacement other than NULL, we have to also add a option to specify
> a column (or a type)  for each replacement value. Or,  we may add a
> option to specify a list of replacement values as many as the number of
> columns, each of whose default is NULL.
>
> Anyway, I prefer 'replace" (or 'set_to') to just 'null' as the option
> value.
>

Let's say tabe t column (a,b,c)
if we support set_to_null(a,b), what should we do if column c has an error.
should we ignore this row or error out immediately?
also I am not sure it's doable to just extract columnList from the
function defGetCopyOnErrorChoice.

to make `COPY x from stdin (on_error set_to_null(a,b);` work,
we may need to refactor to gram.y, in a similar way we do force null

i am ok with
COPY x from stdin (on_error set_to_null);



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Yugo NAGATA
Date:
On Mon, 5 Feb 2024 14:26:46 +0800
jian he <jian.universality@gmail.com> wrote:

> On Mon, Feb 5, 2024 at 10:29 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
> >
> > Hi,
> >
> > On 2024-02-03 15:22, jian he wrote:
> > > The idea of on_error is to tolerate errors, I think.
> > > if a column has a not null constraint, let it cannot be used with
> > > (on_error 'null')
> >
> > > +       /*
> > > +        * we can specify on_error 'null', but it can only apply to
> > > columns
> > > +        * don't have not null constraint.
> > > +       */
> > > +       if (att->attnotnull && cstate->opts.on_error ==
> > > COPY_ON_ERROR_NULL)
> > > +           ereport(ERROR,
> > > +                   (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> > > +                    errmsg("copy on_error 'null' cannot be used with
> > > not null constraint column")));
> >
> > This means we cannot use ON_ERROR 'null' even when there is one column
> > which have NOT NULL constraint, i.e. primary key, right?
> > IMHO this is strong constraint and will decrease the opportunity to use
> > this feature.
> 
> I don't want to fail in the middle of bulk inserts,
> so I thought immediately erroring out would be a great idea.
> Let's see what other people think.

I also think this restriction is too strong because it is very
common that a table has a primary key, unless there is some way
to specify columns that can be set to NULL. Even when ON_ERROR
is specified, any constraint violation errors cannot be generally
ignored, so we cannot elimiate the posibility COPY FROM fails in
the middle due to invalid data, anyway.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
attached v2.
syntax: `on_error set_to_null`
based on upthread discussion, now if you specified `on_error
set_to_null` and your column has `not
null` constraint, we convert the error field to null, so it may error
while bulk inserting for violating NOT NULL constraint.

Attachment

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Jim Jones
Date:
Hi!

On 12.02.24 01:00, jian he wrote:
> attached v2.
> syntax: `on_error set_to_null`
> based on upthread discussion, now if you specified `on_error
> set_to_null` and your column has `not
> null` constraint, we convert the error field to null, so it may error
> while bulk inserting for violating NOT NULL constraint.
That's a very nice feature. Thanks for implementing it!

v2 applies cleanly and works as described.

\pset null '(NULL)'

CREATE TEMPORARY TABLE t1 (a int, b int);
COPY t1 (a,b) FROM STDIN;
1    a
2    1
3    2
4    b
a    c
\.
SELECT * FROM t1;

CONTEXT:  COPY t1, line 1, column b: "a"
 a | b
---+---
(0 rows)


CREATE TEMPORARY TABLE t2 (a int, b int);
COPY t2 (a,b) FROM STDIN WITH (on_error set_to_null);
1    a
2    1
3    2
4    b
a    c
\.
SELECT * FROM t2;

psql:test-copy-on_error-2.sql:12: NOTICE:  some columns of 3 rows, value
were converted to NULL due to data type incompatibility
COPY 5
   a    |   b    
--------+--------
      1 | (NULL)
      2 |      1
      3 |      2
      4 | (NULL)
 (NULL) | (NULL)
(5 rows)


I have one question though:

In case all columns of a record have been set to null due to data type
incompatibility, should we insert it at all? See t2 example above.
I'm not sure if these records would be of any use in the table. What do
you think?

Since the parameter is already called "set_to_null", maybe it is not
necessary to mention in the NOTICE message that the values have been set
to null.
Perhaps something like "XX records were only partially copied due to
data type incompatibility"


-- 
Jim




Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
"David G. Johnston"
Date:
On Fri, Feb 16, 2024 at 1:16 PM Jim Jones <jim.jones@uni-muenster.de> wrote:
In case all columns of a record have been set to null due to data type
incompatibility, should we insert it at all?

Yes.  In particular not all columns in the table need be specified in the copy command so while the parsed input data is all nulls the record itself may not be.

The system should allow the user to exclude rows with incomplete data by ignoring a not null constraint violation.

In short we shouldn't judge non-usefulness and instead give tools to the user to decide for themselves.

David J.

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Jim Jones
Date:

On 16.02.24 21:31, David G. Johnston wrote:
> Yes.  In particular not all columns in the table need be specified in
> the copy command so while the parsed input data is all nulls the
> record itself may not be.

Yeah, you have a point there.
I guess if users want to avoid it to happen they can rely on NOT NULL
constraints.

Thanks

-- 
Jim




Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Jim Jones
Date:
Hi there

On 26.08.24 02:00, jian he wrote:
> hi all.
> patch updated.
> simplified the code a lot.
>
> idea is same:
> COPY t_on_error_null FROM STDIN WITH (on_error set_to_null);
>
> If the STDIN number of columns is the same as the target table, then
> InputFunctionCallSafe
> call failure will make that column values be null.
>
>
> If the STDIN number of columns is not the same as the target table, then error
> ERROR:  missing data for column \"%s\"
> ERROR:  extra data after last expected column
> which is status quo.

I wanted to give it another try, but the patch does not apply ...

$ git apply ~/patches/copy_on_error/v3-0001-on_error-set_to_null.patch -v
Checking patch doc/src/sgml/ref/copy.sgml...
Checking patch src/backend/commands/copy.c...
Checking patch src/backend/commands/copyfrom.c...
Checking patch src/backend/commands/copyfromparse.c...
Checking patch src/include/commands/copy.h...
Checking patch src/test/regress/expected/copy2.out...
error: while searching for:
NOTICE:  skipping row due to data type incompatibility at line 8 for
column k: "a"
CONTEXT:  COPY check_ign_err
NOTICE:  6 rows were skipped due to data type incompatibility
-- tests for on_error option with log_verbosity and null constraint via
domain
CREATE DOMAIN dcheck_ign_err2 varchar(15) NOT NULL;
CREATE TABLE check_ign_err2 (n int, m int[], k int, l dcheck_ign_err2);

error: patch failed: src/test/regress/expected/copy2.out:753
error: src/test/regress/expected/copy2.out: patch does not apply
Checking patch src/test/regress/sql/copy2.sql...


-- 
Jim




Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Jim Jones
Date:

On 12.09.24 12:13, jian he wrote:
> please check the attached file.

v4 applies cleanly, it works as expected, and all tests pass.

postgres=# \pset null '(NULL)'
Null display is "(NULL)".

postgres=# CREATE TEMPORARY TABLE t2 (a int, b int);
CREATE TABLE

postgres=# COPY t2 (a,b) FROM STDIN WITH (on_error set_to_null, format csv);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 1,a
>> 2,1
>> 3,2
>> 4,b
>> a,c
>> \.
COPY 5

postgres=# SELECT * FROM t2;
   a    |   b    
--------+--------
      1 | (NULL)
      2 |      1
      3 |      2
      4 | (NULL)
 (NULL) | (NULL)
(5 rows)


Perhaps small changes in the docs:

<literal>set_to_null</literal> means the input value will set to
<literal>null</literal> and continue with the next one.

"will set" -> "will be set"
"and continue with" -> "and will continue with"

Other than that, LGTM.

Thanks!

-- 
Jim




Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Fujii Masao
Date:

On 2024/10/21 18:30, Kirill Reshke wrote:
> v4 no longer applies. It now conflicts with
> e7834a1a251d4a28245377f383ff20a657ba8262.
> Also, there were review comments.
> 
> So, I decided to rebase.

Thanks for the patch! Here are my review comments:

I noticed that on_error=set_to_null does not trigger NOTICE messages for rows
and columns with errors. It's "unexpected" thing for columns to be silently
replaced with NULL due to on_error=set_to_null. So, similar to on_error=ignore,
there should be NOTICE messages indicating which input records had columns
set to NULL because of data type incompatibility. Without these messages,
users might not realize that some columns were set to NULL.


How should on_error=set_to_null behave when reject_limit is set? It seems
intuitive to trigger an error if the number of rows with columns' data type
issues exceeds reject_limit, similar to on_error=ignore. This is open to discussion.


psql's tab-completion should be updated to include SET_TO_NULL.


        An <replaceable class="parameter">error_action</replaceable> value of
        <literal>stop</literal> means fail the command, while
        <literal>ignore</literal> means discard the input row and continue with the next one.
+      <literal>set_to_null</literal> means the input value will be set to <literal>null</literal> and continue with
thenext one.
 

How about merging these two descriptions to one and updating it to the following?

-------------------
An error_action value of stop means fail the command, ignore means discard the input
row and continue with the next one, and set_to_null means replace columns with invalid
input values with NULL and move to the next row.
-------------------


       The <literal>ignore</literal> option is applicable only for <command>COPY FROM</command>

This should be "... ignore and set_to_null options are ..."?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
On Mon, Oct 21, 2024 at 8:39 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> On 2024/10/21 18:30, Kirill Reshke wrote:
> > v4 no longer applies. It now conflicts with
> > e7834a1a251d4a28245377f383ff20a657ba8262.
> > Also, there were review comments.
> >
> > So, I decided to rebase.
>
> Thanks for the patch! Here are my review comments:
>
> I noticed that on_error=set_to_null does not trigger NOTICE messages for rows
> and columns with errors. It's "unexpected" thing for columns to be silently
> replaced with NULL due to on_error=set_to_null. So, similar to on_error=ignore,
> there should be NOTICE messages indicating which input records had columns
> set to NULL because of data type incompatibility. Without these messages,
> users might not realize that some columns were set to NULL.
>

on_error=set_to_null,
we have two options for CopyFromStateData->num_errors.
A. Counting the number of rows that on_error set_to_null happened.
B. Counting number of times that on_error set_to_null happened

let's say optionA:
        ereport(NOTICE,
                errmsg_plural("%llu row was converted to NULL due to
data type incompatibility",
                              "%llu rows were converted to NULL due to
data type incompatibility",
                              (unsigned long long) cstate->num_errors,
                              (unsigned long long) cstate->num_errors));

I doubt the above message is accurate.
"%llu row was converted to NULL"
can mean "%llu row, for each row, all columns was converted to NULL"
but here we are
"%llu row, for each row, some column (can be all columns) was converted to NULL"


optionB: the message can be:
errmsg_plural("converted to NULL due to data type incompatibility
happened %llu time")
but I aslo feel the wording is not perfect also.

So overall I am not sure how to construct the NOTICE messages.



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Fujii Masao
Date:

On 2024/10/26 6:03, Kirill Reshke wrote:
> when the REJECT LIMIT is set to some non-zero number and the number of
> row NULL replacements exceeds the limit, is it OK to fail. Because
> there WAS errors, and we should not tolerate more than $limit errors .
> I do find this behavior to be consistent.

+1


> But what if we don't set a REJECT LIMIT, it is sane to do all
> replacements, as if REJECT LIMIT is inf.

+1


> But our REJECT LIMIT is zero
> (not set).
> So, we ignore zero REJECT LIMIT if set_to_null is set.

REJECT_LIMIT currently has to be greater than zero, so it won’t ever be zero.


> But while I was trying to implement that, I realized that I don't
> understand v4 of this patch. My misunderstanding is about
> `t_on_error_null` tests. We are allowed to insert a NULL value for the
> first column of t_on_error_null using COPY ON_ERROR SET_TO_NULL. Why
> do we do that? My thought is we should try to execute
> InputFunctionCallSafe with NULL value  (i mean, here [1]) for the
> column after we failed to insert the input value. And, if this second
> call is successful, we do replacement, otherwise we count the row as
> erroneous.

Your concern is valid. Allowing NULL to be stored in a column with a NOT NULL
constraint via COPY ON_ERROR=SET_TO_NULL does seem unexpected. As you suggested,
NULL values set by SET_TO_NULL should probably be re-evaluated.


> Hm, good catch. Applied almost as you suggested. I did tweak this
> "replace columns with invalid input values with " into "replace
> columns containing erroneous input values with". Is that OK?

Yes, sounds good.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
torikoshia
Date:
On 2024-11-09 21:55, Kirill Reshke wrote:

Thanks for working on this!

> On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com> 
> wrote:
>> 
>> 
>> 
>> On 2024/10/26 6:03, Kirill Reshke wrote:
>> > when the REJECT LIMIT is set to some non-zero number and the number of
>> > row NULL replacements exceeds the limit, is it OK to fail. Because
>> > there WAS errors, and we should not tolerate more than $limit errors .
>> > I do find this behavior to be consistent.
>> 
>> +1
>> 
>> 
>> > But what if we don't set a REJECT LIMIT, it is sane to do all
>> > replacements, as if REJECT LIMIT is inf.
>> 
>> +1
> 
> After thinking for a while, I'm now more opposed to this approach. I
> think we should count rows with erroneous data as errors only if
> null substitution for these rows failed, not the total number of rows
> which were modified.
> Then, to respect the REJECT LIMIT option, we compare this number with
> the limit. This is actually simpler approach IMHO. What do You think?

IMHO I prefer the previous interpretation.
I'm not sure this is what people expect, but I assume that REJECT_LIMIT 
is used to specify how many malformed rows are acceptable in the 
"original" data source.

>> > But while I was trying to implement that, I realized that I don't
>> > understand v4 of this patch. My misunderstanding is about
>> > `t_on_error_null` tests. We are allowed to insert a NULL value for the
>> > first column of t_on_error_null using COPY ON_ERROR SET_TO_NULL. Why
>> > do we do that? My thought is we should try to execute
>> > InputFunctionCallSafe with NULL value  (i mean, here [1]) for the
>> > column after we failed to insert the input value. And, if this second
>> > call is successful, we do replacement, otherwise we count the row as
>> > erroneous.
>> 
>> Your concern is valid. Allowing NULL to be stored in a column with a 
>> NOT NULL
>> constraint via COPY ON_ERROR=SET_TO_NULL does seem unexpected. As you 
>> suggested,
>> NULL values set by SET_TO_NULL should probably be re-evaluated.
> 
> Thank you. I updated the patch with a NULL re-evaluation.
> 
> 
> PFA v7. I did not yet update the doc for this patch version, waiting
> for feedback about REJECT LIMIT + SET_TO_NULL behaviour.
> 


There were warnings when I applied the patch:

$ git apply 
v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch
v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:170: 
trailing whitespace.
                    /*
v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:181: 
trailing whitespace.

v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:189: 
trailing whitespace.
                                    /* If datatype if okay with NULL, 
replace
v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:196: 
trailing whitespace.

v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:212: 
trailing whitespace.
                                 /*

>  @@ -403,12 +403,14 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState 
> *pstate, bool is_from)
>                   parser_errposition(pstate, def->location)));
...
> 
>  -   if (opts_out->reject_limit && !opts_out->on_error)
>  +   if (opts_out->reject_limit && !(opts_out->on_error == 
> COPY_ON_ERROR_NULL || opts_out->on_error == COPY_ON_ERROR_IGNORE))

Hmm, is this change necessary?
Personally, I feel the previous code is easier to read.


"REJECT LIMIT" should be "REJECT_LIMIT"?
> 1037                             errhint("Consider specifying the 
> REJECT LIMIT option to skip erroneous rows.")));


SET_TO_NULL is one of the value for ON_ERROR, but the patch treats 
SET_TO_NULL as option for COPY:

221 --- a/src/bin/psql/tab-complete.in.c
222 +++ b/src/bin/psql/tab-complete.in.c
223 @@ -3235,7 +3235,7 @@ match_previous_words(int pattern_id,
224         COMPLETE_WITH("FORMAT", "FREEZE", "DELIMITER", "NULL",
225                       "HEADER", "QUOTE", "ESCAPE", "FORCE_QUOTE",
226                       "FORCE_NOT_NULL", "FORCE_NULL", "ENCODING", 
"DEFAULT",
227 -                     "ON_ERROR", "LOG_VERBOSITY");
228 +                     "ON_ERROR", "SET_TO_NULL", "LOG_VERBOSITY");


> Best regards,
> Kirill Reshke

-- 
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Kirill Reshke
Date:
On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> On 2024-11-09 21:55, Kirill Reshke wrote:
>
> Thanks for working on this!

Thanks for reviewing the v7 patch series!

> > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com>
> > wrote:
> >>
> >>
> >>
> >> On 2024/10/26 6:03, Kirill Reshke wrote:
> >> > when the REJECT LIMIT is set to some non-zero number and the number of
> >> > row NULL replacements exceeds the limit, is it OK to fail. Because
> >> > there WAS errors, and we should not tolerate more than $limit errors .
> >> > I do find this behavior to be consistent.
> >>
> >> +1
> >>
> >>
> >> > But what if we don't set a REJECT LIMIT, it is sane to do all
> >> > replacements, as if REJECT LIMIT is inf.
> >>
> >> +1
> >
> > After thinking for a while, I'm now more opposed to this approach. I
> > think we should count rows with erroneous data as errors only if
> > null substitution for these rows failed, not the total number of rows
> > which were modified.
> > Then, to respect the REJECT LIMIT option, we compare this number with
> > the limit. This is actually simpler approach IMHO. What do You think?
>
> IMHO I prefer the previous interpretation.
> I'm not sure this is what people expect, but I assume that REJECT_LIMIT
> is used to specify how many malformed rows are acceptable in the
> "original" data source.

I do like the first version of interpretation, but I have a struggle
with it. According to this interpretation, we will fail COPY command
if the number
of malformed data rows exceeds the limit, not the number of rejected
rows (some percentage of malformed rows are accepted with null
substitution)
So, a proper name for the limit will be MALFORMED_LIMIT, or something.
However, we are unable to change the name since the REJECT_LIMIT
option has already been committed.
I guess I'll just have to put up with this contradiction. I will send
an updated patch shortly...


> >> > But while I was trying to implement that, I realized that I don't
> >> > understand v4 of this patch. My misunderstanding is about
> >> > `t_on_error_null` tests. We are allowed to insert a NULL value for the
> >> > first column of t_on_error_null using COPY ON_ERROR SET_TO_NULL. Why
> >> > do we do that? My thought is we should try to execute
> >> > InputFunctionCallSafe with NULL value  (i mean, here [1]) for the
> >> > column after we failed to insert the input value. And, if this second
> >> > call is successful, we do replacement, otherwise we count the row as
> >> > erroneous.
> >>
> >> Your concern is valid. Allowing NULL to be stored in a column with a
> >> NOT NULL
> >> constraint via COPY ON_ERROR=SET_TO_NULL does seem unexpected. As you
> >> suggested,
> >> NULL values set by SET_TO_NULL should probably be re-evaluated.
> >
> > Thank you. I updated the patch with a NULL re-evaluation.
> >
> >
> > PFA v7. I did not yet update the doc for this patch version, waiting
> > for feedback about REJECT LIMIT + SET_TO_NULL behaviour.
> >
>
>
> There were warnings when I applied the patch:
>
> $ git apply
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:170:
> trailing whitespace.
>                     /*
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:181:
> trailing whitespace.
>
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:189:
> trailing whitespace.
>                                     /* If datatype if okay with NULL,
> replace
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:196:
> trailing whitespace.
>
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:212:
> trailing whitespace.
>                                  /*
>
> >  @@ -403,12 +403,14 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState
> > *pstate, bool is_from)
> >                   parser_errposition(pstate, def->location)));
> ...
> >
> >  -   if (opts_out->reject_limit && !opts_out->on_error)
> >  +   if (opts_out->reject_limit && !(opts_out->on_error ==
> > COPY_ON_ERROR_NULL || opts_out->on_error == COPY_ON_ERROR_IGNORE))
>
> Hmm, is this change necessary?
> Personally, I feel the previous code is easier to read.
>
>
> "REJECT LIMIT" should be "REJECT_LIMIT"?
> > 1037                             errhint("Consider specifying the
> > REJECT LIMIT option to skip erroneous rows.")));
>
>
> SET_TO_NULL is one of the value for ON_ERROR, but the patch treats
> SET_TO_NULL as option for COPY:
>
> 221 --- a/src/bin/psql/tab-complete.in.c
> 222 +++ b/src/bin/psql/tab-complete.in.c
> 223 @@ -3235,7 +3235,7 @@ match_previous_words(int pattern_id,
> 224         COMPLETE_WITH("FORMAT", "FREEZE", "DELIMITER", "NULL",
> 225                       "HEADER", "QUOTE", "ESCAPE", "FORCE_QUOTE",
> 226                       "FORCE_NOT_NULL", "FORCE_NULL", "ENCODING",
> "DEFAULT",
> 227 -                     "ON_ERROR", "LOG_VERBOSITY");
> 228 +                     "ON_ERROR", "SET_TO_NULL", "LOG_VERBOSITY");
>
>
> > Best regards,
> > Kirill Reshke
>
> --
> Regards,
>
> --
> Atsushi Torikoshi
> Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.



-- 
Best regards,
Kirill Reshke



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Yugo Nagata
Date:
On Tue, 12 Nov 2024 01:27:53 +0500
Kirill Reshke <reshkekirill@gmail.com> wrote:

> On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote:
> >
> > On 2024-11-09 21:55, Kirill Reshke wrote:
> >
> > Thanks for working on this!
> 
> Thanks for reviewing the v7 patch series!
> 
> > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com>
> > > wrote:
> > >>
> > >>
> > >>
> > >> On 2024/10/26 6:03, Kirill Reshke wrote:
> > >> > when the REJECT LIMIT is set to some non-zero number and the number of
> > >> > row NULL replacements exceeds the limit, is it OK to fail. Because
> > >> > there WAS errors, and we should not tolerate more than $limit errors .
> > >> > I do find this behavior to be consistent.
> > >>
> > >> +1
> > >>
> > >>
> > >> > But what if we don't set a REJECT LIMIT, it is sane to do all
> > >> > replacements, as if REJECT LIMIT is inf.
> > >>
> > >> +1
> > >
> > > After thinking for a while, I'm now more opposed to this approach. I
> > > think we should count rows with erroneous data as errors only if
> > > null substitution for these rows failed, not the total number of rows
> > > which were modified.
> > > Then, to respect the REJECT LIMIT option, we compare this number with
> > > the limit. This is actually simpler approach IMHO. What do You think?
> >
> > IMHO I prefer the previous interpretation.
> > I'm not sure this is what people expect, but I assume that REJECT_LIMIT
> > is used to specify how many malformed rows are acceptable in the
> > "original" data source.

I also prefer the previous version.
 
> I do like the first version of interpretation, but I have a struggle
> with it. According to this interpretation, we will fail COPY command
> if the number
> of malformed data rows exceeds the limit, not the number of rejected
> rows (some percentage of malformed rows are accepted with null
> substitution)
> So, a proper name for the limit will be MALFORMED_LIMIT, or something.
> However, we are unable to change the name since the REJECT_LIMIT
> option has already been committed.
> I guess I'll just have to put up with this contradiction. I will send
> an updated patch shortly...

I think we can rename the REJECT_LIMIT option because it is not yet released.

The documentation says that REJECT_LIMIT "Specifies the maximum number of errors",
and there are no wording "reject" in the description, so I wonder it is unclear
what means in "REJECT" in REJECT_LIMIT. It may be proper to use ERROR_LIMIT
since it is supposed to be used with ON_ERROR. 

Alternatively, if we emphasize that errors are handled other than terminating
the command,perhaps MALFORMED_LIMIT as proposed above or TOLERANCE_LIMIT may be
good, for example.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Yugo Nagata
Date:
On Tue, 12 Nov 2024 14:03:50 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

> On Tue, 12 Nov 2024 01:27:53 +0500
> Kirill Reshke <reshkekirill@gmail.com> wrote:
> 
> > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote:
> > >
> > > On 2024-11-09 21:55, Kirill Reshke wrote:
> > >
> > > Thanks for working on this!
> > 
> > Thanks for reviewing the v7 patch series!
> > 
> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com>
> > > > wrote:
> > > >>
> > > >>
> > > >>
> > > >> On 2024/10/26 6:03, Kirill Reshke wrote:
> > > >> > when the REJECT LIMIT is set to some non-zero number and the number of
> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because
> > > >> > there WAS errors, and we should not tolerate more than $limit errors .
> > > >> > I do find this behavior to be consistent.
> > > >>
> > > >> +1
> > > >>
> > > >>
> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all
> > > >> > replacements, as if REJECT LIMIT is inf.
> > > >>
> > > >> +1
> > > >
> > > > After thinking for a while, I'm now more opposed to this approach. I
> > > > think we should count rows with erroneous data as errors only if
> > > > null substitution for these rows failed, not the total number of rows
> > > > which were modified.
> > > > Then, to respect the REJECT LIMIT option, we compare this number with
> > > > the limit. This is actually simpler approach IMHO. What do You think?
> > >
> > > IMHO I prefer the previous interpretation.
> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT
> > > is used to specify how many malformed rows are acceptable in the
> > > "original" data source.
> 
> I also prefer the previous version.
>  
> > I do like the first version of interpretation, but I have a struggle
> > with it. According to this interpretation, we will fail COPY command
> > if the number
> > of malformed data rows exceeds the limit, not the number of rejected
> > rows (some percentage of malformed rows are accepted with null
> > substitution)
> > So, a proper name for the limit will be MALFORMED_LIMIT, or something.
> > However, we are unable to change the name since the REJECT_LIMIT
> > option has already been committed.
> > I guess I'll just have to put up with this contradiction. I will send
> > an updated patch shortly...
> 
> I think we can rename the REJECT_LIMIT option because it is not yet released.
> 
> The documentation says that REJECT_LIMIT "Specifies the maximum number of errors",
> and there are no wording "reject" in the description, so I wonder it is unclear
> what means in "REJECT" in REJECT_LIMIT. It may be proper to use ERROR_LIMIT
> since it is supposed to be used with ON_ERROR. 
> 
> Alternatively, if we emphasize that errors are handled other than terminating
> the command,perhaps MALFORMED_LIMIT as proposed above or TOLERANCE_LIMIT may be
> good, for example.

I might misunderstand the meaning of the name. If REJECT_LIMIT means "a limit on
the number of rows with any malformed value allowed before the COPY command is
rejected", we would not have to rename it.

-- 
Yugo Nagata <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
torikoshia
Date:
On 2024-11-12 14:17, Yugo Nagata wrote:
> On Tue, 12 Nov 2024 14:03:50 +0900
> Yugo Nagata <nagata@sraoss.co.jp> wrote:
> 
>> On Tue, 12 Nov 2024 01:27:53 +0500
>> Kirill Reshke <reshkekirill@gmail.com> wrote:
>> 
>> > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote:
>> > >
>> > > On 2024-11-09 21:55, Kirill Reshke wrote:
>> > >
>> > > Thanks for working on this!
>> >
>> > Thanks for reviewing the v7 patch series!
>> >
>> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com>
>> > > > wrote:
>> > > >>
>> > > >>
>> > > >>
>> > > >> On 2024/10/26 6:03, Kirill Reshke wrote:
>> > > >> > when the REJECT LIMIT is set to some non-zero number and the number of
>> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because
>> > > >> > there WAS errors, and we should not tolerate more than $limit errors .
>> > > >> > I do find this behavior to be consistent.
>> > > >>
>> > > >> +1
>> > > >>
>> > > >>
>> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all
>> > > >> > replacements, as if REJECT LIMIT is inf.
>> > > >>
>> > > >> +1
>> > > >
>> > > > After thinking for a while, I'm now more opposed to this approach. I
>> > > > think we should count rows with erroneous data as errors only if
>> > > > null substitution for these rows failed, not the total number of rows
>> > > > which were modified.
>> > > > Then, to respect the REJECT LIMIT option, we compare this number with
>> > > > the limit. This is actually simpler approach IMHO. What do You think?
>> > >
>> > > IMHO I prefer the previous interpretation.
>> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT
>> > > is used to specify how many malformed rows are acceptable in the
>> > > "original" data source.
>> 
>> I also prefer the previous version.
>> 
>> > I do like the first version of interpretation, but I have a struggle
>> > with it. According to this interpretation, we will fail COPY command
>> > if the number
>> > of malformed data rows exceeds the limit, not the number of rejected
>> > rows (some percentage of malformed rows are accepted with null
>> > substitution)

I feel your concern is valid.
Currently 'reject' can occur only when converting a column's input value 
to its data type, but if we introduce set_to_null option 'reject' also 
occurs when inserting null, i.e. not null constraint.

>> > So, a proper name for the limit will be MALFORMED_LIMIT, or something.
>> > However, we are unable to change the name since the REJECT_LIMIT
>> > option has already been committed.
>> > I guess I'll just have to put up with this contradiction. I will send
>> > an updated patch shortly...
>> I think we can rename the REJECT_LIMIT option because it is not yet 
>> released.

+1

>> The documentation says that REJECT_LIMIT "Specifies the maximum number 
>> of errors",
>> and there are no wording "reject" in the description, so I wonder it 
>> is unclear
>> what means in "REJECT" in REJECT_LIMIT. It may be proper to use 
>> ERROR_LIMIT
>> since it is supposed to be used with ON_ERROR.
>> 
>> Alternatively, if we emphasize that errors are handled other than 
>> terminating
>> the command,perhaps MALFORMED_LIMIT as proposed above or 
>> TOLERANCE_LIMIT may be
>> good, for example.
> 
> I might misunderstand the meaning of the name. If REJECT_LIMIT means "a 
> limit on
> the number of rows with any malformed value allowed before the COPY 
> command is
> rejected", we would not have to rename it.

The meaning of REJECT_LIMIT is what you described, and I think Kirill 
worries about cases when malformed rows are accepted(=not REJECTed) with 
null substitution. REJECT_LIMIT counts this case as REJECTed.

-- 
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Yugo NAGATA
Date:
On Tue, 12 Nov 2024 17:38:25 +0900
torikoshia <torikoshia@oss.nttdata.com> wrote:

> On 2024-11-12 14:17, Yugo Nagata wrote:
> > On Tue, 12 Nov 2024 14:03:50 +0900
> > Yugo Nagata <nagata@sraoss.co.jp> wrote:
> > 
> >> On Tue, 12 Nov 2024 01:27:53 +0500
> >> Kirill Reshke <reshkekirill@gmail.com> wrote:
> >> 
> >> > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote:
> >> > >
> >> > > On 2024-11-09 21:55, Kirill Reshke wrote:
> >> > >
> >> > > Thanks for working on this!
> >> >
> >> > Thanks for reviewing the v7 patch series!
> >> >
> >> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com>
> >> > > > wrote:
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> On 2024/10/26 6:03, Kirill Reshke wrote:
> >> > > >> > when the REJECT LIMIT is set to some non-zero number and the number of
> >> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because
> >> > > >> > there WAS errors, and we should not tolerate more than $limit errors .
> >> > > >> > I do find this behavior to be consistent.
> >> > > >>
> >> > > >> +1
> >> > > >>
> >> > > >>
> >> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all
> >> > > >> > replacements, as if REJECT LIMIT is inf.
> >> > > >>
> >> > > >> +1
> >> > > >
> >> > > > After thinking for a while, I'm now more opposed to this approach. I
> >> > > > think we should count rows with erroneous data as errors only if
> >> > > > null substitution for these rows failed, not the total number of rows
> >> > > > which were modified.
> >> > > > Then, to respect the REJECT LIMIT option, we compare this number with
> >> > > > the limit. This is actually simpler approach IMHO. What do You think?
> >> > >
> >> > > IMHO I prefer the previous interpretation.
> >> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT
> >> > > is used to specify how many malformed rows are acceptable in the
> >> > > "original" data source.
> >> 
> >> I also prefer the previous version.
> >> 
> >> > I do like the first version of interpretation, but I have a struggle
> >> > with it. According to this interpretation, we will fail COPY command
> >> > if the number
> >> > of malformed data rows exceeds the limit, not the number of rejected
> >> > rows (some percentage of malformed rows are accepted with nulnot-null constraintl
> >> > substitution)
> 
> I feel your concern is valid.
> Currently 'reject' can occur only when converting a column's input value 
> to its data type, but if we introduce set_to_null option 'reject' also 
> occurs when inserting null, i.e. not null constraint.

I can suppose "reject" means failure of COPY command here, that is, a reject
of executing the command, not an error of data input. If so, we can interpret
REJECT_LIMIT as "the number of malformed rows allowed before the COPY command
is REJECTed" (not the number of rejected rows). In this case, I think we don't
have to rename the option name.

Of course, if there is more proper name that makes it easy for users to
understand the behaviour of the option, renaming should be nice.

> >> The documentation says that REJECT_LIMIT "Specifies the maximum number 
> >> of errors",
> >> and there are no wording "reject" in the description, so I wonder it 
> >> is unclear
> >> what means in "REJECT" in REJECT_LIMIT. It may be proper to use 
> >> ERROR_LIMIT
> >> since it is supposed to be used with ON_ERROR.
> >> 
> >> Alternatively, if we emphasize that errors are handled other than 
> >> terminating
> >> the command,perhaps MALFORMED_LIMIT as proposed above or 
> >> TOLERANCE_LIMIT may be
> >> good, for example.
> > 
> > I might misunderstand the meaning of the name. If REJECT_LIMIT means "a 
> > limit on
> > the number of rows with any malformed value allowed before the COPY 
> > command is
> > rejected", we would not have to rename it.
> 
> The meaning of REJECT_LIMIT is what you described, and I think Kirill 
> worries about cases when malformed rows are accepted(=not REJECTed) with 
> null substitution. REJECT_LIMIT counts this case as REJECTed.

I am a bit confused. You mean "REJECT" is raising a soft error of data
input here instead of terminating COPY?

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
torikoshia
Date:
On 2024-11-13 22:02, Yugo NAGATA wrote:
> On Tue, 12 Nov 2024 17:38:25 +0900
> torikoshia <torikoshia@oss.nttdata.com> wrote:
> 
>> On 2024-11-12 14:17, Yugo Nagata wrote:
>> > On Tue, 12 Nov 2024 14:03:50 +0900
>> > Yugo Nagata <nagata@sraoss.co.jp> wrote:
>> >
>> >> On Tue, 12 Nov 2024 01:27:53 +0500
>> >> Kirill Reshke <reshkekirill@gmail.com> wrote:
>> >>
>> >> > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote:
>> >> > >
>> >> > > On 2024-11-09 21:55, Kirill Reshke wrote:
>> >> > >
>> >> > > Thanks for working on this!
>> >> >
>> >> > Thanks for reviewing the v7 patch series!
>> >> >
>> >> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com>
>> >> > > > wrote:
>> >> > > >>
>> >> > > >>
>> >> > > >>
>> >> > > >> On 2024/10/26 6:03, Kirill Reshke wrote:
>> >> > > >> > when the REJECT LIMIT is set to some non-zero number and the number of
>> >> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because
>> >> > > >> > there WAS errors, and we should not tolerate more than $limit errors .
>> >> > > >> > I do find this behavior to be consistent.
>> >> > > >>
>> >> > > >> +1
>> >> > > >>
>> >> > > >>
>> >> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all
>> >> > > >> > replacements, as if REJECT LIMIT is inf.
>> >> > > >>
>> >> > > >> +1
>> >> > > >
>> >> > > > After thinking for a while, I'm now more opposed to this approach. I
>> >> > > > think we should count rows with erroneous data as errors only if
>> >> > > > null substitution for these rows failed, not the total number of rows
>> >> > > > which were modified.
>> >> > > > Then, to respect the REJECT LIMIT option, we compare this number with
>> >> > > > the limit. This is actually simpler approach IMHO. What do You think?
>> >> > >
>> >> > > IMHO I prefer the previous interpretation.
>> >> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT
>> >> > > is used to specify how many malformed rows are acceptable in the
>> >> > > "original" data source.
>> >>
>> >> I also prefer the previous version.
>> >>
>> >> > I do like the first version of interpretation, but I have a struggle
>> >> > with it. According to this interpretation, we will fail COPY command
>> >> > if the number
>> >> > of malformed data rows exceeds the limit, not the number of rejected
>> >> > rows (some percentage of malformed rows are accepted with nulnot-null constraintl
>> >> > substitution)
>> 
>> I feel your concern is valid.
>> Currently 'reject' can occur only when converting a column's input 
>> value
>> to its data type, but if we introduce set_to_null option 'reject' also
>> occurs when inserting null, i.e. not null constraint.
> 
> I can suppose "reject" means failure of COPY command here, that is, a 
> reject
> of executing the command, not an error of data input. If so, we can 
> interpret
> REJECT_LIMIT as "the number of malformed rows allowed before the COPY 
> command
> is REJECTed" (not the number of rejected rows). In this case, I think 
> we don't
> have to rename the option name.
> 
> Of course, if there is more proper name that makes it easy for users to
> understand the behaviour of the option, renaming should be nice.
> 
>> >> The documentation says that REJECT_LIMIT "Specifies the maximum number
>> >> of errors",
>> >> and there are no wording "reject" in the description, so I wonder it
>> >> is unclear
>> >> what means in "REJECT" in REJECT_LIMIT. It may be proper to use
>> >> ERROR_LIMIT
>> >> since it is supposed to be used with ON_ERROR.
>> >>
>> >> Alternatively, if we emphasize that errors are handled other than
>> >> terminating
>> >> the command,perhaps MALFORMED_LIMIT as proposed above or
>> >> TOLERANCE_LIMIT may be
>> >> good, for example.
>> >
>> > I might misunderstand the meaning of the name. If REJECT_LIMIT means "a
>> > limit on
>> > the number of rows with any malformed value allowed before the COPY
>> > command is
>> > rejected", we would not have to rename it.
>> 
>> The meaning of REJECT_LIMIT is what you described, and I think Kirill
>> worries about cases when malformed rows are accepted(=not REJECTed) 
>> with
>> null substitution. REJECT_LIMIT counts this case as REJECTed.
> 
> I am a bit confused.

Me too:)
Let me explain my understanding.

I believe there are now two candidates that count as REJECT_LIMIT 
number:

   (1) error converting a column's input value into its data type(soft 
error)
   (2) NULL substitution failure(this comes from the proposed patch)

And I understood Kirill's idea to be the following:

   1st idea: REJECT_LIMIT counts (1)
   2nd idea: REJECT_LIMIT counts (2)

And I've agreed with the 1st one.

> You mean "REJECT" is raising a soft error of data
> input here instead of terminating COPY?

Yes.


-- 
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
On Sat, Nov 16, 2024 at 5:55 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> I am attaching my v8 for reference.
>

in your v8.

   <varlistentry>
    <term><literal>REJECT_LIMIT</literal></term>
    <listitem>
     <para>
      Specifies the maximum number of errors tolerated while converting a
      column's input value to its data type, when <literal>ON_ERROR</literal> is
      set to <literal>ignore</literal>.
      If the input contains more erroneous rows than the specified
value, the <command>COPY</command>
      command fails, even with <literal>ON_ERROR</literal> set to
<literal>ignore</literal>.
     </para>
    </listitem>
   </varlistentry>

then above description not meet with following example, (i think)

create table t(a int not null);
COPY t FROM STDIN WITH (on_error set_to_null, reject_limit 2);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> a
>> \.
ERROR:  null value in column "a" of relation "t" violates not-null constraint
DETAIL:  Failing row contains (null).
CONTEXT:  COPY t, line 1, column a: "a"

Overall, I think
making the domain not-null align with column level not-null would be a
good thing.


     <para>
      Specifies how to behave when encountering an error converting a column's
      input value into its data type.
      An <replaceable class="parameter">error_action</replaceable> value of
      <literal>stop</literal> means fail the command,
      <literal>ignore</literal> means discard the input row and
continue with the next one, and
      <literal>set_to_null</literal> means replace columns containing
erroneous input values with <literal>null</literal> and move to the
next row.

"and move to the next row" is wrong?
I think it should be " and move to the next field".



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Kirill Reshke
Date:
On Tue, 19 Nov 2024, 13:52 jian he, <jian.universality@gmail.com> wrote:
>
> On Sat, Nov 16, 2024 at 5:55 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
> >
> > I am attaching my v8 for reference.
> >
>
> in your v8.
>
>    <varlistentry>
>     <term><literal>REJECT_LIMIT</literal></term>
>     <listitem>
>      <para>
>       Specifies the maximum number of errors tolerated while converting a
>       column's input value to its data type, when <literal>ON_ERROR</literal> is
>       set to <literal>ignore</literal>.
>       If the input contains more erroneous rows than the specified
> value, the <command>COPY</command>
>       command fails, even with <literal>ON_ERROR</literal> set to
> <literal>ignore</literal>.
>      </para>
>     </listitem>
>    </varlistentry>
>
> then above description not meet with following example, (i think)
>
> create table t(a int not null);
> COPY t FROM STDIN WITH (on_error set_to_null, reject_limit 2);
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> a
> >> \.
> ERROR:  null value in column "a" of relation "t" violates not-null constraint
> DETAIL:  Failing row contains (null).
> CONTEXT:  COPY t, line 1, column a: "a"


Sure, my v8 does not helps with column level NOT NULL constraint (or
other constraint)

> Overall, I think
> making the domain not-null align with column level not-null would be a
> good thing.

While this looks sane, it's actually a separate topic. Even on current
HEAD we have domain not-null vs column level not-null unalignment.

consider this example


```
reshke=# create table ftt2 (i int not null);
CREATE TABLE
reshke=# copy ftt2 from stdin with (reject_limit 1000, on_error ignore);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> \N
>> \.
ERROR:  null value in column "i" of relation "ftt2" violates not-null constraint
DETAIL:  Failing row contains (null).
CONTEXT:  COPY ftt2, line 1: "\N"
reshke=# create domain dd as int not null ;
CREATE DOMAIN
reshke=# create table ftt3(i dd);
CREATE TABLE
reshke=# copy ftt3 from stdin with (reject_limit 1000, on_error ignore);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> \N
>> \.
NOTICE:  1 row was skipped due to data type incompatibility
COPY 0
reshke=#
```
So, if we want this, we need to start another thread and deal with
REJECT_LIMIT + on_error ignore first.

The ExecConstraints function is the source of the error in scenario 1.
Therefore, we require something like "ExecConstraintsSafe" to
accommodate the aforementioned. That is a significant change. Not sure
it will be accepted by the community.

>
>      <para>
>       Specifies how to behave when encountering an error converting a column's
>       input value into its data type.
>       An <replaceable class="parameter">error_action</replaceable> value of
>       <literal>stop</literal> means fail the command,
>       <literal>ignore</literal> means discard the input row and
> continue with the next one, and
>       <literal>set_to_null</literal> means replace columns containing
> erroneous input values with <literal>null</literal> and move to the
> next row.
>
> "and move to the next row" is wrong?
> I think it should be " and move to the next field".

Yes, "and move to the next field" is correct.



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Kirill Reshke
Date:
On Tue, 19 Nov 2024 at 13:52, jian he <jian.universality@gmail.com> wrote:
>
> On Sat, Nov 16, 2024 at 5:55 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
> >
> > I am attaching my v8 for reference.
> >
>
> in your v8.
>
>    <varlistentry>
>     <term><literal>REJECT_LIMIT</literal></term>
>     <listitem>
>      <para>
>       Specifies the maximum number of errors tolerated while converting a
>       column's input value to its data type, when <literal>ON_ERROR</literal> is
>       set to <literal>ignore</literal>.
>       If the input contains more erroneous rows than the specified
> value, the <command>COPY</command>
>       command fails, even with <literal>ON_ERROR</literal> set to
> <literal>ignore</literal>.
>      </para>
>     </listitem>
>    </varlistentry>
>
> then above description not meet with following example, (i think)
>
> create table t(a int not null);
> COPY t FROM STDIN WITH (on_error set_to_null, reject_limit 2);
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> a
> >> \.
> ERROR:  null value in column "a" of relation "t" violates not-null constraint
> DETAIL:  Failing row contains (null).
> CONTEXT:  COPY t, line 1, column a: "a"
>
> Overall, I think
> making the domain not-null align with column level not-null would be a
> good thing.
>
>
>      <para>
>       Specifies how to behave when encountering an error converting a column's
>       input value into its data type.
>       An <replaceable class="parameter">error_action</replaceable> value of
>       <literal>stop</literal> means fail the command,
>       <literal>ignore</literal> means discard the input row and
> continue with the next one, and
>       <literal>set_to_null</literal> means replace columns containing
> erroneous input values with <literal>null</literal> and move to the
> next row.
>
> "and move to the next row" is wrong?
> I think it should be " and move to the next field".

Hi! There is not too much time left in this CF, so I moved to the next one.
If you are going to work on this patch, I'm waiting on your feedback
or a v9 patch that answers the issues brought up in this discussion.

--
Best regards,
Kirill Reshke