Thread: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
"David G. Johnston"
Date:
Hi,

The option choice of "ignore" in the COPY ON_ERROR clause seems overly generic.  There would seem to be two relevant ways to ignore bad column input data - drop the entire row or just set the column value to null.  I can see us wanting to provide the set to null option and in any case having the option name be explicit that it ignores the row seems like a good idea.

David J.

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
On Fri, Jan 26, 2024 at 11:09 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> Hi,
>
> The option choice of "ignore" in the COPY ON_ERROR clause seems overly generic.  There would seem to be two relevant
waysto ignore bad column input data - drop the entire row or just set the column value to null.  I can see us wanting
toprovide the set to null option and in any case having the option name be explicit that it ignores the row seems like
agood idea. 

two issue I found out while playing around with it;
create table x1(a int not null, b int not null );

you can only do:
COPY x1 from stdin (on_error 'null');
but you cannot do
COPY x1 from stdin (on_error null);
we need to hack the gram.y to escape the "null". I don't know how to
make it work.
related post I found:
https://stackoverflow.com/questions/31786611/how-to-escape-flex-keyword

another issue:
COPY x1 from stdin (on_error null);

when we already have `not null` top level constraint for table x1.
Do we need an error immediately?
"on_error null" seems to conflict with `not null` constraint (assume
refers to the same column).
it may fail while doing bulk inserts while on_error is set to null
because of violating a not null constraint.



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
"David G. Johnston"
Date:


On Sun, Jan 28, 2024 at 4:51 PM jian he <jian.universality@gmail.com> wrote:
On Fri, Jan 26, 2024 at 11:09 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> Hi,
>
> The option choice of "ignore" in the COPY ON_ERROR clause seems overly generic.  There would seem to be two relevant ways to ignore bad column input data - drop the entire row or just set the column value to null.  I can see us wanting to provide the set to null option and in any case having the option name be explicit that it ignores the row seems like a good idea.

two issue I found out while playing around with it;
create table x1(a int not null, b int not null );

another issue:
COPY x1 from stdin (on_error null);

when we already have `not null` top level constraint for table x1.
Do we need an error immediately?
"on_error null" seems to conflict with `not null` constraint (assume
refers to the same column).
it may fail while doing bulk inserts while on_error is set to null
because of violating a not null constraint.

You should not error immediately since whether or not there is a problem is table and data dependent.  I would not check for the case of all columns being defined not null and just let the mismatch happen.

That said, maybe with this being a string we can accept something like: 'null, ignore'

And so if attempting to place any one null fails, assuming we can make that a soft error too, we would then ignore the entire row.

David J.

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Yugo NAGATA
Date:
On Fri, 26 Jan 2024 08:08:29 -0700
"David G. Johnston" <david.g.johnston@gmail.com> wrote:

> Hi,
> 
> The option choice of "ignore" in the COPY ON_ERROR clause seems overly
> generic.  There would seem to be two relevant ways to ignore bad column
> input data - drop the entire row or just set the column value to null.  I
> can see us wanting to provide the set to null option and in any case having
> the option name be explicit that it ignores the row seems like a good idea.

I am not in favour of renaming the option name "ignore", instead we can
use another style of name for the option to set the column value to NULL,
for example, "set_to_null".

(Maybe, we can make a more generic option "set_to (col, val)" that can set
the value of column specified by "col" value to the specified value "val" 
(e.g. 'N/A') on a soft error, although the syntax would be a bit complex...) 

IMO, it is more simple to define "ignore" as to skip the entire row rather
than having variety of "ignore". Once defined it so, the option to set the
column value to NULL should not be called "ignore" because values in other
columns will be inserted.

Regards,
Yugo Nagata

> 
> David J.


-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
The idea of on_error is to tolerate errors, I think.
if a column has a not null constraint, let it cannot be used with
(on_error 'null')

Based on this, I've made a patch.
based on COPY Synopsis: ON_ERROR 'error_action'
on_error 'null', the  keyword NULL should be single quoted.

demo:
COPY check_ign_err FROM STDIN WITH (on_error 'null');
1 {1} a
2 {2} 1
3 {3} 2
4 {4} b
a {5} c
\.

\pset null NULL

SELECT * FROM check_ign_err;
  n   |  m  |  k
------+-----+------
    1 | {1} | NULL
    2 | {2} |    1
    3 | {3} |    2
    4 | {4} | NULL
 NULL | {5} | NULL

Attachment

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
torikoshia
Date:
Hi,

On 2024-02-03 15:22, jian he wrote:
> The idea of on_error is to tolerate errors, I think.
> if a column has a not null constraint, let it cannot be used with
> (on_error 'null')

> +       /*
> +        * we can specify on_error 'null', but it can only apply to 
> columns
> +        * don't have not null constraint.
> +       */
> +       if (att->attnotnull && cstate->opts.on_error == 
> COPY_ON_ERROR_NULL)
> +           ereport(ERROR,
> +                   (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> +                    errmsg("copy on_error 'null' cannot be used with 
> not null constraint column")));

This means we cannot use ON_ERROR 'null' even when there is one column 
which have NOT NULL constraint, i.e. primary key, right?
IMHO this is strong constraint and will decrease the opportunity to use 
this feature.

It might be better to allow error_action 'null' for tables which have 
NOT NULL constraint columns, and when facing soft errors for those rows, 
skip that row or stop COPY.

> Based on this, I've made a patch.
> based on COPY Synopsis: ON_ERROR 'error_action'
> on_error 'null', the  keyword NULL should be single quoted.

As you mentioned, single quotation seems a little odd..

I'm not sure what is the best name and syntax for this feature, but 
since current error_action are verbs('stop' and 'ignore'), I feel 'null' 
might not be appropriate.

> demo:
> COPY check_ign_err FROM STDIN WITH (on_error 'null');
> 1 {1} a
> 2 {2} 1
> 3 {3} 2
> 4 {4} b
> a {5} c
> \.
> 
> \pset null NULL
> 
> SELECT * FROM check_ign_err;
>   n   |  m  |  k
> ------+-----+------
>     1 | {1} | NULL
>     2 | {2} |    1
>     3 | {3} |    2
>     4 | {4} | NULL
>  NULL | {5} | NULL

Since we notice the number of ignored rows when ON_ERROR is 'ignore', 
users may want to know the number of rows which was changed to NULL when 
using ON_ERROR 'null'.

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
On Mon, Feb 5, 2024 at 10:29 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> Hi,
>
> On 2024-02-03 15:22, jian he wrote:
> > The idea of on_error is to tolerate errors, I think.
> > if a column has a not null constraint, let it cannot be used with
> > (on_error 'null')
>
> > +       /*
> > +        * we can specify on_error 'null', but it can only apply to
> > columns
> > +        * don't have not null constraint.
> > +       */
> > +       if (att->attnotnull && cstate->opts.on_error ==
> > COPY_ON_ERROR_NULL)
> > +           ereport(ERROR,
> > +                   (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> > +                    errmsg("copy on_error 'null' cannot be used with
> > not null constraint column")));
>
> This means we cannot use ON_ERROR 'null' even when there is one column
> which have NOT NULL constraint, i.e. primary key, right?
> IMHO this is strong constraint and will decrease the opportunity to use
> this feature.

I don't want to fail in the middle of bulk inserts,
so I thought immediately erroring out would be a great idea.
Let's see what other people think.



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Yugo NAGATA
Date:
On Mon, 05 Feb 2024 11:28:59 +0900
torikoshia <torikoshia@oss.nttdata.com> wrote:

> > Based on this, I've made a patch.
> > based on COPY Synopsis: ON_ERROR 'error_action'
> > on_error 'null', the  keyword NULL should be single quoted.
> 
> As you mentioned, single quotation seems a little odd..
> 
> I'm not sure what is the best name and syntax for this feature, but 
> since current error_action are verbs('stop' and 'ignore'), I feel 'null' 
> might not be appropriate.

I am not in favour of using 'null' either, so I suggested to use
"set_to_null" or more generic syntax like  "set_to (col, val)" in my
previous post[1], although I'm not convinced what is the best either.

[1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Kyotaro Horiguchi
Date:
At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in 
> On Mon, 05 Feb 2024 11:28:59 +0900
> torikoshia <torikoshia@oss.nttdata.com> wrote:
> 
> > > Based on this, I've made a patch.
> > > based on COPY Synopsis: ON_ERROR 'error_action'
> > > on_error 'null', the  keyword NULL should be single quoted.
> > 
> > As you mentioned, single quotation seems a little odd..
> > 
> > I'm not sure what is the best name and syntax for this feature, but 
> > since current error_action are verbs('stop' and 'ignore'), I feel 'null' 
> > might not be appropriate.
> 
> I am not in favour of using 'null' either, so I suggested to use
> "set_to_null" or more generic syntax like  "set_to (col, val)" in my
> previous post[1], although I'm not convinced what is the best either.
> 
> [1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp

Tom sugggested using a separate option, and I agree with the
suggestion. Taking this into consideration, I imagined something like
the following, for example.  Although I'm not sure we are actually
going to do whole-tuple replacement, the action name in this example
has the suffix '-column'.

COPY (on_error 'replace-colomn', replacement 'null') ..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Yugo NAGATA
Date:
On Tue, 06 Feb 2024 09:39:09 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in 
> > On Mon, 05 Feb 2024 11:28:59 +0900
> > torikoshia <torikoshia@oss.nttdata.com> wrote:
> > 
> > > > Based on this, I've made a patch.
> > > > based on COPY Synopsis: ON_ERROR 'error_action'
> > > > on_error 'null', the  keyword NULL should be single quoted.
> > > 
> > > As you mentioned, single quotation seems a little odd..
> > > 
> > > I'm not sure what is the best name and syntax for this feature, but 
> > > since current error_action are verbs('stop' and 'ignore'), I feel 'null' 
> > > might not be appropriate.
> > 
> > I am not in favour of using 'null' either, so I suggested to use
> > "set_to_null" or more generic syntax like  "set_to (col, val)" in my
> > previous post[1], although I'm not convinced what is the best either.
> > 
> > [1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp
> 
> Tom sugggested using a separate option, and I agree with the
> suggestion. Taking this into consideration, I imagined something like
> the following, for example.  Although I'm not sure we are actually
> going to do whole-tuple replacement, the action name in this example
> has the suffix '-column'.
> 
> COPY (on_error 'replace-colomn', replacement 'null') ..

Thank you for your information. I've found a post[1] you mentioned, 
where adding a separate option for error log destination was suggested. 

Considering consistency with other options, adding a separate option
would be better if we want to specify a value to replace the invalid
value, without introducing a complex syntax that allows options with
more than one parameters. Maybe, if we allow to use values for the
replacement other than NULL, we have to also add a option to specify
a column (or a type)  for each replacement value. Or,  we may add a
option to specify a list of replacement values as many as the number of
columns, each of whose default is NULL.

Anyway, I prefer 'replace" (or 'set_to') to just 'null' as the option
value.

[1] https://www.postgresql.org/message-id/2070915.1705527477%40sss.pgh.pa.us

Regards,
Yugo Nagata


> regards.
> 
> -- 
> Kyotaro Horiguchi
> NTT Open Source Software Center


-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
On Tue, Feb 6, 2024 at 3:46 PM Yugo NAGATA <nagata@sraoss.co.jp> wrote:
>
> On Tue, 06 Feb 2024 09:39:09 +0900 (JST)
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> > At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in
> > > On Mon, 05 Feb 2024 11:28:59 +0900
> > > torikoshia <torikoshia@oss.nttdata.com> wrote:
> > >
> > > > > Based on this, I've made a patch.
> > > > > based on COPY Synopsis: ON_ERROR 'error_action'
> > > > > on_error 'null', the  keyword NULL should be single quoted.
> > > >
> > > > As you mentioned, single quotation seems a little odd..
> > > >
> > > > I'm not sure what is the best name and syntax for this feature, but
> > > > since current error_action are verbs('stop' and 'ignore'), I feel 'null'
> > > > might not be appropriate.
> > >
> > > I am not in favour of using 'null' either, so I suggested to use
> > > "set_to_null" or more generic syntax like  "set_to (col, val)" in my
> > > previous post[1], although I'm not convinced what is the best either.
> > >
> > > [1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp
> >
> > Tom sugggested using a separate option, and I agree with the
> > suggestion. Taking this into consideration, I imagined something like
> > the following, for example.  Although I'm not sure we are actually
> > going to do whole-tuple replacement, the action name in this example
> > has the suffix '-column'.
> >
> > COPY (on_error 'replace-colomn', replacement 'null') ..
>
> Thank you for your information. I've found a post[1] you mentioned,
> where adding a separate option for error log destination was suggested.
>
> Considering consistency with other options, adding a separate option
> would be better if we want to specify a value to replace the invalid
> value, without introducing a complex syntax that allows options with
> more than one parameters. Maybe, if we allow to use values for the
> replacement other than NULL, we have to also add a option to specify
> a column (or a type)  for each replacement value. Or,  we may add a
> option to specify a list of replacement values as many as the number of
> columns, each of whose default is NULL.
>
> Anyway, I prefer 'replace" (or 'set_to') to just 'null' as the option
> value.
>

Let's say tabe t column (a,b,c)
if we support set_to_null(a,b), what should we do if column c has an error.
should we ignore this row or error out immediately?
also I am not sure it's doable to just extract columnList from the
function defGetCopyOnErrorChoice.

to make `COPY x from stdin (on_error set_to_null(a,b);` work,
we may need to refactor to gram.y, in a similar way we do force null

i am ok with
COPY x from stdin (on_error set_to_null);



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Yugo NAGATA
Date:
On Mon, 5 Feb 2024 14:26:46 +0800
jian he <jian.universality@gmail.com> wrote:

> On Mon, Feb 5, 2024 at 10:29 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
> >
> > Hi,
> >
> > On 2024-02-03 15:22, jian he wrote:
> > > The idea of on_error is to tolerate errors, I think.
> > > if a column has a not null constraint, let it cannot be used with
> > > (on_error 'null')
> >
> > > +       /*
> > > +        * we can specify on_error 'null', but it can only apply to
> > > columns
> > > +        * don't have not null constraint.
> > > +       */
> > > +       if (att->attnotnull && cstate->opts.on_error ==
> > > COPY_ON_ERROR_NULL)
> > > +           ereport(ERROR,
> > > +                   (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> > > +                    errmsg("copy on_error 'null' cannot be used with
> > > not null constraint column")));
> >
> > This means we cannot use ON_ERROR 'null' even when there is one column
> > which have NOT NULL constraint, i.e. primary key, right?
> > IMHO this is strong constraint and will decrease the opportunity to use
> > this feature.
> 
> I don't want to fail in the middle of bulk inserts,
> so I thought immediately erroring out would be a great idea.
> Let's see what other people think.

I also think this restriction is too strong because it is very
common that a table has a primary key, unless there is some way
to specify columns that can be set to NULL. Even when ON_ERROR
is specified, any constraint violation errors cannot be generally
ignored, so we cannot elimiate the posibility COPY FROM fails in
the middle due to invalid data, anyway.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
attached v2.
syntax: `on_error set_to_null`
based on upthread discussion, now if you specified `on_error
set_to_null` and your column has `not
null` constraint, we convert the error field to null, so it may error
while bulk inserting for violating NOT NULL constraint.

Attachment

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Jim Jones
Date:
Hi!

On 12.02.24 01:00, jian he wrote:
> attached v2.
> syntax: `on_error set_to_null`
> based on upthread discussion, now if you specified `on_error
> set_to_null` and your column has `not
> null` constraint, we convert the error field to null, so it may error
> while bulk inserting for violating NOT NULL constraint.
That's a very nice feature. Thanks for implementing it!

v2 applies cleanly and works as described.

\pset null '(NULL)'

CREATE TEMPORARY TABLE t1 (a int, b int);
COPY t1 (a,b) FROM STDIN;
1    a
2    1
3    2
4    b
a    c
\.
SELECT * FROM t1;

CONTEXT:  COPY t1, line 1, column b: "a"
 a | b
---+---
(0 rows)


CREATE TEMPORARY TABLE t2 (a int, b int);
COPY t2 (a,b) FROM STDIN WITH (on_error set_to_null);
1    a
2    1
3    2
4    b
a    c
\.
SELECT * FROM t2;

psql:test-copy-on_error-2.sql:12: NOTICE:  some columns of 3 rows, value
were converted to NULL due to data type incompatibility
COPY 5
   a    |   b    
--------+--------
      1 | (NULL)
      2 |      1
      3 |      2
      4 | (NULL)
 (NULL) | (NULL)
(5 rows)


I have one question though:

In case all columns of a record have been set to null due to data type
incompatibility, should we insert it at all? See t2 example above.
I'm not sure if these records would be of any use in the table. What do
you think?

Since the parameter is already called "set_to_null", maybe it is not
necessary to mention in the NOTICE message that the values have been set
to null.
Perhaps something like "XX records were only partially copied due to
data type incompatibility"


-- 
Jim




Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
"David G. Johnston"
Date:
On Fri, Feb 16, 2024 at 1:16 PM Jim Jones <jim.jones@uni-muenster.de> wrote:
In case all columns of a record have been set to null due to data type
incompatibility, should we insert it at all?

Yes.  In particular not all columns in the table need be specified in the copy command so while the parsed input data is all nulls the record itself may not be.

The system should allow the user to exclude rows with incomplete data by ignoring a not null constraint violation.

In short we shouldn't judge non-usefulness and instead give tools to the user to decide for themselves.

David J.

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Jim Jones
Date:

On 16.02.24 21:31, David G. Johnston wrote:
> Yes.  In particular not all columns in the table need be specified in
> the copy command so while the parsed input data is all nulls the
> record itself may not be.

Yeah, you have a point there.
I guess if users want to avoid it to happen they can rely on NOT NULL
constraints.

Thanks

-- 
Jim




Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Jim Jones
Date:
Hi there

On 26.08.24 02:00, jian he wrote:
> hi all.
> patch updated.
> simplified the code a lot.
>
> idea is same:
> COPY t_on_error_null FROM STDIN WITH (on_error set_to_null);
>
> If the STDIN number of columns is the same as the target table, then
> InputFunctionCallSafe
> call failure will make that column values be null.
>
>
> If the STDIN number of columns is not the same as the target table, then error
> ERROR:  missing data for column \"%s\"
> ERROR:  extra data after last expected column
> which is status quo.

I wanted to give it another try, but the patch does not apply ...

$ git apply ~/patches/copy_on_error/v3-0001-on_error-set_to_null.patch -v
Checking patch doc/src/sgml/ref/copy.sgml...
Checking patch src/backend/commands/copy.c...
Checking patch src/backend/commands/copyfrom.c...
Checking patch src/backend/commands/copyfromparse.c...
Checking patch src/include/commands/copy.h...
Checking patch src/test/regress/expected/copy2.out...
error: while searching for:
NOTICE:  skipping row due to data type incompatibility at line 8 for
column k: "a"
CONTEXT:  COPY check_ign_err
NOTICE:  6 rows were skipped due to data type incompatibility
-- tests for on_error option with log_verbosity and null constraint via
domain
CREATE DOMAIN dcheck_ign_err2 varchar(15) NOT NULL;
CREATE TABLE check_ign_err2 (n int, m int[], k int, l dcheck_ign_err2);

error: patch failed: src/test/regress/expected/copy2.out:753
error: src/test/regress/expected/copy2.out: patch does not apply
Checking patch src/test/regress/sql/copy2.sql...


-- 
Jim




Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Jim Jones
Date:

On 12.09.24 12:13, jian he wrote:
> please check the attached file.

v4 applies cleanly, it works as expected, and all tests pass.

postgres=# \pset null '(NULL)'
Null display is "(NULL)".

postgres=# CREATE TEMPORARY TABLE t2 (a int, b int);
CREATE TABLE

postgres=# COPY t2 (a,b) FROM STDIN WITH (on_error set_to_null, format csv);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 1,a
>> 2,1
>> 3,2
>> 4,b
>> a,c
>> \.
COPY 5

postgres=# SELECT * FROM t2;
   a    |   b    
--------+--------
      1 | (NULL)
      2 |      1
      3 |      2
      4 | (NULL)
 (NULL) | (NULL)
(5 rows)


Perhaps small changes in the docs:

<literal>set_to_null</literal> means the input value will set to
<literal>null</literal> and continue with the next one.

"will set" -> "will be set"
"and continue with" -> "and will continue with"

Other than that, LGTM.

Thanks!

-- 
Jim




Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Fujii Masao
Date:

On 2024/10/21 18:30, Kirill Reshke wrote:
> v4 no longer applies. It now conflicts with
> e7834a1a251d4a28245377f383ff20a657ba8262.
> Also, there were review comments.
> 
> So, I decided to rebase.

Thanks for the patch! Here are my review comments:

I noticed that on_error=set_to_null does not trigger NOTICE messages for rows
and columns with errors. It's "unexpected" thing for columns to be silently
replaced with NULL due to on_error=set_to_null. So, similar to on_error=ignore,
there should be NOTICE messages indicating which input records had columns
set to NULL because of data type incompatibility. Without these messages,
users might not realize that some columns were set to NULL.


How should on_error=set_to_null behave when reject_limit is set? It seems
intuitive to trigger an error if the number of rows with columns' data type
issues exceeds reject_limit, similar to on_error=ignore. This is open to discussion.


psql's tab-completion should be updated to include SET_TO_NULL.


        An <replaceable class="parameter">error_action</replaceable> value of
        <literal>stop</literal> means fail the command, while
        <literal>ignore</literal> means discard the input row and continue with the next one.
+      <literal>set_to_null</literal> means the input value will be set to <literal>null</literal> and continue with
thenext one.
 

How about merging these two descriptions to one and updating it to the following?

-------------------
An error_action value of stop means fail the command, ignore means discard the input
row and continue with the next one, and set_to_null means replace columns with invalid
input values with NULL and move to the next row.
-------------------


       The <literal>ignore</literal> option is applicable only for <command>COPY FROM</command>

This should be "... ignore and set_to_null options are ..."?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
On Mon, Oct 21, 2024 at 8:39 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> On 2024/10/21 18:30, Kirill Reshke wrote:
> > v4 no longer applies. It now conflicts with
> > e7834a1a251d4a28245377f383ff20a657ba8262.
> > Also, there were review comments.
> >
> > So, I decided to rebase.
>
> Thanks for the patch! Here are my review comments:
>
> I noticed that on_error=set_to_null does not trigger NOTICE messages for rows
> and columns with errors. It's "unexpected" thing for columns to be silently
> replaced with NULL due to on_error=set_to_null. So, similar to on_error=ignore,
> there should be NOTICE messages indicating which input records had columns
> set to NULL because of data type incompatibility. Without these messages,
> users might not realize that some columns were set to NULL.
>

on_error=set_to_null,
we have two options for CopyFromStateData->num_errors.
A. Counting the number of rows that on_error set_to_null happened.
B. Counting number of times that on_error set_to_null happened

let's say optionA:
        ereport(NOTICE,
                errmsg_plural("%llu row was converted to NULL due to
data type incompatibility",
                              "%llu rows were converted to NULL due to
data type incompatibility",
                              (unsigned long long) cstate->num_errors,
                              (unsigned long long) cstate->num_errors));

I doubt the above message is accurate.
"%llu row was converted to NULL"
can mean "%llu row, for each row, all columns was converted to NULL"
but here we are
"%llu row, for each row, some column (can be all columns) was converted to NULL"


optionB: the message can be:
errmsg_plural("converted to NULL due to data type incompatibility
happened %llu time")
but I aslo feel the wording is not perfect also.

So overall I am not sure how to construct the NOTICE messages.



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Fujii Masao
Date:

On 2024/10/26 6:03, Kirill Reshke wrote:
> when the REJECT LIMIT is set to some non-zero number and the number of
> row NULL replacements exceeds the limit, is it OK to fail. Because
> there WAS errors, and we should not tolerate more than $limit errors .
> I do find this behavior to be consistent.

+1


> But what if we don't set a REJECT LIMIT, it is sane to do all
> replacements, as if REJECT LIMIT is inf.

+1


> But our REJECT LIMIT is zero
> (not set).
> So, we ignore zero REJECT LIMIT if set_to_null is set.

REJECT_LIMIT currently has to be greater than zero, so it won’t ever be zero.


> But while I was trying to implement that, I realized that I don't
> understand v4 of this patch. My misunderstanding is about
> `t_on_error_null` tests. We are allowed to insert a NULL value for the
> first column of t_on_error_null using COPY ON_ERROR SET_TO_NULL. Why
> do we do that? My thought is we should try to execute
> InputFunctionCallSafe with NULL value  (i mean, here [1]) for the
> column after we failed to insert the input value. And, if this second
> call is successful, we do replacement, otherwise we count the row as
> erroneous.

Your concern is valid. Allowing NULL to be stored in a column with a NOT NULL
constraint via COPY ON_ERROR=SET_TO_NULL does seem unexpected. As you suggested,
NULL values set by SET_TO_NULL should probably be re-evaluated.


> Hm, good catch. Applied almost as you suggested. I did tweak this
> "replace columns with invalid input values with " into "replace
> columns containing erroneous input values with". Is that OK?

Yes, sounds good.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
torikoshia
Date:
On 2024-11-09 21:55, Kirill Reshke wrote:

Thanks for working on this!

> On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com> 
> wrote:
>> 
>> 
>> 
>> On 2024/10/26 6:03, Kirill Reshke wrote:
>> > when the REJECT LIMIT is set to some non-zero number and the number of
>> > row NULL replacements exceeds the limit, is it OK to fail. Because
>> > there WAS errors, and we should not tolerate more than $limit errors .
>> > I do find this behavior to be consistent.
>> 
>> +1
>> 
>> 
>> > But what if we don't set a REJECT LIMIT, it is sane to do all
>> > replacements, as if REJECT LIMIT is inf.
>> 
>> +1
> 
> After thinking for a while, I'm now more opposed to this approach. I
> think we should count rows with erroneous data as errors only if
> null substitution for these rows failed, not the total number of rows
> which were modified.
> Then, to respect the REJECT LIMIT option, we compare this number with
> the limit. This is actually simpler approach IMHO. What do You think?

IMHO I prefer the previous interpretation.
I'm not sure this is what people expect, but I assume that REJECT_LIMIT 
is used to specify how many malformed rows are acceptable in the 
"original" data source.

>> > But while I was trying to implement that, I realized that I don't
>> > understand v4 of this patch. My misunderstanding is about
>> > `t_on_error_null` tests. We are allowed to insert a NULL value for the
>> > first column of t_on_error_null using COPY ON_ERROR SET_TO_NULL. Why
>> > do we do that? My thought is we should try to execute
>> > InputFunctionCallSafe with NULL value  (i mean, here [1]) for the
>> > column after we failed to insert the input value. And, if this second
>> > call is successful, we do replacement, otherwise we count the row as
>> > erroneous.
>> 
>> Your concern is valid. Allowing NULL to be stored in a column with a 
>> NOT NULL
>> constraint via COPY ON_ERROR=SET_TO_NULL does seem unexpected. As you 
>> suggested,
>> NULL values set by SET_TO_NULL should probably be re-evaluated.
> 
> Thank you. I updated the patch with a NULL re-evaluation.
> 
> 
> PFA v7. I did not yet update the doc for this patch version, waiting
> for feedback about REJECT LIMIT + SET_TO_NULL behaviour.
> 


There were warnings when I applied the patch:

$ git apply 
v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch
v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:170: 
trailing whitespace.
                    /*
v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:181: 
trailing whitespace.

v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:189: 
trailing whitespace.
                                    /* If datatype if okay with NULL, 
replace
v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:196: 
trailing whitespace.

v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:212: 
trailing whitespace.
                                 /*

>  @@ -403,12 +403,14 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState 
> *pstate, bool is_from)
>                   parser_errposition(pstate, def->location)));
...
> 
>  -   if (opts_out->reject_limit && !opts_out->on_error)
>  +   if (opts_out->reject_limit && !(opts_out->on_error == 
> COPY_ON_ERROR_NULL || opts_out->on_error == COPY_ON_ERROR_IGNORE))

Hmm, is this change necessary?
Personally, I feel the previous code is easier to read.


"REJECT LIMIT" should be "REJECT_LIMIT"?
> 1037                             errhint("Consider specifying the 
> REJECT LIMIT option to skip erroneous rows.")));


SET_TO_NULL is one of the value for ON_ERROR, but the patch treats 
SET_TO_NULL as option for COPY:

221 --- a/src/bin/psql/tab-complete.in.c
222 +++ b/src/bin/psql/tab-complete.in.c
223 @@ -3235,7 +3235,7 @@ match_previous_words(int pattern_id,
224         COMPLETE_WITH("FORMAT", "FREEZE", "DELIMITER", "NULL",
225                       "HEADER", "QUOTE", "ESCAPE", "FORCE_QUOTE",
226                       "FORCE_NOT_NULL", "FORCE_NULL", "ENCODING", 
"DEFAULT",
227 -                     "ON_ERROR", "LOG_VERBOSITY");
228 +                     "ON_ERROR", "SET_TO_NULL", "LOG_VERBOSITY");


> Best regards,
> Kirill Reshke

-- 
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Kirill Reshke
Date:
On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> On 2024-11-09 21:55, Kirill Reshke wrote:
>
> Thanks for working on this!

Thanks for reviewing the v7 patch series!

> > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com>
> > wrote:
> >>
> >>
> >>
> >> On 2024/10/26 6:03, Kirill Reshke wrote:
> >> > when the REJECT LIMIT is set to some non-zero number and the number of
> >> > row NULL replacements exceeds the limit, is it OK to fail. Because
> >> > there WAS errors, and we should not tolerate more than $limit errors .
> >> > I do find this behavior to be consistent.
> >>
> >> +1
> >>
> >>
> >> > But what if we don't set a REJECT LIMIT, it is sane to do all
> >> > replacements, as if REJECT LIMIT is inf.
> >>
> >> +1
> >
> > After thinking for a while, I'm now more opposed to this approach. I
> > think we should count rows with erroneous data as errors only if
> > null substitution for these rows failed, not the total number of rows
> > which were modified.
> > Then, to respect the REJECT LIMIT option, we compare this number with
> > the limit. This is actually simpler approach IMHO. What do You think?
>
> IMHO I prefer the previous interpretation.
> I'm not sure this is what people expect, but I assume that REJECT_LIMIT
> is used to specify how many malformed rows are acceptable in the
> "original" data source.

I do like the first version of interpretation, but I have a struggle
with it. According to this interpretation, we will fail COPY command
if the number
of malformed data rows exceeds the limit, not the number of rejected
rows (some percentage of malformed rows are accepted with null
substitution)
So, a proper name for the limit will be MALFORMED_LIMIT, or something.
However, we are unable to change the name since the REJECT_LIMIT
option has already been committed.
I guess I'll just have to put up with this contradiction. I will send
an updated patch shortly...


> >> > But while I was trying to implement that, I realized that I don't
> >> > understand v4 of this patch. My misunderstanding is about
> >> > `t_on_error_null` tests. We are allowed to insert a NULL value for the
> >> > first column of t_on_error_null using COPY ON_ERROR SET_TO_NULL. Why
> >> > do we do that? My thought is we should try to execute
> >> > InputFunctionCallSafe with NULL value  (i mean, here [1]) for the
> >> > column after we failed to insert the input value. And, if this second
> >> > call is successful, we do replacement, otherwise we count the row as
> >> > erroneous.
> >>
> >> Your concern is valid. Allowing NULL to be stored in a column with a
> >> NOT NULL
> >> constraint via COPY ON_ERROR=SET_TO_NULL does seem unexpected. As you
> >> suggested,
> >> NULL values set by SET_TO_NULL should probably be re-evaluated.
> >
> > Thank you. I updated the patch with a NULL re-evaluation.
> >
> >
> > PFA v7. I did not yet update the doc for this patch version, waiting
> > for feedback about REJECT LIMIT + SET_TO_NULL behaviour.
> >
>
>
> There were warnings when I applied the patch:
>
> $ git apply
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:170:
> trailing whitespace.
>                     /*
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:181:
> trailing whitespace.
>
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:189:
> trailing whitespace.
>                                     /* If datatype if okay with NULL,
> replace
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:196:
> trailing whitespace.
>
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:212:
> trailing whitespace.
>                                  /*
>
> >  @@ -403,12 +403,14 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState
> > *pstate, bool is_from)
> >                   parser_errposition(pstate, def->location)));
> ...
> >
> >  -   if (opts_out->reject_limit && !opts_out->on_error)
> >  +   if (opts_out->reject_limit && !(opts_out->on_error ==
> > COPY_ON_ERROR_NULL || opts_out->on_error == COPY_ON_ERROR_IGNORE))
>
> Hmm, is this change necessary?
> Personally, I feel the previous code is easier to read.
>
>
> "REJECT LIMIT" should be "REJECT_LIMIT"?
> > 1037                             errhint("Consider specifying the
> > REJECT LIMIT option to skip erroneous rows.")));
>
>
> SET_TO_NULL is one of the value for ON_ERROR, but the patch treats
> SET_TO_NULL as option for COPY:
>
> 221 --- a/src/bin/psql/tab-complete.in.c
> 222 +++ b/src/bin/psql/tab-complete.in.c
> 223 @@ -3235,7 +3235,7 @@ match_previous_words(int pattern_id,
> 224         COMPLETE_WITH("FORMAT", "FREEZE", "DELIMITER", "NULL",
> 225                       "HEADER", "QUOTE", "ESCAPE", "FORCE_QUOTE",
> 226                       "FORCE_NOT_NULL", "FORCE_NULL", "ENCODING",
> "DEFAULT",
> 227 -                     "ON_ERROR", "LOG_VERBOSITY");
> 228 +                     "ON_ERROR", "SET_TO_NULL", "LOG_VERBOSITY");
>
>
> > Best regards,
> > Kirill Reshke
>
> --
> Regards,
>
> --
> Atsushi Torikoshi
> Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.



-- 
Best regards,
Kirill Reshke



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Yugo Nagata
Date:
On Tue, 12 Nov 2024 01:27:53 +0500
Kirill Reshke <reshkekirill@gmail.com> wrote:

> On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote:
> >
> > On 2024-11-09 21:55, Kirill Reshke wrote:
> >
> > Thanks for working on this!
> 
> Thanks for reviewing the v7 patch series!
> 
> > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com>
> > > wrote:
> > >>
> > >>
> > >>
> > >> On 2024/10/26 6:03, Kirill Reshke wrote:
> > >> > when the REJECT LIMIT is set to some non-zero number and the number of
> > >> > row NULL replacements exceeds the limit, is it OK to fail. Because
> > >> > there WAS errors, and we should not tolerate more than $limit errors .
> > >> > I do find this behavior to be consistent.
> > >>
> > >> +1
> > >>
> > >>
> > >> > But what if we don't set a REJECT LIMIT, it is sane to do all
> > >> > replacements, as if REJECT LIMIT is inf.
> > >>
> > >> +1
> > >
> > > After thinking for a while, I'm now more opposed to this approach. I
> > > think we should count rows with erroneous data as errors only if
> > > null substitution for these rows failed, not the total number of rows
> > > which were modified.
> > > Then, to respect the REJECT LIMIT option, we compare this number with
> > > the limit. This is actually simpler approach IMHO. What do You think?
> >
> > IMHO I prefer the previous interpretation.
> > I'm not sure this is what people expect, but I assume that REJECT_LIMIT
> > is used to specify how many malformed rows are acceptable in the
> > "original" data source.

I also prefer the previous version.
 
> I do like the first version of interpretation, but I have a struggle
> with it. According to this interpretation, we will fail COPY command
> if the number
> of malformed data rows exceeds the limit, not the number of rejected
> rows (some percentage of malformed rows are accepted with null
> substitution)
> So, a proper name for the limit will be MALFORMED_LIMIT, or something.
> However, we are unable to change the name since the REJECT_LIMIT
> option has already been committed.
> I guess I'll just have to put up with this contradiction. I will send
> an updated patch shortly...

I think we can rename the REJECT_LIMIT option because it is not yet released.

The documentation says that REJECT_LIMIT "Specifies the maximum number of errors",
and there are no wording "reject" in the description, so I wonder it is unclear
what means in "REJECT" in REJECT_LIMIT. It may be proper to use ERROR_LIMIT
since it is supposed to be used with ON_ERROR. 

Alternatively, if we emphasize that errors are handled other than terminating
the command,perhaps MALFORMED_LIMIT as proposed above or TOLERANCE_LIMIT may be
good, for example.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Yugo Nagata
Date:
On Tue, 12 Nov 2024 14:03:50 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

> On Tue, 12 Nov 2024 01:27:53 +0500
> Kirill Reshke <reshkekirill@gmail.com> wrote:
> 
> > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote:
> > >
> > > On 2024-11-09 21:55, Kirill Reshke wrote:
> > >
> > > Thanks for working on this!
> > 
> > Thanks for reviewing the v7 patch series!
> > 
> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com>
> > > > wrote:
> > > >>
> > > >>
> > > >>
> > > >> On 2024/10/26 6:03, Kirill Reshke wrote:
> > > >> > when the REJECT LIMIT is set to some non-zero number and the number of
> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because
> > > >> > there WAS errors, and we should not tolerate more than $limit errors .
> > > >> > I do find this behavior to be consistent.
> > > >>
> > > >> +1
> > > >>
> > > >>
> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all
> > > >> > replacements, as if REJECT LIMIT is inf.
> > > >>
> > > >> +1
> > > >
> > > > After thinking for a while, I'm now more opposed to this approach. I
> > > > think we should count rows with erroneous data as errors only if
> > > > null substitution for these rows failed, not the total number of rows
> > > > which were modified.
> > > > Then, to respect the REJECT LIMIT option, we compare this number with
> > > > the limit. This is actually simpler approach IMHO. What do You think?
> > >
> > > IMHO I prefer the previous interpretation.
> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT
> > > is used to specify how many malformed rows are acceptable in the
> > > "original" data source.
> 
> I also prefer the previous version.
>  
> > I do like the first version of interpretation, but I have a struggle
> > with it. According to this interpretation, we will fail COPY command
> > if the number
> > of malformed data rows exceeds the limit, not the number of rejected
> > rows (some percentage of malformed rows are accepted with null
> > substitution)
> > So, a proper name for the limit will be MALFORMED_LIMIT, or something.
> > However, we are unable to change the name since the REJECT_LIMIT
> > option has already been committed.
> > I guess I'll just have to put up with this contradiction. I will send
> > an updated patch shortly...
> 
> I think we can rename the REJECT_LIMIT option because it is not yet released.
> 
> The documentation says that REJECT_LIMIT "Specifies the maximum number of errors",
> and there are no wording "reject" in the description, so I wonder it is unclear
> what means in "REJECT" in REJECT_LIMIT. It may be proper to use ERROR_LIMIT
> since it is supposed to be used with ON_ERROR. 
> 
> Alternatively, if we emphasize that errors are handled other than terminating
> the command,perhaps MALFORMED_LIMIT as proposed above or TOLERANCE_LIMIT may be
> good, for example.

I might misunderstand the meaning of the name. If REJECT_LIMIT means "a limit on
the number of rows with any malformed value allowed before the COPY command is
rejected", we would not have to rename it.

-- 
Yugo Nagata <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
torikoshia
Date:
On 2024-11-12 14:17, Yugo Nagata wrote:
> On Tue, 12 Nov 2024 14:03:50 +0900
> Yugo Nagata <nagata@sraoss.co.jp> wrote:
> 
>> On Tue, 12 Nov 2024 01:27:53 +0500
>> Kirill Reshke <reshkekirill@gmail.com> wrote:
>> 
>> > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote:
>> > >
>> > > On 2024-11-09 21:55, Kirill Reshke wrote:
>> > >
>> > > Thanks for working on this!
>> >
>> > Thanks for reviewing the v7 patch series!
>> >
>> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com>
>> > > > wrote:
>> > > >>
>> > > >>
>> > > >>
>> > > >> On 2024/10/26 6:03, Kirill Reshke wrote:
>> > > >> > when the REJECT LIMIT is set to some non-zero number and the number of
>> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because
>> > > >> > there WAS errors, and we should not tolerate more than $limit errors .
>> > > >> > I do find this behavior to be consistent.
>> > > >>
>> > > >> +1
>> > > >>
>> > > >>
>> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all
>> > > >> > replacements, as if REJECT LIMIT is inf.
>> > > >>
>> > > >> +1
>> > > >
>> > > > After thinking for a while, I'm now more opposed to this approach. I
>> > > > think we should count rows with erroneous data as errors only if
>> > > > null substitution for these rows failed, not the total number of rows
>> > > > which were modified.
>> > > > Then, to respect the REJECT LIMIT option, we compare this number with
>> > > > the limit. This is actually simpler approach IMHO. What do You think?
>> > >
>> > > IMHO I prefer the previous interpretation.
>> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT
>> > > is used to specify how many malformed rows are acceptable in the
>> > > "original" data source.
>> 
>> I also prefer the previous version.
>> 
>> > I do like the first version of interpretation, but I have a struggle
>> > with it. According to this interpretation, we will fail COPY command
>> > if the number
>> > of malformed data rows exceeds the limit, not the number of rejected
>> > rows (some percentage of malformed rows are accepted with null
>> > substitution)

I feel your concern is valid.
Currently 'reject' can occur only when converting a column's input value 
to its data type, but if we introduce set_to_null option 'reject' also 
occurs when inserting null, i.e. not null constraint.

>> > So, a proper name for the limit will be MALFORMED_LIMIT, or something.
>> > However, we are unable to change the name since the REJECT_LIMIT
>> > option has already been committed.
>> > I guess I'll just have to put up with this contradiction. I will send
>> > an updated patch shortly...
>> I think we can rename the REJECT_LIMIT option because it is not yet 
>> released.

+1

>> The documentation says that REJECT_LIMIT "Specifies the maximum number 
>> of errors",
>> and there are no wording "reject" in the description, so I wonder it 
>> is unclear
>> what means in "REJECT" in REJECT_LIMIT. It may be proper to use 
>> ERROR_LIMIT
>> since it is supposed to be used with ON_ERROR.
>> 
>> Alternatively, if we emphasize that errors are handled other than 
>> terminating
>> the command,perhaps MALFORMED_LIMIT as proposed above or 
>> TOLERANCE_LIMIT may be
>> good, for example.
> 
> I might misunderstand the meaning of the name. If REJECT_LIMIT means "a 
> limit on
> the number of rows with any malformed value allowed before the COPY 
> command is
> rejected", we would not have to rename it.

The meaning of REJECT_LIMIT is what you described, and I think Kirill 
worries about cases when malformed rows are accepted(=not REJECTed) with 
null substitution. REJECT_LIMIT counts this case as REJECTed.

-- 
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Yugo NAGATA
Date:
On Tue, 12 Nov 2024 17:38:25 +0900
torikoshia <torikoshia@oss.nttdata.com> wrote:

> On 2024-11-12 14:17, Yugo Nagata wrote:
> > On Tue, 12 Nov 2024 14:03:50 +0900
> > Yugo Nagata <nagata@sraoss.co.jp> wrote:
> > 
> >> On Tue, 12 Nov 2024 01:27:53 +0500
> >> Kirill Reshke <reshkekirill@gmail.com> wrote:
> >> 
> >> > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote:
> >> > >
> >> > > On 2024-11-09 21:55, Kirill Reshke wrote:
> >> > >
> >> > > Thanks for working on this!
> >> >
> >> > Thanks for reviewing the v7 patch series!
> >> >
> >> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com>
> >> > > > wrote:
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> On 2024/10/26 6:03, Kirill Reshke wrote:
> >> > > >> > when the REJECT LIMIT is set to some non-zero number and the number of
> >> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because
> >> > > >> > there WAS errors, and we should not tolerate more than $limit errors .
> >> > > >> > I do find this behavior to be consistent.
> >> > > >>
> >> > > >> +1
> >> > > >>
> >> > > >>
> >> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all
> >> > > >> > replacements, as if REJECT LIMIT is inf.
> >> > > >>
> >> > > >> +1
> >> > > >
> >> > > > After thinking for a while, I'm now more opposed to this approach. I
> >> > > > think we should count rows with erroneous data as errors only if
> >> > > > null substitution for these rows failed, not the total number of rows
> >> > > > which were modified.
> >> > > > Then, to respect the REJECT LIMIT option, we compare this number with
> >> > > > the limit. This is actually simpler approach IMHO. What do You think?
> >> > >
> >> > > IMHO I prefer the previous interpretation.
> >> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT
> >> > > is used to specify how many malformed rows are acceptable in the
> >> > > "original" data source.
> >> 
> >> I also prefer the previous version.
> >> 
> >> > I do like the first version of interpretation, but I have a struggle
> >> > with it. According to this interpretation, we will fail COPY command
> >> > if the number
> >> > of malformed data rows exceeds the limit, not the number of rejected
> >> > rows (some percentage of malformed rows are accepted with nulnot-null constraintl
> >> > substitution)
> 
> I feel your concern is valid.
> Currently 'reject' can occur only when converting a column's input value 
> to its data type, but if we introduce set_to_null option 'reject' also 
> occurs when inserting null, i.e. not null constraint.

I can suppose "reject" means failure of COPY command here, that is, a reject
of executing the command, not an error of data input. If so, we can interpret
REJECT_LIMIT as "the number of malformed rows allowed before the COPY command
is REJECTed" (not the number of rejected rows). In this case, I think we don't
have to rename the option name.

Of course, if there is more proper name that makes it easy for users to
understand the behaviour of the option, renaming should be nice.

> >> The documentation says that REJECT_LIMIT "Specifies the maximum number 
> >> of errors",
> >> and there are no wording "reject" in the description, so I wonder it 
> >> is unclear
> >> what means in "REJECT" in REJECT_LIMIT. It may be proper to use 
> >> ERROR_LIMIT
> >> since it is supposed to be used with ON_ERROR.
> >> 
> >> Alternatively, if we emphasize that errors are handled other than 
> >> terminating
> >> the command,perhaps MALFORMED_LIMIT as proposed above or 
> >> TOLERANCE_LIMIT may be
> >> good, for example.
> > 
> > I might misunderstand the meaning of the name. If REJECT_LIMIT means "a 
> > limit on
> > the number of rows with any malformed value allowed before the COPY 
> > command is
> > rejected", we would not have to rename it.
> 
> The meaning of REJECT_LIMIT is what you described, and I think Kirill 
> worries about cases when malformed rows are accepted(=not REJECTed) with 
> null substitution. REJECT_LIMIT counts this case as REJECTed.

I am a bit confused. You mean "REJECT" is raising a soft error of data
input here instead of terminating COPY?

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
torikoshia
Date:
On 2024-11-13 22:02, Yugo NAGATA wrote:
> On Tue, 12 Nov 2024 17:38:25 +0900
> torikoshia <torikoshia@oss.nttdata.com> wrote:
> 
>> On 2024-11-12 14:17, Yugo Nagata wrote:
>> > On Tue, 12 Nov 2024 14:03:50 +0900
>> > Yugo Nagata <nagata@sraoss.co.jp> wrote:
>> >
>> >> On Tue, 12 Nov 2024 01:27:53 +0500
>> >> Kirill Reshke <reshkekirill@gmail.com> wrote:
>> >>
>> >> > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote:
>> >> > >
>> >> > > On 2024-11-09 21:55, Kirill Reshke wrote:
>> >> > >
>> >> > > Thanks for working on this!
>> >> >
>> >> > Thanks for reviewing the v7 patch series!
>> >> >
>> >> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com>
>> >> > > > wrote:
>> >> > > >>
>> >> > > >>
>> >> > > >>
>> >> > > >> On 2024/10/26 6:03, Kirill Reshke wrote:
>> >> > > >> > when the REJECT LIMIT is set to some non-zero number and the number of
>> >> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because
>> >> > > >> > there WAS errors, and we should not tolerate more than $limit errors .
>> >> > > >> > I do find this behavior to be consistent.
>> >> > > >>
>> >> > > >> +1
>> >> > > >>
>> >> > > >>
>> >> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all
>> >> > > >> > replacements, as if REJECT LIMIT is inf.
>> >> > > >>
>> >> > > >> +1
>> >> > > >
>> >> > > > After thinking for a while, I'm now more opposed to this approach. I
>> >> > > > think we should count rows with erroneous data as errors only if
>> >> > > > null substitution for these rows failed, not the total number of rows
>> >> > > > which were modified.
>> >> > > > Then, to respect the REJECT LIMIT option, we compare this number with
>> >> > > > the limit. This is actually simpler approach IMHO. What do You think?
>> >> > >
>> >> > > IMHO I prefer the previous interpretation.
>> >> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT
>> >> > > is used to specify how many malformed rows are acceptable in the
>> >> > > "original" data source.
>> >>
>> >> I also prefer the previous version.
>> >>
>> >> > I do like the first version of interpretation, but I have a struggle
>> >> > with it. According to this interpretation, we will fail COPY command
>> >> > if the number
>> >> > of malformed data rows exceeds the limit, not the number of rejected
>> >> > rows (some percentage of malformed rows are accepted with nulnot-null constraintl
>> >> > substitution)
>> 
>> I feel your concern is valid.
>> Currently 'reject' can occur only when converting a column's input 
>> value
>> to its data type, but if we introduce set_to_null option 'reject' also
>> occurs when inserting null, i.e. not null constraint.
> 
> I can suppose "reject" means failure of COPY command here, that is, a 
> reject
> of executing the command, not an error of data input. If so, we can 
> interpret
> REJECT_LIMIT as "the number of malformed rows allowed before the COPY 
> command
> is REJECTed" (not the number of rejected rows). In this case, I think 
> we don't
> have to rename the option name.
> 
> Of course, if there is more proper name that makes it easy for users to
> understand the behaviour of the option, renaming should be nice.
> 
>> >> The documentation says that REJECT_LIMIT "Specifies the maximum number
>> >> of errors",
>> >> and there are no wording "reject" in the description, so I wonder it
>> >> is unclear
>> >> what means in "REJECT" in REJECT_LIMIT. It may be proper to use
>> >> ERROR_LIMIT
>> >> since it is supposed to be used with ON_ERROR.
>> >>
>> >> Alternatively, if we emphasize that errors are handled other than
>> >> terminating
>> >> the command,perhaps MALFORMED_LIMIT as proposed above or
>> >> TOLERANCE_LIMIT may be
>> >> good, for example.
>> >
>> > I might misunderstand the meaning of the name. If REJECT_LIMIT means "a
>> > limit on
>> > the number of rows with any malformed value allowed before the COPY
>> > command is
>> > rejected", we would not have to rename it.
>> 
>> The meaning of REJECT_LIMIT is what you described, and I think Kirill
>> worries about cases when malformed rows are accepted(=not REJECTed) 
>> with
>> null substitution. REJECT_LIMIT counts this case as REJECTed.
> 
> I am a bit confused.

Me too:)
Let me explain my understanding.

I believe there are now two candidates that count as REJECT_LIMIT 
number:

   (1) error converting a column's input value into its data type(soft 
error)
   (2) NULL substitution failure(this comes from the proposed patch)

And I understood Kirill's idea to be the following:

   1st idea: REJECT_LIMIT counts (1)
   2nd idea: REJECT_LIMIT counts (2)

And I've agreed with the 1st one.

> You mean "REJECT" is raising a soft error of data
> input here instead of terminating COPY?

Yes.


-- 
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
On Sat, Nov 16, 2024 at 5:55 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> I am attaching my v8 for reference.
>

in your v8.

   <varlistentry>
    <term><literal>REJECT_LIMIT</literal></term>
    <listitem>
     <para>
      Specifies the maximum number of errors tolerated while converting a
      column's input value to its data type, when <literal>ON_ERROR</literal> is
      set to <literal>ignore</literal>.
      If the input contains more erroneous rows than the specified
value, the <command>COPY</command>
      command fails, even with <literal>ON_ERROR</literal> set to
<literal>ignore</literal>.
     </para>
    </listitem>
   </varlistentry>

then above description not meet with following example, (i think)

create table t(a int not null);
COPY t FROM STDIN WITH (on_error set_to_null, reject_limit 2);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> a
>> \.
ERROR:  null value in column "a" of relation "t" violates not-null constraint
DETAIL:  Failing row contains (null).
CONTEXT:  COPY t, line 1, column a: "a"

Overall, I think
making the domain not-null align with column level not-null would be a
good thing.


     <para>
      Specifies how to behave when encountering an error converting a column's
      input value into its data type.
      An <replaceable class="parameter">error_action</replaceable> value of
      <literal>stop</literal> means fail the command,
      <literal>ignore</literal> means discard the input row and
continue with the next one, and
      <literal>set_to_null</literal> means replace columns containing
erroneous input values with <literal>null</literal> and move to the
next row.

"and move to the next row" is wrong?
I think it should be " and move to the next field".



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Kirill Reshke
Date:
On Tue, 19 Nov 2024, 13:52 jian he, <jian.universality@gmail.com> wrote:
>
> On Sat, Nov 16, 2024 at 5:55 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
> >
> > I am attaching my v8 for reference.
> >
>
> in your v8.
>
>    <varlistentry>
>     <term><literal>REJECT_LIMIT</literal></term>
>     <listitem>
>      <para>
>       Specifies the maximum number of errors tolerated while converting a
>       column's input value to its data type, when <literal>ON_ERROR</literal> is
>       set to <literal>ignore</literal>.
>       If the input contains more erroneous rows than the specified
> value, the <command>COPY</command>
>       command fails, even with <literal>ON_ERROR</literal> set to
> <literal>ignore</literal>.
>      </para>
>     </listitem>
>    </varlistentry>
>
> then above description not meet with following example, (i think)
>
> create table t(a int not null);
> COPY t FROM STDIN WITH (on_error set_to_null, reject_limit 2);
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> a
> >> \.
> ERROR:  null value in column "a" of relation "t" violates not-null constraint
> DETAIL:  Failing row contains (null).
> CONTEXT:  COPY t, line 1, column a: "a"


Sure, my v8 does not helps with column level NOT NULL constraint (or
other constraint)

> Overall, I think
> making the domain not-null align with column level not-null would be a
> good thing.

While this looks sane, it's actually a separate topic. Even on current
HEAD we have domain not-null vs column level not-null unalignment.

consider this example


```
reshke=# create table ftt2 (i int not null);
CREATE TABLE
reshke=# copy ftt2 from stdin with (reject_limit 1000, on_error ignore);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> \N
>> \.
ERROR:  null value in column "i" of relation "ftt2" violates not-null constraint
DETAIL:  Failing row contains (null).
CONTEXT:  COPY ftt2, line 1: "\N"
reshke=# create domain dd as int not null ;
CREATE DOMAIN
reshke=# create table ftt3(i dd);
CREATE TABLE
reshke=# copy ftt3 from stdin with (reject_limit 1000, on_error ignore);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> \N
>> \.
NOTICE:  1 row was skipped due to data type incompatibility
COPY 0
reshke=#
```
So, if we want this, we need to start another thread and deal with
REJECT_LIMIT + on_error ignore first.

The ExecConstraints function is the source of the error in scenario 1.
Therefore, we require something like "ExecConstraintsSafe" to
accommodate the aforementioned. That is a significant change. Not sure
it will be accepted by the community.

>
>      <para>
>       Specifies how to behave when encountering an error converting a column's
>       input value into its data type.
>       An <replaceable class="parameter">error_action</replaceable> value of
>       <literal>stop</literal> means fail the command,
>       <literal>ignore</literal> means discard the input row and
> continue with the next one, and
>       <literal>set_to_null</literal> means replace columns containing
> erroneous input values with <literal>null</literal> and move to the
> next row.
>
> "and move to the next row" is wrong?
> I think it should be " and move to the next field".

Yes, "and move to the next field" is correct.



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Kirill Reshke
Date:
On Tue, 19 Nov 2024 at 13:52, jian he <jian.universality@gmail.com> wrote:
>
> On Sat, Nov 16, 2024 at 5:55 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
> >
> > I am attaching my v8 for reference.
> >
>
> in your v8.
>
>    <varlistentry>
>     <term><literal>REJECT_LIMIT</literal></term>
>     <listitem>
>      <para>
>       Specifies the maximum number of errors tolerated while converting a
>       column's input value to its data type, when <literal>ON_ERROR</literal> is
>       set to <literal>ignore</literal>.
>       If the input contains more erroneous rows than the specified
> value, the <command>COPY</command>
>       command fails, even with <literal>ON_ERROR</literal> set to
> <literal>ignore</literal>.
>      </para>
>     </listitem>
>    </varlistentry>
>
> then above description not meet with following example, (i think)
>
> create table t(a int not null);
> COPY t FROM STDIN WITH (on_error set_to_null, reject_limit 2);
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> a
> >> \.
> ERROR:  null value in column "a" of relation "t" violates not-null constraint
> DETAIL:  Failing row contains (null).
> CONTEXT:  COPY t, line 1, column a: "a"
>
> Overall, I think
> making the domain not-null align with column level not-null would be a
> good thing.
>
>
>      <para>
>       Specifies how to behave when encountering an error converting a column's
>       input value into its data type.
>       An <replaceable class="parameter">error_action</replaceable> value of
>       <literal>stop</literal> means fail the command,
>       <literal>ignore</literal> means discard the input row and
> continue with the next one, and
>       <literal>set_to_null</literal> means replace columns containing
> erroneous input values with <literal>null</literal> and move to the
> next row.
>
> "and move to the next row" is wrong?
> I think it should be " and move to the next field".

Hi! There is not too much time left in this CF, so I moved to the next one.
If you are going to work on this patch, I'm waiting on your feedback
or a v9 patch that answers the issues brought up in this discussion.

--
Best regards,
Kirill Reshke



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Kirill Reshke
Date:
Thank you for your update.

On Fri, 13 Dec 2024 at 08:15, jian he <jian.universality@gmail.com> wrote:
>
> + /*
> + * Here we end processing of current COPY row.
> + * Update copy state counter for number of erroneous rows.
> + */
> + cstate->num_errors++;
> + cstate->escontext->error_occurred = true;
> +
> + /* Only print this NOTICE message, if it will not be followed by ERROR */
> + if (cstate->opts.log_verbosity == COPY_LOG_VERBOSITY_VERBOSE &&
> + (
> + (cstate->opts.on_error == COPY_ON_ERROR_NULL &&
> cstate->opts.reject_limit > 0 && cstate->num_errors <=
> cstate->opts.reject_limit) ||
> + (cstate->opts.on_error == COPY_ON_ERROR_IGNORE &&
> (cstate->opts.reject_limit ==  0 || cstate->num_errors <=
> cstate->opts.reject_limit))
> + ))
>   {
> this is kind of hard to comprehend.
> so attached is a simple version of it based on v8.

+1

So, in this version you essentially removed support for REJECT_LIMIT +
SET_TO_NULL feature? Looks like a promising change. It is more likely
to see this committed.
So, +1 on that too.

However, v9 lacks tests for REJECT_LIMIT vs erroneous rows tests.
In short, we need  this message somewhere in a regression test.
```
ERROR:  skipped more than REJECT_LIMIT (xxx) rows due to data type
incompatibility
```

Also, please update commit msg with all authors and reviewers. This
will make committer job a little bit easier

--
Best regards,
Kirill Reshke



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
hi.
rebase only.

Attachment

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Jim Jones
Date:
Hi Jian

On 07.03.25 04:48, jian he wrote:
> hi.
> rebase only.


I revisited this patch today. It applies and builds cleanly, and it
works as expected.

Some tests and minor comments:

====

1) WARNING might be a better fit than NOTICE here.

postgres=# \pset null NULL
Null display is "NULL".
postgres=# CREATE TEMPORARY TABLE t (x int, y int, z text);
CREATE TABLE
postgres=# COPY t (x,y) FROM STDIN WITH (on_error set_to_null, format csv);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 1,a
>> 2,1
>> 3,2
>> 4,b
>> a,c
>> \.
NOTICE:  erroneous values in 3 rows were replaced with null
COPY 5
postgres=# SELECT * FROM t;
  x   |  y   |  z   
------+------+------
    1 | NULL | NULL
    2 |    1 | NULL
    3 |    2 | NULL
    4 | NULL | NULL
 NULL | NULL | NULL
(5 rows)


postgres=# \pset null NULL
Null display is "NULL".
postgres=# CREATE TEMPORARY TABLE t (x int, y int, z text);
CREATE TABLE
postgres=# COPY t (x,y) FROM STDIN WITH (on_error set_to_null, format
csv, log_verbosity verbose);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 1,a
>> 2,1
>> 3,2
>> 4,b
>> a,c
>> \.
NOTICE:  column "y" was set to null due to data type incompatibility at
line 1
NOTICE:  column "y" was set to null due to data type incompatibility at
line 4
NOTICE:  column "x" was set to null due to data type incompatibility at
line 5
NOTICE:  column "y" was set to null due to data type incompatibility at
line 5
NOTICE:  erroneous values in 3 rows were replaced with null
COPY 5
postgres=# SELECT * FROM t;
  x   |  y   |  z   
------+------+------
    1 | NULL | NULL
    2 |    1 | NULL
    3 |    2 | NULL
    4 | NULL | NULL
 NULL | NULL | NULL
(5 rows)


I would still leave the extra messages from "log_verbosity verbose" as
NOTICE though. What do you think?

====

2) Inconsistent terminology. Invalid values in "on_error set_to_null"
mode are names as "erroneous", but as "invalid" in "on_error stop" mode.
I don't want to get into the semantics of erroneous or invalid, but
sticking to one terminology would IMHO look better.

postgres=# \pset null NULL
Null display is "NULL".
postgres=# CREATE TEMPORARY TABLE t (x int, y int, z text);
CREATE TABLE
postgres=# COPY t (x,y) FROM STDIN WITH (on_error stop, format csv,
log_verbosity verbose);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 1,a
>> 2,1
>> 3,2
>> 4,b
>> a,c
>> \.
ERROR:  invalid input syntax for type integer: "a"
CONTEXT:  COPY t, line 1, column y: "a"
postgres=# SELECT * FROM t;
 x | y | z
---+---+---
(0 rows)

====

3) same as in 1)

postgres=# \pset null NULL
Null display is "NULL".
postgres=# CREATE TEMPORARY TABLE t (x int, y int, z text);
CREATE TABLE
postgres=# COPY t (x,y) FROM STDIN WITH (on_error ignore, format csv,
log_verbosity verbose);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 1,a
>> 2,1
>> 3,2
>> 4,b
>> a,c
>> \.
NOTICE:  skipping row due to data type incompatibility at line 1 for
column "y": "a"
NOTICE:  skipping row due to data type incompatibility at line 4 for
column "y": "b"
NOTICE:  skipping row due to data type incompatibility at line 5 for
column "x": "a"
NOTICE:  3 rows were skipped due to data type incompatibility
COPY 2
postgres=# SELECT * FROM t;
 x | y |  z   
---+---+------
 2 | 1 | NULL
 3 | 2 | NULL
(2 rows)====

====

"on_error ignore" works well with "reject_limit #"

postgres=# \pset null NULL
Null display is "NULL".
postgres=# CREATE TEMPORARY TABLE t (x int, y int, z text);
CREATE TABLE
postgres=# COPY t (x,y) FROM STDIN WITH (on_error ignore, format csv,
log_verbosity verbose, reject_limit 1);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 1,a
>> 2,1
>> 3,2
>> 4,b
>> a,c
>> \.
NOTICE:  skipping row due to data type incompatibility at line 1 for
column "y": "a"
NOTICE:  skipping row due to data type incompatibility at line 4 for
column "y": "b"
ERROR:  skipped more than REJECT_LIMIT (1) rows due to data type
incompatibility
CONTEXT:  COPY t, line 4, column y: "b"
postgres=# SELECT * FROM t;
 x | y | z
---+---+---
(0 rows)

best regards, Jim



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
On Tue, Mar 11, 2025 at 6:31 PM Jim Jones <jim.jones@uni-muenster.de> wrote:
>
>
> I revisited this patch today. It applies and builds cleanly, and it
> works as expected.
>
> Some tests and minor comments:
>

hi. Jim Jones.
thanks for testsing it again!


> ====
>
> 1) WARNING might be a better fit than NOTICE here.
>

but NOTICE, on_errror set_to_null is aligned with on_errror ignore.

>
> I would still leave the extra messages from "log_verbosity verbose" as
> NOTICE though. What do you think?
>
> ====

When LOG_VERBOSITY option is set to verbose,
for ignore option, a NOTICE message containing the line of the input
file and the column name
whose input conversion has failed is emitted for each discarded row;
for set_to_null option, a NOTICE message containing the line of the
input file and the column name
where value was replaced with NULL for each input conversion failure.

see the above desciption,
on_errror set_to_null is aligned with on_errror ignore.
it's just on_errror ignore is per row, on_errror set_to_null is per
column/field.
so NOTICE is aligned with other on_error option.

>
> 2) Inconsistent terminology. Invalid values in "on_error set_to_null"
> mode are names as "erroneous", but as "invalid" in "on_error stop" mode.
> I don't want to get into the semantics of erroneous or invalid, but
> sticking to one terminology would IMHO look better.
>
I am open to changing it.
what do you think "invalid values in %llu row was replaced with null"?

> ====
>
> "on_error ignore" works well with "reject_limit #"
>
i remember there was some confusion about on_error set_to_null with
reject_limit option.
I choose to not suport it.
obviously, if there is consenses, we can support it later.



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Jim Jones
Date:
On 12.03.25 09:00, jian he wrote:
>> 1) WARNING might be a better fit than NOTICE here.
>>
> but NOTICE, on_errror set_to_null is aligned with on_errror ignore.
>
>> I would still leave the extra messages from "log_verbosity verbose" as
>> NOTICE though. What do you think?
>>
>> ====
> When LOG_VERBOSITY option is set to verbose,
> for ignore option, a NOTICE message containing the line of the input
> file and the column name
> whose input conversion has failed is emitted for each discarded row;
> for set_to_null option, a NOTICE message containing the line of the
> input file and the column name
> where value was replaced with NULL for each input conversion failure.
>
> see the above desciption,
> on_errror set_to_null is aligned with on_errror ignore.
> it's just on_errror ignore is per row, on_errror set_to_null is per
> column/field.
> so NOTICE is aligned with other on_error option.


I considered using a WARNING due to the severity of the issue - the
failure to import data - but either NOTICE or WARNING works for me.


>> 2) Inconsistent terminology. Invalid values in "on_error set_to_null"
>> mode are names as "erroneous", but as "invalid" in "on_error stop" mode.
>> I don't want to get into the semantics of erroneous or invalid, but
>> sticking to one terminology would IMHO look better.
>>
> I am open to changing it.
> what do you think "invalid values in %llu row was replaced with null"?


LGTM: "invalid values in %llu rows were replaced with null"

Thanks for the patch!

Best, Jim




Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
On Wed, Mar 12, 2025 at 4:26 PM Jim Jones <jim.jones@uni-muenster.de> wrote:
>
> >> 2) Inconsistent terminology. Invalid values in "on_error set_to_null"
> >> mode are names as "erroneous", but as "invalid" in "on_error stop" mode.
> >> I don't want to get into the semantics of erroneous or invalid, but
> >> sticking to one terminology would IMHO look better.
> >>
> > I am open to changing it.
> > what do you think "invalid values in %llu row was replaced with null"?
>
> LGTM: "invalid values in %llu rows were replaced with null"
>
changed based on this.

also minor documentation tweaks.

Attachment

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
vignesh C
Date:
On Tue, 18 Mar 2025 at 09:26, jian he <jian.universality@gmail.com> wrote:
>
> changed based on this.
>
> also minor documentation tweaks.

Few comments:
1) I felt this is wrong:
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 9a4d993e2bc..7980513a9bd 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -3280,7 +3280,7 @@ match_previous_words(int pattern_id,
                COMPLETE_WITH("FORMAT", "FREEZE", "DELIMITER", "NULL",
                                          "HEADER", "QUOTE", "ESCAPE",
"FORCE_QUOTE",
                                          "FORCE_NOT_NULL",
"FORCE_NULL", "ENCODING", "DEFAULT",
-                                         "ON_ERROR", "LOG_VERBOSITY");
+                                         "ON_ERROR", "SET_TO_NULL",
"LOG_VERBOSITY");

as the following fails:
postgres=# copy t_on_error_null from stdin WITH ( set_to_null );
ERROR:  option "set_to_null" not recognized
LINE 1: copy t_on_error_null from stdin WITH ( set_to_null );

2) Can you limit this to 80 chars if possible to improve the readability:
+      <literal>stop</literal> means fail the command,
+      <literal>ignore</literal> means discard the input row and
continue with the next one, and
+      <literal>set_to_null</literal> means replace columns containing
invalid input values with
+      <literal>NULL</literal> and move to the next field.

3) similarly here too:
+      For <literal>ignore</literal> option,
+      a <literal>NOTICE</literal> message containing the ignored row count is
+      emitted at the end of the <command>COPY FROM</command> if at
least one row was discarded.
+      For <literal>set_to_null</literal> option,
+      a <literal>NOTICE</literal> message indicating the number of
rows where invalid input values were replaced with
+      null is emitted at the end of the <command>COPY FROM</command>
if at least one row was replaced.

4) Could you mention a brief one line in the commit message as to why
"on_error null" cannot be used:
Extent "on_error action", introduce new option:  on_error set_to_null.
Current grammar makes us unable to use "on_error null", so we choose
"on_error set_to_null".

Regards,
Vignesh



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
On Fri, Mar 21, 2025 at 2:34 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Few comments:
> 1) I felt this is wrong:
> diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
> index 9a4d993e2bc..7980513a9bd 100644
> --- a/src/bin/psql/tab-complete.in.c
> +++ b/src/bin/psql/tab-complete.in.c
> @@ -3280,7 +3280,7 @@ match_previous_words(int pattern_id,
>                 COMPLETE_WITH("FORMAT", "FREEZE", "DELIMITER", "NULL",
>                                           "HEADER", "QUOTE", "ESCAPE",
> "FORCE_QUOTE",
>                                           "FORCE_NOT_NULL",
> "FORCE_NULL", "ENCODING", "DEFAULT",
> -                                         "ON_ERROR", "LOG_VERBOSITY");
> +                                         "ON_ERROR", "SET_TO_NULL",
> "LOG_VERBOSITY");
>
> as the following fails:
> postgres=# copy t_on_error_null from stdin WITH ( set_to_null );
> ERROR:  option "set_to_null" not recognized
> LINE 1: copy t_on_error_null from stdin WITH ( set_to_null );
>

- COMPLETE_WITH("stop", "ignore");
+ COMPLETE_WITH("stop", "ignore", "set_to_null");
yech. I think I fixed this.

> 2) Can you limit this to 80 chars if possible to improve the readability:
> +      <literal>stop</literal> means fail the command,
> +      <literal>ignore</literal> means discard the input row and
> continue with the next one, and
> +      <literal>set_to_null</literal> means replace columns containing
> invalid input values with
> +      <literal>NULL</literal> and move to the next field.
>
> 3) similarly here too:
> +      For <literal>ignore</literal> option,
> +      a <literal>NOTICE</literal> message containing the ignored row count is
> +      emitted at the end of the <command>COPY FROM</command> if at
> least one row was discarded.
> +      For <literal>set_to_null</literal> option,
> +      a <literal>NOTICE</literal> message indicating the number of
> rows where invalid input values were replaced with
> +      null is emitted at the end of the <command>COPY FROM</command>
> if at least one row was replaced.
>
sure.

> 4) Could you mention a brief one line in the commit message as to why
> "on_error null" cannot be used:
> Extent "on_error action", introduce new option:  on_error set_to_null.
> Current grammar makes us unable to use "on_error null", so we choose
> "on_error set_to_null".

by the following changes, we can change to (on_error null).
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -3579,6 +3579,7 @@ copy_generic_opt_elem:

 copy_generic_opt_arg:
                        opt_boolean_or_string                   { $$ =
(Node *) makeString($1); }
+                       | NULL_P
         { $$ = (Node *) makeString("null"); }
                        | NumericOnly
 { $$ = (Node *) $1; }
                        | '*'
         { $$ = (Node *) makeNode(A_Star); }
                        | DEFAULT                       { $$ = (Node
*) makeString("default"); }

COPY x from stdin (format null);
ERROR:  syntax error at or near "null"
LINE 1: COPY x from stdin (format null);
                                  ^
will become

COPY x from stdin (format null);
ERROR:  COPY format "null" not recognized
LINE 1: COPY x from stdin (format null);
                           ^

it will cause NULL_P from reserved word to
non-reserved word in the COPY related command.


I am not sure this is what we want.
Anyway, I attached both two version
(ON_ERROR SET_TO_NULL) (ON_ERROR NULL).

Attachment

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
vignesh C
Date:
On Mon, 24 Mar 2025 at 13:21, jian he <jian.universality@gmail.com> wrote:
>
> I am not sure this is what we want.
> Anyway, I attached both two version

Few comments
1) I understood the problem, your first approach is ok for me.

2) Here in error we say column c1 violates not-null constraint and in
the context we show column c2, should the context also display c2
column:
postgres=# create table t3(c1 int not null, c2 int, check (c1 > 10));
CREATE TABLE
postgres=# COPY t3 FROM STDIN WITH (on_error set_to_null);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> a  b
>> \.
ERROR:  null value in column "c1" of relation "t3" violates not-null constraint
DETAIL:  Failing row contains (null, null).
CONTEXT:  COPY t3, line 1, column c2: "b"

3) typo becomen should be become:
null will becomen reserved to non-reserved

4) There is a whitespace error while applying patch
Applying: COPY (on_error set_to_null)
.git/rebase-apply/patch:39: trailing whitespace.
      a <literal>NOTICE</literal> message indicating the number of rows
warning: 1 line adds whitespace errors.

Regards,
Vignesh



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
On Tue, Mar 25, 2025 at 2:31 PM vignesh C <vignesh21@gmail.com> wrote:
>
> 2) Here in error we say column c1 violates not-null constraint and in
> the context we show column c2, should the context also display c2
> column:
> postgres=# create table t3(c1 int not null, c2 int, check (c1 > 10));
> CREATE TABLE
> postgres=# COPY t3 FROM STDIN WITH (on_error set_to_null);
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> a  b
> >> \.
> ERROR:  null value in column "c1" of relation "t3" violates not-null constraint
> DETAIL:  Failing row contains (null, null).
> CONTEXT:  COPY t3, line 1, column c2: "b"
>

It took me a while to figure out why.
with the attached, now the error message becomes:

ERROR:  null value in column "c1" of relation "t3" violates not-null constraint
DETAIL:  Failing row contains (null, null).
CONTEXT:  COPY t3, line 1: "a,b"

while at it,
(on_error set_to_null, log_verbosity verbose)
error message CONTEXT will only emit out relation name,
this aligns with (on_error ignore, log_verbosity verbose).

one of the message out example:
+NOTICE:  column "b" was set to null due to data type incompatibility at line 2
+CONTEXT:  COPY t_on_error_null



> 3) typo becomen should be become:
> null will becomen reserved to non-reserved
fixed.

> 4) There is a whitespace error while applying patch
> Applying: COPY (on_error set_to_null)
> .git/rebase-apply/patch:39: trailing whitespace.
>       a <literal>NOTICE</literal> message indicating the number of rows
> warning: 1 line adds whitespace errors.
fixed.

Attachment

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Masahiko Sawada
Date:
On Fri, Apr 4, 2025 at 4:55 AM jian he <jian.universality@gmail.com> wrote:
>
> On Tue, Mar 25, 2025 at 2:31 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > 2) Here in error we say column c1 violates not-null constraint and in
> > the context we show column c2, should the context also display c2
> > column:
> > postgres=# create table t3(c1 int not null, c2 int, check (c1 > 10));
> > CREATE TABLE
> > postgres=# COPY t3 FROM STDIN WITH (on_error set_to_null);
> > Enter data to be copied followed by a newline.
> > End with a backslash and a period on a line by itself, or an EOF signal.
> > >> a  b
> > >> \.
> > ERROR:  null value in column "c1" of relation "t3" violates not-null constraint
> > DETAIL:  Failing row contains (null, null).
> > CONTEXT:  COPY t3, line 1, column c2: "b"
> >
>
> It took me a while to figure out why.
> with the attached, now the error message becomes:
>
> ERROR:  null value in column "c1" of relation "t3" violates not-null constraint
> DETAIL:  Failing row contains (null, null).
> CONTEXT:  COPY t3, line 1: "a,b"
>
> while at it,
> (on_error set_to_null, log_verbosity verbose)
> error message CONTEXT will only emit out relation name,
> this aligns with (on_error ignore, log_verbosity verbose).
>
> one of the message out example:
> +NOTICE:  column "b" was set to null due to data type incompatibility at line 2
> +CONTEXT:  COPY t_on_error_null
>
>
>
> > 3) typo becomen should be become:
> > null will becomen reserved to non-reserved
> fixed.
>
> > 4) There is a whitespace error while applying patch
> > Applying: COPY (on_error set_to_null)
> > .git/rebase-apply/patch:39: trailing whitespace.
> >       a <literal>NOTICE</literal> message indicating the number of rows
> > warning: 1 line adds whitespace errors.
> fixed.

I've reviewed the v15 patch and here are some comments:

How about renaming the new option value to 'set_null"? The 'to' in the
value name seems redundant to me.

---
+        COPY_ON_ERROR_NULL,                    /* set error field to null */

I think it's better to rename COPY_ON_ERROR_SET_TO_NULL (or
COPY_ON_ERROR_SET_NULL if we change the option value name) for
consistency with the value name.

---
+                else if (cstate->opts.on_error == COPY_ON_ERROR_NULL)
+                        ereport(NOTICE,
+                                        errmsg_plural("invalid values
in %" PRIu64 " row was replaced with null",
+
"invalid values in %" PRIu64 " rows were replaced with null",
+
cstate->num_errors,
+
cstate->num_errors));

How about adding "due to data type incompatibility" at the end of the message?

---
+                                    ereport(NOTICE,
+                                                    errmsg("column
\"%s\" was set to null due to data type incompatibility at line %"
PRIu64 "",
+
cstate->cur_attname,
+
cstate->cur_lineno));

Similar to the IGNORE case, we can show the data in question in the message.

---
+                    else
+                            ereport(ERROR,
+
errcode(ERRCODE_NOT_NULL_VIOLATION),
+                                            errmsg("domain %s does
not allow null values", format_type_be(typioparams[m])),
+                                            errdatatype(typioparams[m]));

If domain data type is the sole case where not to accept NULL, can we
check it beforehand to avoid calling the second
InputFunctionCallSafe() for non-domain data types? Also, if we want to
end up with an error when setting NULL to a domain type with NOT NULL,
I think we don't need to try to handle a soft error by passing
econtext to InputFunctionCallSafe().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
On Sat, Apr 5, 2025 at 5:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Apr 4, 2025 at 4:55 AM jian he <jian.universality@gmail.com> wrote:
> >
> > On Tue, Mar 25, 2025 at 2:31 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > 2) Here in error we say column c1 violates not-null constraint and in
> > > the context we show column c2, should the context also display c2
> > > column:
> > > postgres=# create table t3(c1 int not null, c2 int, check (c1 > 10));
> > > CREATE TABLE
> > > postgres=# COPY t3 FROM STDIN WITH (on_error set_to_null);
> > > Enter data to be copied followed by a newline.
> > > End with a backslash and a period on a line by itself, or an EOF signal.
> > > >> a  b
> > > >> \.
> > > ERROR:  null value in column "c1" of relation "t3" violates not-null constraint
> > > DETAIL:  Failing row contains (null, null).
> > > CONTEXT:  COPY t3, line 1, column c2: "b"
> > >
> >
> > It took me a while to figure out why.
> > with the attached, now the error message becomes:
> >
> > ERROR:  null value in column "c1" of relation "t3" violates not-null constraint
> > DETAIL:  Failing row contains (null, null).
> > CONTEXT:  COPY t3, line 1: "a,b"
> >
> > while at it,
> > (on_error set_to_null, log_verbosity verbose)
> > error message CONTEXT will only emit out relation name,
> > this aligns with (on_error ignore, log_verbosity verbose).
> >
> > one of the message out example:
> > +NOTICE:  column "b" was set to null due to data type incompatibility at line 2
> > +CONTEXT:  COPY t_on_error_null
> >
> >
> >
> > > 3) typo becomen should be become:
> > > null will becomen reserved to non-reserved
> > fixed.
> >
> > > 4) There is a whitespace error while applying patch
> > > Applying: COPY (on_error set_to_null)
> > > .git/rebase-apply/patch:39: trailing whitespace.
> > >       a <literal>NOTICE</literal> message indicating the number of rows
> > > warning: 1 line adds whitespace errors.
> > fixed.
>
> I've reviewed the v15 patch and here are some comments:
>
> How about renaming the new option value to 'set_null"? The 'to' in the
> value name seems redundant to me.
>
> ---
> +        COPY_ON_ERROR_NULL,                    /* set error field to null */
>
> I think it's better to rename COPY_ON_ERROR_SET_TO_NULL (or
> COPY_ON_ERROR_SET_NULL if we change the option value name) for
> consistency with the value name.
>
> ---
> +                else if (cstate->opts.on_error == COPY_ON_ERROR_NULL)
> +                        ereport(NOTICE,
> +                                        errmsg_plural("invalid values
> in %" PRIu64 " row was replaced with null",
> +
> "invalid values in %" PRIu64 " rows were replaced with null",
> +
> cstate->num_errors,
> +
> cstate->num_errors));
>
> How about adding "due to data type incompatibility" at the end of the message?
>
> ---
> +                                    ereport(NOTICE,
> +                                                    errmsg("column
> \"%s\" was set to null due to data type incompatibility at line %"
> PRIu64 "",
> +
> cstate->cur_attname,
> +
> cstate->cur_lineno));
>
> Similar to the IGNORE case, we can show the data in question in the message.
>
> ---
> +                    else
> +                            ereport(ERROR,
> +
> errcode(ERRCODE_NOT_NULL_VIOLATION),
> +                                            errmsg("domain %s does
> not allow null values", format_type_be(typioparams[m])),
> +                                            errdatatype(typioparams[m]));
>
> If domain data type is the sole case where not to accept NULL, can we
> check it beforehand to avoid calling the second
> InputFunctionCallSafe() for non-domain data types? Also, if we want to
> end up with an error when setting NULL to a domain type with NOT NULL,
> I think we don't need to try to handle a soft error by passing
> econtext to InputFunctionCallSafe().
>

please check attached, hope i have addressed all the points you've mentioned.


> If domain data type is the sole case where not to accept NULL, can we
> check it beforehand to avoid calling the second
> InputFunctionCallSafe() for non-domain data types?

I doubt it.

we have
InputFunctionCallSafe(FmgrInfo *flinfo, char *str,
                      Oid typioparam, int32 typmod,
                      fmNodePtr escontext,
                      Datum *result)
{
    LOCAL_FCINFO(fcinfo, 3);
    if (str == NULL && flinfo->fn_strict)
    {
        *result = (Datum) 0;    /* just return null result */
        return true;
    }
}

Most of the non-domain type input functions are strict.
see query result:

select proname, pt.typname, proisstrict,pt.typtype
from pg_type pt
join pg_proc pp on pp.oid = pt.typinput
where pt.typtype <> 'd'
and pt.typtype <> 'p'
and proisstrict is false;

so the second InputFunctionCallSafe will be faster for non-domain types.

before CopyFromTextLikeOneRow we don't know if this type is
domain_with_constraint or not.
Beforehand, we can conditionally call DomainHasConstraints to find out.
but DomainHasConstraints is expensive, which may carry extra
performance issues for non-domain types.

but the second InputFunctionCallSafe call will not be a big issue for
domain_with_constraint,
because the first time domain_in call already cached related structs.

Attachment

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
Masahiko Sawada
Date:
On Sat, Apr 5, 2025 at 1:31 AM jian he <jian.universality@gmail.com> wrote:
>
> On Sat, Apr 5, 2025 at 5:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Apr 4, 2025 at 4:55 AM jian he <jian.universality@gmail.com> wrote:
> > >
> > > On Tue, Mar 25, 2025 at 2:31 PM vignesh C <vignesh21@gmail.com> wrote:
> > > >
> > > > 2) Here in error we say column c1 violates not-null constraint and in
> > > > the context we show column c2, should the context also display c2
> > > > column:
> > > > postgres=# create table t3(c1 int not null, c2 int, check (c1 > 10));
> > > > CREATE TABLE
> > > > postgres=# COPY t3 FROM STDIN WITH (on_error set_to_null);
> > > > Enter data to be copied followed by a newline.
> > > > End with a backslash and a period on a line by itself, or an EOF signal.
> > > > >> a  b
> > > > >> \.
> > > > ERROR:  null value in column "c1" of relation "t3" violates not-null constraint
> > > > DETAIL:  Failing row contains (null, null).
> > > > CONTEXT:  COPY t3, line 1, column c2: "b"
> > > >
> > >
> > > It took me a while to figure out why.
> > > with the attached, now the error message becomes:
> > >
> > > ERROR:  null value in column "c1" of relation "t3" violates not-null constraint
> > > DETAIL:  Failing row contains (null, null).
> > > CONTEXT:  COPY t3, line 1: "a,b"
> > >
> > > while at it,
> > > (on_error set_to_null, log_verbosity verbose)
> > > error message CONTEXT will only emit out relation name,
> > > this aligns with (on_error ignore, log_verbosity verbose).
> > >
> > > one of the message out example:
> > > +NOTICE:  column "b" was set to null due to data type incompatibility at line 2
> > > +CONTEXT:  COPY t_on_error_null
> > >
> > >
> > >
> > > > 3) typo becomen should be become:
> > > > null will becomen reserved to non-reserved
> > > fixed.
> > >
> > > > 4) There is a whitespace error while applying patch
> > > > Applying: COPY (on_error set_to_null)
> > > > .git/rebase-apply/patch:39: trailing whitespace.
> > > >       a <literal>NOTICE</literal> message indicating the number of rows
> > > > warning: 1 line adds whitespace errors.
> > > fixed.
> >
> > I've reviewed the v15 patch and here are some comments:
> >
> > How about renaming the new option value to 'set_null"? The 'to' in the
> > value name seems redundant to me.
> >
> > ---
> > +        COPY_ON_ERROR_NULL,                    /* set error field to null */
> >
> > I think it's better to rename COPY_ON_ERROR_SET_TO_NULL (or
> > COPY_ON_ERROR_SET_NULL if we change the option value name) for
> > consistency with the value name.
> >
> > ---
> > +                else if (cstate->opts.on_error == COPY_ON_ERROR_NULL)
> > +                        ereport(NOTICE,
> > +                                        errmsg_plural("invalid values
> > in %" PRIu64 " row was replaced with null",
> > +
> > "invalid values in %" PRIu64 " rows were replaced with null",
> > +
> > cstate->num_errors,
> > +
> > cstate->num_errors));
> >
> > How about adding "due to data type incompatibility" at the end of the message?
> >
> > ---
> > +                                    ereport(NOTICE,
> > +                                                    errmsg("column
> > \"%s\" was set to null due to data type incompatibility at line %"
> > PRIu64 "",
> > +
> > cstate->cur_attname,
> > +
> > cstate->cur_lineno));
> >
> > Similar to the IGNORE case, we can show the data in question in the message.
> >
> > ---
> > +                    else
> > +                            ereport(ERROR,
> > +
> > errcode(ERRCODE_NOT_NULL_VIOLATION),
> > +                                            errmsg("domain %s does
> > not allow null values", format_type_be(typioparams[m])),
> > +                                            errdatatype(typioparams[m]));
> >
> > If domain data type is the sole case where not to accept NULL, can we
> > check it beforehand to avoid calling the second
> > InputFunctionCallSafe() for non-domain data types? Also, if we want to
> > end up with an error when setting NULL to a domain type with NOT NULL,
> > I think we don't need to try to handle a soft error by passing
> > econtext to InputFunctionCallSafe().
> >
>
> please check attached, hope i have addressed all the points you've mentioned.
>
>
> > If domain data type is the sole case where not to accept NULL, can we
> > check it beforehand to avoid calling the second
> > InputFunctionCallSafe() for non-domain data types?
>
> I doubt it.
>
> we have
> InputFunctionCallSafe(FmgrInfo *flinfo, char *str,
>                       Oid typioparam, int32 typmod,
>                       fmNodePtr escontext,
>                       Datum *result)
> {
>     LOCAL_FCINFO(fcinfo, 3);
>     if (str == NULL && flinfo->fn_strict)
>     {
>         *result = (Datum) 0;    /* just return null result */
>         return true;
>     }
> }
>
> Most of the non-domain type input functions are strict.
> see query result:
>
> select proname, pt.typname, proisstrict,pt.typtype
> from pg_type pt
> join pg_proc pp on pp.oid = pt.typinput
> where pt.typtype <> 'd'
> and pt.typtype <> 'p'
> and proisstrict is false;
>
> so the second InputFunctionCallSafe will be faster for non-domain types.

Agreed.

BTW have you measured the overheads of calling InputFunctionCallSafe
twice? If it's significant, we might want to find other ways to
achieve it as it would not be good to incur overhead just for
relatively rare cases.

Here are some comments:

+               if (InputFunctionCallSafe(&in_functions[m],
+                                         NULL,
+                                         typioparams[m],
+                                         att->atttypmod,
+                                         NULL,
+                                         &values[m]))

Given that we pass NULL to escontext, does this function return false
in an error case? Or can we use InputFunctionCall instead?

I think we should mention that SET_NULL still could fail if the data
type of the column doesn't accept NULL.

How about restructuring the codes around handling data incompatibility
errors like:

else if (!InputFunctionCallSafe(...))
{
    if (cstate->opts.on_error == IGNORE)
    {
        cstate->num_errors++;
        if (cstate->opts.log_verbosity == VERBOSE)
            write a NOTICE message;
        return true; // ignore whole row.
    }
    else if (cstate->opts.on_error == SET_NULL)
    {
        current_row_erroneous = true;
        set NULL to the column;
        if (cstate->opts.log_verbosity == VERBOSE)
            write a NOTICE message;
        continue; // go to the next column.
}

That way, we have similar structures for both on_error handling and
don't need to reset cstate->cur_attname at the end of SET_NULL
handling.

---
From the regression tests:

--fail, column a is domain with not-null constraint
COPY t_on_error_null FROM STDIN WITH (on_error set_null);
a       11      14
\.
ERROR:  domain d_int_not_null does not allow null values
CONTEXT:  COPY t_on_error_null, line 1, column a: "a"

I guess that the log messages could confuse users since while the
actual error was caused by setting NULL to the non-NULL domain type
column, the context message says the data 'a' was erroneous.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From
jian he
Date:
On Tue, Apr 8, 2025 at 6:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>
> BTW have you measured the overheads of calling InputFunctionCallSafe
> twice? If it's significant, we might want to find other ways to
> achieve it as it would not be good to incur overhead just for
> relatively rare cases.
>

Please check the attached two patches
v17-0001-COPY-on_error-set_null.original,
v17-0001-COPY-on_error-set_null.patch

for non-domain types, (on_error set_null), the performance of these
two are the same.
for domain type with or without constraint,
(on_error set_null): v17.original is slower than v17.patch.


test script:

create unlogged table t2(a text);
insert into t2 select 'a' from generate_Series(1, 10_000_000) g;
copy t2 to '/tmp/2.txt';
CREATE DOMAIN d1 AS INT ;
CREATE DOMAIN d2 AS INT check (value > 0);
create unlogged table t3(a int);
create unlogged table t4(a d1);
create unlogged table t5(a d2);


performance result:
v17-0001-COPY-on_error-set_null.patch
-- 764.903 ms
copy t3 from '/tmp/2.txt' (on_error set_null) \watch c=10 i=0.1
-- 779.253 ms
copy t4 from '/tmp/2.txt' (on_error set_null) \watch c=10 i=0.1
-- Time: 750.390 ms
copy t5 from '/tmp/2.txt' (on_error set_null) \watch c=10 i=0.1

v17-0001-COPY-on_error-set_null.original
-- 774.943 ms
copy t3 from '/tmp/2.txt' (on_error set_null) \watch c=10 i=0.1
-- 867.671 ms
copy t4 from '/tmp/2.txt' (on_error set_null) \watch c=10 i=0.1
-- 927.685 ms
copy t5 from '/tmp/2.txt' (on_error set_null) \watch c=10 i=0.1


> Here are some comments:
>
> +               if (InputFunctionCallSafe(&in_functions[m],
> +                                         NULL,
> +                                         typioparams[m],
> +                                         att->atttypmod,
> +                                         NULL,
> +                                         &values[m]))
>
> Given that we pass NULL to escontext, does this function return false
> in an error case? Or can we use InputFunctionCall instead?
>
> I think we should mention that SET_NULL still could fail if the data
> type of the column doesn't accept NULL.
>
> How about restructuring the codes around handling data incompatibility
> errors like:
>
> else if (!InputFunctionCallSafe(...))
> {
>     if (cstate->opts.on_error == IGNORE)
>     {
>         cstate->num_errors++;
>         if (cstate->opts.log_verbosity == VERBOSE)
>             write a NOTICE message;
>         return true; // ignore whole row.
>     }
>     else if (cstate->opts.on_error == SET_NULL)
>     {
>         current_row_erroneous = true;
>         set NULL to the column;
>         if (cstate->opts.log_verbosity == VERBOSE)
>             write a NOTICE message;
>         continue; // go to the next column.
> }
>
> That way, we have similar structures for both on_error handling and
> don't need to reset cstate->cur_attname at the end of SET_NULL
> handling.
>

I think we still need to reset cstate->cur_attname.
the current code structure is
``
foreach(cur, cstate->attnumlist)
{
       if (condition x)
            continue;
        cstate->cur_attname = NULL;
        cstate->cur_attval = NULL;
}
``
In some cases (last column , condition x is satisfied), once we reach
the ``continue``, then we cannot reach.
``
        cstate->cur_attname = NULL;
        cstate->cur_attval = NULL;
``



> ---
> From the regression tests:
>
> --fail, column a is domain with not-null constraint
> COPY t_on_error_null FROM STDIN WITH (on_error set_null);
> a       11      14
> \.
> ERROR:  domain d_int_not_null does not allow null values
> CONTEXT:  COPY t_on_error_null, line 1, column a: "a"
>
> I guess that the log messages could confuse users since while the
> actual error was caused by setting NULL to the non-NULL domain type
> column, the context message says the data 'a' was erroneous.
>

if the second function is InputFunctionCall, then we cannot customize
the error message.
we can't have both.
I guess we need a second InputFunctionCallSafe with escontext NOT NULL.

now i change it to
                if (!cstate->domain_with_constraint[m] ||
                    InputFunctionCallSafe(&in_functions[m],
                                          NULL,
                                          typioparams[m],
                                          att->atttypmod,
                                          (Node *) cstate->escontext,
                                          &values[m]))
                else if (string == NULL)
                    ereport(ERROR,
                            errcode(ERRCODE_NOT_NULL_VIOLATION),
                            errmsg("domain %s does not allow null
values", format_type_be(typioparams[m])),
                            errdatatype(typioparams[m]));
                else
                    ereport(ERROR,
                            errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                            errmsg("invalid input value for domain %s: \"%s\"",
                                   format_type_be(typioparams[m]), string));


do these ``ELSE IF``, ``ELSE`` error report messages make sense to you?

Attachment