Re: Add new error_action COPY ON_ERROR "log" - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Add new error_action COPY ON_ERROR "log"
Date
Msg-id Ze6aqaGJlxr9gH-2@paquier.xyz
Whole thread Raw
In response to Re: Add new error_action COPY ON_ERROR "log"  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: Add new error_action COPY ON_ERROR "log"
List pgsql-hackers
On Sat, Mar 09, 2024 at 12:01:49AM +0530, Bharath Rupireddy wrote:
> If NOTICE per attribute and incrementing num_errors per row is
> implemented, it ends up erroring out with ERROR:  missing data for
> column "m"  for all-column-empty-row. Shall we treat this ERROR softly
> too if on_error ignore is specified? Or shall we discuss this idea
> separately?
>
> ERROR:  missing data for column "m"
> CONTEXT:  COPY check_ign_err, line 5: ""

Hmm.  I have spent some time looking at the bevahior of ON_ERROR, and
there are two tests in copy2.sql, one for the case where there is more
data than attributes and a second where there is not enough data in a
row that checks for errors.

For example, take this table:
=# create table tab (a int, b int, c int);
CREATE TABLE

This case works, even if the row has clearly not enough attributes.
The first attribute can be parsed, not the second one and this causes
the remaining data of the row to be skipped:
=# copy tab from stdin (delimiter ',', on_error ignore);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 1,
>> \.
NOTICE:  00000: 1 row was skipped due to data type incompatibility
LOCATION:  CopyFrom, copyfrom.c:1314
COPY 0

This case fails.  The first and the second attributes can be parsed,
and the line fails because we are missing the last attribute as of a
lack of delimiter:
=# copy tab from stdin (delimiter ',', on_error ignore);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 1,1
>> \.
ERROR:  22P04: missing data for column "c"
CONTEXT:  COPY tab, line 1: "1,1"

This brings a weird edge case for the all-column-empty-row case you
are mentioning once if we try to get information about all the rows we
should expect, but this has as side effect to break a case that's
intended ro work with ON_ERROR, as far as I understand, which is to
skip entirely a raw on the first conversion error we find, even if
there is no data afterwards.  I was a bit confused by that first, but
I can also see why it is useful as-is on HEAD.

At the end of the day, this comes down to what is more helpful to the
user.  And I'd agree on the side what ON_ERROR does currently, which
is what your patch relies on: on the first conversion failure, give up
and skip the rest of the row because we cannot trust its contents.
That's my way of saying that I am fine with the proposal of your
patch, and that we cannot provide the full state of a row without
making the error stack of COPY more invasive.

Perhaps we could discuss this idea of ensuring that all the attributes
are on a row in a different thread, as you say, but I am not really
convinced that there's a strong need for it at this stage as ON_ERROR
is new to v17.  So it does not sound like a bad thing to let it brew
more before implementing more options and make the COPY paths more
complicated than they already are.  I suspect that this may trigger
some discussion during the beta/stability phase depending on the
initial feedback.  Perhaps I'm wrong, though.

+      <literal>verbose</literal>, a <literal>NOTICE</literal> message
+      containing the line number and column name for each discarded row is
+      emitted.

This should clarify that the column name refers to the attribute where
the input conversion has failed, I guess.  Specifying only "column
name" without more context is a bit confusing.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Tender Wang
Date:
Subject: Re: "type with xxxx does not exist" when doing ExecMemoize()
Next
From: Amit Kapila
Date:
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation