Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row - Mailing list pgsql-hackers

From Yugo Nagata
Subject Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row
Date
Msg-id 20241112141733.49e8a4ce1244c660824cff31@sraoss.co.jp
Whole thread Raw
In response to Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row  (Yugo Nagata <nagata@sraoss.co.jp>)
Responses Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row
List pgsql-hackers
On Tue, 12 Nov 2024 14:03:50 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

> On Tue, 12 Nov 2024 01:27:53 +0500
> Kirill Reshke <reshkekirill@gmail.com> wrote:
> 
> > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote:
> > >
> > > On 2024-11-09 21:55, Kirill Reshke wrote:
> > >
> > > Thanks for working on this!
> > 
> > Thanks for reviewing the v7 patch series!
> > 
> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com>
> > > > wrote:
> > > >>
> > > >>
> > > >>
> > > >> On 2024/10/26 6:03, Kirill Reshke wrote:
> > > >> > when the REJECT LIMIT is set to some non-zero number and the number of
> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because
> > > >> > there WAS errors, and we should not tolerate more than $limit errors .
> > > >> > I do find this behavior to be consistent.
> > > >>
> > > >> +1
> > > >>
> > > >>
> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all
> > > >> > replacements, as if REJECT LIMIT is inf.
> > > >>
> > > >> +1
> > > >
> > > > After thinking for a while, I'm now more opposed to this approach. I
> > > > think we should count rows with erroneous data as errors only if
> > > > null substitution for these rows failed, not the total number of rows
> > > > which were modified.
> > > > Then, to respect the REJECT LIMIT option, we compare this number with
> > > > the limit. This is actually simpler approach IMHO. What do You think?
> > >
> > > IMHO I prefer the previous interpretation.
> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT
> > > is used to specify how many malformed rows are acceptable in the
> > > "original" data source.
> 
> I also prefer the previous version.
>  
> > I do like the first version of interpretation, but I have a struggle
> > with it. According to this interpretation, we will fail COPY command
> > if the number
> > of malformed data rows exceeds the limit, not the number of rejected
> > rows (some percentage of malformed rows are accepted with null
> > substitution)
> > So, a proper name for the limit will be MALFORMED_LIMIT, or something.
> > However, we are unable to change the name since the REJECT_LIMIT
> > option has already been committed.
> > I guess I'll just have to put up with this contradiction. I will send
> > an updated patch shortly...
> 
> I think we can rename the REJECT_LIMIT option because it is not yet released.
> 
> The documentation says that REJECT_LIMIT "Specifies the maximum number of errors",
> and there are no wording "reject" in the description, so I wonder it is unclear
> what means in "REJECT" in REJECT_LIMIT. It may be proper to use ERROR_LIMIT
> since it is supposed to be used with ON_ERROR. 
> 
> Alternatively, if we emphasize that errors are handled other than terminating
> the command,perhaps MALFORMED_LIMIT as proposed above or TOLERANCE_LIMIT may be
> good, for example.

I might misunderstand the meaning of the name. If REJECT_LIMIT means "a limit on
the number of rows with any malformed value allowed before the COPY command is
rejected", we would not have to rename it.

-- 
Yugo Nagata <nagata@sraoss.co.jp>



pgsql-hackers by date:

Previous
From: Yugo Nagata
Date:
Subject: Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row
Next
From: Amit Kapila
Date:
Subject: Re: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY