Re: Add new COPY option REJECT_LIMIT - Mailing list pgsql-hackers

From torikoshia
Subject Re: Add new COPY option REJECT_LIMIT
Date
Msg-id f9feda0ecd4e2e09620a6f9ca8f4aac1@oss.nttdata.com
Whole thread Raw
In response to Re: Add new COPY option REJECT_LIMIT  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Responses Re: Add new COPY option REJECT_LIMIT
Re: Add new COPY option REJECT_LIMIT
List pgsql-hackers
On Fri, Jul 19, 2024 at 11:48 PM Junwang Zhao <zhjwpku@gmail.com> wrote:
Thanks for the comment.

> In patch 0002, the ratio is calculated by the already skipped/processed
> rows, but what if a user wants to copy 1000 rows, and he/she can 
> tolerate
> 10 error rows, so he/she might set *reject_limit 0.01*, but one bad row 
> in the
> first 100 rows will fail the entire command, this might surprise the 
> user.

Since the ratio is calculated after all data is processed, the case "one 
bad row in the first 100 rows will fail the entire command" doesn't 
happen:


   =# \! wc -l 1000rows-with-10err.data
       1000 1000rows-with-10err.data

   =# COPY t1 from '1000rows-with-10err.data'  with (log_verbosity 
verbose, reject_limit 0.01);
   NOTICE:  skipping row due to data type incompatibility at line 10 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 11 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 12 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 13 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 14 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 15 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 16 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 17 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 18 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 19 for 
column i: "a"
   NOTICE:  10 rows were skipped due to data type incompatibility
   COPY 990


On 2024-07-20 02:08, Fujii Masao wrote:
> On 2024/07/19 22:03, Fujii Masao wrote:
>> 
>> 
>> On 2024/07/17 22:21, torikoshia wrote:
>>> On 2024-07-03 02:07, Fujii Masao wrote:
>>>> However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR 
>>>> option is still necessary.
>>> 
>>> I remembered another reason for the necessity of ON_ERROR.
>>> 
>>> ON_ERROR defines how to behave when encountering an error and it just 
>>> accepts 'ignore' and 'stop' currently, but is expected to support 
>>> other options such as saving details of errors to a table[1].
>> 
>> Wouldn't it be better to separate the option specifying where
>> error details are output from the ON_ERROR option
>> (which determines behavior when encountering errors)?
>> "table" seems valid for both ON_ERROR=ignore and ON_ERROR=stop.
> 
> I still find it odd to accept "table" as a value for ON_ERROR. However,
> "set_to_null" or "replace-column" proposed in [1] seem valid for
> ON_ERROR. So, I'm okay with keeping the ON_ERROR option.

Agreed.

>> On my second thought, whatever value ON_ERROR is specified(e.g. 
>> ignore, stop, table), it seems fine to use REJECT_LIMIT.
>> I feel REJECT_LIMIT has both "ignore" and "stop" characteristics, 
>> meaning it ignores errors until it reaches REJECT_LIMIT and stops when 
>> it exceeds the REJECT_LIMIT.
> 
> ON_ERROR specifies how to handle errors, and "stop" means to fail
> the command. So, if ON_ERROR=stop, REJECT_LIMIT should have no effect,
> and the command should fail immediately upon encountering an error.
> 
> As in your original proposal, I now think REJECT_LIMIT should only
> apply when ON_ERROR=ignore. The command would ignore errors and
> continue processing, but if the number of errors exceeds REJECT_LIMIT,
> the command should fail. Thought?

Makes sense.
Updated the patch.

> BTW if "set_to_null" is supported someday, REJECT_LIMIT can also
> apply. The command would cinsert NULL into the target table upon
> encountering errors and continue, but fail if the number of errors
> exceed REJECT_LIMIT.

Agreed.


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation
Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations
Next
From: Robert Haas
Date:
Subject: Re: Lock-free compaction. Why not?