Thread: Re: Add reject_limit option to file_fdw
On 2024/10/17 22:45, torikoshia wrote: > Hi, > > 4ac2a9bec introduced reject_limit option to the COPY command, and I was wondering if it might be beneficial to add thesame option to file_fdw. > > Although there may be fewer practical use cases compared to COPY, it could still be useful in situations where the filebeing read via file_fdw is subject to modifications and there is a need to tolerate a limited number of errors. Agreed. > > What do you think? > > I've attached a patch. Thanks for the patch! Could you add it to the next CommitFest? +ALTER FOREIGN TABLE agg_bad OPTIONS (reject_limit '1'); +SELECT * FROM agg_bad; + a | b +-----+-------- + 100 | 99.097 + 42 | 324.78 +(2 rows) Wouldn't it be better to include a test where a SELECT query fails, even with on_error set to "ignore," because the number of errors exceeds reject_limit? + if (cstate->opts.reject_limit > 0 && \ The trailing backslash isn't needed here. + <varlistentry> + <term><literal>reject_limit</literal></term> This entry should be placed right after the on_error option, following the same order as in the COPY command documentation. > Based on the synopsis of the CREATE/ALTER FOREIGN TABLE commands, the value for the foreign table's option must be single-quoted.I’m not entirely sure if this is the correct approach, but in order to accommodate this, the patch modifiesthe value of reject_limit option to accept not only numeric values but also strings. > I don't have a better approach for this, so I'm okay with your solution. Just one note: it would be helpful to explain and comment why defGetCopyRejectLimitOption() accepts and parses both int64 and string values. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2024/11/05 22:30, torikoshia wrote: >> Thanks for the patch! Could you add it to the next CommitFest? > > Added an entry for this patch: > https://commitfest.postgresql.org/50/5331/ Thanks! >> +ALTER FOREIGN TABLE agg_bad OPTIONS (reject_limit '1'); >> +SELECT * FROM agg_bad; >> + a | b >> +-----+-------- >> + 100 | 99.097 >> + 42 | 324.78 >> +(2 rows) >> >> Wouldn't it be better to include a test where a SELECT query fails, even with >> on_error set to "ignore," because the number of errors exceeds reject_limit? > > Agreed. As for the agg.bad2 file that your latest patch added: Instead of adding a new bad input file, what do you think about simply appending "1;@bbb@" to the existing agg.bad file and adjusting sql/file_fdw.sql as follows? This approach might keep things simpler. ------------------- -\set filename :abs_srcdir '/data/agg.bad2' -ALTER FOREIGN TABLE agg_bad OPTIONS (SET filename :'filename', ADD reject_limit '1'); +ALTER FOREIGN TABLE agg_bad OPTIONS (ADD reject_limit '1'); SELECT * FROM agg_bad; ALTER FOREIGN TABLE agg_bad OPTIONS (SET reject_limit '2'); ------------------- >> + <varlistentry> >> + <term><literal>reject_limit</literal></term> >> >> This entry should be placed right after the on_error option, >> following the same order as in the COPY command documentation. >> >> >>> Based on the synopsis of the CREATE/ALTER FOREIGN TABLE commands, the value for the foreign table's option must be single-quoted.I’m not entirely sure if this is the correct approach, but in order to accommodate this, the patch modifiesthe value of reject_limit option to accept not only numeric values but also strings. >>> >> >> I don't have a better approach for this, so I'm okay with your solution. >> Just one note: it would be helpful to explain and comment >> why defGetCopyRejectLimitOption() accepts and parses both int64 and >> string values. > > Added a comment for this. Thanks for adding the comment. It clearly states that REJECT_LIMIT can be a single-quoted string. However, it might also be helpful to mention that it can be provided as an int64 in the COPY command option. How about updating it like this? ------------------------------------ REJECT_LIMIT can be specified in two ways: as an int64 for the COPY command option or as a single-quoted string for the foreign table option using file_fdw. Therefore this function needs to handle both formats. ------------------------------------ Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2024-11-08 01:44, Fujii Masao wrote: Thanks for your review! > On 2024/11/05 22:30, torikoshia wrote: >>> Thanks for the patch! Could you add it to the next CommitFest? >> >> Added an entry for this patch: >> https://commitfest.postgresql.org/50/5331/ > > Thanks! > > >>> +ALTER FOREIGN TABLE agg_bad OPTIONS (reject_limit '1'); >>> +SELECT * FROM agg_bad; >>> + a | b >>> +-----+-------- >>> + 100 | 99.097 >>> + 42 | 324.78 >>> +(2 rows) >>> >>> Wouldn't it be better to include a test where a SELECT query fails, >>> even with >>> on_error set to "ignore," because the number of errors exceeds >>> reject_limit? >> >> Agreed. > > As for the agg.bad2 file that your latest patch added: > > Instead of adding a new bad input file, what do you think about simply > appending > "1;@bbb@" to the existing agg.bad file and adjusting sql/file_fdw.sql > as follows? > This approach might keep things simpler. That seems better. Agreed. > ------------------- > -\set filename :abs_srcdir '/data/agg.bad2' > -ALTER FOREIGN TABLE agg_bad OPTIONS (SET filename :'filename', ADD > reject_limit '1'); > +ALTER FOREIGN TABLE agg_bad OPTIONS (ADD reject_limit '1'); > SELECT * FROM agg_bad; > ALTER FOREIGN TABLE agg_bad OPTIONS (SET reject_limit '2'); > ------------------- > >>> + <varlistentry> >>> + <term><literal>reject_limit</literal></term> >>> >>> This entry should be placed right after the on_error option, >>> following the same order as in the COPY command documentation. >>> >>> >>>> Based on the synopsis of the CREATE/ALTER FOREIGN TABLE commands, >>>> the value for the foreign table's option must be single-quoted. I’m >>>> not entirely sure if this is the correct approach, but in order to >>>> accommodate this, the patch modifies the value of reject_limit >>>> option to accept not only numeric values but also strings. >>>> >>> >>> I don't have a better approach for this, so I'm okay with your >>> solution. >>> Just one note: it would be helpful to explain and comment >>> why defGetCopyRejectLimitOption() accepts and parses both int64 and >>> string values. >> >> Added a comment for this. > > Thanks for adding the comment. It clearly states that REJECT_LIMIT can > be > a single-quoted string. However, it might also be helpful to mention > that > it can be provided as an int64 in the COPY command option. How about > updating it like this? > > ------------------------------------ > REJECT_LIMIT can be specified in two ways: as an int64 for the COPY > command > option or as a single-quoted string for the foreign table option using > file_fdw. Therefore this function needs to handle both formats. > ------------------------------------ Thanks! it seems better. Attached v3 patch. -- Regards, -- Atsushi Torikoshi Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
On 2024/11/11 21:45, torikoshia wrote: >> Thanks for adding the comment. It clearly states that REJECT_LIMIT can be >> a single-quoted string. However, it might also be helpful to mention that >> it can be provided as an int64 in the COPY command option. How about >> updating it like this? >> >> ------------------------------------ >> REJECT_LIMIT can be specified in two ways: as an int64 for the COPY command >> option or as a single-quoted string for the foreign table option using >> file_fdw. Therefore this function needs to handle both formats. >> ------------------------------------ > > Thanks! it seems better. > > > Attached v3 patch. Thanks for updating the patch! It looks like you forgot to attach it, though. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Tue, 12 Nov 2024 at 06:17, torikoshia <torikoshia@oss.nttdata.com> wrote: > > On 2024-11-12 01:49, Fujii Masao wrote: > > On 2024/11/11 21:45, torikoshia wrote: > >>> Thanks for adding the comment. It clearly states that REJECT_LIMIT > >>> can be > >>> a single-quoted string. However, it might also be helpful to mention > >>> that > >>> it can be provided as an int64 in the COPY command option. How about > >>> updating it like this? > >>> > >>> ------------------------------------ > >>> REJECT_LIMIT can be specified in two ways: as an int64 for the COPY > >>> command > >>> option or as a single-quoted string for the foreign table option > >>> using > >>> file_fdw. Therefore this function needs to handle both formats. > >>> ------------------------------------ > >> > >> Thanks! it seems better. > >> > >> > >> Attached v3 patch. > > > > Thanks for updating the patch! It looks like you forgot to attach it, > > though. > > Oops, thanks for pointing it out. > Here it is. > > > -- > Regards, > > -- > Atsushi Torikoshi > Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K. Hi! A little question from me. This is your doc for reject_limit: + <varlistentry> + <term><literal>reject_limit</literal></term> + + <listitem> + <para> + Specifies the maximum number of errors tolerated while converting a column's + input value to its data type, the same as <command>COPY</command>'s + <literal>REJECT_LIMIT</literal> option. + </para> + </listitem> + </varlistentry> + This is how it looks on the current HEAD for copy. <varlistentry> <term><literal>REJECT_LIMIT</literal></term> <listitem> <para> Specifies the maximum number of errors tolerated while converting a column's input value to its data type, when <literal>ON_ERROR</literal> is set to <literal>ignore</literal>. If the input causes more errors than the specified value, the <command>COPY</command> command fails, even with <literal>ON_ERROR</literal> set to <literal>ignore</literal>. This clause must be used with <literal>ON_ERROR</literal>=<literal>ignore</literal> and <replaceable class="parameter">maxerror</replaceable> must be positive <type>bigint</type>. If not specified, <literal>ON_ERROR</literal>=<literal>ignore</literal> allows an unlimited number of errors, meaning <command>COPY</command> will skip all erroneous data. </para> </listitem> </varlistentry> There is a difference. Should we add REJECT_LIMIT vs ON_ERROR clarification for file_fdw too? or maybe we put a reference for COPY doc. -- Best regards, Kirill Reshke
On 2024-11-12 15:23, Kirill Reshke wrote: Thanks for your review! > On Tue, 12 Nov 2024 at 06:17, torikoshia <torikoshia@oss.nttdata.com> > wrote: >> >> On 2024-11-12 01:49, Fujii Masao wrote: >> > On 2024/11/11 21:45, torikoshia wrote: >> >>> Thanks for adding the comment. It clearly states that REJECT_LIMIT >> >>> can be >> >>> a single-quoted string. However, it might also be helpful to mention >> >>> that >> >>> it can be provided as an int64 in the COPY command option. How about >> >>> updating it like this? >> >>> >> >>> ------------------------------------ >> >>> REJECT_LIMIT can be specified in two ways: as an int64 for the COPY >> >>> command >> >>> option or as a single-quoted string for the foreign table option >> >>> using >> >>> file_fdw. Therefore this function needs to handle both formats. >> >>> ------------------------------------ >> >> >> >> Thanks! it seems better. >> >> >> >> >> >> Attached v3 patch. >> > >> > Thanks for updating the patch! It looks like you forgot to attach it, >> > though. >> >> Oops, thanks for pointing it out. >> Here it is. >> >> >> -- >> Regards, >> >> -- >> Atsushi Torikoshi >> Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K. > > Hi! > > A little question from me. > > This is your doc for reject_limit: > > + <varlistentry> > + <term><literal>reject_limit</literal></term> > + > + <listitem> > + <para> > + Specifies the maximum number of errors tolerated while > converting a column's > + input value to its data type, the same as > <command>COPY</command>'s > + <literal>REJECT_LIMIT</literal> option. > + </para> > + </listitem> > + </varlistentry> > + > > This is how it looks on the current HEAD for copy. > > <varlistentry> > <term><literal>REJECT_LIMIT</literal></term> > <listitem> > <para> > Specifies the maximum number of errors tolerated while converting > a > column's input value to its data type, when > <literal>ON_ERROR</literal> is > set to <literal>ignore</literal>. > If the input causes more errors than the specified value, the > <command>COPY</command> > command fails, even with <literal>ON_ERROR</literal> set to > <literal>ignore</literal>. > This clause must be used with > <literal>ON_ERROR</literal>=<literal>ignore</literal> > and <replaceable class="parameter">maxerror</replaceable> must > be positive <type>bigint</type>. > If not specified, > <literal>ON_ERROR</literal>=<literal>ignore</literal> > allows an unlimited number of errors, meaning > <command>COPY</command> will > skip all erroneous data. > </para> > </listitem> > </varlistentry> > > There is a difference. Should we add REJECT_LIMIT vs ON_ERROR > clarification for file_fdw too? or maybe we put a reference for COPY > doc. As you may know, some options for file_fdw are the same as those for COPY. While the manual provides detailed descriptions of these options in the COPY section, the explanations in file_fdw are shorter and less detailed. I intended to follow this approach, but do you think it should be changed? -- Regards, -- Atsushi Torikoshi Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
On 2024-11-21 00:43, Fujii Masao wrote: > On 2024/11/19 21:40, torikoshia wrote: >>> These messages may be unexpected for some users because the >>> documentation of >>> fild_fdw does not explicitly describe that file_fdw uses COPY >>> internally. >>> (I can find several wordings like "as COPY", though.) >>> However, since the current file_fdw already has such messages, I came >>> to think >>> making the messages on "reject_limit" consistent with COPY is >>> reasonable. >>> I mean, the previous version of the message seems fine. >> >> Agreed. Thanks for your investigation. > > I've pushed the 0001 patch. Thanks! Thanks a lot! >>> As another comment, do we need to add validator test for on_error and >>> reject_limit as similar to other options? >> >> That might be better. >> Added some queries which had wrong options to cause errors. >> >> I was unsure whether to add it to "validator test" section or >> "on_error, log_verbosity and reject_limit tests" section, but I chose >> "validator test" as I believe it makes things clearer. >> >> Added 0002 patch since some of the tests are not related to >> reject_limit but just on_error. > > How about adding validator tests for log_verbosity and reject_limit=0, > similar to the tests in the copy2.sql regression test? I've added those > two tests, and the latest version of the patch is attached. Thanks for the update! Agreed. I think the validator tests related to the on_error option you added are sufficient. As for the other file_fdw options, I'm a bit concerned whether it would be better to add validator tests for cases like when 'encoding' is 'unsupported', but that should be discussed in another thread. > Regards, -- Regards, -- Atsushi Torikoshi Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
On 2024/11/21 15:43, torikoshia wrote: > Thanks for the update! Agreed. > I think the validator tests related to the on_error option you added are sufficient. I've pushed the patch. Thanks! > As for the other file_fdw options, I'm a bit concerned whether it would be better to add validator tests for cases likewhen 'encoding' is 'unsupported', but that should be discussed in another thread. I'm not sure how valuable it is to add that test, but yes, it would be better to start a new thread for discussion if needed. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION