Thread: Add on_error and log_verbosity options to file_fdw

Add on_error and log_verbosity options to file_fdw

From
torikoshia
Date:
Hi,

With the current file_fdw, if even one line of data conversion fails, 
the contents of the file cannot be referenced at all:

   =# \! cat data/test.data
   1,a
   2,b
   a,c
   =# create foreign table f_fdw_test_1 (i int, t text) server f_fdw 
options (filename 'test.data', format 'csv');
   CREATE FOREIGN TABLE

   =# table f_fdw_test_1;
   ERROR:  invalid input syntax for type integer: "a"
   CONTEXT:  COPY f_fdw_test, line 3, column i: "a"

Since we'll support ON_ERROR option which tolerates data conversion 
errors in COPY FROM and LOG_VERBOSITY option at v17[1], how about 
supporting them on file_fdw?

This idea comes from Fujii-san[2], and I think it'd be useful when 
reading a bit dirty data.

Attached PoC patch works like below:

   =# create foreign table f_fdw_test_2 (i int, t text) server f_fdw 
options (filename 'test.data', format 'csv', on_error 'ignore');
   CREATE FOREIGN TABLE

   =# table f_fdw_test_2;
   NOTICE:  1 row was skipped due to data type incompatibility
    i | t
   ---+---
    1 | a
    2 | b
   (2 rows)


   =# create foreign table f_fdw_test_3 (i int, t text) server f_fdw 
options (filename 'test.data', format 'csv', on_error 'ignore', 
log_verbosity 'verbose');
CREATE FOREIGN TABLE

   =# table f_fdw_test_3 ;
   NOTICE:  skipping row due to data type incompatibility at line 3 for 
column i: "a"
   NOTICE:  1 row was skipped due to data type incompatibility
    i | t
   ---+---
    1 | a
    2 | b
   (2 rows)


I'm going to continue developing the patch(e.g. add doc, measure 
performance degradation) when people also think this feature is worth 
adding.


What do you think?


[1] https://www.postgresql.org/docs/devel/sql-copy.html
[2] https://x.com/fujii_masao/status/1808178032219509041

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation
Attachment

Re: Add on_error and log_verbosity options to file_fdw

From
torikoshia
Date:
On 2024-07-05 00:27, torikoshia wrote:
> Hi,
> 
> With the current file_fdw, if even one line of data conversion fails,
> the contents of the file cannot be referenced at all:
> 
>   =# \! cat data/test.data
>   1,a
>   2,b
>   a,c
>   =# create foreign table f_fdw_test_1 (i int, t text) server f_fdw
> options (filename 'test.data', format 'csv');
>   CREATE FOREIGN TABLE
> 
>   =# table f_fdw_test_1;
>   ERROR:  invalid input syntax for type integer: "a"
>   CONTEXT:  COPY f_fdw_test, line 3, column i: "a"
> 
> Since we'll support ON_ERROR option which tolerates data conversion
> errors in COPY FROM and LOG_VERBOSITY option at v17[1], how about
> supporting them on file_fdw?
> 
> This idea comes from Fujii-san[2], and I think it'd be useful when
> reading a bit dirty data.
> 
> Attached PoC patch works like below:
> 
>   =# create foreign table f_fdw_test_2 (i int, t text) server f_fdw
> options (filename 'test.data', format 'csv', on_error 'ignore');
>   CREATE FOREIGN TABLE
> 
>   =# table f_fdw_test_2;
>   NOTICE:  1 row was skipped due to data type incompatibility
>    i | t
>   ---+---
>    1 | a
>    2 | b
>   (2 rows)
> 
> 
>   =# create foreign table f_fdw_test_3 (i int, t text) server f_fdw
> options (filename 'test.data', format 'csv', on_error 'ignore',
> log_verbosity 'verbose');
> CREATE FOREIGN TABLE
> 
>   =# table f_fdw_test_3 ;
>   NOTICE:  skipping row due to data type incompatibility at line 3 for
> column i: "a"
>   NOTICE:  1 row was skipped due to data type incompatibility
>    i | t
>   ---+---
>    1 | a
>    2 | b
>   (2 rows)
> 
> 
> I'm going to continue developing the patch(e.g. add doc, measure
> performance degradation) when people also think this feature is worth
> adding.
> 
> 
> What do you think?
> 
> 
> [1] https://www.postgresql.org/docs/devel/sql-copy.html
> [2] https://x.com/fujii_masao/status/1808178032219509041

Update the patch since v1 patch caused compiler warning.

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation
Attachment

Re: Add on_error and log_verbosity options to file_fdw

From
Masahiko Sawada
Date:
Hi,

On Thu, Jul 18, 2024 at 6:38 PM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> On 2024-07-05 00:27, torikoshia wrote:
> > Hi,
> >
> > With the current file_fdw, if even one line of data conversion fails,
> > the contents of the file cannot be referenced at all:
> >
> >   =# \! cat data/test.data
> >   1,a
> >   2,b
> >   a,c
> >   =# create foreign table f_fdw_test_1 (i int, t text) server f_fdw
> > options (filename 'test.data', format 'csv');
> >   CREATE FOREIGN TABLE
> >
> >   =# table f_fdw_test_1;
> >   ERROR:  invalid input syntax for type integer: "a"
> >   CONTEXT:  COPY f_fdw_test, line 3, column i: "a"
> >
> > Since we'll support ON_ERROR option which tolerates data conversion
> > errors in COPY FROM and LOG_VERBOSITY option at v17[1], how about
> > supporting them on file_fdw?

+1

> >
> > This idea comes from Fujii-san[2], and I think it'd be useful when
> > reading a bit dirty data.
> >
> > Attached PoC patch works like below:
> >
> >   =# create foreign table f_fdw_test_2 (i int, t text) server f_fdw
> > options (filename 'test.data', format 'csv', on_error 'ignore');
> >   CREATE FOREIGN TABLE
> >
> >   =# table f_fdw_test_2;
> >   NOTICE:  1 row was skipped due to data type incompatibility
> >    i | t
> >   ---+---
> >    1 | a
> >    2 | b
> >   (2 rows)

I'm slightly concerned that users might not want to see the NOTICE
message for every scan. Unlike COPY FROM, scanning a file via file_fdw
could be frequent. An alternative idea of place to write the
information of the number of malformed rows would be the EXPLAIN
command as follow:

                           QUERY PLAN
----------------------------------------------------------------
 Foreign Scan on public.test  (cost=0.00..1.10 rows=1 width=12)
   Output: a, b, c
   Foreign File: test.csv
   Foreign File Size: 12 b
   Skipped Rows: 10

> >
> >
> >   =# create foreign table f_fdw_test_3 (i int, t text) server f_fdw
> > options (filename 'test.data', format 'csv', on_error 'ignore',
> > log_verbosity 'verbose');
> > CREATE FOREIGN TABLE
> >
> >   =# table f_fdw_test_3 ;
> >   NOTICE:  skipping row due to data type incompatibility at line 3 for
> > column i: "a"
> >   NOTICE:  1 row was skipped due to data type incompatibility
> >    i | t
> >   ---+---
> >    1 | a
> >    2 | b
> >   (2 rows)

IIUC we have to execute ALTER FOREIGN TABLE to change the
log_verbosity value and which requires to be the owner. Which seems
not to be user-friendly.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Add on_error and log_verbosity options to file_fdw

From
Michael Paquier
Date:
On Mon, Jul 22, 2024 at 03:07:46PM -0700, Masahiko Sawada wrote:
> I'm slightly concerned that users might not want to see the NOTICE
> message for every scan. Unlike COPY FROM, scanning a file via file_fdw
> could be frequent. An alternative idea of place to write the
> information of the number of malformed rows would be the EXPLAIN
> command as follow:

Yeah, I also have some concerns regarding the noise that this could
produce if called on a foreign table on a regular basis.  The verbose
mode is disabled by default so I don't see why we should not allow it
if the relation owner wants to show it.

Perhaps we should first do a silence mode for log_verbosity to skip
the NOTICE produced at the end of the COPY FROM summarizing the whole?
It would be confusing to have different defaults between COPY and
file_fdw, but having the option to silence that completely is also
appealing from the user point of view.

>                            QUERY PLAN
> ----------------------------------------------------------------
>  Foreign Scan on public.test  (cost=0.00..1.10 rows=1 width=12)
>    Output: a, b, c
>    Foreign File: test.csv
>    Foreign File Size: 12 b
>    Skipped Rows: 10

Interesting idea linked to the idea of pushing the error state to
something else than the logs.  Sounds like a separate feature.

> IIUC we have to execute ALTER FOREIGN TABLE to change the
> log_verbosity value and which requires to be the owner. Which seems
> not to be user-friendly.

I am not sure about allowing scans to force an option to be a
different thing at runtime vs what's been defined in the relation
itself with CREATE/ALTER.
--
Michael

Attachment

Re: Add on_error and log_verbosity options to file_fdw

From
torikoshia
Date:
On 2024-07-23 08:57, Michael Paquier wrote:
> On Mon, Jul 22, 2024 at 03:07:46PM -0700, Masahiko Sawada wrote:
>> I'm slightly concerned that users might not want to see the NOTICE
>> message for every scan. Unlike COPY FROM, scanning a file via file_fdw
>> could be frequent.

Agreed.

> Yeah, I also have some concerns regarding the noise that this could
> produce if called on a foreign table on a regular basis.  The verbose
> mode is disabled by default so I don't see why we should not allow it
> if the relation owner wants to show it.
> 
> Perhaps we should first do a silence mode for log_verbosity to skip
> the NOTICE produced at the end of the COPY FROM summarizing the whole?

I like this idea.
If there are no objections, I'm going to make a patch for this.

> It would be confusing to have different defaults between COPY and
> file_fdw, but having the option to silence that completely is also
> appealing from the user point of view.

I'm not sure we should change the defaults.
If the default of file_fdw is silence mode, I am a little concerned that 
there may be cases where people think they have no errors, but in fact 
they have.

>>                            QUERY PLAN
>> ----------------------------------------------------------------
>>  Foreign Scan on public.test  (cost=0.00..1.10 rows=1 width=12)
>>    Output: a, b, c
>>    Foreign File: test.csv
>>    Foreign File Size: 12 b
>>    Skipped Rows: 10
> 
> Interesting idea linked to the idea of pushing the error state to
> something else than the logs.  Sounds like a separate feature.

+1


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



Re: Add on_error and log_verbosity options to file_fdw

From
torikoshia
Date:
On 2024-07-24 19:43, torikoshia wrote:
> On 2024-07-23 08:57, Michael Paquier wrote:
>> On Mon, Jul 22, 2024 at 03:07:46PM -0700, Masahiko Sawada wrote:
>>> I'm slightly concerned that users might not want to see the NOTICE
>>> message for every scan. Unlike COPY FROM, scanning a file via 
>>> file_fdw
>>> could be frequent.
> 
> Agreed.
> 
>> Yeah, I also have some concerns regarding the noise that this could
>> produce if called on a foreign table on a regular basis.  The verbose
>> mode is disabled by default so I don't see why we should not allow it
>> if the relation owner wants to show it.
>> 
>> Perhaps we should first do a silence mode for log_verbosity to skip
>> the NOTICE produced at the end of the COPY FROM summarizing the whole?
> 
> I like this idea.
> If there are no objections, I'm going to make a patch for this.

Attached patches.
0001 adds new option 'silent' to log_verbosity and 0002 adds on_error 
and log_verbosity options to file_fdw.


> I'm going to continue developing the patch(e.g. add doc, measure
performance degradation) when people also think this feature is worth 
adding.

Here is a quick performance test result.

I loaded 1,000,000 rows using pgbench_accounts to a file and counted the 
number of rows using file_fdw on different conditions and compared the 
execution times on my laptop.

The changed conditions are below:
- source code: HEAD/patch applied
- data: no error data/all row occur data conversion error at the 1st 
column
- file_fdw options: on_error=stop/on_error=ignore

There seems no significant difference in performance between HEAD and 
the patch applied with on_error option specified as either ignore/stop 
when data has no error.
OTOH when all rows occur data conversion error, it is significantly 
faster than other cases:

# HAED(e950fe58bd0)
## data:no error

=# create foreign table t1 (a int, b int, c int, t text) server f_fdw 
options (filename 'pgb_ac', format 'csv');
=# select count(*) from t1;

1567.569 ms
1675.112 ms
1555.782 ms
1547.676 ms
1660.221 ms

# patch applied
## data:no error, on_error:stop

=# create foreign table t1 (a int, b int, c int, t text) server f_fdw 
options (filename 'pgb_ac', format 'csv', on_error 'stop');
=# select count(*) from t1;

1580.656 ms
1623.784 ms
1596.947 ms
1652.307 ms
1613.607 ms

## data:no error, on_error:ignore

=# create foreign table t1 (a int, b int, c int, t text) server f_fdw 
options (filename 'pgb_ac', format 'csv', on_error 'ignore');
=# select count(*) from t1;

1575.718 ms
1597.464 ms
1596.540 ms
1665.818 ms
1595.453 ms

#### data:all rows contain error, on_error:ignore

=# create foreign table t1 (a int, b int, c int, t text) server f_fdw 
options (filename 'pgb_ac', format 'csv', on_error 'ignore');
=# select count(*) from t1;

914.537 ms
907.506 ms
912.768 ms
913.769 ms
914.327 ms


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation
Attachment