Thread: Add on_error and log_verbosity options to file_fdw
Hi, With the current file_fdw, if even one line of data conversion fails, the contents of the file cannot be referenced at all: =# \! cat data/test.data 1,a 2,b a,c =# create foreign table f_fdw_test_1 (i int, t text) server f_fdw options (filename 'test.data', format 'csv'); CREATE FOREIGN TABLE =# table f_fdw_test_1; ERROR: invalid input syntax for type integer: "a" CONTEXT: COPY f_fdw_test, line 3, column i: "a" Since we'll support ON_ERROR option which tolerates data conversion errors in COPY FROM and LOG_VERBOSITY option at v17[1], how about supporting them on file_fdw? This idea comes from Fujii-san[2], and I think it'd be useful when reading a bit dirty data. Attached PoC patch works like below: =# create foreign table f_fdw_test_2 (i int, t text) server f_fdw options (filename 'test.data', format 'csv', on_error 'ignore'); CREATE FOREIGN TABLE =# table f_fdw_test_2; NOTICE: 1 row was skipped due to data type incompatibility i | t ---+--- 1 | a 2 | b (2 rows) =# create foreign table f_fdw_test_3 (i int, t text) server f_fdw options (filename 'test.data', format 'csv', on_error 'ignore', log_verbosity 'verbose'); CREATE FOREIGN TABLE =# table f_fdw_test_3 ; NOTICE: skipping row due to data type incompatibility at line 3 for column i: "a" NOTICE: 1 row was skipped due to data type incompatibility i | t ---+--- 1 | a 2 | b (2 rows) I'm going to continue developing the patch(e.g. add doc, measure performance degradation) when people also think this feature is worth adding. What do you think? [1] https://www.postgresql.org/docs/devel/sql-copy.html [2] https://x.com/fujii_masao/status/1808178032219509041 -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Attachment
On 2024-07-05 00:27, torikoshia wrote: > Hi, > > With the current file_fdw, if even one line of data conversion fails, > the contents of the file cannot be referenced at all: > > =# \! cat data/test.data > 1,a > 2,b > a,c > =# create foreign table f_fdw_test_1 (i int, t text) server f_fdw > options (filename 'test.data', format 'csv'); > CREATE FOREIGN TABLE > > =# table f_fdw_test_1; > ERROR: invalid input syntax for type integer: "a" > CONTEXT: COPY f_fdw_test, line 3, column i: "a" > > Since we'll support ON_ERROR option which tolerates data conversion > errors in COPY FROM and LOG_VERBOSITY option at v17[1], how about > supporting them on file_fdw? > > This idea comes from Fujii-san[2], and I think it'd be useful when > reading a bit dirty data. > > Attached PoC patch works like below: > > =# create foreign table f_fdw_test_2 (i int, t text) server f_fdw > options (filename 'test.data', format 'csv', on_error 'ignore'); > CREATE FOREIGN TABLE > > =# table f_fdw_test_2; > NOTICE: 1 row was skipped due to data type incompatibility > i | t > ---+--- > 1 | a > 2 | b > (2 rows) > > > =# create foreign table f_fdw_test_3 (i int, t text) server f_fdw > options (filename 'test.data', format 'csv', on_error 'ignore', > log_verbosity 'verbose'); > CREATE FOREIGN TABLE > > =# table f_fdw_test_3 ; > NOTICE: skipping row due to data type incompatibility at line 3 for > column i: "a" > NOTICE: 1 row was skipped due to data type incompatibility > i | t > ---+--- > 1 | a > 2 | b > (2 rows) > > > I'm going to continue developing the patch(e.g. add doc, measure > performance degradation) when people also think this feature is worth > adding. > > > What do you think? > > > [1] https://www.postgresql.org/docs/devel/sql-copy.html > [2] https://x.com/fujii_masao/status/1808178032219509041 Update the patch since v1 patch caused compiler warning. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Attachment
Hi, On Thu, Jul 18, 2024 at 6:38 PM torikoshia <torikoshia@oss.nttdata.com> wrote: > > On 2024-07-05 00:27, torikoshia wrote: > > Hi, > > > > With the current file_fdw, if even one line of data conversion fails, > > the contents of the file cannot be referenced at all: > > > > =# \! cat data/test.data > > 1,a > > 2,b > > a,c > > =# create foreign table f_fdw_test_1 (i int, t text) server f_fdw > > options (filename 'test.data', format 'csv'); > > CREATE FOREIGN TABLE > > > > =# table f_fdw_test_1; > > ERROR: invalid input syntax for type integer: "a" > > CONTEXT: COPY f_fdw_test, line 3, column i: "a" > > > > Since we'll support ON_ERROR option which tolerates data conversion > > errors in COPY FROM and LOG_VERBOSITY option at v17[1], how about > > supporting them on file_fdw? +1 > > > > This idea comes from Fujii-san[2], and I think it'd be useful when > > reading a bit dirty data. > > > > Attached PoC patch works like below: > > > > =# create foreign table f_fdw_test_2 (i int, t text) server f_fdw > > options (filename 'test.data', format 'csv', on_error 'ignore'); > > CREATE FOREIGN TABLE > > > > =# table f_fdw_test_2; > > NOTICE: 1 row was skipped due to data type incompatibility > > i | t > > ---+--- > > 1 | a > > 2 | b > > (2 rows) I'm slightly concerned that users might not want to see the NOTICE message for every scan. Unlike COPY FROM, scanning a file via file_fdw could be frequent. An alternative idea of place to write the information of the number of malformed rows would be the EXPLAIN command as follow: QUERY PLAN ---------------------------------------------------------------- Foreign Scan on public.test (cost=0.00..1.10 rows=1 width=12) Output: a, b, c Foreign File: test.csv Foreign File Size: 12 b Skipped Rows: 10 > > > > > > =# create foreign table f_fdw_test_3 (i int, t text) server f_fdw > > options (filename 'test.data', format 'csv', on_error 'ignore', > > log_verbosity 'verbose'); > > CREATE FOREIGN TABLE > > > > =# table f_fdw_test_3 ; > > NOTICE: skipping row due to data type incompatibility at line 3 for > > column i: "a" > > NOTICE: 1 row was skipped due to data type incompatibility > > i | t > > ---+--- > > 1 | a > > 2 | b > > (2 rows) IIUC we have to execute ALTER FOREIGN TABLE to change the log_verbosity value and which requires to be the owner. Which seems not to be user-friendly. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Jul 22, 2024 at 03:07:46PM -0700, Masahiko Sawada wrote: > I'm slightly concerned that users might not want to see the NOTICE > message for every scan. Unlike COPY FROM, scanning a file via file_fdw > could be frequent. An alternative idea of place to write the > information of the number of malformed rows would be the EXPLAIN > command as follow: Yeah, I also have some concerns regarding the noise that this could produce if called on a foreign table on a regular basis. The verbose mode is disabled by default so I don't see why we should not allow it if the relation owner wants to show it. Perhaps we should first do a silence mode for log_verbosity to skip the NOTICE produced at the end of the COPY FROM summarizing the whole? It would be confusing to have different defaults between COPY and file_fdw, but having the option to silence that completely is also appealing from the user point of view. > QUERY PLAN > ---------------------------------------------------------------- > Foreign Scan on public.test (cost=0.00..1.10 rows=1 width=12) > Output: a, b, c > Foreign File: test.csv > Foreign File Size: 12 b > Skipped Rows: 10 Interesting idea linked to the idea of pushing the error state to something else than the logs. Sounds like a separate feature. > IIUC we have to execute ALTER FOREIGN TABLE to change the > log_verbosity value and which requires to be the owner. Which seems > not to be user-friendly. I am not sure about allowing scans to force an option to be a different thing at runtime vs what's been defined in the relation itself with CREATE/ALTER. -- Michael
Attachment
On 2024-07-23 08:57, Michael Paquier wrote: > On Mon, Jul 22, 2024 at 03:07:46PM -0700, Masahiko Sawada wrote: >> I'm slightly concerned that users might not want to see the NOTICE >> message for every scan. Unlike COPY FROM, scanning a file via file_fdw >> could be frequent. Agreed. > Yeah, I also have some concerns regarding the noise that this could > produce if called on a foreign table on a regular basis. The verbose > mode is disabled by default so I don't see why we should not allow it > if the relation owner wants to show it. > > Perhaps we should first do a silence mode for log_verbosity to skip > the NOTICE produced at the end of the COPY FROM summarizing the whole? I like this idea. If there are no objections, I'm going to make a patch for this. > It would be confusing to have different defaults between COPY and > file_fdw, but having the option to silence that completely is also > appealing from the user point of view. I'm not sure we should change the defaults. If the default of file_fdw is silence mode, I am a little concerned that there may be cases where people think they have no errors, but in fact they have. >> QUERY PLAN >> ---------------------------------------------------------------- >> Foreign Scan on public.test (cost=0.00..1.10 rows=1 width=12) >> Output: a, b, c >> Foreign File: test.csv >> Foreign File Size: 12 b >> Skipped Rows: 10 > > Interesting idea linked to the idea of pushing the error state to > something else than the logs. Sounds like a separate feature. +1 -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On 2024-07-24 19:43, torikoshia wrote: > On 2024-07-23 08:57, Michael Paquier wrote: >> On Mon, Jul 22, 2024 at 03:07:46PM -0700, Masahiko Sawada wrote: >>> I'm slightly concerned that users might not want to see the NOTICE >>> message for every scan. Unlike COPY FROM, scanning a file via >>> file_fdw >>> could be frequent. > > Agreed. > >> Yeah, I also have some concerns regarding the noise that this could >> produce if called on a foreign table on a regular basis. The verbose >> mode is disabled by default so I don't see why we should not allow it >> if the relation owner wants to show it. >> >> Perhaps we should first do a silence mode for log_verbosity to skip >> the NOTICE produced at the end of the COPY FROM summarizing the whole? > > I like this idea. > If there are no objections, I'm going to make a patch for this. Attached patches. 0001 adds new option 'silent' to log_verbosity and 0002 adds on_error and log_verbosity options to file_fdw. > I'm going to continue developing the patch(e.g. add doc, measure performance degradation) when people also think this feature is worth adding. Here is a quick performance test result. I loaded 1,000,000 rows using pgbench_accounts to a file and counted the number of rows using file_fdw on different conditions and compared the execution times on my laptop. The changed conditions are below: - source code: HEAD/patch applied - data: no error data/all row occur data conversion error at the 1st column - file_fdw options: on_error=stop/on_error=ignore There seems no significant difference in performance between HEAD and the patch applied with on_error option specified as either ignore/stop when data has no error. OTOH when all rows occur data conversion error, it is significantly faster than other cases: # HAED(e950fe58bd0) ## data:no error =# create foreign table t1 (a int, b int, c int, t text) server f_fdw options (filename 'pgb_ac', format 'csv'); =# select count(*) from t1; 1567.569 ms 1675.112 ms 1555.782 ms 1547.676 ms 1660.221 ms # patch applied ## data:no error, on_error:stop =# create foreign table t1 (a int, b int, c int, t text) server f_fdw options (filename 'pgb_ac', format 'csv', on_error 'stop'); =# select count(*) from t1; 1580.656 ms 1623.784 ms 1596.947 ms 1652.307 ms 1613.607 ms ## data:no error, on_error:ignore =# create foreign table t1 (a int, b int, c int, t text) server f_fdw options (filename 'pgb_ac', format 'csv', on_error 'ignore'); =# select count(*) from t1; 1575.718 ms 1597.464 ms 1596.540 ms 1665.818 ms 1595.453 ms #### data:all rows contain error, on_error:ignore =# create foreign table t1 (a int, b int, c int, t text) server f_fdw options (filename 'pgb_ac', format 'csv', on_error 'ignore'); =# select count(*) from t1; 914.537 ms 907.506 ms 912.768 ms 913.769 ms 914.327 ms -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation