Thread: Add tuples_skipped to pg_stat_progress_copy
Hi, 132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to skip malformed data, but there is no way to watch the number of skipped rows during COPY. Attached patch adds tuples_skipped to pg_stat_progress_copy, which counts the number of skipped tuples because source data is malformed. If SAVE_ERROR_TO is not specified, this column remains zero. The advantage would be that users can quickly notice and stop COPYing when there is a larger amount of skipped data than expected, for example. As described in commit log, it is expected to add more choices for SAVE_ERROR_TO like 'log' and using such options may enable us to know the number of skipped tuples during COPY, but exposed in pg_stat_progress_copy would be easier to monitor. What do you think? -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Attachment
On Wed, Jan 17, 2024 at 2:22 PM torikoshia <torikoshia@oss.nttdata.com> wrote: > > Hi, > > 132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to > skip malformed data, but there is no way to watch the number of skipped > rows during COPY. > > Attached patch adds tuples_skipped to pg_stat_progress_copy, which > counts the number of skipped tuples because source data is malformed. > If SAVE_ERROR_TO is not specified, this column remains zero. > > The advantage would be that users can quickly notice and stop COPYing > when there is a larger amount of skipped data than expected, for > example. > > As described in commit log, it is expected to add more choices for > SAVE_ERROR_TO like 'log' and using such options may enable us to know > the number of skipped tuples during COPY, but exposed in > pg_stat_progress_copy would be easier to monitor. > > > What do you think? +1 The patch is pretty simple. Here is a comment: + (if <literal>SAVE_ERROR_TO</literal> is specified, otherwise zero). + </para></entry> + </row> To be precise, this counter only advances when a value other than 'ERROR' is specified to SAVE_ERROR_TO option. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 2024-01-17 14:47, Masahiko Sawada wrote: > On Wed, Jan 17, 2024 at 2:22 PM torikoshia <torikoshia@oss.nttdata.com> > wrote: >> >> Hi, >> >> 132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to >> skip malformed data, but there is no way to watch the number of >> skipped >> rows during COPY. >> >> Attached patch adds tuples_skipped to pg_stat_progress_copy, which >> counts the number of skipped tuples because source data is malformed. >> If SAVE_ERROR_TO is not specified, this column remains zero. >> >> The advantage would be that users can quickly notice and stop COPYing >> when there is a larger amount of skipped data than expected, for >> example. >> >> As described in commit log, it is expected to add more choices for >> SAVE_ERROR_TO like 'log' and using such options may enable us to know >> the number of skipped tuples during COPY, but exposed in >> pg_stat_progress_copy would be easier to monitor. >> >> >> What do you think? > > +1 > > The patch is pretty simple. Here is a comment: > > + (if <literal>SAVE_ERROR_TO</literal> is specified, otherwise > zero). > + </para></entry> > + </row> > > To be precise, this counter only advances when a value other than > 'ERROR' is specified to SAVE_ERROR_TO option. Thanks for your comment and review! Updated the patch according to your comment and option name change by b725b7eec. BTW, based on this patch, I think we can add another option which specifies the maximum tolerable number of malformed rows. I remember this was discussed in [1], and feel it would be useful when loading 'dirty' data but there is a limit to how dirty it can be. Attached 0002 is WIP patch for this(I haven't added doc yet). This may be better discussed in another thread, but any comments(e.g. necessity of this option, option name) are welcome. [1] https://www.postgresql.org/message-id/752672.1699474336%40sss.pgh.pa.us -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Attachment
On Tue, Jan 23, 2024 at 1:02 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > On 2024-01-17 14:47, Masahiko Sawada wrote: > > On Wed, Jan 17, 2024 at 2:22 PM torikoshia <torikoshia@oss.nttdata.com> > > wrote: > >> > >> Hi, > >> > >> 132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to > >> skip malformed data, but there is no way to watch the number of > >> skipped > >> rows during COPY. > >> > >> Attached patch adds tuples_skipped to pg_stat_progress_copy, which > >> counts the number of skipped tuples because source data is malformed. > >> If SAVE_ERROR_TO is not specified, this column remains zero. > >> > >> The advantage would be that users can quickly notice and stop COPYing > >> when there is a larger amount of skipped data than expected, for > >> example. > >> > >> As described in commit log, it is expected to add more choices for > >> SAVE_ERROR_TO like 'log' and using such options may enable us to know > >> the number of skipped tuples during COPY, but exposed in > >> pg_stat_progress_copy would be easier to monitor. > >> > >> > >> What do you think? > > > > +1 > > > > The patch is pretty simple. Here is a comment: > > > > + (if <literal>SAVE_ERROR_TO</literal> is specified, otherwise > > zero). > > + </para></entry> > > + </row> > > > > To be precise, this counter only advances when a value other than > > 'ERROR' is specified to SAVE_ERROR_TO option. > > Thanks for your comment and review! > > Updated the patch according to your comment and option name change by > b725b7eec. Thanks! The patch looks good to me. I'm going to push it tomorrow, barring any objections. > > > BTW, based on this patch, I think we can add another option which > specifies the maximum tolerable number of malformed rows. > I remember this was discussed in [1], and feel it would be useful when > loading 'dirty' data but there is a limit to how dirty it can be. > Attached 0002 is WIP patch for this(I haven't added doc yet). Yeah, it could be a good option. > This may be better discussed in another thread, but any comments(e.g. > necessity of this option, option name) are welcome. I'd recommend forking a new thread for this option. As far as I remember, there also was an opinion that "reject limit" stuff is not very useful. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 2024-01-24 17:05, Masahiko Sawada wrote: > On Tue, Jan 23, 2024 at 1:02 AM torikoshia <torikoshia@oss.nttdata.com> > wrote: >> >> On 2024-01-17 14:47, Masahiko Sawada wrote: >> > On Wed, Jan 17, 2024 at 2:22 PM torikoshia <torikoshia@oss.nttdata.com> >> > wrote: >> >> >> >> Hi, >> >> >> >> 132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to >> >> skip malformed data, but there is no way to watch the number of >> >> skipped >> >> rows during COPY. >> >> >> >> Attached patch adds tuples_skipped to pg_stat_progress_copy, which >> >> counts the number of skipped tuples because source data is malformed. >> >> If SAVE_ERROR_TO is not specified, this column remains zero. >> >> >> >> The advantage would be that users can quickly notice and stop COPYing >> >> when there is a larger amount of skipped data than expected, for >> >> example. >> >> >> >> As described in commit log, it is expected to add more choices for >> >> SAVE_ERROR_TO like 'log' and using such options may enable us to know >> >> the number of skipped tuples during COPY, but exposed in >> >> pg_stat_progress_copy would be easier to monitor. >> >> >> >> >> >> What do you think? >> > >> > +1 >> > >> > The patch is pretty simple. Here is a comment: >> > >> > + (if <literal>SAVE_ERROR_TO</literal> is specified, otherwise >> > zero). >> > + </para></entry> >> > + </row> >> > >> > To be precise, this counter only advances when a value other than >> > 'ERROR' is specified to SAVE_ERROR_TO option. >> >> Thanks for your comment and review! >> >> Updated the patch according to your comment and option name change by >> b725b7eec. > > Thanks! The patch looks good to me. I'm going to push it tomorrow, > barring any objections. Thanks! >> >> BTW, based on this patch, I think we can add another option which >> specifies the maximum tolerable number of malformed rows. >> I remember this was discussed in [1], and feel it would be useful when >> loading 'dirty' data but there is a limit to how dirty it can be. >> Attached 0002 is WIP patch for this(I haven't added doc yet). > > Yeah, it could be a good option. > >> This may be better discussed in another thread, but any comments(e.g. >> necessity of this option, option name) are welcome. > > I'd recommend forking a new thread for this option. As far as I > remember, there also was an opinion that "reject limit" stuff is not > very useful. OK, I'll make another thread for this. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On Thu, Jan 25, 2024 at 11:25 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > On 2024-01-24 17:05, Masahiko Sawada wrote: > > On Tue, Jan 23, 2024 at 1:02 AM torikoshia <torikoshia@oss.nttdata.com> > > wrote: > >> > >> On 2024-01-17 14:47, Masahiko Sawada wrote: > >> > On Wed, Jan 17, 2024 at 2:22 PM torikoshia <torikoshia@oss.nttdata.com> > >> > wrote: > >> >> > >> >> Hi, > >> >> > >> >> 132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to > >> >> skip malformed data, but there is no way to watch the number of > >> >> skipped > >> >> rows during COPY. > >> >> > >> >> Attached patch adds tuples_skipped to pg_stat_progress_copy, which > >> >> counts the number of skipped tuples because source data is malformed. > >> >> If SAVE_ERROR_TO is not specified, this column remains zero. > >> >> > >> >> The advantage would be that users can quickly notice and stop COPYing > >> >> when there is a larger amount of skipped data than expected, for > >> >> example. > >> >> > >> >> As described in commit log, it is expected to add more choices for > >> >> SAVE_ERROR_TO like 'log' and using such options may enable us to know > >> >> the number of skipped tuples during COPY, but exposed in > >> >> pg_stat_progress_copy would be easier to monitor. > >> >> > >> >> > >> >> What do you think? > >> > > >> > +1 > >> > > >> > The patch is pretty simple. Here is a comment: > >> > > >> > + (if <literal>SAVE_ERROR_TO</literal> is specified, otherwise > >> > zero). > >> > + </para></entry> > >> > + </row> > >> > > >> > To be precise, this counter only advances when a value other than > >> > 'ERROR' is specified to SAVE_ERROR_TO option. > >> > >> Thanks for your comment and review! > >> > >> Updated the patch according to your comment and option name change by > >> b725b7eec. > > > > Thanks! The patch looks good to me. I'm going to push it tomorrow, > > barring any objections. > > Thanks! Pushed (commit 729439607). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com