Thread: Add tuples_skipped to pg_stat_progress_copy

Add tuples_skipped to pg_stat_progress_copy

From
torikoshia
Date:
Hi,

132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to 
skip malformed data, but there is no way to watch the number of skipped 
rows during COPY.

Attached patch adds tuples_skipped to pg_stat_progress_copy, which 
counts the number of skipped tuples because source data is malformed.
If SAVE_ERROR_TO is not specified, this column remains zero.

The advantage would be that users can quickly notice and stop COPYing 
when there is a larger amount of skipped data than expected, for 
example.

As described in commit log, it is expected to add more choices for 
SAVE_ERROR_TO like 'log' and using such options may enable us to know 
the number of skipped tuples during COPY, but exposed in 
pg_stat_progress_copy would be easier to monitor.


What do you think?

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation
Attachment

Re: Add tuples_skipped to pg_stat_progress_copy

From
Masahiko Sawada
Date:
On Wed, Jan 17, 2024 at 2:22 PM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> Hi,
>
> 132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to
> skip malformed data, but there is no way to watch the number of skipped
> rows during COPY.
>
> Attached patch adds tuples_skipped to pg_stat_progress_copy, which
> counts the number of skipped tuples because source data is malformed.
> If SAVE_ERROR_TO is not specified, this column remains zero.
>
> The advantage would be that users can quickly notice and stop COPYing
> when there is a larger amount of skipped data than expected, for
> example.
>
> As described in commit log, it is expected to add more choices for
> SAVE_ERROR_TO like 'log' and using such options may enable us to know
> the number of skipped tuples during COPY, but exposed in
> pg_stat_progress_copy would be easier to monitor.
>
>
> What do you think?

+1

The patch is pretty simple. Here is a comment:

+       (if <literal>SAVE_ERROR_TO</literal> is specified, otherwise zero).
+      </para></entry>
+     </row>

To be precise, this counter only advances when a value other than
'ERROR' is specified to SAVE_ERROR_TO option.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Add tuples_skipped to pg_stat_progress_copy

From
torikoshia
Date:
On 2024-01-17 14:47, Masahiko Sawada wrote:
> On Wed, Jan 17, 2024 at 2:22 PM torikoshia <torikoshia@oss.nttdata.com> 
> wrote:
>> 
>> Hi,
>> 
>> 132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to
>> skip malformed data, but there is no way to watch the number of 
>> skipped
>> rows during COPY.
>> 
>> Attached patch adds tuples_skipped to pg_stat_progress_copy, which
>> counts the number of skipped tuples because source data is malformed.
>> If SAVE_ERROR_TO is not specified, this column remains zero.
>> 
>> The advantage would be that users can quickly notice and stop COPYing
>> when there is a larger amount of skipped data than expected, for
>> example.
>> 
>> As described in commit log, it is expected to add more choices for
>> SAVE_ERROR_TO like 'log' and using such options may enable us to know
>> the number of skipped tuples during COPY, but exposed in
>> pg_stat_progress_copy would be easier to monitor.
>> 
>> 
>> What do you think?
> 
> +1
> 
> The patch is pretty simple. Here is a comment:
> 
> +       (if <literal>SAVE_ERROR_TO</literal> is specified, otherwise 
> zero).
> +      </para></entry>
> +     </row>
> 
> To be precise, this counter only advances when a value other than
> 'ERROR' is specified to SAVE_ERROR_TO option.

Thanks for your comment and review!

Updated the patch according to your comment and option name change by 
b725b7eec.


BTW, based on this patch, I think we can add another option which 
specifies the maximum tolerable number of malformed rows.
I remember this was discussed in [1], and feel it would be useful when 
loading 'dirty' data but there is a limit to how dirty it can be.
Attached 0002 is WIP patch for this(I haven't added doc yet).

This may be better discussed in another thread, but any comments(e.g. 
necessity of this option, option name) are welcome.


[1] 
https://www.postgresql.org/message-id/752672.1699474336%40sss.pgh.pa.us

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation
Attachment

Re: Add tuples_skipped to pg_stat_progress_copy

From
Masahiko Sawada
Date:
On Tue, Jan 23, 2024 at 1:02 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> On 2024-01-17 14:47, Masahiko Sawada wrote:
> > On Wed, Jan 17, 2024 at 2:22 PM torikoshia <torikoshia@oss.nttdata.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> 132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to
> >> skip malformed data, but there is no way to watch the number of
> >> skipped
> >> rows during COPY.
> >>
> >> Attached patch adds tuples_skipped to pg_stat_progress_copy, which
> >> counts the number of skipped tuples because source data is malformed.
> >> If SAVE_ERROR_TO is not specified, this column remains zero.
> >>
> >> The advantage would be that users can quickly notice and stop COPYing
> >> when there is a larger amount of skipped data than expected, for
> >> example.
> >>
> >> As described in commit log, it is expected to add more choices for
> >> SAVE_ERROR_TO like 'log' and using such options may enable us to know
> >> the number of skipped tuples during COPY, but exposed in
> >> pg_stat_progress_copy would be easier to monitor.
> >>
> >>
> >> What do you think?
> >
> > +1
> >
> > The patch is pretty simple. Here is a comment:
> >
> > +       (if <literal>SAVE_ERROR_TO</literal> is specified, otherwise
> > zero).
> > +      </para></entry>
> > +     </row>
> >
> > To be precise, this counter only advances when a value other than
> > 'ERROR' is specified to SAVE_ERROR_TO option.
>
> Thanks for your comment and review!
>
> Updated the patch according to your comment and option name change by
> b725b7eec.

Thanks! The patch looks good to me. I'm going to push it tomorrow,
barring any objections.

>
>
> BTW, based on this patch, I think we can add another option which
> specifies the maximum tolerable number of malformed rows.
> I remember this was discussed in [1], and feel it would be useful when
> loading 'dirty' data but there is a limit to how dirty it can be.
> Attached 0002 is WIP patch for this(I haven't added doc yet).

Yeah, it could be a good option.

> This may be better discussed in another thread, but any comments(e.g.
> necessity of this option, option name) are welcome.

I'd recommend forking a new thread for this option. As far as I
remember, there also was an opinion that "reject limit" stuff is not
very useful.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Add tuples_skipped to pg_stat_progress_copy

From
torikoshia
Date:
On 2024-01-24 17:05, Masahiko Sawada wrote:
> On Tue, Jan 23, 2024 at 1:02 AM torikoshia <torikoshia@oss.nttdata.com> 
> wrote:
>> 
>> On 2024-01-17 14:47, Masahiko Sawada wrote:
>> > On Wed, Jan 17, 2024 at 2:22 PM torikoshia <torikoshia@oss.nttdata.com>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> 132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to
>> >> skip malformed data, but there is no way to watch the number of
>> >> skipped
>> >> rows during COPY.
>> >>
>> >> Attached patch adds tuples_skipped to pg_stat_progress_copy, which
>> >> counts the number of skipped tuples because source data is malformed.
>> >> If SAVE_ERROR_TO is not specified, this column remains zero.
>> >>
>> >> The advantage would be that users can quickly notice and stop COPYing
>> >> when there is a larger amount of skipped data than expected, for
>> >> example.
>> >>
>> >> As described in commit log, it is expected to add more choices for
>> >> SAVE_ERROR_TO like 'log' and using such options may enable us to know
>> >> the number of skipped tuples during COPY, but exposed in
>> >> pg_stat_progress_copy would be easier to monitor.
>> >>
>> >>
>> >> What do you think?
>> >
>> > +1
>> >
>> > The patch is pretty simple. Here is a comment:
>> >
>> > +       (if <literal>SAVE_ERROR_TO</literal> is specified, otherwise
>> > zero).
>> > +      </para></entry>
>> > +     </row>
>> >
>> > To be precise, this counter only advances when a value other than
>> > 'ERROR' is specified to SAVE_ERROR_TO option.
>> 
>> Thanks for your comment and review!
>> 
>> Updated the patch according to your comment and option name change by
>> b725b7eec.
> 
> Thanks! The patch looks good to me. I'm going to push it tomorrow,
> barring any objections.

Thanks!

>> 
>> BTW, based on this patch, I think we can add another option which
>> specifies the maximum tolerable number of malformed rows.
>> I remember this was discussed in [1], and feel it would be useful when
>> loading 'dirty' data but there is a limit to how dirty it can be.
>> Attached 0002 is WIP patch for this(I haven't added doc yet).
> 
> Yeah, it could be a good option.
> 
>> This may be better discussed in another thread, but any comments(e.g.
>> necessity of this option, option name) are welcome.
> 
> I'd recommend forking a new thread for this option. As far as I
> remember, there also was an opinion that "reject limit" stuff is not
> very useful.

OK, I'll make another thread for this.


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



Re: Add tuples_skipped to pg_stat_progress_copy

From
Masahiko Sawada
Date:
On Thu, Jan 25, 2024 at 11:25 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> On 2024-01-24 17:05, Masahiko Sawada wrote:
> > On Tue, Jan 23, 2024 at 1:02 AM torikoshia <torikoshia@oss.nttdata.com>
> > wrote:
> >>
> >> On 2024-01-17 14:47, Masahiko Sawada wrote:
> >> > On Wed, Jan 17, 2024 at 2:22 PM torikoshia <torikoshia@oss.nttdata.com>
> >> > wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> 132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to
> >> >> skip malformed data, but there is no way to watch the number of
> >> >> skipped
> >> >> rows during COPY.
> >> >>
> >> >> Attached patch adds tuples_skipped to pg_stat_progress_copy, which
> >> >> counts the number of skipped tuples because source data is malformed.
> >> >> If SAVE_ERROR_TO is not specified, this column remains zero.
> >> >>
> >> >> The advantage would be that users can quickly notice and stop COPYing
> >> >> when there is a larger amount of skipped data than expected, for
> >> >> example.
> >> >>
> >> >> As described in commit log, it is expected to add more choices for
> >> >> SAVE_ERROR_TO like 'log' and using such options may enable us to know
> >> >> the number of skipped tuples during COPY, but exposed in
> >> >> pg_stat_progress_copy would be easier to monitor.
> >> >>
> >> >>
> >> >> What do you think?
> >> >
> >> > +1
> >> >
> >> > The patch is pretty simple. Here is a comment:
> >> >
> >> > +       (if <literal>SAVE_ERROR_TO</literal> is specified, otherwise
> >> > zero).
> >> > +      </para></entry>
> >> > +     </row>
> >> >
> >> > To be precise, this counter only advances when a value other than
> >> > 'ERROR' is specified to SAVE_ERROR_TO option.
> >>
> >> Thanks for your comment and review!
> >>
> >> Updated the patch according to your comment and option name change by
> >> b725b7eec.
> >
> > Thanks! The patch looks good to me. I'm going to push it tomorrow,
> > barring any objections.
>
> Thanks!

Pushed (commit 729439607).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com