Re: Add tuples_skipped to pg_stat_progress_copy - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Add tuples_skipped to pg_stat_progress_copy
Date
Msg-id CAD21AoAxavoPmOZLQOvGm+T8x+ht4FbOL_EV4kUJR=+PK9A7kg@mail.gmail.com
Whole thread Raw
In response to Re: Add tuples_skipped to pg_stat_progress_copy  (torikoshia <torikoshia@oss.nttdata.com>)
Responses Re: Add tuples_skipped to pg_stat_progress_copy
List pgsql-hackers
On Tue, Jan 23, 2024 at 1:02 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> On 2024-01-17 14:47, Masahiko Sawada wrote:
> > On Wed, Jan 17, 2024 at 2:22 PM torikoshia <torikoshia@oss.nttdata.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> 132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to
> >> skip malformed data, but there is no way to watch the number of
> >> skipped
> >> rows during COPY.
> >>
> >> Attached patch adds tuples_skipped to pg_stat_progress_copy, which
> >> counts the number of skipped tuples because source data is malformed.
> >> If SAVE_ERROR_TO is not specified, this column remains zero.
> >>
> >> The advantage would be that users can quickly notice and stop COPYing
> >> when there is a larger amount of skipped data than expected, for
> >> example.
> >>
> >> As described in commit log, it is expected to add more choices for
> >> SAVE_ERROR_TO like 'log' and using such options may enable us to know
> >> the number of skipped tuples during COPY, but exposed in
> >> pg_stat_progress_copy would be easier to monitor.
> >>
> >>
> >> What do you think?
> >
> > +1
> >
> > The patch is pretty simple. Here is a comment:
> >
> > +       (if <literal>SAVE_ERROR_TO</literal> is specified, otherwise
> > zero).
> > +      </para></entry>
> > +     </row>
> >
> > To be precise, this counter only advances when a value other than
> > 'ERROR' is specified to SAVE_ERROR_TO option.
>
> Thanks for your comment and review!
>
> Updated the patch according to your comment and option name change by
> b725b7eec.

Thanks! The patch looks good to me. I'm going to push it tomorrow,
barring any objections.

>
>
> BTW, based on this patch, I think we can add another option which
> specifies the maximum tolerable number of malformed rows.
> I remember this was discussed in [1], and feel it would be useful when
> loading 'dirty' data but there is a limit to how dirty it can be.
> Attached 0002 is WIP patch for this(I haven't added doc yet).

Yeah, it could be a good option.

> This may be better discussed in another thread, but any comments(e.g.
> necessity of this option, option name) are welcome.

I'd recommend forking a new thread for this option. As far as I
remember, there also was an opinion that "reject limit" stuff is not
very useful.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: partitioning and identity column
Next
From: Michael Paquier
Date:
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations