Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features) - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Date
Msg-id CAD21AoANTGuF++Lo9PMQXNi88f=TSBahrRMw3XEB7J-5kw=N8g@mail.gmail.com
Whole thread Raw
In response to Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)  (jian he <jian.universality@gmail.com>)
List pgsql-hackers
On Mon, Dec 18, 2023 at 9:16 AM jian he <jian.universality@gmail.com> wrote:
>
> On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Hi,
> >
> > I've read this thread and the latest patch. IIUC with SAVE_ERROR
> > option, COPY FROM creates an error table for the target table and
> > writes error information there.
> >
> > While I agree that the final shape of this feature would be something
> > like that design, I'm concerned some features are missing in order to
> > make this feature useful in practice. For instance, error logs are
> > inserted to error tables without bounds, meaning that users who want
> > to tolerate errors during COPY FROM  will have to truncate or drop the
> > error tables periodically, or the database will grow with error logs
> > without limit. Ideally such maintenance work should be done by the
> > database. There might be some users who want to log such conversion
> > errors in server logs to avoid such maintenance work. I think we
> > should provide an option for where to write, at least. Also, since the
> > error tables are normal user tables internally, error logs are also
> > replicated to subscribers if there is a publication FOR ALL TABLES,
> > unlike system catalogs. I think some users would not like such
> > behavior.
>
> save the error metadata to  system catalogs would be more expensive,
> please see below explanation.
> I have no knowledge of publications.
> but i feel there is a feature request: publication FOR ALL TABLES
> exclude regex_pattern.
> Anyway, that would be another topic.

I don't think the new regex idea would be a good solution for the
existing users who are using FOR ALL TABLES publication. It's not
desirable that they have to change the publication because of this
feature. With the current patch, a logical replication using FOR ALL
TABLES publication will stop immediately after an error information is
inserted into a new error table unless the same error table is created
on subscribers.

>
> > Looking at SAVE_ERROR feature closely, I think it consists of two
> > separate features. That is, it enables COPY FROM to load data while
> > (1) tolerating errors and (2) logging errors to somewhere (i.e., an
> > error table). If we implement only (1), it would be like COPY FROM
> > tolerate errors infinitely and log errors to /dev/null. The user
> > cannot see the error details but I guess it could still help some
> > cases as Andres mentioned[1] (it might be a good idea to send the
> > number of rows successfully loaded in a NOTICE message if some rows
> > could not be loaded). Then with (2), COPY FROM can log error
> > information to somewhere such as tables and server logs and the user
> > can select it. So I'm thinking we may be able to implement this
> > feature incrementally. The first step would be something like an
> > option to ignore all errors or an option to specify the maximum number
> > of errors to tolerate before raising an ERROR. The second step would
> > be to support logging destinations such as server logs and tables.
> >
> > Regards,
> >
> > [1] https://www.postgresql.org/message-id/20231109002600.fuihn34bjqqgmbjm%40awork3.anarazel.de
> >
> > --
> > Masahiko Sawada
> > Amazon Web Services: https://aws.amazon.com
>
> > feature incrementally. The first step would be something like an
> > option to ignore all errors or an option to specify the maximum number
> > of errors to tolerate before raising an ERROR. The second step would
>
> I don't think "specify the maximum number  of errors to tolerate
> before raising an ERROR." is very useful....
>
> QUOTE from [1]
> MAXERROR [AS] error_count
> If the load returns the error_count number of errors or greater, the
> load fails. If the load returns fewer errors, it continues and returns
> an INFO message that states the number of rows that could not be
> loaded. Use this parameter to allow loads to continue when certain
> rows fail to load into the table because of formatting errors or other
> inconsistencies in the data.
> Set this value to 0 or 1 if you want the load to fail as soon as the
> first error occurs. The AS keyword is optional. The MAXERROR default
> value is 0 and the limit is 100000.
> The actual number of errors reported might be greater than the
> specified MAXERROR because of the parallel nature of Amazon Redshift.
> If any node in the Amazon Redshift cluster detects that MAXERROR has
> been exceeded, each node reports all of the errors it has encountered.
> END OF QUOTE
>
> option MAXERROR error_count. iiuc, it fails while validating line
> error_count + 1, else it raises a notice, tells you how many rows have
> errors.
>
> * case when error_count is small, and the copy fails, it only tells
> you that at least the error_count line has malformed data. but what if
> the actual malformed rows are very big. In this case, this failure
> error message is not that helpful.
> * case when error_count is very big, and the copy does not fail. then
> the actual malformed data rows are very big (still less than
> error_count). but there is no error report, you don't know which line
> has an error.
>
> Either way, if the file has a large portion of malformed rows, then
> the MAXERROR option does not make sense.
> so maybe we don't need a threshold for tolerating errors.
>
> however, we can have an option, not actually copy to the table, but
> only validate, similar to NOLOAD in [1]

I'm fine even if the feature is not like MAXERROR. If we want a
feature to tolerate errors during COPY FROM, I just thought it might
be a good idea to have a tuning knob for better flexibility, not just
like a on/off switch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Synchronizing slots from primary to standby
Next
From: Masahiko Sawada
Date:
Subject: Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)