Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features) - Mailing list pgsql-hackers

From torikoshia
Subject Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Date
Msg-id 19551e8c2717c24689913083f841ddb5@oss.nttdata.com
Whole thread Raw
In response to Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)  (torikoshia <torikoshia@oss.nttdata.com>)
Responses Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)  (Damir Belyalov <dam.bel07@gmail.com>)
List pgsql-hackers
On 2023-03-23 02:50, Andres Freund wrote:
Thanks again for your review.
Attached v5 patch.

> Have you measured whether this has negative performance effects when 
> *NOT*
> using the new option?

I loaded 10000000 rows of pgbench_accounts on my laptop and compared the 
elapsed time.
GUCs changed from the default are logging_collector = on, 
log_error_verbosity = verbose.

Three tests were run under each condition and the middle of them is 
listed below:

- patch NOT applied(36f40ce2dc66): 35299ms
- patch applied, without IGNORE_DATATYPE_ERRORS: 34409ms
- patch applied, with IGNORE_DATATYPE_ERRORS: 35510ms

It seems there are no significant degradation.

Also tested the elapsed time when loading data which has some datatype 
error with IGNORE_DATATYPE_ERRORS:
- data has 100 rows of error: 35269ms
- data has 1000 rows of error: 34577ms
- data has 5000000 rows of error: 48925ms

5000000 rows of error consumes much time, but it seems to be influenced 
by logging time.
Here are test results under log_min_messages and client_min_messages are 
'error':
- data has 5000000 data type error: 23972ms
- data has 0 data type error: 34320ms

Now conversely, when there were many data type errors, it consumes less 
time.
This seems like a reasonable result since the amount of skipped data is 
increasing.


> As-is this does not work with FORMAT BINARY - and converting the binary 
> input
> functions to support soft errors won't happen for 16. So I think you 
> need to
> raise an error if BINARY and IGNORE_DATATYPE_ERRORS are specified.

Added the option check.

> On 2023-03-22 22:34:20 +0900, torikoshia wrote:
> > @@ -985,9 +986,28 @@ CopyFrom(CopyFromState cstate)
> >
> >               ExecClearTuple(myslot);
> >
> > +             if (cstate->opts.ignore_datatype_errors)
> > +             {
> > +                     escontext.details_wanted = true;
> > +                     cstate->escontext = escontext;
> > +             }
> 
> I think it might be worth pulling this out of the loop. That does mean 
> you'd
> have to reset escontext.error_occurred after an error, but that doesn't 
> seem
> too bad, you need to do other cleanup anyway.

Pull this out of the loop and added process for resetting escontext.

> > @@ -956,10 +957,20 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
> >                               values[m] = ExecEvalExpr(defexprs[m], econtext, &nulls[m]);
> >                       }
> >                       else
> > -                             values[m] = InputFunctionCall(&in_functions[m],
> > -                                                                                       string,
> > -                                                                                       typioparams[m],
> > -                                                                                       att->atttypmod);
> > +                             /* If IGNORE_DATATYPE_ERRORS is enabled skip rows with datatype errors */
> > +                             if (!InputFunctionCallSafe(&in_functions[m],
> > +                                                                                string,
> > +                                                                                typioparams[m],
> > +                                                                                att->atttypmod,
> > +                                                                                (Node *) &cstate->escontext,
> > +                                                                                &values[m]))
> > +                             {
> > +                                     cstate->ignored_errors++;
> > +
> > +                                     ereport(WARNING,
> > +                                                     errmsg("%s", cstate->escontext.error_data->message));
> 
> That isn't right - you loose all the details of the message. As is 
> you'd also
> leak the error context.
> 
> I think the best bet for now is to do something like
>     /* adjust elevel so we don't jump out */
>     cstate->escontext.error_data->elevel = WARNING;
>     /* despite the name, this won't raise an error if elevel < ERROR */
>     ThrowErrorData(cstate->escontext.error_data);

As I mentioned in one previous email, added above codes for now.

> I wonder if we ought to provide a wrapper for this? It could e.g. know 
> to
> mention the original elevel and such?
> 
> 
> I don't think NextCopyFrom() is the right place to emit this warning - 
> it
> e.g. is also called from file_fdw.c, which might want to do something 
> else
> with the error. From a layering POV it seems cleaner to do this in
> CopyFrom(). You already have a check for escontext.error_occurred there
> anyway.

Agreed.

> > @@ -3378,6 +3378,10 @@ copy_opt_item:
> >                               {
> >                                       $$ = makeDefElem("freeze", (Node *) makeBoolean(true), @1);
> >                               }
> > +                     | IGNORE_DATATYPE_ERRORS
> > +                             {
> > +                                     $$ = makeDefElem("ignore_datatype_errors", (Node *)makeBoolean(true), @1);
> > +                             }
> >                       | DELIMITER opt_as Sconst
> >                               {
> >                                       $$ = makeDefElem("delimiter", (Node *) makeString($3), @1);
> 
> 
> I think we shouldn't add a new keyword for this, but only support this 
> via
> /* new COPY option syntax */
> copy_generic_opt_list:
>                         copy_generic_opt_elem
> 
> Further increasing the size of the grammar with random keywords when we 
> have
> more generic ways to represent them seems unnecessary.

Agreed.

> > +-- tests for IGNORE_DATATYPE_ERRORS option
> > +CREATE TABLE check_ign_err (n int, m int[], k int);
> > +COPY check_ign_err FROM STDIN WITH IGNORE_DATATYPE_ERRORS;
> > +1    {1}     1
> > +a    {2}     2
> > +3    {3}     3333333333
> > +4    {a, 4}  4
> > +
> > +5    {5}     5
> > +\.
> > +SELECT * FROM check_ign_err;
> > +
> 
> I suggest adding a few more tests:
> 
> - COPY with a datatype error that can't be handled as a soft error

Added a test for cases with missing columns.
However it's not datatype error and not what you expected, is it?

> - test documenting that COPY FORMAT BINARY is incompatible with 
> IGNORE_DATATYPE_ERRORS

Added it.

> - a soft error showing the error context - although that will require 
> some
>   care to avoid the function name + line in the output

I assume you mean a test to check the server log, but I haven't come up 
with a way to do it.
Adding a TAP test might do it, but I think it would be overkill to add 
one just for this.


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION
Attachment

pgsql-hackers by date:

Previous
From: parth ratra
Date:
Subject: facing issues in downloading of packages in pgadmin4
Next
From: Peter Eisentraut
Date:
Subject: Re: cataloguing NOT NULL constraints