Re: Conflict handling for COPY FROM - Mailing list pgsql-hackers

From Surafel Temesgen
Subject Re: Conflict handling for COPY FROM
Date
Msg-id CALAY4q_5y3TZYkC5nCYJr1Lufbb3kykd6KAedYyW3J3rJ_XsAA@mail.gmail.com
Whole thread Raw
In response to Re: Conflict handling for COPY FROM  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Conflict handling for COPY FROM  (David Steele <david@pgmasters.net>)
List pgsql-hackers


On Fri, Mar 27, 2020 at 3:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Surafel Temesgen <surafel3000@gmail.com> writes:
> [ conflict-handling-copy-from-v16.patch ]

I took a quick look at this patch, since it was marked "ready for
committer", but I don't see how it can possibly be considered committable.

1. Covering only the errors that are thrown in DoCopy itself doesn't
seem to me to pass the smell test.  Yes, I'm sure there's some set of
use-cases for which that'd be helpful, but I think most people would
expect a "skip errors" option to be able to handle cases like malformed
numbers or bad encoding.  I understand the implementation reasons that
make it impractical to cover other errors, but do we really want a
feature that most people will see as much less than half-baked?  I fear
it'll be an embarrassment.

I did small research and most major database management system didn't claim they handle every problem in loading file at least in every usage scenario.
 

2. If I'm reading the patch correctly, (some) rejected rows are actually
sent back to the client.  This is a wire protocol break of the first
magnitude, and can NOT be accepted.  At least not without some provisions
for not doing it with a client that isn't prepared for it.  I also am
fairly worried about the possibilities for deadlock (ie, both ends stuck
waiting for the other one to absorb data) if the return traffic volume is
high enough.


if my understanding is correct copy_in occur in sub protocol inside simple or
extended query protocol that know and handle the responds
 
3. I don't think enough thought has been put into the reporting, either.

+                ereport(WARNING,
+                        (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+                         errmsg("skipping \"%s\" --- extra data after last expected column",
+                                cstate->line_buf.data)));

That's not going to be terribly helpful if the input line runs to many
megabytes.  Or even if no individual line is very long, but you get
millions of such warnings.  It's pretty much at variance with our
message style guidelines (among other things, those caution you to keep
the primary error message short); and it's randomly different from
COPY's existing practice, which is to show the faulty line as CONTEXT.
Plus it seems plenty weird that some errors are reported this way while
others are reported by sending back the bad tuple (with, it looks like,
no mention of what the specific problem is ... what if you have a lot of
unique indexes?).

Currently we can’t get problem description in speculative insertion infrastructure  and am afraid adding problem description to return tuple will make the data less usable without further processing.Warning raised for error that happen before tuple contracted. Other option is to skip those record silently but reporting to user give user the chance to correct it.  

BTW, while I don't know much about the ON CONFLICT (speculative
insertion) infrastructure, I wonder how well it really works to
not specify an arbiter index.  I see that you're getting away with
it in a trivial test case that has exactly one index, but that's
not stressing the point very hard.

If arbiter index is not specified that means all indexes as the comment in ExecCheckIndexConstraints stated

/* ----------------------------------------------------------------

*             ExecCheckIndexConstraints

*

*             This routine checks if a tuple violates any unique or

*             exclusion constraints.  Returns true if there is no conflict.

*             Otherwise returns false, and the TID of the conflicting

*             tuple is returned in *conflictTid.

*

*             If 'arbiterIndexes' is given, only those indexes are checked.

*             NIL means all indexes. 


regards

Surafel

pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: WAL usage calculation patch
Next
From: Ahsan Hadi
Date:
Subject: Re: WIP/PoC for parallel backup