Re: Conflict handling for COPY FROM - Mailing list pgsql-hackers

From David G. Johnston
Subject Re: Conflict handling for COPY FROM
Date
Msg-id CAKFQuway9O6kzLCTDgO4fS6pFyk75a1HqtvTfzKfC9diJ+473g@mail.gmail.com
Whole thread Raw
In response to Re: Conflict handling for COPY FROM  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Fri, Mar 27, 2020 at 3:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Surafel Temesgen <surafel3000@gmail.com> writes:
> [ conflict-handling-copy-from-v16.patch ]

I took a quick look at this patch, since it was marked "ready for
committer", but I don't see how it can possibly be considered committable.
 
[...]
 

Looking at this patch in terms of whether the functionality is
available in that approach, it seems like you might want two parts
of it:

1. A COPY option to be flexible about the number of columns in the
input, say by filling omitted columns with NULLs.

2. INSERT ... ON CONFLICT can be used to transfer data to permanent
tables with rejection of duplicate keys, but it doesn't have much
flexibility about just what to do with duplicates.  Maybe we could
add more ON CONFLICT actions?  Sending conflicted rows to some other
table, or updating the source table to show which rows were copied
and which not, might be useful things to think about.

tl/dr: patch 1: change the option to something like "processing_mode = row {default = file}" and relay existing errors across all rows in the table in the error detail message.

tl/dr patch 2: add an "IGNORE_CONFLICTS = {-1, 0 default, 1+}" option that converts the errors that appear for speculative insertions into warnings (in the error detail message) and either aborts the copy (cleanly, no inserted rows) if the count exceeds the user specified value) or commits if it is able (i.e., no errors in the error message detail - which would come from other problems besides conflicts).  I don't really get why a number is needed here though - its not like ON CONFLICT DO NOTHING has that ability and all this seems to be wanting to do is enable a subset of ON CONFLICT DO NOTHING for COPY - in which case no warning or error would appear in the first place.

Sorry for the rambling below, a decent chuck of the material is stream of consciousness but it all generally relates to the behaviors that this patch is touching.

My desired take-away from this would be to have COPY's error response include one line for each input line that fails to be inserted with the line number and the underlying error message passed along verbatim.  Importing, fixing one error, trying again, fixing another error, repeat is a first level goal for me.  Including at most the first 30 characters of the problematic line would facilitate the common case where the first field is an identifier which is more useful that a pure line number.  But including the entire line doesn't seem worth the risk.  Though it should be considered whether to include the error detail of the underlying error message - probably as a subsequent line.

After that, consider having a facility for telling COPY that you are OK if it ignores the problem rows and inserts the ones it is able - and converting the final exit code to be a warning instead of an error (whatever that means, if anything, in this context).

The speculative insertion stuff is outside my experience but are we trying to just make it work, give the COPY user control of whether its used or not, have an independent facility just for copy, or something else?  Teaching copy to use speculative insertion infrastructure that already exists is a patch in its own right and seems to be fixing a current POLA violation - this other stuff about reporting and ignoring errors is besides the point.  The basic inter-operability fix seems like it could be made to work under today's copy error handling and report framework just fine.

If there is value in someone augmenting the error handling and response framework to work in a tabular format instead of a free-form error message text body that too could be presented and discussed as its own commit.  Granted it may be the case that such a verbose error message from COPY as described above is undesirable but would be palatable in some other presentation format.  I'm inclined to believe, however, that even if COPY is made an exception to the general error message rules (and the detail information would be part of the errdetail, not the main message, it could just include a summary count) that the added functionality would make the exception acceptable.

To be honest I haven't dived into the assumptions baked into the regression tests to know whether the tabular stuff that was discussed and that is shown in the regression expected outputs is normal or not.  I assume the part in question is:

+COPY x from stdin WITH(ERROR_LIMIT 5);
+WARNING:  skipping "70001 22 32" --- missing data for column "d"
+WARNING:  skipping "70002 23 33 43 53 54" --- extra data after last expected column
+WARNING:  skipping "70003 24 34 44" --- missing data for column "e"
+
+   a   | b  | c  | d  |          e          
+-------+----+----+----+----------------------
+ 70005 | 27 | 37 | 47 | before trigger fired
+(1 row)

Right now COPY would just fail.  What would be nice is for it to say:

ERROR: Line 1?  missing data for column "d"
ERROR: Line 2?  --- extra data after last expected column
ERROR: Line 3?  --- missing data for column "e"

Making process go as far along as possible, to get better (more specific) messages from deeper in the system, but still just saying "ERROR: Line # failed"

Then,

WARNING: ...
WARNING: ...
WARNING: ...

COPY 1

But I'm not even sure why the tabular representation is showing up in that stanza...

From the OP:

"If you assume that users re-execute COPY FROM with the output lines as input, these columns are obstacles."

I really don't care right now to make that assumption nor is facilitating it necessarily a goal.  While this may be the current motivation there are other parts that are useful in their own right whether or not we proceed toward this final outcome.  As Tom said its unclear whether its realistic to make this work directly and there is no documentation that describes how this patch would support that use case.   I haven't delved that deeply into this potential aspect of the patch but it is something worth of its own patch on a thread with an appropriate title.

Given the current thread's title (conflict handling - not clear on whether "make it work" or "response alterations when it doesn't" is the goal) all this should be doing is making it so errors don't happen when all rows would be inserted correctly if existing speculative insertion routines were being used for the exact same data coming from a table and moved via INSERT (ON CONFLICT DO NOTHING).  That implies that any issues in populating said equivalent table are out of scope for this patch - as is changing the error handling mechanics.  I don't know if, with those pieces cut out, whether the core of the patch with speculative insertions is ready to commit.

Looking only at the patch it seems as if speculative insertion is only attempted if we are willing to ignore errors.  Thus ignoring errors is an indirect way of saying that you want to bypass the bulk insertion performance trade-offs and once you do that you might as well check for conflicts and let them resolve.  It is desirable, though, to be able to turn off those performance trade-offs simply in order to attempt to insert every single row and be informed of all the errors in the input in one pass.  Future flags like ERROR_LIMIT would then simply imply that state.

With the two tl/dr patches in place features like making actions besides "DO NOTHING" (which is what "ignore" basically does but you can choose a maximum count to accept a do nothing outcome) can be considered.  Also, adding an option to "IGNORE_PARSING_ERRORS" or IGNORE_STRUCTURE_ERRORS" (or creating an enum and a multi-valued option...).  I don't have a problem with limited scope here but the naming should be chosen to match.  You name something "ERRORS" and it should cover everything.  In addition, being more precise, besides aiding development scope, gives the user the ability to choose which kinds of errors that are willing to ignore and which they really want to be fatal.

David J.


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Commitfest 2020-03 Now in Progress
Next
From: Justin Pryzby
Date:
Subject: Re: Vacuum o/p with (full 1, parallel 0) option throwing an error