Re: Conflict handling for COPY FROM - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Conflict handling for COPY FROM |
Date | |
Msg-id | 28444.1585348052@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Conflict handling for COPY FROM (Surafel Temesgen <surafel3000@gmail.com>) |
Responses |
Re: Conflict handling for COPY FROM
Re: Conflict handling for COPY FROM |
List | pgsql-hackers |
Surafel Temesgen <surafel3000@gmail.com> writes: > [ conflict-handling-copy-from-v16.patch ] I took a quick look at this patch, since it was marked "ready for committer", but I don't see how it can possibly be considered committable. 1. Covering only the errors that are thrown in DoCopy itself doesn't seem to me to pass the smell test. Yes, I'm sure there's some set of use-cases for which that'd be helpful, but I think most people would expect a "skip errors" option to be able to handle cases like malformed numbers or bad encoding. I understand the implementation reasons that make it impractical to cover other errors, but do we really want a feature that most people will see as much less than half-baked? I fear it'll be an embarrassment. 2. If I'm reading the patch correctly, (some) rejected rows are actually sent back to the client. This is a wire protocol break of the first magnitude, and can NOT be accepted. At least not without some provisions for not doing it with a client that isn't prepared for it. I also am fairly worried about the possibilities for deadlock (ie, both ends stuck waiting for the other one to absorb data) if the return traffic volume is high enough. 3. I don't think enough thought has been put into the reporting, either. + ereport(WARNING, + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), + errmsg("skipping \"%s\" --- extra data after last expected column", + cstate->line_buf.data))); That's not going to be terribly helpful if the input line runs to many megabytes. Or even if no individual line is very long, but you get millions of such warnings. It's pretty much at variance with our message style guidelines (among other things, those caution you to keep the primary error message short); and it's randomly different from COPY's existing practice, which is to show the faulty line as CONTEXT. Plus it seems plenty weird that some errors are reported this way while others are reported by sending back the bad tuple (with, it looks like, no mention of what the specific problem is ... what if you have a lot of unique indexes?). BTW, while I don't know much about the ON CONFLICT (speculative insertion) infrastructure, I wonder how well it really works to not specify an arbiter index. I see that you're getting away with it in a trivial test case that has exactly one index, but that's not stressing the point very hard. On the whole, I feel like adding this sort of functionality to COPY itself is a dead end. COPY is meant for fast bulk transfer and not much else; trying to load more functionality onto it can only end in serving multiple masters poorly. What we normally recommend if you have data that needs to be cleaned is to import it into a permissively-defined staging table (eg, with all columns declared as text) and then transfer cleaned data to your tables-of-record. Looking at this patch in terms of whether the functionality is available in that approach, it seems like you might want two parts of it: 1. A COPY option to be flexible about the number of columns in the input, say by filling omitted columns with NULLs. 2. INSERT ... ON CONFLICT can be used to transfer data to permanent tables with rejection of duplicate keys, but it doesn't have much flexibility about just what to do with duplicates. Maybe we could add more ON CONFLICT actions? Sending conflicted rows to some other table, or updating the source table to show which rows were copied and which not, might be useful things to think about. regards, tom lane
pgsql-hackers by date: