Re: COPY enhancements - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: COPY enhancements
Date
Msg-id 1255005636.26302.357.camel@ebony.2ndQuadrant
Whole thread Raw
In response to Re: COPY enhancements  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Wed, 2009-10-07 at 22:30 -0400, Robert Haas wrote:
> On Fri, Sep 25, 2009 at 10:01 AM, Emmanuel Cecchet <manu@asterdata.com> wrote:
> > Robert,
> >
> > Here is the new version of the patch that applies to CVS HEAD as of this
> > morning.
> >
> > Emmanuel
> 
> I took a look at this patch tonight and, having now read through some
> of it, I have some more detailed comments.
> 
> It seems quite odd to me that when COPY succeeds but there are errors,
> the transaction commits.  The only indication that some of my data
> didn't end up in the table is that the output says "COPY n" where n is
> less than the total number of rows I attempted to copy.  On the other
> hand, it would be equally bad if the transaction aborted, because then
> my error logging table would not get populated - I note that that's
> the behavior we do get if the max errors threshold is exceeded.  I'm
> not sure what the right answer is here, but it needs some thought and
> discussion.  I think at a minimum the caller needs some indication of
> the number of FAILED rows, not just the number of succesful ones.
> 
> What's really bad about this is that a flag called "error_logging" is
> actually changing the behavior of the command in a way that is far
> more dramatic than (and doesn't actually have much to do with) error
> logging.  It's actually making a COPY command succeed that would
> otherwise have failed.  That's not just a logging change.
> 
> I think the underlying problem is that there is more than one feature
> here, and they're kind of mixed together.  The fact that the
> partitioning code needs to be separated from the error logging stuff
> has already been discussed, but I think it actually needs to be broken
> down even further.  I think there are three separate features in the
> error logging code:
> 
> A. Ability to ignore a certain number of errors (of certain types?)
> and still get the other tuples into the table anyway.
> B. Ability to return error information in a structured format rather
> than as an exception message.
> C. Ability to direct the structured error messages to a table.
> 
> Each of those features deserves a separate discussion to decide
> whether we want it and how best to implement it.  Personally, I think
> we should skip (C), at least as a starting point.  Instead of logging
> to a table, I think we should consider making COPY return the tuples
> indicating the error as a query result, the same way EXPLAIN returns
> the query plan.  It looks like this would eliminate the need for at
> least three of the new COPY options without losing much functionality.
> 
> I also think that, if possible, (A) and (B) should be implemented as
> separate features, so that each one can be used separately or both can
> be used in combination.

I don't see any logical issues with what has been proposed. Current COPY
allows k=0 cumulative errors before it aborts. This patch allows us to
provide a parameter to vary k. If we get nerrors > k then the COPY
should abort. If nerrors <= k then the COPY can commit. Exactly what we
need. Perhaps the parameter should simply be called max_errors rather
than error_logging, so the meaning is clear.

We must have detailed error reporting for each row individually,
otherwise we will not be able to correct the errors. 125346 rows loaded,
257 errors is not really useful information, so I wouldn't bother trying
to change the return info to supply the number of errors. If you defined
an error file you know you need to check it. If it doesn't exist, no
errors. Simple.

For me, A and B are inseparable. Dimitri's idea to have an error file
and a error reason file seems very good to me.

C may be important because of security concerns regarding writing error
files but it isn't critical in first commit.

(Not really in reply to Robert: Most database loaders define
max_num_bad_rows as a percentage of the size of the file. Otherwise you
need to read the file to see how big it is before you decide an
appropriate setting, which if it is very big is a bad idea).

-- Simon Riggs           www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Review of "SQLDA support for ECPG"
Next
From: Dimitri Fontaine
Date:
Subject: Re: COPY enhancements