Re: VLDB Features - Mailing list pgsql-hackers

From Trent Shipley
Subject Re: VLDB Features
Date
Msg-id 200712141830.16877.trent_shipley@qwest.net
Whole thread Raw
In response to Re: VLDB Features  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)  (Alex Shulgin <ash@commandprompt.com>)
List pgsql-hackers
On Friday 2007-12-14 16:22, Tom Lane wrote:
> Neil Conway <neilc@samurai.com> writes:
> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
> > to drop (and log) rows that contain malformed data. That is, rows with
> > too many or too few columns, rows that result in constraint violations,
> > and rows containing columns where the data type's input function raises
> > an error. The last case is the only thing that would be a bit tricky to
> > implement, I think: you could use PG_TRY() around the InputFunctionCall,
> > but I guess you'd need a subtransaction to ensure that you reset your
> > state correctly after catching an error.
>
> Yeah.  It's the subtransaction per row that's daunting --- not only the
> cycles spent for that, but the ensuing limitation to 4G rows imported
> per COPY.

You could extend the COPY FROM syntax with a COMMIT EVERY n clause.  This 
would help with the 4G subtransaction limit.  The cost to the ETL process is 
that a simple rollback would not be guaranteed send the process back to it's 
initial state.  There are easy ways to deal with the rollback issue though.  

A {NO} RETRY {USING algorithm} clause might be useful.   If the NO RETRY 
option is selected then the COPY FROM can run without subtransactions and in 
excess of the 4G per transaction limit.  NO RETRY should be the default since 
it preserves the legacy behavior of COPY FROM.

You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the 
option of sending exceptions to a table since they are presumably malformed, 
otherwise they would not be exceptions.  (Users should re-process exception 
files if they want an if good then table a else exception to table b ...)

EXCEPTIONS TO and NO RETRY would be mutually exclusive.


> If we could somehow only do a subtransaction per failure, things would
> be much better, but I don't see how.




pgsql-hackers by date:

Previous
From: Neil Conway
Date:
Subject: Re: EXPLAIN ANALYZE printing logical and hardware I/O per-node
Next
From: Tom Lane
Date:
Subject: Re: VLDB Features