POC PATCH: copy from ... exceptions to: (was Re: VLDB Features) - Mailing list pgsql-hackers
From | Alex Shulgin |
---|---|
Subject | POC PATCH: copy from ... exceptions to: (was Re: VLDB Features) |
Date | |
Msg-id | 87k31ftoe0.fsf_-_@commandprompt.com Whole thread Raw |
In response to | Re: VLDB Features (Trent Shipley <trent_shipley@qwest.net>) |
Responses |
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
|
List | pgsql-hackers |
Trent Shipley <trent_shipley@qwest.net> writes: > On Friday 2007-12-14 16:22, Tom Lane wrote: >> Neil Conway <neilc@samurai.com> writes: >> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY >> > to drop (and log) rows that contain malformed data. That is, rows with >> > too many or too few columns, rows that result in constraint violations, >> > and rows containing columns where the data type's input function raises >> > an error. The last case is the only thing that would be a bit tricky to >> > implement, I think: you could use PG_TRY() around the InputFunctionCall, >> > but I guess you'd need a subtransaction to ensure that you reset your >> > state correctly after catching an error. >> >> Yeah. It's the subtransaction per row that's daunting --- not only the >> cycles spent for that, but the ensuing limitation to 4G rows imported >> per COPY. > > You could extend the COPY FROM syntax with a COMMIT EVERY n clause. This > would help with the 4G subtransaction limit. The cost to the ETL process is > that a simple rollback would not be guaranteed send the process back to it's > initial state. There are easy ways to deal with the rollback issue though. > > A {NO} RETRY {USING algorithm} clause might be useful. If the NO RETRY > option is selected then the COPY FROM can run without subtransactions and in > excess of the 4G per transaction limit. NO RETRY should be the default since > it preserves the legacy behavior of COPY FROM. > > You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the > option of sending exceptions to a table since they are presumably malformed, > otherwise they would not be exceptions. (Users should re-process exception > files if they want an if good then table a else exception to table b ...) > > EXCEPTIONS TO and NO RETRY would be mutually exclusive. > > >> If we could somehow only do a subtransaction per failure, things would >> be much better, but I don't see how. Hello, Attached is a proof of concept patch for this TODO item. There is no docs yet, I just wanted to know if approach is sane. The added syntax is like the following: COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout] The way it's done it is abusing Copy Both mode and from my limited testing, that seems to just work. The error trapping itself is done using PG_TRY/PG_CATCH and can only catch formatting or before-insert trigger errors, no attempt is made to recover from a failed unique constraint, etc. Example in action: postgres=# \d test_copy2 Table "public.test_copy2" Column | Type | Modifiers --------+---------+----------- id | integer | val | integer | postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout; 1 NOTICE: missing data for column "val" CONTEXT: COPY test_copy2, line 1: "1" 2 NOTICE: missing data for column "val" CONTEXT: COPY test_copy2, line 2: "2" 3 NOTICE: missing data for column "val" CONTEXT: COPY test_copy2, line 3: "3" NOTICE: total exceptions ignored: 3 postgres=# \d test_copy1 Table "public.test_copy1" Column | Type | Modifiers --------+---------+----------- id | integer | not null postgres=# set client_min_messages to warning; SET postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout; ... vmstat zoneinfo postgres=# Limited performance testing shows no significant difference between error-catching and plain code path. For example, timing copy test_copy1 from program 'seq 1000000' [exceptions to stdout] shows similar numbers with or without the added "exceptions to" clause. Now that I'm sending this I wonder if the original comment about the need for subtransaction around every loaded line still holds. Any example of what would be not properly rolled back by just PG_TRY? Happy hacking! -- Alex
Attachment
pgsql-hackers by date: