Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features) - Mailing list pgsql-hackers
From | Pavel Stehule |
---|---|
Subject | Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features) |
Date | |
Msg-id | CAFj8pRC3ksKhaSxDUo3Y1DekvV12OcEERryT8gPGdUAVEe2SLg@mail.gmail.com Whole thread Raw |
In response to | POC PATCH: copy from ... exceptions to: (was Re: VLDB Features) (Alex Shulgin <ash@commandprompt.com>) |
Responses |
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
|
List | pgsql-hackers |
2014-12-25 22:23 GMT+01:00 Alex Shulgin <ash@commandprompt.com>:
Trent Shipley <trent_shipley@qwest.net> writes:
> On Friday 2007-12-14 16:22, Tom Lane wrote:
>> Neil Conway <neilc@samurai.com> writes:
>> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
>> > to drop (and log) rows that contain malformed data. That is, rows with
>> > too many or too few columns, rows that result in constraint violations,
>> > and rows containing columns where the data type's input function raises
>> > an error. The last case is the only thing that would be a bit tricky to
>> > implement, I think: you could use PG_TRY() around the InputFunctionCall,
>> > but I guess you'd need a subtransaction to ensure that you reset your
>> > state correctly after catching an error.
>>
>> Yeah. It's the subtransaction per row that's daunting --- not only the
>> cycles spent for that, but the ensuing limitation to 4G rows imported
>> per COPY.
>
> You could extend the COPY FROM syntax with a COMMIT EVERY n clause. This
> would help with the 4G subtransaction limit. The cost to the ETL process is
> that a simple rollback would not be guaranteed send the process back to it's
> initial state. There are easy ways to deal with the rollback issue though.
>
> A {NO} RETRY {USING algorithm} clause might be useful. If the NO RETRY
> option is selected then the COPY FROM can run without subtransactions and in
> excess of the 4G per transaction limit. NO RETRY should be the default since
> it preserves the legacy behavior of COPY FROM.
>
> You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not give the
> option of sending exceptions to a table since they are presumably malformed,
> otherwise they would not be exceptions. (Users should re-process exception
> files if they want an if good then table a else exception to table b ...)
>
> EXCEPTIONS TO and NO RETRY would be mutually exclusive.
>
>
>> If we could somehow only do a subtransaction per failure, things would
>> be much better, but I don't see how.
Hello,
Attached is a proof of concept patch for this TODO item. There is no
docs yet, I just wanted to know if approach is sane.
The added syntax is like the following:
COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]
The way it's done it is abusing Copy Both mode and from my limited
testing, that seems to just work. The error trapping itself is done
using PG_TRY/PG_CATCH and can only catch formatting or before-insert
trigger errors, no attempt is made to recover from a failed unique
constraint, etc.
Example in action:
postgres=# \d test_copy2
Table "public.test_copy2"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
val | integer |
postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
1
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 1: "1"
2
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 2: "2"
3
NOTICE: missing data for column "val"
CONTEXT: COPY test_copy2, line 3: "3"
NOTICE: total exceptions ignored: 3
postgres=# \d test_copy1
Table "public.test_copy1"
Column | Type | Modifiers
--------+---------+-----------
id | integer | not null
postgres=# set client_min_messages to warning;
SET
postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
...
vmstat
zoneinfo
postgres=#
Limited performance testing shows no significant difference between
error-catching and plain code path. For example, timing
copy test_copy1 from program 'seq 1000000' [exceptions to stdout]
shows similar numbers with or without the added "exceptions to" clause.
Now that I'm sending this I wonder if the original comment about the
need for subtransaction around every loaded line still holds. Any
example of what would be not properly rolled back by just PG_TRY?
this method is unsafe .. exception handlers doesn't free memory usually - there is risk of memory leaks, source leaks
you can enforce same performance with block subtransactions - when you use subtransaction for 1000 rows, then impact of subtransactions is minimal
when block fails, then you can use row level subtransaction - it works well when you expect almost correct data.
Regards
Pavel
Pavel
Happy hacking!
--
Alex
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
pgsql-hackers by date: