Re: COPY enhancements - Mailing list pgsql-hackers
From | Emmanuel Cecchet |
---|---|
Subject | Re: COPY enhancements |
Date | |
Msg-id | 4AA9705C.5020108@asterdata.com Whole thread Raw |
In response to | Re: COPY enhancements (Josh Berkus <josh@agliodbs.com>) |
Responses |
Re: COPY enhancements
|
List | pgsql-hackers |
Josh, See the answers inlined. > Thank you for tackling this very long-time TODO. > >> Error logging is described here: >> http://wiki.postgresql.org/wiki/Error_logging_in_COPY >> > > Questions & Comments: > > A) Why would someone want to turn error_logging on, but leave > error_logging_skip_tuples off? The pg_log already logs errors which copy throws by default. > When error_logging is on and skip_tuples is off, errors are logged in the error table. If skip_tuples is on, tuples are not logged in the error table. > B) As I mentioned earlier, we'll want to provide the option of logging > to a file instead of to a table. That's not a reason to reject this > patch, but probably a TODO for 8.5. > Ok but what should be the format of that file? > C) Are we sure we want to handle this via GUCs rather than extensions to > COPY syntax? It seems like fairly often users would want to log > different COPY sources to different tables/files. > I agree that new COPY options could be easier to use, the implementation is just more complex. However, the labels allows you to select the tuples related to specific COPY commands. > D) These GUCs are userset, I hope? (haven't dug into the code far > enough to tell yet). > Yes. > E) What is error_logging_tuple_label for? You don't explain/give > examples. And how is error_logging_tuple_partition_key used? > We use the label and partition key in Aster products to easily retrieve which COPY command on which partition did generate the bad tuples. By default, the tuple_label contains the COPY command that was executed (see example on Wiki) and the key contains the index of the tuple in the source file (see example on Wiki). > F) Rawdata for rejected tuples is presumably BYTEA? > Yes. I forgot to put back the table description that can be seen in the unit tests. I have updated the Wiki with the table definition. > G) We should probably have a default for error_logging_table_name, such > as pg_copy_errors. Does that table get automatically created if it > doesn't exist? > Yes, as indicated on the wiki the table is created automatically (see config variable section). > H) Finally, one request of the TODO is some way to halt import after a > specified number of bad tuples because it probably means you have the > wrong file or wrong table. Do we still want that? > We can still do that. It can be another GUC variable or an option to COPY. If the COPY command fails, everything gets rolled back (data in the destination table and error table). That would be harder to implement with a file (the rollback part). >> Autopartitioning is described here: >> http://wiki.postgresql.org/wiki/Auto-partitioning_in_COPY >> > > M) tuple_routing_in_copy should take "on" or "off", not 0 or 1. > Ok. > N) Have you measured the overhead & speed of this kind of COPY as > opposed to COPY into a single table? Have you checked the overhead if > tuple_routing_in_copy is on, but you are not loading into a partitioned > table? > Yes. There is no noticeable overhead if there is no routing to do (but routing is on). If routing is involved, the overhead depends on how sorted your input data is. If it all goes to the same partition, the caching effect works well and there is no noticeable overhead. The cost is in the constraint check and it depends on the complexity of the constraint. The more constraints you have to check and the more complex they are, the more overhead on each tuple routing. > O) Is this capable of dealing with partitioning by more than one column, > or by an expression? > Yes, we just use a brute force technique where we try all child tables 1-by-1 and rely on the existing Postgres constraint checking mechanism (no new or duplicated code there). > Finally, I'm going to suggest different names for the GUCs, as the names > you've chosen don't group well and would likely cause confusion. Here > are my suggestions, which all begin with "copy_" for prefix matching: > > error_logging --> probaby not needed, see able > error_logging_skip_tuples --> copy_skip_bad_rows > error_logging_schema_name --> copy_logging_schema_name > error_logging_relation_name --> copy_logging_table_name > error_logging_tuple_label --> don't know what this is for, see above > error_logging_tuple_partition_key --> don't know what this is for, see above > > tuple_routing_in_copy --> copy_partitioning > tuple_routing_cache_size --> copy_partitioning_cache_size > This makes sense. I'll add that on my todo list. Emmanuel -- Emmanuel Cecchet Aster Data Systems Web: http://www.asterdata.com
pgsql-hackers by date: