Home > mailing lists

Re: COPY enhancements - Mailing list pgsql-hackers

From	Emmanuel Cecchet
Subject	Re: COPY enhancements
Date	September 10, 2009 18:32:25
Msg-id	4AA9705C.5020108@asterdata.com Whole thread Raw
In response to	Re: COPY enhancements (Josh Berkus <josh@agliodbs.com>)
Responses	Re: COPY enhancements
List	pgsql-hackers

Tree view

Josh,

See the answers inlined.
> Thank you for tackling this very long-time TODO.
>   
>> Error logging is described here:
>> http://wiki.postgresql.org/wiki/Error_logging_in_COPY
>>     
>
> Questions & Comments:
>
> A) Why would someone want to turn error_logging on, but leave
> error_logging_skip_tuples off?  The pg_log already logs errors which copy throws by default.
>   
When error_logging is on and skip_tuples is off, errors are logged in 
the error table. If skip_tuples is on, tuples are not logged in the 
error table.
> B) As I mentioned earlier, we'll want to provide the option of logging
> to a file instead of to a table.  That's not a reason to reject this
> patch, but probably a TODO for 8.5.
>   
Ok but what should be the format of that file?
> C) Are we sure we want to handle this via GUCs rather than extensions to
> COPY syntax?  It seems like fairly often users would want to log
> different COPY sources to different tables/files.
>   
I agree that new COPY options could be easier to use, the implementation 
is just more complex. However, the labels allows you to select the 
tuples related to specific COPY commands.
> D) These GUCs are userset, I hope?  (haven't dug into the code far
> enough to tell yet).
>   
Yes.
> E) What is error_logging_tuple_label for?   You don't explain/give
> examples.  And how is error_logging_tuple_partition_key used?
>   
We use the label and partition key in Aster products to easily retrieve 
which COPY command on which partition did generate the bad tuples. By 
default, the tuple_label contains the COPY command that was executed 
(see example on Wiki) and the key contains the index of the tuple in the 
source file (see example on Wiki).
> F) Rawdata for rejected tuples is presumably BYTEA?
>   
Yes. I forgot to put back the table description that can be seen in the 
unit tests. I have updated the Wiki with the table definition.
> G) We should probably have a default for error_logging_table_name, such
> as pg_copy_errors.  Does that table get automatically created if it
> doesn't exist?
>   
Yes, as indicated on the wiki the table is created automatically (see 
config variable section).
> H) Finally, one request of the TODO is some way to halt import after a
> specified number of bad tuples because it probably means you have the
> wrong file or wrong table.  Do we still want that?
>   
We can still do that. It can be another GUC variable or an option to 
COPY. If the COPY command fails, everything gets rolled back (data in 
the destination table and error table). That would be harder to 
implement with a file (the rollback part).
>> Autopartitioning is described here:
>> http://wiki.postgresql.org/wiki/Auto-partitioning_in_COPY
>>     
>
> M) tuple_routing_in_copy should take "on" or "off", not 0 or 1.
>   
Ok.
> N) Have you measured the overhead & speed of this kind of COPY as
> opposed to COPY into a single table?  Have you checked the overhead if
> tuple_routing_in_copy is on, but you are not loading into a partitioned
> table?
>   
Yes. There is no noticeable overhead if there is no routing to do (but 
routing is on).
If routing is involved, the overhead depends on how sorted your input 
data is. If it all goes to the same partition, the caching effect works 
well and there is no noticeable overhead. The cost is in the constraint 
check and it depends on the complexity of the constraint. The more 
constraints you have to check and the more complex they are, the more 
overhead on each tuple routing.
> O) Is this capable of dealing with partitioning by more than one column,
> or by an expression?
>   
Yes, we just use a brute force technique where we try all child tables 
1-by-1 and rely on the existing Postgres constraint checking mechanism 
(no new or duplicated code there).
> Finally, I'm going to suggest different names for the GUCs, as the names
> you've chosen don't group well and would likely cause confusion.  Here
> are my suggestions, which all begin with "copy_" for prefix matching:
>
> error_logging --> probaby not needed, see able
> error_logging_skip_tuples --> copy_skip_bad_rows
> error_logging_schema_name --> copy_logging_schema_name
> error_logging_relation_name --> copy_logging_table_name
> error_logging_tuple_label --> don't know what this is for, see above
> error_logging_tuple_partition_key --> don't know what this is for, see above
>
> tuple_routing_in_copy --> copy_partitioning
> tuple_routing_cache_size --> copy_partitioning_cache_size
>   
This makes sense. I'll add that on my todo list.

Emmanuel

-- 
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com

pgsql-hackers by date:

From: Pavel Stehule
Date: 10 September 2009, 18:14:43
Subject: Re: RfD: more powerful "any" types

From: Josh Berkus
Date: 10 September 2009, 19:11:25
Subject: Re: COPY enhancements

Re: COPY enhancements - Mailing list pgsql-hackers

Previous

Next